file systems: designs kamen yotov cs 614 lecture, 04/26/2001

File Systems: Designs

Kamen Yotov

CS 614 Lecture, 04/26/2001

Overview

The Design and Implementation of a Log-Structured File SystemSequential StructureSpeeds up Writes & Crash Recovery

The Zebra Stripped Network File SystemStripping across multiple serversRAID equivalent data recovery

Log-structured FS: Intro

Order of magnitude faster!?!The future is dominated by writesMain memory increasesReads are handled by cache

Logging is old, but now is differentNTFS, Linux Kernel 2.4

Challenge – finding free spaceBandwidth utilization 65-75% vs. 5-10%

Design issues of the 1990’s

Importance factorsCPU – exponential growthMain memory – caches, buffersDisks – bandwidth, access time

WorkloadsSmall files – single random I/OsLarge files – bandwidth vs. FS policies

Problems with current FS

Scattered information5 I/O operations to access a file under BSD

Synchronous writesMay be only the meta-data, but it’s enoughHard to benefit from the faster CPUs

Network file systemsMore synchrony in the wrong place

Log-structured FS (LFS)

FundamentalsBuffering many small write operationsWriting them at once on a single,

continuous disk block

Simple or not?How to retrieve information?How to manage the free space?

LFS: File Location and Reading

Index structures to permit random access retrieval

Again inodes, …but at random positions on the log!

inode map: indexed, memory resident

Writes are better, while reads are at least as good!

Example: Creating 2 Files

inode

directory

data

inode map

LFS: Free Space Management

Need large free chucks of spaceThreading

Excessive fragmentation of free spaceNot better than other file systems

CopyingCan be to different places…Big costs

Combination

Threading & copying

Solution: Segmentation

Large fixed-size blocks (e.g. 1MB) Threading through segments Copying inside segments

Transfer longer than seekingSegment cleaning Which are the live chucks To what file they belong and at what position

(inode update) Segment summary block(s) File version stamps

Segment Cleaning Policies

When should the cleaner execute?Continuously, At night, When exhausted

How many segments to clean at once?Which segments are to be cleaned?Most fragmented ones or…

How should the live blocks be grouped when written back?Locality for future reads…

Measuring & Analysing

Write cost Average amount of time busy for byte of data

written, including cleaning overhead 1.0 is perfect – full bandwidth, no overhead Bigger is worse LFS: seek and rotational latency negligible, so it’s

just totaldata!

Performance trade-off: utilization vs. speed

The key: bimodal segment distribution!

Simulation & Results

Purpose: Analyze different cleaning policiesHarsh modelFile system is modeled as a set of 4K filesAt each step a file is chosen and rewritten

Uniform: Each with equal likelihood to be chosen

Hot-and-cold: The 10-90 formula

Runs until write cost is stabilized

Write Cost vs. Disk Utilization

2

4

6

8

10

12

14

16

18

0.1

0.2

0.9

0.3

0.4

0.5

0.6

0.7

0.8

FFS today

FFS improved

LFS uniform

LFS hot-and-cold

No variance

Disk utilization

Write cost

Hot & Cold Segments

Why is locality worse than no locality?

Free space valuable in cold segments Value based on data stability Approximate stability with age

Cost-benefit policy Benefit: Amount of:

Space cleaned (inverse of utilization of segment) Time stays free (timestamp of youngest block)

Cost (read + write live data)

Segment Utilization Distributions

1

2

3

4

5

6

7

8

9

0.1

0.2

0.9

0.3

0.4

0.5

0.6

0.7

0.8

Fraction of segments (0.001)

Segment utilization

Uniform

Hot-and-cold (greedy)

Hot-and-cold (cost-benefit)

Write Cost vs. Disk Utilization (revisited)

2

4

6

8

10

12

14

16

18

0.1

0.2

0.9

0.3

0.4

0.5

0.6

0.7

0.8

FFS today

FFS improvedLFS uniformLFS cost-

benefit

No variance

Disk utilization

Write cost

Crash Recovery

Currently file systems require full scanLog-based systems are definitely betterCheck-pointing (two-phase, trade-offs)

(Meta-)information – logCheckpoint region – fixed position

inode map blocks segment usage table time & last segment written

Roll-forward

Crash Recovery (cont.)

Naïve method: On a crash, just use the latest checkpoint and go from there!

Roll-forward recovery Scan segment summary blocks for new inodes If just data, but no inode, assume incomplete and

ignore Adjust utilization of segments Restore consistency between directory entries and

inodes (special records in the log for the purpose)

Experience with Sprite LFS

Part of Sprite Network Operating System

All implemented, roll-forwarding disabled

Short 30 second check-pointing interval

Not more complicated to implement than a normal “allocation” file system NTFS and Ext2 even more…

Not great improvement to the user as few applications are disk-bound!

So, let’s go micro!

Micro-benchmarks were produced

10 times faster when creating small files

Faster in reading of order preserved

Only case slower in Sprite isWrite file randomlyRead it sequentially

Produced locality differs a lot!

Sprite LFS vs. Sun OS 4.0.3

Sizes 4KB block 1MB segment

x10 speed-up in writes/deletes

Temporal locality

Saturate CPU!!!

Random write

Size 8KB block

Slow on writes/deletes

Logical locality

Keep disk busy

Sequential read

Related Work

WORM media – always been logged Maintain indexes No deletion necessary

Garbage collection Scavenging = Segment cleaning Generational = Cost-benefit scheme Difference: random vs. sequential

Logging similar to database systems Use of the log differs (like NTFS, Linux 2.4)

Recovery is like “redo log”-ging

Zebra Networked FS: Intro

Multi-server networked file systemClients stripe data throughRedundancy ensures fault-tolerance &

recoverability

Suitable for multimedia & parallel tasks

Borrows from RAID and LFS principles

Achieves speed-ups from 20% to 5x

Zebra: Background

RAIDDefinitions

Stripes Fragments

ProblemsBandwidth bottleneckSmall files

Differences with Distributed File Systems

stripedata parity

Per file vs. Per client stripping

RAID standard

4 I/Os for small files 2 reads 2 writes

LFS Data distribution Parity distribution Storage efficient

14

25

36

1 2 3 4 5 6 large

file small file

(1) small file

(2)

14

25

36

1 2 3 4 5 6

many files (LFS)

Zebra: Network LSF

Logging between clients and servers (as opposed to file server and disks)Per client strippingMore efficient storage space usageParity mechanism is simplified

No overhead for small filesNever needs to be modified

Typical distributed computing problems

Zebra: Components

File Manager

Stripe Cleaner

Storage Servers

Clients

File Manager and Stripe cleaner may reside on a Storage Server as separate processes – useful for fault tolerance!

FastNetwork

File Manager

Stripe Cleaner

StorageServer

StorageServer

Client

Client

Client

Client

Client

Client

…

Zebra: Component Dynamics

Clients Location, fetching &

delivery of fragments Striping, parity

computation, writing

Storage servers Bulk data repositories Fragment operations

Store, Append, Retrieve, Delete, Identify

Synchronous, non overwrite semantics

File Manager Meta-data repository Just pointers to blocks RPC bottleneck for many

small files Can run as a separate

process on a Storage Server

Stripe Cleaner Similar to the Sprite LFS

we discussed Runs as a separate, user

mode process

Zebra: System Operation - Deltas

Communication via DeltasFields

File ID, File version, Block #Old & New block pointers

TypesUpdate, Cleaner, Reject

Reliable, because stored in the logReplay after crashes

Zebra: System Operation (cont.)

Writing files Flushes on

Threshold age (30 s) Cache full & dirty Application fsync File manager request

Striping Deltas update Concurrent transfers

Reading files Nearly identical to

conventional FS Good client caching

Consistency

Stripe cleaning Choosing which to… Space utilization

through deltas Stripe Status File

Zebra: Advanced System Operations

Adding Storage Servers Scalable

Restoring from crashes Consistency & Availability Specifics due to distributed system state

Internally inconsistent stripes Stripe information inconsistent with File Manager Stripe cleaner state consistency with Storage Servers

Logging and check-pointing Fast recoveries after failures

Prototyping

Most of the interesting parts only on paper Included

All UNIX file commands, file system semantics Functional cleaner Clients construct fragments and write parities File Manager and Storage Servers checkpoint

Some advanced crash recovery methods omitted Metadata not yet stored on Storage Servers Clients do not automatically reconstruct fragments upon a

Storage Server crash Storage Servers do not reconstruct fragments on recovery File Manager and Cleaner not automatically restarted

Measurements: Platform

Cluster of DECstation-5000 Model 200 100 Mb/s FDDI local network ring 20 SPECint 32 MB RAM 12 MB/s memory to memory copy 8 MB/s memory to controller copy RZ57 1GB disks, 15ms seek

2 MB/s native transfer bandwidth 1.6 MB/s real transfer bandwidth (due to controller)

Caching disk controllers (1MB)

Measurements: Results (1)

Large File Writes

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4

Servers

To

tal

Th

rou

gh

pu

t (M

B/s

)

1 client

2 clients

3 clients

1 client w/ parity

Sprite

NFS


Large File Reads

0

1

2

3

4

5

6

1 2 3 4

Servers

To

tal

Th

rou

gh

pu

t (M

B/s

)

1 client

2 clients

3 clients

1 client (reconstruct)

Sprite

NFS


Small File Writes

0

10

20

30

40

50

60

70

NFS Sprite Zebra Sprite N.C. Zebra N.C.

File System

Ela

pse

d T

ime

(sec

on

ds)

Server Flush

Client Flush

Write

Open/Close


Resource Utilization

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Zebra LW Sprite LW Zebra LR Sprite LR Zebra SW Sprite SW

% u

tili

zed

FM CPU

FM Disk

Client CPU

SS CPU

SS Disk

Zebra: Conclustions

Pros Applies parity and

log structure to network file systems

Performance Scalability Cost-effective

servers Availability Simplicity

Cons Lacks name caching,

causing severe performance degradations

Not well suited for transaction processing

Metadata problems Small reads are

problematic again

file systems: designs kamen yotov cs 614 lecture, 04/26/2001

Documents

cost slide

stabilized slide

file location

wrong place slide

network file system

free space management

segment cleaning policies

hot cold segments