csc 660: advanced operating systemsslide #1 csc 660: advanced os distributed filesystems

CSC 660: Advanced Operating Systems Slide #1

CSC 660: Advanced OS

Distributed Filesystems


Topics

1. Filesystem History

2. Distributed Filesystems

3. AFS

4. GoogleFS

5. Common filesystem issues


Filesystem History

• FS (1974)• Fast Filesystem (FFS) / UFS (1984)• Log-structured Filesystem (1991)• ext2 (1993)• ext3 (2001)• WAFL (1994)• XFS (1994)• Reiserfs (1998)• ZFS (2004)


FS

• First UNIX filesystem (1974)• Simple

– Layout: superblock, inodes, then data blocks.

– Unused blocks stored in free linked list, not bitmap.

– 512 byte blocks, no fragments.

– Short filenames.

• Slow: 2% of raw disk bandwidth.– Disk seeks consume most file access time due to small

block size and high fragmentation.

– Later doubled perf by using 1KB blocks.


FFS

• BSD (1984), basis for SYSV UFS• More complex

– Cylinder groups: inodes, bitmaps, data blocks.– Larger blocks (4K) with 1K fragments.– Block layout based on physical disk parameters.– Long filenames, symlinks, file locks, quotas.– 10% space reserved by default.

• Faster: 14-47% of raw disk bandwidth.– Creating a new file requires 5 seeks.– 2 inode seeks, 1 file data, 1 dir data, 1 dir inode– User/kernel memory copies take 40% of disk op time.


Log-structured Filesystem (LFS)

• All data stored as sequential log entries.– Divided into large log segments.– Cleaner defragments, produces new segments.

• Fast recovery: checkpoint + roll forward.

• Performance: 70% of raw disk bandwidth.– Large sequential writes vs multiple writes/seeks.– Inode map tracks dynamic locations of inodes.


ext2 and ext3

FFS + performance features.– Variable block size (1K-4K), no fragments.– Partitions disk into block groups.– Data block preallocation + read ahead.– Fast symlinks (stored in inode.)– 5% space reserved by default.– Very fast.

ext3 adds journaling capabilities.


WAFLNetwork Appliance (1994)Metadata in files

– Root inode points to inode file.– Filesystem is tree of blocks with inode file.– Write metadata anywhere faster with RAID.– Allows filesystem to be expanded on fly.


WAFLCopy on write snapshots

– Hourly (4/day, keep 2d), Daily (keep 7d)– Users can get deleted files from .snapshot dirs.– Snapshots created by just copying root inode.– Creates consistency point snapshot every few seconds.– Writes only to unused blocks between consistency snaps.– Recovery = last consistency point + replay NVRAM log.


XFS

SGI (1994)

Complex journaling filesystem– Uses B+ trees to track free space, index dirs,

locate file blocks and inodes.– Dynamic inode allocation, metadata journaling,

volume manager, multithreaded, allocate on flush.

– 64-bit filesystem (filesystems up to 263 bytes.)– Fast: 90-95% of raw disk bandwidth.


Reiserfs

Multiple different versions (v1-4)

Complex tree-based filesystem– Uses B+ trees (v3) or dancing trees (v4).– Journaling, allocate on flush, COW, tail-packing– High perf with small files, large directories.– Second to ext2 in perf (v3.)


ZFS

Sun (2004)

Copy-on-write + volume management– Variable block size + compression.– Built-in volume manager (striping, pooling.)– Self-healing with 64-bit checksums + mirroring.– COW transactional model (live data never

overwritten)– Fast snapshots (just don’t release old blocks.)– 128-bit filesystem.


Distributed Filesystems

Use filesystem to transparently share data between computers.

Accessing files via a distributed filesystem:1. Client mounts network filesystem.

2. Client makes a request for file access.

3. Client kernel sends network request to server.

4. Server performs file ops on physical disk.

5. Server sends response across network to client.


Naming

Mapping between logical and physical objects.UNIX filenames mapped to inodes.

Network filenames map to hostname, vnode pairs.

Location independent namesFilename is a dynamic one-to-many mapping.

Files can migrate to other servers w/o renaming.

Files can be replicated across multiple servers.


Naming Implementation

Location-dependent (non-transparent)filename -> <system,disk,inode>

Location-independent (transparent)filename -> file_identifier -> <system,disk,inode>

Identifiers must be unique.

Identifiers must be updated to point to a new physical location when a file is moved.


Caching

Problem: Every file access uses network.

Solution: Store remote data on local system.Cache can be memory or disk based.

Read-ahead can reduce accesses further.


Cache Update Policies

Write ThroughWrite data to server and cache at once.

Return to program when server write complete.

High reliability, poor performance.

Delayed WriteWrite data to cache, then return to program.

Modifications written through to server later.

High performance, poor reliability.


NFS with Cachefs


Cache Consistency Problem

Cache Consistency ProblemKeeping cached copies consistent with server.

Consistency overhead can decrease performance if too many writes done on same set of files.

Client-initiated consistencyClient asks server if data is consistent.

When: every file access, periodically.

Server-initiated consistencyServer detects conflicts and invalidates client caches.

Server has to maintain state of what is cached where.


Stateful File Access

Stateful process:1. Client sends open request to server.

2. Server opens file, inserts into open file table.

3. Server returns file identifier to client.

4. Client uses identifier to read/write file.

5. Client closes file.

6. Server removes file from open file table.

FeaturesHigh performance, because fewer disk accesses.

Problem of clients that crash without closing files.


Stateless File Service

Every request is self contained.Must specify filename, position in every request.

Server doesn’t know which files are open.

Server crashes have minimal effect.Stateful servers must poll clients to recover state.


NFSSun

v2 (1984)

v3 (1992) TCP + 64-bit.

Implementation– System calls via Sun RPC calls.

– Stateless: client obtains filesystem ID on mount, then uses filesystem ID (like filehandle) in subsequent reqs.

– UNIX-centric (UIDs, GIDs, permissions)

– Server authenticates by client IP address.

• Client UIDs mapped to server w/ root quashing.

• Danger: Client root user can su to any desired UID.


CIFS

Microsoft (1998)Derived from 1980s IBM SMB net filesystem.

ImplementationOriginally ran over NetBIOS, not TCP/IP.

\\svr\share\path Universal Naming Convention

Auth: NTLM (insecure), NTLMv2, Kerberos

MS Windows-centric (filenames, ACLs, EOLs)


AFS

CMU (1983)– Sold by Transarc/IBM, then free as OpenAFS.

Features– Uniform /afs name space.– Location-independent file sharing.– Whole file caching on client.– Secure authentication via Kerberos.


AFS

Global namespace divided into cells– Cells separate authorization domains.– Cells included in pathname: /afs/CELL/– Ex: cmu.edu, intel.com

Cells contain multiple servers– Location independence managed via volume db.– Files are located on volumes.– Volumes can migrate between servers.– Volumes can be replicated in read-only fashion.


NFSv4

IETF (2000)Based on 1998 Sun draft.

New Features– Only one protocol.– Global namespace.– Security (ACLs, Kerberos, encryption)– Cross platform + internationalized.– Better caching via delegation of files to clients.


GoogleFS Assumptions

1. High rate of commodity hardware failures.

2. Small number of huge files (multi-GB +).

3. Reads: large streaming + small random.

4. Most modifications are appends.

5. High bandwidth >> low latency.

6. Applications / filesystem co-designed.


GoogleFS Architecture


GoogleFS Architecture

• Master server– Metadata: namespace, ACL, chunk mapping.– Chunk lease management, garbage collection,

chunk migration.

• Chunk servers– Serve chunks (64MB + checksum) of files.– Chunks replicated on multiple (3) servers.


GoogleFS Writing1. Client asks master which

chunksvr has lease.2. Master responds:

leaseholder + replicas.3. Client pushes data to all

replicas.4. Client sends write to

primary replica.5. Primary forwards req.6. Secondaries reply to

primary on completion.7. Primary replies to client.


GoogleFS Consistency

File regions can beConsistent: all clients see the same data.

Defined: consistent + clients will see entire write.

Inconsistent: different clients see different data.

Files can be modified byRandom write: data written at specified offset.

Record append: data is appended atomically at least once. Padding or record duplicates may be inserted as part of an append operation.


GoogleFS Consistency

Writers deal with consistency issues by1. Preferring appends to random writes.

2. Application-level checkpoints.

3. Self-identifying records with checksums.

Readers deal with consistency issues by1. Processing file only up until checkpoint.

2. Ignoring padding.

3. Discarding records with duplicate checksums.


Chunk Replication

New Chunks– Replicate new chunks on servers with below-average

disk utilization.

– Limit the number of recent chunk creations on each server, due to iminent writes.

Re-replication– Prioritize chunks based on how far chunk is away from

replication goal.

– Master clones chunk by choosing a server and telling it to replicate chunk from closest replica.

– Master re-balances chunk distribution periodically.


GoogleFS Reliability

Chunk level reliabilityIncremental checksums on each chunk

Chunks replicated by default across 3 servers.

Single master serverMetadata stored in memory, operation log.

Metadata recovered by polling chunk servers.

Shadow masters provide ro access if primary down.


Common Problems

1. Consistency after crash.

2. Large contiguous allocations.

3. Metadata allocation.


Consistency

• Detect + Repair– Use fsck to repair.– Journal replay.

• Always Consistent– Copy on write.


Large Contiguous Allocations

• Pre-allocation.

• Block groups.

• Multiple block sizes.


Metadata Allocation

• Fixed number in one location.

• Fixed number spread across disk.

• Dynamically allocated in files.


References1. Jerry Breecher, “Distributed Filesystems,” http://cs.clarku.edu/~jbreecher/os/lectures/Section17-Dist_File_Sys.ppt2. Florian Buchholz, “The structure of the Reiser file system,” http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php,

2006.3. Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,”

http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994.4. Sanjay Ghemawat et. al., “The Google File System,” SOSP, 2003.5. Christopher Hertel, Implementing CIFS, Prentice Hall, 2003.6. Val Henson, “A Brief History of UNIX Filesystems,” http://infohost.nmt.edu/~val/fs_slides.pdf7. Dave Hitz, James Lau, Michael Malcolm, “File System Design for an NFS File Server Appliance,” Proceedings of the

USENIX Winter 1994 Technical Conference, http://www.netapp.com/library/tr/3002.pdf8. John Howard et. al., “Scale and Performance in a Distributed File System,” ACM Transactions on Computer Systems

6(1), 1988.9. Marshall K. McKusick, “A Fast File System for Unix,” Transactions on Computer Systems 2(3), 1984.10. Brian Powlowski et. a., “The NFS Version 4 Protocol,” SANE 2000.11. Daniel Robbins, “Advanced File System Implementor’s Guide,” IBM Developer Works,

http://www-128.ibm.com/developerworks/linux/library/l-fs9.html, 2002.12. Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005.13. Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13 th

ACM SOSP, 1991.14. R. Sandberg, “Design and Implementation of the Sun Network Filesystem,” Proceedings of the USENIX 1985 Summer

Conference, 1985.15. Adam Sweeney et. al., “Scalability in the XFS File System,” Proceedings of the USENIX 1996 Annual Technical

Conference, 1996.16. Wikipedia, http://en.wikipedia.org/wiki/Comparison_of_file_systems

http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php

http://web.mit.edu/tytso/www/linux/ext2intro.html

http://infohost.nmt.edu/~val/fs_slides.pdf

http://www-128.ibm.com/developerworks/linux/library/l-fs9.html

csc 660: advanced operating systemsslide #1 csc 660: advanced os distributed filesystems

Documents