csc 660: advanced operating systemsslide #1 csc 660: advanced os distributed filesystems

39
CSC 660: Advanced Operating Systems Slide #1 CSC 660: Advanced OS Distributed Filesystems

Upload: virgil-fitzgerald

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #1

CSC 660: Advanced OS

Distributed Filesystems

Page 2: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #2

Topics

1. Filesystem History

2. Distributed Filesystems

3. AFS

4. GoogleFS

5. Common filesystem issues

Page 3: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #3

Filesystem History

• FS (1974)• Fast Filesystem (FFS) / UFS (1984)• Log-structured Filesystem (1991)• ext2 (1993)• ext3 (2001)• WAFL (1994)• XFS (1994)• Reiserfs (1998)• ZFS (2004)

Page 4: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #4

FS

• First UNIX filesystem (1974)• Simple

– Layout: superblock, inodes, then data blocks.

– Unused blocks stored in free linked list, not bitmap.

– 512 byte blocks, no fragments.

– Short filenames.

• Slow: 2% of raw disk bandwidth.– Disk seeks consume most file access time due to small

block size and high fragmentation.

– Later doubled perf by using 1KB blocks.

Page 5: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #5

FFS

• BSD (1984), basis for SYSV UFS• More complex

– Cylinder groups: inodes, bitmaps, data blocks.– Larger blocks (4K) with 1K fragments.– Block layout based on physical disk parameters.– Long filenames, symlinks, file locks, quotas.– 10% space reserved by default.

• Faster: 14-47% of raw disk bandwidth.– Creating a new file requires 5 seeks.– 2 inode seeks, 1 file data, 1 dir data, 1 dir inode– User/kernel memory copies take 40% of disk op time.

Page 6: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #6

Log-structured Filesystem (LFS)

• All data stored as sequential log entries.– Divided into large log segments.– Cleaner defragments, produces new segments.

• Fast recovery: checkpoint + roll forward.

• Performance: 70% of raw disk bandwidth.– Large sequential writes vs multiple writes/seeks.– Inode map tracks dynamic locations of inodes.

Page 7: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #7

ext2 and ext3

FFS + performance features.– Variable block size (1K-4K), no fragments.– Partitions disk into block groups.– Data block preallocation + read ahead.– Fast symlinks (stored in inode.)– 5% space reserved by default.– Very fast.

ext3 adds journaling capabilities.

Page 8: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #8

WAFLNetwork Appliance (1994)Metadata in files

– Root inode points to inode file.– Filesystem is tree of blocks with inode file.– Write metadata anywhere faster with RAID.– Allows filesystem to be expanded on fly.

Page 9: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #9

WAFLCopy on write snapshots

– Hourly (4/day, keep 2d), Daily (keep 7d)– Users can get deleted files from .snapshot dirs.– Snapshots created by just copying root inode.– Creates consistency point snapshot every few seconds.– Writes only to unused blocks between consistency snaps.– Recovery = last consistency point + replay NVRAM log.

Page 10: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #10

XFS

SGI (1994)

Complex journaling filesystem– Uses B+ trees to track free space, index dirs,

locate file blocks and inodes.– Dynamic inode allocation, metadata journaling,

volume manager, multithreaded, allocate on flush.

– 64-bit filesystem (filesystems up to 263 bytes.)– Fast: 90-95% of raw disk bandwidth.

Page 11: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #11

Reiserfs

Multiple different versions (v1-4)

Complex tree-based filesystem– Uses B+ trees (v3) or dancing trees (v4).– Journaling, allocate on flush, COW, tail-packing– High perf with small files, large directories.– Second to ext2 in perf (v3.)

Page 12: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #12

ZFS

Sun (2004)

Copy-on-write + volume management– Variable block size + compression.– Built-in volume manager (striping, pooling.)– Self-healing with 64-bit checksums + mirroring.– COW transactional model (live data never

overwritten)– Fast snapshots (just don’t release old blocks.)– 128-bit filesystem.

Page 13: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #13

Distributed Filesystems

Use filesystem to transparently share data between computers.

Accessing files via a distributed filesystem:1. Client mounts network filesystem.

2. Client makes a request for file access.

3. Client kernel sends network request to server.

4. Server performs file ops on physical disk.

5. Server sends response across network to client.

Page 14: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #14

Naming

Mapping between logical and physical objects.UNIX filenames mapped to inodes.

Network filenames map to hostname, vnode pairs.

Location independent namesFilename is a dynamic one-to-many mapping.

Files can migrate to other servers w/o renaming.

Files can be replicated across multiple servers.

Page 15: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #15

Naming Implementation

Location-dependent (non-transparent)filename -> <system,disk,inode>

Location-independent (transparent)filename -> file_identifier -> <system,disk,inode>

Identifiers must be unique.

Identifiers must be updated to point to a new physical location when a file is moved.

Page 16: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #16

Caching

Problem: Every file access uses network.

Solution: Store remote data on local system.Cache can be memory or disk based.

Read-ahead can reduce accesses further.

Page 17: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #17

Cache Update Policies

Write ThroughWrite data to server and cache at once.

Return to program when server write complete.

High reliability, poor performance.

Delayed WriteWrite data to cache, then return to program.

Modifications written through to server later.

High performance, poor reliability.

Page 18: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #18

NFS with Cachefs

Page 19: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #19

Cache Consistency Problem

Cache Consistency ProblemKeeping cached copies consistent with server.

Consistency overhead can decrease performance if too many writes done on same set of files.

Client-initiated consistencyClient asks server if data is consistent.

When: every file access, periodically.

Server-initiated consistencyServer detects conflicts and invalidates client caches.

Server has to maintain state of what is cached where.

Page 20: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #20

Stateful File Access

Stateful process:1. Client sends open request to server.

2. Server opens file, inserts into open file table.

3. Server returns file identifier to client.

4. Client uses identifier to read/write file.

5. Client closes file.

6. Server removes file from open file table.

FeaturesHigh performance, because fewer disk accesses.

Problem of clients that crash without closing files.

Page 21: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #21

Stateless File Service

Every request is self contained.Must specify filename, position in every request.

Server doesn’t know which files are open.

Server crashes have minimal effect.Stateful servers must poll clients to recover state.

Page 22: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #22

NFSSun

v2 (1984)

v3 (1992) TCP + 64-bit.

Implementation– System calls via Sun RPC calls.

– Stateless: client obtains filesystem ID on mount, then uses filesystem ID (like filehandle) in subsequent reqs.

– UNIX-centric (UIDs, GIDs, permissions)

– Server authenticates by client IP address.

• Client UIDs mapped to server w/ root quashing.

• Danger: Client root user can su to any desired UID.

Page 23: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #23

CIFS

Microsoft (1998)Derived from 1980s IBM SMB net filesystem.

ImplementationOriginally ran over NetBIOS, not TCP/IP.

\\svr\share\path Universal Naming Convention

Auth: NTLM (insecure), NTLMv2, Kerberos

MS Windows-centric (filenames, ACLs, EOLs)

Page 24: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #24

AFS

CMU (1983)– Sold by Transarc/IBM, then free as OpenAFS.

Features– Uniform /afs name space.– Location-independent file sharing.– Whole file caching on client.– Secure authentication via Kerberos.

Page 25: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #25

AFS

Global namespace divided into cells– Cells separate authorization domains.– Cells included in pathname: /afs/CELL/– Ex: cmu.edu, intel.com

Cells contain multiple servers– Location independence managed via volume db.– Files are located on volumes.– Volumes can migrate between servers.– Volumes can be replicated in read-only fashion.

Page 26: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #26

NFSv4

IETF (2000)Based on 1998 Sun draft.

New Features– Only one protocol.– Global namespace.– Security (ACLs, Kerberos, encryption)– Cross platform + internationalized.– Better caching via delegation of files to clients.

Page 27: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #27

GoogleFS Assumptions

1. High rate of commodity hardware failures.

2. Small number of huge files (multi-GB +).

3. Reads: large streaming + small random.

4. Most modifications are appends.

5. High bandwidth >> low latency.

6. Applications / filesystem co-designed.

Page 28: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #28

GoogleFS Architecture

Page 29: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #29

GoogleFS Architecture

• Master server– Metadata: namespace, ACL, chunk mapping.– Chunk lease management, garbage collection,

chunk migration.

• Chunk servers– Serve chunks (64MB + checksum) of files.– Chunks replicated on multiple (3) servers.

Page 30: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #30

GoogleFS Writing1. Client asks master which

chunksvr has lease.2. Master responds:

leaseholder + replicas.3. Client pushes data to all

replicas.4. Client sends write to

primary replica.5. Primary forwards req.6. Secondaries reply to

primary on completion.7. Primary replies to client.

Page 31: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #31

GoogleFS Consistency

File regions can beConsistent: all clients see the same data.

Defined: consistent + clients will see entire write.

Inconsistent: different clients see different data.

Files can be modified byRandom write: data written at specified offset.

Record append: data is appended atomically at least once. Padding or record duplicates may be inserted as part of an append operation.

Page 32: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #32

GoogleFS Consistency

Writers deal with consistency issues by1. Preferring appends to random writes.

2. Application-level checkpoints.

3. Self-identifying records with checksums.

Readers deal with consistency issues by1. Processing file only up until checkpoint.

2. Ignoring padding.

3. Discarding records with duplicate checksums.

Page 33: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #33

Chunk Replication

New Chunks– Replicate new chunks on servers with below-average

disk utilization.

– Limit the number of recent chunk creations on each server, due to iminent writes.

Re-replication– Prioritize chunks based on how far chunk is away from

replication goal.

– Master clones chunk by choosing a server and telling it to replicate chunk from closest replica.

– Master re-balances chunk distribution periodically.

Page 34: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #34

GoogleFS Reliability

Chunk level reliabilityIncremental checksums on each chunk

Chunks replicated by default across 3 servers.

Single master serverMetadata stored in memory, operation log.

Metadata recovered by polling chunk servers.

Shadow masters provide ro access if primary down.

Page 35: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #35

Common Problems

1. Consistency after crash.

2. Large contiguous allocations.

3. Metadata allocation.

Page 36: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #36

Consistency

• Detect + Repair– Use fsck to repair.– Journal replay.

• Always Consistent– Copy on write.

Page 37: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #37

Large Contiguous Allocations

• Pre-allocation.

• Block groups.

• Multiple block sizes.

Page 38: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #38

Metadata Allocation

• Fixed number in one location.

• Fixed number spread across disk.

• Dynamically allocated in files.

Page 39: CSC 660: Advanced Operating SystemsSlide #1 CSC 660: Advanced OS Distributed Filesystems

CSC 660: Advanced Operating Systems Slide #39

References1. Jerry Breecher, “Distributed Filesystems,” http://cs.clarku.edu/~jbreecher/os/lectures/Section17-Dist_File_Sys.ppt2. Florian Buchholz, “The structure of the Reiser file system,” http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php,

2006.3. Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,”

http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994.4. Sanjay Ghemawat et. al., “The Google File System,” SOSP, 2003.5. Christopher Hertel, Implementing CIFS, Prentice Hall, 2003.6. Val Henson, “A Brief History of UNIX Filesystems,” http://infohost.nmt.edu/~val/fs_slides.pdf7. Dave Hitz, James Lau, Michael Malcolm, “File System Design for an NFS File Server Appliance,” Proceedings of the

USENIX Winter 1994 Technical Conference, http://www.netapp.com/library/tr/3002.pdf8. John Howard et. al., “Scale and Performance in a Distributed File System,” ACM Transactions on Computer Systems

6(1), 1988.9. Marshall K. McKusick, “A Fast File System for Unix,” Transactions on Computer Systems 2(3), 1984.10. Brian Powlowski et. a., “The NFS Version 4 Protocol,” SANE 2000.11. Daniel Robbins, “Advanced File System Implementor’s Guide,” IBM Developer Works,

http://www-128.ibm.com/developerworks/linux/library/l-fs9.html, 2002.12. Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005.13. Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13 th

ACM SOSP, 1991.14. R. Sandberg, “Design and Implementation of the Sun Network Filesystem,” Proceedings of the USENIX 1985 Summer

Conference, 1985.15. Adam Sweeney et. al., “Scalability in the XFS File System,” Proceedings of the USENIX 1996 Annual Technical

Conference, 1996.16. Wikipedia, http://en.wikipedia.org/wiki/Comparison_of_file_systems