file system topics
DESCRIPTION
File System Topics. Lei Xu. Agenda. Introduction VFS Optimizations Examples F&Q. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
FILE SYSTEM TOPICSLei Xu
Agenda Introduction VFS Optimizations Examples F&Q
Introduction “A file system is a means to organize data
expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device(s) which contain it.” – from Wikipedia Store data Organize data Access data Manage storage resources (e.g. hard drive)
Relationship to Architecture Course
Acknowledge to the slides from 830 course
Relationship to Architecture Course
File system is designed between memory and secondary storage (or remote servers) One of the most complex part in an
operating system Main R&D focuses:
Performance: throughput, latency, scalability Reliability and availability Management: snapshot and etc.
Acknowledge to the slides from 830 course
Different types of file systems
Local file systems Stored data on local hard drives, SSDs, floppy
drives, optical disks or etc. Examples: NTFS, EXT4, HFS+, ZFS
Network/distributed file systems Stored data on remote file server(s) Example: NFS, CIFS/Samba, AFP, Hadoop DFS, Ceph
Pseudo file systems Example: procfs, devfs, tmpfs
“List of file systems” http://en.wikipedia.org/wiki/List_of_file_systems
Agenda Introduction VFS Optimizations Examples F&Q
Overall Architecture of Linux file system components
Acknowledgement: “Anatomy of the Linux file system”, IBM developerWorks.
Virtual File System (VFS) VFS is the essential concept in UNIX-like FS
Specify an interface between the kernel and a concrete file system Introduced by SUN in 1985
Pass system calls to the underlying file systems E.g. pass sys_write() to Ext4 (i.e. ext4_write())
Three major metadata in VFS Metadata: the data about data (wikipedia) Super block, dentry and inode OO design
Each component defines a set of data members and the functions to access them
Super block A segment of metadata that describes a file
system Is constructed when mount a file system Usually, a persistent copy of super block is
stored in the beginning of a storage device Describes:
File system type, size, status (e.g. dirty bit, read only bit)
Block size, max file bytes, device size.. How to find other metadata and data. How to manipulates these data (i.e. sb_ops)
Inode “Index-node” in Unix-style file system
All information about one file (or directory) Except its name
In UNIX-like system, file names are stored in the directory file: the content of it is an “array” of file names
E.g. owner, access rights, mode, size, time and etc.
Pointers to data
Directory Entry (dentry) Dentry conceptually points a file name to
its corresponding Inode Each file/directory has a dentry presenting
it File systems use dentry to lookup a file in
the hierarchical namespace Each dentry has a pointer to the dentry of its
parent directory Each dentry of a directory has a list of dentries
of its sub-directories and sub-files
Agenda Introduction VFS Optimizations Examples F&Q
Optimizations Most of file system optimizations are
designed based on the characteristics of the memory hierarchy and storage devices. Recall:
RAM 50-100 ns Disks: 5-10 ms 2-3 orders of magnitude difference Almost all widely used local file systems are
designed for hard disk drives, which have their unique characteristics
Hard Disk Drive (HDD) Stores data on
one or more rotating disks, coated with magnetic material Introduce by IBM
in 1956 Use magnetic
head to read data
The very early HDD…..
Acknowledge to:
HDD (Cont’d) The essential structure
of HDD has not changed too much… Constitute with several
disks Each disk is divided to
tracks, each of which then is divided to sectors
The single most significant factor: Seek time
Why seek time matters When access a data (sector), the HDD head
must first move to the track (seek time), then rotates the disk to the sector (rotational time) Seek time: 3 ms on high-end server disks, 12 ms
on desktop-level disks [1] Rotational time: 5.56ms on 5400 RPM HDD,
4.17ms on 7200 RPM HDD [1] As a result, sequential IO is much faster than
random IO, because there is no seek /rotational time[1], http://en.wikipedia.org/wiki/Disk-drive_performance_characteristics
General Optimizations Based on two principles:
RAM access is much faster than the access on disk
Sequential IOs is much faster than random IOs on disk
So we design file systems that Largely utilizes CPU/RAM to reduce IO to disks
(various caches/write buffers) Prefers sequential IOs
Computes disk layout to arrange related data sequentially located on disks
Dcache Dentry cache (dcache)
Directories are stored as files on disks. For each file lookup, we want obtain the inode from the
given full file path OS looks the dentries from the root to all parent directories
in the path. E.g. for looking up file “/Users/john/Documents/course.pdf”, OS
needs traverse the dentries that presents “/”, “Users”, “john”, “Documents”, and “course.pdf”
To accelerate this: We use a global hash table (dcache) to map “file path” ->
dentry A two-list solution: one for active dentries, and one for
“recent unused dentries” (LRU).
Inode cache Similar to the
dcache, OS maintains a cache for inode objects. Each inode object
has 1-to-1 relation to a dentry
If the dentry object is evicted, this inode is evicted
Page Cache …a “transparent” buffer for disk-backed pages
kept in RAM for fast access… [wikipedia] A write-back cache Main purpose: reducing the # of IOs to disks Access based on page (usually 4KB).
Page cache is per-file based. A Redix-tree in inode object. Prefetch pages to serve future read Absorb writes to reduce # of IOs
The dirty pages (modified) are flushed to disks for : 1) each 30s or 5s, or 2) OS wants to reclaim RAMs Also can be forced to flush by calling “fsync()” system call
Agenda Introduction VFS Optimizations Examples F&Q
Examples Several concrete file system designs
Ext4, classic UNIX-like file system concepts NTFS, advanced Windows file system ZFS, “the last word of file system” NFS, a standard network file system Google File System, a special distributed
file system for special requirements
Ext4 The latest version of
the “extended file system” (Ext2/3/4) The standard Linux
file system for a long time
Inspired from UFS from BSD/Solaris
Group files to block groups Keep file data near to
inodesAck: http://bit.ly/tjipWY
NTFS “New Technology
File System” (NTFS) The standard file
system in Windows world.
A Master File Table (MFT) contains all metadata. Directory is also a
file
ZFS ZFS: “the last word of file system”
The most advanced local file system in production
128 bits space (2128 bytes in theory) larger the # of sand in the earth…
A lot of advanced features: E.g. transactional commits, end-to-end integration,
snapshot, volume management and much more… Will never lose data and always be consistent.
Every OS community wants to clone or copy its features…
Btrfs on Linux, ReFS on Windows, ZFS on FreeBSD
NFS “Network File
System (NFS)” A protocol
developed by SUN in 1984 A set of RPC calls
IETF standard Supported by all
major OSs Simple and
efficient
Google File System (GFS) A large distributed file
system specially designed for MapReduce framework High throughput High availability Special designed. Not
compatible to VFS/POSIX API. Requires clients linked to
the GFS library. Hadoop DFS clones the
concepts of GFS
More File Systems Interesting file systems that are worth to explore
Btrfs (B-tree FS) from oracle, expected to be the next standard Linux file system. Many concepts are shared with ZFS.
ReFS: The file system for Windows 8 (from Microsoft). Many concepts are shared with ZFS (too!).
WAFL (Write Anywhere File Layout) file system from NetApp.
FUSE (Filesystem in Userspace): a cross-platform library that allows developers to write file system running in user mode
Thanks
FAQ?