file system concepts - androbenchcsl.skku.edu/uploads/swe2015-41/swe2015s16fs_concept.pdf · file...
TRANSCRIPT
File system concepts
File system concepts
• Ease of searching a specific data
– File to group data: variable size, naming
– Directory to group files
File data
DirectoryFile name, file offset File name, file offset
File data
Unix file systems history
Unix file system(System V, 1974)
Berkeley fast file system(BSD 4.2, 1984)
Extended file system(Linux, 1992)
Log-structured file system (1991)
Minix file system(Minix, 1987)
Ext4 file system(2008)
XFS (IRIX, 1994)Journaling file system
(OS/2, 1999)
BTRFS(2009)
Ext2 file system(1993)
Ext3 file system(2001)
1970
1980
1990
2000
2010
Journaling file system (AIX, 1990)
Journaling file system (Linux, 2001)
XFS (Linux, 2002)
F2FS(2012)
HFS(1985)
HFS+(1998)
DOS/Windows file systems history
• File Allocation Table
– FAT (8bit, 1977) / FAT12 (1980) / FAT16 (1984)
Target for floppy disk
– HPFS (OS/2, 1989)
– FAT32/VFAT (1996)
– exFAT (2006)
• NTFS
– Since Windows NT 3.1 (1993)
Network/distributed file systems
• Network file systems
– Mount remote file system to local directory
– Network File System
– Server Message Block/CiFS (samba)
– AppleTalk Filing Protocol
• Distributed file systems
– Share storage device to build a large file system
– Andrew File System
– Google file system
– Hadoop file system (HDFS)
File system interfaces
• R. C. Daley, P. G. Neumann, A General-Purpose File System For Secondary Storage, 1965– Defined what a file system is and how it works
– Concepts of user, file, directory, directory hierarchy
– Backup storage and their usage• Incremental backup / weekly full backup recovery
• POSIX [IEEE 1003 / Richard Stallman / 1988]
– Standardized file system interfaces
– Standard I/O API
– Direct I/O API
– Memory mapped I/O API
File system interface : stream I/O
• Buffered and line-by-line I/O interface
• Header: <stdio.h>
• Handler: FILE *f;
• Functions
– fopen, fclose
– fprintf, fscanf
– fgets, fputs
– fread, fwrite
– fseek, ftell
#include <stdio.h>
int main(void)
{
FILE *fp;
char *str;
if ( fp = fopen("main.c", "r") )
{
str = malloc(4096);
while( fgets(str, 4095, fp) )
printf("%s", str);
fclose(fp);
free(str);
}
return 0;
}
File system interface : direct I/O
• Header: <fcntl.h>, <unistd.h>, …
• Handler: int fd;
• Functions
– open, creat, close
– read, write
– lseek, lseek64
– posix_fallocate, posix_fadvise
#include <fcntl.h>
#include <unistd.h>
int main(void)
{
int fd;
void *buf;
if ( (fd = open("main.c", "r")) > 0)
{
buf = malloc(4096);
while( read(fd, buf, 4096) > 0)
write(1, buf, 4096);
close(fd);
free(buf);
}
return 0;
}
File system interface : mmap I/O
• Memory access to read/write a file
• Header: <sys/mman.h>
• Handler: void *ptr;
• Functions
– void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
– int munmap(void *addr, size_t length)
File system interface : mmap I/O
• Example
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
int main(void)
{
int fd, length;
void *buf;
if ( (fd = open("main.c", "r")) > 0)
{
length = lseek(fd, 0, SEEK_END);
buf = mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0);
write(1, buf, length);
munmap(buf, length);
close(fd);
}
return 0;
}
Stream I/O illustrated
Application
VFS
Page cache
libc
fopen
open
sys_open
Hello, Guys
fgets
read
sys_read
fgets
Hello, Guys
Hello,
fclose
close
fprintf
Hello, World
write
World
fflush
sys_write sys_close
Memory mapped I/O illustrated
Application
VFS
Page cache
libc
mmap
sys_mmap
동해물과백두산이마르고닳도록하느님이보우하사우리나라만세
무궁화삼천리화려강산 …
c=buf[0]
aops->readpage()
buf[1]=‘\n’ munmap
동해물과백두산이마르고닳도록하느님이보우하사우리나라만세
무궁화삼천리화려강산pagefault
aops->writepage()
replacement
sys_munmap
File system benchmarks
65
IOzoneIometer
FilebenchFFSB
sysbenchBonnie
PostmarkTPC
SPECsfsdbench
IOzone
• File I/O performance analysis
• Installation
– apt-get install iozone3
• Parameters
– -s filesize_Kb
– -r record_size_Kb
– -f [path]filename
– -i test
– -a / -A / -z / -Z : auto mode
– -t children66
-i Description
0 write/rewrite
1 read/re-read
2 random-read/write
3 read-backwards
4 re-write-record
5 stride-read
6 fwrite/re-fwrite
7 fread/re-fread
8 random_mix
9 pwrite/re-pwrite
10 pread/re-pread
11 pwritev/re-pwritev
12 preadv/re-preadv
Filebench
• File system operation analysis
• Installation
– http://sourceforge.net/projects/filebench/files/latest/download
– configure ; make; make install
• Execution
– go_filebench• load workload (…/share/filebench/workloads/*)
• set $dir=path
• run duration
• quit67
filemicro_...singlestream...
fivestream...random...fileserver
networkfsoltp
varmailwebservervideoserer
workloads
Postmark
• Mail-server workload simulation
• Installation
– apt-get install postmark
– Distributed as a C source file
• Execution
– postmark config
68
set sizeset numberset transactionsset locationrunquit
Commands
Sysbench
• A modular, cross-platform and multi-threaded benchmark tool
– Target: CPU, memory, threads, mutex, fileio, oltp
• Installation
– apt-get install sysbench
• Parameters
– --test=fileio
– --file-test-mode=rndwr
– --file-total-size=1G
– --file-block-size=16K
– command: prepare, run, cleanup
File system design
File system design elements
• Space allocation
– Contiguous allocation vs. fragmented allocation
– File to block mapping management
– Managing free space
• Name space management
– File naming: name length, case sensitivity, …• ex. early UNIX file system / FAT uses 8.3 naming system
– Directory hierarchy• Single level array
• Tree-structured multi-level directory
• graph-structured directory
Disk layout and file abstraction
• Abstractions in file system
– File data
– Inode: per file metadata• name, size, data location, modified time, owner, …
– Directory hierarchy
– Superblock
– Meta data for free space management
File a, 0 File a, 1Inode aDir bSuperblock
?
Allocated/free space management
• Bitmap approach (ext*fs)
– Low storage capacity usage
– High free space search cost
• Linked List approach (FAT)
– Low free space search cost
11011000
Allocated/free space management
• Tree-based approach
– Inode and indirect blocks
– Extents: (start block number, contiguous blocks)
inode filenameattributes
direct blocks
single indirectdouble indirecttriple indirect
Indirect block
Indirect block Indirect block
data
datadatadatadata
data
…
Indirect block
data
data
…
data
data
…
…
Allocated/free space management
• Tree-based approach
– B-Tree (XFS, btrfs, …)• Useful for extent-based allocation
(1, 3) (7, 1) (10, 4)
3
4
8
1 2 3 4 5 6 7 8
File allocation
(14, 5) (4, 3) (8, 2)
5
3
2
(0, 1)
Free space
Directory implementation
• Array
– Easy to manage
– File name length limit
• Linear list
– Variable length file name
– Hard to manage
• Hash table
– Indexed by file name: fast search
– Hash collision
RUN.EXE
README.TXT
DATA.DB
…
…
…
RUN.EXE
README.TXT
DATA.DB
…
…
…
Long named file.docx …
Example: FAT
Characteristics
• Background: 1970s
– Personal computer
– Floppy disks (~ 1MB)
• 8.3 name space
– Case insensitive
– Long name format extension
• No protection mechanism
• No consistency guarantee
– chkdsk, diskscan
• File data location management
– Linked list approach
• FAT entry (1 entry / 1 cluster)
– Next cluster number (cluster: 512 bytes ~ 32 KB)
– 0: free, -1: end of file
Boot block
File allocation table
0 0 0 0
A.EXE
FAT Root dir. Data
00003 00005 00006 -1
Backup
Directory
• A special file with 32 bytes directory entries
• Entries
– File name: 11 bytes (name 8, extension 3)
– Attributes• Read-only, hidden, system, sub-directory, archive, long file name
– ctime, atime, mtime• Year (7), month (4), day (5), hour (5), min (6), second/2 (5)
– First data cluster
– File size (max. 4 GB)
Long name extension
• Combining consecutive directory entries
– First entry: normal directory entry (first 11 character)
– LFN entries• File name segment: 26 bytes
• Reserved critical entries
– First data cluster
– File type, sequence number, etc.
Introductio 0 ctime atime mtime FDC lengthn to File L F System.pptx 0
Sequence File type First cluster, for compatibility
Boot sector
• Boot strap
• File system summary
– File system size (sectors)
– Logical sector size
– Cluster size
– # of FATs
– Root directory entries• Root directory first cluster
– Volume label
– Drive number
Free space management
• Next free cluster pointer
– FAT32 maintains last allocated cluster number Possible to undelete recently delete files
– Produces fragmentation
0 0 0 000003 00005 00006 -1
Last allocated cluster
Example: ext3
Characteristics
• Background
– Linux operating system: multi-user
– Evolving for from desktop to server and real-time system
• Based on block groups
– Each block group works as an independent file system
– Inode, directory, file data
• Inodes for allocation and attribute management
• Journaling support from ext3
Block group
• Ext file system = an array of block groups
• Block group size: determined by block size
– 4K block 128MB
– Why? Data block bitmap must fit in a block
bg_block_bitmap, bg_inode_bitmap, bg_inode_tablebg_free_blocks_count, bg_free_inodes_count, …
Inode
• Size: 128-byte / 256-byte (ext4)
Directory
• ext3~ supports HTree: hashing for entry lookup[Daniel Phillips, A Directory Index for Ext2, Linux Symposium’02]
Free space management
• Data block bitmap / inode bitmap in each block group
• Block allocation rule
– Top-level directory’s inode• In the empty block group, if possible
• Block group with maximum free inodes
– Other inodes and data blocks• In the block group where its inode or parent resides, if possible
• Nearest-backside block group with free blocks more than average
/usr /home /var /etc