yonggang liu university of florida
Post on 24-Feb-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Yonggang LiuUniversity of Florida
Learning the Data Management in Linux Kernel v2.6
LayoutPicture of Today’s Topics
Virtual File System
The Page Cache
Accessing the Files
Ext2 and Ext3 file systems
LayoutPicture of Today’s Topics
Virtual File System
The Page Cache
Accessing the Files
Ext2 and Ext3 file systems
Picture of Today’s TopicsVirtual File System (VFS)
Disk Caches
Ext3 FAT UFS
Mapping Layer
Generic Block Layer
I/O scheduler layer
Block Device Driver
Block Device Driver
HardDisk
HardDisk
Provides an uniform file system interface to the processes.
Keeps the most recently accessed data in RAM.Includes page cache, dentry cache and inode cache.
Specific file systems determine the physical location of the data on disk.
Offers an abstract view of the block devices. I/O operation is “block I/O”. Groups requests of data
that lie near each other on the physical medium.
Takes care of the actual data transfer by sending suitable commands to the hardware.
LayoutPicture of Today’s Topics
Virtual File System• Uniform System Calls - VFS Calls• Common File Mode - VFS Objects• Interaction between Processes and VFS Objects• Files Associated with a Process
The Page Cache
Accessing the Files
Ext2 and Ext3 file systems
Uniform System Calls - VFS Calls
Virtual File System
Process 1 Process 2 Process 3
Disk-based file systems:Ext3, NTFS, ReiserFS,
UDF DVD FS …
Networkfile systems: NFS, Coda, AFS, CIFS,
NCP …
Special file systems:
root FS, sysfs, tmpfs, usbfs, sockfs
…
VFS defines the uniform System Calls:mount(), umount(), sysfs(), statfs(), chroot(), chdir(), fchdir, getcwd(), mkdir(), rmdir, getdents(), link(), rename(), readlink(), chown(), chmod(), stat(), open(), close(), creat(), dup(), fcntl(), select(), poll(), truncate(), lseek(), read(), write() …
Common File Model - VFS Objects
File ObjectDescribes how a process interacts with a file it has opened. Created when the file is opened. Has no image on disk.
Some fields:f_dentry
f_opf_pos
f_versionf_mapping
Dentry ObjectA directory entry
object associates a pathname to its inode. Copied to memory during the path-name
look ups.Some fields:
d_inoded_parentd_name
d_subdirsd_sb
Superblock ObjectEach file system has a superblock
recording the information of the
file system; it is copied to memory
when used.Some fields:s_blocksize
s_types_root
s_inodes_bdev
Inode ObjectIncludes all information
needed by the file system to handle a
file. Copied to memory when the
file attribute is accessed.
Some fields:i_inoi_size
i_atimei_sb
i_mapping
Interaction between Processes and VFS Objects
Process 1
Process 2
Process 3
File object
File object
File object
Dentry object
Dentry object
Inode object
Superblock object
disk file
fd
fd
fd
f_dentry
f_dentry
f_dentry
d_inode
i_sb
dentry cache
In the example, 3 processes have opened the same file, 2 of them using the same hard link.
Files Associated with a Process
fs
files
Process Descriptor
files_struct
fd_array
File object
File object
File object
fd
stdin
stdout
stderr
0
1
2
3
fs_struct
pwd
root
Stores which files are currently opened by
the process.
Stores current working directory and its own
root directory, etc.
LayoutPicture of Today’s Topics
Virtual File System
• Three Kinds of Disk Cache• Page descriptors• Find a Page in Page Cache• Typical Layout of a Page• Buffer Pages
The Page Cache
Accessing the Files
Ext2 and Ext3 file systems
Three Kinds of Disk Cache
Page CacheThe main disk cache used by the
Linux kernel. Stores the pages containing:
• Data of regular files• Directories• Data directly read from block
device files• Data of User Mode processes
swapped out on disk• Special file systems (e.g., tmpfs)
Dentry CacheStores dentry objects
representing file system pathnames.
Inode CacheStores inode objects
representing disk inodes.
Disk Cache
Page descriptors
• Page descriptors are used by the kernel to keep track of the status of each page frame.
• Size: 32 bytes.• All page descriptors are stored in mem_map,
which takes about 1% of RAM.
Pagesmem_maparray
A View of Memory Address Space (Abbreviated)
Reserved(kernel)
Reserved(HD)
Find a Page in Page Cache• Each inode object owns an address_space object, which has
a pointer to a radix tree.• A radix tree is a tree for looking for a page in the page cache.
– An offset in the file will lead to a page descriptor position in the radix_tree.
•
address_space object
page_tree
Pagedescriptor
root
node
node node node
Pagedescriptor
Pagedescriptor
Pagedescriptor
radix_tree
inode object
i_mapping
Typical Layout of a PagePa
ge
SectorSe
gmen
tBl
ock
Bloc
kBl
ock
Bloc
kSector
Sector
Sector
Sector
Sector
Sector
Sector
• Sector (typically 512B): The smallest unit of data when accessing the block device.
• Block (a multiple of sector size, be a power of 2, no larger than a page frame): The smallest unit of data transfer for the VFS and the file systems. It corresponds to one or more ADJASENT sectors.
• Segment (a multiple of block size): If some blocks in a page holds the data adjacent on disk, they belong to one segment. Segment is used because each block I/O takes a group of adjacent blocks on disk.
Buffer Pages
Buffer (block)
Buffer (block)
Buffer (block)
Buffer (block)
Page
Buffer head
Page descriptorBuffer head
Buffer head
Buffer head
Buffer pages are used to address individual blocks in a page on the disk.Buffer page = a regular page + several buffer headsBuffer pages are created only when necessary, two common cases:• When reading/writing pages of a file that are not stored in contiguous disk blocks.• When accessing a single disk block (e.g., supoerblock or inode).
LayoutPicture of Today’s Topics
Virtual File System
The Page Cache
• I/O Modes• Read A File• Read-ahead• Read-ahead Considerations• Writing to a File• When to Flush Dirty Pages• Process of Flushing Dirty Pages
Accessing the Files
Ext2 and Ext3 file systems
I/O ModesCanonical Mode
O_SYNC and O_DIRECT are cleared. Read() is blocking, write() terminates as soon as the data is copied to the page cache.
Synchronous ModeO_SYNC is set. The flag affects only the write operation, which blocks the calling process until the
data is effectively written to disk.
Memory Mapping ModeThe application issues and mmap() system call to map the file to memory. So the file appears as an
array of byte in RAM.
Direct I/O ModeO_DIRECT is set. Any read or writer operation transfers data directly from User Mode address space
to disk , or vise versa, bypassing the page cache.
Asynchronous ModeThe requests for data never block the calling process; rather, they are carried on “in the background”
while the application continues its normal execution.
Reading A FileGet the address_space object and inode object.
Derive page’s logical index and offset, save them locally.
Start the following cycle to read all requested pages:• Read ahead if necessary.• Find the page descriptor from address_space.• If the page descriptor is NULL, allocate a new page, and start the I/O of reading
the page from disk.• Copy the page from page cache to the User Mode buffer.• Update the index and offset variables.• Continue if there are more pages to read.
All data are read. Increment the file pointer.
Update atime in inode.
Read-ahead
Read-ahead consists of reading several adjacent pages of data before they are actually requested. In most cases, read-ahead
significantly enhances disk performance.
Tune the read-ahead size for an
opened file: modify the
ra_pages field of file->f_ra object.
POSIX_FADV_NORMAL: 32 pages (default)
POSIX_FADV_SEQUENTIAL: 2NORMAL
POSIX_FADV_RANDOM: 0 page
Read-ahead Considerations
Read-ahead may be gradually increased as long as the process keeps accessing the file sequentially.
Read-ahead must be scaled down or even disabled when the current access is not sequential with respect to the previous one (random access).
Read-ahead should be stopped when a process keeps accessing the same page, or when almost all pages of the file are already in the page cache.
Writing to a FileFind the inode object of the file.
• Search the page in the page cache. If the page is not in the page cache, allocate a new page frame.• Allocate and initialize the buffer heads for the page.• Copy the characters from the User Mode buffer to the page.• Mark the underlying buffers as dirty so they can be written to disk later.• Check whether the ratio of dirty pages in the page cache has risen above vm.dirty_ratio (typically 40%); if so, flush a few tens of pages to disk.
If O_APPEND is set, move file pointer to the end.
Update mtime and ctime in inode; mark inode as dirty.
Start the following cycle to update all the pages involved:
All pages involved have been handled, update the file pointer.
A process itself may invoke system call to write back a few tens of pages when:
When to Flush Dirty PagesThe pdflush kernel thread is responsible for writing out dirty pages in the background.
Each time, pdflush tries to flush 1024 dirty pages.A pdflush thread is waken when:
A process modifies a page in page cache, and causes the fraction of dirty pages to raise above vm.dirty_ratio (typically 40%).
The User Mode process issues a sync() system call.
The kernel fails to allocate a new buffer page or memory pool element.
The page reclaiming algorithm (LRU) wants to free more memory.
A process modifies a page in page cache, and causes the fraction of dirty pages to raise above vm.dirty_background_ratio (typically 10%).
Process of Flushing Dirty Pages
For each dirty inode in each superblock, do:
• If the request queue is write-congested and process does not want to block, terminate.• Find the file’s initial page to be considered.• Look up the descriptors of dirty pages from the radix_tree.• For each page descriptor got from above, flush the page to disk right away or record it (depends on the file system). • Start the disk I/O if the “record” method is used in above step.
If inode is dirty, write it back.
Continue until the specified number of pages are flushed.
LayoutPicture of Today’s Topics
Virtual File System
The Page Cache
Accessing the Files
Ext2 and Ext3 File Systems• Ext2 Block Groups• Data Blocks Addressing• Allocating a Data Block• The Ext3 Journaling File System
Ext2 Block Groups
Block group 0Boot Block Block group n…
Super Block
Group Descriptors
Data block Bitmap
inode Bitmap
inode Table Data blocks
1 block n blocks 1 block 1 block n blocks n blocks
• Ext2 file system partitions the disk blocks into block groups of the same size.
• The maximum number of blocks in a block group is 8b blocks, b is the block size in bytes, because data block bitmap must be in one block.
• The kernel tries to keep the data blocks belonging to a file in the same block group, if possible.
Data Blocks AddressingGiven an offset f inside a file, how to derive the logical block number of this block on disk?1. Get the file block number by dividing f with the block size.2. Translate the file block number to the corresponding logical block number
by “Data Blocks Addressing”.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 123 11+b/4
b/4+12…
…
…
…
…(b/4) 3 +(b/4)2 +
(b/4)+11
…
Inode -> i_block
(Blocks numbered with file block number)
“Address Mapping Table”
Allocating a Data BlockTo reduce file fragmentation, when allocating a block for a file, Ext2 follows this order:
Get a new block for a file near the block already allocated for the file.
Get a new block in the block group that includes the file’s inode.
Get a new block from one of the other block groups.
Preallocation of data blocksTo reduce file fragmentation, each time, the file does not get only the requested block, but rather 8 adjacent blocks. When the file is closed, all the unused preallocated blocks will be freed.
The Ext3 Journaling File SystemGoal of Journaling file systemsWhen doing a consistency check, the file system only needs to look in the journal part of disk which contains the most recent disk write operations, instead of checking the whole file system. This saves large amount of time after a system failure.
Two Steps in Ext3 Journaling Process
A copy of the blocks to be written is stored in the journal.
When the I/O data transfer to the journal is completed, the blocks are written in the file system. When finish, the copies
in journal are discarded.
Discard the changes, still
constant.
Apply the changes, constant.
Obrigado!
Thank you!
谢谢!
top related