i/o systems (based on 4.4bsd)
DESCRIPTION
I/O Systems (based on 4.4BSD). Properties of the I/O Subsystem. Anonymizes (unifies the interface to) hardware devices, both to users and to other parts of the kernel Buffer-caching systems General device-driver code Hardware-specific device drivers. Four Main Kinds of I/O. Filesystem - PowerPoint PPT PresentationTRANSCRIPT
1
I/O Systems (based on 4.4BSD)I/O Systems (based on 4.4BSD)
2
Properties of the I/O SubsystemProperties of the I/O Subsystem
Anonymizes (unifies the interface to) Anonymizes (unifies the interface to) hardware devices, both to users and to hardware devices, both to users and to other parts of the kernelother parts of the kernel
Buffer-caching systemsBuffer-caching systems General device-driver codeGeneral device-driver code Hardware-specific device driversHardware-specific device drivers
3
Four Main Kinds of I/OFour Main Kinds of I/O
FilesystemFilesystem Actual files (although this is too restrictive)Actual files (although this is too restrictive) Hierarchical name space, locking, quotas, Hierarchical name space, locking, quotas,
attribute management, protectionattribute management, protection Socket interfaceSocket interface
Networking requestsNetworking requests
4
Four Types of I/O IIFour Types of I/O II
Block-device interfaceBlock-device interface Structured access to devicesStructured access to devices Uses buffer cache Uses buffer cache
– Minimizes I/O operationsMinimizes I/O operations– Synchronize with filesystem operationsSynchronize with filesystem operations– Uses kernel I/O buffersUses kernel I/O buffers
E.g. reading a byte from a file causes a E.g. reading a byte from a file causes a block to be fetched, and that block is stored block to be fetched, and that block is stored in the disk buffer cachein the disk buffer cache
5
Four Types of I/O IIIFour Types of I/O III
Character-device interfaceCharacter-device interface Unstructured access to devicesUnstructured access to devices Two typesTwo types
– Actual character devices (e.g. terminals)Actual character devices (e.g. terminals)– ““raw” access to block devices (e.g. disks, tapes)raw” access to block devices (e.g. disks, tapes)
I/O doesn’t use the buffer cacheI/O doesn’t use the buffer cache Requests are still in multiples of block sizeRequests are still in multiples of block size Requests might need to be alignedRequests might need to be aligned
6
Internal I/O StructureInternal I/O Structure
Device drivers are modules of code dedicated Device drivers are modules of code dedicated to handling I/O for a particular piece or kind of to handling I/O for a particular piece or kind of hardwarehardware
Drivers define a set of operations via well-Drivers define a set of operations via well-known entry points (e.g., read, write, etc.)known entry points (e.g., read, write, etc.)
Block device interface is kept in Block device interface is kept in bdevsw bdevsw structurestructure
Character device interface is kept in Character device interface is kept in cdevswcdevsw structure.structure.
7
Kernel I/O StructureKernel I/O Structure
buffer cache
character-device driver
the hardware
system call interface to the kernel
VM
network-interface drivers
block-device driver
MFS
cooked disk
raw disk and tty
tty
line disciplin
e
FFSLFS
NFS
network protocols
socket
VNODE layer
local naming (UFS) special devices
active file entries
swap- space mgmt.
8
Device NumbersDevice Numbers
How we select the device driver and How we select the device driver and actual deviceactual device
Device number is a (major, minor) pairDevice number is a (major, minor) pair Major number selects the device driverMajor number selects the device driver
Single device can have two major numbersSingle device can have two major numbers—one each for block and char.—one each for block and char.
Minor number selects which (of possibly Minor number selects which (of possibly many) devices—interpreted by the drivermany) devices—interpreted by the driver
9
Device DriversDevice Drivers
InitializationInitialization Probes device, sets initial software stateProbes device, sets initial software state
Top HalfTop Half System calls, may call sleep()System calls, may call sleep()
Bottom HalfBottom Half Interrupt handlers; no per-process state or blockingInterrupt handlers; no per-process state or blocking
Error handlingError handling Crash-dump routine, called by system when an Crash-dump routine, called by system when an
unrecoverable error occurs.unrecoverable error occurs.
10
I/O QueuingI/O Queuing
How data moves between the top and How data moves between the top and bottom halves of the kernelbottom halves of the kernel
Top half Top half receives request from user through system receives request from user through system
callcall records request in data structurerecords request in data structure places data structure in queueplaces data structure in queue starts device if necessarystarts device if necessary
11
Interrupt HandlersInterrupt Handlers
Device interrupts after completing the Device interrupts after completing the I/O operationI/O operation
Removes request from device queueRemoves request from device queue Notifies requester that command has Notifies requester that command has
completed (how?)completed (how?) Starts next request from the queueStarts next request from the queue
12
Top/Bottom CoordinationTop/Bottom Coordination
What might happen if an interrupt What might happen if an interrupt occurred while a request was being occurred while a request was being enqueued in the top half?enqueued in the top half?
The queue is just like any other shared The queue is just like any other shared data structuredata structure
To enforce synchronization, use the To enforce synchronization, use the splxxx() routines.splxxx() routines.
13
Interrupt MechanismInterrupt Mechanism Device interruptsDevice interrupts Call glue routine through Call glue routine through
interrupt table (glue routine interrupt table (glue routine allows ISR to be machine allows ISR to be machine independent)independent)
Glue routine saves volatile Glue routine saves volatile registersregisters
… … updates statisticsupdates statistics … … calls ISR passing in calls ISR passing in
device unit numberdevice unit number … … restores volatile registersrestores volatile registers … … returns from interruptreturns from interrupt
glue routine
interrupt service routine
interrupt
interruptdispatch
table
14
Block DevicesBlock Devices
User abstraction: array of bytesUser abstraction: array of bytes Physical device: disk, tape, …Physical device: disk, tape, … Driver converts byte-based requests to Driver converts byte-based requests to
block I/Oblock I/O.. Block I/O uses the buffer cacheBlock I/O uses the buffer cache One can use raw devices to bypass the One can use raw devices to bypass the
buffer cache, but coordination is criticalbuffer cache, but coordination is critical
15
Buffer CacheBuffer Cache
System caches blocks and returns System caches blocks and returns portion read (think small reads from portion read (think small reads from disk)disk)
Synchronous writes can be done with Synchronous writes can be done with sync()sync()
System caches writes and periodically System caches writes and periodically flushes dirty blocks to disk.flushes dirty blocks to disk.
fsync() flushes a single file.fsync() flushes a single file.
16
Block Device Interface: open()Block Device Interface: open()
Prepare device for I/OPrepare device for I/O Verify the integrity of the medium (e.g. Verify the integrity of the medium (e.g.
make sure that the floppy disk is make sure that the floppy disk is inserted when you’re mounting a inserted when you’re mounting a filesystem there).filesystem there).
This is not the same as opening a file, This is not the same as opening a file, although it uses the same interfacealthough it uses the same interface
17
Block Device Interface: strategy()Block Device Interface: strategy()
Start read/write operation and return Start read/write operation and return without waiting for it to complete without waiting for it to complete (default)(default)
Can be called synchronously (caller Can be called synchronously (caller sleeps on buffer passed in)sleeps on buffer passed in)
bread()/bwrite() map through to thisbread()/bwrite() map through to this
18
Block Device Interface: close()Block Device Interface: close()
Called when the final client is using the Called when the final client is using the device terminatesdevice terminates
Device-dependent behaviorDevice-dependent behavior Disks: do nothingDisks: do nothing Tapes: write end-of-tape marker, rewindTapes: write end-of-tape marker, rewind
19
Block Device Interface: dump()Block Device Interface: dump()
Used to save the contents of physical Used to save the contents of physical memory to backing storememory to backing store
Dump used to analyze the crashDump used to analyze the crash All disks must support thisAll disks must support this This is what ends up in /lost+found on a This is what ends up in /lost+found on a
filesystemfilesystem
20
Block Device Interface: psize()Block Device Interface: psize()
Returns the size of a partition, in terms Returns the size of a partition, in terms of blocks of size DEV_BSIZEof blocks of size DEV_BSIZE
Makes sense mostly in disksMakes sense mostly in disks Used during bootstrap to calculate Used during bootstrap to calculate
location where a crash dump should go, location where a crash dump should go, and to determine size of swap deviceand to determine size of swap device
21
Disk Scheduling: disksort()Disk Scheduling: disksort()
Implements a variant of the elevator Implements a variant of the elevator algorithm: which one is it?algorithm: which one is it?
Keeps two sorted lists of requests:Keeps two sorted lists of requests: The current one (the one we’re servicing)The current one (the one we’re servicing) The other one (for the opposite direction)The other one (for the opposite direction) Switch whenever we hit the endSwitch whenever we hit the end
Modern disk controllers will allow multiple Modern disk controllers will allow multiple outstanding requests and will sort the outstanding requests and will sort the requests themselvesrequests themselves
22
Disk LabelsDisk Labels
The geometry of a disk is the number of The geometry of a disk is the number of cylinders/tracks/sectors on a disk, and the cylinders/tracks/sectors on a disk, and the size of the sectorssize of the sectors
Disks can have different geometries imposed Disks can have different geometries imposed by a low-level formatby a low-level format
Disk labels are written in a well-known Disk labels are written in a well-known location on the disk, and contain the location on the disk, and contain the geometry, as well as information about geometry, as well as information about partition usage.partition usage.
23
BootstrappingBootstrapping
The first level bootstrap loader reads the The first level bootstrap loader reads the first few disk sectors to findfirst few disk sectors to find The geometry of the diskThe geometry of the disk A second-level bootstrap routineA second-level bootstrap routine
– Think of grub, lilo, multiboot, et al.Think of grub, lilo, multiboot, et al.
See, for example, See, for example, http://www.nomoa.com/bsd/dualboot.htmhttp://www.nomoa.com/bsd/dualboot.htm
24
Character DevicesCharacter Devices
Most peripherals (except networks) have Most peripherals (except networks) have character interfacescharacter interfaces
For many devices, this is just a byte For many devices, this is just a byte streamstream /dev/lp0/dev/lp0 /dev/tty00/dev/tty00 /dev/mem/dev/mem /dev/null/dev/null
25
Character Devices IICharacter Devices II For “block” devices, the character interface is For “block” devices, the character interface is
called the “raw” interfacecalled the “raw” interface Direct access to the deviceDirect access to the device Must manage own buffers (not use buffer cache)Must manage own buffers (not use buffer cache) User-level buffer must be wired during I/OUser-level buffer must be wired during I/O Not a byte-stream model; still have to talk in terms Not a byte-stream model; still have to talk in terms
understood by underlying deviceunderstood by underlying device– Size, alignmentSize, alignment
Should only be used by programs that have an Should only be used by programs that have an intimate knowledge of the underlying device and intimate knowledge of the underlying device and format; otherwise, chaos may ensueformat; otherwise, chaos may ensue
26
Character Device InterfaceCharacter Device Interface
Open, close: just Open, close: just what you thinkwhat you think
Read, write: call Read, write: call physio() to do the physio() to do the transfertransfer
Ioctl: the “swiss Ioctl: the “swiss army knife” call; army knife” call; highly-dependent on highly-dependent on devicedevice
Select: see whether the Select: see whether the device is ready for device is ready for reading/writing (cf. reading/writing (cf. select() syscall)select() syscall)
Stop: flow control on Stop: flow control on terminal devicesterminal devices
Mmap: map device Mmap: map device memory into address memory into address spacespace
Reset: re-initialize the Reset: re-initialize the device statedevice state
27
File DescriptorsFile Descriptors
System calls use file descriptors to refer System calls use file descriptors to refer to open files and devicesto open files and devices File entry contains file type and pointer to File entry contains file type and pointer to
underlying objectunderlying object– Socket (IPC)Socket (IPC)– Vnode (device, file)Vnode (device, file)– Virtual memoryVirtual memory
file descriptor
descriptor table
file entry
interprocess communicatio
n
vnode
virtual memoryuser process filedesc process
substructure
kernel list
28
Open File EntriesOpen File Entries
““object-oriented” data structureobject-oriented” data structure Methods:Methods:
ReadRead WriteWrite SelectSelect IoctlIoctl Close/deallocateClose/deallocate
Really just a type and a table of function Really just a type and a table of function pointerspointers
29
Open File Entries IIOpen File Entries II
Each file entry has a pointer to a data structure Each file entry has a pointer to a data structure describing the file objectdescribing the file object Reference passed to each function call operating on Reference passed to each function call operating on
the filethe file All object state is stored in the data structureAll object state is stored in the data structure Opaque to the routines that manipulate the file entryOpaque to the routines that manipulate the file entry
—only interpreted by code for underlying operations—only interpreted by code for underlying operations File offset determines position in file for next File offset determines position in file for next
read or writeread or write Not stored in per-object data structure. Why not?Not stored in per-object data structure. Why not?
30
File Descriptor FlagsFile Descriptor Flags
Protection flags (R, W, RW)Protection flags (R, W, RW) Checked before system call executedChecked before system call executed System calls themselves don’t have to checkSystem calls themselves don’t have to check
NDELAY: return EWOULDBLOCK if the call NDELAY: return EWOULDBLOCK if the call would block the processwould block the process
ASYNC: send SIGIO to the process when I/O ASYNC: send SIGIO to the process when I/O becomes possiblebecomes possible
Append: set offset to EOF on each writeAppend: set offset to EOF on each write Locking information (shared/exclusive)Locking information (shared/exclusive)
31
File Entry Structure CountsFile Entry Structure Counts
Reference CountsReference Counts Count the number of times processes point to the Count the number of times processes point to the
file structurefile structure– A single process can have multiple references to (i.e., A single process can have multiple references to (i.e.,
descriptors for) the file (dup, fcntl)descriptors for) the file (dup, fcntl)– Processes inherit file entries on forkProcesses inherit file entries on fork
Message CountsMessage Counts Descriptors can be sent in messages to local filesDescriptors can be sent in messages to local files
– Increment message count when descriptor sent, and Increment message count when descriptor sent, and decrement when receiveddecrement when received
32
File Structure Manipulation:File Structure Manipulation:fcntl()fcntl()
Duplicate a descriptor (same as dup())Duplicate a descriptor (same as dup()) Get or set close-on-exec flagGet or set close-on-exec flag
When set, descriptor is closed when a new When set, descriptor is closed when a new program is exec()’edprogram is exec()’ed
Set descriptor into nonblocking modeSet descriptor into nonblocking mode Set NDELAY flagSet NDELAY flag Not implemented for regular files in 4.4BSDNot implemented for regular files in 4.4BSD
33
File Structure Manipulation:File Structure Manipulation:fcntl() IIfcntl() II
Set append flagSet append flag Set ASYNC flagSet ASYNC flag Send signal to process when urgent data Send signal to process when urgent data
arisesarises Set or get the PID or PGID to which to send Set or get the PID or PGID to which to send
the two signalsthe two signals Test or set status of a lock on a range of Test or set status of a lock on a range of
bytes in the underlying filebytes in the underlying file
34
File Descriptor LockingFile Descriptor Locking
Early Unix didn’t have file lockingEarly Unix didn’t have file locking So, it used an atomic operation on the file system So, it used an atomic operation on the file system
to simulate it. How?to simulate it. How?
Downsides of this approach:Downsides of this approach: Processes consumed CPU time loopingProcesses consumed CPU time looping Locks left lying around have to be cleaned up by Locks left lying around have to be cleaned up by
handhand Root processes would override the lock Root processes would override the lock
mechanismmechanism
Real locks were added to 4.2 BSDReal locks were added to 4.2 BSD
35
File Locking Options: File Locking Options: Lock the Whole FileLock the Whole File
Serializes access to the fileSerializes access to the file Fast (can just record the PID/PGRP of Fast (can just record the PID/PGRP of
the locking process and proceed)the locking process and proceed) Inherited by childrenInherited by children Released when last descriptor for a file Released when last descriptor for a file
is closedis closed When does it work well? When poorly?When does it work well? When poorly? 4.2 BSD and 4.3 BSD had this4.2 BSD and 4.3 BSD had this
36
File Locking Options:File Locking Options:Byte-Range LocksByte-Range Locks
Can lock a piece of a file (down to a byte)Can lock a piece of a file (down to a byte) Not powerful enough to do nested Not powerful enough to do nested
hierarchical locks (databases)hierarchical locks (databases) Complex implementationComplex implementation Mandated by POSIX standard, so put Mandated by POSIX standard, so put
into 4.4 BSDinto 4.4 BSD Released Released eacheach time a close is done on a time a close is done on a
descriptor referencing the filedescriptor referencing the file
37
File Locking Options:File Locking Options:Advisory vs. MandatoryAdvisory vs. Mandatory
Mandatory:Mandatory: Locks enforced for all processesLocks enforced for all processes Cannot be bypassed (even by root)Cannot be bypassed (even by root)
Advisory:Advisory: Locks enforced only for processes that Locks enforced only for processes that
request themrequest them
38
File Locking Options:File Locking Options:Shared vs. ExclusiveShared vs. Exclusive
Shared locks: multiple processes may access Shared locks: multiple processes may access the filethe file
Exclusive locks: only one processExclusive locks: only one process Request a lock when another process has an Request a lock when another process has an
exclusive lock => you blockexclusive lock => you block Request an exclusive lock when another Request an exclusive lock when another
process holds a lock => you blockprocess holds a lock => you block Think of the classic readers-writers problemThink of the classic readers-writers problem
39
4.4 BSD Locks4.4 BSD Locks
Both whole-file and byte-rangeBoth whole-file and byte-range Whole-file implemented as byte-range over the entire fileWhole-file implemented as byte-range over the entire file Closing behavior handled as special case checkClosing behavior handled as special case check
Advisory locksAdvisory locks Both shared and exclusive may be requestedBoth shared and exclusive may be requested May request lock when opening a file (avoid race May request lock when opening a file (avoid race
condition)condition) Done on per-filesystem basisDone on per-filesystem basis Process may specify to return with error instead of Process may specify to return with error instead of
blocking if lock is busyblocking if lock is busy
40
Doing I/O for Doing I/O for Multiple DescriptorsMultiple Descriptors
Consider the X ServerConsider the X Server Read from keyboardRead from keyboard Read from mouseRead from mouse Write to frame buffer (video)Write to frame buffer (video) Read from multiple network connectionsRead from multiple network connections Write to multiple network connectionsWrite to multiple network connections
If we try to read from a descriptor that If we try to read from a descriptor that isn’t ready, we’ll blockisn’t ready, we’ll block
41
Historical SolutionHistorical Solution
Use multiple processes communicating Use multiple processes communicating through shared facilitiesthrough shared facilities
Example:Example: Have a “master” process fork off a separate Have a “master” process fork off a separate
process to handle each I/O device or client for the process to handle each I/O device or client for the X ServerX Server
For N children, N+1 pipes:For N children, N+1 pipes:– One for all children to talk to X ServerOne for all children to talk to X Server– One per child for the X Server to talk backOne per child for the X Server to talk back
High overhead in context switchesHigh overhead in context switches
42
4.4 BSD’s Solution: 4.4 BSD’s Solution: Three I/O OptionsThree I/O Options
Polling I/O: Polling I/O: Repeatedly check a set of descriptors for Repeatedly check a set of descriptors for
available I/O; see discussion of Selectavailable I/O; see discussion of Select Nonblocking I/O:Nonblocking I/O:
Complete immediately w/partial resultsComplete immediately w/partial results Signal-driven I/O:Signal-driven I/O:
Notify process when I/O becomes possibleNotify process when I/O becomes possible
43
Possibilities to Avoid the Possibilities to Avoid the Blocking Problem (1-2)Blocking Problem (1-2)
Set all descriptors to nonblocking modeSet all descriptors to nonblocking mode This ends up being a different form of This ends up being a different form of
polling (busy waiting)polling (busy waiting) Set all descriptors to send signalsSet all descriptors to send signals
Signals are expensive to catch; not good Signals are expensive to catch; not good for a high I/O process like the X Serverfor a high I/O process like the X Server
44
Possibilities to Avoid the Possibilities to Avoid the Blocking Problem (3-4)Blocking Problem (3-4)
Have a query mechanism to find out Have a query mechanism to find out which descriptors are readywhich descriptors are ready Block process if none are readyBlock process if none are ready Two system calls: one for check, one for I/OTwo system calls: one for check, one for I/O
Have the process give the kernel a set of Have the process give the kernel a set of descriptors to read from, and be told on descriptors to read from, and be told on return which one provided the datareturn which one provided the data Can’t mix reads and writesCan’t mix reads and writes
45
The Virtual Filesystem InterfaceThe Virtual Filesystem Interface
In older systems, file entries pointed In older systems, file entries pointed directly to a filesystem index node (inode).directly to a filesystem index node (inode). An index node describes the layout of a fileAn index node describes the layout of a file
However, this tied file entries to the However, this tied file entries to the particular filesystem (the Berkeley Fast particular filesystem (the Berkeley Fast File System, or FFS).File System, or FFS).
Add a layer to sit between the file entries Add a layer to sit between the file entries and the inodes: the virtual node (vnode)and the inodes: the virtual node (vnode)
46
VnodesVnodes
An object-oriented interfaceAn object-oriented interface Vnodes for local filesystems map onto Vnodes for local filesystems map onto
inodesinodes Vnodes for remote filesystem map to Vnodes for remote filesystem map to
network protocol control blocks describing network protocol control blocks describing how to find and access the filehow to find and access the file
47
Remember This?Remember This?
buffer cache
character-device driver
the hardware
system call interface to the kernel
VM
network-interface drivers
block-device driver
MFS
cooked disk
raw disk and tty
tty
line disciplin
e
FFSLFS
NFS
network protocols
socket
VNODE layerVNODE layer
local naming (UFS) special devices
active file entries
swap- space mgmt.
48
Vnode entriesVnode entries
Flags for locking and Flags for locking and generic attributes (e.g. generic attributes (e.g. filesystem root)filesystem root)
Reference countsReference counts Pointer to mount Pointer to mount
structure for filesystem structure for filesystem containing the vnodecontaining the vnode
File read-ahead File read-ahead informationinformation
An NFS lease referenceAn NFS lease reference
Reference to state of Reference to state of special devices, sockets, special devices, sockets, FIFOsFIFOs
Pointer to vnode operationsPointer to vnode operations Pointer to private Pointer to private
information (inode, information (inode, nsfnode)nsfnode)
Type of underlying objectType of underlying object Clean and dirty buffer poolsClean and dirty buffer pools Count of write operations in Count of write operations in
progressprogress
49
Vnode OperationsVnode Operations
At boot time, each filesystem registers At boot time, each filesystem registers the set of vnode operations it supports.the set of vnode operations it supports.
Kernel builds a table listing the union of Kernel builds a table listing the union of all operations on any filesystem, and an all operations on any filesystem, and an operations vector for each filesystemoperations vector for each filesystem
Supported operations point to routines; Supported operations point to routines; unsupported point to a default routine.unsupported point to a default routine.
50
Filesystem OperationsFilesystem Operations
File systems provide two types of File systems provide two types of operations:operations: Name space operationsName space operations On-disk file storageOn-disk file storage
4.3 BSD had these tied together; 4.3 BSD had these tied together; 4.4BSD separates them4.4BSD separates them
51
Pathname TranslationPathname Translation
The pathname is copied in from the user The pathname is copied in from the user process (or extracted from network process (or extracted from network buffer for remote filesystem request)buffer for remote filesystem request)
Starting point of the pathname is Starting point of the pathname is determined (root for absolute path determined (root for absolute path names, current directory for relative names, current directory for relative names)—this is the lookup directory in names)—this is the lookup directory in the next stepthe next step
52
Pathname Translation IIPathname Translation II
Vnode layer calls filesystem-specific Vnode layer calls filesystem-specific lookup() operation, passing in the lookup() operation, passing in the remaining pathname and the lookup remaining pathname and the lookup directorydirectory
Filesystem code searches the lookup Filesystem code searches the lookup directory for the next component of the directory for the next component of the path, and returns either the vnode for path, and returns either the vnode for that component or an errorthat component or an error
53
Pathname Translation III:Pathname Translation III:Possible OutcomesPossible Outcomes
Error returned: propagate itError returned: propagate it Pathname not exhausted, but current vnode is not Pathname not exhausted, but current vnode is not
a directory: “not a directory” errora directory: “not a directory” error Pathname exhausted, then the current vnode is Pathname exhausted, then the current vnode is
the result; return itthe result; return it If vnode is the mount point for another file system, If vnode is the mount point for another file system,
make the lookup directory the mounted make the lookup directory the mounted filesystem; else make the lookup directory the filesystem; else make the lookup directory the returned vnodereturned vnode In either case, iteratively call lookup() with the new In either case, iteratively call lookup() with the new
lookup directorylookup directory
54
Common VnodeCommon Vnode(Exported) Filesystem Services(Exported) Filesystem Services
Generic mount optionsGeneric mount options Noexec: don’t execute any files found on this file Noexec: don’t execute any files found on this file
system (useful in heterogeneous networked system (useful in heterogeneous networked environments)environments)
Nosuid: don’t honor setuid or setgid flags for Nosuid: don’t honor setuid or setgid flags for executables on this filesystem (security)executables on this filesystem (security)
Nodev: don’t allow any special devices on this Nodev: don’t allow any special devices on this filesystem to be opened (again, heterogeneous filesystem to be opened (again, heterogeneous systems)systems)
ReadonlyReadonly User mountable: CD drives, floppiesUser mountable: CD drives, floppies
55
Common Vnode Services II:Common Vnode Services II:Informational servicesInformational services
statfs(): returns information about the statfs(): returns information about the filesystem resources (number of used filesystem resources (number of used and free disk blocks and inodes, and free disk blocks and inodes, filesystem mount point, and source of filesystem mount point, and source of the mounted filesystem)the mounted filesystem)
getfsstat(): returns information about all getfsstat(): returns information about all mounted filesystemsmounted filesystems
56
Filesystem-Independent Services: Filesystem-Independent Services: Inactive()Inactive()
Called by the vnode layer when last Called by the vnode layer when last reference to a file is closed, usage count reference to a file is closed, usage count goes to 0goes to 0
Filesystem can then write dirty blocks, Filesystem can then write dirty blocks, etc.etc.
Vnode is placed on system-wide free listVnode is placed on system-wide free list Floating free list allows vnodes to migrate Floating free list allows vnodes to migrate
to the file system where they’re neededto the file system where they’re needed
57
reclaim() disassociates the vnode with reclaim() disassociates the vnode with the underlying file systemthe underlying file system
vgone() uses reclaim and then vgone() uses reclaim and then associates the vnode with the dead associates the vnode with the dead filesystemfilesystem All calls except close() generate an errorAll calls except close() generate an error Used to disassociate a terminal when the Used to disassociate a terminal when the
session leader exits, or when unmounting a session leader exits, or when unmounting a filesystem where processes have open filesfilesystem where processes have open files
Filesystem-Independent Services: Filesystem-Independent Services: reclaim() & vgone()reclaim() & vgone()
58
Name CacheName Cache
Associate a name with its vnode to save Associate a name with its vnode to save lookups in the file systemlookups in the file system Add name and vnode to cacheAdd name and vnode to cache Look up vnode by nameLook up vnode by name Delete name from the cacheDelete name from the cache Invalidate all names that reference a vnodeInvalidate all names that reference a vnode
– Use capabilities to speed this upUse capabilities to speed this up
59
vnode Capabilitiesvnode Capabilities
Unique, 32-bit numbers assigned by the Unique, 32-bit numbers assigned by the kernel (one per vnode)kernel (one per vnode) If the space is exhausted, purge all If the space is exhausted, purge all
capabilities and start again (once per year)capabilities and start again (once per year) Copy capability into name table when Copy capability into name table when
making entrymaking entry Comparison of capabilities as part of Comparison of capabilities as part of
lookup in name cachelookup in name cache
60
Negative CachingNegative Caching
Names of files that aren’t there are looked up Names of files that aren’t there are looked up in the cache, miss, then also not found in the in the cache, miss, then also not found in the filesystem—expensive!filesystem—expensive!
Alternative: put an entry in the cache with that Alternative: put an entry in the cache with that name, and a null vnode pointer.name, and a null vnode pointer.
This entry will be found on next lookup, short-This entry will be found on next lookup, short-circuiting the processcircuiting the process
If a directory is modified, then we change its If a directory is modified, then we change its capability to avoid improper negative cachingcapability to avoid improper negative caching think about this; what is really happening?think about this; what is really happening?
61
Buffer Management:Buffer Management:the Buffer Cachethe Buffer Cache
Remember that the buffer cache speeds up Remember that the buffer cache speeds up access to block devicesaccess to block devices 85% of transfers can be skipped85% of transfers can be skipped
Buffer formatBuffer format HeaderHeader ContentContent Stored separately (allows buffers to be on page Stored separately (allows buffers to be on page
boundaries, for efficient transfer and manipulation)boundaries, for efficient transfer and manipulation) Buffer data can be from 512 up to 64k bytesBuffer data can be from 512 up to 64k bytes
62
Buffer FormatBuffer Format
hash link
free-list link
flags
vnode pointer
file offset
byte count
buffer size
buffer pointer
(64 Kbyte)MAXBSIZE
buffer contentsbuffer header
63
Buffer Management IIIBuffer Management III
Each buffer initially gets 4096 bytes (multiple Each buffer initially gets 4096 bytes (multiple pages)pages) May give up part of its physical memory laterMay give up part of its physical memory later
Buffers identified by logical block # in fileBuffers identified by logical block # in file Used to be done by physical block #, butUsed to be done by physical block #, but
– Didn’t always have access to physical deviceDidn’t always have access to physical device– Slow lookup through indirect blocks in filesystem to find Slow lookup through indirect blocks in filesystem to find
physical numberphysical number To avoid aliasing, kernel prohibits mounting a To avoid aliasing, kernel prohibits mounting a
partition while its block device is open (and vice partition while its block device is open (and vice versa)versa)
64
Buffer CallsBuffer Calls
bread(): takes a vnode, logical block bread(): takes a vnode, logical block number, and size, and returns pointer to number, and size, and returns pointer to locked bufferlocked buffer Further accesses to this buffer by other Further accesses to this buffer by other
processes cause them to sleepprocesses cause them to sleep Four ways for a node to be unlockedFour ways for a node to be unlocked
– brelse(), bdwrite(), bawrite(), bwrite()brelse(), bdwrite(), bawrite(), bwrite()
65
Buffer Calls IIBuffer Calls II
brelse(): release an unmodified bufferbrelse(): release an unmodified buffer bdwrite(): mark the buffer as dirty and put it bdwrite(): mark the buffer as dirty and put it
back in the free list, and awaken processes back in the free list, and awaken processes waiting for it—how long, on average, will it waiting for it—how long, on average, will it wait to be flushed, and by whom/what?wait to be flushed, and by whom/what?
bawrite(): schedule a write on a full buffer, and bawrite(): schedule a write on a full buffer, and return (asynchronous; implicit brelse())return (asynchronous; implicit brelse())
bwrite(): ensure write is complete before bwrite(): ensure write is complete before returning (implicit brelse())returning (implicit brelse())
66
Buffer ListsBuffer Lists
Bufhash: quick lookup for buffers with Bufhash: quick lookup for buffers with valid contents (exactly one)valid contents (exactly one)
Four “free” lists:Four “free” lists: LOCKED: buffers cant be flushed LOCKED: buffers cant be flushed
(superblock, now only used by lfs)(superblock, now only used by lfs) LRU: an LRU cache of used buffersLRU: an LRU cache of used buffers AGE: Rather like the inactive page listAGE: Rather like the inactive page list EMPTY: buffers with no physical memoryEMPTY: buffers with no physical memory
67
Buffer PoolBuffer Pool
LOCKED LRU AGE EMPTY
bufhash
bfreelist
V1,X0k
V1,X8k
V1,X16k
V1,X24k
V2,X16k
V2,X48k
V8,X40k
V4,X32k
V8,X48k
V7,X0k
V7,X8k
V7,X16k
V8,X56k
-- --
-- --
68
Buffer Management Buffer Management ImplementationImplementation
bread() called with request for a bread() called with request for a datablock of specified size for a datablock of specified size for a specified vnodespecified vnode breadn() gets block and starts readaheadbreadn() gets block and starts readahead
bread()
allocbuf()getnewbuf()
bremfree() getblk()
VOP_STRATEGY() Do the I/O
Check for buffer on free list
Take buffer off of freelist and mark busy.
Adjust memory in buffer to requested size
Find next available buffer on the free list
69
Buffer Management Buffer Management Implementation IIImplementation II
getblk() finds out if the data block is already in getblk() finds out if the data block is already in a buffera buffer If in a buffer, call bremfree() to take it off of If in a buffer, call bremfree() to take it off of
whatever free list it’s on and mark it busywhatever free list it’s on and mark it busy If not in buffer, call getnewbuf(), then allocbuf()If not in buffer, call getnewbuf(), then allocbuf()
getnewbuf() allocates a new buffergetnewbuf() allocates a new buffer allocbuf() makes sure buffer has physical allocbuf() makes sure buffer has physical
memorymemory bread() then passes the buffer to the strategy bread() then passes the buffer to the strategy
routine for the underlying filesystem.routine for the underlying filesystem.
70
Buffer AllocationBuffer Allocation
allocbuf() ensures that the buffer has enough allocbuf() ensures that the buffer has enough physical memoryphysical memory
If there is excess physical memory, take some If there is excess physical memory, take some of it and give it to a buffer on the EMPTY list of it and give it to a buffer on the EMPTY list (move it to AGE)—keep if none there(move it to AGE)—keep if none there
If not enough, call getnewbuf() and then steal If not enough, call getnewbuf() and then steal its memory (putting it on the EMPTY list if you its memory (putting it on the EMPTY list if you steal all of its memory)steal all of its memory)
Keeps each physical page mapeed into Keeps each physical page mapeed into exactly one buffer at all timesexactly one buffer at all times
71
Overlapping BuffersOverlapping Buffers
The kernel must ensure that disk blocks The kernel must ensure that disk blocks appear in at most one bufferappear in at most one buffer
If multiple dirty buffers held portions of If multiple dirty buffers held portions of the same block, the kernel wouldn’t the same block, the kernel wouldn’t know which one was more recentknow which one was more recent
Kernel purges buffers when files are Kernel purges buffers when files are shortened or removedshortened or removed
72
Vnode EvolutionVnode Evolution
Vnodes were originally only used to Vnodes were originally only used to provide an OO-interface to the provide an OO-interface to the underlying filesystemunderlying filesystem
Concept then generalized to allow Concept then generalized to allow vnodes to point to other vnodesvnodes to point to other vnodes
73
Stackable FilesystemStackable Filesystem
With vnodes being able to point to other With vnodes being able to point to other vnodes, we can treat filesystems as vnodes, we can treat filesystems as layers, and stack themlayers, and stack them
So, we can add features (such as So, we can add features (such as transparent encryption) to an underlying transparent encryption) to an underlying filesystem without changing that filesystem without changing that filesystem at allfilesystem at all
74
Mounting LayersMounting Layers
Mount introduces an alias for one filesystem Mount introduces an alias for one filesystem in anotherin another
Traditional mount: map device into a Traditional mount: map device into a filesystem, overlaying the old directory with a filesystem, overlaying the old directory with a file systemfile system
With layers, mount pushes a new layer on a With layers, mount pushes a new layer on a vnode stackvnode stack Can overlay old vnode (use same name in file Can overlay old vnode (use same name in file
space)space) Can appear separately (use different name)Can appear separately (use different name)
75
Layered FilesystemsLayered Filesystems
File access is dependent upon which File access is dependent upon which name is used for the filename is used for the file New name: use new layerNew name: use new layer Old name: use old layerOld name: use old layer This assumes new mounts are at different This assumes new mounts are at different
points from old onespoints from old ones
76
Vnode Options on File AccessVnode Options on File Access
Perform requested operation and return Perform requested operation and return resultresult
Pass operation unchanged to next-lower Pass operation unchanged to next-lower vnode; might (or might not) modify resultvnode; might (or might not) modify result
Modify operands and pass on to next-Modify operands and pass on to next-lower vnode; might (or might not) modify lower vnode; might (or might not) modify resultresult
Operation passed ot the bottom of the Operation passed ot the bottom of the stack without action generates an errorstack without action generates an error
77
Old vs. New VnodesOld vs. New Vnodes
Old vnodesOld vnodes Vnode operations implemented as indirect function Vnode operations implemented as indirect function
callscalls
New vnodesNew vnodes Intermediate layers pass operation to lower layersIntermediate layers pass operation to lower layers New operations added into the system at boot timeNew operations added into the system at boot time Pass operations that may not have been defined Pass operations that may not have been defined
when the filesystem was writtenwhen the filesystem was written Pass parameters of unknown type and numberPass parameters of unknown type and number
78
SolutionSolution
Kernel places vnode operation name Kernel places vnode operation name and arguments into a structand arguments into a struct
That struct passed as single parameter That struct passed as single parameter to vnode operation (uniform interface)to vnode operation (uniform interface)
If the operation is supported, the vnode If the operation is supported, the vnode knows how to decode the argumentknows how to decode the argument
If it’s not supported, then it’s passed to If it’s not supported, then it’s passed to the next lower layer w/same argumentthe next lower layer w/same argument
79
The Null File System: nullfsThe Null File System: nullfs
This would be better called the “identity This would be better called the “identity file system”file system”
Just passes all requests and responses Just passes all requests and responses without changing them at allwithout changing them at all
Can make a filesystem appear within Can make a filesystem appear within itselfitself
Useful as starting point for implementing Useful as starting point for implementing other file systemsother file systems
80
User Mapping File System: User Mapping File System: umapfsumapfs
umapfs mounts the same file system in two places (/local into umapfs mounts the same file system in two places (/local into /export) on the file server/export) on the file server
/export is exported to external domains that use different UID/GID./export is exported to external domains that use different UID/GID. Mapping is automatically done by the umapfs layerMapping is automatically done by the umapfs layer Request handed to /localRequest handed to /local
outside administrative exports local administrative exports
UFSFFSuid/gid mapping
NFS server
operation not supported
NFS server
/export/export/local/local
81
Benefits of umapfsBenefits of umapfs
No cost to local clients (they don’t go No cost to local clients (they don’t go through the umapfs)through the umapfs)
No changes required to NFS or local No changes required to NFS or local filesystem codefilesystem code
We can expand this idea to give each We can expand this idea to give each outside domain its own mapping, outside domain its own mapping, accessed through its own mount pointaccessed through its own mount point
82
Union Mount Filesystem: unionfsUnion Mount Filesystem: unionfs Traditional mount hides any files in the directory Traditional mount hides any files in the directory
before the mountbefore the mount Union filesystem does a transparent overlay (you Union filesystem does a transparent overlay (you
can still see the old)can still see the old) Both can be accessed simultaneouslyBoth can be accessed simultaneously
/usr/src/usr/src
src src src
shell shell shell
Makefile shsh.h sh.c
/usr/src/usr/src/tmp/src/tmp/src
shsh.h sh.cMakefile
83
More unionfsMore unionfs Start with an empty top-level directoryStart with an empty top-level directory As the underlying fs is traversed, entries are made in As the underlying fs is traversed, entries are made in
the top levelthe top level Lower levels are read-onlyLower levels are read-only Any names appearing in top fs obscure lower fsAny names appearing in top fs obscure lower fs
/usr/src/usr/src
src src src
shell shell shell
Makefile shsh.h sh.c
/usr/src/usr/src/tmp/src/tmp/src
shsh.h sh.cMakefile
84
Still More unionfsStill More unionfs
Two conflicting requirements:Two conflicting requirements: Lower layers are read-onlyLower layers are read-only User should think this is a regular file systemUser should think this is a regular file system
We already copy files up on writes, but how do We already copy files up on writes, but how do we handle deletions?we handle deletions? Deletions of unmodified files (lower layer)Deletions of unmodified files (lower layer) Deletions of previously modified files (top layer)Deletions of previously modified files (top layer)
Use “whiteout” entries in the top level: inode 1Use “whiteout” entries in the top level: inode 1
On hitting a whiteout entry, kernel returns error.On hitting a whiteout entry, kernel returns error.
85
Tricky Issues With Whiteout: Tricky Issues With Whiteout: DirectoriesDirectories
What happens if I create a file with the What happens if I create a file with the same name as a whiteout file?same name as a whiteout file?
What about a directory? Is there any What about a directory? Is there any difference?difference?
Compare what happens in the “normal” Compare what happens in the “normal” situation where the same directory situation where the same directory appears in multiple layers of the appears in multiple layers of the filesystem stackfilesystem stack
86
Uses of the unionfsUses of the unionfs
Heterogeneous architectures can build from a Heterogeneous architectures can build from a common source basecommon source base NFS mount source treeNFS mount source tree Union mount over the topUnion mount over the top Builds stay localBuilds stay local
Can compile sources “on” read-only media Can compile sources “on” read-only media (e.g., CD-ROM)(e.g., CD-ROM)
Private source directories from common code Private source directories from common code basesbases
87
Other Common Filesystems: Other Common Filesystems: PortalPortal
Mount a process into the filesystemMount a process into the filesystem Pass remainder of path name to process as Pass remainder of path name to process as
inputinput Process returns file descriptorProcess returns file descriptor
Could be to processCould be to process Could be to fileCould be to file
Can be used, e.g., to provide filesystem Can be used, e.g., to provide filesystem access to network services (opening file access to network services (opening file creates TCP connection to server)creates TCP connection to server)
88
Other Common Filesystems:Other Common Filesystems:procfsprocfs
procfs provides information on running procfs provides information on running processes, through the file system (viz ps)processes, through the file system (viz ps)
An ls of /proc shows a directory for each An ls of /proc shows a directory for each process, which contains:process, which contains: ctl: file allowing control of the processctl: file allowing control of the process file: the executable (compiled code)file: the executable (compiled code) mem: virtual memory for the processmem: virtual memory for the process regs: registers for the processregs: registers for the process status: text file with information (e.g. state, ppid, …)status: text file with information (e.g. state, ppid, …)
89
End of LectureEnd of Lecture