i/o systems (based on 4.4bsd)

89
1 I/O Systems (based on I/O Systems (based on 4.4BSD) 4.4BSD)

Upload: donkor

Post on 11-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

I/O Systems (based on 4.4BSD). Properties of the I/O Subsystem. Anonymizes (unifies the interface to) hardware devices, both to users and to other parts of the kernel Buffer-caching systems General device-driver code Hardware-specific device drivers. Four Main Kinds of I/O. Filesystem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: I/O Systems (based on 4.4BSD)

1

I/O Systems (based on 4.4BSD)I/O Systems (based on 4.4BSD)

Page 2: I/O Systems (based on 4.4BSD)

2

Properties of the I/O SubsystemProperties of the I/O Subsystem

Anonymizes (unifies the interface to) Anonymizes (unifies the interface to) hardware devices, both to users and to hardware devices, both to users and to other parts of the kernelother parts of the kernel

Buffer-caching systemsBuffer-caching systems General device-driver codeGeneral device-driver code Hardware-specific device driversHardware-specific device drivers

Page 3: I/O Systems (based on 4.4BSD)

3

Four Main Kinds of I/OFour Main Kinds of I/O

FilesystemFilesystem Actual files (although this is too restrictive)Actual files (although this is too restrictive) Hierarchical name space, locking, quotas, Hierarchical name space, locking, quotas,

attribute management, protectionattribute management, protection Socket interfaceSocket interface

Networking requestsNetworking requests

Page 4: I/O Systems (based on 4.4BSD)

4

Four Types of I/O IIFour Types of I/O II

Block-device interfaceBlock-device interface Structured access to devicesStructured access to devices Uses buffer cache Uses buffer cache

– Minimizes I/O operationsMinimizes I/O operations– Synchronize with filesystem operationsSynchronize with filesystem operations– Uses kernel I/O buffersUses kernel I/O buffers

E.g. reading a byte from a file causes a E.g. reading a byte from a file causes a block to be fetched, and that block is stored block to be fetched, and that block is stored in the disk buffer cachein the disk buffer cache

Page 5: I/O Systems (based on 4.4BSD)

5

Four Types of I/O IIIFour Types of I/O III

Character-device interfaceCharacter-device interface Unstructured access to devicesUnstructured access to devices Two typesTwo types

– Actual character devices (e.g. terminals)Actual character devices (e.g. terminals)– ““raw” access to block devices (e.g. disks, tapes)raw” access to block devices (e.g. disks, tapes)

I/O doesn’t use the buffer cacheI/O doesn’t use the buffer cache Requests are still in multiples of block sizeRequests are still in multiples of block size Requests might need to be alignedRequests might need to be aligned

Page 6: I/O Systems (based on 4.4BSD)

6

Internal I/O StructureInternal I/O Structure

Device drivers are modules of code dedicated Device drivers are modules of code dedicated to handling I/O for a particular piece or kind of to handling I/O for a particular piece or kind of hardwarehardware

Drivers define a set of operations via well-Drivers define a set of operations via well-known entry points (e.g., read, write, etc.)known entry points (e.g., read, write, etc.)

Block device interface is kept in Block device interface is kept in bdevsw bdevsw structurestructure

Character device interface is kept in Character device interface is kept in cdevswcdevsw structure.structure.

Page 7: I/O Systems (based on 4.4BSD)

7

Kernel I/O StructureKernel I/O Structure

buffer cache

character-device driver

the hardware

system call interface to the kernel

VM

network-interface drivers

block-device driver

MFS

cooked disk

raw disk and tty

tty

line disciplin

e

FFSLFS

NFS

network protocols

socket

VNODE layer

local naming (UFS) special devices

active file entries

swap- space mgmt.

Page 8: I/O Systems (based on 4.4BSD)

8

Device NumbersDevice Numbers

How we select the device driver and How we select the device driver and actual deviceactual device

Device number is a (major, minor) pairDevice number is a (major, minor) pair Major number selects the device driverMajor number selects the device driver

Single device can have two major numbersSingle device can have two major numbers—one each for block and char.—one each for block and char.

Minor number selects which (of possibly Minor number selects which (of possibly many) devices—interpreted by the drivermany) devices—interpreted by the driver

Page 9: I/O Systems (based on 4.4BSD)

9

Device DriversDevice Drivers

InitializationInitialization Probes device, sets initial software stateProbes device, sets initial software state

Top HalfTop Half System calls, may call sleep()System calls, may call sleep()

Bottom HalfBottom Half Interrupt handlers; no per-process state or blockingInterrupt handlers; no per-process state or blocking

Error handlingError handling Crash-dump routine, called by system when an Crash-dump routine, called by system when an

unrecoverable error occurs.unrecoverable error occurs.

Page 10: I/O Systems (based on 4.4BSD)

10

I/O QueuingI/O Queuing

How data moves between the top and How data moves between the top and bottom halves of the kernelbottom halves of the kernel

Top half Top half receives request from user through system receives request from user through system

callcall records request in data structurerecords request in data structure places data structure in queueplaces data structure in queue starts device if necessarystarts device if necessary

Page 11: I/O Systems (based on 4.4BSD)

11

Interrupt HandlersInterrupt Handlers

Device interrupts after completing the Device interrupts after completing the I/O operationI/O operation

Removes request from device queueRemoves request from device queue Notifies requester that command has Notifies requester that command has

completed (how?)completed (how?) Starts next request from the queueStarts next request from the queue

Page 12: I/O Systems (based on 4.4BSD)

12

Top/Bottom CoordinationTop/Bottom Coordination

What might happen if an interrupt What might happen if an interrupt occurred while a request was being occurred while a request was being enqueued in the top half?enqueued in the top half?

The queue is just like any other shared The queue is just like any other shared data structuredata structure

To enforce synchronization, use the To enforce synchronization, use the splxxx() routines.splxxx() routines.

Page 13: I/O Systems (based on 4.4BSD)

13

Interrupt MechanismInterrupt Mechanism Device interruptsDevice interrupts Call glue routine through Call glue routine through

interrupt table (glue routine interrupt table (glue routine allows ISR to be machine allows ISR to be machine independent)independent)

Glue routine saves volatile Glue routine saves volatile registersregisters

… … updates statisticsupdates statistics … … calls ISR passing in calls ISR passing in

device unit numberdevice unit number … … restores volatile registersrestores volatile registers … … returns from interruptreturns from interrupt

glue routine

interrupt service routine

interrupt

interruptdispatch

table

Page 14: I/O Systems (based on 4.4BSD)

14

Block DevicesBlock Devices

User abstraction: array of bytesUser abstraction: array of bytes Physical device: disk, tape, …Physical device: disk, tape, … Driver converts byte-based requests to Driver converts byte-based requests to

block I/Oblock I/O.. Block I/O uses the buffer cacheBlock I/O uses the buffer cache One can use raw devices to bypass the One can use raw devices to bypass the

buffer cache, but coordination is criticalbuffer cache, but coordination is critical

Page 15: I/O Systems (based on 4.4BSD)

15

Buffer CacheBuffer Cache

System caches blocks and returns System caches blocks and returns portion read (think small reads from portion read (think small reads from disk)disk)

Synchronous writes can be done with Synchronous writes can be done with sync()sync()

System caches writes and periodically System caches writes and periodically flushes dirty blocks to disk.flushes dirty blocks to disk.

fsync() flushes a single file.fsync() flushes a single file.

Page 16: I/O Systems (based on 4.4BSD)

16

Block Device Interface: open()Block Device Interface: open()

Prepare device for I/OPrepare device for I/O Verify the integrity of the medium (e.g. Verify the integrity of the medium (e.g.

make sure that the floppy disk is make sure that the floppy disk is inserted when you’re mounting a inserted when you’re mounting a filesystem there).filesystem there).

This is not the same as opening a file, This is not the same as opening a file, although it uses the same interfacealthough it uses the same interface

Page 17: I/O Systems (based on 4.4BSD)

17

Block Device Interface: strategy()Block Device Interface: strategy()

Start read/write operation and return Start read/write operation and return without waiting for it to complete without waiting for it to complete (default)(default)

Can be called synchronously (caller Can be called synchronously (caller sleeps on buffer passed in)sleeps on buffer passed in)

bread()/bwrite() map through to thisbread()/bwrite() map through to this

Page 18: I/O Systems (based on 4.4BSD)

18

Block Device Interface: close()Block Device Interface: close()

Called when the final client is using the Called when the final client is using the device terminatesdevice terminates

Device-dependent behaviorDevice-dependent behavior Disks: do nothingDisks: do nothing Tapes: write end-of-tape marker, rewindTapes: write end-of-tape marker, rewind

Page 19: I/O Systems (based on 4.4BSD)

19

Block Device Interface: dump()Block Device Interface: dump()

Used to save the contents of physical Used to save the contents of physical memory to backing storememory to backing store

Dump used to analyze the crashDump used to analyze the crash All disks must support thisAll disks must support this This is what ends up in /lost+found on a This is what ends up in /lost+found on a

filesystemfilesystem

Page 20: I/O Systems (based on 4.4BSD)

20

Block Device Interface: psize()Block Device Interface: psize()

Returns the size of a partition, in terms Returns the size of a partition, in terms of blocks of size DEV_BSIZEof blocks of size DEV_BSIZE

Makes sense mostly in disksMakes sense mostly in disks Used during bootstrap to calculate Used during bootstrap to calculate

location where a crash dump should go, location where a crash dump should go, and to determine size of swap deviceand to determine size of swap device

Page 21: I/O Systems (based on 4.4BSD)

21

Disk Scheduling: disksort()Disk Scheduling: disksort()

Implements a variant of the elevator Implements a variant of the elevator algorithm: which one is it?algorithm: which one is it?

Keeps two sorted lists of requests:Keeps two sorted lists of requests: The current one (the one we’re servicing)The current one (the one we’re servicing) The other one (for the opposite direction)The other one (for the opposite direction) Switch whenever we hit the endSwitch whenever we hit the end

Modern disk controllers will allow multiple Modern disk controllers will allow multiple outstanding requests and will sort the outstanding requests and will sort the requests themselvesrequests themselves

Page 22: I/O Systems (based on 4.4BSD)

22

Disk LabelsDisk Labels

The geometry of a disk is the number of The geometry of a disk is the number of cylinders/tracks/sectors on a disk, and the cylinders/tracks/sectors on a disk, and the size of the sectorssize of the sectors

Disks can have different geometries imposed Disks can have different geometries imposed by a low-level formatby a low-level format

Disk labels are written in a well-known Disk labels are written in a well-known location on the disk, and contain the location on the disk, and contain the geometry, as well as information about geometry, as well as information about partition usage.partition usage.

Page 23: I/O Systems (based on 4.4BSD)

23

BootstrappingBootstrapping

The first level bootstrap loader reads the The first level bootstrap loader reads the first few disk sectors to findfirst few disk sectors to find The geometry of the diskThe geometry of the disk A second-level bootstrap routineA second-level bootstrap routine

– Think of grub, lilo, multiboot, et al.Think of grub, lilo, multiboot, et al.

See, for example, See, for example, http://www.nomoa.com/bsd/dualboot.htmhttp://www.nomoa.com/bsd/dualboot.htm

Page 24: I/O Systems (based on 4.4BSD)

24

Character DevicesCharacter Devices

Most peripherals (except networks) have Most peripherals (except networks) have character interfacescharacter interfaces

For many devices, this is just a byte For many devices, this is just a byte streamstream /dev/lp0/dev/lp0 /dev/tty00/dev/tty00 /dev/mem/dev/mem /dev/null/dev/null

Page 25: I/O Systems (based on 4.4BSD)

25

Character Devices IICharacter Devices II For “block” devices, the character interface is For “block” devices, the character interface is

called the “raw” interfacecalled the “raw” interface Direct access to the deviceDirect access to the device Must manage own buffers (not use buffer cache)Must manage own buffers (not use buffer cache) User-level buffer must be wired during I/OUser-level buffer must be wired during I/O Not a byte-stream model; still have to talk in terms Not a byte-stream model; still have to talk in terms

understood by underlying deviceunderstood by underlying device– Size, alignmentSize, alignment

Should only be used by programs that have an Should only be used by programs that have an intimate knowledge of the underlying device and intimate knowledge of the underlying device and format; otherwise, chaos may ensueformat; otherwise, chaos may ensue

Page 26: I/O Systems (based on 4.4BSD)

26

Character Device InterfaceCharacter Device Interface

Open, close: just Open, close: just what you thinkwhat you think

Read, write: call Read, write: call physio() to do the physio() to do the transfertransfer

Ioctl: the “swiss Ioctl: the “swiss army knife” call; army knife” call; highly-dependent on highly-dependent on devicedevice

Select: see whether the Select: see whether the device is ready for device is ready for reading/writing (cf. reading/writing (cf. select() syscall)select() syscall)

Stop: flow control on Stop: flow control on terminal devicesterminal devices

Mmap: map device Mmap: map device memory into address memory into address spacespace

Reset: re-initialize the Reset: re-initialize the device statedevice state

Page 27: I/O Systems (based on 4.4BSD)

27

File DescriptorsFile Descriptors

System calls use file descriptors to refer System calls use file descriptors to refer to open files and devicesto open files and devices File entry contains file type and pointer to File entry contains file type and pointer to

underlying objectunderlying object– Socket (IPC)Socket (IPC)– Vnode (device, file)Vnode (device, file)– Virtual memoryVirtual memory

file descriptor

descriptor table

file entry

interprocess communicatio

n

vnode

virtual memoryuser process filedesc process

substructure

kernel list

Page 28: I/O Systems (based on 4.4BSD)

28

Open File EntriesOpen File Entries

““object-oriented” data structureobject-oriented” data structure Methods:Methods:

ReadRead WriteWrite SelectSelect IoctlIoctl Close/deallocateClose/deallocate

Really just a type and a table of function Really just a type and a table of function pointerspointers

Page 29: I/O Systems (based on 4.4BSD)

29

Open File Entries IIOpen File Entries II

Each file entry has a pointer to a data structure Each file entry has a pointer to a data structure describing the file objectdescribing the file object Reference passed to each function call operating on Reference passed to each function call operating on

the filethe file All object state is stored in the data structureAll object state is stored in the data structure Opaque to the routines that manipulate the file entryOpaque to the routines that manipulate the file entry

—only interpreted by code for underlying operations—only interpreted by code for underlying operations File offset determines position in file for next File offset determines position in file for next

read or writeread or write Not stored in per-object data structure. Why not?Not stored in per-object data structure. Why not?

Page 30: I/O Systems (based on 4.4BSD)

30

File Descriptor FlagsFile Descriptor Flags

Protection flags (R, W, RW)Protection flags (R, W, RW) Checked before system call executedChecked before system call executed System calls themselves don’t have to checkSystem calls themselves don’t have to check

NDELAY: return EWOULDBLOCK if the call NDELAY: return EWOULDBLOCK if the call would block the processwould block the process

ASYNC: send SIGIO to the process when I/O ASYNC: send SIGIO to the process when I/O becomes possiblebecomes possible

Append: set offset to EOF on each writeAppend: set offset to EOF on each write Locking information (shared/exclusive)Locking information (shared/exclusive)

Page 31: I/O Systems (based on 4.4BSD)

31

File Entry Structure CountsFile Entry Structure Counts

Reference CountsReference Counts Count the number of times processes point to the Count the number of times processes point to the

file structurefile structure– A single process can have multiple references to (i.e., A single process can have multiple references to (i.e.,

descriptors for) the file (dup, fcntl)descriptors for) the file (dup, fcntl)– Processes inherit file entries on forkProcesses inherit file entries on fork

Message CountsMessage Counts Descriptors can be sent in messages to local filesDescriptors can be sent in messages to local files

– Increment message count when descriptor sent, and Increment message count when descriptor sent, and decrement when receiveddecrement when received

Page 32: I/O Systems (based on 4.4BSD)

32

File Structure Manipulation:File Structure Manipulation:fcntl()fcntl()

Duplicate a descriptor (same as dup())Duplicate a descriptor (same as dup()) Get or set close-on-exec flagGet or set close-on-exec flag

When set, descriptor is closed when a new When set, descriptor is closed when a new program is exec()’edprogram is exec()’ed

Set descriptor into nonblocking modeSet descriptor into nonblocking mode Set NDELAY flagSet NDELAY flag Not implemented for regular files in 4.4BSDNot implemented for regular files in 4.4BSD

Page 33: I/O Systems (based on 4.4BSD)

33

File Structure Manipulation:File Structure Manipulation:fcntl() IIfcntl() II

Set append flagSet append flag Set ASYNC flagSet ASYNC flag Send signal to process when urgent data Send signal to process when urgent data

arisesarises Set or get the PID or PGID to which to send Set or get the PID or PGID to which to send

the two signalsthe two signals Test or set status of a lock on a range of Test or set status of a lock on a range of

bytes in the underlying filebytes in the underlying file

Page 34: I/O Systems (based on 4.4BSD)

34

File Descriptor LockingFile Descriptor Locking

Early Unix didn’t have file lockingEarly Unix didn’t have file locking So, it used an atomic operation on the file system So, it used an atomic operation on the file system

to simulate it. How?to simulate it. How?

Downsides of this approach:Downsides of this approach: Processes consumed CPU time loopingProcesses consumed CPU time looping Locks left lying around have to be cleaned up by Locks left lying around have to be cleaned up by

handhand Root processes would override the lock Root processes would override the lock

mechanismmechanism

Real locks were added to 4.2 BSDReal locks were added to 4.2 BSD

Page 35: I/O Systems (based on 4.4BSD)

35

File Locking Options: File Locking Options: Lock the Whole FileLock the Whole File

Serializes access to the fileSerializes access to the file Fast (can just record the PID/PGRP of Fast (can just record the PID/PGRP of

the locking process and proceed)the locking process and proceed) Inherited by childrenInherited by children Released when last descriptor for a file Released when last descriptor for a file

is closedis closed When does it work well? When poorly?When does it work well? When poorly? 4.2 BSD and 4.3 BSD had this4.2 BSD and 4.3 BSD had this

Page 36: I/O Systems (based on 4.4BSD)

36

File Locking Options:File Locking Options:Byte-Range LocksByte-Range Locks

Can lock a piece of a file (down to a byte)Can lock a piece of a file (down to a byte) Not powerful enough to do nested Not powerful enough to do nested

hierarchical locks (databases)hierarchical locks (databases) Complex implementationComplex implementation Mandated by POSIX standard, so put Mandated by POSIX standard, so put

into 4.4 BSDinto 4.4 BSD Released Released eacheach time a close is done on a time a close is done on a

descriptor referencing the filedescriptor referencing the file

Page 37: I/O Systems (based on 4.4BSD)

37

File Locking Options:File Locking Options:Advisory vs. MandatoryAdvisory vs. Mandatory

Mandatory:Mandatory: Locks enforced for all processesLocks enforced for all processes Cannot be bypassed (even by root)Cannot be bypassed (even by root)

Advisory:Advisory: Locks enforced only for processes that Locks enforced only for processes that

request themrequest them

Page 38: I/O Systems (based on 4.4BSD)

38

File Locking Options:File Locking Options:Shared vs. ExclusiveShared vs. Exclusive

Shared locks: multiple processes may access Shared locks: multiple processes may access the filethe file

Exclusive locks: only one processExclusive locks: only one process Request a lock when another process has an Request a lock when another process has an

exclusive lock => you blockexclusive lock => you block Request an exclusive lock when another Request an exclusive lock when another

process holds a lock => you blockprocess holds a lock => you block Think of the classic readers-writers problemThink of the classic readers-writers problem

Page 39: I/O Systems (based on 4.4BSD)

39

4.4 BSD Locks4.4 BSD Locks

Both whole-file and byte-rangeBoth whole-file and byte-range Whole-file implemented as byte-range over the entire fileWhole-file implemented as byte-range over the entire file Closing behavior handled as special case checkClosing behavior handled as special case check

Advisory locksAdvisory locks Both shared and exclusive may be requestedBoth shared and exclusive may be requested May request lock when opening a file (avoid race May request lock when opening a file (avoid race

condition)condition) Done on per-filesystem basisDone on per-filesystem basis Process may specify to return with error instead of Process may specify to return with error instead of

blocking if lock is busyblocking if lock is busy

Page 40: I/O Systems (based on 4.4BSD)

40

Doing I/O for Doing I/O for Multiple DescriptorsMultiple Descriptors

Consider the X ServerConsider the X Server Read from keyboardRead from keyboard Read from mouseRead from mouse Write to frame buffer (video)Write to frame buffer (video) Read from multiple network connectionsRead from multiple network connections Write to multiple network connectionsWrite to multiple network connections

If we try to read from a descriptor that If we try to read from a descriptor that isn’t ready, we’ll blockisn’t ready, we’ll block

Page 41: I/O Systems (based on 4.4BSD)

41

Historical SolutionHistorical Solution

Use multiple processes communicating Use multiple processes communicating through shared facilitiesthrough shared facilities

Example:Example: Have a “master” process fork off a separate Have a “master” process fork off a separate

process to handle each I/O device or client for the process to handle each I/O device or client for the X ServerX Server

For N children, N+1 pipes:For N children, N+1 pipes:– One for all children to talk to X ServerOne for all children to talk to X Server– One per child for the X Server to talk backOne per child for the X Server to talk back

High overhead in context switchesHigh overhead in context switches

Page 42: I/O Systems (based on 4.4BSD)

42

4.4 BSD’s Solution: 4.4 BSD’s Solution: Three I/O OptionsThree I/O Options

Polling I/O: Polling I/O: Repeatedly check a set of descriptors for Repeatedly check a set of descriptors for

available I/O; see discussion of Selectavailable I/O; see discussion of Select Nonblocking I/O:Nonblocking I/O:

Complete immediately w/partial resultsComplete immediately w/partial results Signal-driven I/O:Signal-driven I/O:

Notify process when I/O becomes possibleNotify process when I/O becomes possible

Page 43: I/O Systems (based on 4.4BSD)

43

Possibilities to Avoid the Possibilities to Avoid the Blocking Problem (1-2)Blocking Problem (1-2)

Set all descriptors to nonblocking modeSet all descriptors to nonblocking mode This ends up being a different form of This ends up being a different form of

polling (busy waiting)polling (busy waiting) Set all descriptors to send signalsSet all descriptors to send signals

Signals are expensive to catch; not good Signals are expensive to catch; not good for a high I/O process like the X Serverfor a high I/O process like the X Server

Page 44: I/O Systems (based on 4.4BSD)

44

Possibilities to Avoid the Possibilities to Avoid the Blocking Problem (3-4)Blocking Problem (3-4)

Have a query mechanism to find out Have a query mechanism to find out which descriptors are readywhich descriptors are ready Block process if none are readyBlock process if none are ready Two system calls: one for check, one for I/OTwo system calls: one for check, one for I/O

Have the process give the kernel a set of Have the process give the kernel a set of descriptors to read from, and be told on descriptors to read from, and be told on return which one provided the datareturn which one provided the data Can’t mix reads and writesCan’t mix reads and writes

Page 45: I/O Systems (based on 4.4BSD)

45

The Virtual Filesystem InterfaceThe Virtual Filesystem Interface

In older systems, file entries pointed In older systems, file entries pointed directly to a filesystem index node (inode).directly to a filesystem index node (inode). An index node describes the layout of a fileAn index node describes the layout of a file

However, this tied file entries to the However, this tied file entries to the particular filesystem (the Berkeley Fast particular filesystem (the Berkeley Fast File System, or FFS).File System, or FFS).

Add a layer to sit between the file entries Add a layer to sit between the file entries and the inodes: the virtual node (vnode)and the inodes: the virtual node (vnode)

Page 46: I/O Systems (based on 4.4BSD)

46

VnodesVnodes

An object-oriented interfaceAn object-oriented interface Vnodes for local filesystems map onto Vnodes for local filesystems map onto

inodesinodes Vnodes for remote filesystem map to Vnodes for remote filesystem map to

network protocol control blocks describing network protocol control blocks describing how to find and access the filehow to find and access the file

Page 47: I/O Systems (based on 4.4BSD)

47

Remember This?Remember This?

buffer cache

character-device driver

the hardware

system call interface to the kernel

VM

network-interface drivers

block-device driver

MFS

cooked disk

raw disk and tty

tty

line disciplin

e

FFSLFS

NFS

network protocols

socket

VNODE layerVNODE layer

local naming (UFS) special devices

active file entries

swap- space mgmt.

Page 48: I/O Systems (based on 4.4BSD)

48

Vnode entriesVnode entries

Flags for locking and Flags for locking and generic attributes (e.g. generic attributes (e.g. filesystem root)filesystem root)

Reference countsReference counts Pointer to mount Pointer to mount

structure for filesystem structure for filesystem containing the vnodecontaining the vnode

File read-ahead File read-ahead informationinformation

An NFS lease referenceAn NFS lease reference

Reference to state of Reference to state of special devices, sockets, special devices, sockets, FIFOsFIFOs

Pointer to vnode operationsPointer to vnode operations Pointer to private Pointer to private

information (inode, information (inode, nsfnode)nsfnode)

Type of underlying objectType of underlying object Clean and dirty buffer poolsClean and dirty buffer pools Count of write operations in Count of write operations in

progressprogress

Page 49: I/O Systems (based on 4.4BSD)

49

Vnode OperationsVnode Operations

At boot time, each filesystem registers At boot time, each filesystem registers the set of vnode operations it supports.the set of vnode operations it supports.

Kernel builds a table listing the union of Kernel builds a table listing the union of all operations on any filesystem, and an all operations on any filesystem, and an operations vector for each filesystemoperations vector for each filesystem

Supported operations point to routines; Supported operations point to routines; unsupported point to a default routine.unsupported point to a default routine.

Page 50: I/O Systems (based on 4.4BSD)

50

Filesystem OperationsFilesystem Operations

File systems provide two types of File systems provide two types of operations:operations: Name space operationsName space operations On-disk file storageOn-disk file storage

4.3 BSD had these tied together; 4.3 BSD had these tied together; 4.4BSD separates them4.4BSD separates them

Page 51: I/O Systems (based on 4.4BSD)

51

Pathname TranslationPathname Translation

The pathname is copied in from the user The pathname is copied in from the user process (or extracted from network process (or extracted from network buffer for remote filesystem request)buffer for remote filesystem request)

Starting point of the pathname is Starting point of the pathname is determined (root for absolute path determined (root for absolute path names, current directory for relative names, current directory for relative names)—this is the lookup directory in names)—this is the lookup directory in the next stepthe next step

Page 52: I/O Systems (based on 4.4BSD)

52

Pathname Translation IIPathname Translation II

Vnode layer calls filesystem-specific Vnode layer calls filesystem-specific lookup() operation, passing in the lookup() operation, passing in the remaining pathname and the lookup remaining pathname and the lookup directorydirectory

Filesystem code searches the lookup Filesystem code searches the lookup directory for the next component of the directory for the next component of the path, and returns either the vnode for path, and returns either the vnode for that component or an errorthat component or an error

Page 53: I/O Systems (based on 4.4BSD)

53

Pathname Translation III:Pathname Translation III:Possible OutcomesPossible Outcomes

Error returned: propagate itError returned: propagate it Pathname not exhausted, but current vnode is not Pathname not exhausted, but current vnode is not

a directory: “not a directory” errora directory: “not a directory” error Pathname exhausted, then the current vnode is Pathname exhausted, then the current vnode is

the result; return itthe result; return it If vnode is the mount point for another file system, If vnode is the mount point for another file system,

make the lookup directory the mounted make the lookup directory the mounted filesystem; else make the lookup directory the filesystem; else make the lookup directory the returned vnodereturned vnode In either case, iteratively call lookup() with the new In either case, iteratively call lookup() with the new

lookup directorylookup directory

Page 54: I/O Systems (based on 4.4BSD)

54

Common VnodeCommon Vnode(Exported) Filesystem Services(Exported) Filesystem Services

Generic mount optionsGeneric mount options Noexec: don’t execute any files found on this file Noexec: don’t execute any files found on this file

system (useful in heterogeneous networked system (useful in heterogeneous networked environments)environments)

Nosuid: don’t honor setuid or setgid flags for Nosuid: don’t honor setuid or setgid flags for executables on this filesystem (security)executables on this filesystem (security)

Nodev: don’t allow any special devices on this Nodev: don’t allow any special devices on this filesystem to be opened (again, heterogeneous filesystem to be opened (again, heterogeneous systems)systems)

ReadonlyReadonly User mountable: CD drives, floppiesUser mountable: CD drives, floppies

Page 55: I/O Systems (based on 4.4BSD)

55

Common Vnode Services II:Common Vnode Services II:Informational servicesInformational services

statfs(): returns information about the statfs(): returns information about the filesystem resources (number of used filesystem resources (number of used and free disk blocks and inodes, and free disk blocks and inodes, filesystem mount point, and source of filesystem mount point, and source of the mounted filesystem)the mounted filesystem)

getfsstat(): returns information about all getfsstat(): returns information about all mounted filesystemsmounted filesystems

Page 56: I/O Systems (based on 4.4BSD)

56

Filesystem-Independent Services: Filesystem-Independent Services: Inactive()Inactive()

Called by the vnode layer when last Called by the vnode layer when last reference to a file is closed, usage count reference to a file is closed, usage count goes to 0goes to 0

Filesystem can then write dirty blocks, Filesystem can then write dirty blocks, etc.etc.

Vnode is placed on system-wide free listVnode is placed on system-wide free list Floating free list allows vnodes to migrate Floating free list allows vnodes to migrate

to the file system where they’re neededto the file system where they’re needed

Page 57: I/O Systems (based on 4.4BSD)

57

reclaim() disassociates the vnode with reclaim() disassociates the vnode with the underlying file systemthe underlying file system

vgone() uses reclaim and then vgone() uses reclaim and then associates the vnode with the dead associates the vnode with the dead filesystemfilesystem All calls except close() generate an errorAll calls except close() generate an error Used to disassociate a terminal when the Used to disassociate a terminal when the

session leader exits, or when unmounting a session leader exits, or when unmounting a filesystem where processes have open filesfilesystem where processes have open files

Filesystem-Independent Services: Filesystem-Independent Services: reclaim() & vgone()reclaim() & vgone()

Page 58: I/O Systems (based on 4.4BSD)

58

Name CacheName Cache

Associate a name with its vnode to save Associate a name with its vnode to save lookups in the file systemlookups in the file system Add name and vnode to cacheAdd name and vnode to cache Look up vnode by nameLook up vnode by name Delete name from the cacheDelete name from the cache Invalidate all names that reference a vnodeInvalidate all names that reference a vnode

– Use capabilities to speed this upUse capabilities to speed this up

Page 59: I/O Systems (based on 4.4BSD)

59

vnode Capabilitiesvnode Capabilities

Unique, 32-bit numbers assigned by the Unique, 32-bit numbers assigned by the kernel (one per vnode)kernel (one per vnode) If the space is exhausted, purge all If the space is exhausted, purge all

capabilities and start again (once per year)capabilities and start again (once per year) Copy capability into name table when Copy capability into name table when

making entrymaking entry Comparison of capabilities as part of Comparison of capabilities as part of

lookup in name cachelookup in name cache

Page 60: I/O Systems (based on 4.4BSD)

60

Negative CachingNegative Caching

Names of files that aren’t there are looked up Names of files that aren’t there are looked up in the cache, miss, then also not found in the in the cache, miss, then also not found in the filesystem—expensive!filesystem—expensive!

Alternative: put an entry in the cache with that Alternative: put an entry in the cache with that name, and a null vnode pointer.name, and a null vnode pointer.

This entry will be found on next lookup, short-This entry will be found on next lookup, short-circuiting the processcircuiting the process

If a directory is modified, then we change its If a directory is modified, then we change its capability to avoid improper negative cachingcapability to avoid improper negative caching think about this; what is really happening?think about this; what is really happening?

Page 61: I/O Systems (based on 4.4BSD)

61

Buffer Management:Buffer Management:the Buffer Cachethe Buffer Cache

Remember that the buffer cache speeds up Remember that the buffer cache speeds up access to block devicesaccess to block devices 85% of transfers can be skipped85% of transfers can be skipped

Buffer formatBuffer format HeaderHeader ContentContent Stored separately (allows buffers to be on page Stored separately (allows buffers to be on page

boundaries, for efficient transfer and manipulation)boundaries, for efficient transfer and manipulation) Buffer data can be from 512 up to 64k bytesBuffer data can be from 512 up to 64k bytes

Page 62: I/O Systems (based on 4.4BSD)

62

Buffer FormatBuffer Format

hash link

free-list link

flags

vnode pointer

file offset

byte count

buffer size

buffer pointer

(64 Kbyte)MAXBSIZE

buffer contentsbuffer header

Page 63: I/O Systems (based on 4.4BSD)

63

Buffer Management IIIBuffer Management III

Each buffer initially gets 4096 bytes (multiple Each buffer initially gets 4096 bytes (multiple pages)pages) May give up part of its physical memory laterMay give up part of its physical memory later

Buffers identified by logical block # in fileBuffers identified by logical block # in file Used to be done by physical block #, butUsed to be done by physical block #, but

– Didn’t always have access to physical deviceDidn’t always have access to physical device– Slow lookup through indirect blocks in filesystem to find Slow lookup through indirect blocks in filesystem to find

physical numberphysical number To avoid aliasing, kernel prohibits mounting a To avoid aliasing, kernel prohibits mounting a

partition while its block device is open (and vice partition while its block device is open (and vice versa)versa)

Page 64: I/O Systems (based on 4.4BSD)

64

Buffer CallsBuffer Calls

bread(): takes a vnode, logical block bread(): takes a vnode, logical block number, and size, and returns pointer to number, and size, and returns pointer to locked bufferlocked buffer Further accesses to this buffer by other Further accesses to this buffer by other

processes cause them to sleepprocesses cause them to sleep Four ways for a node to be unlockedFour ways for a node to be unlocked

– brelse(), bdwrite(), bawrite(), bwrite()brelse(), bdwrite(), bawrite(), bwrite()

Page 65: I/O Systems (based on 4.4BSD)

65

Buffer Calls IIBuffer Calls II

brelse(): release an unmodified bufferbrelse(): release an unmodified buffer bdwrite(): mark the buffer as dirty and put it bdwrite(): mark the buffer as dirty and put it

back in the free list, and awaken processes back in the free list, and awaken processes waiting for it—how long, on average, will it waiting for it—how long, on average, will it wait to be flushed, and by whom/what?wait to be flushed, and by whom/what?

bawrite(): schedule a write on a full buffer, and bawrite(): schedule a write on a full buffer, and return (asynchronous; implicit brelse())return (asynchronous; implicit brelse())

bwrite(): ensure write is complete before bwrite(): ensure write is complete before returning (implicit brelse())returning (implicit brelse())

Page 66: I/O Systems (based on 4.4BSD)

66

Buffer ListsBuffer Lists

Bufhash: quick lookup for buffers with Bufhash: quick lookup for buffers with valid contents (exactly one)valid contents (exactly one)

Four “free” lists:Four “free” lists: LOCKED: buffers cant be flushed LOCKED: buffers cant be flushed

(superblock, now only used by lfs)(superblock, now only used by lfs) LRU: an LRU cache of used buffersLRU: an LRU cache of used buffers AGE: Rather like the inactive page listAGE: Rather like the inactive page list EMPTY: buffers with no physical memoryEMPTY: buffers with no physical memory

Page 67: I/O Systems (based on 4.4BSD)

67

Buffer PoolBuffer Pool

LOCKED LRU AGE EMPTY

bufhash

bfreelist

V1,X0k

V1,X8k

V1,X16k

V1,X24k

V2,X16k

V2,X48k

V8,X40k

V4,X32k

V8,X48k

V7,X0k

V7,X8k

V7,X16k

V8,X56k

-- --

-- --

Page 68: I/O Systems (based on 4.4BSD)

68

Buffer Management Buffer Management ImplementationImplementation

bread() called with request for a bread() called with request for a datablock of specified size for a datablock of specified size for a specified vnodespecified vnode breadn() gets block and starts readaheadbreadn() gets block and starts readahead

bread()

allocbuf()getnewbuf()

bremfree() getblk()

VOP_STRATEGY() Do the I/O

Check for buffer on free list

Take buffer off of freelist and mark busy.

Adjust memory in buffer to requested size

Find next available buffer on the free list

Page 69: I/O Systems (based on 4.4BSD)

69

Buffer Management Buffer Management Implementation IIImplementation II

getblk() finds out if the data block is already in getblk() finds out if the data block is already in a buffera buffer If in a buffer, call bremfree() to take it off of If in a buffer, call bremfree() to take it off of

whatever free list it’s on and mark it busywhatever free list it’s on and mark it busy If not in buffer, call getnewbuf(), then allocbuf()If not in buffer, call getnewbuf(), then allocbuf()

getnewbuf() allocates a new buffergetnewbuf() allocates a new buffer allocbuf() makes sure buffer has physical allocbuf() makes sure buffer has physical

memorymemory bread() then passes the buffer to the strategy bread() then passes the buffer to the strategy

routine for the underlying filesystem.routine for the underlying filesystem.

Page 70: I/O Systems (based on 4.4BSD)

70

Buffer AllocationBuffer Allocation

allocbuf() ensures that the buffer has enough allocbuf() ensures that the buffer has enough physical memoryphysical memory

If there is excess physical memory, take some If there is excess physical memory, take some of it and give it to a buffer on the EMPTY list of it and give it to a buffer on the EMPTY list (move it to AGE)—keep if none there(move it to AGE)—keep if none there

If not enough, call getnewbuf() and then steal If not enough, call getnewbuf() and then steal its memory (putting it on the EMPTY list if you its memory (putting it on the EMPTY list if you steal all of its memory)steal all of its memory)

Keeps each physical page mapeed into Keeps each physical page mapeed into exactly one buffer at all timesexactly one buffer at all times

Page 71: I/O Systems (based on 4.4BSD)

71

Overlapping BuffersOverlapping Buffers

The kernel must ensure that disk blocks The kernel must ensure that disk blocks appear in at most one bufferappear in at most one buffer

If multiple dirty buffers held portions of If multiple dirty buffers held portions of the same block, the kernel wouldn’t the same block, the kernel wouldn’t know which one was more recentknow which one was more recent

Kernel purges buffers when files are Kernel purges buffers when files are shortened or removedshortened or removed

Page 72: I/O Systems (based on 4.4BSD)

72

Vnode EvolutionVnode Evolution

Vnodes were originally only used to Vnodes were originally only used to provide an OO-interface to the provide an OO-interface to the underlying filesystemunderlying filesystem

Concept then generalized to allow Concept then generalized to allow vnodes to point to other vnodesvnodes to point to other vnodes

Page 73: I/O Systems (based on 4.4BSD)

73

Stackable FilesystemStackable Filesystem

With vnodes being able to point to other With vnodes being able to point to other vnodes, we can treat filesystems as vnodes, we can treat filesystems as layers, and stack themlayers, and stack them

So, we can add features (such as So, we can add features (such as transparent encryption) to an underlying transparent encryption) to an underlying filesystem without changing that filesystem without changing that filesystem at allfilesystem at all

Page 74: I/O Systems (based on 4.4BSD)

74

Mounting LayersMounting Layers

Mount introduces an alias for one filesystem Mount introduces an alias for one filesystem in anotherin another

Traditional mount: map device into a Traditional mount: map device into a filesystem, overlaying the old directory with a filesystem, overlaying the old directory with a file systemfile system

With layers, mount pushes a new layer on a With layers, mount pushes a new layer on a vnode stackvnode stack Can overlay old vnode (use same name in file Can overlay old vnode (use same name in file

space)space) Can appear separately (use different name)Can appear separately (use different name)

Page 75: I/O Systems (based on 4.4BSD)

75

Layered FilesystemsLayered Filesystems

File access is dependent upon which File access is dependent upon which name is used for the filename is used for the file New name: use new layerNew name: use new layer Old name: use old layerOld name: use old layer This assumes new mounts are at different This assumes new mounts are at different

points from old onespoints from old ones

Page 76: I/O Systems (based on 4.4BSD)

76

Vnode Options on File AccessVnode Options on File Access

Perform requested operation and return Perform requested operation and return resultresult

Pass operation unchanged to next-lower Pass operation unchanged to next-lower vnode; might (or might not) modify resultvnode; might (or might not) modify result

Modify operands and pass on to next-Modify operands and pass on to next-lower vnode; might (or might not) modify lower vnode; might (or might not) modify resultresult

Operation passed ot the bottom of the Operation passed ot the bottom of the stack without action generates an errorstack without action generates an error

Page 77: I/O Systems (based on 4.4BSD)

77

Old vs. New VnodesOld vs. New Vnodes

Old vnodesOld vnodes Vnode operations implemented as indirect function Vnode operations implemented as indirect function

callscalls

New vnodesNew vnodes Intermediate layers pass operation to lower layersIntermediate layers pass operation to lower layers New operations added into the system at boot timeNew operations added into the system at boot time Pass operations that may not have been defined Pass operations that may not have been defined

when the filesystem was writtenwhen the filesystem was written Pass parameters of unknown type and numberPass parameters of unknown type and number

Page 78: I/O Systems (based on 4.4BSD)

78

SolutionSolution

Kernel places vnode operation name Kernel places vnode operation name and arguments into a structand arguments into a struct

That struct passed as single parameter That struct passed as single parameter to vnode operation (uniform interface)to vnode operation (uniform interface)

If the operation is supported, the vnode If the operation is supported, the vnode knows how to decode the argumentknows how to decode the argument

If it’s not supported, then it’s passed to If it’s not supported, then it’s passed to the next lower layer w/same argumentthe next lower layer w/same argument

Page 79: I/O Systems (based on 4.4BSD)

79

The Null File System: nullfsThe Null File System: nullfs

This would be better called the “identity This would be better called the “identity file system”file system”

Just passes all requests and responses Just passes all requests and responses without changing them at allwithout changing them at all

Can make a filesystem appear within Can make a filesystem appear within itselfitself

Useful as starting point for implementing Useful as starting point for implementing other file systemsother file systems

Page 80: I/O Systems (based on 4.4BSD)

80

User Mapping File System: User Mapping File System: umapfsumapfs

umapfs mounts the same file system in two places (/local into umapfs mounts the same file system in two places (/local into /export) on the file server/export) on the file server

/export is exported to external domains that use different UID/GID./export is exported to external domains that use different UID/GID. Mapping is automatically done by the umapfs layerMapping is automatically done by the umapfs layer Request handed to /localRequest handed to /local

outside administrative exports local administrative exports

UFSFFSuid/gid mapping

NFS server

operation not supported

NFS server

/export/export/local/local

Page 81: I/O Systems (based on 4.4BSD)

81

Benefits of umapfsBenefits of umapfs

No cost to local clients (they don’t go No cost to local clients (they don’t go through the umapfs)through the umapfs)

No changes required to NFS or local No changes required to NFS or local filesystem codefilesystem code

We can expand this idea to give each We can expand this idea to give each outside domain its own mapping, outside domain its own mapping, accessed through its own mount pointaccessed through its own mount point

Page 82: I/O Systems (based on 4.4BSD)

82

Union Mount Filesystem: unionfsUnion Mount Filesystem: unionfs Traditional mount hides any files in the directory Traditional mount hides any files in the directory

before the mountbefore the mount Union filesystem does a transparent overlay (you Union filesystem does a transparent overlay (you

can still see the old)can still see the old) Both can be accessed simultaneouslyBoth can be accessed simultaneously

/usr/src/usr/src

src src src

shell shell shell

Makefile shsh.h sh.c

/usr/src/usr/src/tmp/src/tmp/src

shsh.h sh.cMakefile

Page 83: I/O Systems (based on 4.4BSD)

83

More unionfsMore unionfs Start with an empty top-level directoryStart with an empty top-level directory As the underlying fs is traversed, entries are made in As the underlying fs is traversed, entries are made in

the top levelthe top level Lower levels are read-onlyLower levels are read-only Any names appearing in top fs obscure lower fsAny names appearing in top fs obscure lower fs

/usr/src/usr/src

src src src

shell shell shell

Makefile shsh.h sh.c

/usr/src/usr/src/tmp/src/tmp/src

shsh.h sh.cMakefile

Page 84: I/O Systems (based on 4.4BSD)

84

Still More unionfsStill More unionfs

Two conflicting requirements:Two conflicting requirements: Lower layers are read-onlyLower layers are read-only User should think this is a regular file systemUser should think this is a regular file system

We already copy files up on writes, but how do We already copy files up on writes, but how do we handle deletions?we handle deletions? Deletions of unmodified files (lower layer)Deletions of unmodified files (lower layer) Deletions of previously modified files (top layer)Deletions of previously modified files (top layer)

Use “whiteout” entries in the top level: inode 1Use “whiteout” entries in the top level: inode 1

On hitting a whiteout entry, kernel returns error.On hitting a whiteout entry, kernel returns error.

Page 85: I/O Systems (based on 4.4BSD)

85

Tricky Issues With Whiteout: Tricky Issues With Whiteout: DirectoriesDirectories

What happens if I create a file with the What happens if I create a file with the same name as a whiteout file?same name as a whiteout file?

What about a directory? Is there any What about a directory? Is there any difference?difference?

Compare what happens in the “normal” Compare what happens in the “normal” situation where the same directory situation where the same directory appears in multiple layers of the appears in multiple layers of the filesystem stackfilesystem stack

Page 86: I/O Systems (based on 4.4BSD)

86

Uses of the unionfsUses of the unionfs

Heterogeneous architectures can build from a Heterogeneous architectures can build from a common source basecommon source base NFS mount source treeNFS mount source tree Union mount over the topUnion mount over the top Builds stay localBuilds stay local

Can compile sources “on” read-only media Can compile sources “on” read-only media (e.g., CD-ROM)(e.g., CD-ROM)

Private source directories from common code Private source directories from common code basesbases

Page 87: I/O Systems (based on 4.4BSD)

87

Other Common Filesystems: Other Common Filesystems: PortalPortal

Mount a process into the filesystemMount a process into the filesystem Pass remainder of path name to process as Pass remainder of path name to process as

inputinput Process returns file descriptorProcess returns file descriptor

Could be to processCould be to process Could be to fileCould be to file

Can be used, e.g., to provide filesystem Can be used, e.g., to provide filesystem access to network services (opening file access to network services (opening file creates TCP connection to server)creates TCP connection to server)

Page 88: I/O Systems (based on 4.4BSD)

88

Other Common Filesystems:Other Common Filesystems:procfsprocfs

procfs provides information on running procfs provides information on running processes, through the file system (viz ps)processes, through the file system (viz ps)

An ls of /proc shows a directory for each An ls of /proc shows a directory for each process, which contains:process, which contains: ctl: file allowing control of the processctl: file allowing control of the process file: the executable (compiled code)file: the executable (compiled code) mem: virtual memory for the processmem: virtual memory for the process regs: registers for the processregs: registers for the process status: text file with information (e.g. state, ppid, …)status: text file with information (e.g. state, ppid, …)

Page 89: I/O Systems (based on 4.4BSD)

89

End of LectureEnd of Lecture