inside the linux kernel
DESCRIPTION
INSIDE THE LINUX KERNEL. UnixForum Chicago - March 8, 2001. Daniel P. Bovet University of Rome "Tor Vergata". WHAT IS A KERNEL? (1/2). it’s a program that runs in Kernel Mode CPUs run either in Kernel Mode or in User Mode - PowerPoint PPT PresentationTRANSCRIPT
UnixForum Chicago - March 8, 2001
Daniel P. BovetUniversity of Rome "Tor Vergata"
INSIDE THE LINUX KERNEL
WHAT IS A KERNEL? (1/2)
it’s a program that runs in Kernel Mode
CPUs run either in Kernel Mode or in User Mode
when in User Mode, some parts of RAM can’t be addressed, some instructions can’t be executed, and I/O ports can’t be accessed
when in Kernel Mode, no restriction is put on the program
WHAT IS A KERNEL? (2/2)
besides running in Kernel Mode, kernels have three other peculiarities:
large size (millions of machine language instructions)
machine dependency (some parts of the kernel must be coded in Assembly language)
loading into RAM at boot time in a rather primitive way
ENTERING THE KERNEL PROGRAM (1/2)
when the CPU is running in User Mode
User Mode
Kernel Mode
ENTERING THE KERNEL PROGRAM (2/2)
when the CPU is running in Kernel Mode
User Mode
Kernel Mode
NESTED KERNEL INVOCATIONS
some similarity with nested function calls
AB
C
different because events causing kernel invocations are not (usually) related to the running program
KERNEL ENTRY POINTS
Kernel
software interrupt --->
I/O device requires attention --->
time interval elapsed --->
hardware failure --->
faulty instruction --->
IS AN INSTRUCTION REALLY FAULTY?
faulty instructions may occur for two distinct reasons:
programming error
deferred allocation of some kind of resource
the kernel must be able to identify the reason that caused the exception
EXCEPTIONS RELATED TO DEFERRED ALLOCATION
two cases of deferred allocation of resources in Linux
page frames (demand paging, Copy On Write)
floating point registers
WHY IS A KERNEL SO COMPLEX?
large program with many entry points
must offer disk caching to lower average disk access time
must support run nested kernel invocations --> must run with the interrupts enabled most of the time
must be updated quite frequently to support new hardware circuits and devices
HW CONCURRENCY (1/2)
I/Odevice
I/OAPIC
CPUIRQ
INT
INT ACK
the I/O APIC polls the devices and issues interrupts
no new interrupt can be issued until the CPU acknowledges the previous one
good kernels run with interrupts enabled most of the time
HW CONCURRENCY (2/2)
Symmetrical MultiProcessor architectures (SMP) include two ore more CPUs
SMP kernels must be able to execute concurrently on available CPUs
one service routine related to networking runs on a CPU while another routine related to file system runs concurrently on another CPU
LIMITING KERNEL SIZE
try to distribute kernel functions in smaller programs that can be linked separately
two approaches: microkernels and modules
Linux prefers modules for reasons of efficiency
MICROKERNELS
only a few functions such as process scheduling, and interprocess communication are included into the microkernel
other kernel functions such as memory allocation, file system handling, and device drivers are implemented as system processes running in User Mode
microkernels introduce a lot of interprocess communication
MODULES (1/2)
modules are object files containing kernel functions that are linked dynamically to the kernel
Linux offers an excellent support for implementing and handling modules
MODULES (2/2)
bpt
object module mmm.o
ab
z
kernel symbol table
externalreferencesto kernelsymbols
thanks to the kernel symbol table, it is possible to defer linking of an object module
MODULES AND DISTRIBUTIONS
modern computer architectures based on PCI busses support autoprobe of installed I/O devices while booting the system
recent Linux distributions put all non-critical I/O drivers into modules
at boot time, only the I/O modules of identified I/O devices are dynamically linked to the kernel
SUPPORT TO CLIENT/SERVER APPLICATIONS
scenario: many tasks executing concurrently on a common address space (for instance, a web server handling thousands of requests per second)
problem: implementing each client request as a new process causes a lot of overhead
process creation/elimination are time-consuming kernel functions
THE THREAD SOLUTION
introduce a new kernel object called thread
each process includes one or more threads
all threads associated with a given process share the same address space
CPU scheduling is done at the thread level (Windows NT)
thread switching is more efficient than process switching
THE CLONE SOLUTION
introduce groups of lightweight processes called clones that share a common address space, opened files, signals, etc.
CPU scheduling is done at the process level in a standard way
clones have been invented by Linux
the npmt_pthread or the dexter module used by the Linux version of Apache 2.0 are both based on clones
LINUX PEARLS
we selected in a rather arbitrary way a few pearls related to two distinct kernel design areas:
clever design choices
efficient coding
CLEVER DESIGN CHOICES
isolate the architecture-dependent code
rely on the VFS abstraction
avoid over-designing
ISOLATE THE ARCHITECTURE-DEPENDENT CODE (1/2)
Linux source code includes two architecture-dependent directories: /usr/src/linux/arch and /usr/src/linux/include
arch
i386 ….. s390
include
asm asm-i386 …. asm-s390
ISOLATE THE ARCHITECTURE-DEPENDENT CODE (2/2)
the schedule() function invokes the switch_to() Assembly language function to perform process switching
the code for switch_to() is stored in the include/asm/system.h file
depending on the target system, the asm symbolic link is set to asm-i386, asm-s390, etc.
RELY ON THE VFS ABSTRACTION
VFS is an abstraction for representing several kinds of information containers (IC) in a common way
standard operations on ICs: open(), close(), seek(), ioctl(), read(), write()
VFS associates a logical inode with each opened IC
EXAMPLES OF ICs
files stored in a disk-based filesystem
files stored in a network filesystem
disk partitions
kernel data structures (/proc filesystem)
RAM content (/dev/mem)
RAM disk (/dev/ram0)
serial port (/dev/ttyS0)
AVOID OVER-DESIGNING
Linux scheduler is simple and works for most applications
no attempt to transform Linux into a real-time system
A GENERAL-PURPOSE SCHEDULER
the scheduler of the System V Release 4 provides a set of class-independent routines that implement common services
object-oriented approach based on scheduling class: the scheduler represents an abstract base class, and each scheduling class acts as a subclass
A HEATED DISCUSSION
If the Linux development community is not responsive to the end user community, refusing to incorporate necessary functionality on the basis of aesthetics, then that community will abandon Linux in favor of something else. Is that really what you want?
Yes - If it turns into a pile of shit they'll abandon it even faster. I'd rather have a decent OS that works and does the right thing for most people than a single OS that tries to do everything and does nothing right (Alan Cox)
EXAMPLES OF EFFICIENT CODING
retrieving the process descriptor of the running process
handling dynamic timers
catching invalid addresses passed as system call parameters
RETRIEVING THE PROCESS DESCRIPTOR OF THE
RUNNING PROCESS (1/3)
classic solution: introduce an array current[NCPU] whose components point to the process descriptors of the processes running on the CPUs
clever solution: store the process Kernel Mode stack and the process descriptor into contiguous addresses so that the value of the CPU stack pointer register (esp register) is linked to that of the process descriptor
DESCRIPTOR OF THE RUNNING PROCESS (2/3)
Kernel Mode stack + process descriptor are stored in 2 contiguous page frames (8 KB)
fixed-length process descriptor
variable-length Kernel Modestack
esp
DESCRIPTOR OF THE RUNNING PROCESS (3/3)
fixed-length process descriptor
variable-length Kernel Modestack
esp
Mask
value of esp register: 0x00bdbad4
mask: 0xffffd000
starting address of process descriptor 0: 0x00bda000
HANDLING DYNAMIC TIMERS (1/3)
I/O drivers and user applications may create hundreds of timers
find an efficient way to check at each timer interrupt whether at least one timer has expired
trivial solution: maintain a list of timers ordered by increasing decaying times and start checking from the first element of the list
HANDLING DYNAMIC TIMERS (2/3)
clever solution (timing wheel): use percolation and maintain strict ordering only for the next 256 ticks (in Linux- i386, one tick = 10 ms)
use several lists of timers
HANDLING DYNAMIC TIMERS (3/3)
0 1 2 …… 255 0 1 2 …… 63
index incremented by 1 once every tick
index incremented by 1 once every 256 ticks
tv1: tv2:
when tv1 becomes empty, it is replenished byemptying one slot of tv2, and so forth
CATCHING INVALID ADDRESSES (1/4)
many system calls require one or more addresses specified as parameters
invalid addresses passed as parameters should not cause a system crash
classic solution: perform a preliminary check before servicing the system call
clever solution: defer checking until an exception caused by the invalid occurs in Kernel Mode
CATCHING INVALID ADDRESSES (2/4)
deferred checking is more efficient since system calls are issued most of the times with correct parameters
if an addressing error occurs in Kernel Mode, the kernel must be able to distinguish whether it is caused by a faulty process or whether by a kernel bug
in the first case, the kernel sends a SIGSEGV signal to the faulty process
CATCHING INVALID ADDRESSES (3/4)
clever idea: force the kernel to use always the same group of functions when copying data to or from the process address space
if an addressing error occurs while doing that, the CPU will signal the address of the instruction that contained an invalid address operand
CATCHING INVALID ADDRESSES (4/4)
the kernel knows from the address of the faulty instruction that it belongs to one of the functions used to access data in the process address space
it can then execute some kind of “fixup code”: as a result, the system call returns an error code