introduction to linux kernel internals for openvms experts
DESCRIPTION
Introduction to Linux Kernel Internals for OpenVMS Experts. Keith Parris HP. History. Unix heritage Richard M. Stallman and GNU project Berkeley Standard Distribution (BSD) Andy Tannenbaum and Minix Linus Torvalds and Linux 1991 Unix-like kernel for Intel 386. Philosophy. Open Source - PowerPoint PPT PresentationTRANSCRIPT
Introduction toLinux Kernel Internalsfor OpenVMS Experts
Keith Parris
HP
History
• Unix heritage
• Richard M. Stallman and GNU project
• Berkeley Standard Distribution (BSD)
• Andy Tannenbaum and Minix
• Linus Torvalds and Linux– 1991 Unix-like kernel for Intel 386
Philosophy
• Open Source– Source can be:
• Studied
• Modified
• Distributed without restrictions (except those of the GPL itself)
• Do-it-yourself attitude• Rivalry with Microsoft Windows• Simplicity as Elegance
History
• August 1991: Version 0.01• October 1991: Version 0.02• November 1993: First Slackware distribution,
with kernel 0.99• March 1994: Version 1.0• June 1995: Ported to Alpha architecture• January 1999: Version 2.2• January 2001: Version 2.4
Release Numbering
• Major.Minor-Step format, e.g. 2.4-7• Major:
– 0 to 1: First stable 386 release– 1 to 2: SMP support
• Minor:– Odd-numbered “dot” releases for new functionality,
e.g. 0.9, 2.3, 2.5– Even-numbered releases for stability, e.g. 1.0, 2.2, 2.4
• Step: Indicates roll-up of patches
Some Major Linux Distributions
• Red Hat – leader in overall market share• SuSE – leader in the European market, making in-
roads in the USA• Debian – favorite of Open Source purists• Slackware – first commercial Linux distribution;
favorite of macho do-it-yourselfers• Mandrake – easy to install for novices; available at
Wal-Mart• TurboLinux – family of distributions, including
server, workstation, and cluster
Source Language
• Linux is written in GNU C (gcc)– GNU C is a dialect of ANSI C (not K&R)
• ANSI C has better type checking
• So Linux is portable to any machine to which gcc has been ported
CPU Platform Support
• 32-bit:– Intel x86, Crusoe, MIPS, ARM, SPARC,
PowerPC, Motorola 68000
• 64-bit:– Alpha, MIPS, Itanium, PA-RISC, SPARC
• Original version was on Intel 80386, and many artifacts remain from that legacy
Byte Ordering
• Both types of platforms supported by Linux:– Little-Endian: IA32, Alpha, Itanium– Big-Endian: PA-RISC, PowerPC, SPARC64
Windowing Software
• XFree86
• KDE – K Desktop Environment
• GNOME – GNU Network Object Model Environment
Kernel Component Interaction
CPU Terminal DiskNetworkInterface
SystemMemory
Processes &Scheduler
Traps &Faults
Network devicedrivers
FileSystems
Networkprotocols
Block devicedriversChar device
drivers
VirtualMemory
PhysicalMemory Interrupts
SystemCalls
Signals
Task
Modes
• User and Kernel Modes only– No Executive Mode– No Supervisor Mode
• So no protection between different portions of the kernel
Linux Process Model
• Basic execution unit on Linux is called a “task”, and is a thread of execution with an associated address space
• Multiple tasks can share the same address space (for multi-threading) in a task group
Task data
• Task data structure is about 1 KB in size– This is the most complex data structure in the
Linux kernel
• task_struct is placed at the end of the kernel stack area, so kernel stack pointer can be used to locate it efficiently– But if kernel stack overflows, it clobbers task
structure
Task Creation
• init task is Process ID (PID) #1• fork() creates duplicate task
– except Child is given different PID and stack – Address space is same (but with copy-on-write)– Child is independent of Parent– Typically exec() is then called in Child to load and run
an executable image• clone() is used to create more threads
– Address space, file descriptors, signal handlers can be shared
– PID is the same for all threads in the same process
Symmetrical Multi-Processing
• SMP support was introduced in 2.0– But scalability was limited to about 4 CPUs– Version 2.4 and 2.5 are better in scalability
• All CPUs can process interrupts and execute kernel functions
• Kernel can be built with or without MP support (for efficiency on Uni-Processor machines)
SMP Synchronization
• 2 types of spinlocks:– Adaptive: if lock holder is running, spin-wait;
otherwise, block (sleep)– Spin: spin-wait until lock becomes free
Spinlocks
• Conflicting priorities in SMP design:– Safety is easier to ensure with few spinlocks
• Linux originally had just one, the Big Kernel Lock (BKL)
– Performance is better with lots of individual spinlocks for more parallelism
• Hierarchy is needed to avoid deadlocks with multiple spinlocks (Linux has only a few heirarchies)
• Keeping per-CPU data separate avoids some spinlocks
• Linux kernel presently uses > 100 different locks
Time keeping
• 10 millisecond clock ticks on IA32– 1024 per second on Alpha
• Timers are available for processes (similar to VMS TQEs)
• syscall gettimeofday() analogous to $GETTIM system service– except down to microsecond resolution instead
of 10-millisecond
Interrupts
• Interrupts are divided into two classes:– “top-half” (hardware) and– “bottom-half” (software)
Interrupts
• Interrupt modes:– Critical: all interrupts masked, & uses kernel
stack of current task, for lowest interrupt latency
– Noncritical: only interrupt of same IRQ is masked; higher IRQs might pre-empt
– Deferred: uses software interrupt to defer low-priority work
Interrupts
• Interrupts can be directed to different CPUs in an SMP system via IRQ affinity
• Kernel can be profiled, either by timer IRQ or (using gcc options) by function call
Virtual Memory
• Linux kernel itself is not pageable
• Despite terminology like “swap partition”, Linux actually does only paging, not swapping
• Page replacement is LRU (for v2.4)– A linked list of all pages in memory is kept,
with most-recently used pages at the front and least-recently used pages at the end
Virtual Memory
• Linux has a 3-tier virtual address translation table model– IA32 has only 2 tiers, so 2nd level is mapped 1-to-1
• Linux page size is typically 4 KB• Virtual address format for IA32:
– Page Directory Entry (PGD): 10 bits
– Page Table Entry (PTE): 10 bits
– Byte within page: 12 bits (4 KB pages)
Virtual Memory
• Memory reclamation algorithms must also factor in the presence of:– Page cache: File system data presently in
memory– Buffer cache: File system meta-data– Swap cache: Pages being written to swap space
Virtual Memory
• The swap daemon kswapd tries to free memory when it is short, in this order:
1. Tries to free “clean” pages (page cache, buffer cache)
2. Shrinks the dentry cache
3. Shrinks the inode cache
4. Tries to page shared memory out
5. Tries to free “dirty” pages
Virtual Memory
• Linux keeps Accessed and Dirty bits for pages in memory
• VMS forgoes an Accessed bit and uses the Free Page List and Modified Page List as temporary caches, to rescue frequently-accessed pages before they are freed or written to disk
Kernel
• Linux kernel is not pre-emptible (yet)
• Linux kernel is monolithic– i.e., it is not a micro-kernel based OS
• although most kernel components (such as drivers) can be built as Dynamically Loadable Kernel Modules (DLKMs), which are loaded on demand
– DLKMs can be upgraded incrementally, so it is theoretically possible to improve the kernel without rebooting
Scheduler
• Process priorities can be in the range of -20 to +20, with -20 being the “highest” priority
• Idle process is PID #0, and is scheduled when nothing else is runable– Requires context switch; no equivalent of VMS
loop in scheduler, or code to clear demand-zero pages in advance of need
Scheduler
• Compute-bound tasks tend to be given lower priority than I/O bound tasks
• Real-time priorities are static, ranging from 1 to 99. Two scheduling policies available:– FIFO: Threads run to completion in order– Round-robin: Thread runs for time slice, then
next thread runs, and so forth
syscalls and Signals
• syscalls are used by a program to request something of the operating system
• Signals are used the by operating system to inform a task of errors or asynchronous events
syscalls
• Each syscall has a unique number– There are presently about 200
• Parameters are passed in registers– Implication from 80386: limit of 6 parameters
Signals
• Each signal has a default action associated with it:– Exit – forces the process to exit– Core – forces the process to exit and create a core file– Stop – stops the process (which then awaits a signal to
continue)– Ignore – ignores the signal; no action taken
• A process can define a signal handler for most signals, to override the default action
• Behavior of System V and BSD differed when a task received a signal while performing a syscall, so Linux provides a choice
Signals
• Examples:– SIGINT signal is sent on a keyboard interrupt
(i.e. Control-C)• Default action: Terminate the process
– SIGSEGV signal is sent when a segmentation violation (an attempt to access memory that one has no right to access) occurs
• Default action: Write core dump, then terminate the process
File Systems
• Linux Virtual File System (VFS) layer allows different file systems underneath– 32-bit interface causes some limitations
• Using a different file system may require a kernel re-compile or even patches
File Systems
• ext2 – standard out-of-the-box, traditional Unix file system
• ReiserFS – first journaled file system publicly available for Linux
• JFS from IBM– Journaled File System from OS/2 and AIX
• XFS from SGI – journaled file system taken from Irix
• ext3 – journaled file system compatible with ext2
File Systems withRemote Mirroring
• enbd – Enhanded Network Block Device can be used with Software RAID to mirror over TCP/IP to a disk on a remote system
• drbd – Disaster Recovery Block Device integrates network driver with RAID 1 and preserves write ordering as needed by journaled file systems
Why Journaling File Systems?
• ext2– Meta-data writes are
asynchronous, with no journaling
• must run fsck upon reboot
• damage may be unrepairable
– Slow linear search of unordered directory entries
– 32-bit meta-data design• Limits file system and file
sizes
• JFS, reiserfs, ext3, etc.– A journal is used to log all
changes to file system meta-data
• much faster restart times
• improved reliability for file system (but not data)
– B-tree data structures for faster lookup & access
– 64-bit meta-data designs• But still some limits due to
32-bit VFS interface
Storage Management
• LVM – Logical Volume Manager
• RAID
Logical Volume Manager
• Adds layer between block I/O interface in the kernel and the physical disks
• Volume Groups can consist of multiple Physical Volumes
• Data can be spread across disks• Space can be added or removed dynamically• Data can be migrated between physical disks
Logical Volume Manager
• “Snapshots” can be taken of data at a point in time and accessed as a read-only copy– But in practice, file system metadata may be
inconsistent unless file system is unmounted or quiesced before the snapshot is taken
Software RAID
• Supports RAID levels 0, 1, 4, 5, 10
• “Linear mode” (basically disk concatenation) is another option
Software RAID Levels
RAID 0 – Disk StripingRAID 1 – Mirroring (shadowing) of 2 or more disks,
with optional spare disk(s)RAID 4 – Striping across 2 or more disks, with
Parity all on one diskRAID 5 – Striping with Parity distributed across sets
of 3 or more disks, with optional spare disk(s)RAID 10 – RAID 1 array of two or more RAID 0
arrays (mirrored stripesets)
RAID 1 Recovery
• After a crash or power loss, a utility needs to be run, preferably at boot time: – ckraid –fix
• By default, chooses first working member as master copy and copies it to the other(s)
– Then run fsck on the file system(s)
RAID 4 or 5 Recovery
• After a crash or power loss, a utility needs to be run manually: – ckraid
• To determine what changes need to be done, then:
– ckraid –fix –suggest-failed-dsk-mask• Where the mask is a binary bit mask with one bit set, or
– ckraid –fix –suggest-fix-parity• To recalculate the parity from the data disks
– Then run fsck on the file system(s)
File System Implications
• Prudent kernel development typically requires two separate Linux systems:– One to keep source code on, and do compiles– One to test new kernel code (in case code
corrupts pages in buffer cache and makes file system unrepairable)
Future Development Directions
• Lots of projects underway:– (but no guarantee they will all reach mainstream Linux):
– Pre-emptible kernel– Hot plugging for USB and PCI– Finer privilege granularity than ‘user or root’,
using POSIX Capabilities– User-Mode Linux for safer kernel debugging– Lots of cluster projects
Questions?
Speaker Contact Info
Keith ParrisE-mail: [email protected] or
[email protected] or [email protected]
Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/
ext2 File System Weaknesses
• Meta-data writes are asynchronous– Good for performance, but…– fsck required after a crash or power loss
• Can take hours to complete
• Linear search of unordered directory entries is slow
• 32-bit meta-data design– Limits file system and file sizes
Why Journaling File Systems
• ext2– Meta-data writes are asynchronous, with no journaling
• must run fsck upon reboot• damage may be unrepairable
– slow linear search of unordered directory entries32-bit meta-data design– 32-bit meta-data design
• Limits file system and file sizes• JFS, reiserfs, ext3, etc.
– journal logs metadata changes• much faster restart times• improved reliability
– B-tree data structures for faster lookup & access– 64-bit meta-data designs
Storage Allocation
• Block-based (ext2)
• Extent-based (JFS, reiserfs)
JFS File System
• Authored by IBM; released in February 2000
• Log-based file system– Writing log implies some performance penalty
over ext2– Only file system meta-data is logged – not file
contents
JFS File System
• 64-bit file system design– although VFS on 32-bit Linux may limit file
size
• Uses B+ trees for extent mapping as well as ones keyed on name for directory entries– Standard Unix file systems do linear search of
filenames in a directory
JFS File System
• An ”aggregate” is an array of disk blocks allocated on a logical volume (partition)
• Primary aggregate Super-block and its backup copy Secondary Aggregate Super-block are at fixed locations
• i-nodes are 512 bytes in size• JFS supports block sizes of 512 bytes, 1KB,
2KB, and 4 KB
JFS File System
• A utility is provided to check/recover file system metadata by replaying the log and applying committed changes to the file system meta-data
ext2 File System
• ext2 name means Second Extended File System
• inode (index node) is fundamental building block
• Each inode has a unique number
• inode 2 is the root directory file
ext2 File System
• Two types of files: ordinary and directory files
• Directory files have filenames and i-node numbers
• i-node size is 256 bytes
ext2 File System
• Blocks may be 1KB, 2KB, or 4KB (default)
• Blocks are grouped (up to 8)
• Block groups have descriptor in array after super-block
ext2 File System
• Super-Block is Block 1 of file system (after the Boot Block at Block 0)– Multiple backup copies of Super-Block are kept
ext2 File System
• Meta-data writes are asynchronous– fsck required after a crash or power loss
• Can take hours to complete
• 32-bit meta-data design– Limits file system and file sizes