introduction to linux kernel internals for openvms experts

Introduction toLinux Kernel Internalsfor OpenVMS Experts

Keith Parris

HP

History

• Unix heritage

• Richard M. Stallman and GNU project

• Berkeley Standard Distribution (BSD)

• Andy Tannenbaum and Minix

• Linus Torvalds and Linux– 1991 Unix-like kernel for Intel 386

Philosophy

• Open Source– Source can be:

• Studied

• Modified

• Distributed without restrictions (except those of the GPL itself)

• Do-it-yourself attitude• Rivalry with Microsoft Windows• Simplicity as Elegance

History

• August 1991: Version 0.01• October 1991: Version 0.02• November 1993: First Slackware distribution,

with kernel 0.99• March 1994: Version 1.0• June 1995: Ported to Alpha architecture• January 1999: Version 2.2• January 2001: Version 2.4

Release Numbering

• Major.Minor-Step format, e.g. 2.4-7• Major:

– 0 to 1: First stable 386 release– 1 to 2: SMP support

• Minor:– Odd-numbered “dot” releases for new functionality,

e.g. 0.9, 2.3, 2.5– Even-numbered releases for stability, e.g. 1.0, 2.2, 2.4

• Step: Indicates roll-up of patches

Some Major Linux Distributions

• Red Hat – leader in overall market share• SuSE – leader in the European market, making in-

roads in the USA• Debian – favorite of Open Source purists• Slackware – first commercial Linux distribution;

favorite of macho do-it-yourselfers• Mandrake – easy to install for novices; available at

Wal-Mart• TurboLinux – family of distributions, including

server, workstation, and cluster

Source Language

• Linux is written in GNU C (gcc)– GNU C is a dialect of ANSI C (not K&R)

• ANSI C has better type checking

• So Linux is portable to any machine to which gcc has been ported

CPU Platform Support

• 32-bit:– Intel x86, Crusoe, MIPS, ARM, SPARC,

PowerPC, Motorola 68000

• 64-bit:– Alpha, MIPS, Itanium, PA-RISC, SPARC

• Original version was on Intel 80386, and many artifacts remain from that legacy

Byte Ordering

• Both types of platforms supported by Linux:– Little-Endian: IA32, Alpha, Itanium– Big-Endian: PA-RISC, PowerPC, SPARC64

Windowing Software

• XFree86

• KDE – K Desktop Environment

• GNOME – GNU Network Object Model Environment

Kernel Component Interaction

CPU Terminal DiskNetworkInterface

SystemMemory

Processes &Scheduler

Traps &Faults

Network devicedrivers

FileSystems

Networkprotocols

Block devicedriversChar device

drivers

VirtualMemory

PhysicalMemory Interrupts

SystemCalls

Signals

Task

Modes

• User and Kernel Modes only– No Executive Mode– No Supervisor Mode

• So no protection between different portions of the kernel

Linux Process Model

• Basic execution unit on Linux is called a “task”, and is a thread of execution with an associated address space

• Multiple tasks can share the same address space (for multi-threading) in a task group

Task data

• Task data structure is about 1 KB in size– This is the most complex data structure in the

Linux kernel

• task_struct is placed at the end of the kernel stack area, so kernel stack pointer can be used to locate it efficiently– But if kernel stack overflows, it clobbers task

structure

Task Creation

• init task is Process ID (PID) #1• fork() creates duplicate task

– except Child is given different PID and stack – Address space is same (but with copy-on-write)– Child is independent of Parent– Typically exec() is then called in Child to load and run

an executable image• clone() is used to create more threads

– Address space, file descriptors, signal handlers can be shared

– PID is the same for all threads in the same process

Symmetrical Multi-Processing

• SMP support was introduced in 2.0– But scalability was limited to about 4 CPUs– Version 2.4 and 2.5 are better in scalability

• All CPUs can process interrupts and execute kernel functions

• Kernel can be built with or without MP support (for efficiency on Uni-Processor machines)

SMP Synchronization

• 2 types of spinlocks:– Adaptive: if lock holder is running, spin-wait;

otherwise, block (sleep)– Spin: spin-wait until lock becomes free

Spinlocks

• Conflicting priorities in SMP design:– Safety is easier to ensure with few spinlocks

• Linux originally had just one, the Big Kernel Lock (BKL)

– Performance is better with lots of individual spinlocks for more parallelism

• Hierarchy is needed to avoid deadlocks with multiple spinlocks (Linux has only a few heirarchies)

• Keeping per-CPU data separate avoids some spinlocks

• Linux kernel presently uses > 100 different locks

Time keeping

• 10 millisecond clock ticks on IA32– 1024 per second on Alpha

• Timers are available for processes (similar to VMS TQEs)

• syscall gettimeofday() analogous to $GETTIM system service– except down to microsecond resolution instead

of 10-millisecond

Interrupts

• Interrupts are divided into two classes:– “top-half” (hardware) and– “bottom-half” (software)

Interrupts

• Interrupt modes:– Critical: all interrupts masked, & uses kernel

stack of current task, for lowest interrupt latency

– Noncritical: only interrupt of same IRQ is masked; higher IRQs might pre-empt

– Deferred: uses software interrupt to defer low-priority work

Interrupts

• Interrupts can be directed to different CPUs in an SMP system via IRQ affinity

• Kernel can be profiled, either by timer IRQ or (using gcc options) by function call

Virtual Memory

• Linux kernel itself is not pageable

• Despite terminology like “swap partition”, Linux actually does only paging, not swapping

• Page replacement is LRU (for v2.4)– A linked list of all pages in memory is kept,

with most-recently used pages at the front and least-recently used pages at the end

Virtual Memory

• Linux has a 3-tier virtual address translation table model– IA32 has only 2 tiers, so 2nd level is mapped 1-to-1

• Linux page size is typically 4 KB• Virtual address format for IA32:

– Page Directory Entry (PGD): 10 bits

– Page Table Entry (PTE): 10 bits

– Byte within page: 12 bits (4 KB pages)

Virtual Memory

• Memory reclamation algorithms must also factor in the presence of:– Page cache: File system data presently in

memory– Buffer cache: File system meta-data– Swap cache: Pages being written to swap space

Virtual Memory

• The swap daemon kswapd tries to free memory when it is short, in this order:

1. Tries to free “clean” pages (page cache, buffer cache)

2. Shrinks the dentry cache

3. Shrinks the inode cache

4. Tries to page shared memory out

5. Tries to free “dirty” pages

Virtual Memory

• Linux keeps Accessed and Dirty bits for pages in memory

• VMS forgoes an Accessed bit and uses the Free Page List and Modified Page List as temporary caches, to rescue frequently-accessed pages before they are freed or written to disk

Kernel

• Linux kernel is not pre-emptible (yet)

• Linux kernel is monolithic– i.e., it is not a micro-kernel based OS

• although most kernel components (such as drivers) can be built as Dynamically Loadable Kernel Modules (DLKMs), which are loaded on demand

– DLKMs can be upgraded incrementally, so it is theoretically possible to improve the kernel without rebooting

Scheduler

• Process priorities can be in the range of -20 to +20, with -20 being the “highest” priority

• Idle process is PID #0, and is scheduled when nothing else is runable– Requires context switch; no equivalent of VMS

loop in scheduler, or code to clear demand-zero pages in advance of need

Scheduler

• Compute-bound tasks tend to be given lower priority than I/O bound tasks

• Real-time priorities are static, ranging from 1 to 99. Two scheduling policies available:– FIFO: Threads run to completion in order– Round-robin: Thread runs for time slice, then

next thread runs, and so forth

syscalls and Signals

• syscalls are used by a program to request something of the operating system

• Signals are used the by operating system to inform a task of errors or asynchronous events

syscalls

• Each syscall has a unique number– There are presently about 200

• Parameters are passed in registers– Implication from 80386: limit of 6 parameters

Signals

• Each signal has a default action associated with it:– Exit – forces the process to exit– Core – forces the process to exit and create a core file– Stop – stops the process (which then awaits a signal to

continue)– Ignore – ignores the signal; no action taken

• A process can define a signal handler for most signals, to override the default action

• Behavior of System V and BSD differed when a task received a signal while performing a syscall, so Linux provides a choice

Signals

• Examples:– SIGINT signal is sent on a keyboard interrupt

(i.e. Control-C)• Default action: Terminate the process

– SIGSEGV signal is sent when a segmentation violation (an attempt to access memory that one has no right to access) occurs

• Default action: Write core dump, then terminate the process

File Systems

• Linux Virtual File System (VFS) layer allows different file systems underneath– 32-bit interface causes some limitations

• Using a different file system may require a kernel re-compile or even patches

File Systems

• ext2 – standard out-of-the-box, traditional Unix file system

• ReiserFS – first journaled file system publicly available for Linux

• JFS from IBM– Journaled File System from OS/2 and AIX

• XFS from SGI – journaled file system taken from Irix

• ext3 – journaled file system compatible with ext2

File Systems withRemote Mirroring

• enbd – Enhanded Network Block Device can be used with Software RAID to mirror over TCP/IP to a disk on a remote system

• drbd – Disaster Recovery Block Device integrates network driver with RAID 1 and preserves write ordering as needed by journaled file systems

Why Journaling File Systems?

• ext2– Meta-data writes are

asynchronous, with no journaling

• must run fsck upon reboot

• damage may be unrepairable

– Slow linear search of unordered directory entries

– 32-bit meta-data design• Limits file system and file

sizes

• JFS, reiserfs, ext3, etc.– A journal is used to log all

changes to file system meta-data

• much faster restart times

• improved reliability for file system (but not data)

– B-tree data structures for faster lookup & access

– 64-bit meta-data designs• But still some limits due to

32-bit VFS interface

Storage Management

• LVM – Logical Volume Manager

• RAID

Logical Volume Manager

• Adds layer between block I/O interface in the kernel and the physical disks

• Volume Groups can consist of multiple Physical Volumes

• Data can be spread across disks• Space can be added or removed dynamically• Data can be migrated between physical disks

Logical Volume Manager

• “Snapshots” can be taken of data at a point in time and accessed as a read-only copy– But in practice, file system metadata may be

inconsistent unless file system is unmounted or quiesced before the snapshot is taken

Software RAID

• Supports RAID levels 0, 1, 4, 5, 10

• “Linear mode” (basically disk concatenation) is another option

Software RAID Levels

RAID 0 – Disk StripingRAID 1 – Mirroring (shadowing) of 2 or more disks,

with optional spare disk(s)RAID 4 – Striping across 2 or more disks, with

Parity all on one diskRAID 5 – Striping with Parity distributed across sets

of 3 or more disks, with optional spare disk(s)RAID 10 – RAID 1 array of two or more RAID 0

arrays (mirrored stripesets)

RAID 1 Recovery

• After a crash or power loss, a utility needs to be run, preferably at boot time: – ckraid –fix

• By default, chooses first working member as master copy and copies it to the other(s)

– Then run fsck on the file system(s)

RAID 4 or 5 Recovery

• After a crash or power loss, a utility needs to be run manually: – ckraid

• To determine what changes need to be done, then:

– ckraid –fix –suggest-failed-dsk-mask• Where the mask is a binary bit mask with one bit set, or

– ckraid –fix –suggest-fix-parity• To recalculate the parity from the data disks

– Then run fsck on the file system(s)

File System Implications

• Prudent kernel development typically requires two separate Linux systems:– One to keep source code on, and do compiles– One to test new kernel code (in case code

corrupts pages in buffer cache and makes file system unrepairable)

Future Development Directions

• Lots of projects underway:– (but no guarantee they will all reach mainstream Linux):

– Pre-emptible kernel– Hot plugging for USB and PCI– Finer privilege granularity than ‘user or root’,

using POSIX Capabilities– User-Mode Linux for safer kernel debugging– Lots of cluster projects

Questions?

Speaker Contact Info

Keith ParrisE-mail: [email protected] or

[email protected] or [email protected]

Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/

mailto:[email protected]



http://www.geocities.com/keithparris/

http://encompasserve.org/~kparris/



ext2 File System Weaknesses

• Meta-data writes are asynchronous– Good for performance, but…– fsck required after a crash or power loss

• Can take hours to complete

• Linear search of unordered directory entries is slow

• 32-bit meta-data design– Limits file system and file sizes

Why Journaling File Systems

• ext2– Meta-data writes are asynchronous, with no journaling

• must run fsck upon reboot• damage may be unrepairable

– slow linear search of unordered directory entries32-bit meta-data design– 32-bit meta-data design

• Limits file system and file sizes• JFS, reiserfs, ext3, etc.

– journal logs metadata changes• much faster restart times• improved reliability

– B-tree data structures for faster lookup & access– 64-bit meta-data designs

Storage Allocation

• Block-based (ext2)

• Extent-based (JFS, reiserfs)

JFS File System

• Authored by IBM; released in February 2000

• Log-based file system– Writing log implies some performance penalty

over ext2– Only file system meta-data is logged – not file

contents

JFS File System

• 64-bit file system design– although VFS on 32-bit Linux may limit file

size

• Uses B+ trees for extent mapping as well as ones keyed on name for directory entries– Standard Unix file systems do linear search of

filenames in a directory

JFS File System

• An ”aggregate” is an array of disk blocks allocated on a logical volume (partition)

• Primary aggregate Super-block and its backup copy Secondary Aggregate Super-block are at fixed locations

• i-nodes are 512 bytes in size• JFS supports block sizes of 512 bytes, 1KB,

2KB, and 4 KB

JFS File System

• A utility is provided to check/recover file system metadata by replaying the log and applying committed changes to the file system meta-data

ext2 File System

• ext2 name means Second Extended File System

• inode (index node) is fundamental building block

• Each inode has a unique number

• inode 2 is the root directory file

ext2 File System

• Two types of files: ordinary and directory files

• Directory files have filenames and i-node numbers

• i-node size is 256 bytes

ext2 File System

• Blocks may be 1KB, 2KB, or 4KB (default)

• Blocks are grouped (up to 8)

• Block groups have descriptor in array after super-block

ext2 File System

• Super-Block is Block 1 of file system (after the Boot Block at Block 0)– Multiple backup copies of Super-Block are kept

ext2 File System

• Meta-data writes are asynchronous– fsck required after a crash or power loss

• Can take hours to complete

• 32-bit meta-data design– Limits file system and file sizes

introduction to linux kernel internals for openvms experts

Documents

kernel stack overflows

kernel stack area

kernel stack pointer

linux kerneltask

sparcoriginal version

stack address space

gnu c gccgnu c

kransi c