linux internals - cse.msu.edu

LINUX INTERNALS

• Topics:

– Process definition and scheduling

– Memory management

– Process control and interaction

– Boot sequence

– I/O subsystem

– File system and VFS

– Networking

1

Introduction to Linux

• Public domain OS

– developed originally by Linus Torvalds, a Finnish computerscience student, in 1991

– placed under GNU Public License (anyone can use, copy,and modify it)

– designed primarily for Intel PCs, but also runs on manyother processors (Sparc, Alpha, etc.)

2

Introduction to Linux (continued)

• distributed (CD with everything, including source andinstallation/management tools) by a number of companies(Redhat, Slackware, ...) which is sometimes better than pullingit off the Internet

• ”maintained” by Internet users at large

3

Linux Information Sources

4

Linux Kernel Source Code

• The top level of the source tree is /usr/src/linux. The OSsource is found in subdirectories according to functionality:

– arch - all of the architecture specific kernel code;subdirectories for each architecture.

– include - most of the header files needed to build the kernelcode.

– init - initialization code for the kernel (good place to startlooking at how the kernel works).

– mm - memory management code (architecture-dependentmm code is under arch/*/mm/).

5

Linux Kernel Source Code (continued)

• drivers - contains all the system’s device drivers, furthersub-divided into classes (char, block, net, ...)

• ipc - interprocess communications code.

• modules - used to hold built modules.

• fs - file system code; further divided into fs types (e.g., vfat andext2)

• kernel - main kernel code (except architecture specific)

• net - networking code (sockets, tcp, ip,...)

• lib - various simple library routines.

• scripts - scripts (for example awk and tk scripts) that are usedwhen the kernel is configured.

6

Traditional Unix Kernel

• Unix is a monolithic operating system. Traditionally, the entireUnix kernel was loaded into physical memory and remainedmemory resident.

• Newer systems, such as Linux, relax this requirement, byallowing demand-loadable kernel modules.

7

Traditional Unix Kernel (continued)

• Typical kernel functions

– Controlling the execution of processes by allowing theircreation, termination or suspension, and communication

– Scheduling processes fairly for execution on the CPU

– Allocating main memory for an executing process

– Allowing processes controlled access to peripheral devices

– Allocating secondary memory for efficient storage andretrieval of user files and process images

• In Unix, the kernel does not persist as a process itself. Rather,its routines are executed (in protected mode) on behalf of userprocesses.

8

Outline of Linux Topics

• Process definition and scheduling

• Memory management and paging

• Process control and interprocess communication

• Linux boot sequence

• I/O subsystem and device drivers

• Linux file system design

• Linux networking

• Dynamically loadable modules

9

Process Definition

• In many (old and new) versions of Unix, two kernel datastructures describe the state of a process

– proc table entry : everything that must be known when theprocess is swapped out

– u area: everything else

• In Linux, these are combined into a single task structstructure. Contains information on:

– current process state, timers, signals, links to other tasks,pointers to mmgt info, open files, permissions, etc.

10

Process Definition (continued)

• Process space comprises three regions

– text: read only and can be shared by other processes

– data: global variables, usually private to the process

– stack: growable, for execution of program

11

Process Activities

• Process creation

– In Unix processes are created through system call fork orderivatives (vfork, clone).

– In all versions of Unix, certain processes created at boottime have special significance.

• Process termination through exit.

• Process suspension and resumption

– Example: Sleep/Wakeup, Processes go to sleep because theyare awaiting the occurrence of some event (sleep on anevent).

– Sleeping processes do not consume CPU resources

12

Linux Task Structure

• task struct data structure (large!)

– one per process, pointed to in an array task[] of length 512(default) in the kernel defined in include/linux/sched.h

– task array allocated in kernel/sched.cstruct task struct * task[NR TASKS] = {&init task };

– a global variable in the kernel current points to thecurrently running process

13

Linux Task Structure (continued)

• State

– running - either running or ready to run

– waiting, interruptible - can be interrupted by signals

– waiting, uninterruptible - cannot be interrupted under anycircumstances.

– stopped - for example, being debugged

– zombie - halted, but task struct still allocated

struct task_struct{

/* these are hardcoded - don’t touch */

voltile long state;

/* -1 unrunnable, 0 runnable, 1 stopped */

...

14


• Scheduling information

– determines which process is selected to run

– policy - scheduling priority for this process

– counter - how many clock ticks till time slice ends

– priority - static priority of process

– rt priority - real-time priority

• Identifiers

– pid, pgrp, session, leader, groups[],

– uid, euid, gid, egid, ...

– (what is effective uid and how is it set?)

15


• IPC information

– status of signals, which are blocked (signal and blocked)

– signal handlers

/* signal handlers */

struct signal_struct *sig; <-- in task_struct

...

struct signal_struct{

int count;

struct sigaction action[32];

}

16

Task Structure Contents (continued)

• Links to parent and children processes

– tree of processes, wait queue for dying children

/*

* pointers to (original) parent process,

* youngest child, younger sibling,

* older sibling, respectively.

* (p->father can be replaced with

* p->p_pptr->pid)

*/

struct task_struct *p_opptr, *p_pptr, *p_cptr,

*p_ysptr, *p_osptr;

struct wait_queue *wait_chldexit;

/* for wait4() */

17


• can see tree using pstree command (ptree in Solaris)

– also, doubly linked list of all processes; also, doubly linkedlist of processes on run queue

• struct task struct *next task, *prev task;

• struct task struct *next run, *prev run;

• Exercise: Use the solaris ptree command to examine theprocesses running under your machine.

18


• Times and Timers

– scheduling information

– user-defined timers, interval timers (see setitimer andgetitimer system calls)

unsigned long timeout, policy, rt priority; unsigned longit real value, it prof value,it virt value; unsigned long it real incr, it prof incr, it virt incr;struct timer list real timer; long utime, stime, cutime, cstime,start time;

19


• Memory management

– pointer to structure defining virtual memory and processimage

/* memory management info */

struct mm_struct *mm;

...

/* mm fault and swap info: this can arguably

* be seen as either mm-specific or

* thread-specific */

unsigned long min_flt, maj_flt, nswap,

cmin_flt, cmaj_flt, cn *swap;

int swappable:1;

20

unsigned long swap_address;

unsigned long old_maj_flt;

/* old value of maj_flt */

unsigned long dec_flt;

/* page fault count of the last time */

unsigned long swap_cnt;

/* number of pages to swap on next pass */

21

File Information in the Task Structure

• File system information

– local filesystem root, current working directory, and openfiles

/* filesystem information */

struct fs_struct *fs;

/* open file information */

struct files_struct *files;

struct fs_struct{

int count; /* reserved*/

unsigned short umask;

struct inode * root, * pwd;

...

}

22

File Information in the Task Structure (continued)

struct files_struct {

int count; /* reserved*/

fd_set close_on_exec;

/* bit map - close these on exec */

fd_set open_fds;

struct file * fd[NR_OPEN];

/* fds are index */

}

23

File Structures

24


• Personality - Linux can run more than i386-based Unixenvironments

struct exec domain *exec domain;

unsigned long personality;

• Status information

int exit code, exit signal;

int errno;

• Program name

char comm[16];

25


• Multiprocessor information

#ifdef SMPint processor;int last processor;

#endif

• Processor specific context

– different thread struct for each architecture

– include/asm-i386/processor.h∗ struct thread struct tss; /* tss for this task */

26

Unix Scheduling

• Common features

– all versions of Unix support a time-slice scheduler (timeslice varies)

– process may also give up processor when it waits on an event

– timeout, sleep, wakeup: internal kernel routines used

27

Unix Scheduling (continued)

• Example states

– executing in user mode

– executing in kernel mode

– not executing but is ready to run as soon as the kernelschedules it

– sleeping and interruptible

– sleeping and non-interruptible

– returning from the kernel to user mode, but the kernelpreempts it and does a context switch to schedule anotherprocess

– stopped (for example, by a debugger) has executed the exitsystem call and is in the zombie state, which is the finalstate of a process

28

Process/Scheduler Interaction

• The scheduler always executes in the context of a user process.

• Context Switching

– Context of process is its state (text, values in data andregisters, values in process structure(s), and stack)

– When doing a context switch, the kernel saves enoughinformation so that process can be recovered and resumedlater.

29

Linux Scheduler

• Operation

– run whenever process is voluntarily relinquishes control ortime-slice expires

– time slice of 200 ms

– selects ”most deserving” process on run queue

• Priority

– in priority field of task struct

– equal to the number of clock ticks (jiffies) for which it willrun if it does not relinquish processor

– can be changed dynamically with renice system call

– counter field in task struct initially set to priority of process

– decremented with each clock tick

30

Linux Scheduler (continued)

• Linux also supports real-time processes

– identified by t.policy

– have higher priority than any non-real-time processes

– t.rt priority holds relative real-time priority

31

Process selection

• Algorithm

– step through run queue, and note process with highestpriority

– uses goodness function to compute priority (includes realtime and SMP weights)

• goodness() (kernel/sched.c)

/*

* This is the function that decides how desirable a

* process is. You can weigh different processes

* against each other depending on what CPU they’ve

* run on lately etc to try to handle cache and TLB

* miss penalties.

*

* Return values:

* -1000: never select this

32

* 0: out of time, recalculate counters

* (but it might still be selected)

* +ve: "goodness" value (the larger, the better)

* +1000: realtime process, select this.

*/

33

Process selection (cont)

int weight;

#ifdef __SMP__

/* We are not permitted to run a task someone

* else is running */

if (p->processor != NO_PROC_ID)

return -1000;

#ifdef PAST_2_0

/* This process is locked to a processor group */

if (p->processor_mask &&

!(p->processor_mask & (1<<this_cpu))

return -1000;

#endif

#endif

/*

34

* Realtime process, select the first one on the

* runqueue (taking priorities within processes

* into account).

*/

if (p->policy != SCHED_OTHER)

return 1000 + p->rt_priority;

/*

* Give the process a first-approximation goodness value

* according to the number of clock-ticks it has left.

*

* Don’t do any other calculations if the time slice is

* over..

*/

35

Process selection (continued)

weight = p->counter;

if (weight) {

#ifdef __SMP__

/* Give a largish advantage to the same processor... */

/* (this is equivalent to penalizing other processors) */

if (p->last_processor == this_cpu)

weight += PROC_CHANGE_PENALTY;

#endif

/* .. and a slight advantage to the current process */

if (p == prev)

weight += 1;

}

return weight;

}

36

Scheduler Invocation

• Scheduler is invoked ”voluntarily” from many places in thekernel

• In addition, scheduler called whenevercurrent− >counter expires.

• Wait queues

– simply a list of processes associated with some resource

– processes add themselves to queue, then call scheduler

37

Scheduler Invocation

struct wait_queue {

struct task_struct * task;

struct wait_queue * next;

};

#define WAIT_QUEUE_HEAD(x)

((struct wait_queue *)((x)-1))

static inline void init_waitqueue

(struct wait_queue **q)

{

*q = WAIT_QUEUE_HEAD(q);

}

38

Bottom Half Handling

• Requiring interrupt handlers of device drivers to do allprocessing of tasks related to a given interrupt may not beadvisable

– the rest of the system is suspended during interrupt

– many such tasks may not be time critical

– what is time critical is ”registering” these tasks to be donelater

• Linux supports up to 32 ”bottom half” handlers

– bh base points to handling routines

– bh mask indicates which entries are valid

– bh active indicates which need service

• Priority is low (0 is for timers) to high

• Jobs to be handled later are placed on task queues

39

Bottom Half Data Structures

Bit maps and function vector

40

Bottom Half Data Structures (continued)

Bit maps and function vector

41

Memory Management Techniques

• Swapping– used in early System V Unix versions

– whole processes are moved in and out of memory

– first-fit allocation of main memory

– swapper process wakes up periodically and makes swapping

decisions

• Paging– bring pages into memory on demand

– use page-fault handler

– pre-paging may be done for performance

– page-replacement algorithm (such as LRU or approximation) is

major design decision

– if paging system is overloaded (thrashing), swapper can swap out

whole processes

42

Linux Memory Management

• Features

– demand paged virtual memory

– memory space protection provided by architecture

– processes can share virtual memory (text, shared (dynamic)libraries, shmem)

• Implementation

– virtual and physical memory divided into pages (4K on Intelprocessors)

– if not present, page fault results

43

Linux Memory Management (continued)

44

Linux Page Tables

• Memory address structure

– Linux assumes three levels of page tables

– virtual address is broken into fields, three of which point topage table entries, another is offset

– definition of field bits is architecture dependent, but macroshide this from Linux kernel

– idea is to have a ”virtual’ mm, just as we see Unix supportsa virtual file system

45

Linux Page Tables (continued)

46

Details for i386 Architecture

• Example for 386 architecture

– two levels of indirection in address translation

– page directory contains pointers to 1024 page tables

– each page table contains pointers to 1024 pages

– the register CR3 contains the physical base address of thepage directory and is stored as part of the TSS in thetask struct and is loaded on each task switch

– 32 bit linear address is divided as follows: 31-22 DIR, 21-12TABLE, 11-0 OFFSET

47

Details for i386 Architecture

• Example for 386 architecture (continued)

– physical address is then computed (in hardware) as follows:∗ table base = CR3 + DIR∗ page base = table base + TABLE∗ physical address = page base + OFFSET

48

Page Table Entry

• Upper 20 bits contain address information

• lower 12 bits are used to store useful information about the page

table (or page) pointed to by the entry

• Format for page directory and page table entries:

31-12 11-9 8 7 6 5 4 3 2 1 0

ADDR OS 0 0 D A 0 0 U/S R/W P

OS : used by replacement policy.

D : denoted to mark a page as dirty (1) (undefined for page directory

entry)

A : to denote if page has been accessed (1).

U/S : to denote whether a user page(1) or a system page (0).

R/W : used to denote if the page is read-only (0).

P : to denote if the page is in memory (1).

49

Page table entry (continued)

• When a page is swapped out, bits 1-31 of the page table entryare used to mark where a page is stored in swap (bit 0 must be0).

• Of course, a TLB translation lookaside buffer is used as a”cache” of address translations.

50

Page Allocation and Deallocation

• Each physical page is described by a mem map t data structure(an array of these, called mem map, is defined ininclude/linux/mm.h):

typedef struct page {

/* these must be first (free area handling) */

struct page *next;

struct page *prev;

struct inode *inode;

unsigned long offset;

struct page *next_hash;

atomic_t count;

unsigned flags;

/* atomic flags, some possibly updated asynchronously */

unsigned dirty:16,

age:8;

51

Page Allocation and Deallocation (continued)

struct wait_queue *wait;

struct page *prev_hash;

struct buffer_head * buffers;

unsigned long swap_unlock_entry;

} mem_map_t;

/* Page flag bit values */

#define PG_locked 0

#define PG_error 1

#define PG_referenced 2

#define PG_uptodate 3

#define PG_free_after 4

#define PG_decr_after 5

#define PG_swap_unlock_after 6

#define PG_DMA 7

#define PG_reserved 31

52

Free Area

• free area vector used by page allocation code to find free pages.

– 1st element points to list of single free pages

– 2nd element points to blocks of two (consecutive) free pages

– and so on

• Each element also points to a bit map of blocks of thecorresponding size.

• Standard buddy algorithm used for allocation, deallocation.

53

Free Area (continued)

54

MM Definitions

• Here are the definitions as they appear in mm/page alloc.c

/*

* Free area management

*

* The free_area_list arrays point to the queue heads of the

* free areas of different sizes

*/

#define NR_MEM_LISTS 6

/* The start of this MUST match the start of "struct page" */

struct free_area_struct {

struct page *next;

struct page *prev;

unsigned int * map;

};

#define memory_head(x) ((struct page *)(x))

static struct free_area_struct free_area[NR_MEM_LISTS];

55

Memory Mapping

• When an image is executed, the contents of the executableimage must be brought into the virtual address space of theprocess.

The same is true for any shared libraries to which the imagehas been linked.

• Of course, these parts are not brought into memory at once,but rather are mapped into the address space of the process,then demand paged.

56

Memory Mapping (continued)

• Specifically, when a process attempts to access a virtualaddress within the new memory region for the first time,

– the processor will attempt to decode the virtual address,

– since there are no page table entries for this new area, theprocessor will raise a page fault exception

– kernel will create entries, allocate physical page frame, andbring in one (or more) pages from disk (either filesystem orswap)

– process is then resumed at instruction that caused page fault

57

Process Address Space

• The address space of a process is described by an mm structdata structure (sched.h), which points to a number ofvm area struct data structures (mm.h)

• Each vm area describes the start and end of part of the virtualmemory of the process, such as the code, data, shared regions,etc. Key is that the area is treated as a unit (permissions, etc).

58

Process Address Space (continued)

struct mm_struct {

int count;

pgd_t * pgd;

unsigned long context;

unsigned long start_code, end_code, fstart_data, end_data;

unsigned long start_brk, brk, start_stack, start_mmap;

unsigned long arg_start, arg_end, env_start, env_end;

unsigned long rss, total_vm, locked_vm;

unsigned long def_flags;

struct vm_area_struct * mmap;

struct vm_area_struct * mmap_avl;

struct semaphore mmap_sem;

};

59

Process Virtual Memory

• Since an area of memory may be associated with an image ondisk, the vm area struct points to an inode.

• Also, the vm ops field points to the specific functions to beused to map and unmap this area, etc.

• A very important such operation is the nopage operation,which specifies what to do when a page fault occurs (forexample, bring in page from an image on disk) Different pagefault routines may be applied to different areas.

60

Process Virtual Memory (continued)

61

Traversing the Memory Structure of a Process

• The vm area struct’s of a process are arranged as an AVL treefor fast searching. (AVL tree guarantees worst case O(log n)time for insert, delete, membership, instead of linear.)

62

Demand Paging

• See mm/memory.c

• When a page fault occurs, Linux searches the AVL tree to findwhich area is involved. Possibilities:

– if no such area is found, what happens?

– legal area, but wrong operation.

• Assuming address is legal, Linux checks to see if the page is

– in swap file (page table entry is marked invalid but theaddress is not empty)

– in an executable image (invalid and address is empty), inwhich case the page is read through page cache.

63

Page Cache

• Role

– similar to file cache, used to speed up access to any memorymapped files (images)

– always checked first before going to disk

– read-aheads are done when pages are brought in from files

• Exercise: Peruse mm/filemap.c to become familiar withfunctions that manage the page cache.

64

Page Cache Implementation

• Structure

– page hash table is a vector of pointers to mem map t datastructures

– hash function (index) derived from VFS inode number andthe offset of the page in the file

– if page is present in cache, pointer to its mem map tstructure is returned to fault handler

– page copied into user space

65

Page Cache Implementation (continued)

66

Kernel Swap Daemon (kswapd)

• Description

– job is to keep enough free pages in system for Linux’s needs

– kernel thread - runs in kernel mode in the physical addressspace

• Operation

– periodically awakens and ensures number of free pages isnot too low

– tries to free up 4 pages each time it runs

• Three methods

– reduce size of buffer cache and page cache

– swap out shared pages

– swap out (or discard) other pages

67

Kernel Swap Daemon (kswapd)

• Exercise: Find kswapd in mm/vscan.c and peruse the routinesit calls to see how pages are freed.

68

Swapping Out Pages

• Rules

– never save a page to swap if it can be later retrieved fromsome other place

– pages cannot be swapped or discarded if they are locked inmemory

• Linux swap algorithm

– based on page aging, age counter in mem map t structure

– initial age of 3, count bumped when referenced, up to maxof 20

– swap daemon ages pages by decrementing count

– only pages with age = 0 are considered

69

Swapping Out Pages (continued)

• When page is swapped, its PTE is replaced by one marked asinvalid but with a pointer (offset ) to location in swap file

• When a shared page is swapped, page tables of all processesusing it must be modified.

70

Page and Buffer Cache Sizes

• Kernel swap daemon checks these to see if they are getting toolarge; may discard some pages from memory.

• Uses clock algorithm (cyclical scan through mem map pagevector)

• Buffer cache page may actually contain several buffers,depending on block size of file system (when all buffers in apage are freed, page is freed)

• Whenever in buffer or page cache, none of these pages in thevirtual memory of a process, so no page tables need updating.

71

Swap Cache

• Simply a list of page table entries, one per physical page in thesystem

– just page table entries for swapped out pages

– non zero entry represents page, held in swap file on disk,that has not been modified

– when page is modified, its entry is removed from swap cache

• When swapping out a page

– first check swap cache

– if valid entry, a copy is already on disk and page can simplybe freed

72

Swap Cache (continued)

• When swapping in a page, its PTE points to location in swapfile If access that caused fault was not a write, then

– entry for page is left in swap cache.

– page table entry is not marked as writable

– if page is later written, page fault occurs, page marked asdirty, and entry removed from swap cache

• If access was a write then the entry is removed from swapcache and page table entry marked as dirty and writable.

73

Process Control in Unix/Linux

• System calls controlling process context

– fork, clone - create a new process

– exit - terminate a process

– wait - synchronize with death of child

– exec variations- invoke a new program

– sbrk - change address space

• Signal Operations

– inform processes of asynchronous events

– may be sent (posted) by other processes

– may be sent by the kernel

74

Evolution of Fork

• Traditional fork system call in swapped systems

– complete copy of data and stack segments created for newprocess

– text segment could be shared as read-only

• BSD optimization

– full copy, as in swapping system, is very wasteful

– in paging systems, just copy page tables and update pageframe data table reference points

– for data pages, remain shared until written to, at whichtime copy occurs

75

Evolution of Fork (continued)

• Vfork

– child executes in parent’s address space

– can be ”dangerous”

• In Linux, fork is simply an alias for vfork, both of which arecopy on write (note: NOT like original BSD vfork)

• Linux clone call does allow parent and child to share writeabledata.

76

Process Creation in Linux

• Fork/vfork System Call

– pid = fork();

– pid - process id of child returned to parent, 0 returned tochild

– child process differs from the parent process only in its PIDand PPID; file locks and pending signals are not inherited.

77

Process Creation in Linux (continued)

• Fork procedures of kernel (see kernel/fork.c, do fork())

– allocate new task struct

– assign unique id to child

– get free page for kernel stack

– make ”logical” copy of parent process

– copy-on-write: vm area structs (segments) have copy onwrite flag set, generating page fault upon write access

– increment file and inode counters for files associated withthe process

– return appropriate values to parent and child

78

Clone System Call

• Synopsis:pid t clone(void *sp, unsigned long flags)

• clone is an alternate interface to fork, with more options; forkis equivalent to

clone(0,CLONE VM).

79

Clone System Call (continued)

• Parameters

– if sp is non-zero, child process uses sp as its initial stackpointer.

– CLONE VM flag: if set, child pages are copy-on-writeimages of the parent pages. If not set, child process sharesthe same pages as the parent, and both parent and childmay write on the same data.

– CLONE FD flag: is set, the child’s file descriptors are copiesof the parent’s file descriptors. If not set, the child’s filedescriptors are shared with the parent.

80

Process Termination

• Exit system call (see sys exit in kernel/exit.c)

– exit(status)

– SIGCHLD signal posted against parent process

– status is returned to parent process

– most of the code is cleaning up of resources

– if process exited due to uncaught signal, status is signalnumber

81

Awaiting Process Termination

• Wait system call

– pid t = wait(int *status) or pid t waitpid(pid t pid, int*status, int options)

– pid : process ID of zombie child, or can specify any of a setof children (see man exit)

– status : address at which returned status will be stored forsignal

– if child already in ZOMBIE state, returns immediately

– else waits in kernel for child specified by pid

82

Awaiting Process Termination (continued)

• In the kernel (sys wait4 in kernel/exit.c)

– parent adds itself to wait queue

add_wait_queue(&current->wait_chldexit,

&wait);

– loops, checking for a child to have died (in earlier versions ofUnix, would have slept in INTERRUPTIBLE state)

– invokes scheduler each time through loop

83

Invoking Other Programs

• execve(2) system call

– many variations of front end: execl, execlp, execle, exect,execv, execvp

– synopsis:

int execve (const char *filename,

const char *argv [* *],

const char *envp[]);

84

Invoking Other Programs (continued)

• Operation

– executes the program pointed to by a filename; can beeither a binary executable or a shell script

– on success, does not return

– text, data, bss, and stack of the calling process areoverwritten by that of the program loaded.

– new program invoked inherits the calling process’s PID andany open file descriptors that are not set to close on exec

– signals pending on the parent process are cleared

85

Invoking Other Programs (continued)

• Kernel (see do exec in fs/exec.c) locates and reads beginning ofimage, tries different binary formats till one works, sets upmemory map, and lets executable get demand paged.

• Linux can support various object file formats, but the mostcommonly used is ELF.

86

Executable and Linkable Format

• An object file format designed at Unix System Laboratories,alternative to (ECOFF, a.out)

• Description

– tables in image describe how program should be placed inmemory

– statically linked images are built by linker (ld) into singleimage containing all code and data

– dynamically linked routines in tables so that the library canbe found and linked

87

Executable and Linkable Format (continued)

• Loading an image (ELF or otherwise)

– flush current executable image (e.g., shell) from its virtualmemory, clear any signals, close all files

– set up mm struct (start of text, data, pointers toenvironment, etc)

– set up vm area struct structures and corresponding pagetables.

88

An ELF Example

89

Dynamically Linked (Shared) Libraries

• DLLs have been a part of Unix since the development of libc

• Executable image tables provide information on all libraryroutines referenced, indicating to dynamic linker (e.g., ld.so.1,lib.so.1, ...) how to locate the library routine and link it intothe address space of the program.

90

Linux System Calls

• System call implementation is architecture specific

• i386 Implementation

– i386-compatible architectures support programmedexceptions.

– execution of a system call is invoked by a programmedexception, caused by the instruction ”int 0x80”

– interrupt vector 0x80 is set up (at boot, along with otherinterrupt vectors) to transfer control to the kernel

91

Linux System Calls (continued)

• Library expansion

– each call is vectored through a stub in libc

– the routine is generally a syscallX() macro, where X is thenumber of parameters used by the actual routine

– each syscall macro expands to an assembly routine whichsets up the calling stack frame and calls system call()through an interrupt, via the instruction int $0x80

92

Implementation

Example library macro

#define _syscall1(type,name,type1,arg1) \

type name(type1 arg1) \

{ \

long __res; \

__asm__ volatile ("int $0x80" \

: "=a" (__res) \

: "0" (__NR_##name), \

"b" ((long)(arg1))); \

if (__res >= 0) \

return (type) __res; \

errno = -__res; \

return -1; \

}

93

Implementation (continued)

• See arch/*/kernel/entry.S, which defines entry points forinterrupts/exceptions set up a boot (segmentation error, divideby zero, system call, ...)

– ENTRY(system call) : this code is responsible for saving allregisters, checking to make sure a valid system call wasinvoked and then ultimately transferring control to theactual system call code via the offsets in the sys call table

– ret from sys call() checks to see if the scheduler should berun, and if so, calls it

94

System Call Table

• Defined at the end of entry.S

– simply a table that maps a routine to an index

– over 170 system calls presently defined

.data

ENTRY(sys_call_table)

.long SYMBOL_NAME(sys_setup) /* 0 */

.long SYMBOL_NAME(sys_exit)

.long SYMBOL_NAME(sys_fork)

.long SYMBOL_NAME(sys_read)

.long SYMBOL_NAME(sys_write)

.long SYMBOL_NAME(sys_open) /* 5 */

.long SYMBOL_NAME(sys_close)

.long SYMBOL_NAME(sys_waitpid)

95

Linux Interprocess Communication

• Signals

• Pipes

• System V IPC

• message queues

• semaphores

• shared memory

• sockets

96

Unix Signals

• Operations

– inform processes of asynchronous events

– may be sent by other processes using kill system call

– may be sent by the kernel

97

Unix Signals (continued)

• Classes of signals

– indicating termination of a process

– process induced exceptions

– unrecoverable conditions during system call

– unexpected error during system call

– user mode signals

– terminal-related signals

– tracing-related signals

98

List of Signals (Linux)

#define SIGHUP 1

#define SIGINT 2

#define SIGQUIT 3

#define SIGILL 4

#define SIGTRAP 5

#define SIGABRT 6

#define SIGIOT 6

#define SIGBUS 7

#define SIGFPE 8

#define SIGKILL 9

#define SIGUSR1 10

#define SIGSEGV 11

#define SIGUSR2 12

#define SIGPIPE 13

#define SIGALRM 14

...

#define SIGUNUSED 31

99

Kernel and Signals

• Sending a signal to a process

– set bit in signal field of task struct

– if process asleep and interruptible, kernel wakes it up

– processes do not know how many times a signal fired

• Kernel handling of signals

– kernel checks for received signals when process ready toreturn to user mode

– signal has no instant effect on kernel mode process

– if process is running in user mode, it is interrupted; signaltakes effect when returning from the interrupt

– kernel only dumps core for signals that imply program error

100

Kernel and Signals (continued)

• Possible actions taken by process receiving signal

– exit in kernel mode (default)

– ignore the signal

– execute a particular user function

101

Signal System Call

• Syntax

void (*signal(int signum, void *handler)

(int)))(int);

– signum is the number of the signal

– handler is a user-defined function to be called upon receiptof the signal, or special flag:

– if handler is SIG IGN, the signal will be ignored if it isposted against the process

– if handler is SIG DFL, default action for the signal isreinstated.

– kernel keeps track of how it is to handle signals in the taskstruct :

102

Sending a Signal from a Process

• kill(pid, signum) system call

– pid - identifies set of processes to receive signal

– signum - signal being sent

• pid can specify process or a process group

• In the kernel (kernel/exit.c)

– simply post the signal against the process or each memberof a process group

103

Executing a User-Defined Signal Catcher

• Kernel activities

– access saved register context to get PC and SP

– set signal handler field to default state

– create new stack frame on user stack, so it looks like userprogram directly called the signal handler

– change PC and SP in saved register context to indicate newfunction

– when kernel returns to user mode, the process will executethe signal handling code, returning to place in code whereinterrupted

104

Pipes

• Enable standard output from one process to be directed tostandard input of another process

– processes themselves are not aware of the redirection

– example: ls | more

– created by shell using pipe(2) call - (do pipe in fs/pipe.c)

105

Pipes

• Implementation

– two file data structures point at temporary inode

– inode points at a physical page within memory

– file points to operations that are specific to pipes (pipe read,pipe write, ...) instead of the regular file operations

– linux uses locks, wait queues, and signals to synchronizeaccess to the pipe (see fs/pipe.c)

106

Pipe Implementation

107

System V Interprocess Communication

• Message queues - allow processes to send formatted messagesto other processes

– msgget - create (or return) message queue

– msgctl - control

– msgsnd - send message

– msgrcv - receive message

• Shared memory - processes communicate by sharing pairs ofvirtual address space

– shmget - create (or return) new share memory region

– shmat - attach to a region

– shmdt - detach

– shmctl - control

108

System V Interprocess Communication (continued)

• Semaphores - generalization of Dijkstra’s P and V operations

– semget - allocate entry for an array of semaphores

– semop - manipulate a semaphore with an operation

109

Linux System Boot

• PC boot is carried out by the BIOS

– initializes interrupt vector

– tries to read first sector (boot sector) of the first floppy

– if that fails, tries to read boot sector (called the master bootrecord, MBR) of first hard disk

– (in many systems, you can re-order this sequence byreconfiguring the BIOS)

– the MBR contains code to determine which partition toboot from (active partition), load boot sector of thatpartition, jump to beginning of that code

110

Linux System Boot (continued)

• Boot sector (code in arch/*/boot/bootsect.S)

– loaded by the BIOS to address 0x7C00

– relocates self to 0x90000

– loads the next 2 kBytes of code from the boot device toaddress 0x90200

– loads the rest of the kernel to address 0x10000.

– prints message ”Loading...

– passes control to setup (boot/setup.S)

111

Linux Boot (continued)

• Setup

– identifies various hardware features of the host system(memory size, video card type, hard disk info...)

– prompts user to choose the video mode for the console

– moves the whole system from address 0x10000 to address0x1000 enters protected mode and jumps to the rest of thesystem (at 0x1000)

• Kernel decompression (head.S)

– invokes decompress kernel(), which in turn is made up ofinflate.c, unzip.c and misc.c.

– decompressed kernel put at address 0x100000 (1 meg) andexecution transfers to start kernel()

112


• start kernel (init/main.c)

– no assembly language after this point

– sets the memory bounds and callspaging init().

– initializes the traps, IRQ channels and scheduling.

– parses the boot command line

– initializes all the device drivers and disk buffering, as well asmany other data structures

113


asmlinkage void start_kernel(void)

{

char * command_line;

/*

* Interrupts are still disabled. Do necessary setups, then

* enable them

*/

setup_arch(&command_line, &memory_start, &memory_end);

memory_start = paging_init(memory_start,memory_end);

trap_init();

init_IRQ();

sched_init();

time_init();

parse_options(command_line);

114


calibrate_delay();

...

mem_init(memory_start,memory_end);

buffer_init();

sock_init();

ipc_init();

dquot_init();

arch_syms_export();

sti();

check_bugs();

printk(linux_banner);

sysctl_init();

115


/*

* We count on the initial thread going ok

* Like idlers init is an unlocked kernel thread,

* which will make syscalls (and thus be locked).

*/

kernel_thread(init, NULL, 0);

116


/* task[0] is meant to be used as an

* "idle" task: it may not sleep, but it

* might do some general things like

* count free pages or it could be used

* to implement a reasonable LRU

* algorithm for the paging routines:

* anything that can be useful, but

* shouldn’t take time from the real

* processes. * Right now task[0] just

* does a infinite idle loop. */

cpu_idle(NULL);

}

117

Init Process

• System’s first real process

– created by start kernel

– process id of 1

– task struct is pointed to by global variable, init task

118

Init Process (continued)

• Operation

– initial processing: open console, mount root file system

– exec’s the system initialization program, historically/etc/init, but now usually found in /sbin/init

– read /etc/inittab, which contains shell script to start up avariety of daemons (see /etc/rc.d)

– as such, init becomes the ancestor of all other processes(except the idle loop)

– from now on, the kernel runs only in one of two modes:∗ executing a system call on behalf of a user process∗ handling some asynchronous event, such as an interrupt

119

Unix Shells

• A getty process is usually set up by init to wait on each tty line.

• Getty gets the user login id and spawns login, gets and checkspassword, then spawns whatever shell is indicated in passwordfile.

• A number of shells have evolved over the years: sh, csh, ksh,tcsh, ...

120

Unix Shells (continued)

• Shell operation

– shell is one large loop

– each time through, shell reads command line

– interprets line according to standard set of rules

– built-in commands (cd, for, while, etc) executed internally

– else, assumes command is name of an executable file, forks,then execs new program

• After exec’ing the command

– by default the shell executes a wait for the child to die, thengoes to top of a loop

– if the command was executed in background mode, goes totop of the loop

121

Unix I/O Subsystem

• Components

– general device driver code

– drivers for specific hardware devices

– buffer cache

• Device access

– drivers are accessed through device files (/dev/...)

– device numbers (major and minor) specify device

122

Unix I/O Subsystem (continued)

• Three kinds of I/O

– block devices (disks and tapes) - usually accessed via filesystem instead of directly

– character devices (terminals, printers,...)

– network devices

123

Unix I/O Subsystem (continued)

• Buffer cache

– keep disk blocks in memory for future use (read and write)

– dirty buffers written periodically to secondary storage usingan elevator algorithm

124

Linux Device Drivers

• Every physical device in the system has its own hardwarecontroller

– SuperIO chip for keyboard

– IDE controller for IDE disks

– SCSI controller for SCSI disks...

• Each hardware controller has its own control and statusregisters, used by the device driver to interact with the device

125

Linux Device Drivers

• Device files

– one of the most elegant features of Unix

– make hardware devices look like files, to be opened, closed,read, written

– e.g., /dev/tty, /dev/hda

– created with mknod command or by Linux uponinitialization

• Major and minor device numbers

– all devices controlled by same driver have same majordevice number

– minor device numbers distinguish different (physical orlogical) devices

– e.g., disk partitions, ttys

126

Features of Linux Drivers

• Kernel code - even though drivers are often added to systemfor new devices, by third parties, they are kernel code and, ifbuggy, can easily crash the system or worse.

• Kernel interfaces - must provide a standard interface to Linuxkernel or subsystem (file I/O interface, SCSI interface, etc)

• Kernel mechanisms - make use of standard kernel services, suchas wait queues

• Most drivers can be configured as modules, so they are demandloadable as well as boot configurable. If driver is present buthardware is not, no problem.

• Drivers may use DMA for data transfers between an adaptercard and main memory

127

DMA

• Data transfer between devices and memory

• A small number of DMAs

• DMAs cannot be shared between devices

• Limited addressing capabilities (24bits)

• A device registers with the kernel for a DMA channel

128

Polling vs. Interrupts

• Polling device drivers

– don’t use true polling

– instead, timer interrupt routine checks status of commandand indicates to driver when it is complete

– floppy driver (has been) implemented this way

• Interrupts

– more efficient and responsive than polling

– device driver needs to register its usage of the interruptwith the kernel

129

Polling vs. Interrupts

• /proc/interrupts indicates which drivers use which interrupts

0: 727432 timer

1: 20534 keyboard

2: 0 cascade

...

• Drivers should do as little as possible in interrupt handlingroutine, deferring non-time-critical work to ”bottom half”handler (called when scheduler runs)

130

The Kernel and Character Devices

• Simplest of Linux’s devices

– accessed as files

– standard open, close, read, write calls used

• Initialization

– device driver registers itself by adding entry to crhdevsvector

– major device number is index into this vector

– each entry contains pointer name of driver and pointer toset of file operations

131

Kernel and Character Devices (continued)

• File operations

– inode for character device file points only to open operation

– upon open, other ops retrieved from chrdevs entry andplaced in open file structure

132

The Kernel and Block Devices

• blkdevs table plays similar role as chrdevs

• There are classes of block devices (SCSI, IDE)

– class registers with the kernel and provides file operations

– driver provides interfaces to appropriate subsystem (e.g.,SCSI), which the subsystem uses when providing interfacesto kernel

• blk dev vector

– each block device driver must also provide interface tobuffer cache (address of request routine and pointer to listof requests, each containing pointer to one or morebuffer head structures)

133

Kernel and Block Devices (continued)

• when buffer cache wishes to read/write block of data, it placesrequest on the appropriate list, and request function is called

• after request is completed, each buffer head is unlocked, wakingup any waiting processes

• Details of device drivers for specific hard disk types can befound in Rusling, Ch. 8.

134

Network Devices

• Typically an adapter card, but could be software only, such asloopback device

• All packets represented by sk buff structures, that allowheaders to be easily added and removed (seeinclude/linux/skbuff.h)

• Device files

– created at boot time as network devices are discovered andinitialized

– names are standard, multiple of same type are numberedstarting at 0

– e.g., /dev/eth0, /dev/eth1, ...

135

Device Data Structure Contents

• See include/linux/netdevice.h

• Name, as discussed earlier

• Bus information

– interrupt, base address in I/O memory, DMA channel beingused

• Interface flags - characteristics/abilities of device

136

Device Data Structure Contents (continued)

• Protocol information

– MTU for this interface

– family (AF INET for all Linux network devices)

– type (Ethernet, X.25, SLIP, PPP, ...)

– addresses (hw address, IP address, ...)

• Packet queue - queue of sk buff packets queued for transmission

• Support functions

– setup and frame routines

– statistics routines

– etc.

137

Interface Flags

#define IFF_UP 0x1 /* interface is up */

#define IFF_BROADCAST 0x2 /* broadcast address valid */

#define IFF_DEBUG 0x4 /* turn on debugging */

#define IFF_LOOPBACK 0x8 /* is a loopback net */

#define IFF_POINTOPOINT 0x10 /* interface is has p-p link */

#define IFF_NOTRAILERS 0x20 /* avoid use of trailers */

#define IFF_RUNNING 0x40 /* resources allocated */

#define IFF_NOARP 0x80 /* no ARP protocol */

#define IFF_PROMISC 0x100 /* receive all packets */

#define IFF_ALLMULTI 0x200 /* receive all multicast packets */

#define IFF_MASTER 0x400 /* master of a load balancer */

#define IFF_SLAVE 0x800 /* slave of a load balancer */

#define IFF_MULTICAST 0x1000 /* Supports multicast */

138

Linux Virtual File System

• Linux actually supports many file systems (simultaneously)

– ext, ext2, xia, minix, msdos, vfat, proc, ...

– an extremely powerful and useful feature

• Virtual file system

– supplies applications with system calls for file management

– hides details of individual file systems from user programs

139

Unix File Systems - Review

• Two main objects, files and directories

– directories are just files with a special format

– files are made up of data blocks on disk

140

Unix File Systems - Review (continued)

• A file is represented by an inode.

– resides on disk, copies in memory

– defines ownership, permissions, status

– type: plain, directory, symbolic link, character device, blockdevice, socket

• Mapping path names to inodes

– responsibility of kernel

– follow path and read inodes

– if directory, read to get inodes

141

Unix File Systems - Review (continued)

• Files opened by applications

– file descriptor points into table of open files

– entry there points to file-structure table entry

– entry there points to in-core copy of inode

142

Unix File System Structure

• File system data structures maintained by the kernel

– each process has its own file descriptor table, whichidentifies all open files for a process

– a system-wide file table keeps track the of the byte offset inthe file where the user’s next read or write will start, andthe access rights allowed to the opening process

– a system-wide inode table

143

Unix File System Structure (continued)

• Traditional Unix file system layout

– boot block: bootstrap code (typically the first sector)

– superblock: describes the state of a file system - how large itis, how many files it can store, where to find free space, etc.

– inode list: includes the root inode

– data blocks: each block can only belong to one file

144

Traditional Unix FS Operation

• Contents of superblock

– size of the file system

– number of free blocks in file system

– list of free blocks in file system

– index of next free block in the free block list

– size of inode list

– number of free inodes in file system

– list of free inodes in file system

– index of next free inode in the free inode list

145

Traditional Unix FS Operation (continued)

• Linear list of inodes follows super block

– each inode has a type field: 0 (available), 1 (used)

– linear search for a free one would be expensive (many diskaccesses)

146

Traditional Unix FS Operation

• Superblock holds list of free inodes (free inode list) and kernelkeeps track of search point (in the real inode list) where itshould begin looking next time it needs to fill up free inode list.

• When the free inode list is empty, the kernel will bring morefree inodes to the superblock; it will start looking at theremembered inode

• Freeing an inode

– increment the total number of available inodes

– place in superblock free inode list if there is a slot available

• Access of superblock must be a critical section.

• Managing regular disk blocks is similar to that for inodes,except that an explicit list of free disk blocks is maintained.

147

Performance Problems

• For Unix file management system described, the effective datatransfer rate is less than the disk bandwidth

• Major factors that affect the file system performance

– allocation of blocks of a file: the free list of blocks arescrambled as files were created and removed, eventually thefree list becomes random causing files to have their blocksallocated randomly

– overallocation of files: files in same directory are typicallynot allocated consecutive slots in the inode list (need manynon-consecutive blocks of inodes to be accessed)

– inodes are not usually near their respective files

148

EXT2 File System

• History

– Minix file system originally supported, but very restrictive(14-character filenames and 64Mbyte file size limit)

– EXT introduced in 1992 specifically for Linux (VFS addedat same time), but lacked performance

– EXT2 added in 1993

• EXT2

– design heavily influenced by BSD Fast File System (FFS)

– logical partitions are divided into block groups

– superblock is replicated on each block group, for faulttolerance

149

EXT2 Inode

• Fairly traditional Unix inode

– type (file,directory, symbolic link, block device, char device,FIFO)

– ownership

– permissions (owner, group, others)

– size

– timestamps (inode creation and modification times)

– pointers to data blocks

150

EXT2 Superblock

• Read into memory when file system is mounted

• Each block group contains a duplicate in case of corruption

• Contents

– magic number (0xEF53) indicates superblock of ext2 fs

– revision level

– mount count and maximum mount count.

– block group number - which group holds this copy

– block size (e.g. 1024 bytes)

– blocks per group

– number of free blocks in fs

– number of free inodes in fs

– inode number of first inode in fs (directory entry for ’/’)

151

Group Descriptor

• Describes a block group, but all group descriptors arereplicated in each block group

• Contents

– blocks bitmap for allocation and deallocation

– inode bitmap for inode alloc and dealloc

– first block of inode table

– free block count

– free inode count

– used directory count

152

EXT2 Directories

• Defn: list of directory entries, each containing

– inode (index into inode table of the block group)

– entry length

– name length

– name [255]

– first two entries are always ”.” and ”..”

153

EXT2 Directories (continued)

• When a file size is increased, EXT2 tries to allocate new blocksfor a file physically close to current data blocks or at least insame Block Group

• Also, whenever it needs new block for a file, it looks for a blockof 8 if it can find one

154

EXT2 Directories (continued)

155

Finding Files in EXT2

• The file name is a series of directory names, separated byforward slashes and ends in file’s name.

• Filename itself can be any length and consist of any printablecharacters

• Linux parses pathname a directory at a time until it finds theinode it wants

– first, get inode of root directory (given in superblock)

– read contents (all but last is a directory)

– get inode of next path element, and repeat

• Exercise: Find the code in Linux that parses pathnames to findthe inode of the file being referenced.

156

Changing File Size

• EXT2 tries to allocate new blocks for file in same Block Groupas its current blocks (and inode)

• When a write will go past last allocated block

– Linux suspends process

– locks EXT2 Superblock for this file system

– checks to make sure there are at least some free blocks left

– allocates block

157

Changing File Size (continued)

• Block allocation sequence

– may have been preallocated (prealloc block andprealloc count in inode); then grab one and update fields

– else, see if next block after last in file is available (ideal)

– else, look for block within 64 blocks of ideal (same blockgroup)

– look in other block groups (look for sets of 8 blocks, ifavailable)

• When free block is found, update block bitmap in superblockand allocate buffer in buffer cache (zeroed and marked dirty)

• Superblock is marked as dirty and is unlocked

158

Virtual File System

• Provides processes with transparent access to many types oflocal file systems and to remote file systems

• Maintains data structures that describe the whole (virtual) filesystem and the real, mounted file systems

• Maintains (VFS) superblocks and (VFS) inodes, just like realfile systems, BUT these exist only in memory, not on disk, andare constructed based on information in their ”real”counterparts

159

VFS Operation

• As each file system is initialized at boot time, it registers itselfwith the VFS

• A file system can be either built into kernel or built as aloadable module.

• When a file system is mounted, VFS reads its superblock andmaps this information onto a VFS superblock structure.

• VFS keeps list of mounted file systems with their superblocks.

• Each VFS superblock contains pointers to routines thatperform file-system-specific functions, e.g., reading inode. So,the read inode routine (fs/inode.c) is generic and calls theoperation in superblock, which fills in a (VFS!) inode:

160

VFS Caches

• Three major caches are used by the VFS;

• Inode cache

– a hash table containing VFS inodes

– reading an inode will put it in cache, and references keep itthere

• Buffer cache

– cache of blocks from underlying file systems

– contains not only files but also (raw) inodes from various filesystems

– shared by all file systems

– buffers are identified by block number and unique deviceidentifier

161

VFS Caches (continued)

• Directory cache

– hash table that stores mapping between full directory namesand their inode numbers

162

VFS Superblock Contents

• Device identifier (e.g., /dev/hda1 is 0x301)

• Inode pointers

– mounted - points at first inode in this file system

– covered - points at directory that got covered by mount(root fs doe not have one)

• Blocksize for this fs

• Pointer to set of super

• File system type

• File system specific information

163

VFS Inode Contents

• Identifier of device holding file (or whatever inode represents)

• (Raw) inode number (unique within given file system)

• Mode - file type and permissions

• User and group ids

• Pointer to set of file system specific inode routines

• Count of number of current users of the inode

• Lock used when being read from file system (disk/buffer cache)

• Dirty flag - raw inode will have to be written

• File system specific information

164

File System Mounting

• Three pieces of info passed to kernel

– file system type

– physical block device containing file system

– where in existing file system to mount file system

• Steps taken by kernel

– checks registered file system types, gets routine to readsuperblock

– get (VFS) inode of mount point directory

– allocate VFS superblock and read the superblock

– fill in vfsmount structure

165

VFS Mount List

• List of all mounted file systems

• Each entry points to superblock, root inode, file system type

166

Buffer Cache

• As mounted file systems are used, the produce requests tounderlying device drivers to read blocks from various devices.

• These requests are in the form of buffer head data structures,which contain all information necessary to read appropriateblock

• All block devices viewed as linear collection of blocks of samesize

167

Buffer Cache (continued)

• Buffer cache shared among all devices and real file systems

– lists of free buffers of different sizes (512, 1024, 2048, ...)

– hash table that contains used buffers, index generated fromdevice identifier and block number

• bdflush (kflushd) kernel daemon awakened when no. of dirtybuffers grows too large (60% of total!) or if there are notenough free buffers and writes some (e.g., up to 500) to disk

168

The /proc File System

• Example of the power of the Linux VFS

• Neither /proc nor its subdirectories actually exist on disk

• /proc registers itself with VFS

• When requested files are opened, /proc routines create filesbased on info in kernel

• /proc is user-readable window into kernel’s inner workings

169

linux internals - cse.msu.edu

Documents