cs162 operating systems and systems programming midterm review march 18, 2013

CS162Operating Systems andSystems Programming

Midterm Review

March 18, 2013

Midterm Review.2CS162 ©UCB Fall 2013

Overview

• Synchronization, Critical Sections

• Deadlock

• CPU Scheduling

• Memory Multiplexing, Address Translation

• Caches, TLBs

• Page Allocation and Replacement

• Kernel/User Interface


Synchronization,Critical sections


Definitions

• Synchronization: using atomic operations to ensure cooperation between threads

• Mutual Exclusion: ensuring that only one thread does a particular thing at a time

– One thread excludes the other while doing its task

• Critical Section: piece of code that only one thread can execute at once

– Critical section is the result of mutual exclusion– Critical section and mutual exclusion are two ways of

describing the same thing.


Locks: using interrupts• Key idea: maintain a lock variable and impose mutual

exclusion only during operations on that variable

int value = FREE;

Acquire() {disable interrupts;if (value == BUSY) {

put thread on wait queue;Go to sleep();// Enable interrupts?

} else {value = BUSY;

}enable interrupts;

}

Release() {disable interrupts;if (anyone on wait queue) {

take thread off wait queuePlace on ready queue;

} else {value = FREE;

}enable interrupts;

}


Better locks: using test&set• test&set (&address) { /* most architectures */

result = M[address];M[address] = 1;return result;

}

Release() {// Short busy-wait timewhile (test&set(guard));if anyone on wait queue {

take thread off wait queuePlace on ready queue;

} else {value = FREE;

}guard = 0;

int guard = 0;int value = FREE;

Acquire() {// Short busy-wait timewhile (test&set(guard));if (value == BUSY) {

put thread on wait queue;go to sleep() & guard = 0;

} else {value = BUSY;guard = 0;

}}


• Does this busy wait? If so, how long does it wait?

• Why is this better than the version that disables interrupts?

• Remember that a thread can be context-switched while holding a lock.

• What implications does this have on the performance of the system?


Semaphores• Definition: a Semaphore has a non-negative integer value

and supports the following two operations:– P(): an atomic operation that waits for semaphore to become

positive, then decrements it by 1 » Think of this as the wait() operation

– V(): an atomic operation that increments the semaphore by 1, waking up a waiting P, if any

» This of this as the signal() operation

– You cannot read the integer value! The only methods are P() and V().

– Unlike lock, there is no owner (thus no priority donation). The thread that calls V() may never call P(). Example: Producer/consumer


Condition Variables

• Condition Variable: a queue of threads waiting for something inside a critical section

– Key idea: allow sleeping inside critical section by atomically releasing lock at time we go to sleep

– Contrast to semaphores: Can’t wait inside critical section

• Operations:– Wait(&lock): Atomically release lock and go to sleep. Re-

acquire lock later, before returning. – Signal(): Wake up one waiter, if any– Broadcast(): Wake up all waiters

• Rule: Must hold lock when doing condition variable ops!


Complete Monitor Example(with condition variable)

• Here is an (infinite) synchronized queueLock lock;Condition dataready;Queue queue;

AddToQueue(item) {lock.Acquire(); // Get Lockqueue.enqueue(item); // Add itemdataready.signal(); // Signal any

waiterslock.Release(); // Release Lock

}

RemoveFromQueue() {lock.Acquire(); // Get Lockwhile (queue.isEmpty()) {

dataready.wait(&lock); // If nothing, sleep}item = queue.dequeue(); // Get next itemlock.Release(); // Release Lockreturn(item);

}


Mesa vs. Hoare monitors• Need to be careful about precise definition of signal and wait.

Consider a piece of our dequeue code:while (queue.isEmpty()) {

dataready.wait(&lock); // If nothing, sleep}item = queue.dequeue();// Get next item

– Why didn’t we do this?if (queue.isEmpty()) {

dataready.wait(&lock); // If nothing, sleep}item = queue.dequeue();// Get next item

• Answer: depends on the type of scheduling– Hoare-style (most textbooks):

» Signaler gives lock, CPU to waiter; waiter runs immediately» Waiter gives up lock, processor back to signaler when it exits

critical section or if it waits again– Mesa-style (most real operating systems):

» Signaler keeps lock and processor» Waiter placed on ready queue with no special priority» Practically, need to check condition again after wait


Reader/Writer

• Two types of lock: read lock and write lock (i.e. shared lock and exclusive lock)

• Shared locks is an example where you might want to do broadcast/wakeAll on a condition variable. If you do it in other situations, correctness is generally preserved, but it is inefficient because one thread gets in and then the rest of the woken threads go back to sleep.


Read/Writer

Reader() {// check into systemlock.Acquire();while ((AW + WW) > 0) {

WR++;okToRead.wait(&lock);WR--;

}AR++;lock.release();

// read-only accessAccessDbase(ReadOnly);

// check out of systemlock.Acquire();AR--;if (AR == 0 && WW > 0)

okToWrite.signal();lock.Release();

}

Writer() {// check into systemlock.Acquire();while ((AW + AR) > 0) {

WW++;okToWrite.wait(&lock);

WW--;}AW++;lock.release();

// read/write accessAccessDbase(ReadWrite);

// check out of systemlock.Acquire();AW--;if (WW > 0){

okToWrite.signal();} else if (WR > 0) {

okToRead.broadcast();}lock.Release();

}

What if we remove this line?


Read/Writer Revisited


WR++;okToRead.wait(&lock);WR--;




okToWrite.broadcast();

lock.Release();}


WW++;okToWrite.wait(&lock);

WW--;}AW++;lock.release();




okToRead.broadcast();}lock.Release();

}

What if we turn signal to broadcast?




WR++;

okContinue.wait(&lock);WR--;




okContinue.signal();lock.Release();

}


WW++;okContinue.wait(&lock);WW--;

}AW++;lock.release();




okContinue.broadcast();}lock.Release();

}

What if we turn okToWrite and okToRead into okContinue?


Read/Writer RevisitedReader() {

// check into systemlock.Acquire();while ((AW + WW) > 0) {

WR++;





okContinue.signal();lock.Release();

}








}• R1 arrives • W1, R2 arrive while R1 reads• R1 signals R2




WR++;





okContinue.broadcast();lock.Release();

}








}Need to change to broadcast!

Why?


Example Synchronization Problems

• Fall 2012 Midterm Exam #3

• Spring 2013 Midterm Exam #2

• Hard Synchronization Problems from Past Exams– http://inst.eecs.berkeley.edu/~cs162/fa13/hand-outs/

synch-problems.html


Deadlock


Four requirements for Deadlock

• Mutual exclusion– Only one thread at a time can use a resource.

• Hold and wait– Thread holding at least one resource is waiting to acquire

additional resources held by other threads• No preemption

– Resources are released only voluntarily by the thread holding the resource, after thread is finished with it

• Circular Wait– There exists a set {T1, …, Tn} of waiting threads

» T1 is waiting for a resource that is held by T2

» T2 is waiting for a resource that is held by T3

» …» Tn is waiting for a resource that is held by T1


Resource Allocation Graph Examples

T1 T2 T3

R1 R2

R3R4

Simple ResourceAllocation Graph

T1 T2 T3

R1 R2

R3R4

Allocation GraphWith Deadlock

T1

T2

T3

R2

R1

T4

Allocation GraphWith Cycle, butNo Deadlock

• Recall:– request edge – directed edge T1 Rj

– assignment edge – directed edge Rj Ti


T1

T2

T3

R2

R1

T4

Deadlock Detection Algorithm• Only one of each type of resource look for loops• More General Deadlock Detection Algorithm

– Let [X] represent an m-ary vector of non-negative integers (quantities of resources of each type):[FreeResources]: Current free resources each type[RequestX]: Current requests from thread X

[AllocX]: Current resources held by thread X

– See if tasks can eventually terminate on their own[Avail] = [FreeResources] Add all nodes to UNFINISHED do {

done = trueForeach node in UNFINISHED {

if ([Requestnode] <= [Avail]) {remove node from UNFINISHED[Avail] = [Avail] + [Allocnode]done = false

}}

} until(done)– Nodes left in UNFINISHED deadlocked


Banker’s algorithm

• Method of deadlock avoidance.

• Process specifies max resources it will ever request. Process can initially request less than max and later ask for more. Before a request is granted, the Banker’s algorithm checks that the system is “safe” if the requested resources are granted.

• Roughly: Some process could request it’s specified Max and eventually finish; when it finishes, that process frees up its resources. A system is safe if there exists some sequence by which processes can request their max, terminate, and free up their resources so that all other processes can finish (in the same way).


Example Deadlock Problems




CPU Scheduling


Scheduling Assumptions• CPU scheduling big area of research in early 70’s• Many implicit assumptions for CPU scheduling:

– One program per user

– One thread per program

– Programs are independent

• In general unrealistic but they simplify the problem – For instance: is “fair” about fairness among users or programs?

» If I run one compilation job and you run five, you get five times as

much CPU on many operating systems

• The high-level goal: Dole out CPU time to optimize some desired parameters of system

USER1USER2 USER3USER1 USER2

Time


Scheduling Metrics• Waiting Time: time the job is waiting in the ready queue

– Time between job’s arrival in the ready queue and launching the job

• Service (Execution) Time: time the job is running• Response (Completion) Time:

– Time between job’s arrival in the ready queue and job’s completion

– Response time is what the user sees:» Time to echo a keystroke in editor» Time to compile a program

Response Time = Waiting Time + Service Time

• Throughput: number of jobs completed per unit of time – Throughput related to response time, but not same thing:

» Minimizing response time will lead to more context switching than if you only maximized throughput


Scheduling Policy Goals/Criteria

• Minimize Response Time– Minimize elapsed time to do an operation (or job)

• Maximize Throughput– Two parts to maximizing throughput

» Minimize overhead (for example, context-switching)» Efficient use of resources (CPU, disk, memory, etc)

• Fairness– Share CPU among users in some equitable way

– Fairness is not minimizing average response time:» Better average response time by making system less fair

– Tradeoff: fairness gained by hurting average response time!


First-Come, First-Served (FCFS) Scheduling• First-Come, First-Served (FCFS)

– Also “First In, First Out” (FIFO) or “Run until done”» In early systems, FCFS meant one program

scheduled until done (including I/O)» Now, means keep CPU until thread blocks

• Example: Process Burst TimeP1 24 P2 3P3 3

– Suppose processes arrive in the order: P1 , P2 , P3 The Gantt Chart for the schedule is:

– Waiting time for P1 = 0; P2 = 24; P3 = 27– Average waiting time: (0 + 24 + 27)/3 = 17– Average completion time: (24 + 27 + 30)/3 = 27

• Convoy effect: short process behind long process

P1 P2 P3

24 27 300


Round Robin (RR)• FCFS Scheme: Potentially bad for short jobs!

– Depends on submit order– If you are first in line at supermarket with milk, you don’t care

who is behind you, on the other hand…• Round Robin Scheme

– Each process gets a small unit of CPU time (time quantum), usually 10-100 milliseconds

– After quantum expires, the process is preempted and added to the end of the ready queue

– n processes in ready queue and time quantum is q » Each process gets 1/n of the CPU time » In chunks of at most q time units » No process waits more than (n-1)q time units

• Performance– q large FCFS– q small Interleaved– q must be large with respect to context switch, otherwise

overhead is too high (all overhead)


What if we Knew the Future?• Could we always mirror best FCFS?• Shortest Job First (SJF):

– Run whatever job has the least amount of computation to do

• Shortest Remaining Time First (SRTF):– Preemptive version of SJF: if job arrives and has a shorter

time to completion than the remaining time on the current job, immediately preempt CPU

• These can be applied either to a whole program or the current CPU burst of each program

– Idea is to get short jobs out of the system– Big effect on short jobs, only small effect on long ones– Result is better average response time


Discussion

• SJF/SRTF are the best you can do at minimizing average response time

– Provably optimal (SJF among non-preemptive, SRTF among preemptive)

– Since SRTF is always at least as good as SJF, focus on SRTF

• Comparison of SRTF with FCFS and RR– What if all jobs the same length?

» SRTF becomes the same as FCFS (i.e., FCFS is best can do if all jobs the same length)

– What if jobs have varying length?» SRTF (and RR): short jobs not stuck behind long ones


Summary• Scheduling: selecting a process from the ready queue and

allocating the CPU to it

• FCFS Scheduling:– Run threads to completion in order of submission– Pros: Simple (+)– Cons: Short jobs get stuck behind long ones (-)

• Round-Robin Scheduling: – Give each thread a small amount of CPU time when it

executes; cycle between all ready threads– Pros: Better for short jobs (+)– Cons: Poor when jobs are same length (-)


Summary (cont’d)• Shortest Job First (SJF)/Shortest Remaining Time First

(SRTF):– Run whatever job has the least amount of computation to

do/least remaining amount of computation to do– Pros: Optimal (average response time) – Cons: Hard to predict future, Unfair

• Multi-Level Feedback Scheduling:– Multiple queues of different priorities– Automatic promotion/demotion of process priority in order to

approximate SJF/SRTF

• Lottery Scheduling:– Give each thread a number of tokens (short tasks more

tokens)– Reserve a minimum number of tokens for every thread to

ensure forward progress/fairness


Example CPU Scheduling Problems





Memory Multiplexing,Address Translation


Important Aspects of Memory Multiplexing• Controlled overlap:

– Processes should not collide in physical memory– Conversely, would like the ability to share memory when desired

(for communication)• Protection:

– Prevent access to private memory of other processes» Different pages of memory can be given special behavior (Read

Only, Invisible to user programs, etc).» Kernel data protected from User programs» Programs protected from themselves

• Translation: – Ability to translate accesses from one address space (virtual) to

a different one (physical)– When translation exists, processor uses virtual addresses,

physical memory uses physical addresses– Side effects:

» Can be used to avoid overlap» Can be used to give uniform view of memory to programs


Why Address Translation?

Prog 1Virtual

AddressSpace 1

Prog 2Virtual

AddressSpace 2

Code

Data

Heap

Stack

Code

Data

Heap

Stack

Data 2

Stack 1

Heap 1

OS heap & Stacks

Code 1

Stack 2

Data 1

Heap 2

Code 2

OS code

OS dataTranslation Map 1 Translation Map 2

Physical Address Space


Addr. Translation: Segmentation vs. Paging

Base0 Limit0 VBase1 Limit1 VBase2 Limit2 VBase3 Limit3 NBase4 Limit4 VBase5 Limit5 NBase6 Limit6 NBase7 Limit7 V

OffsetSeg #VirtualAddress

Base2 Limit2 V

+ PhysicalAddress

> Error

Physical Address

Offset

OffsetVirtualPage #Virtual Address:

PageTablePtr page #0

page #2page #3page #4page #5

V,R

page #1 V,R

V,R,W

V,R,W

N

V,R,W

page #1 V,R

Check Perm

AccessError

PhysicalPage #


Review: Address Segmentation

1111 1111stack

heap

code

data

Virtual memory view Physical memory view

data

heap

stack

0000 0000

0001 0000

0101 0000

0111 0000

1110 000

0000 0000

0100 0000

1000 0000

1100 0000

1111 0000

Seg # base limit

00 0001 0000 10 0000

01 0101 0000 10 0000

10 0111 0000 1 1000

11 1011 0000 1 0000

seg # offset

code



1111 1111

stack

heap

code

data


data

heap

stack

0000 0000

0100 0000

1000 0000

1100 0000

seg # offset

code

What happens if stack grows to 1110 0000?

1110 0000

0000 0000

0001 0000

0101 0000

0111 0000

1110 000Seg # base limit

00 0001 0000 10 0000

01 0101 0000 10 0000

10 0111 0000 1 1000

11 1011 0000 1 0000



1111 1111

stack

heap

code

data


data

heap

stack

0000 0000

0100 0000

1000 0000

1100 0000

1110 0000

seg # offset

code

No room to grow!! Buffer overflow error

0000 0000

0001 0000

0101 0000

0111 0000

1110 000Seg # base limit

00 0001 0000 10 0000

01 0101 0000 10 0000

10 0111 0000 1 1000

11 1011 0000 1 0000


Review: Paging

1111 1111stack

heap

code

data

Virtual memory view

0000 0000

0100 0000

1000 0000

1100 0000

1111 0000

page # offset

Physical memory view

data

code

heap

stack

0000 0000

0001 0000

0101 000

0111 000

1110 0000

11111 1110111110 1110011101 null 11100 null 11011 null11010 null11001 null11000 null10111 null10110 null10101 null10100 null10011 null10010 1000010001 0111110000 0111001111 null01110 null 01101 null01100 null01011 01101 01010 01100 01001 0101101000 0101000111 null00110 null00101 null 00100 null 00011 0010100010 0010000001 0001100000 00010

Page Table


Review: Paging

1111 1111

stack

heap

code

data

Virtual memory view

0000 0000

0100 0000

1000 0000

1100 0000

page # offset


data

code

heap

stack

0000 0000

0001 0000

0101 000

0111 000

1110 0000

11111 1110111110 1110011101 null 11100 null 11011 null11010 null11001 null11000 null10111 null10110 null10101 null10100 null10011 null10010 1000010001 0111110000 0111001111 null01110 null 01101 null01100 null01011 01101 01010 01100 01001 0101101000 0101000111 null00110 null00101 null 00100 null 00011 0010100010 0010000001 0001100000 00010

Page Table

1110 0000

What happens if stack grows to 1110 0000?


stack

Review: Paging

1111 1111

stack

heap

code

data

Virtual memory view

0000 0000

0100 0000

1000 0000

1100 0000

page # offset


data

code

heap

stack

11111 1110111110 1110011101 1011111100 1011011011 null11010 null11001 null11000 null10111 null10110 null10101 null10100 null10011 null10010 1000010001 0111110000 0111001111 null01110 null01101 null01100 null01011 01101 01010 01100 01001 0101101000 0101000111 null00110 null00101 null 00100 null 00011 0010100010 0010000001 0001100000 00010

Page Table

0000 0000

0001 0000

0101 000

0111 000

1110 0000

Allocate new pages where room!

Challenge: Table size equal to the # of pages in virtual memory!

1110 0000


Would you choose Segmentation or Paging?

Design a Virtual Memory system



How would you choose your Page Size?


How would you choose your Page Size?

• Page Table Size

• TLB Size

• Internal Fragmentation

• Disk Access



You have a 32 bit address space and 2 KB pages.

In a single-level page table, how many bits would you use for the

index and how many for the offset?



2 KB = 211 bytes

11 bit Offset

32 bit address space -11 bit offset = 21 bits

21 bit Virtual Page Number

031

Virtual Page Number Byte Offset

11




Page Table

VPN0

VPN1

VPN2

VPN3

……

PTE1

PTE2

PTE3

PTE4

……

What does each page table entry look like (PTE)?


What does each page table entry look like (PTE)?


sizeof(VPN) == sizeof(PPN) == 21 bits21 bit Physical Page Number

Don’t forget the 4 auxiliary bits:• Valid (resident in memory)• Read (has read permissions)• Write (has write permissions)• Execute (has execute permissions)

= 25 bit PTE

32 – 25 = 7 unused bits

PPN V R W

21 4

EUnused

7


If you only had 16 MB of physical memory, how many bits of the PPN field would

actually be used?




PPN V R W

13 4

EUnused

15

If you only had 16 MB of physical memory, how many bits of the PPN field would

actually be used?16 MB = 224 bytes

With an 11 bit offset…24-11 = 13 bit Physical Page Number


How large would the page table be?



How large would the page table be?

221

Entries

Total Size:221 x 4 B = 8MB


PPN V R W EUnused

PPN V R W EUnused

PPN V R W EUnused

PPN V R W EUnused

4 Byte Entry


stack

Review: Two-Level Paging

1111 1111

stack

heap

code

data

Virtual memory view

0000 0000

0100 0000

1000 0000

1100 0000

page1 # offset


data

code

heap

stack

0000 0000

0001 0000

0101 000

0111 000

1110 0000

page2 #

111 110 null101 null100 011 null010 001 null000

11 11101

10 1110001 1011100 10110

11 01101

10 0110001 0101100 01010

11 00101

10 0010001 0001100 00010

11 null 10 1000001 0111100 01110

Page Tables(level 2)

Page Table(level 1)

1110 0000


stack

Review: Two-Level Paging

stack

heap

code

data

Virtual memory view

1001 0000


data

code

heap

stack

0000 0000

0001 0000

1000 0000

1110 0000

111 110 null101 null100 011 null010 001 null000

11 11101

10 1110001 1011100 10110

11 01101

10 0110001 0101100 01010

11 00101

10 0010001 0001100 00010

11 null 10 1000001 0111100 01110

Page Tables(level 2)

Page Table(level 1)

In best case, total size of page tables ≈ number of pages used by program. But requires one additional memory access!


PhysicalAddress:(40-50 bits)

12bit OffsetPhysical Page #

X86_64: Four-level page table!9 bits 9 bits 12 bits

48-bit Virtual Address: Offset

VirtualP2 index

VirtualP1 index

8 bytes

PageTablePtr

VirtualP3 index

VirtualP4 index

9 bits 9 bits

4096-byte pages (12 bit offset)Page tables also 4k bytes (pageable)


stack

Review: Inverted Table

1111 1111

stack

heap

code

data

Virtual memory view

0000 0000

0100 0000

1000 0000

1100 0000

page # offset


data

code

heap

stack

11111 1110111110 1110011101 1011111100 1011010010 1000010001 0111110000 0111001011 01101 01010 01100 01001 0101110000 01010 00011 0010100010 0010000001 0001100000 00010

Inverted Tablehash(virt. page #) =

physical page #

0000 0000

0001 0000

0101 000

0111 000

1110 00001110 0000

Total size of page table ≈ number of pages used by program. But hash more

complex.


Address Translation Comparison

Advantages Disadvantages

Segmentation Fast context switching: Segment mapping maintained by CPU

External fragmentation

Paging (single-level page)

No external fragmentation

•Large size: Table size ~ virtual memory•Internal fragmentation

Paged segmentation

•No external fragmentation•Table size ~ memory used by program

•Multiple memory references per page access •Internal fragmentation

Multi-level pages

Inverted Table Hash function more complex


Example Memory Multiplexing,Address Translation Problems



• Spring 2013 Midterm Exam #3b


Caches, TLBs


• Compulsory (cold start): first reference to a block– “Cold” fact of life: not a whole lot you can do about it

– Note: When running “billions” of instruction, Compulsory Misses are insignificant

• Capacity:– Cache cannot contain all blocks access by the program

– Solution: increase cache size

• Conflict (collision):– Multiple memory locations mapped to same cache location

– Solutions: increase cache size, or increase associativity

• Two others:– Coherence (Invalidation): other process (e.g., I/O) updates

memory

– Policy: Due to non-optimal replacement policy

Review: Sources of Cache Misses


• Example: Block 12 placed in 8 block cache

0 1 2 3 4 5 6 7Blockno.

Direct mapped:block 12 (01100) can go only into block 4 (12 mod 8)

Set associative:block 12 can go anywhere in set 0

0 1 2 3 4 5 6 7Blockno.

Set0

Set1

Set2

Set3

Fully associative:block 12 can go anywhere

0 1 2 3 4 5 6 7Blockno.

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

32-Block Address Space:

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.

Where does a Block Get Placed in a Cache?

01 100

tag index

011 00

tag index

01100

tag


Review: Caching Applied to Address Translation

• Problem: address translation expensive (especially multi-level)• Solution: cache address translation (TLB)

– Instruction accesses spend a lot of time on the same page (since accesses sequential)

– Stack accesses have definite locality of reference– Data accesses have less page locality, but still some…

Data Read or Write(untranslated)

CPU PhysicalMemory

TLB

Translate(MMU)

No

VirtualAddress

PhysicalAddress

YesCached?

Save

Result


TLB organization

• How big does TLB actually have to be?–Usually small: 128-512 entries–Not very big, can support higher associativity

• TLB usually organized as fully-associative cache–Lookup is by Virtual Address–Returns Physical Address

• What happens when fully-associative is too slow?–Put a small (4-16 entry) direct-mapped cache in front–Called a “TLB Slice”

• When does TLB lookup occur?–Before cache lookup?– In parallel with cache lookup?


• As described, TLB lookup is in serial with cache lookup:

• Machines with TLBs go one step further: they overlap TLB lookup with cache access.

– Works because offset available early

Reducing translation time further

Virtual Address

TLB Lookup

V AccessRights PA

V page no. offset10

P page no. offset10

Physical Address


• Here is how this might work with a 4K cache:

• What if cache size is increased to 8KB?– Overlap not complete– Need to do something else. See CS152/252

• Another option: Virtual Caches– Tags in cache are virtual addresses– Translation only happens on cache misses

TLB 4K Cache

10 2

004 bytes

index 1 K

page # disp20

assoclookup

32

Hit/Miss

PA Data Hit/Miss

=PA

Overlapping TLB & Cache Access


Example Caches, TLB Problems



• Spring 2013 Midterm Exam #3a


Putting Everything Together


Paging & Address Translation

Physical Address:

OffsetPhysicalPage #

Virtual Address:

OffsetVirtualP2 index

VirtualP1 index

PageTablePtr

Page Table (1st level)

Page Table (2nd level)

Physical Memory:



PageTablePtr


Translation Look-aside Buffer

OffsetPhysicalPage #

Virtual Address:


VirtualP1 index

Physical Memory:

Physical Address:

…

TLB:



PageTablePtr


Virtual Address:


VirtualP1 index

…

TLB:

Caching

Offset

Physical Memory:

Physical Address:PhysicalPage #

…

tag: block:cache:

index bytetag


Page Allocation and Replacement


Demand Paging• Modern programs require a lot of physical memory

– Memory per system growing faster than 25%-30%/year• But they don’t use all their memory all of the time

– 90-10 rule: programs spend 90% of their time in 10% of their code

– Wasteful to require all of user’s code to be in memory• Solution: use main memory as cache for disk

On

-Ch

ipC

ache

Control

Datapath

SecondaryStorage(Disk)

Processor

MainMemory(DRAM)

SecondLevelCache(SRAM)

TertiaryStorage(Tape)

Caching


• PTE helps us implement demand paging– Valid Page in memory, PTE points at physical page– Not Valid Page not in memory; use info in PTE to find it on

disk when necessary• Suppose user references page with invalid PTE?

– Memory Management Unit (MMU) traps to OS» Resulting trap is a “Page Fault”

– What does OS do on a Page Fault?:» Choose an old page to replace » If old page modified (“D=1”), write contents back to disk» Change its PTE and any cached TLB to be invalid» Load new page into memory from disk» Update page table entry, invalidate TLB for new entry» Continue thread from original faulting location

– TLB for new page will be loaded when thread continued!– While pulling pages off disk for one process, OS runs another

process from ready queue» Suspended process sits on wait queue

Demand Paging Mechanisms


What Factors Lead to Misses?• Compulsory Misses:

– Pages that have never been paged into memory before– How might we remove these misses?

» Prefetching: loading them into memory before needed» Need to predict future somehow! More later.

• Capacity Misses:– Not enough memory. Must somehow increase size.– Can we do this?

» One option: Increase amount of DRAM (not quick fix!)» Another option: If multiple processes in memory: adjust percentage

of memory allocated to each one!• Conflict Misses:

– Technically, conflict misses don’t exist in virtual memory, since it is a “fully-associative” cache

• Policy Misses:– Caused when pages were in memory, but kicked out prematurely

because of the replacement policy– How to fix? Better replacement policy


Replacement Policies (Con’t)• LRU (Least Recently Used):

– Replace page that hasn’t been used for the longest time– Programs have locality, so if something not used for a while,

unlikely to be used in the near future.– Seems like LRU should be a good approximation to MIN.

• How to implement LRU? Use a list!

– On each use, remove page from list and place at head– LRU page is at tail

• Problems with this scheme for paging?– List operations complex

» Many instructions for each hardware access• In practice, people approximate LRU (more later)

Page 6 Page 7 Page 1 Page 2Head

Tail (LRU)


• Suppose we have 3 page frames, 4 virtual pages, and following reference stream:

– A B C A B D A D B C B• Consider FIFO Page replacement:

– FIFO: 7 faults. – When referencing D, replacing A is bad choice, since need A

again right away

Example: FIFO

C

B

A

D

C

B

A

BCBDADBACBA

3

2

1

Ref:

Page:


• Suppose we have the same reference stream: – A B C A B D A D B C B

• Consider MIN Page replacement:

– MIN: 5 faults – Look for page not referenced farthest in future.

• What will LRU do?– Same decisions as MIN here, but won’t always be true!

Example: MIN

C

DC

B

A

BCBDADBACBA

3

2

1

Ref:

Page:


Implementing LRU & Second Chance

• Perfect:– Timestamp page on each reference– Keep list of pages ordered by time of reference– Too expensive to implement in reality for many reasons

• Second Chance Algorithm: – Approximate LRU

» Replace an old page, not the oldest page– FIFO with “use” bit

• Details– A “use” bit per physical page– On page fault check page at head of queue

» If use bit=1 clear bit, and move page at tail (give the page second chance!)

» If use bit=0 replace page – Moving pages to tail still complex


Clock Algorithm

• Clock Algorithm: more efficient implementation of second chance algorithm

– Arrange physical pages in circle with single clock hand• Details:

– On page fault:» Check use bit: 1used recently; clear and leave it alone

0selected candidate for replacement» Advance clock hand (not real time)

– Will always find a page or loop forever?


Clock Replacement Illustration

• Max page table size 4

• Invariant: point at oldest page

– Page B arrives

B u:0





– Page B arrives

– Page A arrives

– Access page A B u:0

A u:0





– Page B arrives

– Page A arrives

– Access page A

– Page D arrives

B u:0

A u:1

D u:0





– Page B arrives

– Page A arrives

– Access page A

– Page D arrives

– Page C arrives

B u:0

A u:1

D u:0

C u:0


B u:0




– Page B arrives

– Page A arrives

– Access page A

– Page D arrives

– Page C arrives

– Page F arrives

– Access page D

F u:0

A u:1

D u:0

C u:0


C u:0E u:0



– Page B arrives

– Page A arrives

– Access page A

– Page D arrives

– Page C arrives

– Page F arrives

– Access page D

– Page E arrives

A u:1A u:0

D u:1D u:0


F u:0


Thrashing

• If a process does not have “enough” pages, the page-fault rate is very high. This leads to:

– low CPU utilization– operating system spends most of its time swapping to disk

• Thrashing a process is busy swapping pages in and out• Questions:

– How do we detect Thrashing?– What is best response to Thrashing?


• Program Memory Access Patterns have temporal and spatial locality– Group of Pages accessed

along a given time slice called the “Working Set”

– Working Set defines minimum number of pages needed for process to behave well

• Not enough memory for Working SetThrashing– Better to swap out

process?

Locality In A Memory-Reference Pattern


Working-Set Model

working-set window fixed number of page references – Example: 10,000 accesses

• WSi (working set of Process Pi) = total set of pages referenced in the most recent (varies in time)

– if too small will not encompass entire locality– if too large will encompass several localities– if = will encompass entire program

• D = |WSi| total demand frames • if D > physical memory Thrashing

– Policy: if D > physical memory, then suspend/swap out processes

– This can improve overall system behavior by a lot!


Kernel/User Interface


Dual-Mode Operation• Can an application modify its own translation maps or PTE

bits?– If it could, could get access to all of physical memory– Has to be restricted somehow

• To assist with protection, hardware provides at least two modes (Dual-Mode Operation):

– “Kernel” mode (or “supervisor” or “protected”)– “User” mode (Normal program mode)– Mode set with bits in special control register only accessible in

kernel-mode

• Intel processors actually have four “rings” of protection:– PL (Privilege Level) from 0 – 3

» PL0 has full access, PL3 has least– Typical OS kernels on Intel processors only use PL0 (“kernel”)

and PL3 (“user”)


How to get from KernelUser• What does the kernel do to create a new user process?

– Allocate and initialize process control block

– Read program off disk and store in memory

– Allocate and initialize translation map» Point at code in memory so program can execute

» Possibly point at statically initialized data

– Run Program:» Set machine registers

» Set hardware pointer to translation table

» Set processor status word for user mode

» Jump to start of program

• How does kernel switch between processes (we learned about this!) ?

– Same saving/restoring of registers as before

– Save/restore hardware pointer to translation map


UserKernel (System Call)• Can’t let inmate (user) get out of padded cell on own

– Would defeat purpose of protection!– So, how does the user program get back into kernel?

• System call: Voluntary procedure call into kernel– Hardware for controlled UserKernel transition– Can any kernel routine be called?

» No! Only specific ones– System call ID encoded into system call instruction

» Index forces well-defined interface with kernel

I/O: open, close, read, write, lseekFiles: delete, mkdir, rmdir, chownProcess: fork, exit, joinNetwork: socket create, select


UserKernel (Exceptions: Traps and Interrupts)• System call instr. causes a synchronous exception (or “trap”)

– In fact, often called a software “trap” instruction

• Other sources of Synchronous Exceptions:– Divide by zero, Illegal instruction, Bus error (bad address, e.g.

unaligned access)– Segmentation Fault (address out of range)– Page Fault

• Interrupts are Asynchronous Exceptions– Examples: timer, disk ready, network, etc….– Interrupts can be disabled, traps cannot!

• SUMMARY – On system call, exception, or interrupt:– Hardware enters kernel mode with interrupts disabled– Saves PC, then jumps to appropriate handler in kernel– For some processors (x86), processor also saves registers,

changes stack, etc.


Questions?

cs162 operating systems and systems programming midterm review march 18, 2013

Documents