virtual memory

CS/EE 5810CS/EE 6810

F00: 1

Virtual Memory


F00: 2

Virtual MemoryAn Address Remapping Scheme

• Permits applications to grow bigger than main memory size

• Helps with multiple process management

– Each process gets its own chunk of memory

– Permits protection of one process’ chunk from another

– Mapping of multiple chunks onto shared physical memory

– Mapping also facilitates relocation

• Think of it as another level of cache in the hierarchy:

– Caches disk pages into DRAM» Miss becomes a page fault

» Block becomes a page (or segment)


F00: 3

Virtual Memory Basics

• Programs reference “virtual” addresses in a non-existent memory

– These are then translated into real “physical” addresses

– Virtual address space may be bigger than physical address space

• Divide physical memory into blocks, called pages

– Anywhere from 512 to 16MB (4k typical)

• Virtual-to-physical translation by indexed table lookup

– Add another cache for recent translations (the TLB)

• Invisible to the programmer

– Looks to your application like you have a lot of memory!

– Anyone remember overlays?


F00: 4

VM: Page Mapping

Process 1’sVirtual AddressSpace

Process 2’sVirtual AddressSpace

Physical Memory

Disk

Page Frames


F00: 5

VM: Address Translation

Virtual page number Page offset

Physical page number Page offset

Page Tablebase

Per-process page table

Valid bitProtection bits

Dirty btReference bit

12 bits20 bits Log2 ofpagesize

To physical memory


F00: 6

Typical Page Parameters

• It’s a lot like what happens in a cache

– But everything (except miss rate) is a LOT worse

Parameter Value

Page Size 4KB – 64KB

L1 Cache Hit Time 1-2 clock cycles

Virtual Hit (e.g. mapped to DRAM) 50-400 clock cycles

Miss Penalty (all the way to disk) 700k-6M clock cycles

Disk Access Time 500k-4M clock cycles

Page Transfer Time 200k-2M clock cycles

Page Fault Rate .001% - .00001%

Main Memory Size 4MB – 4GB


F00: 7

Cache vs. VM Differences

• Replacement– cache miss handled by hardware– page fault usually handled by the OS

» This is OK - since fault penalty is so horrific» hence some strategy of what to replace makes sense

• Addresses– VM space is determined by the address size of the CPU– cache size is independent of the CPU address size

• Lower level memory– For caches, the main memory is not shared by something

else (well, there is I/O…)– For VM, most of the disk contains the file system

» File system addressed differently, usually in I/O space» The VM lower level is usually called Swap Space


F00: 8

Paging vs. Segmentation• Pages are fixed sized blocks

• Segments vary from 1 byte to 232 (for 32bit addresses) bytes

Aspect Page Segment

Words per address One – contains page and offset

Two – possible large max-size, so need Seg and offset words

Programmer visible? No Sometimes

Replacement Trivial – because of fixed size

Hard, need to find contiguous space, use garbage collection

Memory Efficiency Internal Fragmentation

External Fragmentation

Disk Efficiency Yes – adjust page size to balance access and transfer time

Not always – segment size varies


F00: 9

Pages are Cached in a Virtual Memory System

Can Ask the Same Four Questions we did about caches

• Q1: Block Placement

– choice: lower miss rates and complex placement or vice versa

» miss penalty is huge

» so choose low miss rate ==> place page anywhere in physical memory

» similar to fully associative cache model

• Q2: Block Addressing - use additional data structure

– fixed size pages - use a page table» virtual page number ==> physical page number and

concatenate offset

» tag bit to indicate presence in main memory


F00: 10

Normal Page Tables

• Size is number of virtual pages

• Purpose is to hold the translation of VPN to PPN

– Permits ease of page relocation

– Make sure to keep tags to indicate page is mapped

• Potential problem:

– Consider 32bit virtual address and 4k pages

– 4GB/4KB = 1MW required just for the page table!

– Might have to page in the page table… » Consider how the problem gets worse on 64bit machines with

even larger virtual address spaces!

» Alpha has a 43bit virtual address with 8k pages…

– Might have multi-level page tables


F00: 11

Inverted Page Tables

Similar to a set-associative mechanism

• Make the page table reflect the # of physical pages (not virtual)

• Use a hash mechanism

– virtual page number ==> HPN index into inverted page table

– Compare virtual page number with the tag to make sure it is the one you want

– if yes » check to see that it is in memory - OK if yes - if not page fault

– If not - miss » go to full page table on disk to get new entry

» implies 2 disk accesses in the worst case

» trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table


F00: 12

Inverted Page Table

Page V

Page Offset

Frame Offset

FrameHash

=

OK

•Only store entriesFor pages in physicalmemory


F00: 13

Address Translation Reality

• The translation process using page tables takes too long!

• Use a cache to hold recent translations

– Translation Lookaside Buffer» Typically 8-1024 entries

» Block size same as a page table entry (1 or 2 words)

» Only holds translations for pages in memory

» 1 cycle hit time

» Highly or fully associative

» Miss rate < 1%

» Miss goes to main memory (where the whole page table lives)

» Must be purged on a process switch


F00: 14

Back to the 4 Questions

• Q3: Block Replacement (pages in physical memory)

– LRU is best» So use it to minimize the horrible miss penalty

– However, real LRU is expensive» Page table contains a use tag

» On access the use tag is set

» OS checks them every so often, records what it sees, and resets them all

» On a miss, the OS decides who has been used the least

– Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate


F00: 15

Last Question

• Q4: Write Policy

– Always write-back» Due to the access time of the disk

» So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out.

– Anything else is pretty silly

– Remember – the disk is SLOW!


F00: 16

Page SizesAn architectural choice

• Large pages are good:

– reduces page table size

– amortizes the long disk access

– if spatial locality is good then hit rate will improve

• Large pages are bad:

– more internal fragmentation» if everything is random each structure’s last page is only half

full

» Half of bigger is still bigger

» if there are 3 structures per process: text, heap, and control stack

» then 1.5 pages are wasted for each process

– process start up time takes longer » since at least 1 page of each type is required to prior to start

» transfer time penalty aspect is higher


F00: 17

More on TLBs

• The TLB must be on chip

– otherwise it is worthless

– small TLB’s are worthless anyway

– large TLB’s are expensive» high associativity is likely

• ==> Price of CPU’s is going up!

– OK as long as performance goes up faster


F00: 18

Alpha AXP 21064 TLB

Page Frame Number Page Offset

13 bits30 bits

43-bit Virtual Address

30 bits 21 bits221

VPN Tag Physical PNWRV




32 entries total

1

2

32:1 Mux

hitlocation

3

34-b

itp

hys

ical

addr

ess

4

protection

56 bits/entry


F00: 19

Protection• Multiprogramming forces us to worry about it

– think about what happens on your workstation» it would be annoying if your program clobbered your email

files

• There are lots of processes

– Implies lots of task switch overhead

– HW must provide savable state

– OS must promise to save and restore properly

– most machines task switch every few milliseconds

– a task switch typically takes several microseconds

– also implies inter-process communication» which implies OS intervention

» which implies a task switch

» which implies less of the duty cycle gets spent on the application


F00: 20

Protection Options• Simplest - base and bound

– 2 registers - check each address falls between the values» these registers must be changed by the OS but not the app

– need for 2 modes: regular & privileged» hence need to privilege-trap and provide mode switch ability

• VM provides another option

– check as part of the VA --> PA translation process» The protection bits reside in the page table & TLB

• Other options

– rings - ala MULTIC’s and now Pentium» inner is most privileged - outer is least

– capabilities (i432) - similar to a key or password model» OS hands them out - so they’re difficult to forge

» in some cases they can be passed between app’s


F00: 21

VAX 11/780 Example• Hybrid - segmented + paged

– segments used for system/process separation– paging used for VM, relocation, and protection– reasonably common again (was common in the late 60’s

too)• Segments - 1 for system and 1 for processes

– high order address bit 31: 1- system, 0 - process– all processes have their own space but share the process

segment– bit 30 divides the process segment into 2 regions– =1 P1 grows downward (stack),=0 P0 grows upward

(text,heap)• Protection

– pair of base and bound registers for S, PO, and P1– saves page table space since the page size was 512 bytes


F00: 22

VAX-11/780 Address Mapping31 30

21 bit virtual page num- 9 bit page offset

29 9 8 0

Page TBL

Page TBL S

P0

P1

+

>

21 bit page frame number 9 bit page offset

PT

Note - 3 separatepage tables

Page Index Fault

To the TLB

SPT, P0PT, P1PT


F00: 23

More VAX-11/780• System page tables are pinned (frozen) in physical memory

– virtual and physical page numbers are the same

– OS handles replacement so it never moves these pages

• P0 and P1 tables are in virtual memory

– hence they can be missed as well» double page fault is therefore possible

• Page Table Entry

– M - modify indicating dirty page

– V - valid indicating a real entry

– PROT - 4 protection bits

– 21 bit physical page number

– no reference or use bit - hence hokey LRU accounting - use M


F00: 24

VAX PROT bits

• 4 levels of use - each with its own stack and stack pointer R15

– kernel - most trusted

– executive

– supervisor

– user - least trusted

• 3 types of access

– no access, read, or write

• Bizarre encoding

– all 16 4-bit patterns meant something» if there was a model to their encoding method it eludes

everybody I know

– 1001 - R/W for kernel and exec, R for sup , zip for user

VAX 11/780 TLB Parameters• 2-way set associative but partitioned

» 2 64 entry banks - top 32 of each bank is reserved for system

» on a task switch only need to invalidate half the TLB

» high order address bit selects the bank (corresponds to P0, P1 distinction)

» split increases miss rate but under high task switch rate may still be a win

Parameter Value

Block Size 1 Page table entry (4 bytes)

Hit time 1 cycle

Average miss penalty 22 clock cycles

Miss rate 1% - 2%

TLB Size 128 entries

Block Selection Random, but not last used

Block Placement 2-way Set-associative

Write Strategy Explicit by OS


F00: 26

VAX 11/780 TLB Action• Steps 1&2

» bit 31 ## Index passed to both banks (both set members)

» V must be 1 for anything to continue - TLB miss trap if V=0

» PROT bits checked against R/W access type and which kind of user (from the PSW) - TLB protection trap if violation

• Step 3» page table tag compared with TLB tag

» both banks done in parallel

» both cannot match - if no match then TLB-miss trap

• Step 4» matching side’s 21 bit physical page address passed thru MUX

• Step 5» 21 bit physical page addr ## 9 bit page offset = physical address

» if P=1 then cache hit and address goes to the cache - note cache hit time stretch

» if P=0 then page fault trap


F00: 27

The Alpha AXP 21064

Also a page/segment combo• Segmented 64 bit address space

– Seg0 (addr63 = 0) & Seg1 (addr[63:62] = 11)» Mapped into pages for user code» Seg0 for text and heap sections grows upwards» Seg1 for stack which grows downward

– Kseg (addr[63:62] = 10)» Reserved for the O.S.» Uniformly protected space, no memory management

• Advantages of the split model– Segmentation conserves page table space– Paging provides VM, protection, and relocation– O.S. tends to need to be resident anyway


F00: 28

Page Table Problems… 64-bit address is the first-order cause

• Reduction approach - go hierarchical

– 3 levels - each page table limited to a single page size» current page size is 8KB but claim to support 16, 32, and 64KB

in the future

» Superpage model extends TLB REACH – used in MIPS R10k

» Uses 34-bit physical addresses (max could be 41)

– Virtual address = [seg-selector Lvl1 Lvl2 Lvl3 Offset]

• Mapping

– Page table base register points to base of LVL1-TBL

– LVL1-TBL[Lvl1] + Lvl2 points to LVL2-TBL entry

– LVL2-TBL entry + Lvl3 points to LVL3-TBL entry

– LVL3-TBL entry provides physical page number finally

– PPN##Offset => physical address for main memory

Alpha VPN => PPN Mapping

seg0 or 1 level 1 level 1 level 1 page offset

virtual address

page tablebase reg.

entry

L1 Table

entry

L1 Table

entry

L1 Table

physical page frame # page offset

physical addressNote potential delayproblem – sequential

Lookups X 3


F00: 30

Page Table Entries64-bit = 1k entries per table

• low order 32 bits = page frame number• 5 Protection fields

– valid - e.g. OK to do address translation– user Read Enable– kernel Read Enable– user Write Enable– kernel Write Enable

• Other fields– system accounting - like a USE field– high order bits basically unused - real Virtual Addr = 43 bits

» common hack to save chip area in the early implementations - different now

interesting to see OS problems as VM goes to bigger pages


F00: 31

Alpha TLB Stats• Contiguous pages mapped as 1

– Option of 8, 64, 512 pages – also extends TLB reach

– Complicates the TLB somewhat… » Separate ITLB and DTLB

Parameter Description

Block Size 1 PTE (8 bytes)

Hit Time 1 cycle

Average Miss Penalty 20 cycles

ITLB size 96 bytes (12 entries for 8KB pages, 4 entries for max 4MB superpage)

DTLB 32 PTE’s for any size superpage

Block Placement Random, but not last used

Block Placement Fully associative


F00: 32

The Whole Alpha Memory System

• Boot

– initial Icache fill from boot PROM

– hence no Valid bit needed - Vbits in Dcache are cleared

– PC set to kseg so no address translation and TLB can bebypassed

– then start loading the TLB for real from the kseg OS code» subsequent misses call PAL (Priv. Arch. Library) to remap and

fill the miss

• Once ready to run user code

– OS sets PC to appropriate seg0 address

– then the memory hierarchy is ready to go


F00: 33

Things to Note

• Prefetch stream buffer

– holds next Ifetch access so an Icache miss checks there first

• L1 is write through with a write buffer

– avoids CPU stall on a write miss

– 4 block capacity

– write merge - if block is same then merges the writes

• L2 is write back with a victim buffer

– 1 block - so delays stall on a L2 miss dirty until the 2nd miss

– normally this is safe

• Full datapath shown in Figure 5.47

– worth the time to follow the numbered balls


F00: 34

Alpha 21064 I-fetch• I-Cache Hit (1 cycle)

» Step 1: Send address to TLB an Cache» Step 2: Icache lookup (8KB, DM, 32-byte blocks)» Step 3: ITLB Lookup (12 entries, FA)» Step 4: Cache hit and valid PTE» Step 5: Send 8 bytes to CPU

• I-Cache miss, Prefetch buffer (PFB) hit (1 cycle)» Step 6: Start L2 access, just in case» Step 7: Check prefetch buffer» Step 8: Prefetch buffer hit, send 8 bytes to CPU» Step 9: Refill I-Cache from PFB, and cancel L2 access

• I-Cache Miss, Prefetch buffer miss, L2 Cache Hit» Step 10: Check L2 cache tag» Step 11: Return critical 16B (5 cycles)» Step 12: Return other 16B (5 cycles)» Step 13: Prefetch next sequential block to PFB (10 cycles)


F00: 35

Alpha 21064 L2 Cache Miss

• L2 Cache Miss» Step 14: Send new request to main memory

» Step 15: Put dirty victim block in victim buffer

» Step 16: Load new block in L2 cache, 16B at a time

» Step 17: Write Victim buffer to memory

• Data loads are like instruction fetches, except use DTLB and D-Cache instead of the ITLB and I-Cache

• Allows hits under miss

• On a read miss, the write buffer is flushed first to avoid RAW hazards


F00: 36

Alpha 21064 Data Store

• Data Store» Step 18: DTLB Lookup and protection violation check

» Step 19: D-Cache lookup (8KB, DM, write through)

» Step 22: Check D-Cache tag

» Step 24: Send data to write buffer

» Step 25: Send data to delayed write buffer in front of D-Cache

• Write hits are pipelined

» Step 26: Write previous delayed write buffer to D-Cache

» Step 27: Merge data into write buffer

» Step 28: Write data at the head of write buffer to L2 cache (15 cycles)


F00: 37

AXP Memory is Quite ComplexHow well does it work?

• Table 5.48 in the book tells…

• Summary – interesting to note:

– Alpha 21064 is a dual-issue machine, but nowhere is the CPI < 1

– Worst case are the TPC benchmarks» Memory intensive, so more stalls

» CPI bloats to 4.3 (1.8 of which is from the caches)

» 1.67 due to other stalls (I.e. load dependencies)

– Average SPECint CPI is 1.86» .77 contributed by caches, .74 by other stalls

– Average SPECfp CPI is 2.14» .45 by caches, .98 by other stalls

Superscalar issue is sorely limited!


F00: 38

Summary 1

• CPU performance is outpacing main memory performance

• Principle of locality saves us: thus the Memory Hierarchy

• The memory hierarchy should be designed as a system

• Key old ideas:

– Bigger caches

– Higher set-associativity

• Key new ideas:

– Non-blocking caches

– Ancillary caches

– Multi-port caches

– Prefetching (software and/or hardware controlled)

Summary 2: Typical ChoicesOption TLB L1 Cache L2 Cache VM (page)

Block Size 4-8 bytes (1 PTE)

4-32 bytes 32-256 bytes 4k-16k bytes

Hit Time 1 cycle 1-2 cycles 6-15 cycles 10-100 cycles

Miss Penalty 10-30 cycles 8-66 cycles 30-200 cycles 700k-6M cycles

Local Miss Rate

.1 - 2% .5 – 20% 13 - 15% .00001 - 001%

Size 32B – 8KB 1 – 128 KB 256KB - 16MB

Backing Store L1 Cache L2 Cache DRAM Disks

Q1: Block Placement

Fully or set associative

DM DM or SA Fully associative

Q2: Block ID Tag/block Tag/block Tag/block Table

Q3: Block Replacement

Random (not last)

N.A. For DM Random (if SA)

LRU/LFU

Q4: Writes Flush on PTE write

Through or back

Write-back Write-back

virtual memory

Documents