virtual memory

39
CS/EE 5810 CS/EE 6810 F00: 1 Virtual Memory

Upload: quynn-norman

Post on 30-Dec-2015

21 views

Category:

Documents


0 download

DESCRIPTION

Virtual Memory. Virtual Memory. An Address Remapping Scheme Permits applications to grow bigger than main memory size Helps with multiple process management Each process gets its own chunk of memory Permits protection of one process’ chunk from another - PowerPoint PPT Presentation

TRANSCRIPT

CS/EE 5810CS/EE 6810

F00: 1

Virtual Memory

CS/EE 5810CS/EE 6810

F00: 2

Virtual MemoryAn Address Remapping Scheme

• Permits applications to grow bigger than main memory size

• Helps with multiple process management

– Each process gets its own chunk of memory

– Permits protection of one process’ chunk from another

– Mapping of multiple chunks onto shared physical memory

– Mapping also facilitates relocation

• Think of it as another level of cache in the hierarchy:

– Caches disk pages into DRAM» Miss becomes a page fault

» Block becomes a page (or segment)

CS/EE 5810CS/EE 6810

F00: 3

Virtual Memory Basics

• Programs reference “virtual” addresses in a non-existent memory

– These are then translated into real “physical” addresses

– Virtual address space may be bigger than physical address space

• Divide physical memory into blocks, called pages

– Anywhere from 512 to 16MB (4k typical)

• Virtual-to-physical translation by indexed table lookup

– Add another cache for recent translations (the TLB)

• Invisible to the programmer

– Looks to your application like you have a lot of memory!

– Anyone remember overlays?

CS/EE 5810CS/EE 6810

F00: 4

VM: Page Mapping

Process 1’sVirtual AddressSpace

Process 2’sVirtual AddressSpace

Physical Memory

Disk

Page Frames

CS/EE 5810CS/EE 6810

F00: 5

VM: Address Translation

Virtual page number Page offset

Physical page number Page offset

Page Tablebase

Per-process page table

Valid bitProtection bits

Dirty btReference bit

12 bits20 bits Log2 ofpagesize

To physical memory

CS/EE 5810CS/EE 6810

F00: 6

Typical Page Parameters

• It’s a lot like what happens in a cache

– But everything (except miss rate) is a LOT worse

Parameter Value

Page Size 4KB – 64KB

L1 Cache Hit Time 1-2 clock cycles

Virtual Hit (e.g. mapped to DRAM) 50-400 clock cycles

Miss Penalty (all the way to disk) 700k-6M clock cycles

Disk Access Time 500k-4M clock cycles

Page Transfer Time 200k-2M clock cycles

Page Fault Rate .001% - .00001%

Main Memory Size 4MB – 4GB

CS/EE 5810CS/EE 6810

F00: 7

Cache vs. VM Differences

• Replacement– cache miss handled by hardware– page fault usually handled by the OS

» This is OK - since fault penalty is so horrific» hence some strategy of what to replace makes sense

• Addresses– VM space is determined by the address size of the CPU– cache size is independent of the CPU address size

• Lower level memory– For caches, the main memory is not shared by something

else (well, there is I/O…)– For VM, most of the disk contains the file system

» File system addressed differently, usually in I/O space» The VM lower level is usually called Swap Space

CS/EE 5810CS/EE 6810

F00: 8

Paging vs. Segmentation• Pages are fixed sized blocks

• Segments vary from 1 byte to 232 (for 32bit addresses) bytes

Aspect Page Segment

Words per address One – contains page and offset

Two – possible large max-size, so need Seg and offset words

Programmer visible? No Sometimes

Replacement Trivial – because of fixed size

Hard, need to find contiguous space, use garbage collection

Memory Efficiency Internal Fragmentation

External Fragmentation

Disk Efficiency Yes – adjust page size to balance access and transfer time

Not always – segment size varies

CS/EE 5810CS/EE 6810

F00: 9

Pages are Cached in a Virtual Memory System

Can Ask the Same Four Questions we did about caches

• Q1: Block Placement

– choice: lower miss rates and complex placement or vice versa

» miss penalty is huge

» so choose low miss rate ==> place page anywhere in physical memory

» similar to fully associative cache model

• Q2: Block Addressing - use additional data structure

– fixed size pages - use a page table» virtual page number ==> physical page number and

concatenate offset

» tag bit to indicate presence in main memory

CS/EE 5810CS/EE 6810

F00: 10

Normal Page Tables

• Size is number of virtual pages

• Purpose is to hold the translation of VPN to PPN

– Permits ease of page relocation

– Make sure to keep tags to indicate page is mapped

• Potential problem:

– Consider 32bit virtual address and 4k pages

– 4GB/4KB = 1MW required just for the page table!

– Might have to page in the page table… » Consider how the problem gets worse on 64bit machines with

even larger virtual address spaces!

» Alpha has a 43bit virtual address with 8k pages…

– Might have multi-level page tables

CS/EE 5810CS/EE 6810

F00: 11

Inverted Page Tables

Similar to a set-associative mechanism

• Make the page table reflect the # of physical pages (not virtual)

• Use a hash mechanism

– virtual page number ==> HPN index into inverted page table

– Compare virtual page number with the tag to make sure it is the one you want

– if yes » check to see that it is in memory - OK if yes - if not page fault

– If not - miss » go to full page table on disk to get new entry

» implies 2 disk accesses in the worst case

» trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table

CS/EE 5810CS/EE 6810

F00: 12

Inverted Page Table

Page V

Page Offset

Frame Offset

FrameHash

=

OK

•Only store entriesFor pages in physicalmemory

CS/EE 5810CS/EE 6810

F00: 13

Address Translation Reality

• The translation process using page tables takes too long!

• Use a cache to hold recent translations

– Translation Lookaside Buffer» Typically 8-1024 entries

» Block size same as a page table entry (1 or 2 words)

» Only holds translations for pages in memory

» 1 cycle hit time

» Highly or fully associative

» Miss rate < 1%

» Miss goes to main memory (where the whole page table lives)

» Must be purged on a process switch

CS/EE 5810CS/EE 6810

F00: 14

Back to the 4 Questions

• Q3: Block Replacement (pages in physical memory)

– LRU is best» So use it to minimize the horrible miss penalty

– However, real LRU is expensive» Page table contains a use tag

» On access the use tag is set

» OS checks them every so often, records what it sees, and resets them all

» On a miss, the OS decides who has been used the least

– Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate

CS/EE 5810CS/EE 6810

F00: 15

Last Question

• Q4: Write Policy

– Always write-back» Due to the access time of the disk

» So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out.

– Anything else is pretty silly

– Remember – the disk is SLOW!

CS/EE 5810CS/EE 6810

F00: 16

Page SizesAn architectural choice

• Large pages are good:

– reduces page table size

– amortizes the long disk access

– if spatial locality is good then hit rate will improve

• Large pages are bad:

– more internal fragmentation» if everything is random each structure’s last page is only half

full

» Half of bigger is still bigger

» if there are 3 structures per process: text, heap, and control stack

» then 1.5 pages are wasted for each process

– process start up time takes longer » since at least 1 page of each type is required to prior to start

» transfer time penalty aspect is higher

CS/EE 5810CS/EE 6810

F00: 17

More on TLBs

• The TLB must be on chip

– otherwise it is worthless

– small TLB’s are worthless anyway

– large TLB’s are expensive» high associativity is likely

• ==> Price of CPU’s is going up!

– OK as long as performance goes up faster

CS/EE 5810CS/EE 6810

F00: 18

Alpha AXP 21064 TLB

Page Frame Number Page Offset

13 bits30 bits

43-bit Virtual Address

30 bits 21 bits221

VPN Tag Physical PNWRV

VPN Tag Physical PNWRV

VPN Tag Physical PNWRV

VPN Tag Physical PNWRV

32 entries total

1

2

32:1 Mux

hitlocation

3

34-b

itp

hys

ical

addr

ess

4

protection

56 bits/entry

CS/EE 5810CS/EE 6810

F00: 19

Protection• Multiprogramming forces us to worry about it

– think about what happens on your workstation» it would be annoying if your program clobbered your email

files

• There are lots of processes

– Implies lots of task switch overhead

– HW must provide savable state

– OS must promise to save and restore properly

– most machines task switch every few milliseconds

– a task switch typically takes several microseconds

– also implies inter-process communication» which implies OS intervention

» which implies a task switch

» which implies less of the duty cycle gets spent on the application

CS/EE 5810CS/EE 6810

F00: 20

Protection Options• Simplest - base and bound

– 2 registers - check each address falls between the values» these registers must be changed by the OS but not the app

– need for 2 modes: regular & privileged» hence need to privilege-trap and provide mode switch ability

• VM provides another option

– check as part of the VA --> PA translation process» The protection bits reside in the page table & TLB

• Other options

– rings - ala MULTIC’s and now Pentium» inner is most privileged - outer is least

– capabilities (i432) - similar to a key or password model» OS hands them out - so they’re difficult to forge

» in some cases they can be passed between app’s

CS/EE 5810CS/EE 6810

F00: 21

VAX 11/780 Example• Hybrid - segmented + paged

– segments used for system/process separation– paging used for VM, relocation, and protection– reasonably common again (was common in the late 60’s

too)• Segments - 1 for system and 1 for processes

– high order address bit 31: 1- system, 0 - process– all processes have their own space but share the process

segment– bit 30 divides the process segment into 2 regions– =1 P1 grows downward (stack),=0 P0 grows upward

(text,heap)• Protection

– pair of base and bound registers for S, PO, and P1– saves page table space since the page size was 512 bytes

CS/EE 5810CS/EE 6810

F00: 22

VAX-11/780 Address Mapping31 30

21 bit virtual page num- 9 bit page offset

29 9 8 0

Page TBL

Page TBL S

P0

P1

+

>

21 bit page frame number 9 bit page offset

PT

Note - 3 separatepage tables

Page Index Fault

To the TLB

SPT, P0PT, P1PT

CS/EE 5810CS/EE 6810

F00: 23

More VAX-11/780• System page tables are pinned (frozen) in physical memory

– virtual and physical page numbers are the same

– OS handles replacement so it never moves these pages

• P0 and P1 tables are in virtual memory

– hence they can be missed as well» double page fault is therefore possible

• Page Table Entry

– M - modify indicating dirty page

– V - valid indicating a real entry

– PROT - 4 protection bits

– 21 bit physical page number

– no reference or use bit - hence hokey LRU accounting - use M

CS/EE 5810CS/EE 6810

F00: 24

VAX PROT bits

• 4 levels of use - each with its own stack and stack pointer R15

– kernel - most trusted

– executive

– supervisor

– user - least trusted

• 3 types of access

– no access, read, or write

• Bizarre encoding

– all 16 4-bit patterns meant something» if there was a model to their encoding method it eludes

everybody I know

– 1001 - R/W for kernel and exec, R for sup , zip for user

VAX 11/780 TLB Parameters• 2-way set associative but partitioned

» 2 64 entry banks - top 32 of each bank is reserved for system

» on a task switch only need to invalidate half the TLB

» high order address bit selects the bank (corresponds to P0, P1 distinction)

» split increases miss rate but under high task switch rate may still be a win

Parameter Value

Block Size 1 Page table entry (4 bytes)

Hit time 1 cycle

Average miss penalty 22 clock cycles

Miss rate 1% - 2%

TLB Size 128 entries

Block Selection Random, but not last used

Block Placement 2-way Set-associative

Write Strategy Explicit by OS

CS/EE 5810CS/EE 6810

F00: 26

VAX 11/780 TLB Action• Steps 1&2

» bit 31 ## Index passed to both banks (both set members)

» V must be 1 for anything to continue - TLB miss trap if V=0

» PROT bits checked against R/W access type and which kind of user (from the PSW) - TLB protection trap if violation

• Step 3» page table tag compared with TLB tag

» both banks done in parallel

» both cannot match - if no match then TLB-miss trap

• Step 4» matching side’s 21 bit physical page address passed thru MUX

• Step 5» 21 bit physical page addr ## 9 bit page offset = physical address

» if P=1 then cache hit and address goes to the cache - note cache hit time stretch

» if P=0 then page fault trap

CS/EE 5810CS/EE 6810

F00: 27

The Alpha AXP 21064

Also a page/segment combo• Segmented 64 bit address space

– Seg0 (addr63 = 0) & Seg1 (addr[63:62] = 11)» Mapped into pages for user code» Seg0 for text and heap sections grows upwards» Seg1 for stack which grows downward

– Kseg (addr[63:62] = 10)» Reserved for the O.S.» Uniformly protected space, no memory management

• Advantages of the split model– Segmentation conserves page table space– Paging provides VM, protection, and relocation– O.S. tends to need to be resident anyway

CS/EE 5810CS/EE 6810

F00: 28

Page Table Problems… 64-bit address is the first-order cause

• Reduction approach - go hierarchical

– 3 levels - each page table limited to a single page size» current page size is 8KB but claim to support 16, 32, and 64KB

in the future

» Superpage model extends TLB REACH – used in MIPS R10k

» Uses 34-bit physical addresses (max could be 41)

– Virtual address = [seg-selector Lvl1 Lvl2 Lvl3 Offset]

• Mapping

– Page table base register points to base of LVL1-TBL

– LVL1-TBL[Lvl1] + Lvl2 points to LVL2-TBL entry

– LVL2-TBL entry + Lvl3 points to LVL3-TBL entry

– LVL3-TBL entry provides physical page number finally

– PPN##Offset => physical address for main memory

Alpha VPN => PPN Mapping

seg0 or 1 level 1 level 1 level 1 page offset

virtual address

page tablebase reg.

entry

L1 Table

entry

L1 Table

entry

L1 Table

physical page frame # page offset

physical addressNote potential delayproblem – sequential

Lookups X 3

CS/EE 5810CS/EE 6810

F00: 30

Page Table Entries64-bit = 1k entries per table

• low order 32 bits = page frame number• 5 Protection fields

– valid - e.g. OK to do address translation– user Read Enable– kernel Read Enable– user Write Enable– kernel Write Enable

• Other fields– system accounting - like a USE field– high order bits basically unused - real Virtual Addr = 43 bits

» common hack to save chip area in the early implementations - different now

interesting to see OS problems as VM goes to bigger pages

CS/EE 5810CS/EE 6810

F00: 31

Alpha TLB Stats• Contiguous pages mapped as 1

– Option of 8, 64, 512 pages – also extends TLB reach

– Complicates the TLB somewhat… » Separate ITLB and DTLB

Parameter Description

Block Size 1 PTE (8 bytes)

Hit Time 1 cycle

Average Miss Penalty 20 cycles

ITLB size 96 bytes (12 entries for 8KB pages, 4 entries for max 4MB superpage)

DTLB 32 PTE’s for any size superpage

Block Placement Random, but not last used

Block Placement Fully associative

CS/EE 5810CS/EE 6810

F00: 32

The Whole Alpha Memory System

• Boot

– initial Icache fill from boot PROM

– hence no Valid bit needed - Vbits in Dcache are cleared

– PC set to kseg so no address translation and TLB can bebypassed

– then start loading the TLB for real from the kseg OS code» subsequent misses call PAL (Priv. Arch. Library) to remap and

fill the miss

• Once ready to run user code

– OS sets PC to appropriate seg0 address

– then the memory hierarchy is ready to go

CS/EE 5810CS/EE 6810

F00: 33

Things to Note

• Prefetch stream buffer

– holds next Ifetch access so an Icache miss checks there first

• L1 is write through with a write buffer

– avoids CPU stall on a write miss

– 4 block capacity

– write merge - if block is same then merges the writes

• L2 is write back with a victim buffer

– 1 block - so delays stall on a L2 miss dirty until the 2nd miss

– normally this is safe

• Full datapath shown in Figure 5.47

– worth the time to follow the numbered balls

CS/EE 5810CS/EE 6810

F00: 34

Alpha 21064 I-fetch• I-Cache Hit (1 cycle)

» Step 1: Send address to TLB an Cache» Step 2: Icache lookup (8KB, DM, 32-byte blocks)» Step 3: ITLB Lookup (12 entries, FA)» Step 4: Cache hit and valid PTE» Step 5: Send 8 bytes to CPU

• I-Cache miss, Prefetch buffer (PFB) hit (1 cycle)» Step 6: Start L2 access, just in case» Step 7: Check prefetch buffer» Step 8: Prefetch buffer hit, send 8 bytes to CPU» Step 9: Refill I-Cache from PFB, and cancel L2 access

• I-Cache Miss, Prefetch buffer miss, L2 Cache Hit» Step 10: Check L2 cache tag» Step 11: Return critical 16B (5 cycles)» Step 12: Return other 16B (5 cycles)» Step 13: Prefetch next sequential block to PFB (10 cycles)

CS/EE 5810CS/EE 6810

F00: 35

Alpha 21064 L2 Cache Miss

• L2 Cache Miss» Step 14: Send new request to main memory

» Step 15: Put dirty victim block in victim buffer

» Step 16: Load new block in L2 cache, 16B at a time

» Step 17: Write Victim buffer to memory

• Data loads are like instruction fetches, except use DTLB and D-Cache instead of the ITLB and I-Cache

• Allows hits under miss

• On a read miss, the write buffer is flushed first to avoid RAW hazards

CS/EE 5810CS/EE 6810

F00: 36

Alpha 21064 Data Store

• Data Store» Step 18: DTLB Lookup and protection violation check

» Step 19: D-Cache lookup (8KB, DM, write through)

» Step 22: Check D-Cache tag

» Step 24: Send data to write buffer

» Step 25: Send data to delayed write buffer in front of D-Cache

• Write hits are pipelined

» Step 26: Write previous delayed write buffer to D-Cache

» Step 27: Merge data into write buffer

» Step 28: Write data at the head of write buffer to L2 cache (15 cycles)

CS/EE 5810CS/EE 6810

F00: 37

AXP Memory is Quite ComplexHow well does it work?

• Table 5.48 in the book tells…

• Summary – interesting to note:

– Alpha 21064 is a dual-issue machine, but nowhere is the CPI < 1

– Worst case are the TPC benchmarks» Memory intensive, so more stalls

» CPI bloats to 4.3 (1.8 of which is from the caches)

» 1.67 due to other stalls (I.e. load dependencies)

– Average SPECint CPI is 1.86» .77 contributed by caches, .74 by other stalls

– Average SPECfp CPI is 2.14» .45 by caches, .98 by other stalls

Superscalar issue is sorely limited!

CS/EE 5810CS/EE 6810

F00: 38

Summary 1

• CPU performance is outpacing main memory performance

• Principle of locality saves us: thus the Memory Hierarchy

• The memory hierarchy should be designed as a system

• Key old ideas:

– Bigger caches

– Higher set-associativity

• Key new ideas:

– Non-blocking caches

– Ancillary caches

– Multi-port caches

– Prefetching (software and/or hardware controlled)

Summary 2: Typical ChoicesOption TLB L1 Cache L2 Cache VM (page)

Block Size 4-8 bytes (1 PTE)

4-32 bytes 32-256 bytes 4k-16k bytes

Hit Time 1 cycle 1-2 cycles 6-15 cycles 10-100 cycles

Miss Penalty 10-30 cycles 8-66 cycles 30-200 cycles 700k-6M cycles

Local Miss Rate

.1 - 2% .5 – 20% 13 - 15% .00001 - 001%

Size 32B – 8KB 1 – 128 KB 256KB - 16MB

Backing Store L1 Cache L2 Cache DRAM Disks

Q1: Block Placement

Fully or set associative

DM DM or SA Fully associative

Q2: Block ID Tag/block Tag/block Tag/block Table

Q3: Block Replacement

Random (not last)

N.A. For DM Random (if SA)

LRU/LFU

Q4: Writes Flush on PTE write

Through or back

Write-back Write-back