Memory Hierarchy
How to improve memory access
Outline• Locality• Structure of memory hierarchy• Cache• Virtual memory
Locality• Principle of locality
– Programs access a relatively small portion of their address space at any instant of time.
• Temporal locality– If an item is referenced, it tends to be referenced
again soon.• Spatial locality
– If an item is referenced, items near by tends to be referenced soon.
Memory Hierarchy• Multiple levels of
memory with different speeds and sizes.
• Give users the perception that the memory is as large as the largest and as fast as the fastest.
• The unit of memory considered in memory hierarchy is a block.
CPUregistersMemorySRAMMemoryDRAMMemory
Magnetic disk
Structure of memory hierarchy
MemorySRAM
MemoryDRAM
MemoryMagnetic disk
size
CPUregisters
spee
dC
ost p
er b
it
Structure of memory hierarchy
Memory type Access time Cost per bit
Registers ~ 0.2 ns
SRAM: Static RAM 0.5 – 5 ns $4,000-$10,000
DRAM: Dynamic RAM 50 – 70 ns $100 - $200
Magnetic Disk 5 - 20 ms $0.5 - $2
Cache• A level of memory hierarchy between CPU
and main memory.
registers
cache
Memory
Disk
Every thing you need is in a register.
Everything you need is in cache.
Everything you need is in memory.
How to improve memory access time
registers
A B C D
a b c d e f g h
CPU
Cache
Memory
Disk
A CB D
A BC D
a b edc
fhg
Address Space
Suppose• 1 block = 256 byte = 28 byte• cache has 8 blocks• memory has 32 blocks• disk has 64 blocks.Then,• cache has 828 = 211 bytes• memory has 3228 = 213
bytes• disk has 6428 = 214 bytes.
• For the cache, a block number has 3 bits, and an address has 11 bits.
• For the memory, a block number has 5 bits, and an address has 13 bits.
• For the disk, a block number has 6 bits, and an address has 14 bits.
cache memory disk11 13 14
address address addressdata datadata
8 88
Address Space
000000 000001 000010 000011
000100 000101 000110 000111
001000 001001 001010 001011
001100 001101 001110 001111
111000 111001 111010 111011
111100 111101 111110 111111
00000 00001 00010 00011
00100 00101 00110 00111
01000 01001 01010 01011
… …
11000 11001 11010 11011
11100 11101 11110 11111
000 001 010 011100 101 110 111
Cache: 8 blocks
Mem
ory:
32
bloc
ks
Disk: 64 blocks
Address: Block number || offset in block
Address in cache : xxx || xxxxxxxxAddress in disk : xxxxxx || xxxxxxxxAddress in memory: xxxxx || xxxxxxxx
Hit / MissHit
• The requested data is found in the upper level of the hierarchy.
Hit rate or hit ratio• The fraction of memory
access found in the upper level
Hit time• The time to access data
when it hits (= time to check if the data is in the upper level + access time)
Miss• The requested data is not
found the upper level, but is in the lower level, of the hierarchy.Miss rate or miss ratio
• 1 – hit rate
Miss penalty• The time to get a block of
data into the upper level, and then into the CPU.
Cache• A level of memory hierarchy between CPU
and main memory.• To access data in memory hierarchy
– CPU requests data from cache.– Check if data is in the cache.
• Cache hit– Transfer the requested data from cache to CPU
• Cache miss– Transfer a block containing the requested data from memory
to cache– Transfer the requested data from cache to CPU
How cache works
A B C D E F
cache
memory
A
CPU Request A
B C D E F
miss
Request BRequest CRequest D
hit
Request ERequest FCache is full;Replace a block.
Where to place a block in cacheDirect-mapped cache• Each memory location is mapped to exactly one
location in the cache.(But one cache location can be mapped to different memory location at different time.)
Other mapping can be used.
c0 c1 c2 c3
b0 b1 b2 b3b4 b5 b6 b7b8 b9 b10 ……
Cache-memory mapping
Direct-mapped cache
00 01 10 11
000000 - 000011
000100 - 000111
001000 - 001011
001100 - 001111
010000 - 010011
010100 - 010111
011000 - 011011
011100 - 011111
100000 - 100011
100100 - 100111
101000 - 101011
101100 - 101111
110000 - 110011
110100 - 110111
111000 - 111011
111100 - 111111
Cache
Memory
1 bl
ock
= 4
byte
Fully-associative cache
00 01 10 11
000000 - 000011
000100 - 000111
001000 - 001011
001100 - 001111
010000 - 010011
010100 - 010111
011000 - 011011
011100 - 011111
100000 - 100011
100100 - 100111
101000 - 101011
101100 - 101111
110000 - 110011
110100 - 110111
111000 - 111011
111100 - 111111
Cache
Memory
1 bl
ock
= 4
byte
Set-associative cache
000 001 010 011 100 101 110 111
000000 - 000011
000100 - 000111
001000 - 001011
001100 - 001111
010000 - 010011
010100 - 010111
011000 - 011011
011100 - 011111
100000 - 100011
100100 - 100111
101000 - 101011
101100 - 101111
110000 - 110011
110100 - 110111
111000 - 111011
111100 - 111111
Cache
Memory
1 bl
ock
= 4
byte
Determine if a block is in the cache• For each block in the cache
– Valid bit• indicate that the block contains valid data
– Tag• Contain the information of the associated block in the
memory
• Example:– If the valid bit is false, no block from memory is
stored in that block of cache. – If the valid bit is true, the address of data stored in
the block is stored in tag.
Example: direct-mapped
00 01 10 11
000000 - 000011
000100 - 000111
001000 - 001011
001100 - 001111
010000 - 010011
010100 - 010111
011000 - 011011
011100 - 011111
100000 - 100011
100100 - 100111
101000 - 101011
101100 - 101111
110000 - 110011
110100 - 110111
111000 - 111011
111100 - 111111
Valid bitTag
Cache
Memory
1 1 0 101 11 11 00
Example: Fully-associative
00 01 10 11
000000 - 000011
000100 - 000111
001000 - 001011
001100 - 001111
010000 - 010011
010100 - 010111
011000 - 011011
011100 - 011111
100000 - 100011
100100 - 100111
101000 - 101011
101100 - 101111
110000 - 110011
110100 - 110111
111000 - 111011
111100 - 111111
Valid bitTag
Cache
Memory
1 1 1 11101 0101 1110 0110
Example: set-associative cache
000 001 010 011 100 101 110 111
000000 - 000011
000100 - 000111
001000 - 001011
001100 - 001111
010000 - 010011
010100 - 010111
011000 - 011011
011100 - 011111
100000 - 100011
100100 - 100111
101000 - 101011
101100 - 101111
110000 - 110011
110100 - 110111
111000 - 111011
111100 - 111111
Valid bitTag
Cache
Memory
1 1 1 0 0 0 1 011 01 11 00 01 11 00 00
Access a direct-mapped cache
Cache index Valid bit tag000 1 00111001 1 10011010 0 11000… … …
111 1 01101
1 0 0 1 1 0 0 1 0 0 1 1 0Memoryaddress
mem
ory
=
ANDhit
Cache address
Access a fully-associative cache
Cache index Valid bit tag000 1 00111000001 1 10011001010 0 11000111… … …
111 1 01101100
1 0 0 1 1 0 0 1 0 0 1 1 0Memory address
AND
0 0 1 0 0 1 1 0 Cache address0 0 1 0 0 1 1 0
Access a set-associative cache
Cache index Valid bit tag Valid bit tag000 0 10001 1 11000001 1 11001 1 00001010 1 11010 0 11000… … … … …
111 1 01111 1 01111
1 1 0 0 1 0 0 1 1 0Memory address
AND
0 0 1 0 0 1 1 0
Cache address0 0 1 0 0 0 1 1 0
AND
0 0 1 0 0 1 1 010
Cache address
Access a set-associative cache
Cache index Valid bit tag Valid bit tag000 1 00000 1 01000001 1 11001 1 00001010 1 11010 0 10010… … … … …
111 0 01101 0 01101
Mem
ory
addr
ess
Cache address
= =
1 1
0 0
10
0 1
1 0
AND AND
hit1
hit0
Block size vs. Miss rate
Handling Cache Misses• If an instruction is not in a cache, we have to
wait for the memory to respond and write data into the cache. (multiple cycles)
• Cause processor stall.• Steps to handle
– Send PC-4 to memory– Read from memory to cache and wait for the
result– Update cache information (tag + valid bit)– Restart instruction execution.
Handling WritesWrite-through• When data is written, both
the cache and the memory are updated.
• Consistent copies of memory.
• Slow because writing to memory is slower.
• Improve by:– using a write buffer, storing
data waiting to be written to memory
– Then, processor can continue execution.
Write-back• When data is written, only
the cache is updated.• Memory is inconsistent with
cache.• Faster.• But, once a block is
removed from cache, it must be written back to memory.
Performance Improvement• Increase hit rate/reduce miss rate
– Increase cache size– Block size– Good cache associativity– Good replacement policy
• Reduce cache access time– Multilevel cache
CPU
Multilevel Cache
Memory
L1 cache
L2 cache
Processor L1 cache L2 cachePentium 16 KBPentium Pro 16KB 256/512 KBPentium MMX 32 KBPentium II and III 32 KBCeleron 32 KB 128 KBPentium III Cumine 32 KB 256 KBAMD K6 and K6-2 64 KBAMD K6-3 64 KB 256 KBAMD K7 Athlon 128 KBAMD Duron 128 KB 64 KBAMD Athlon Thunderbird 128 KB 256 KB
Virtual Memory• Similar to cache
– Based on principle of locality– Memory is divided into equal blocks called page.– If a requested page is not found in the memory,
page fault occurs.• Allow efficient and safe sharing of memory
among multiple programs– Each program has its own address space
• Virtually extends the memory size– A program can be larger than the memory.
Program A
Virtual Memory
Program B
Program C
Main memory
Physical addressV
irtua
l mem
ory
Virtual address
Address translationdisk
swap space
Program A
Virtual Memory
Main memory
Virtual address space can be larger than physical address space.
Address Calculation
Virtual memory
physical memory
Virtual page number Page offset
Physical page number Page offset
Virtual address
physical address
Address translationpage table
Page TableVirtual page number
Virtual page number Valid bit Physical page number
0000…000 1
0000…001 1
…
0011…110 0
1111…111 1
Page offset
Physical page number Page offset
Virtual address
physical address
Page table register
Page fault• When the valid bit of the requested page = 0,
a page fault occurs.• Handling page fault
– Get the requested page from disk (use information in the page table)
– Find an available page in the memory• If there is one, put the requested page in and update
the entry in the page table.• If there is none, find a page to be replaced (according
to page replacement policy), replace it, and update both entries in the page table.
Page Replacement• Page replacement policy
– Least recently used (LRU): replace the page that has not been used for the longest time.
• Updating data in the virtual memory– If the replaced page was changed (written on the
page), the page must be updated in the virtual memory.
– Writing-back is more efficient than write-through.– If the replaced page was not changed (written on
the page), no virtual memory update is necessary.
Other information in page tables• Use/reference bit
– Used for LRU policy• Dirty bit
– Used for updating the virtual memory
Translation-lookaside buffer (TLB)• Cache that stores recently-used part of page
table for efficiency• When the operating system switches from
process A to process B (called context switch), A’s page table must be replaced by B’s page table in TLB.
disk
Aswap space
CB
memory
part of A part of Cpart of B
A’s page table B’s page table C’s page table
TLB
Currently used page table
cache
Currently used data & prog
CPU
Three C’s
Effects of the three C’sCompulsory misses are too small to be seen in this graph.
One-way set associativity
two-way set
associativity
Four and eight-way set associativity
Design factors