chapter 7 – large and fast: exploiting memory hierarchy · l1 instruction and data caches and l2...
TRANSCRIPT
ECE200 – Computer Organization
Chapter 7 – Large and Fast: Exploiting Memory
Hierarchy
Copyright 2003 David H. Albonesi and the University of Rochester.
Outline for Chapter 7 lectures
Motivation for, and concept of, memory hierarchies
Caches
Main memory
Characterizing memory hierarchy performance
Virtual memory
Real memory hierarchies
The memory dilemma
Ch 6 assumption: on-chip instruction and data memories hold the entire program and its data and can be accessed in one cycle
Reality checkIn high performance machines, programs may require 100’s of megabytes or even gigabytes of memory to runEmbedded processors have smaller needs but there is also less room for on-chip memory
Basic problemWe need much more memory than can fit on the microprocessor chipBut we do not want to incur stall cycles every time the pipeline accesses instructions or data At the same time, we need the memory to be economical for the machine to be competitive in the market
Solution: a hierarchy of memories
Another view
Typical characteristics of each level
First level (L1) is separate on-chip instruction and data caches placed where our instruction and data memories reside
16-64KB for each cache (desktop/server machine)Fast, power-hungry, not-so-dense, static RAM (SRAM)
Second level (L2) consists of another larger unified cache
Holds both instructions and data256KB-4MBOn or off-chipSRAM
Third level is main memory64-512MBSlower, lower-power, denser dynamic RAM (DRAM)
Final level is I/O (e.g., disk)
Caches and the pipeline
L1 instruction and data caches and L2 cache
cache cache
L2 cache
to mm
L1 L1
Memory hierarchy operation
(1) Search L1 for the instruction or dataIf found (cache hit), done
(2) Else (cache miss), search L2 cacheIf found, place it in L1 and repeat (1)
(3) Else, search main memoryIf found, place it in L2 and repeat (2)
(4) Else, get it from I/O (Chapter 8)
Steps (1)-(3) are performed in hardware1-3 cycles to get from L1 caches5-20 cycles to get from L2 cache50-200 cycles to get from main memory
L1
L2
main memory
Principle of locality
Programs access a small portion of memory within a short time period
Temporal locality: recently accessed memory locations will likely be accessed soon
Spatial locality: memory locations near recently accessed locations will likely be accessed soon
POL makes memory hierarchies work
A large percentage of the time (typically >90%) the instruction or data is found in L1, the fastest memory
Cheap, abundant main memory is accessed more rarely
Memory hierarchy operates at nearly the speed of expensive on-chip SRAM with about the cost of main memory (DRAMs)
Caches
Caches are small, fast, memories that hold recently accessed instructions and/or data
Separate L1 instruction and L1 data cachesNeed simultaneous access of instructions and data in pipelines
L2 cache holds both instructions and dataSimultaneous access not as critical since >90% of the instructions and data will be found in L1PC or effective address from L1 is sent to L2 to search for the instruction or data
How caches exploit the POL
On a cache miss, a block of several instructions or data, including the requested item, are returned
The entire block is placed into the cache so that future searches for items in the block will be successful
instructioni instructioni+3instructioni+2instructioni+1
requested instruction
How caches exploit the POL
Consider sequence of instructions and data accesses in this loop with a block size of 4 words
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop
Four Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement)
Q2: How is a block found if it is in the upper level?
(Block identification)
Q3: Which block should be replaced on a miss? (Block replacement)
Q4: What happens on a write? (Write strategy)
The cache is much smaller than main memory
Multiple memory blocks must share the same cache location
Searching the cache
block
Need a way to determine whether the desired instruction or data is held in the cache
Need a scheme for replacing blocks when a new block needs to be brought in on a miss
Searching the cache
block
Cache organization alternatives
Direct mapped: each block can be placed in only one cache location
Set associative: each block can be placed in any of n cache locations
Fully associative: each block can be placed in any cache location
Cache organization alternatives
Searching for block 12 in caches of size 8 blocks
Set Set # 0
Searching a direct mapped cache
Need log2 number of sets of the address bits (the index) to select the block location
block offset used to select the desired byte, half-word, or word within the block
Remaining bits (the tag) used to determine if this is the desired block or another that shares the same cache location
indextagblock offset
memory address
assume data cache with 16 byte blocks
8 sets
4 block offset bits
3 index bits
25 tag bits
Set
Searching a direct mapped cache
Block is placed in the set index
number of sets = cache size/block size
indextagblock offset
memory address
assume data cache with 16 byte blocks
8 sets
4 block offset bits
3 index bits
25 tag bits
Set
Direct mapped cache organization
64KB instruction cache with 16 byte (4 word) blocks
4K sets (64KB/16B) need 12 address bits to pick
Direct mapped cache organization
The data section of the cache holds the instructions
Direct mapped cache organization
The tag section holds the part of the memory address used to distinguish different blocks
Direct mapped cache organization
A valid bit associated with each set indicates if the instructions are valid or not
Direct mapped cache access
The index bits are used to select one of the sets
Direct mapped cache access
The data, tag, and Valid bit from the selected set are simultaneously accessed
Direct mapped cache access
The tag from the selected entry is compared with the tag field of the address
Direct mapped cache access
A match between the tags and a Valid bit that is set indicates a cache hit
Direct mapped cache access
The block offset selects the desired instruction
Set associative cache
Block placed in one way of set index
Number of sets = cache size/(block size*ways)
ways 0-3
Set associative cache operation
The index bits are used to select one of the sets
Set associative cache operation
The data, tag, and Valid bit from all ways of the selected entry are simultaneously accessed
Set associative cache operation
The tags from all ways of the selected entry are compared with the tag field of the address
Set associative cache operation
A match between the tags and a Valid bit that is set indicates a cache hit (hit in way1 shown)
Set associative cache operation
The data from the way that had a hit is returned through the MUX
Fully associative cache
A block can be placed in any location in the single set
31 30 29 28 27 26 …6 5 4 3 2 1 0
V Tag Data
= ===
.
.
.
01
10231022
.
.
.
30
MUX
Data
Hit
...
.
.
.
.
.
...
Different degrees of associativity
Four different caches of size 8 blocks
Cache misses
A cache miss occurs when the block is not found in the cache
The block is requested from the next level of the hierarchy
When the block returns, it is loaded into the cache and provided to the requester
A copy of the block remains in the lower levels of the hierarchy
The cache miss rate is found by dividing the total number of misses by the total number of accesses (misses/accesses)
The hit rate is 1-miss rate
L1
L2
main memory
Classifying cache misses
Compulsory missesCaused by the first access to a block that has never been in the cache
Capacity missesDue to the cache not being big enough to hold all the blocks that are needed
Conflict missesDue to multiple blocks competing for the same set
A fully associative cache with a “perfect”replacement policy has no conflict misses
Cache miss classification examplesDirect mapped cache of size two blocks
Blocks A and B map to set 0, C and D to set 1
Access pattern 1: A, B, C, D, A, B, C, D
Access pattern 2: A, A, B, A
01
A 01
B 01
B 01
B
01
A 01
B 01
B 01
B
C D
D D C D
01
A 01
A 01
B 01
A
?? ?? ?? ??
?? ?? ?? ??
?? ?? ?? ??
Reducing capacity misses
Increase the cache sizeMore cache blocks can be simultaneously held in the cacheDrawback: increased access time
Reducing compulsory missesIncrease the block size
Each miss results in more words being loaded into the cacheBlock size should only be increased to a certain point!
As block size is increasedFewer cache sets (increased contention)Larger percentage of block may not be referenced
Reducing conflict missesIncrease the associativity
More locations in which a block can be heldDrawback: increased access time
Cache miss rates for SPEC92
Block replacement policy
Determines what block to replace on a cache miss to make room for the new block
Least recently used (LRU)Pick the one that has been unused for the longest timeBased on temporal localityRequires ordering bits to be kept with each setToo expensive beyond 4-way
RandomPseudo-randomly pick a blockGenerally not as effective as LRU (higher miss rates)Simple even for highly associative organizations
Most recently used (MRU)Keep track of which block was accessed lastRandomly pick a block from other than that oneCompromise between LRU and random
Cache writes
The L1 data cache needs to handle writes (stores) in addition to reads (loads)
Need to check for a hit before writing Don’t want to write over another block on a missRequires a two cycle operation (tag check followed by write)
Write back cacheCheck for hit If a hit, write the byte, halfword, or word to the correct location in the blockIf a miss, request the block from the next level in the hierarchyLoad the block into the cache, and then perform the write
Cache writes and block replacement
With a write back cache, when a block is written to, copies of the block in the lower levels are not updated
If this block is chosen for replacement on a miss, we need save it to the next level
Solution:A dirty bit is associated with each cache blockThe dirty bit is set if the block is written toA block with a set dirty bit that is chosen for replacement is written to the next level before being overwritten with the new block
Cache writes and block replacement
L2
main memory
(2) dirty block
written to L2
L1D(4) block
loaded into L1D cache
(3) read request sent
to L2
(1) cache miss
Write Buffer for Write Through
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory
Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmare:Store frequency (w.r.t. time) -> 1 / DRAM write cycleWrite buffer saturation
ProcessorCache
Write Buffer
DRAM
Write Buffer Saturation
Store frequency (w.r.t. time) -> 1 / DRAM write cycle
If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):
Store buffer will overflow no matter how big you make itThe CPU Cycle Time <= DRAM Write Cycle Time
Solution for write buffer saturation:Use a write back cacheInstall a second level (L2) cache:
Write Buffer
ProcessorCache
DRAM
ProcessorCache
Write Buffer
DRAML2Cache
Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss
Do we read in the block?Yes: Write AllocateNo: Write Not Allocate
Cache Index
0123
:
Cache DataByte 0
0431
:
Cache Tag Example: 0x00Ex: 0x00
0x00Valid Bit
:31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte SelectEx: 0x00
9
Summary #1/ 3:The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time.
Temporal Locality: Locality in TimeSpatial Locality: Locality in Space
Three Major Categories of Cache Misses:Compulsory Misses: sad facts of life. Example: cold start misses.Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!Capacity Misses: increase cache size
Cache Design Spacetotal size, block size, associativityreplacement policywrite-hit policy (write-through, write-back)write-miss policy
Summary #2 / 3: The Cache Design Space
Several interacting dimensionscache sizeblock sizeassociativityreplacement policywrite-through vs write-backwrite allocation
The optimal choice is a compromisedepends on access characteristics
workloaduse (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
Summary #3 / 3
Caches understood by examining how they deal with four questions: 1. Where can block be placed? 2. How is block found? 3. What block is replaced on miss? 4. How are writes handled?