csc 4250 computer architectures december 5, 2006 chapter 5. memory hierarchy
Post on 28-Dec-2015
213 Views
Preview:
TRANSCRIPT
CSC 4250Computer Architectures
December 5, 2006
Chapter 5. Memory Hierarchy.
Cache Optimizations
1. Reducing miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches
2. Reducing miss rate: larger block size, larger cache size, higher associativity, way prediction, and compiler optimization
3. Reducing miss penalty or miss rate via parallelism: hardware and compiler prefetching
4. Reducing time to hit in cache: small and simple caches, and pipelined cache access
Three Categories of Misses (Three C’s) Three C’s: Compulsory, Capacity, and Conflict Compulsory ─ The very first access to a block cannot be in the
cache; also called cold-start misses or first-reference misses Capacity ─ If the cache cannot contain all the blocks needed
during execution, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved
Conflict ─ If the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if too many blocks map to its set; also called collision misses or interference misses
Figure 5.15. Total miss rate (top) and distribution of miss rate (bottom) for each size data cache
according to the three C’s.
Interpretation of Figure 5.15
The figure shows the relative frequencies of cache misses, broken down by the “three C’s”
Compulsory misses are those that occur in an infinite cache Capacity misses are those that occur in a fully associative
cache Conflict misses are those that occur going from fully associative
to 8-way associative, 4-way associative, and so on To show the benefit of associativity, conflict misses are divided
by each decrease in associativity: 8-way ─ Conflict misses from fully assoc. to 8-way assoc. 4-way ─ Conflict misses from 8-way assoc. to 4-way assoc. 2-way ─ Conflict misses from 4-way assoc. to 2-way assoc. 1-way ─ Conflict misses from 2-way assoc. to 1-way assoc.
Reducing Miss Rate
1. Larger Block Size
2. Larger Caches
3. Higher Associativity
4. Way Prediction
5. Compiler Optimizations
1. Larger Block Size
Larger block size reduces compulsory misses, due to spatial locality
Larger blocks increase miss penalty Larger blocks increase conflict misses and
even capacity misses if the cache is small Do not increase the block size to value
beyond which either miss rate or average memory access time increases
Figure 5.16. Miss rate versus block size
Figure 5.18. Average memory access time versus block size for four caches sized 4KB, 16KB, 64KB, and 256KB.
Block sizes of 32B and 64B dominate; the smallest average time per cache size is shown in italic
What is the memory access overhead included in the miss penalty?
Block size Miss penalty 4KB 16KB 64KB 256KB
16B 82 8.027 4.231 2.673 1.894
32B 84 7.082 3.411 2.134 1.588
64B 88 7.160 3.323 1.933 1.449
128B 96 8.469 3.659 1.979 1.470
256B 112 11.651 4.685 2.288 1.549
2. Larger Caches
An obvious way to reduce capacity misses in Fig. 5.15 is to increase capacity of the cache
The drawback is a longer hit time and a higher dollar cost
This technique is especially popular in off-chip caches: The size of second- or third-level caches in 2001 equals the size of main memory in desktop computers in 1990
3. Higher Associativity
Figure 5.15 shows how miss rates improve with higher associativity. There are two general rules of thumb:1. 8-way set associative is for practical purposes as
effective in reducing misses as fully associative2. A direct-mapped cache of size N has about the same
miss rate as a 2-way set associative cache of size N/2 Improving one aspect of the average memory access
time comes at the expense of another: 1. Increasing block size reduces miss rate while increasing
miss penalty2. Greater associativity comes at the cost of an increased
hit time
Fig. 5.19. Average memory access time
versus associativity
Italic entries show where higher associativity increases average memory access time
Smaller caches need higher associativity
Cache size 1-way 2-way 4-way 8-way
4KB 3.44 3.25 3.22 3.28
8KB 2.69 2.58 2.55 2.62
16KB 2.23 2.40 2.46 2.53
32KB 2.06 2.30 2.37 2.45
64KB 1.92 2.14 2.18 2.25
128KB 1.52 1.84 1.92 2.00
256KB 1.32 1.66 1.74 1.82
512KB 1.20 1.55 1.59 1.66
4. Way Prediction
This approach reduces conflict misses and maintains the hit speed of direct-mapped cache
Extra bits are kept in the cache to predict the way of the next cache access
Alpha 21264 uses way prediction in its 2-way set associative instruction cache: added to each block is a prediction bit, used to select which block to try on the next cache access
If predictor is correct, the instruction cache latency is 1 clock cycle; if not, the cache tries the other block, changes the way predictor, and has a latency of 3 clock cycles
SPEC95 suggests a way prediction accuracy of 85%
5. Compiler Optimizations
Code can be rearranged without affecting correctness: Reordering the procedures of a program might reduce
instruction miss rates by reducing conflict misses. Use profiling information to determine likely conflicts between groups of instructions
Aim for better efficiency from long cache blocks: Align basic blocks so that the entry point is at the beginning of a cache block decreases the chance for a cache miss for sequential code
Improve the spatial and temporal locality of data
Loop Interchange
Some programs have nested loops that access data in memory in nonsequential order. Simply exchanging the nesting of the loops can make the code access the data in the order they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses by improving spatial locality: reordering maximizes use of data in a cache block before the data are discarded.
/* Before */for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)x[i][j] = 2*x[i][j];
/* After */for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)x[i][j] = 2*x[i][j];
Reducing Hit Time
Small and Simple Caches A time-consuming part of a cache hit is using the index portion
of the address to read tag memory and then compare it to the address
We already know that smaller hardware is faster It is critical to keep the cache small enough to fit on the same
chip as the processor to avoid the time penalty of going off chip
Keep the cache simple: Say use direct mapping; a main advantage is that we can overlap tag check with transmission of data
We use small and simple caches for level-1 caches For level-2 caches, some designs strike a compromise by
keeping tags on chip and data off chip, promising a fast tag check, yet providing the greater capacity of separate memory chips
Fig. 5.26. Summary of Cache OptimizationsTechnique Miss
penaltyMiss rate Hit time Hardware
complexity
Multilevel caches + 2
Critical word first and early restart + 2
Priority to read misses over write misses + 1
Merging write buffer + 1
Victim caches + + 2
Larger block size − + 0
Larger cache size + − 1
Higher associativity + − 1
Way prediction + 2
Compiler techniques + 0
Small and simple caches − + 0
Pipelined cache access + 1
Virtual Cache
The guideline of making the common case fast suggests that we use virtual addresses for the cache, since hits are much more common than misses
Such caches are termed virtual caches, with physical cache used to identify the traditional cache that uses physical addresses
It is important to distinguish two tasks: indexing the cache and comparing addresses
The issues are whether a virtual or physical address is used to index the cache and whether a virtual or physical address is used in the tag comparison
Full virtual addressing for both indices and tags eliminates address translation time from a cache hit
Why doesn’t everyone build virtually addressed caches?
Reasons against Virtual Caches First reason is protection. Page-level protection is checked as
part of the virtual to physical address translation. Second reason is that every time a process is switched, the
virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID).
Third reason is that operating systems and user programs may use two different virtual addresses for the same physical address. These duplicate addresses could result in two copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache, this wouldn’t happen, since accesses would first be translated to the same physical cache block.
Fourth reason is I/O. I/O typically uses physical addresses and thus would require mapping to virtual addresses to interact with a virtual cache.
One Good Choice
One way to get the best of both virtual and physical caches is to use part of the page offset (the part that is identical in both virtual and physical addresses) to index the cache
At the same time as the cache is being read using the index, the virtual part of the address is translated, and the tag match uses physical addresses
This strategy allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses
The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size
Example
In this figure, the index is 9 bits and the cache block offset is 6 bits
To use the trick on the previous slide, what should be the virtual page size?
The virtual page size would have to be at least 2(9+6) bytes or 32KB
What is the size of the cache? 64KB (=2×32KB)
How to Build a Large Cache
Associativity can keep the index in the physical part of the address and yet still support a large cache
Doubling associativity and doubling the cache size do not change the size of the index
Pentium III, with 8KB pages, avoids translation with its 16KB cache by using 2-way set associativity
IBM 3033 cache is 16-way set associative, even though studies show that there is little benefit to miss rates above 8-way associativity. This high associativity allows a 64KB cache to be addressed with a physical index, despite the handicap of 4KB pages.
top related