memory hierarchy designmemory hierarchy...
TRANSCRIPT
![Page 1: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/1.jpg)
Memory Hierarchy DesignMemory Hierarchy Design
Chapter 5 and Appendix C
1
![Page 2: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/2.jpg)
Overview
• Problem– CPU vs Memory
performance imbalance• Solution
– Driven by temporal and spatial localityMemory hierarchies– Memory hierarchies
• Fast L1, L2, L3 caches• Larger but slower
memories• Even larger but even
slower secondary storage• Keep most of the action in
the higher levels
2
![Page 3: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/3.jpg)
Locality of Reference• Temporal and Spatial
S ti l t• Sequential access to memory• Unit-stride loop (cache lines = 256 bits)
for (i = 1; i < 100000; i++)sum = sum + a[i];
• Non-unit stride loop (cache lines = 256 bits)f (i 0 i < 100000 i i+8)for (i = 0; i <= 100000; i = i+8)
sum = sum + a[i];
3
![Page 4: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/4.jpg)
Cache Systems
CPU Main MainCPUCPU
400MHz
Main Memory 10MHz
Main Memory 10MHz
CPU
Cache
Bus 66MHz Bus 66MHz
Data object t f Block transfer
CPU Cache Main Memory
transfer
4
![Page 5: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/5.jpg)
Example: Two-level HierarchyAccess Time
T1+T2
0 1
T1
Hit ti5
0 1Hit ratio
![Page 6: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/6.jpg)
Basic Cache Read Operation
• CPU requests contents of memory location• CPU requests contents of memory location• Check cache for this data
If t t f h (f t)• If present, get from cache (fast)• If not present, read required block from
i t hmain memory to cache• Then deliver from cache to CPU• Cache includes tags to identify which block
of main memory is in each cache slot
6
![Page 7: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/7.jpg)
Elements of Cache Designg• Cache size• Line (block) size• Number of caches• Mapping function
Block placement– Block placement – Block identification
R l t Al ith• Replacement Algorithm• Write Policy
7
![Page 8: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/8.jpg)
Cache Size• Cache size << main memory size
Small eno gh• Small enough– Minimize cost
Speed up access (less gates to address the cache)– Speed up access (less gates to address the cache)– Keep cache on chip
• Large enough• Large enough– Minimize average access time
O ti i d d th kl d• Optimum size depends on the workload• Practical size?
8
![Page 9: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/9.jpg)
Line Size• Optimum size depends on workload
S ll bl k d t l lit f f• Small blocks do not use locality of reference principle
• Larger blocks reduce the number of blocks– Replacement overhead
• Practical sizes? Cache Main MemoryTag
9
![Page 10: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/10.jpg)
Number of Caches• Increased logic density => on-chip cache
I l h l l 1 (L1)– Internal cache: level 1 (L1)– External cache: level 2 (L2)
• Unified cache– Balances the load between instruction and data fetches– Only one cache needs to be designed / implemented
• Split caches (data and instruction)p ( )– Pipelined, parallel architectures
10
![Page 11: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/11.jpg)
Mapping Function• Cache lines << main memory blocks
Di t i• Direct mapping– Maps each block into only one possible line– (block address) MOD (number of lines)
• Fully associative– Block can be placed anywhere in the cache
• Set associative– Block can be placed in a restricted set of lines– (block address) MOD (number of sets in cache)
11
(block address) MOD (number of sets in cache)
![Page 12: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/12.jpg)
Cache Addressing
Block address l k ffBlock address Block offset
IndexTag
Block offset – selects data object from the block
Index – selects the block set
Tag – used to detect a hit
12
![Page 13: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/13.jpg)
Direct Mapping
13
![Page 14: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/14.jpg)
Associative Mapping
14
![Page 15: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/15.jpg)
K-Way Set Associative Mapping
15
![Page 16: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/16.jpg)
Replacement Algorithm• Simple for direct-mapped: no choice
R d• Random– Simple to build in hardware
• LRUAssociativityAssociativity
Two-way Four-way Eight-waySize LRU Random LRU Random LRU Random16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96%64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53%256KB 1 15% 1 17% 1 13% 1 13% 1 12% 1 12%
16
256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
![Page 17: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/17.jpg)
Write Policy• Write is more complex than read
Write and tag comparison can not proceed– Write and tag comparison can not proceed simultaneously
– Only a portion of the line has to be updatedOnly a portion of the line has to be updated• Write policies
– Write through – write to the cache and memoryWrite through write to the cache and memory– Write back – write only to the cache (dirty bit)
• Write miss:Write miss:– Write allocate – load block on a write miss– No-write allocate – update directly in memory
17
No write allocate update directly in memory
![Page 18: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/18.jpg)
Alpha AXP 21064 Cache
T I d ff21 8 5 Address
CPU
Tag Index offsetData dataIn out
Valid Tag Data (256)Valid Tag Data (256)
Write buffer
L l l=?
18
Lower level memory
![Page 19: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/19.jpg)
Write MergingWrite address V V V V
100 1 0 0 0100
104
108
1
1
0 0 0
0
0
0 0
0 0108
112
1
1
0
0
0
0
0
0
Write address V V V V
100 1 1 1 1
0 0 0 0
0 0 0 0
19
0 0 0 0
0 0 0 0
![Page 20: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/20.jpg)
DECstation 5000 Miss Rates
25
30
15
20
25
%
Instr. CacheData Cache
5
10
15% Data CacheUnified
0
5
1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB
C h iCache size
Direct-mapped cache with 32-byte blocks
20
Direct mapped cache with 32 byte blocks
Percentage of instruction references is 75%
![Page 21: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/21.jpg)
Cache Performance Measures• Hit rate: fraction found in that level
S hi h th t ll t lk b t Mi t– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,
• Average memory access time• Average memory-access time = Hit time + Miss rate x Miss penalty (ns)
• Miss penalty: time to replace a block from lower• Miss penalty: time to replace a block from lower level, including time to replace in CPU– access time to lower level = f(latency to lower level)access time to lower level f(latency to lower level)
– transfer time: time to transfer block =f(bandwidth)
21
![Page 22: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/22.jpg)
Cache Performance Improvements• Average memory-access time
Hit time + Miss rate Miss penalt= Hit time + Miss rate x Miss penalty• Cache optimizations
– Reducing the miss rate– Reducing the miss penalty– Reducing the hit time
22
![Page 23: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/23.jpg)
ExampleWhich has the lower average memory access time:
A 16 KB instruction cache with a 16 KB data cache orA 16-KB instruction cache with a 16-KB data cache orA 32-KB unified cache
Hit time = 1 cycleHit time 1 cycleMiss penalty = 50 cyclesLoad/store hit = 2 cycles on a unified cacheGiven: 75% of memory accesses are instruction references.Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10%
Miss rate for unified cache = 1.99%Average memory access times:
S lit 0 75 * (1 + 0 0064 * 50) + 0 25 * (1 + 0 0647 * 50) 2 05
23
Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05
Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24
![Page 24: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/24.jpg)
Cache Performance EquationsCPUtime = (CPU execution cycles + Mem stall cycles) * Cycle time
Mem stall cycles = Mem accesses * Miss rate * Miss penalty
CPUtime = IC * (CPIexecution + Mem accesses per instr * Miss rate * Mi lt ) * C l tiMiss penalty) * Cycle time
Misses per instr = Mem accesses per instr * Miss rate
CPUtime = IC * (CPIexecution + Misses per instr * Miss penalty) * Cycle time
24
![Page 25: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/25.jpg)
Reducing Miss Penalty• Multi-level Caches• Critical Word First and Early Restart• Priority to Read Misses over Writes• Merging Write Buffers• Victim CachesVictim Caches
25
![Page 26: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/26.jpg)
Multi-Level Caches• Avg mem access time = Hit time(L1) + Miss Rate
(L1) X Miss Penalty(L1)• Miss Penalty (L1) = Hit Time (L2) + Miss Rate
(L2) X Miss Penalty (L2)• Avg mem access time = Hit Time (L1) + Miss• Avg mem access time = Hit Time (L1) + Miss
Rate (L1) X (Hit Time (L2) + Miss Rate (L2) X Miss Penalty (L2)
• Local Miss Rate: number of misses in a cache divided by the total number of accesses to the cache
• Global Miss Rate: number of misses in a cache divided by the total number of memory accesses generated by the cache
26
generated by the cache
![Page 27: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/27.jpg)
Performance of Multi-Level Caches
27
![Page 28: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/28.jpg)
Critical Word First and Early Restart
• Critical Word First: Request the missed ord first from memorword first from memory
• Early Restart: Fetch in normal order, but as h d d i d isoon as the requested word arrives, send it
to CPU
28
![Page 29: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/29.jpg)
Giving Priority to Read Misses over Writes
SW R3, 512(R0)LW R1, 1024 (R0)LW R2, 512 (R0)• Direct-mapped, write-through cache
mapping 512 and 1024 to the same block pp gand a four word write buffer
• Will R2=R3?Will R2 R3?• Priority for Read Miss?
29
![Page 30: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/30.jpg)
Victim Caches
30
![Page 31: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/31.jpg)
Reducing Miss Rates:Types of Cache MissesTypes of Cache Misses
• Compulsory– First reference or cold start missesFirst reference or cold start misses
• Capacity– Working set is too big for the cache– Working set is too big for the cache– Fully associative caches
• Conflict (collision)• Conflict (collision)– Many blocks map to the same block frame (line)– AffectsAffects
• Set associative caches• Direct mapped caches
31
![Page 32: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/32.jpg)
Miss Rates: Absolute and DistributionDistribution
32
![Page 33: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/33.jpg)
Reducing the Miss Rates
1. Larger block size2. Larger Caches3. Higher associativityg y4. Pseudo-associative caches5 Compiler optimizations5. Compiler optimizations
33
![Page 34: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/34.jpg)
1. Larger Block Size• Effects of larger block sizes
R d i f l i– Reduction of compulsory misses• Spatial locality
I f i lt (t f ti )– Increase of miss penalty (transfer time)– Reduction of number of blocks
P t ti l i f fli t i• Potential increase of conflict misses
• Latency and bandwidth of lower-level memory– High latency and bandwidth => large block size
• Small increase in miss penalty
34
![Page 35: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/35.jpg)
Example
35
![Page 36: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/36.jpg)
2. Larger Caches• More blocks• Higher probability of getting the data• Longer hit time and higher cost• Primarily used in 2nd level caches
36
![Page 37: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/37.jpg)
3. Higher Associativity
• Eight-way set associative is good enough• 2:1 Cache Rule:
– Miss Rate of direct mapped cache size N = Miss Rate 2-way cache size N/2
• Higher Associativity can increaseg y– Clock cycle time– Hit time for 2-way vs. 1-way y y
external cache +10%, internal + 2%
37
![Page 38: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/38.jpg)
4. Pseudo-Associative Caches• Fast hit time of direct mapped and lower conflict
misses of 2-way set-associative cache?misses of 2-way set-associative cache? • Divide cache: on a miss, check other half of cache
to see if there, if so have a pseudo-hit (slow hit), p ( )Hit time
Pseudo hit time Miss penalty
• Drawback: i li d i i h d if hi k l
Miss penalty
– CPU pipeline design is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor (L2)– Used in MIPS R1000 L2 cache similar in UltraSPARC
38
Used in MIPS R1000 L2 cache, similar in UltraSPARC
![Page 39: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/39.jpg)
Pseudo Associative Cache
CPUAddress
Data Datain out
D tTag
Data11
=?
223
Write buffer
=?
39Lower level memory
![Page 40: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/40.jpg)
5. Compiler Optimizations• Avoid hardware changes
I t ti• Instructions– Profiling to look at conflicts between groups of
i t tiinstructions• Data
– Loop Interchange: change nesting of loops to access data in order stored in memory
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
40
columns or rows
![Page 41: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/41.jpg)
Loop Interchange/* Before */
for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After *// /for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)x[i][j] = 2 * x[i][j];x[i][j] 2 x[i][j];
•Sequential accesses instead of striding through memory every 100 words; improved spatial localitymemory every 100 words; improved spatial locality•Same number of executed instructions
41
![Page 42: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/42.jpg)
Blocking (1/2)/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){for (j = 0; j < N; j = j+1){r = 0;for (k = 0; k < N; k = k+1)
r = r + y[i][k]*z[k][j];x[i][j] = r;
};
•Two Inner Loops:p–Read all NxN elements of z[]–Read N elements of 1 row of y[] repeatedly–Write N elements of 1 row of x[]–Write N elements of 1 row of x[]
•Capacity Misses a function of N & Cache Size:–3 NxNx4 => no capacity misses
42
p y–Idea: compute on BxB submatrix that fits
![Page 43: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/43.jpg)
Blocking (2/2)/* After */
for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j=j+1){
r = 0;for(k=kk; k<min(kk+B-1,N);k =k+1)
r = r + y[i][k]*z[k][j];x[i][j] = x[i][j] + r;
};};
•B called Blocking Factor
43
![Page 44: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/44.jpg)
Reducing Cache Miss Penalty or Miss Rate via ParallelismMiss Rate via Parallelism
1. Nonblocking Caches2. Hardware Prefetching3. Compiler controlled Prefetchingp g
44
![Page 45: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/45.jpg)
1. Nonblocking Cache
• Out-of-order execution– Proceeds with next
fetches while waiting for data to comefor data to come
45
![Page 46: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/46.jpg)
2. Hardware Prefetching• Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a miss– Extra block placed in stream buffer– On miss check stream buffer
• Works with data blocks too:– 1 data stream buffer gets 25% misses from 4KB DM
cache; 4 streams get 43%cache; 4 streams get 43%– For scientific programs: 8 streams got 50% to 70% of
misses from 2 64KB, 4-way set associative caches• Prefetching relies on having extra memory
bandwidth that can be used without penalty
46
![Page 47: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/47.jpg)
3. Compiler-Controlled Prefetching
• Compiler inserts data prefetch instructionsLoad data into register (HP PA RISC loads)– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC)– Special prefetching instructions cannot cause faults;Spec a p e etc g st uct o s ca ot cause au ts;
a form of speculative execution
• Nonblocking cache: overlap execution with prefetch• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?– Higher superscalar reduces difficulty of issue bandwidth
47
![Page 48: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/48.jpg)
Reducing Hit Time
1. Small and Simple Caches2. Avoiding address Translation during
Indexing of the Cache
48
![Page 49: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/49.jpg)
Small and Simple Caches
49
![Page 50: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/50.jpg)
Avoiding Address Translation during IndexingIndexing
• Virtual vs. Physical Cache• What happens when address translation is
done• Page offset bits to index the cache
50
![Page 51: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/51.jpg)
TLB and Cache Operation
Virtual address
TLB Operation
Page# Offset TLB
Mi
Hit
Miss
Hit
Real address
Cache Operation
Tag Remainder Cache
Miss
Hit
Value+
MainPage Table
51
Memory Value
![Page 52: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/52.jpg)
Main Memory Background• Performance of Main Memory:
– Latency: Cache Miss Penalty• Access Time: time between request and word arrives• Cycle Time: time between requests
– Bandwidth: I/O & Large Block Miss Penalty (L2)M i M i DRAM D i R d A M• Main Memory is DRAM: Dynamic Random Access Memory– Dynamic since needs to be refreshed periodically – Addresses divided into 2 halves (Memory as a 2D matrix):
RAS R A St b• RAS or Row Access Strobe• CAS or Column Access Strobe
• Cache uses SRAM: Static Random Access MemoryNo refresh (6 transistors/bit vs 1 transistor /bit area is 10X)– No refresh (6 transistors/bit vs. 1 transistor /bit, area is 10X)
– Address not divided: Full addreess• Size: DRAM/SRAM - 4-8
Cost & Cycle time: SRAM/DRAM 8 1652
Cost & Cycle time: SRAM/DRAM - 8-16
![Page 53: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/53.jpg)
Main Memory Organizations
CPU CPUCPU
Simple Wide InterleavedCPU
Cache
CPU
Cache
CPU
Multiplexor
bus
Memory
bus
Memory Memory Memory Memory
Cache
busMemory Memory bank 0
Memory bank 1
Memory bank 2
Memory bank 3
Memory
bus
256/512 bits
5332/64 bits sp
![Page 54: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/54.jpg)
Interleaved Memory
54
![Page 55: Memory Hierarchy DesignMemory Hierarchy Designweb.cse.ohio-state.edu/~panda.2/775/slides/Ch5_App_C_1.pdf · – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for](https://reader036.vdocuments.us/reader036/viewer/2022062607/6051d3ef41482935a2438940/html5/thumbnails/55.jpg)
Performance• Timing model (word size is 32 bits)
– 1 to send address1 to send address, – 6 access time, 1 to send data– Cache Block is 4 words
• Simple M.P. = 4 x (1+6+1) = 32• Wide M.P. = 1 + 6 + 1 = 8• Interleaved M.P. = 1 + 6 + 4x1 = 11
55