the goal: illusion of large, fast, cheap memory
DESCRIPTION
The Goal: illusion of large, fast, cheap memory. Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism. An Expanded View of the Memory System. Processor. Control. Memory. Memory. Memory. - PowerPoint PPT PresentationTRANSCRIPT
The Goal: illusion of large, fast, cheap memory
• Fact: Large memories are slow, fast memories are small
• How do we create a memory that is large, cheap and fast (most of the time)?
–Hierarchy–Parallelism
An Expanded View of the Memory System
Control
Datapath
Memory
Processor
Mem
ory
Memory
Memory
Mem
ory
Fastest Slowest
Smallest Biggest
Highest Lowest
Speed:
Size:
Cost:
Memory Hierarchy: How Does it Work?
• Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to
the processor
• Spatial Locality (Locality in Space):=> Move blocks consisting of contiguous words to
the upper levels
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Memory Hierarchy of a Modern Computer System• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the cheapest technology.
– Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On
-Ch
ipC
ache
1s 10,000,000s (10s ms)
Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage(Disk)
10,000,000,000s (10s sec)
Ts
• Users want large and fast memories!
SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.
• Try and give it to them anyway
– build a memory hierarchy
Exploiting Memory Hierarchy
1997
CPU
Level n
Level 2
Level 1
Levels in thememory hierarchy
Increasing distance from the CPU in
access time
Size of the memory at each level
How is the hierarchy managed?
• Registers <-> Memory–by compiler (programmer?)
• cache <-> memory–by the hardware
• memory <-> disks–by the hardware and operating system (virtual
memory)–by the programmer (files)
• Read hits– this is what we want!
• Read misses– stall the CPU, fetch block from memory, deliver to cache, restart
• Write hits:– can replace data in cache and memory (write-through)– write the data only into the cache (write-back the cache later)
• Write misses:– read the entire block into the cache, then write the word
Hits vs. Misses
• Two issues:– How do we know if a data item is in the cache?
– If it is, how do we find it?
• Our first example:– block size is one word of data
– "direct mapped"
For each item of data at the lower level, there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
Cache
• Mapping: address is modulo the number of blocks in the cache
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
000
Cache
Memory
001
01
001
11
001
011
101
11
• For MIPS
What kind of locality are we taking advantage of?
Direct Mapped CacheAddress (showing bit positions)
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
• Taking advantage of spatial locality:
Direct Mapped Cache
Address (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
Choosing a block size
• Large block sizes help with spatial locality, but...– It takes time to read the memory in
• Larger block sizes increase the time for misses
– It reduces the number of blocks in the cache• Number of blocks = cache size/block size
• Need to find a middle ground– 16-64 bytes works nicely
• Use split caches because there is more spatial locality in code
Performance
• Simplified model:
execution time = (execution cycles + stall cycles) cycle time
stall cycles = # of instructions miss ratio miss penalty
• Two ways of improving performance:– decreasing the miss ratio– decreasing the miss penalty
What happens if we increase block size?
Decreasing miss ratio with associativity
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data
Four-way set associative
Set
0
1
Tag Data
One-way set associative(direct mapped)
Block
0
7
1
2
3
4
5
6
Tag Data
Two-way set associative
Set
0
1
2
3
Tag Data
Fully Associative vs. Direct Mapped• Fully associative caches provide much greater
flexibility– Nothing gets “thrown out” of the cache until it is
completely full
• Direct-mapped caches are more rigid– Any cached data goes directly where the index
says to, even if the rest of the cache is empty
• A problem, though...– Fully associative caches require a complete search
through all the tags to see if there’s a hit– Direct-mapped caches only need to look one place
A Compromise
7.3
2-Way set associative2-Way set associative
Address = Tag | Index | Block offset
4-Way set associative4-Way set associative
Address = Tag | Index | Block offset
0:
1:
2:
3:
4:
5:
6:
7:
V Tag Data
Each address has two possiblelocations with the same index
Each address has two possiblelocations with the same index
One fewer index bit: 1/2 the indexes
One fewer index bit: 1/2 the indexes
0:
1:
2:
3:
V Tag Data
Each address has four possiblelocations with the same index
Each address has four possiblelocations with the same index
Two fewer index bits: 1/4 the indexes
Two fewer index bits: 1/4 the indexes