chapter 7b: cache memory performance

Ch7b- 2EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Direct Mapping Review

7.2

IndexTag

Always zero (words)

Each word has only one placeit can be in the cache: Index must match exactly

Each word has only one placeit can be in the cache: Index must match exactly

Va

lidTag DataIndex

Cache

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

6-bit Address

560000Y0077511Y0184501Y10

3323400N11

01

Index2

Tag31

Memory Address:Split depends oncache size


Missed me, Missed me...

7.2

• What to do on a hit:

• Carry on... (Hits should take one cycle or less)

• What to do on an instruction fetch miss:

• Undo PC increment (PC <-- PC-4)

• Do a memory read

• Stall until memory returns the data

• Update the cache (data, tag and valid) at index

• Un-stall• What to do on a load miss

• Same thing, except don’t mess with the PC


Missed me, Missed me...

7.2

• What to do on a store (hit or miss)

• Won’t do to just write it to the cache

• The cache would have a different (newer) value than main memory

• Simple Write-Through

• Write both the cache and memory• Works correctly, but slowly

• Buffered Write-Through

• Write the cache

• Buffer a write request to main memory• 1 to 10 buffer slots are typical


Splitting up

• It is common to use two separate caches for Instructions and for Data

• All Instruction fetches use the I-cache

• All data accesses (loads and stores) use the D-cache

• This allows the CPU to access the I-cache at the same time it is accessing the D-cache

• Still have to share a single memory

IF RF M WBEX

7.2

Note: The hit rate will probably be lower than for a combined cache of the same total size.


What about Spatial Locality?

7.2

• Spatial locality says that physically close data is likely to be accessed close together

Word 2 Word 1 Word 0

012131431

Index

1018

AddressAddress

Tag

34

Blockoffset

2

Byteoffset

2

One 4-word BlockOne 4-word BlockAll words in the same block have the same index and tag

• On a cache miss, don’t just grab the word needed, but also the words nearby

• The easiest way to do this is to increase the block size

DataTagVWord

CacheEntry

Note: 22 = 4

3


TagData (4-word Blocks)

Index V

012

20472046

...

...

32KByte/4-Word Block D.M. Cache

7.2

014141531

Hit!

Tag Index

11

17

17

Byte offset23

32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks

Block offset

Data

32

Mux3 2 1 0

211=2K


How Much Change?

7.2

Miss rates for DEC 3100 (MIPS machine)

spice 1 1.2% 1.3% 1.2%

gcc 1 6.1% 2.1% 5.4%

spice 4 0.3% 0.6% 0.4%

gcc 4 2.0% 1.7% 1.9%

Benchmark Block Size Instruction Data miss Combined(words) miss rate miss rate

Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks)


The cost of a cache miss

7.2

• For a memory access, assume:

• 1 clock cycle to send address to memory

• 40 clock cycles for each DRAM access(clock cycle 0.5ns, 20 ns access time)

• 1 clock cycle to send each resulting data word

This actuallydepends onthe busspeed

• Miss access time (4-word block)

• 4 x (Address + access + sending data word)

• 4 x (1 + 40 + 1) = 168= 168 cycles for each miss


Memory Interleaving

7.2

InterleavingInterleaving

DefaultDefaultBegin accessing one word, and while waiting, start accessing other three words (pipelining)

CPU

Cache

Memory

4 bytes

Bus

Bus

CPU

Cache

Memory2

4 bytes

Memory1Memory3 Memory0

Bus

Bus Bus Bus BusRequires 4 separate memories, each 1/4 size

Must finish accessing one word before starting the next access

(1+40+1)x4 = 168 cycles

1 40 1

45 cycles

1 40 11 40 1

1 40 1

Spread out addresses among the memories

Interleaving worksperfectly with caches

Interleaving worksperfectly with caches

Sophisticated DRAMs (EDO, SDRAM, etc.) provide support for this


The issue of Writes

7.2

Perform a write to a location with index 1000, tag 2420, word 1 (value 4334)

On a read miss, we read the entire block from memory into the cache

On a write hit, we write one word into the block. The other words in theblock are unchanged.

On a write miss, we write one word into the block and update the tag.

2330001 322 355 2word 3 word 2 word 1 word 0V tagBlock with

index 1000: 2420 4334

The other words are still the old data (for tag 3000). Bad news!

Solution 1: Don’t update the cache on a write miss. Write only to memory.

Solution 2: On a write miss, first read the referenced block in (including the old value of the word being written), then write the new word into the cache and write-through to memory.


Choosing a block size

7.2

• Large block sizes help with spatial locality, but...

• It takes time to read the memory in• Larger block sizes increase the time for misses

• It reduces the number of blocks in the cache• Number of blocks = cache size/block size

• Need to find a middle ground

• 16-64 bytes works nicely


Other Cache organizations

7.3

Direct MappedDirect Mapped

0:1:23:4:5:6:7:89:

10:11:12:13:14:15:

V Tag DataIndexIndex

Address = Tag | Index | Block offset

Fully AssociativeFully Associative

No IndexNo Index

Address = Tag | Block offset

Each address has only one possible location

Each address has only one possible location

Tag DataV


Fully Associative vs. Direct Mapped

7.3

• Fully associative caches provide much greater flexibility

• Nothing gets “thrown out” of the cache until it is completely full

• Direct-mapped caches are more rigid

• Any cached data goes directly where the index says to, even if the rest of the cache is empty

• A problem, though...

• Fully associative caches require a complete search through all the tags to see if there’s a hit

• Direct-mapped caches only need to look one place


A Compromise

7.3

2-Way set associative2-Way set associative


4-Way set associative4-Way set associative


0:

1:

2:

3:

4:

5:

6:

7:

V Tag Data

Each address has two possiblelocations with the same index

Each address has two possiblelocations with the same index

One fewer index bit: 1/2 the indexes

One fewer index bit: 1/2 the indexes

0:

1:

2:

3:

V Tag Data

Each address has four possiblelocations with the same index

Each address has four possiblelocations with the same index

Two fewer index bits: 1/4 the indexes

Two fewer index bits: 1/4 the indexes


Set Associative Example

V Tag DataIndex00000000

000:

001:

010:

011:

100:

101:

110:

111:

01001110001100110100010011110001101100001100111000

MissMissMissMissMiss

7.3

Index V Tag Data

0

0000000

00:

01:

10:

11:

V Tag DataIndex0

0000000

0:

1:

Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

01001110001100110100010011110001101100001100111000

MissMissHitMissMiss

01001110001100110100010011110001101100001100111000

MissMissHitMissHit

Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)

010 -1 110010

0100 -

1 1100 -1

011110

01101100

1 01001

1 11001

1 01101

-

--

128-byte cache, 4-word blocks, 10 bit addresses,1-4 way assocativity


New Performance Numbers

7.3

Miss rates for DEC 3100 (MIPS machine)

spice Direct 0.3% 0.6% 0.4%

gcc Direct 2.0% 1.7% 1.9%

spice 2-way 0.3% 0.6% 0.4%

gcc 4-way 1.6% 1.4% 1.5%

Benchmark Associativity Instruction Data miss Combinedrate miss rate

Separate 64KB Instruction/Data Caches (4K 4-word blocks)

gcc 2-way 1.6% 1.4% 1.5%

spice 4-way 0.3% 0.6% 0.4%


Block Replacement Strategies

7.5

• We have to replace a block when there is a collision

• Collisions occur whenever the selected set is full

• Strategy 1: Ideal (Oracle)

• Replace the block that won’t be used again for the longest time• Drawback - Requires knowledge of the future

• Strategy 2: Least Recently Used (LRU)

• Replace the block that was last used (hit) the longest time ago

• Drawback - Requires difficult bookkeeping

• Strategy 3: Approximate LRU

• Set a use bit for each block every time it is hit, clear all periodically

• Replace a block without its use bit set• Strategy 4: Random

• Pick a block at random (works almost as well as approx. LRU)


The Three C’s of Misses

7.5

• Compulsory Misses

• The first time a memory location is accessed, it is always a miss

• Also known as cold-start misses

• Only way to decrease miss rate is to increase the block size

• Capacity Misses

• Occur when a program is using more data than can fit in the cache

• Some misses will result because the cache isn’t big enough

• Increasing the size of the cache solves this problem

• Conflict Misses

• Occur when a block forces out another block with the same index

• Increasing Associativity reduces conflict misses

• Worst in Direct-Mapped, non-existent in Fully Associative

chapter 7b: cache memory performance

Documents