chapter 7a: cache memory

Ch7a- 2EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Big is Slow

• The more information stored, the slower the access

7.1

Amdahl’s law?

Spatial Locality – You’re likely to have questions on similar topicsTemporal Locality – If you need a particular formula, you’re likely to need it again soon

Spatial Locality – You’re likely to have questions on similar topicsTemporal Locality – If you need a particular formula, you’re likely to need it again soon

• Consider taking an open-book exam. You might find the answer:

• In your memory

• In a sheet of notes

• In course handouts

• In the textbook


And so it is with Computers

• Main memory• Big• Slow (15ns)• “Far” from CPU

7.1

MainMemory

Registers

CPU

Load or I-FetchStore

Assembly language programmers andcompilers manage all transitions betweenregisters and main memory

Assembly language programmers andcompilers manage all transitions betweenregisters and main memory

• Our system has two kinds of memory

• Registers• Close to CPU• Small number of them• Fast


The problem...

7.1

IF RF MLW WBEX... ...

Instruction FetchInstruction Fetch Memory AccessMemory Access

• Since every instruction has to be fetched from memory, we lose big time

• We lose double big time when executing a load or store

• DRAM Memory access takes around 15ns

• At 100 MHz, that’s 1.5 cycles

• At 1GHz, that’s 15 cycles

• Don’t even get started talking about 3-4GHz

Note: Access time is faster in some memory modes, but basic access is around 10-20ns

Note: Access time is faster in some memory modes, but basic access is around 10-20ns


A hopeful thought

7.1

• Static RAMs are much faster than DRAMs

• 3-4 ns possible (instead of 15ns)

• So, build memory out of SRAMs

• SRAMs cost about 20 times as much as DRAM• Technology limitations cause the price difference

• Access time gets worse if larger SRAM systems are needed (small is fast...)

• Nice try.


A more hopeful thought

7.1

• Remember the telephone directory?

• Do the same thing with computer memory

Registers

CPU


MainMemory(DRAM)

SRAM CacheCache

The big question: What goes in the cache?

• Build a hierarchy of memories between the registers and main memory

• Closer to CPU: Small and fast (frequently used)

• Closer to Main Memory: Big and slow (more rarely used)


Locality

7.1

i = i+1;if (i<20) { z = i*i + 3*i -2;}q = A[i];

Temporal localityTemporal locality

p = A[i];q = A[i+1]r = A[i] * A[i+3] - A[i+2]

name = employee.name;rank = employee.rank;salary = employee.salary;

Spatial LocalitySpatial Locality

The program is very likelyto access the same dataagain and again over time

The program is very likelyto access data that is closetogether


The Cache

7.2

5600100001016

24471048431028

4 Most recently accessedMemory locations (exploitstemporal locality)

Issues: How do we know what’s in the cache? What if the cache is full?

Issues: How do we know what’s in the cache? What if the cache is full?

Cache

5600100032231004

23100811221012

01016323241020

8451024431028

9761032775541036

4331040778510442447104877510524331056

Main Memory Fragment


Goals for Cache Organization

• Complete

• Data may come from anywhere in main memory

• Fast lookup

• We have to look up data in the cache on every memory access

• Exploits temporal locality

• Stores only the most recently accessed data

• Exploits spatial locality

• Stores related data


Direct Mapping

7.2

IndexTag

Always zero (words)

Va

lidTag DataIndex

Cache

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

6-bit Address

560000Y77511Y84501Y

3323400N

In a direct-mapped cache:-Each memory address

corresponds to one location in the cache-There are many differentmemory locations for each cache entry (four in this case)

In a direct-mapped cache:-Each memory address

corresponds to one location in the cache-There are many differentmemory locations for each cache entry (four in this case)

00011011


Hits and Misses

7.2

• The hit rate and miss rate are the fraction of memory accesses that are hits and misses

• Typically, hit rates are around 95%• Many times instructions and data are considered

separately when calculating hit/miss rates

• When the CPU reads from memory:

• Calculate the index and tag

• Is the data in the cache? Yes – a hit, you’re done!• Data not in cache? This is a miss.

• Read the word from memory, give it to the CPU.

• Update the cache so we won’t miss again. Write the data and tag for this memory location to the cache. (Exploits temporal locality)


A 1024-entry Direct-mapped Cache

7.2

Tag DataIndex V012

10231022

...

...

012111231

Hit! Data

Tag

Index 1020

3220

Memory AddressMemory Address

Byte offset

One BlockOne Block


Example - 1024-entry Direct Mapped Cache

11153432321214

1101

Tag DataIndex V012

2332232323998

34238829

1976894111023

3...

Assume the cache has been used for awhile, so it’s not empty...

01

Index- 10 bits211

Tag- 20 bits1231

LW $t3, 0x0000E00C($0)

address = 0000 0000 0000 0000 1110 0000 0000 1100tag = 14 index = 3 byte offset=0

Hit: Data is 34238829

LB $t3, 0x00003005($0) (let’s assume the word at mem[0x0003004] = 8764)

address = 0000 0000 0000 0000 0011 0000 0000 0101tag = 3 index = 1 byte offset=1

Miss: load word from mem[0x0003004] and write into cache at index 1

3 8764

7.2

byte address


Separate I- and D-Caches

• It is common to use two separate caches for Instructions and for Data

• All Instruction fetches use the I-cache

• All data accesses (loads and stores) use the D-cache

• This allows the CPU to access the I-cache at the same time it is accessing the D-cache

• Still have to share a single memory

IF RF M WBEX

7.2

InstructionCache

DataCache

Main Memory missmiss


So, how’d we do?

7.2

Miss rates for DEC 3100 (MIPS machine)

Note: This isn’tjust the average

Benchmark Instruction Data miss Combinedmiss rate rate miss rate

spice 1.2% 1.3% 1.2%

gcc 6.1% 2.1% 5.4%

Separate 64KB Instruction/Data Caches (16K 1-word blocks)


The issue of writes

7.2

• What to do on a store (hit or miss)

• Won’t do to just write it to the cache

• The cache would have a different (newer) value than main memory

• Simple Write-Through

• Write both the cache and memory• Works correctly, but slowly

• Buffered Write-Through

• Write the cache

• Buffer a write request to main memory• 1 to 10 buffer slots are typical


What about Spatial Locality?

7.2

• Spatial locality says that physically close data is likely to be accessed close together

• On a cache miss, don’t just grab the word needed, but also the words nearby

• Organize memory in multi-word blocks

• Memory transfers between cache and memory are always one full block

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

Address

Example of 4-word blocks. Each block is 16 bytes.

Example of 4-word blocks. Each block is 16 bytes.

On a miss, the cache copies the entire block that contains the desired word

On a miss, the cache copies the entire block that contains the desired word


Working with Blocks

Word 2 Word 1 Word 0

012131431

Index

1018

AddressAddress

Tag

34

Blockoffset

2

Byteoffset

2

One 4-word BlockOne 4-word Block

All words in the same block have the same index and tag

DataTagVWord

CacheEntry 3

The requested word may be at any position within a block.

The requested word may be at any position within a block.

The block size may be any power of 2: 1,2,4,8,16,…

The block size may be any power of 2: 1,2,4,8,16,…


TagData (4-word Blocks)

Index V

012

20472046

...

...

32KByte/4-Word Block D.M. Cache

7.2

014141531

Hit!

Tag Index

11

17

17

Byte offset23

32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks

Block offset

Data

32

Mux3 2 1 0

211=2K


How Much Change?

7.2


spice 1 1.2% 1.3% 1.2%

gcc 1 6.1% 2.1% 5.4%

spice 4 0.3% 0.6% 0.4%

gcc 4 2.0% 1.7% 1.9%

Benchmark Block Size Instruction Data miss Combined(words) miss rate miss rate

Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks)


Choosing a block size

7.2

• Large block sizes help with spatial locality, but...

• It takes time to read the memory in• Larger block sizes increase the time for misses

• It reduces the number of blocks in the cache• Number of blocks = cache size/block size

• Need to find a middle ground

• 16-64 bytes works nicely


Other Cache organizations

7.3

Direct MappedDirect Mapped

0:1:23:4:5:6:7:89:

10:11:12:13:14:15:

V Tag DataIndexIndex

Address = Tag | Index | Block offset

Fully AssociativeFully Associative

No IndexNo Index

Address = Tag | Block offset

Each address has only one possible location

Each address has only one possible location

Tag DataV


Fully Associative vs. Direct Mapped

7.3

• Fully associative caches provide much greater flexibility

• Nothing gets “thrown out” of the cache until it is completely full

• Direct-mapped caches are more rigid

• Any cached data goes directly where the index says to, even if the rest of the cache is empty

• A problem, though...

• Fully associative caches require a complete search through all the tags to see if there’s a hit

• Direct-mapped caches only need to look one place


A Compromise

7.3

2-Way set associative2-Way set associative


4-Way set associative4-Way set associative


0:

1:

2:

3:

4:

5:

6:

7:

V Tag Data

Each address has two possiblelocations with the same index

Each address has two possiblelocations with the same index

One fewer index bit: 1/2 the indexes

One fewer index bit: 1/2 the indexes

0:

1:

2:

3:

V Tag Data

Each address has four possiblelocations with the same index

Each address has four possiblelocations with the same index

Two fewer index bits: 1/4 the indexes

Two fewer index bits: 1/4 the indexes


Set Associative Example

V Tag DataIndex00000000

000:

001:

010:

011:

100:

101:

110:

111:

01001110001100110100010011110001101100001100111000

MissMissMissMissMiss

7.3

Index V Tag Data

0

0000000

00:

01:

10:

11:

V Tag DataIndex0

0000000

0:

1:

Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

01001110001100110100010011110001101100001100111000

MissMissHitMissMiss

01001110001100110100010011110001101100001100111000

MissMissHitMissHit

Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)

010 -1 110010

0100 -

1 1100 -1

011110

01101100

1 01001

1 11001

1 01101

-

--

128-byte cache, 4-word blocks, 10 bit addresses,1-4 way assocativity


New Performance Numbers

7.3


spice Direct 0.3% 0.6% 0.4%

gcc Direct 2.0% 1.7% 1.9%

spice 2-way 0.3% 0.6% 0.4%

gcc 4-way 1.6% 1.4% 1.5%

Benchmark Associativity Instruction Data miss Combinedrate miss rate

Separate 64KB Instruction/Data Caches (4K 4-word blocks)

gcc 2-way 1.6% 1.4% 1.5%

spice 4-way 0.3% 0.6% 0.4%


Block Replacement Strategies

7.5

• We have to replace a block when there is a collision

• Collisions occur whenever the selected set is full

• Strategy 1: Ideal (Oracle)

• Replace the block that won’t be used again for the longest time• Drawback - Requires knowledge of the future

• Strategy 2: Least Recently Used (LRU)

• Replace the block that was last used (hit) the longest time ago

• Drawback - Requires difficult bookkeeping

• Strategy 3: Approximate LRU

• Set a use bit for each block every time it is hit, clear all periodically

• Replace a block without its use bit set• Strategy 4: Random

• Pick a block at random (works almost as well as approx. LRU)


The Three C’s of Misses

7.5

• Compulsory Misses

• The first time a memory location is accessed, it is always a miss

• Also known as cold-start misses

• Only way to decrease miss rate is to increase the block size

• Capacity Misses

• Occur when a program is using more data than can fit in the cache

• Some misses will result because the cache isn’t big enough

• Increasing the size of the cache solves this problem

• Conflict Misses

• Occur when a block forces out another block with the same index

• Increasing Associativity reduces conflict misses

• Worst in Direct-Mapped, non-existent in Fully Associative


Cache Sizing

Registers

CPU


MainMemory(DRAM)

Cache

• How big should the cache be?

• As big as possible! Hold as much data in the cache as you can.

• But… Smaller is faster…

• The cache must provide the data within 1 CPU cycle to avoid stalling

• Cache must be on the same chip as the CPU

• Make the cache as large as possible until either:

• Access time is > 1 CPU cycle• Run out of room on CPU chip


Multi-level CachesRegisters

CPU

MainMemory(DRAM)

L1 Cache• The difference between a cache hit (1

cycle) and miss (30-50 cycles) is huge

• Introduce a series of larger, but slower caches to smooth out the difference• L1 Cache: As big as can be in 1 cycle

• L2 Cache: As big as can be in 3-5 cycles

• L3 Cache: As big as can be in 5-10 cycles

L2 Cache

L3 Cache

• L2/L3 Cache may be on/off chip depending on CPU speeds and constraints

chapter 7a: cache memory

Documents