chapter 7a: cache memory

30

Upload: rosemarie-jean

Post on 31-Dec-2015

25 views

Category:

Documents


1 download

DESCRIPTION

Chapter 7a: Cache Memory. Amdahl’s law?. Big is Slow. Consider taking an open-book exam. You might find the answer:. The more information stored, the slower the access. In your memory. In a sheet of notes. In course handouts. In the textbook. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 7a: Cache Memory
Page 2: Chapter 7a: Cache Memory

Ch7a- 2EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Big is Slow

• The more information stored, the slower the access

7.1

Amdahl’s law?

Spatial Locality – You’re likely to have questions on similar topicsTemporal Locality – If you need a particular formula, you’re likely to need it again soon

Spatial Locality – You’re likely to have questions on similar topicsTemporal Locality – If you need a particular formula, you’re likely to need it again soon

• Consider taking an open-book exam. You might find the answer:

• In your memory

• In a sheet of notes

• In course handouts

• In the textbook

Page 3: Chapter 7a: Cache Memory

Ch7a- 3EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

And so it is with Computers

• Main memory• Big• Slow (15ns)• “Far” from CPU

7.1

MainMemory

Registers

CPU

Load or I-FetchStore

Assembly language programmers andcompilers manage all transitions betweenregisters and main memory

Assembly language programmers andcompilers manage all transitions betweenregisters and main memory

• Our system has two kinds of memory

• Registers• Close to CPU• Small number of them• Fast

Page 4: Chapter 7a: Cache Memory

Ch7a- 4EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

The problem...

7.1

IF RF MLW WBEX... ...

Instruction FetchInstruction Fetch Memory AccessMemory Access

• Since every instruction has to be fetched from memory, we lose big time

• We lose double big time when executing a load or store

• DRAM Memory access takes around 15ns

• At 100 MHz, that’s 1.5 cycles

• At 1GHz, that’s 15 cycles

• Don’t even get started talking about 3-4GHz

Note: Access time is faster in some memory modes, but basic access is around 10-20ns

Note: Access time is faster in some memory modes, but basic access is around 10-20ns

Page 5: Chapter 7a: Cache Memory

Ch7a- 5EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

A hopeful thought

7.1

• Static RAMs are much faster than DRAMs

• 3-4 ns possible (instead of 15ns)

• So, build memory out of SRAMs

• SRAMs cost about 20 times as much as DRAM• Technology limitations cause the price difference

• Access time gets worse if larger SRAM systems are needed (small is fast...)

• Nice try.

Page 6: Chapter 7a: Cache Memory

Ch7a- 6EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

A more hopeful thought

7.1

• Remember the telephone directory?

• Do the same thing with computer memory

Registers

CPU

Load or I-FetchStore

MainMemory(DRAM)

SRAM CacheCache

The big question: What goes in the cache?

• Build a hierarchy of memories between the registers and main memory

• Closer to CPU: Small and fast (frequently used)

• Closer to Main Memory: Big and slow (more rarely used)

Page 7: Chapter 7a: Cache Memory

Ch7a- 7EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Locality

7.1

i = i+1;if (i<20) { z = i*i + 3*i -2;}q = A[i];

Temporal localityTemporal locality

p = A[i];q = A[i+1]r = A[i] * A[i+3] - A[i+2]

name = employee.name;rank = employee.rank;salary = employee.salary;

Spatial LocalitySpatial Locality

The program is very likelyto access the same dataagain and again over time

The program is very likelyto access data that is closetogether

Page 8: Chapter 7a: Cache Memory

Ch7a- 8EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

The Cache

7.2

5600100001016

24471048431028

4 Most recently accessedMemory locations (exploitstemporal locality)

Issues: How do we know what’s in the cache? What if the cache is full?

Issues: How do we know what’s in the cache? What if the cache is full?

Cache

5600100032231004

23100811221012

01016323241020

8451024431028

9761032775541036

4331040778510442447104877510524331056

Main Memory Fragment

Page 9: Chapter 7a: Cache Memory

Ch7a- 9EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Goals for Cache Organization

• Complete

• Data may come from anywhere in main memory

• Fast lookup

• We have to look up data in the cache on every memory access

• Exploits temporal locality

• Stores only the most recently accessed data

• Exploits spatial locality

• Stores related data

Page 10: Chapter 7a: Cache Memory

Ch7a- 10EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Direct Mapping

7.2

IndexTag

Always zero (words)

Va

lidTag DataIndex

Cache

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

6-bit Address

560000Y77511Y84501Y

3323400N

In a direct-mapped cache:-Each memory address

corresponds to one location in the cache-There are many differentmemory locations for each cache entry (four in this case)

In a direct-mapped cache:-Each memory address

corresponds to one location in the cache-There are many differentmemory locations for each cache entry (four in this case)

00011011

Page 11: Chapter 7a: Cache Memory

Ch7a- 11EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Hits and Misses

7.2

• The hit rate and miss rate are the fraction of memory accesses that are hits and misses

• Typically, hit rates are around 95%• Many times instructions and data are considered

separately when calculating hit/miss rates

• When the CPU reads from memory:

• Calculate the index and tag

• Is the data in the cache? Yes – a hit, you’re done!• Data not in cache? This is a miss.

• Read the word from memory, give it to the CPU.

• Update the cache so we won’t miss again. Write the data and tag for this memory location to the cache. (Exploits temporal locality)

Page 12: Chapter 7a: Cache Memory

Ch7a- 12EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

A 1024-entry Direct-mapped Cache

7.2

Tag DataIndex V012

10231022

...

...

012111231

Hit! Data

Tag

Index 1020

3220

Memory AddressMemory Address

Byte offset

One BlockOne Block

Page 13: Chapter 7a: Cache Memory

Ch7a- 13EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Example - 1024-entry Direct Mapped Cache

11153432321214

1101

Tag DataIndex V012

2332232323998

34238829

1976894111023

3...

Assume the cache has been used for awhile, so it’s not empty...

01

Index- 10 bits211

Tag- 20 bits1231

LW $t3, 0x0000E00C($0)

address = 0000 0000 0000 0000 1110 0000 0000 1100tag = 14 index = 3 byte offset=0

Hit: Data is 34238829

LB $t3, 0x00003005($0) (let’s assume the word at mem[0x0003004] = 8764)

address = 0000 0000 0000 0000 0011 0000 0000 0101tag = 3 index = 1 byte offset=1

Miss: load word from mem[0x0003004] and write into cache at index 1

3 8764

7.2

byte address

Page 14: Chapter 7a: Cache Memory

Ch7a- 14EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Separate I- and D-Caches

• It is common to use two separate caches for Instructions and for Data

• All Instruction fetches use the I-cache

• All data accesses (loads and stores) use the D-cache

• This allows the CPU to access the I-cache at the same time it is accessing the D-cache

• Still have to share a single memory

IF RF M WBEX

7.2

InstructionCache

DataCache

Main Memory missmiss

Page 15: Chapter 7a: Cache Memory

Ch7a- 15EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

So, how’d we do?

7.2

Miss rates for DEC 3100 (MIPS machine)

Note: This isn’tjust the average

Benchmark Instruction Data miss Combinedmiss rate rate miss rate

spice 1.2% 1.3% 1.2%

gcc 6.1% 2.1% 5.4%

Separate 64KB Instruction/Data Caches (16K 1-word blocks)

Page 16: Chapter 7a: Cache Memory

Ch7a- 16EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

The issue of writes

7.2

• What to do on a store (hit or miss)

• Won’t do to just write it to the cache

• The cache would have a different (newer) value than main memory

• Simple Write-Through

• Write both the cache and memory• Works correctly, but slowly

• Buffered Write-Through

• Write the cache

• Buffer a write request to main memory• 1 to 10 buffer slots are typical

Page 17: Chapter 7a: Cache Memory

Ch7a- 17EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

What about Spatial Locality?

7.2

• Spatial locality says that physically close data is likely to be accessed close together

• On a cache miss, don’t just grab the word needed, but also the words nearby

• Organize memory in multi-word blocks

• Memory transfers between cache and memory are always one full block

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

Address

Example of 4-word blocks. Each block is 16 bytes.

Example of 4-word blocks. Each block is 16 bytes.

On a miss, the cache copies the entire block that contains the desired word

On a miss, the cache copies the entire block that contains the desired word

Page 18: Chapter 7a: Cache Memory

Ch7a- 18EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Working with Blocks

Word 2 Word 1 Word 0

012131431

Index

1018

AddressAddress

Tag

34

Blockoffset

2

Byteoffset

2

One 4-word BlockOne 4-word Block

All words in the same block have the same index and tag

DataTagVWord

CacheEntry 3

The requested word may be at any position within a block.

The requested word may be at any position within a block.

The block size may be any power of 2: 1,2,4,8,16,…

The block size may be any power of 2: 1,2,4,8,16,…

Page 19: Chapter 7a: Cache Memory

Ch7a- 19EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

TagData (4-word Blocks)

Index V

012

20472046

...

...

32KByte/4-Word Block D.M. Cache

7.2

014141531

Hit!

Tag Index

11

17

17

Byte offset23

32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks

Block offset

Data

32

Mux3 2 1 0

211=2K

Page 20: Chapter 7a: Cache Memory

Ch7a- 20EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

How Much Change?

7.2

Miss rates for DEC 3100 (MIPS machine)

spice 1 1.2% 1.3% 1.2%

gcc 1 6.1% 2.1% 5.4%

spice 4 0.3% 0.6% 0.4%

gcc 4 2.0% 1.7% 1.9%

Benchmark Block Size Instruction Data miss Combined(words) miss rate miss rate

Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks)

Page 21: Chapter 7a: Cache Memory

Ch7a- 21EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Choosing a block size

7.2

• Large block sizes help with spatial locality, but...

• It takes time to read the memory in• Larger block sizes increase the time for misses

• It reduces the number of blocks in the cache• Number of blocks = cache size/block size

• Need to find a middle ground

• 16-64 bytes works nicely

Page 22: Chapter 7a: Cache Memory

Ch7a- 22EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Other Cache organizations

7.3

Direct MappedDirect Mapped

0:1:23:4:5:6:7:89:

10:11:12:13:14:15:

V Tag DataIndexIndex

Address = Tag | Index | Block offset

Fully AssociativeFully Associative

No IndexNo Index

Address = Tag | Block offset

Each address has only one possible location

Each address has only one possible location

Tag DataV

Page 23: Chapter 7a: Cache Memory

Ch7a- 23EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Fully Associative vs. Direct Mapped

7.3

• Fully associative caches provide much greater flexibility

• Nothing gets “thrown out” of the cache until it is completely full

• Direct-mapped caches are more rigid

• Any cached data goes directly where the index says to, even if the rest of the cache is empty

• A problem, though...

• Fully associative caches require a complete search through all the tags to see if there’s a hit

• Direct-mapped caches only need to look one place

Page 24: Chapter 7a: Cache Memory

Ch7a- 24EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

A Compromise

7.3

2-Way set associative2-Way set associative

Address = Tag | Index | Block offset

4-Way set associative4-Way set associative

Address = Tag | Index | Block offset

0:

1:

2:

3:

4:

5:

6:

7:

V Tag Data

Each address has two possiblelocations with the same index

Each address has two possiblelocations with the same index

One fewer index bit: 1/2 the indexes

One fewer index bit: 1/2 the indexes

0:

1:

2:

3:

V Tag Data

Each address has four possiblelocations with the same index

Each address has four possiblelocations with the same index

Two fewer index bits: 1/4 the indexes

Two fewer index bits: 1/4 the indexes

Page 25: Chapter 7a: Cache Memory

Ch7a- 25EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Set Associative Example

V Tag DataIndex00000000

000:

001:

010:

011:

100:

101:

110:

111:

01001110001100110100010011110001101100001100111000

MissMissMissMissMiss

7.3

Index V Tag Data

0

0000000

00:

01:

10:

11:

V Tag DataIndex0

0000000

0:

1:

Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

01001110001100110100010011110001101100001100111000

MissMissHitMissMiss

01001110001100110100010011110001101100001100111000

MissMissHitMissHit

Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)

010 -1 110010

0100 -

1 1100 -1

011110

01101100

1 01001

1 11001

1 01101

-

--

128-byte cache, 4-word blocks, 10 bit addresses,1-4 way assocativity

Page 26: Chapter 7a: Cache Memory

Ch7a- 26EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

New Performance Numbers

7.3

Miss rates for DEC 3100 (MIPS machine)

spice Direct 0.3% 0.6% 0.4%

gcc Direct 2.0% 1.7% 1.9%

spice 2-way 0.3% 0.6% 0.4%

gcc 4-way 1.6% 1.4% 1.5%

Benchmark Associativity Instruction Data miss Combinedrate miss rate

Separate 64KB Instruction/Data Caches (4K 4-word blocks)

gcc 2-way 1.6% 1.4% 1.5%

spice 4-way 0.3% 0.6% 0.4%

Page 27: Chapter 7a: Cache Memory

Ch7a- 27EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Block Replacement Strategies

7.5

• We have to replace a block when there is a collision

• Collisions occur whenever the selected set is full

• Strategy 1: Ideal (Oracle)

• Replace the block that won’t be used again for the longest time• Drawback - Requires knowledge of the future

• Strategy 2: Least Recently Used (LRU)

• Replace the block that was last used (hit) the longest time ago

• Drawback - Requires difficult bookkeeping

• Strategy 3: Approximate LRU

• Set a use bit for each block every time it is hit, clear all periodically

• Replace a block without its use bit set• Strategy 4: Random

• Pick a block at random (works almost as well as approx. LRU)

Page 28: Chapter 7a: Cache Memory

Ch7a- 28EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

The Three C’s of Misses

7.5

• Compulsory Misses

• The first time a memory location is accessed, it is always a miss

• Also known as cold-start misses

• Only way to decrease miss rate is to increase the block size

• Capacity Misses

• Occur when a program is using more data than can fit in the cache

• Some misses will result because the cache isn’t big enough

• Increasing the size of the cache solves this problem

• Conflict Misses

• Occur when a block forces out another block with the same index

• Increasing Associativity reduces conflict misses

• Worst in Direct-Mapped, non-existent in Fully Associative

Page 29: Chapter 7a: Cache Memory

Ch7a- 29EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Cache Sizing

Registers

CPU

Load or I-FetchStore

MainMemory(DRAM)

Cache

• How big should the cache be?

• As big as possible! Hold as much data in the cache as you can.

• But… Smaller is faster…

• The cache must provide the data within 1 CPU cycle to avoid stalling

• Cache must be on the same chip as the CPU

• Make the cache as large as possible until either:

• Access time is > 1 CPU cycle• Run out of room on CPU chip

Page 30: Chapter 7a: Cache Memory

Ch7a- 30EE/CS/CPE 3760 - Computer OrganizationSeattle Pacific University

Multi-level CachesRegisters

CPU

MainMemory(DRAM)

L1 Cache• The difference between a cache hit (1

cycle) and miss (30-50 cycles) is huge

• Introduce a series of larger, but slower caches to smooth out the difference• L1 Cache: As big as can be in 1 cycle

• L2 Cache: As big as can be in 3-5 cycles

• L3 Cache: As big as can be in 5-10 cycles

L2 Cache

L3 Cache

• L2/L3 Cache may be on/off chip depending on CPU speeds and constraints