chapter 7 – large and fast: exploiting memory hierarchy · l1 instruction and data caches and l2...

ECE200 – Computer Organization

Chapter 7 – Large and Fast: Exploiting Memory

Hierarchy

Copyright 2003 David H. Albonesi and the University of Rochester.

Outline for Chapter 7 lectures

Motivation for, and concept of, memory hierarchies

Caches

Main memory

Characterizing memory hierarchy performance

Virtual memory

Real memory hierarchies

The memory dilemma

Ch 6 assumption: on-chip instruction and data memories hold the entire program and its data and can be accessed in one cycle

Reality checkIn high performance machines, programs may require 100’s of megabytes or even gigabytes of memory to runEmbedded processors have smaller needs but there is also less room for on-chip memory

Basic problemWe need much more memory than can fit on the microprocessor chipBut we do not want to incur stall cycles every time the pipeline accesses instructions or data At the same time, we need the memory to be economical for the machine to be competitive in the market

Solution: a hierarchy of memories

Another view

Typical characteristics of each level

First level (L1) is separate on-chip instruction and data caches placed where our instruction and data memories reside

16-64KB for each cache (desktop/server machine)Fast, power-hungry, not-so-dense, static RAM (SRAM)

Second level (L2) consists of another larger unified cache

Holds both instructions and data256KB-4MBOn or off-chipSRAM

Third level is main memory64-512MBSlower, lower-power, denser dynamic RAM (DRAM)

Final level is I/O (e.g., disk)

Caches and the pipeline

L1 instruction and data caches and L2 cache

cache cache

L2 cache

to mm

L1 L1

Memory hierarchy operation

(1) Search L1 for the instruction or dataIf found (cache hit), done

(2) Else (cache miss), search L2 cacheIf found, place it in L1 and repeat (1)

(3) Else, search main memoryIf found, place it in L2 and repeat (2)

(4) Else, get it from I/O (Chapter 8)

Steps (1)-(3) are performed in hardware1-3 cycles to get from L1 caches5-20 cycles to get from L2 cache50-200 cycles to get from main memory

L1

L2

main memory

Principle of locality

Programs access a small portion of memory within a short time period

Temporal locality: recently accessed memory locations will likely be accessed soon

Spatial locality: memory locations near recently accessed locations will likely be accessed soon

POL makes memory hierarchies work

A large percentage of the time (typically >90%) the instruction or data is found in L1, the fastest memory

Cheap, abundant main memory is accessed more rarely

Memory hierarchy operates at nearly the speed of expensive on-chip SRAM with about the cost of main memory (DRAMs)

Caches

Caches are small, fast, memories that hold recently accessed instructions and/or data

Separate L1 instruction and L1 data cachesNeed simultaneous access of instructions and data in pipelines

L2 cache holds both instructions and dataSimultaneous access not as critical since >90% of the instructions and data will be found in L1PC or effective address from L1 is sent to L2 to search for the instruction or data

How caches exploit the POL

On a cache miss, a block of several instructions or data, including the requested item, are returned

The entire block is placed into the cache so that future searches for items in the block will be successful

instructioni instructioni+3instructioni+2instructioni+1

requested instruction

How caches exploit the POL

Consider sequence of instructions and data accesses in this loop with a block size of 4 words

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop

Four Questions for Memory Hierarchy

Q1: Where can a block be placed in the upper level? (Block placement)

Q2: How is a block found if it is in the upper level?

(Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)

The cache is much smaller than main memory

Multiple memory blocks must share the same cache location

Searching the cache

block

Need a way to determine whether the desired instruction or data is held in the cache

Need a scheme for replacing blocks when a new block needs to be brought in on a miss

Searching the cache

block

Cache organization alternatives

Direct mapped: each block can be placed in only one cache location

Set associative: each block can be placed in any of n cache locations

Fully associative: each block can be placed in any cache location

Cache organization alternatives

Searching for block 12 in caches of size 8 blocks

Set Set # 0

Searching a direct mapped cache

Need log2 number of sets of the address bits (the index) to select the block location

block offset used to select the desired byte, half-word, or word within the block

Remaining bits (the tag) used to determine if this is the desired block or another that shares the same cache location

indextagblock offset

memory address

assume data cache with 16 byte blocks

8 sets

4 block offset bits

3 index bits

25 tag bits

Set

Searching a direct mapped cache

Block is placed in the set index

number of sets = cache size/block size

indextagblock offset

memory address

assume data cache with 16 byte blocks

8 sets

4 block offset bits

3 index bits

25 tag bits

Set

Direct mapped cache organization

64KB instruction cache with 16 byte (4 word) blocks

4K sets (64KB/16B) need 12 address bits to pick


The data section of the cache holds the instructions


The tag section holds the part of the memory address used to distinguish different blocks


A valid bit associated with each set indicates if the instructions are valid or not

Direct mapped cache access

The index bits are used to select one of the sets


The data, tag, and Valid bit from the selected set are simultaneously accessed


The tag from the selected entry is compared with the tag field of the address


A match between the tags and a Valid bit that is set indicates a cache hit


The block offset selects the desired instruction

Set associative cache

Block placed in one way of set index

Number of sets = cache size/(block size*ways)

ways 0-3

Set associative cache operation

The index bits are used to select one of the sets


The data, tag, and Valid bit from all ways of the selected entry are simultaneously accessed


The tags from all ways of the selected entry are compared with the tag field of the address


A match between the tags and a Valid bit that is set indicates a cache hit (hit in way1 shown)


The data from the way that had a hit is returned through the MUX

Fully associative cache

A block can be placed in any location in the single set

31 30 29 28 27 26 …6 5 4 3 2 1 0

V Tag Data

= ===

.

.

.

01

10231022

.

.

.

30

MUX

Data

Hit

...

.

.

.

.

.

...

Different degrees of associativity

Four different caches of size 8 blocks

Cache misses

A cache miss occurs when the block is not found in the cache

The block is requested from the next level of the hierarchy

When the block returns, it is loaded into the cache and provided to the requester

A copy of the block remains in the lower levels of the hierarchy

The cache miss rate is found by dividing the total number of misses by the total number of accesses (misses/accesses)

The hit rate is 1-miss rate

L1

L2

main memory

Classifying cache misses

Compulsory missesCaused by the first access to a block that has never been in the cache

Capacity missesDue to the cache not being big enough to hold all the blocks that are needed

Conflict missesDue to multiple blocks competing for the same set

A fully associative cache with a “perfect”replacement policy has no conflict misses

Cache miss classification examplesDirect mapped cache of size two blocks

Blocks A and B map to set 0, C and D to set 1

Access pattern 1: A, B, C, D, A, B, C, D

Access pattern 2: A, A, B, A

01

A 01

B 01

B 01

B

01

A 01

B 01

B 01

B

C D

D D C D

01

A 01

A 01

B 01

A

?? ?? ?? ??

?? ?? ?? ??

?? ?? ?? ??

Reducing capacity misses

Increase the cache sizeMore cache blocks can be simultaneously held in the cacheDrawback: increased access time

Reducing compulsory missesIncrease the block size

Each miss results in more words being loaded into the cacheBlock size should only be increased to a certain point!

As block size is increasedFewer cache sets (increased contention)Larger percentage of block may not be referenced

Reducing conflict missesIncrease the associativity

More locations in which a block can be heldDrawback: increased access time

Cache miss rates for SPEC92

Block replacement policy

Determines what block to replace on a cache miss to make room for the new block

Least recently used (LRU)Pick the one that has been unused for the longest timeBased on temporal localityRequires ordering bits to be kept with each setToo expensive beyond 4-way

RandomPseudo-randomly pick a blockGenerally not as effective as LRU (higher miss rates)Simple even for highly associative organizations

Most recently used (MRU)Keep track of which block was accessed lastRandomly pick a block from other than that oneCompromise between LRU and random

Cache writes

The L1 data cache needs to handle writes (stores) in addition to reads (loads)

Need to check for a hit before writing Don’t want to write over another block on a missRequires a two cycle operation (tag check followed by write)

Write back cacheCheck for hit If a hit, write the byte, halfword, or word to the correct location in the blockIf a miss, request the block from the next level in the hierarchyLoad the block into the cache, and then perform the write

Cache writes and block replacement

With a write back cache, when a block is written to, copies of the block in the lower levels are not updated

If this block is chosen for replacement on a miss, we need save it to the next level

Solution:A dirty bit is associated with each cache blockThe dirty bit is set if the block is written toA block with a set dirty bit that is chosen for replacement is written to the next level before being overwritten with the new block

Cache writes and block replacement

L2

main memory

(2) dirty block

written to L2

L1D(4) block

loaded into L1D cache

(3) read request sent

to L2

(1) cache miss

Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory

Processor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory

Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Memory system designer’s nightmare:Store frequency (w.r.t. time) -> 1 / DRAM write cycleWrite buffer saturation

ProcessorCache

Write Buffer

DRAM

Write Buffer Saturation

Store frequency (w.r.t. time) -> 1 / DRAM write cycle

If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):

Store buffer will overflow no matter how big you make itThe CPU Cycle Time <= DRAM Write Cycle Time

Solution for write buffer saturation:Use a write back cacheInstall a second level (L2) cache:

Write Buffer

ProcessorCache

DRAM

ProcessorCache

Write Buffer

DRAML2Cache

Write-miss Policy: Write Allocate versus Not Allocate

Assume: a 16-bit write to memory location 0x0 and causes a miss

Do we read in the block?Yes: Write AllocateNo: Write Not Allocate

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x00Ex: 0x00

0x00Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9

Summary #1/ 3:The Principle of Locality:

Program likely to access a relatively small portion of the address space at any instant of time.

Temporal Locality: Locality in TimeSpatial Locality: Locality in Space

Three Major Categories of Cache Misses:Compulsory Misses: sad facts of life. Example: cold start misses.Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!Capacity Misses: increase cache size

Cache Design Spacetotal size, block size, associativityreplacement policywrite-hit policy (write-through, write-back)write-miss policy

Summary #2 / 3: The Cache Design Space

Several interacting dimensionscache sizeblock sizeassociativityreplacement policywrite-through vs write-backwrite allocation

The optimal choice is a compromisedepends on access characteristics

workloaduse (I-cache, D-cache, TLB)

depends on technology / cost

Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

Summary #3 / 3

Caches understood by examining how they deal with four questions: 1. Where can block be placed? 2. How is block found? 3. What block is replaced on miss? 4. How are writes handled?

chapter 7 – large and fast: exploiting memory hierarchy · l1 instruction and data caches and l2...

Documents