cs 147 cache memory prof. sin-min lee department of computer science

CS 147 Cache Memory

Prof. Sin-Min Lee

Department of Computer Science

Memory: Capacity

• Word size: # of bits in natural unit of organization– Usually related to length of an instruction or

the number of bits used to represent an integer number

• Capacity expressed as number of words or number of bytes– Usually a power of 2, e.g. 1 KB 1024 bytes

why?

Other Memory System Characteristics

• Unit of Transfer: Number of bits read from, or written into memory at a time– Internal : usually governed by data bus width– External : usually a block of words e.g 512 or more

• Addressable unit: smallest location which can be uniquely addressed– Internal : word or byte– External : device dependent e.g. a disk “cluster”

Sequential Access Method

• Start at the beginning – read through in order• Access time depends on location of data and

previous location• e.g. tape

location of interest

start

read to here

. . .

first location

Direct Access• Individual blocks have unique address• Access is by jumping to vicinity plus sequential search (or

waiting! e.g. waiting for disk to rotate)• Access time depends on target location and previous

location• e.g. disk

block i

. . .

jump to here

read to here

PRIMARY MEMORY

The memory is that part of the computer where programs and data are stored. Some computer scientists (especially British ones) use the term store or storage rather than memory, although more and more, the term "storage" is used to refer to disk storage.

Memories consist of a number of cells (or locations) each of which can store a piece of information. Each cell has a number, called its address, by which programs can refer to it. If a memory has n cells, they will have addresses 0 to n - 1. All cells in a memory contain the same number of bits. If a cell consists of k bits, it can hold any one of 2k different bit combinations.

MEMORY ADDRESSES

Computers that use the binary number system (including octal and hexadecimal notation for binary numbers) express memory addresses as binary numbers. If an address has m bits, the maximum number of cells addressable is 2m.

MEMORY ADDRESSES

For example, an address used to reference the memory to the left needs at least 4 bits in order to express all the numbers from 0 to 11.

A 3-bit address is sufficient here. The number of bits in the address determines the maximum number of directly addressable cells in the memory and is independent of the number of bits per cell.

A memory with 212 cells of 8 bits each and a memory with 212 cells of 64 bits each each need 12-bit addresses.

The number of bits per cell for some computers that have been sold commercially is listed to the right.

Random Access Method

• Individual addresses identify specific locations

• Access time independent of location or previous access

• e.g. RAM. . .

read here

main memory types

Problem: CPU Fast, Memory Slow

• After a memory request, the CPU will not get the word for several cycles

• Two simple solutions:– Continue execution, but stall CPU if an instruction

references the word before it has arrived (hardware)– Require compiler to fetch words before they are

needed (software)• May need to insert NOP instructions• Very difficult to write compilers to do this effectively

The Root of the Problem:Economics

• Fast memory is possible, but to run at full speed, it needs to be located on the same chip as the CPU– Very expensive– Limits the size of the memory

• Do we choose:– A small amount of fast memory?– A large amount of slow memory?

Memory Hierarchy Design (1)

• Since 1987, microprocessors performance improved 55% per year and 35% until 1987• This picture shows the CPU performance against memory access time improvements

over the years– Clearly there is a processor-memory performance gap that computer architects must take care

of


• It is a tradeoff between size, speed and cost and exploits the principle of locality.

• Register– Fastest memory element; but small storage; very expensive

• Cache– Fast and small compared to main memory; acts as a buffer between the

CPU and main memory: it contains the most recent used memory locations (address and contents are recorded here)

• Main memory is the RAM of the system• Disk storage - HDD

Registers(CPU)

Cache (one ormore levels)

MainMemory

DiskStorage

Specialized bus(internal or external

to CPU)

Memory bus I/O bus


• Comparison between different types of memory

size:speed:$/Mbyte:

32 - 256 B2 ns

Register Cache Memory

32KB - 4MB4 ns$100/MB

128 MB60 ns$1.50/MB

20 GB8 ms$0.05/MB

larger, slower, cheaper

HDD

The Best of Both Worlds:Cache Memory

• Combine a small amount of fast memory (the cache) with a large amount of slow memory– When a word is referenced, put it and its neighbours

into the cache

• Programs do not access memory randomly– Temporal Locality: recently accessed items are likely

to be used again– Spatial Locality: the next access is likely to be near

the last one

The Cache Hit Ratio

• How often is a word found in the cache?

• Suppose a word is accessed k times in a short interval– 1 reference to main memory– (k-1) references to the cache

• The cache hit ratio h is then

kk

h1-

=

Reasons why we use cache

• Cache memory is made of STATIC RAM – a transistor based RAM that has very low access times (fast)

• STATIC RAM is however, very bulky and very expensive

• Main Memory is made of DYNAMIC RAM – a capacitor based RAM that has very high access times because it has to be constantly refreshed (slow)

• DYNAMIC RAM is much smaller and cheaper

Performance (Speed)• Access time

– Time between presenting the address and getting the valid data (memory or other storage)

• Memory cycle time– Some time may be required for the memory to

“recover” before next access– cycle time = access + recovery

• Transfer rate– rate at which data can be moved– for random access memory = 1 / cycle time

(cycle time)-1

Memory Hierarchy

• size ? speed ? cost ?

• registers– in CPU

• internal– may include one or more levels of cache

• external memory– backing store

smallest, fastest, most expensive, most frequently accessed

medium, quick, price varies

largest, slowest, cheapest, least frequently accessed

Memory: Location• Registers: inside cpu

– Fastest – on CPU chip– Cache : very fast, semiconductor, close to CPU

• Internal or main memory – Typically semiconductor media (transistors)– Fast, random access, on system bus

• External or secondary memory– peripheral storage devices (e.g. disk, tape) – Slower, often magnetic media , maybe slower bus

Memory Hierarchy - Diagramdecreasing

cost per bit, speed, access frequency

increasingcapacity, access time

Performance & Hierarchy List• Registers• Level 1 Cache• Level 2 Cache• Main memory• Disk cache• Disk• Optical• Tape

soon( 2 slides ! )

Faster, +$/byte

Slower, -$/byte

Locality of Reference (circa 1968)

• During program execution memory references tend to cluster, e.g. loops

• Many instructions in localized areas of pgm are executed repeatedly during some time period, and remainder of pgm is accessed infrequently. (Tanenbaum)

• Temporal LOR: a recently executed instruction is likely to be executed again soon

• Spatial LOR: instructions with addresses close to a recently executed instruction are likely to be executed soon.

• Same principles apply to data references.

Cache

• small amount of fast memory• sits between normal main memory and CPU• may be located on CPU chip or module

word transfer

block transfer

cache

cache viewsmain memoryas organized in “blocks”

smaller thanmain memory

The Cache Hit Ratio

• How often is a word found in the cache?

• Suppose a word is accessed k times in a short interval– 1 reference to main memory– (k-1) references to the cache

• The cache hit ratio h is then

kk

h1-

=

Mean Access Time

• Cache access time = c

• Main memory access time = m

• Mean access time = c +(1-h)m

• If all address references are satisfied by the cache, the access time approaches c

• If no reference is in the cache, the access time approaches c+m

Cache Design Issues• How big should the cache be?

– Bigger means more hits, but more expensive

• How big should a cache-line be?• How does the cache keep track of what it contains?

– If we change an item in the cache, how do we write it back to main memory?

• Separate caches for data and instructions?– Instructions never have to be written back to main memory

• How many caches should there be?– Primary (on chip), secondary (off chip), tertiary…

Why does Caching Improve Speed?Example:

• Main memory has 100,000 words, access time is 0.1 s.

• Cache has 1000 words and access time is 0.01 s.• If word is

– in cache (hit), it can be accessed directly by processor.

– in memory (miss), it must be first transferred to cache before access.

• Suppose that 95% of access requests are hits.• Average time to access a word (0.95)(0.01 s)

+0.05(0.1 s+ 0.01 s) = 0.015 s

Close to cache speed

Key proviso

Cache Read Operation

• CPU requests contents of memory location

• check cache for contents of locationcache hit !

cache miss !present get data from cache (fast)

not present read required block from main to cache

– then deliver data from cache to CPU

Cache Design

• Size

• Mapping Function

• Replacement Algorithm

• Write Policy

• Block Size

• Number of Caches

Size

• Cost– More cache is expensive

• Speed– More cache is faster (up to a point)– Checking cache for data takes time

Mapping Function• how does cache contents map to main memory contents?

cache

tag data block000

xxx

blocki

blockj

. . .

main memory

address contents

line

use tag (and maybe line address) to

identify block address

Cache Basics

• cache line vs. main memory location• same concept – avoid confusion (?)• line has address and contents

• contents of cache line divided into tag and data fields– fixed width– fields used differently !– data field holds contents of a block of main memory – tag field helps identify the start address of the block of

memory that is in the data field

cache line width bigger than

memory location width !

Cache (2)• Every address reference goes first to the cache;

– if the desired address is not here, then we have a cache miss;• The contents are fetched from main memory into the indicated CPU

register and the content is also saved into the cache memory– If the desired data is in the cache, then we have a cache hit

• The desired data is brought from the cache, at very high speed (low access time)

• Most software exhibits temporal locality of access, meaning that it is likely that same address will be used again soon, and if so, the address will be found in the cache

• Transfers between main memory and cache occur at granularity of cache lines or cache blocks, around 32 or 64 bytes (rather than bytes or processor words). Burst transfers of this kind receive hardware support and exploit spatial locality of access to the cache (future access are often to address near to the previous one)

Where can a block be placed in Cache? (1)

• Our cache has eight block frames and the main memory has 32 blocks

Where can a block be placed in Cache? (2)

• Direct mapped Cache– Each block has only one place where it can appear in the cache– (Block Address) MOD (Number of blocks in cache)

• Fully associative Cache– A block can be placed anywhere in the cache

• Set associative Cache– A block can be placed in a restricted set of places into the

cache– A set is a group of blocks into the cache– (Block Address) MOD (Number of sets in the cache)

• If there are n blocks in the cache, the placement is said to be n-way set associative

Mapping Function Example

• cache of 64 KByte– 16 K (214) lines – each line is 5 bytes wide = 40 bits

• 16 MBytes main memory• 24 bit address

– 224 = 16 M• will consider DIRECT and ASSOCIATIVE mappings

4 byte blocksof main memory

holds up to 64 Kbytes of main memory contents

tag field: 1 byte

data field: 4 bytes

Direct Mapping

• each block of main memory maps to only one cache line– i.e. if a block is in cache, it must be in one specific place – based

on address!• split address into two parts

– least significant w bits identify unique word in block– most significant s bits specify one memory block

• split s bits into:– cache line address field r bits– tag field of s-r most significant bits

s w

tag line

address

s – r r

sline field

identifies line containing

block !

tags-r

line address r

wordw

8 14 2

24 bit address

s = 22 bit block identifier

2 bit word identifier (4 byte block)

Direct Mapping: Address Structure for Example

• two blocks may have the same r value, but then always have different tag value !

Direct Mapping Cache Line Table

cache line main memory blocks held

0 0, m, 2m, 3m, … 2s-m

1 1, m+1, 2m+1, … 2s-m+1

m-1 m-1, 2m-1,3m-1, … 2s-1. . .

. . .

m=214

s=22

each block = 4 bytes

But…a line can contain only one of these at a time!

Direct Mapping Cache Organization

Direct Mapping pros & cons

• Simple

• Inexpensive

• Fixed location for given block– If a program accesses 2 blocks that map

to the same line repeatedly, cache misses are very high

Associative Memory

• read: specify tag field value and word select• checks all lines – finds matching tag

– return contents of data field @ selected word• access time independent of location or previous

access• write to data field @ tag value + word select

– what if no words with matching tag?

Associative Mapping• main memory block can load into any

line of cache

• memory address is interpreted as tag and word select in block

• tag uniquely identifies block of memory !

• every line’s tag is examined for a match

• cache searching gets expensive

s = tag does not use line address !

Fully Associative Cache Organization

no linefield !

Associative Mapping Example

tag = most signif. 22 bits of address

Typo-leading Fmissing!

Tag 22 bitWord2 bit

Associative MappingAddress Structure

• 22 bit tag stored with each 32 bit block of data• Compare tag field with tag entry in cache to check for hit• Least significant 2 bits of address identify which 8 bit word

is required from 32 bit data block• e.g.

– Address Tag Data Cache line– FFFFFC 3FFFFF 24682468 any, e.g. 3FFF

Set Associative Mapping

• Cache is divided into a number of sets

• Each set contains k lines k – way associative

• A given block maps to any line in a given set– e.g. Block B can be in any line of set i

• e.g. 2 lines per set– 2 – way associative mapping– A given block can be in one of 2 lines in only one

set

K-Way Set Associative Cache Organization

Direct + Associative Mappingset select(direct)

tag(associative)

Set Associative MappingAddress Structure

• Use set field to determine which set of cache lines to look in (direct)• Within this set, compare tag fields to see if we have a hit

(associative)

• e.g– Address Tag Data Set number– FFFFFC 1FF 12345678 1FFF– 00FFFF 001 11223344 1FFF

Tag 9 bit Set 13 bitWord2 bit

Same Set, different Tag, different Word

e.g Breaking into Tag, Set, Word

• Given Tag=9 bits, Set=13 bits, Word=2 bits

• Given address FFFFFD16

• What are values of Tag, Set, Word?– First 9 bits are Tag, next 13 are Set, next 2 are Word– Rewrite address in base 2: 1111 1111 1111 1111 1111

1101– Group each field in groups of 4 bits starting at right– Add zero bits as necessary to leftmost group of bits

• 0001 1111 1111 0001 1111 1111 1111 0001 1FF 1FFF 1 (Tag, Set, Word)

Replacement Algorithms Direct Mapping

• what if bringing in a new block, but no line available in cache?

• must replace (overwrite) a line – which one?

• direct no choice– each block only maps to one line

• replace that line

Replacement Algorithms Associative & Set Associative

• hardware implemented algorithm (speed)• Least Recently Used (LRU)• e.g. in 2-way set associative

– which of the 2 blocks is LRU?

• First In first Out (FIFO)– replace block that has been in cache longest

• Least Frequently Used (LFU)– replace block which has had fewest hits

• Random

Write Policy

• must not overwrite a cache block unless main memory is up to date

• Complication: Multiple CPUs may have individual caches!!

• Complication: I/O may address main memory too (read and write)!!

• N.B. 15% of memory references are writes

Write Through Method

• all writes go to main memory as well as cache

• Each of multiple CPUs can monitor main memory traffic to keep its own local cache up to date

• lots of traffic slows down writes

Write Back Method

• updates initially made in cache only

• update (dirty) bit for cache slot is set when update occurs

• if block is to be replaced, write to main memory only if update bit is set

• Other caches get out of sync

• I/O must access main memory through cache

Multiple Caches on one processor

• two levels – L1 close to processor (often on chip)

• L2 – between L1 and main memory

• check L1 first – if miss – then check L2– if L2 miss – get from memory

processor L1 L2localbus system bus

to high speed bus

Unified vs. Split Caches

• unified both instruction and data in same cache

• split separate caches for instructions and data– separate local busses to cache

• increased concurrency pipelining – allows instruction fetch to be concurrent with

operand access

Pentium Family Cache Evolution

• 80386 – no on chip cache

• 80486 – 8k using 16 byte lines and four way set associative organization

• Pentium (all versions) – two on chip L1 (split) caches– data & instructions

Pentium 4 Cache

• Pentium 4 – split L1 caches– 8k bytes– 128 lines of 64 bytes each– four way set associative = 32 sets

• unified L2 cache – feeding both L1 caches– 256k bytes– 2048 (2k) lines of 128 bytes each– 8 way set associative = 256 sets

how many bits ?w wordss set

L1 instructions

L1 data

unified

L2

Pentium 4 Diagram (Simplified)

Power PC Cache Evolution

• 601 – single 32kb 8 way set associative

• 603 – 16kb (2 x 8kb) two way set associative

• 604 – 32kb

• 610 – 64kb

• G3 & G4– 64kb L1 cache 8 way set associative– 256k, 512k or 1M L2 cache two way set

associative

PowerPC G4

unified L2

L1 instructions

L1 data

Historically, CPUs have always been faster than memories.As memories have improved, so have CPUs, preserving the imbalance. In fact, as it becomes possible to put more and more circuits on a chip, CPU designers are using these new facilities for pipelining and superscalar operation, making CPUs go even faster. Memory designers have usually used new technology to increase the capacity of their chips, not the speed, so the problem appears to be getting worse in time.

CACHE MEMORY

What this imbalance means in practice is that after the CPU issues a memory request, it will not get the word it needs for many CPU cycles. The slower the memory, the more cycles the CPU will have to wait.

CACHE MEMORY

Actually, the problem is not technology, but economics.Engineers know how to build memories that are as fast as CPUs, but to run at full speed, they have to be located on the CPU chip (because going over the bus to memory is very slow).

CACHE MEMORY

Putting a large memory on the CPU chip makes it bigger, which makes it more expensive, and even if cost were not an issue, there are limits to how big a CPU chip can be made. Thus the choice comes down to having a small amount of fast memory or a large amount of slow memory. What we would prefer is a large amount of fast memory at a low price.

CACHE MEMORY

How it works

• Whenever the processor requires access to data or an instruction stored in RAM, it makes a request to a particular memory address

• The Cache Controller intercepts this request and checks cache memory to see if that address is stored in cache. If it is, the Cache Controller directs the CPU to access the faster Cache Ram instead.

• If it does not exist, then the cache controller instructs the CPU to write the contents of that address into the Cache so that the next time it is requested it will be in Cache for the CPU’s use.

Organization is the key

• A good Cache system has to be able to do the following:

• Find information quickly (Search Times -- Low is good)

• Keep data long enough to be used -- 256KB isn’t a lot of memory (Hit Rates -- High is good)

• It’s pretty easy to do one or the other, but not so easy to do both.

What’s a line?

• Often Main Memory and Cache Memory cells that store data are called “Lines.” Each line holds a piece of data or instruction and has an address associated with it. Although it is up to the designers of the cache system to decide how long these lines are, today they are 32 bytes.

Direct Mapped Cache

• Direct Mapped Cache divides the Cache memory into as many lines as it can and then it assigns each line to a “block” of Main Memory lines

Do until A=5A=A+1Print AEND LOOP

Do until A=5


A=A+1


Print A


END LOOP

Direct Mapped Cache

• Because there is such high competition for space, data is replaced in the Cache often and therefore, the data we require is seldom (if ever) present in Cache when we need it -- LOW HIT RATES = BAD

• The advantage of this type of organization is that there is only one place to look for any given address and that makes the cache system very fast -- LOW SEARCH TIMES = GOOD

Fully Associative Cache

• In a Fully Associative Cache, the Cache is divided into as many lines as possible and the Main Memory is divided into as many lines as possible. There is no assigning of lines at all. Any line from Main Memory can be stored on any line of Cache.

Do until A=5

Print AA=A+1

END LOOP

Do until A=5

Do until A=5

Print AA=A+1

END LOOP

Do until A=5A=A+1

Do until A=5

Print AA=A+1

END LOOP

Do until A=5A=A+1Print A

END LOOP

Fully Associative Cache

• The previous scenario at first seems to work really well – after a single Main Memory access to each line, we can retrieve each line from Cache for subsequent accesses (HIGH HIT RATE = GOOD)

• The problem however, is that the Cache controller has no way of knowing exactly where the data is and therefore must search through the entire Cache until it finds what it is looking for, taking up a lot of precious time (HIGH SEARCH TIMES = BAD)

A third problem… (which we’ll talk about later)

• What happens if the cache fills up? We need to decide what gets discarded and depending on the Replacement Policy that is used, this can have an impact on the performance

N-WAY Set Associative

• An N-WAY Set associative cache tries to find a middle ground compromise between the previous two strategies – In fact, Direct Mapped and Fully Associative strategies are simply two extreme cases of N-WAY Set!

• The “N” in N-WAY actually represents a number (2,4,8,etc.) The following example deals with a 2-WAY Set Associative organization.

• The Cache is broken up into groups of two (2) lines and then Main Memory is divided into the same number of groups that exist in the Cache. Each group in Cache is assigned to a group Main Memory.

A=A+1Do until A=5

Print AEND LOOP

Do until A=5A=A+1

N-WAY Set Associative

• Due to the fact that there is less competition for space in the Cache, there is a higher chance of getting the data you want – AS N INCREASES, HIT RATES GET HIGHER = GOOD

• However, there is more than one line to search and therefore there is more time involved in finding data – AS N INCREASES, SEARCH TIMES GET HIGHER = BAD

• There is still the problem of what data to replace and what the impacts on performance will be

Let’s Split

• Any of these Cache organizations can be Split or Unified

• If a Cache is split, it means that half of the Cache is used to store instructions only, and the other half is used to store data created by programs

• Advantage – It’s more organized and easier to find what you’re looking for

•Disadvantage – Instructions are generally much smaller than the data produced by a program, so a lot of space can be wasted

A Final Word

• Cache organizations are not something you can tweak – they are determined by the manufacturer through research on current software packages and benchmarks vs. the amount of memory available on the chip

• Intel and AMD processors currently use a Split 8 - way L1 cache and a unified 8 - way L2 cache

•You may have heard by now of the Intel L3 cache on the Itanium processor. This is simply another layer of memory that works in conjunction with the L1 and L2 cache. It is not on the die of the processor, but still works faster than Main Memory and is larger than L1 or L2 so it is still quite effective

cs 147 cache memory prof. sin-min lee department of computer science

Documents

number of bits

memory slowafter

economicsfast memory

memory request

memory addressescomputers

number of words

number of bytesusually

heremain memory typesproblem