csit 301 (blum)1 cache based in part on chapter 9 in computer architecture (nicholas carter)

CSIT 301 (Blum) 1

Cache

Based in part on Chapter 9 in Computer Architecture

(Nicholas Carter)

CSIT 301 (Blum) 2

Pentium 4 Blurb (L1)

Some cache terms to understand:

• Data cache

• Execution Trace Cache

CSIT 301 (Blum) 3

Pentium 4 Blurb (L2)

Some cache terms to understand:

• Non-Blocking

• 8-way set associativity

• on-die

CSIT 301 (Blum) 4

Caching Analogy: Why grading late

homework is a pain • To grade a student’s homework problem, a

professor must 1. Solve the problem

2. Compare the answer with the student’s

• When grading the homework of a class of students’ homework, the professor can

1. Solve the problem

2. Compare the answer with Student 1’s answer

3. Compare the answer with Student 2’s answer

4. …

CSIT 301 (Blum) 5

Caching Analogy (Cont.)

• In other words, the professor “caches” the solution so that all students after the first can be graded much more quickly than the first.

• Even if the professor “stores” the solution (that is, files it away), it is not handy when it comes time to grade the late student’s homework.

CSIT 301 (Blum) 6


• You might think the benefits of caching are too contrived in the previous analogy since the professor instructed all of the students to solve the same problem and submit it at the same time.

• Suppose students (on their own volition) looked at the problems at the end of the chapter being discussed. – It’s hard to imagine, I know.

CSIT 301 (Blum) 7


• Then a student might come to the professor’s office for help on a difficult problem.

• The professor should keep the solution handy because a problem that was difficult for one student is likely to be difficult for other students who are likely to turn up soon.

• This is the notion of “locality of reference” – What was needed/used recently is likely to be

needed/used again soon.

CSIT 301 (Blum) 8

Locality Of Reference• The memory assigned to an executing program

will have both data and instructions. At a given time, the probability that the processor will need to access a given memory location is not equally distributed among all of the memory locations. – The program may be more likely to need the same

location that it has accessed in the recent past – this is known as temporal locality.

– The program may be more likely to need a location that is near the one just accessed – this is known as spatial locality.

CSIT 301 (Blum) 9

Loops and Arrays

• Consider that the tasks best suited for automation (to be done by machine including a computer) are repetitive.

• Any program with loops and arrays is a good candidate to display locality of reference.

• Also waiting for some user event is also very repetitive. This repetition may be hidden from the programmer working with a high-level language.

CSIT 301 (Blum) 10

Locality of reference

• Locality of reference is the principle behind caching.

• Locality of reference is what allows 256-512 KB of cache to stand in for 256-512 MB of memory.

• The cache is a factor of 1000 times smaller, yet the processor finds what it needs in cache ninety-some percent of the time.

CSIT 301 (Blum) 11

Caching• The term cache can be used in different ways. • Sometimes “cache” is used to refer generally to

placing something where it can be retrieved more quickly. In this sense of the term, there is an entire hierarchy of caching, SRAM is faster than DRAM is faster than the hard drive is faster than the Internet.

• Sometimes “cache” is used to refer specifically to the top layer of the above hierarchy (the SRAM). – For the rest of the presentation, we will be using the

latter meaning.

CSIT 301 (Blum) 12

What are we caching?

• We have to look one level down in the memory/storage hierarchy to realize what it is we are caching.

• One level down is main memory. – Recall how one interacts with memory

(DRAM) – one supplies an address to obtain the value located at that address.

CSIT 301 (Blum) 13

What are we caching?

• We must cache the address and the value. – Recall our analogy – if the professor writes down the

answer (analogous to the value) but does not recall what problem it is the answer to (analogous to the address), it is useless.

• Ultimately we want the value, but it is the (memory) address we will be given and that is what we will search for in our cache. – The student does not ask if 43 is the answer (the answer

to what?); the student asks what is the answer to problem 5-15.

CSIT 301 (Blum) 14

Some terminology• Think of cache as parallel arrays (address and values). • The array of addresses is called the tag array. • The array of values is called the data array.

– Don’t confuse the terms “data array” and “data cache.”• A memory address is supplied:

– If the memory address is found in the tag array, one is said to have a cache hit and the corresponding value from the data array is sent out.

– If the memory address is not found, one has a cache miss, and the processor must go to memory to obtain the desired value.

– The percentage of cache hits is known as the hit rate (usually looking for 90% or better).

CSIT 301 (Blum) 15

Cache Controller• In addition to the tag and data arrays is the cache

controller which runs the show. – When L2 cache was separate from the processor, the

cache controller was part of the system chipset.

– When L2 cache moved onto the microprocessor so too did the controller.

– Now it is the L3 cache controller which is part of the system chipset.

• Now even L3 is moving onto the microprocessor.

CSIT 301 (Blum) 16

One caches addresses (tags) and values

Cache Address

Memory Address (tag)

Memory Value

(data)

0000 FFA0 Some value

0001 FF18 Another

0002 FFB0 Yet Another

… … …

CSIT 301 (Blum) 17

Data Array versus Data Cache• The term data array refers to the set of values that are

placed in cache. (It doesn’t matter what the values correspond to.)

• The term data cache refers the caching of data as opposed to the instruction cache where instructions are cached.

• In a modern adaptation of the Harvard architecture, called the Harvard cache, data and instructions are sent to separate caches. – Unlike data, an instruction is unlikely to be updated –

overwritten yes, updated no. Therefore data cache and instruction cache can have different write policies.

CSIT 301 (Blum) 18

Capacity• The usual specification (spec) one is given

for cache is called the capacity. – E.g. Norwood-core Pentium 4s have a 512 KB

L2 cache.

• The capacity refers only to the amount of information in the data array (values). – The spec does not include the tag array

(addresses), the dirty bits, and so on – though they must of course be there.

CSIT 301 (Blum) 19

Lines and Line Lengths

• The basic unit of memory is a byte, the basic unit of cache is a line. – Be careful not to use the word “block” in place of

“line.” In cache, blocking means that upon a cache miss, one must write the new values to cache before proceeding.

• A line consists of many bytes (typically a power of 2, such as 32, 64 or 128). The number of bytes in a line is called the line length.

CSIT 301 (Blum) 20

Memory CacheFFA0 26

FFA1 FD

FFA2 A7

FFA3 37

… …

Cache Address

Tag Line Value

0011 FFA 26FDA737…

0 1 2 3 ….

Because cache lines are bigger than memory locations, one does not store full memory address in the tag array.

CSIT 301 (Blum) 21

Example

• Assume a capacity of 512 KB.

• Don’t think of an array with 524,288 (512 K) elements with each element a byte long as you would if it were main memory.

• Instead think of an array with 16,384 (16 K) elements with each element 32 bytes long.

CSIT 301 (Blum) 22

Line Length Benefits

• The concept of cache lines has a few benefits1. It directly builds in the notion of spatial locality

– cache is physically designed to hold the contents of several consecutive memory locations.

2. Eventually we must perform a search on the tags to see if the particular memory address has been cached. The line length shortens the tag, i.e. the item one must search for. – In the example on the earlier slide one would search for

FFA instead of FFA3. That is the tag is four bits smaller than the address.

CSIT 301 (Blum) 23

Line Length Benefits

3. The cached value must have been read from memory. Recall that one can significantly improve the efficiency of reading memory locations if they are consecutive locations (especially if they are all in the same row).

– So the paging/bursting improvements of reading memory are particularly important because of the way cache is structured.

CSIT 301 (Blum) 24

Hardware Searching

• The cache is handed a memory address, it strips off the least significant bits to form the corresponding search tag, it then must search the tag array for that value. – The most efficient search algorithm you know

is useless at this level, we need to perform the search in a couple clock cycles. We need to search using hardware.

CSIT 301 (Blum) 25

Variations

• The hardware search can be executed in a number of ways and this is where the terms direct-mapped, fully associative and set-associative come in. – The Pentium 4’s Advanced Transfer cache has 8-

way set associativity.

• The variations determine how many comparators (circuitry that determines whether we have a hit or miss) are necessary.

CSIT 301 (Blum) 26

XNOR: Bit Equality Comparator

CSIT 301 (Blum) 27

ANDed XNORs: Word Equality Comparator

CSIT 301 (Blum) 28

Direct Mapping• Direct Mapping simplifies tag-array

searching (i.e. minimizes the number of comparators) by saying that a given memory location can be cached in one and only one line of cache. – The mapping is not one-to-one. Since memory

is about a thousand times bigger than cache, many memory locations share a cache line, and only one section of memory can be in there at a time.

CSIT 301 (Blum) 29

Memory

Direct Mapping Cache

A given memory location is mapped to one and only one line of cache. But each line of cache corresponds to several (sets of) memory locations. Only one of these can be cached at a given time.

CSIT 301 (Blum) 30

A Direct Mapping ScenarioMemory Address

Determines position within the line of cache

Determines the cache address that will be used

The part of the address actually stored in the tag array

CSIT 301 (Blum) 31

A Direct Mapping Scenario (Cont.)

• A memory address is handed to cache. • The middle portion is used to select the cache

address. • The tag stored at that cache address and the upper

portion of the original memory address are sent to a comparator. – Note there’s one comparator!

• If they are equal (a cache hit), then the lower portion of the original memory address is used to select out the byte from within the line.

CSIT 301 (Blum) 32

A Potential Problem with Direct Mapping

• Recall that locality of reference (the notion behind caching) is particularly effective during repetitive tasks.

• Imagine that a loop involves two memory locations that share the same cache address (perhaps it processes a large array). Then each time the processor wanted one of the locations, the other would be in the cache. Thus, there would be two cache misses for each iteration of the loop. But loops are when caching is supposed to be at its most effective.

• TOO MANY CACHE MISSES!

CSIT 301 (Blum) 33

Fully Associative Cache: The Other Extreme

• In Direct Mapping, a given memory location is mapped onto one and only one cache location.

• In Fully Associative Caches, a given memory location can be mapped to any cache location. – This will solve the previous problem. There’s no conflict –

one caches whatever is needed for the loop. – But with fully-associative cache searching becomes more

difficult, one has to examine the entire tag array whereas before with direct mapping there was only one place to look.

CSIT 301 (Blum) 34

Associativity = Many Comparators

• Looping through the tag array would be prohibitively slow. We must compare the memory address (or the appropriate portion thereof) to all of the values in the tag array simultaneously.

CSIT 301 (Blum) 35

Array of Comparators

Tags Comparators

Address

Hit?

Yes or no

Address of hit

For each element of the tag array, there is a comparator. Each comparator checks the tag element against the search tag.

CSIT 301 (Blum) 36

Associative memory a.k.a. content addressable memory (CAM)

CSIT 301 (Blum) 37

Associative memory

• In regular memory, one provides an address, and then the value at that address is supplied.

• In associative memory (content addressable memory), one provides the value or some part thereof, and then the address and/or the remainder of the value is supplied.

CSIT 301 (Blum) 38

The Problem with Fully Associative Cache

• All of those comparators are made of transistors. They take up room “on the die.” And any space lost to comparators has to be taken away from the data array. – After all we’re talking about thousands of

comparators.

• ASSOCIATIVITY LOWERS CAPACITY!

CSIT 301 (Blum) 39

Set-Associative Caches: The Compromise

• For example, instead of having the 1000-to-1 mapping we had with direct mapping, we could elect to have an 8000-to-8 mapping.

• That is, a given memory location can be cached into any of 8 cache locations, but the set of memory locations sharing those cache locations has also gone up by a factor of 8.

• This would be called an 8-way set associative cache.

CSIT 301 (Blum) 40

A Happy Medium

• 4- or 8-way set associative provides enough flexibility to allow one (under most circumstances) to cache the necessary memory locations to get the desired effects of caching for an iterative procedure.– I.e. it minimizes cache misses.

• But it only requires 4 or 8 comparators instead of the thousands required for fully associative caches.

CSIT 301 (Blum) 41

Bad Direct Mapping Scenario Recalled

• With direct mapping cache, the loop involves memory locations that share the same cache address. With set associative cache, the loop involves memory locations that share the same set of cache addresses.

• It is thus possible with set associative cache that each of these memory locations is cache to a different member of the set. The iterations can proceed without repeated cache misses.

CSIT 301 (Blum) 42

Set-Associative Cache

• Again the memory address is broken into three parts. – One part determines the position in the line. – One part determines this time a set of cache

addresses. – The last part is compared to what is stored in

the tags of the set of cache locations. – Etc.

CSIT 301 (Blum) 43

PCGuide.com comparison table

To which we add that full associativity has an adverse effect on capacity.

CSIT 301 (Blum) 44

References

• Computer Architecture, Nicholas Carter• http://www.simmtester.com/page/news/

showpubnews.asp?num=101• http://www.pcguide.com/ref/mbsys/cache/ • http://www.howstuffworks.com/

cache.htm/printable • http://slcentral.com/articles/00/10/cache/

print.php

csit 301 (blum)1 cache based in part on chapter 9 in computer architecture (nicholas carter)

Documents

caching analogy

students homework problem

late students homework

class of students homework

benefits of caching

difficult problem

locality of referencelocality

temporal locality