1 cs 201 computer systems programming chapter 10 data cache architecture herbert g. mayer, psu...

CS 201Computer Systems Programming

Chapter 10

Data Cache Architecture

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 6/28/2015Status 6/28/2015

Syllabus Introduction Definitions Effective Times teff

Cache Subsystem and Design Parameters Single-Line Degenerate Cache Multi-Line, Single-Set Cache Single-Line, Multi-Set Cache, Blocked Mapping Single-Line, Multi-Set, Cyclic Mapping Multi-Line per Set (Associative), Multi-Set Cache Replacement Policies LRU Sample Compute Cache Size Trace Cache Characteristic Cache Curve Bibliography

Introduction Cache Architecture

Cache-related definitions below are common, though not all manufacturers apply the same nomenclature. Initially we discuss cache designs for single-processor architectures. In another lecture note we progress to more complex and complete MP architectures, covering the MESI protocol for a two-processor system with external L2 cache. Focus will be data caches

Introduction The speed with which the processor executes an

instruction and references data in its registers is generally vastly superior to the speed with which memory can be accessed

For example, an integer type instruction on a Pentium® Pro costs on the order of 1 cycle or less; less is possible, since multiple operations may be executed in one step on a superscalar processor

The number of cycles to get an operand out of memory on typical Pentium Pro or newer systems is several dozens of cycles

The gap between the slowness of memory and the speed of processors is increasing over time, despite memories getting faster!

Introduction To bridge this long recognized gap (von Neumann

bottleneck), computer architects invented (at Manchester University in the 1960s; see [5]) a special purpose memory, now called the cache

Like regular memory, a cache holds bits of information, data or instructions

Unlike regular memory, a cache is very fast and more expensive per bit. If it were not so costly, we’d simply build all of memory out of cache memory and the speed gap between processor and memory would be solved; but alas!

Even to date, with some caches being several megabytes large, caches are small vs. a memory’s logical addressing space of 264 bytes

Introduction While regular memory is arranged as a linear array

of equal cells (bytes, words), caches usually are arranged by lines, also called blocks

Since block has already several other meanings, we shall use line

Only the address of the first byte of a line need be remembered

Individual bytes within lines are addressable by their offset. Note that only line-size-aligned portions of memory (AKA paragraphs) are moved into cache lines!

Each line represents a small linearly contiguous subsection of memory, which we’ll call paragraph

Introduction Caches evolved into multiple levels and purposes

Often the first level cache (L1) is physically on-chip, allowing the processor to retrieve information sometimes in a single cycle

The next level cache (L2) is often a separate physical device, larger in size than the L1, and slower to access, due to “having to go off-chip”

With multi-core architectures, L2 caches also tend to move on-chip

On some multi-core chips the L2 is shared between the cores, yet on others there are individual L2 caches per core

Introduction

L3 caches are common in servers that process very large amounts of data

Caches also have become specialized. Instructions are stored separately in so-called I-Caches, while data reside in data caches (D-Cache)

In the early 2000s, the trend was to replace I-Caches with trace-caches (TC), which store already pre-decoded micro instructions

Since about 2007 trace-caches are out of favor and I-caches emerge again

Definitions

A cache line’s age is tracked; only in associative cache

Aging tracks, when a cache line was accessed, relative to the other lines in this set

This implies that ages are compared

Generally, the relative ages are of interest, such as: am I older than you? Rather than the absolute age, e.g.: I was accessed at cycle such and such

Think about the minimum number of bits needed to store the relative ages of, say, 8 cache lines!

Memory access addresses only one line, hence all lines in a set have distinct (relative) ages

DefinitionsAlignment Alignment is a spacing requirement, i.e. the

restriction that an address adhere to a specific placement condition

For example, even-alignment means that an address is even, that it be divisible by 2

E.g. address 3 is not even-aligned, but address 1000 is; thus the rightmost address bit will be 0

In VMM, page addresses are aligned on page-boundaries. If a page-frame has size 4k, then page addresses that adhere to page-alignment are evenly divisible by 4k

As a result, the low-order (rightmost) 12 bits are 0. Knowledge of alignment can be exploited to save storing address bits in VMM, caching, etc.

DefinitionsAssociativity

If a cache has multiple lines per set, we call it associative; K stands for number of lines in a set

Having a cache with multiple lines K > 1 requires searching, or address comparing, whether a referenced object is in fact present in cache; the key term is: to hit the cache

Another way of saying this is: An object at some address in memory has more lines than one where it might live in an associative cache

Synonym: full associativity

Antonym: direct mapped; if only a single line (per set) exists, the search is reduced to a simple, single tag comparison

Definitions

Blocked Cache

If a cache cannot be accessed by the HW while some line is currently being streamed in, the cache is said to be blocked

This can be a performance limiter, if the current memory access wishes to refer to a line different from the one being streamed in

Not to be confused with cache blocks, AKA cache lines!

DefinitionsCritical Chunk First

The number of bytes in a line is generally larger than the number of bytes that can be brought into the cache across the bus in 1 step, requiring multiple bus transfers to fill a line completely

It would be efficient, if the actual byte needed, would reside in the first chunk brought across the bus

The deliberate policy that accomplishes just that is the Critical Chunk First policy

This allows the cache to be unblocked after the first transfer, even though the line is not yet completely loaded

Other parts of the line may be used later, but the critical byte can thus be accessed right away

Definitions

Direct Mapped

If each memory address has just one possible location (i.e. one single line, of K = 1) in the cache where it could possibly reside, then that cache is called direct mapped

Antonym: associative, or fully associative

Synonym: non-associative

1 cs 201 computer systems programming chapter 10 data cache architecture herbert g. mayer, psu...

Documents

1 cs 201 computer systems programming chapter 3...

ece 171 digital circuits chapter 5 karnaugh diagrams herbert...

ece 171 digital circuits chapter 6 logic circuits herbert g....

1 cs 162 introduction to computer science chapter 18 argc,...

1 cs 201 computer systems programming chapter 18...

1 cs 410 / 510 mastery in programming chapter 5 ll(1)...

ece 171 digital circuits chapter 2 binary arithmetic herbert...

ece 171 digital circuits chapter 10 mux herbert g. mayer,...

1 cs 201 computer systems programming chapter 2 argv, argc,...

1 cs 201 computer systems programming chapter 8 “unix...

ece 221 electric circuit analysis i chapter 4 circuit...

cs 161 introduction to programming and problem solving...

1 ece 221 electric circuit analysis i chapter 2 terms and...

1 cs 162 introduction to computer science chapter 10 c++...

1 ece 101 exploring electrical engineering chapter 2 terms...

1 cs 201 computer systems programming chapter 11 “x86...

herbert g. mayer, psu cs status 7/29/2012 slides derived...

1 cs 163 data structures chapter 5 sorting herbert g. mayer,...

herbert g. mayer, psu cs status 6/24/2012 slides derived...

1 cs 201 computer systems programming chapter 12 x86 call ...