1 cs 201 computer systems programming chapter 10 data cache architecture herbert g. mayer, psu...
TRANSCRIPT
![Page 1: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/1.jpg)
1
CS 201Computer Systems Programming
Chapter 10
Data Cache Architecture
Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 6/28/2015Status 6/28/2015
![Page 2: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/2.jpg)
2
Syllabus Introduction Definitions Effective Times teff
Cache Subsystem and Design Parameters Single-Line Degenerate Cache Multi-Line, Single-Set Cache Single-Line, Multi-Set Cache, Blocked Mapping Single-Line, Multi-Set, Cyclic Mapping Multi-Line per Set (Associative), Multi-Set Cache Replacement Policies LRU Sample Compute Cache Size Trace Cache Characteristic Cache Curve Bibliography
![Page 3: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/3.jpg)
3
Introduction Cache Architecture
Cache-related definitions below are common, though not all manufacturers apply the same nomenclature. Initially we discuss cache designs for single-processor architectures. In another lecture note we progress to more complex and complete MP architectures, covering the MESI protocol for a two-processor system with external L2 cache. Focus will be data caches
![Page 4: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/4.jpg)
4
Introduction The speed with which the processor executes an
instruction and references data in its registers is generally vastly superior to the speed with which memory can be accessed
For example, an integer type instruction on a Pentium® Pro costs on the order of 1 cycle or less; less is possible, since multiple operations may be executed in one step on a superscalar processor
The number of cycles to get an operand out of memory on typical Pentium Pro or newer systems is several dozens of cycles
The gap between the slowness of memory and the speed of processors is increasing over time, despite memories getting faster!
![Page 5: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/5.jpg)
5
Introduction To bridge this long recognized gap (von Neumann
bottleneck), computer architects invented (at Manchester University in the 1960s; see [5]) a special purpose memory, now called the cache
Like regular memory, a cache holds bits of information, data or instructions
Unlike regular memory, a cache is very fast and more expensive per bit. If it were not so costly, we’d simply build all of memory out of cache memory and the speed gap between processor and memory would be solved; but alas!
Even to date, with some caches being several megabytes large, caches are small vs. a memory’s logical addressing space of 264 bytes
![Page 6: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/6.jpg)
6
Introduction While regular memory is arranged as a linear array
of equal cells (bytes, words), caches usually are arranged by lines, also called blocks
Since block has already several other meanings, we shall use line
Only the address of the first byte of a line need be remembered
Individual bytes within lines are addressable by their offset. Note that only line-size-aligned portions of memory (AKA paragraphs) are moved into cache lines!
Each line represents a small linearly contiguous subsection of memory, which we’ll call paragraph
![Page 7: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/7.jpg)
7
Introduction Caches evolved into multiple levels and purposes
Often the first level cache (L1) is physically on-chip, allowing the processor to retrieve information sometimes in a single cycle
The next level cache (L2) is often a separate physical device, larger in size than the L1, and slower to access, due to “having to go off-chip”
With multi-core architectures, L2 caches also tend to move on-chip
On some multi-core chips the L2 is shared between the cores, yet on others there are individual L2 caches per core
![Page 8: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/8.jpg)
8
Introduction
L3 caches are common in servers that process very large amounts of data
Caches also have become specialized. Instructions are stored separately in so-called I-Caches, while data reside in data caches (D-Cache)
In the early 2000s, the trend was to replace I-Caches with trace-caches (TC), which store already pre-decoded micro instructions
Since about 2007 trace-caches are out of favor and I-caches emerge again
![Page 9: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/9.jpg)
9
Definitions
Aging
A cache line’s age is tracked; only in associative cache
Aging tracks, when a cache line was accessed, relative to the other lines in this set
This implies that ages are compared
Generally, the relative ages are of interest, such as: am I older than you? Rather than the absolute age, e.g.: I was accessed at cycle such and such
Think about the minimum number of bits needed to store the relative ages of, say, 8 cache lines!
Memory access addresses only one line, hence all lines in a set have distinct (relative) ages
![Page 10: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/10.jpg)
10
DefinitionsAlignment Alignment is a spacing requirement, i.e. the
restriction that an address adhere to a specific placement condition
For example, even-alignment means that an address is even, that it be divisible by 2
E.g. address 3 is not even-aligned, but address 1000 is; thus the rightmost address bit will be 0
In VMM, page addresses are aligned on page-boundaries. If a page-frame has size 4k, then page addresses that adhere to page-alignment are evenly divisible by 4k
As a result, the low-order (rightmost) 12 bits are 0. Knowledge of alignment can be exploited to save storing address bits in VMM, caching, etc.
![Page 11: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/11.jpg)
11
DefinitionsAssociativity
If a cache has multiple lines per set, we call it associative; K stands for number of lines in a set
Having a cache with multiple lines K > 1 requires searching, or address comparing, whether a referenced object is in fact present in cache; the key term is: to hit the cache
Another way of saying this is: An object at some address in memory has more lines than one where it might live in an associative cache
Synonym: full associativity
Antonym: direct mapped; if only a single line (per set) exists, the search is reduced to a simple, single tag comparison
![Page 12: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/12.jpg)
12
Definitions
Blocked Cache
If a cache cannot be accessed by the HW while some line is currently being streamed in, the cache is said to be blocked
This can be a performance limiter, if the current memory access wishes to refer to a line different from the one being streamed in
Not to be confused with cache blocks, AKA cache lines!
![Page 13: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/13.jpg)
13
DefinitionsCritical Chunk First
The number of bytes in a line is generally larger than the number of bytes that can be brought into the cache across the bus in 1 step, requiring multiple bus transfers to fill a line completely
It would be efficient, if the actual byte needed, would reside in the first chunk brought across the bus
The deliberate policy that accomplishes just that is the Critical Chunk First policy
This allows the cache to be unblocked after the first transfer, even though the line is not yet completely loaded
Other parts of the line may be used later, but the critical byte can thus be accessed right away
![Page 14: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/14.jpg)
14
Definitions
Direct Mapped
If each memory address has just one possible location (i.e. one single line, of K = 1) in the cache where it could possibly reside, then that cache is called direct mapped
Antonym: associative, or fully associative
Synonym: non-associative
Directory
The collection of all tags is referred to as the cache directory; opposed to actual data bits in a D-Cache or actual instruction bits in an I-Cache
![Page 15: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/15.jpg)
15
Definitions
Dirty Bit
If a line in a cache with write-back policy is never modified (written), then that line doesn’t need to be copied back into memory upon retirement; it is already there
However, if at least one write (AKA store, AKA modification) into that complete cache line has occurred, the line must be copied back into memory eventually, lest memory becomes stale
To discern, whether or not to copy back, the dirty bit must be set upon write; initially this bit is clear. (See also: modified state)
Synonym: write-bit
![Page 16: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/16.jpg)
16
Definitions
Effective Cycle Time teff
Let the cache hit rate h be the number of hits divided by the number of all memory accesses, with an ideal hit rate being 1; thus:
teff = tcache + (1-h) * tmem
Alternatively, the effective cycle time might be
teff = max( tcache, (1-h)*tmem )
The latter holds, if a memory access is initiated parallel to the cache access
Here tcache is the time to access a datum in the cache, while tmem is the time to access a data item in memory
The hit rate h varies from 0.0 to 1.0
![Page 17: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/17.jpg)
17
Definitions
Hit Rate h
The hit rate h is the number of memory accesses (read/writes, or load/stores) that hit the cache, over the total number of memory accesses
By contrast H is the total number of hits
A hit rate h = 1 means: all accesses are from cache, while h = 0 means, all are from memory, i.e. none hit the cache
Conventional notations are: hr or hw for read and write misses
See also miss rate
![Page 18: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/18.jpg)
18
Definitions
LLC
Acronym for Last Level Cache. This is the largest cache in the memory hierarchy, the one closest to physical memory, or furthest from the processor
Typical on multi-core architectures
Typical cash sizes: 4 MB to 32 MB. See [3]
Common to have one LLC be shared between all cores of an MCP (Multi-Core Processor), but have option of separating (by fusing) and creating dedicated LLC caches, with identical total size
![Page 19: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/19.jpg)
19
Definitions
LRU
Acronym for Least Recently Used. A cache replacement policy (also page replacement policy discussed under VMM) that requires aging information for the lines in a set
Each time a cache line is accessed, that line become by definition the youngest one touched
Other lines of the same set do age by one unit, i.e. get older by 1 event
Relative ages are sufficient for LRU tracking; no need to track exact ages!
Antonym: last recently used!
![Page 20: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/20.jpg)
20
Definitions
Line
Storage area in cache able to hold a copy of a contiguous block of memory cells, i.e. a paragraph
The portion of memory stored in that line is aligned on an address modulo the line size
For example, if a line holds 64 bytes on a byte-addressable architecture, the address of the first byte stored in such a line will have 6 trailing zeros, as it is evenly divisible by 64, it is 64-byte aligned
Such known zeros don’t need to be stored in the tag, the address bits stored in the cache; they are implied
This shortens the tag, which makes the cache cheaper to build: less bits!
![Page 21: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/21.jpg)
21
Definitions
Locality of Data
A surprising and very beneficial attribute of memory access patterns: when an address is referenced, there is a good chance that in the near future another access will happen at or near that same address
I.e. memory accesses tend to cluster, also observable in hashing functions and memory page accesses
Antonym: Randomly distributed, or normally distributed
![Page 22: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/22.jpg)
22
Definitions
Miss Rate
Miss rate is the number of memory (read/write) accesses that miss the cache over total number of accesses, denoted m
Clearly the miss rate, like the hit rate, varies between 0.0 .. 1.0
The miss rate m = 1 - h
![Page 23: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/23.jpg)
23
Definitions
Paragraph
A paragraph is a contiguous portion of memory of exactly line-size bytes
The starting address of a paragraph is evenly divisible by the line-size
For example, if a cache line is 32 bytes long, memory can be thought of as logically partitioned into contiguous byte streams, AKA paragraphs
These start at 32-byte boundaries, each 32 bytes long; hence the rightmost 5 bits = log2(32) are 0
Paragraphs or correspondingly line sizes may be of any size, not just 32 bytes, but a power of 2 seems handy on a binary system
![Page 24: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/24.jpg)
24
Definitions
Replacement Policy
A replacement policy is a defined convention that defines which line is to be retired in case a new line must be loaded, none is free in a set, so one has to be evicted
Ideally, the line that will remain unused for the longest time in the future should be replaced and its contents overwritten with new data
Generally we do not know which line will stay unreferenced for the longest time in the future
In a direct-mapped cache, the replacement policy is trivial, it is moot, as there will be just 1 line
![Page 25: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/25.jpg)
25
DefinitionsSet
A logically connected region of memory, to be mapped onto a specific area of cache (line), is a set; there are N sets in memory
Elements of a set don’t need to be physically contiguous in memory; if contiguous, leftmost log2(N) bits are 0, if cyclic distribution, then the rightmost log2(N) after alignment bits are 0
The number of sets is conventionally labeled N
A degenerate case is to map all memory onto the whole cache, in which case only a single set exists: N = 1; i.e. one set
Notion of set is meaningful only if there are multiple sets. A memory region belonging to one set can be physically contiguous or distributed cyclically
In the former case the distribution is called blocked, the latter cyclic. Cache area into which a portion of memory is mapped to is also called set
![Page 26: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/26.jpg)
26
Definitions
Set-Associative
A cached system in which each set has multiple lines is called set-associative
For example, 4-way set associative means that there are multiple sets (could be 4 sets, 256 sets, 1024 sets, or any other number of sets) and each of those sets has 4 lines
That’s what the 4 refers to in 4-way
Opposite: non-associative
![Page 27: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/27.jpg)
27
DefinitionsStale Memory A valid line may be overwritten in a cache with
new data The write-back policy records such an over writing At the moment of a cache write with write-back,
cache and memory are out of synch; we say memory is stale
Poses no danger, since the dirty bit (or modified bit) reflects that memory eventually is updated
But until this happens, memory is stale Note that if two processors’ caches share memory
and one cache renders memory stale, the other processor should no longer have access to that portion of shared memory
![Page 28: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/28.jpg)
28
DefinitionsStream-Out
Streaming out a line refers to the movement of one line of modified data, out of the cache and back into a memory paragraph
Stream-In
The movement of one paragraph of data from memory into a cache line. Since line length generally exceeds the bus width (i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions in a row
Possible that the byte actually needed will arrive last in a cache line during a sequence of bus transactions; can be avoided with the critical chunk first policy
![Page 29: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/29.jpg)
29
DefinitionsTag
A tag is the relevant portion of address bits. If a memory object (paragraph) is present in the cache, its address must be stored, so the cache control unit can determine, whether the referenced bits are present
That portion of a memory address that must be stored in the directory is the tag. If there is only one set in the whole cache and any line can hold only a single addressable unit, then the tag would hold the complete address
If there are N sets in the cache, log2(N) = m bits of the virtual address are implied. If there are L aligned bytes per line, log2(L) = n bits can be implied in the tag. Hence, for an address of M bits, only M-m-n need be represented in a line’s tag
![Page 30: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/30.jpg)
30
Definitions
Trace Cache
Special-purpose cache that holds pre-decoded instructions, AKA micro-ops
Advantage: Repeated decoding for instructions is not needed
See [1]. Trace caches have fallen out of favor in the 2000s
![Page 31: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/31.jpg)
31
Definitions
Valid Bit
Single-bit data structure per cache line, indicating, whether or not the line is free; free means invalid
If a line is not valid (i.e. if valid bit is 0), it can be filled with a new paragraph upon a cache miss
Else, (valid bit 1), the line holds valid information
After a system reset, all valid bits of the whole cache are set to 0
The I bit in the MESI protocol takes on that role on an MP cache subsystem; to be discussed in higher level class
![Page 32: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/32.jpg)
32
Definitions
Write Back
Write back is a cache write policy that keeps changed bits in cache after modification (after a write), until the line is evicted
Thus, whenever a line is written (AKA modified), that fact must be remembered by the dirty bit
Upon retirement, any dirty line must be written back (streamed-out) to memory
Advantage: Multiple writes to the same line put traffic onto the bus only once: Upon retirement
![Page 33: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/33.jpg)
33
Definitions
Write Once
Cache write policy that starts out with write through
After the first write hit causing a write though, the policy then changes to write back
This is called write once
Applies to multi-level caches
![Page 34: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/34.jpg)
34
Definitions
Write Through
Cache write policy that copies modified data from cache back to memory immediately, i.e. when the write hit occurs
Thus cache and main memory are always in synch
Disadvantage: Each cache write consumes memory bus bandwidth
Antonym: Write back
![Page 35: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/35.jpg)
35
Effective Time teff
Starting with teff = tcache + ( 1-h ) * tmem we observe:
No matter how many hits (H) we experience during repeated memory access, the effective cycle time is never less than tcache
No matter how many misses (M) we experience, the effective cycle time to access a datum is never more than tcache + tmem
It is desirable to have teff = tmem in case of a cache miss
Another way to compute the effective access time is to add all memory-access times, and divide them by the total number of accesses, and thus compute the average
![Page 36: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/36.jpg)
36
Effective Time teff
Average time per access:
teff = ( hits * tcache + misses * ( tcache + tmem ) ) / total_accesses
teff = h * tcache + m * ( tcache + tmem ) or if memory accessed immed.:
teff = h * tcache + m * tmem
• Assume an access time of 1 (one) cycle to reference data in the cache
• Assume an access time of 10 (ten) for data in memory
• Assume that a memory access is initiated after a cache miss; then:
![Page 37: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/37.jpg)
37
Effective Time teff
![Page 38: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/38.jpg)
38
Effective Time teff
Symb. Name Explanation H Hits Number of successful cache accesses M Misses Number of failed cache accesses A All All accesses A = H + M T Total time Time for A memory accesses
tcache Cache time Time to successfully access memory via cache tmem Mem time Time to access memory teff Effective tm. Average time over all memory accesses H Hit rate H / A = h = 1 – m M Miss rate M / A = m = 1 – h
h + m Total rate = 1 Total rate, either hit or miss, probability is 1
![Page 39: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/39.jpg)
39
Effective Time teff
![Page 40: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/40.jpg)
40
Highlights of Different Kinds of Caches,Not all Useful
![Page 41: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/41.jpg)
41
Purpose of Cache
Cache is logically part of Memory Subsystem, but physically often part of processor (e.g. on the same silicon die)
Purpose: render slow memory into a fast one
With minimal cost, since the cache is just a few % of total physical main store
Works well, if locality is good, but only if locality is good; else performance is same as memory access, or worse, depending on architecture
With poor locality, i.e. random distribution of memory accesses, cache can slow down if:
teff = tcache+(1-h)*tmem and not: teff = max(tcache, (1-h)*tmem)
![Page 42: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/42.jpg)
42
Purpose of Cache
With good locality, cache delivers available data in close to unit cycle time
Cache must cooperate with other processors’ caches and with memory in MP system
Cache must cooperate with VMM of memory subsystem to jointly render a physically small, slow memory into a virtually large, fast memory at small cost in additional hardware (or silicon), and system SW
L1 cache access time should be within order of magnitude of machine cycle time. For example, a successful L1 data cache access costing 1 cycle is desirable
![Page 43: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/43.jpg)
43
Cache Design Parameters
Number of lines in set: K
Quick test students: K is how large in a direct-mapped cache?
Number of bytes in a line, AKA Length of line: L
Number of sets in memory, and hence in the cache: N
Policy upon memory write (cache write policy)
Policy upon access miss (cache read policy)
What to do, when an empty lines is needed for the next paragraph to be streamed-in, but none is available (replacement policy)
![Page 44: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/44.jpg)
44
Cache Design Parameters
Size = K * ( L + bits for tag and control bits ) * N
Ratio of cache size to physical memory generally is very small
Cache access time, typically close to 1 cycle for L1 cache
Number of processors with cache, 1 in UP, M in MP architecture
Levels of caches, L1, L2, L3 … Last one referred to as LLC, for last level cache
![Page 45: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/45.jpg)
45
Single-Line Degenerate Cache
![Page 46: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/46.jpg)
46
Single-Line Degenerate Cache
Quick test students: what is the minimum size (in number of bits) for tag for this degenerate cache?
The single-line cache, shown here, stores multiple words
Can improve memory access if extremely good locality exists within narrow address range
Upon miss cache initiates a stream-in operation
Is direct mapped cache: all memory locations know a priori where they’ll reside in cache; there is but one line
Is single-set cache, since all memory locations are mapped onto collection of lines, and there is just one line
![Page 47: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/47.jpg)
47
Single-Line Degenerate Cache As data cache: exploits only locality of near-by
addresses in the same paragraph As instruction cache: Exploits locality of tight
loops that completely fit inside the address range of a single line
However, there will be a cache-miss as soon as an address makes reference outside of line’s range
For example, tight loop with a function call will cause cache miss
Stream-in time is time to load a line worth of data from memory
Total overhead: tag bits = address bits + valid bit + dirty bit (if write-back)
Not advisable to build this cache subsystem
![Page 48: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/48.jpg)
48
Multi-Line, Single-Set Cache
![Page 49: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/49.jpg)
49
Multi-Line, Single-Set Cache
Next cache has one set, multiple lines; here 2 lines as shown
Quick test students: minimum size of the tag on 32-bit architecture with 2 lines, 1 set?
Each line holds multiple, contiguous addressing units, 4 words AKA 16 bytes shown
Thus 2 disparate areas of memory can be cached at the same time
Is associative cache; all lines in the single set must be searched to determine, whether a memory element is present in cache
Is single-set associative cache, since all of memory is mapped onto the same cache lines
![Page 50: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/50.jpg)
50
Multi-Line, Single-Set Cache
Some tight loops with a function call can be completely cached in an I-cache, assuming loop body fits into line and callée fits into the other line
Also would allow one larger loop to be cached, whose total body does not fit into a single line, but would fit into two (or more if available) lines
With multiple lines in a set locality constraints are less stringent
Applies to more realistic programs But if number of lines K >> 1, the time to search all
tags (in set) can grow beyond unit cycle time Sometimes trade-off between cycle time and
cache access so that multiple cycles needed even in case of cache hit
![Page 51: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/51.jpg)
51
Single-Line, Multi-Set Cache
![Page 52: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/52.jpg)
52
Single-Line, Multi-Set Cache Next cache architecture has multiple sets, 2 shown, 2 distinct
areas of memory, each being mapped onto separate cache lines: N = 2, K = 1
Quick test students: minimum size of the tag on 32-bit arch.? Each set has a single line, in this case 4 memory units (e.g.
words, AKA 16 bytes) long; AKA paragraph Thus 2 disparate areas of memory can be cached at the same
time But these areas must reside in separate memory sets, each
contiguous, each having only 1 option Is direct mapped; all memory locations know a priori where
they’ll reside in cache Is multi-set cache, since different blocks of memory are
mapped onto different sets. Different parts of memory have their own portion of cache
![Page 53: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/53.jpg)
53
Single-Line, Multi-Set Cache
Allows one larger loop to be cached, whose total body does not fit into a single line of an I-cache, but would fit into two lines
But only if by some great coincidence both parts of that loop reside in different memory sets
If used as instruction cache, all programs consuming half of memory or less never use the second line in the second set. Hence that is again a bad idea!
If used as data cache, all data areas that fit into first block will never utilize second set of cache
Problem specific to blocked mapping; try cyclic instead
![Page 54: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/54.jpg)
54
Multi-Set, Single-Line, Cyclic
![Page 55: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/55.jpg)
55
Multi-Set, Single-Line, Cyclic This cache architecture below also has 2 sets, N = 2
Each set has a single line, each holding 4 contiguous memory units, 4 words, 16 bytes, K = 1
Thus 2 disparate areas of memory can be cached at the same time
Quick test: tag size on 32-bit, 4-byte architecture?
Disparate areas (of line size, equal to paragraph size) are scattered cyclically throughout memory
Cyclically distributed memory areas associated with each respective set
Is direct mapped; all memory locations know a priori where they’ll reside in cache, as each set has a single line
Is multi-set cache: different locations of memory are mapped onto different cache lines, the sets
![Page 56: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/56.jpg)
56
Multi-Set, Single-Line, Cyclic
Also allows one larger loop to be cached, whose total body does not fit into a single line, but would fit into two lines
Even if parts of loop belong to different sets
If used as instruction cache, small code section can use the total cache
If used as data cache, small data areas can utilize complete cache
Cyclic mapping of memory areas to sets is generally superior to blocked mapping
![Page 57: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/57.jpg)
57
Multi-Line, Multi-Set, Cyclic
![Page 58: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/58.jpg)
58
Multi-Line, Multi-Set, Cyclic
Quick test: minimum size (in bits) of the tag?
Here is a more realistic cache architecture
Two sets, memory will be mapped cyclically, AKA in a round-robin fashion
Each set has two lines, each line holds 4 addressable words (a paragraph)
![Page 59: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/59.jpg)
59
Multi-Line, Multi-Set, Cyclic
Associative cache: once set is known, search all tags for the memory address in all lines of that set
In example, line 2 of set 2 is unused
By now you know: sets, lines, associate, non-associative, direct mapped, etc.!!
![Page 60: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/60.jpg)
60
Replacement Policy
The replacement policy is the rule that determines:
When all lines are valid, and a new line must be streamed in
Which of the valid lines is to be removed?
Removal can be low cost, if the modified bit (AKA “dirty” bit) is 0
Or removal may be costly, if “dirty” bit is set
![Page 61: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/61.jpg)
61
Replacement Policy# Name Summary
1 LRU Replaces Least Recently Used cache line; requires keeping track of relative “ages” of lines. Retire line that has remained unused for the longest time of all candidate lines. Speculate that that line will remain unused for the longest time in the future.
2 LFU Replaces Least Frequently Used cache line; requires keeping track of the number m of times this line was used over the last n>=m uses. Depending on how long we track the usage, this may require many bits.
3 FI FO First I n First Out: The first of the lines in the set that was streamed in is the first to be retired, when it comes time to find a candidate. Has the advantage that no further update is needed, while all lines are in use.
4 Random Pick a random line from candidate set for retirement; is not as bad as this irrational algorithm might suggest. Reason: The other methods are not too good either
5 Optimal I f a cache were omniscient, it could predict, which line will remain unused for the longest time in the future. Of course, that is not computable. However, for creating the perfect reference point, we can do this with past memory access patterns, and use the optimal access pattern for comparison, how well our chosen policy rates vs. the optimal strategy!
![Page 62: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/62.jpg)
62
LRU Sample
Assume the following cache architecture:
• N = 16 sets
• K = 4 lines per set
• 32-bit architecture
• write back (dirty bit)
• valid line indicator (valid bit)
• L = 64 bytes per line
• This results in a tag size of 22 bits
• 2 LRU bits (4 lines per set), to store relative ages of the 4 lines in each set
![Page 63: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/63.jpg)
63
LRU Sample
Let lines be numbered 0..3
And accessed in the order x=0 miss, x=1 miss, 0 hit, x=2 miss, 0 hit, x=3 miss, 0 hit, and another miss
Assume initially a cold cache, all lines in the cache are free
Problem: Once all lines are filled (Valid bit is 1 for all 4 lines) some line must be retired (i.e. kicked out) to make room for the new paragraph x that caused a miss, but which?
The answer is based on the LRU line (Least Recently Used line), which is line 1
![Page 64: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/64.jpg)
64
LRU Sample The access order, assuming all memory accesses
are just reads (loads), no writes (no stores), i.e. dirty bit is always clear:
Read miss, all lines invalid, stream paragraph in line 0 Read miss (implies to a new address), stream another
paragraph in line 1 Read hit on line 0 Read miss to a new address, store paragraph in line 2 Read hit, access line 0 Read miss, store paragraph in line 3 Read hit, access line 0 Now another Read miss, all lines valid, find line to retire
Note that LRU age 002 is youngest for cache cache line 0, and 112 is the oldest line (AKA the least recently used line) for cache line 1, of the 4 relative ages out of 4 total lines
![Page 65: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/65.jpg)
65
LRU Sample
![Page 66: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/66.jpg)
66
LRU Sample
Whenever an empty line is filled, its relative age is set to 00. It will be the youngest line. All others must be checked, some may be updated. This automatically avoids any of 4 lines ever growing as “old” as 4 or “older”. Detail:
1. Initially, in a partly cold cache, if we experience a miss and there is an empty line (partly cold cache), the paragraph is streamed into the empty line, its relative age is set to 0, and all other ages are incremented by 1
2. In a warm cache (all lines are used) when a line of age X experiences a hit, its new age becomes 0. But the ages of all other lines whose age is younger than that of X, all and only those ages are incremented by 1
![Page 67: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/67.jpg)
67
Compute Cache Size
Typical Cache Design Parameters:
1. Number of lines in set: K
2. Number of bytes in a line, Length of line: L
3. Number of sets in memory, and hence in cache: N
4. Policy upon memory write (cache write policy)
5. Policy upon read miss (cache read policy)
6. Replacement policy (e.g. LRU, random, FIFO, etc.)
7. Size (bits) = K *( 8 * L + tag + control bits ) * N
![Page 68: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/68.jpg)
68
Compute Cache Size
Compute minimum number of bits of an 8-way, set-associative cache with 64 sets, using cyclic allocation of memory sets, cache line length of 32 bytes, using LRU replacement. Use write-back. Memory is byte addressable, 32-bit addresses:
Tag = 32-5-6 = 21 bits
LRU 8-ways = 3 bits
Dirty bit = 1 bit
Valid bit = 1 bit
Overhead per line = 21+3+1+1 = 26 bits
# of lines = K * N = 64*8 = 29 lines
Data bits per cache line = 32*8 = 28
Total cache size = 29*(26+28) = 144,384
Byte size = ~141 kB
![Page 69: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/69.jpg)
69
Trace Cache
Trace Cache is a special-purpose cache that does not hold (raw) instructions, but instead stores pre-decoded operations (micro-ops)
The old AMD K5 uses a Trace Cache (TC); see [1]
Intel’s Pentium® P4 uses a 12 k micro-op TC
Advantages: faster access to executable bits at every cached instruction
Disadvantage: less dense cache storage exploitation, i.e. wasted cache bits compared to a regular I-cache
Note that cache bits are more costly than memory bits!
Trace caches are falling out of favor in the 2010s
![Page 70: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/70.jpg)
70
Trace Cache
![Page 71: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/71.jpg)
71
Characteristic Cache Curve
In graph below we use relative number of cache misses [RM] to avoid infinitely high abscissa
RM = 0 is ideal case: No misses at all
RM = 1 is worst case: All memory accesses are cache misses
If a program exhibits good locality, relative cache size of 1 results in good performance; we use this as the reference point:
Very coarsely, in some ranges, doubling the cache’s size results in 30% less cache misses
In others, doubling the cache results in a few % less misses: beyond the sweet spot!
![Page 72: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/72.jpg)
72
Characteristic Cache Curve
![Page 73: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/73.jpg)
73
Summary A cache is a special HW storage device that allows
fast access
Its is costly, hence the size of a cache relative to the size of memory is small; cache holds a subset
Frequently used data (or instruction in an I-cache) are copied in a cache, with the hope that the data present in the cache are accessed relatively frequently
Miraculously, that is generally true, so caches in general do speed up execution despite slow memories
Caches are organized into sets, with each set having 1 or more lines
Defined portions of memory get mapped into any one of these sets
![Page 74: 1 CS 201 Computer Systems Programming Chapter 10 Data Cache Architecture Herbert G. Mayer, PSU Status 6/28/2015](https://reader035.vdocuments.us/reader035/viewer/2022062807/5697c02d1a28abf838cd9bd9/html5/thumbnails/74.jpg)
74
Bibliography1. http://forums.amd.com/forum/messageview.cfm?
catid=11&threadid=29382&enterthread=y
2. Lam, M., E. E. Rothberg, and M. E. Wolf [1991]. "The Cache Performance and Optimizations of Blocked Algorithms," ACM 0-89791-380-9/91, p. 63-74.
3. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf
4. On MESI: http://en.wikipedia.org/wiki/Cache_coherence
5. Kilburn, T., et al: “One-level storage systems, IRE Transactions, EC-11, 2, 1962, p. 223-235.