yaser zhian fanafzar game studio igdi, workshop 07, january 2 nd, 2013

55
Modern CPUs and Caches A Starting Point for Programmers Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd , 2013

Upload: cristopher-obee

Post on 29-Mar-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

Modern CPUs and CachesA Starting Point for Programmers

Yaser ZhianFanafzar Game StudioIGDI, Workshop 07, January 2nd, 2013

Page 2: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 2

Agenda

Some notes about the subject CPUs and their gimmicks Caches and their importance How CPU and OS handle memory logically

Page 3: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 3

A Word of Caution

These are very complex subjects Expect very few details and much simplification

These are very complicated subjects Expect much generalization and omission

No time Even a full course would be hilariously insufficient

Not an expert Sorry! Can’t help much.

Just a pile of loosely related stuff

Page 4: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 4

A Real Mess

Pressure for performance Backwards compatibility Cost/power/etc. The ridiculous “numbers game” Law of diminishing returns Latency vs. Throughput

Page 5: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 5

Latency vs. Throughput

You can always solve your bandwidth (throughput) problems with money, but it is rarely so for lag (latency.)

Relative rate of improvements (from David Patterson’s keynote, HPEC 2004) CPU, 80286 till Pentium 4: 21x vs. 2250x Ethernet, 10Mb till 10Gb: 16x vs. 1000x Disk, 3600 till 15000rpm: 8x vs. 143x DRAM, plain till DDR: 4x vs. 120x

Page 6: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 6

Not the von Neumann Model

At the simplest level, the von Neumann model stipulates: Program is data and is stored in memory along

with data (departing from Turing’s model) Program is executed sequentially

Not the way computers function anymore… Abstraction still used for thinking about programs But it’s leaky as heck!

“Not Your Fathers’ von Neumann Machine!”

Page 7: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 7

“Hit the Wall”

Speed of Light: can’t send and receive signals to and from all parts of the die in a cycle anymore

Power: more transistors leads to more power, which leads to much more heat

Memory: the CPU isn’t even close to the bottleneck anymore. “All your base are belong to” memory

Complexity: adding more transistors for more sophisticated operation won’t give much of a speedup (e.g. doubling transistors might give 2%.)

Page 8: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 8

x86

Family introduced with 8086 in 1978 Today, new members are still fully binary

backward-compatible with that puny machine (5MHz clock, 20-bit addressing, 16-bit regs.)

It had very few registers It had segmented memory addressing (joy!) It had many complex instructions and several

addressing modes

Page 9: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 9

Newer x86 (1/2)

1982 (80286): Protected mode, MMU 1985 (80386): 32-bit ISA, Paging 1989 (80486): Pipelining, Cache, Intgrtd. FPU 1993 (Pentium): Superscalar, 64-bit bus, MMX 1995 (P-Pro): μ-ops, OoO Exec., Register

Renaming, Speculative Exec. 1997 (K6-2, PIII): 3DNow!/SSE 2003 (Opteron): 64-bit ISA 2006 (Core 2): Multi-core

Page 10: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 10

Newer x86 (2/2)

Registers got expanded from (all 16 bit, non really general purpose) AX, BX, CX, DX SI, DI, BP, SP CS, DS, ES, SS, Flags, IP

To 16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI, R8-R15)

plus RIP and Flags and others 16 x 128-bit XMM regs. (XMM0-...)▪ Or 16 x 256-bit YMM regs. (YMM0-...)

More than a thousand logically different instructions (the usual, plus string processing, cryptography, CRC, complex numbers, etc.)

Page 11: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 11

The Central Processing Unit

The Fetch-Decode-Execute-Retire Cycle Strategies for more performance:

More complex instructions, doing more in hardware (CISCing things up)

Faster CPU clock rates (the free lunch) Instruction-Level Parallelism (SIMD + gimmicks) Adding cores (free lunch is over!)

And then, there are gimmicks…

Page 12: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 12

“Performance Enhancement”

Pipelining µ-ops Superscalar Pipelines Out-of-order Execution Speculative Execution Register Renaming Branch Prediction Prefetching Store Buffer Trace Cache …

Page 13: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 13

Pipelining (1/4)

Classic sequential execution: Length of instruction executions vary a lot (5-10

times usual, several orders of magnitude also happen.)

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Page 14: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 14

Pipelining (2/4)

It’s really more like this for the CPU: Instructions may have many sub-parts, and they

engage different parts of the CPU

F1 D1 R1E1

F2 D2 R2E2

F3 D3 R3E3

F4 D4 R4E4

Page 15: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 15

Pipelining (3/4)

So why not do this: This is called “pipelining” It increases throughput (significantly) Doesn’t decrease latency for single instructions

F1 D1 R1E1

F2 D2 R2E2

F3 D3 R3E3

F4 D4 R4E4

Page 16: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 16

Pipelining (4/4)

But it has its own share of problems Hazards, stalls, flushing, etc. Execution of i2 depends on the result of i1 After i2, we jump and the i3, i4,… are flushed out

F1 D1 R1E1

F2 D2 R2E2

F3 D3 R3E3

F4 D4 R4E4

add EAX,120

jmp [EAX]mov [4*EBX+42],EDXadd ECX,[EAX]

Page 17: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 17

Micro Operations (µ-ops) (1/2)

Instructions are broken up into simple, orthogonal µ-ops mov EAX,EDX might generate only one µ-op mov EAX,[EDX] might generate two:1. µld tmp0,[EDX]2. µmov EAX,tmp0

add [EAX],EDX probably generates three:1. µld tmp0,[EAX]2. µadd tmp0,EDX3. µst [EAX],tmp0

Page 18: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 18

Micro Operations (µ-ops) (2/2)

The CPU then, gets two layers: The one that breaks up operations into µ-ops The one that executes µ-ops

The part that executes µ-ops can be simpler (more RISCy) and therefore faster.

More complex instructions can be supported without (much) complicating the CPU

The pipelining (and other gimmicks) can happen at the µ-op level

Page 19: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 19

Superscalar Execution

CPUs that issue (or retire) more than one instruction per cycle are called Superscalar

Can be thought of as a pipeline with more than one line

Simplest form: integer pipe plus floating-point pipe

These days, CPUs do 4 or more Obviously requires more of each type of

operational unit in the CPU

Page 20: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 20

Out-of-Order Execution (1/2)

To prevent your pipeline from stalling as much as possible, issue the next instructions even if you can’t start the current one.

But of course, only if there are no hazards (dependencies) and there are operational units available.

add RAX,RAXadd RAX,RBXadd RCX,RDX

This can be and is started before the previous instruction.

Page 21: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 21

Out-of-Order Execution (2/2)

This obviously also applies at the µ-op level:mov RAX,[mem0]mul RAX,42add RAX,[mem1]

push RAXcall Func

Fetching mem1 is started long before the result of the multiply becomes available.

Pushing RAX is sub RSP,8 and then mov [RSP],RAX. Since call instruction needs RSP too, it will only wait for the subtraction and not the store to finish to start.

Page 22: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 22

Register Renaming (1/3)

Consider this:mov RAX,[mem0]mul RAX,42mov [mem1],RAXmov RAX,[mem2]add RAX,7mov [mem3],RAX

Logically, the two parts are totally separate. However, the use of RAX will stall the pipeline.

Page 23: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 23

Register Renaming (2/3)

Modern CPUs have a lot of temporary, unnamed registers at their disposal.

They will detect the logical independence, and will use one of those in the second block instead on RAX.

And they will track which reg. is which, where. In effect, they are renaming another register

to RAX. There might not even be a real RAX!

Page 24: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 24

Renaming on mul means that it won’t clobber RAX (which we need for the add, that is waiting on the load of [uncached]) and we can do the multiply and reach the first store much sooner.

Register Renaming (3/3)

This is, for once, simpler than it might seem! Every time a register is assigned to, a new

temporary register is used in its stead. Consider this:mov RAX,[cached]mov RBX,[uncached]add RBX,RAXmul RAX,42mov [mem0],RAXmov [mem1],RBX

Rena

me

happ

ens

Page 25: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 25

Branch Prediction (1/9)

The CPU always depends on knowing where the next instruction is, so it can go ahead and work on it.

That’s why branches in code are anathema to modern, deep pipelines and all the gimmicks they pull.

Only if the CPU could somehow guess where the target of each branch is going to be…

That’s where branch prediction comes in.

Page 26: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 26

Branch Prediction (2/9)

So the CPU guesses the target of a jump (if it doesn’t know for sure,) and continues to speculatively execute instructions from there.

For a conditional jump, the CPU must also predict whether the branch is taken or not.

If the CPU is right, the pipeline flows smoothly. If not, the pipeline must be flushed and much time and resource is wasted on a misprediction.

Page 27: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 27

Branch Prediction (3/9)

In this code:cmp RAX,0jne [RBX]both the target and whether the jump happens or not must be predicted.

The above can effectively jump anywhere! But usually branches are closer to this:cmp RAX,0jne somewhere_specific

Which can only have two possible targets.

Page 28: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 28

Branch Prediction (4/9)

In a simple form, when a branch is executed, its target is stored in a table called the BTB (or Branch Target Buffer.) When that branch is encountered again, the target address is predicted to be the value read from the BTB.

As you might guess, this doesn’t work for many situations (e.g. alternating branch.)

Also, the size of the BTB is limited, so CPU will forget about the last target of some jumps.

Page 29: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 29

Branch Prediction (5/9)

A simple expansion on the previous idea is to use a saturating counter along with each entry of the BTB.

For example, with a 2-bit counter, Branch is predicted not to be taken if the counter is 0 or 1. The branch is predicted to be taken if the counter is 2 or 3. Each time it is taken, counter is incremented, and vice versa.

Strongly Not

Taken

Weakly Not

Taken

Strongly Taken

Weakly Taken

T T T

TNT

NT NT NT

Page 30: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 30

Branch Prediction (6/9)

But this behaves very badly in common situations. For an alternating branch,

If the counter starts in 00 or 11, it will mispredict 50%. If the counter starts in 01, and the first time the branch

is taken, it will mispredict 100%! As an improvement, we can store the history of

the last N occurrences of the branch in the BTB, and use 2N counters for each of the possible history patterns.

Page 31: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 31

Branch Prediction (7/9)

For N=4 and 2-bit counters, we’ll have: This is an extremely cool method of doing branch

prediction!

.

.

.0010Prediction

(0 or 1)Branch History

Page 32: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 32

Branch Prediction (8/9)

Some predictions are simpler: For each ret instruction, the target is somewhere

on the stack (pushed before.) Modern CPUs keep track of return addresses in an internal return stack buffer. Each time a call is executed, an entry is added and is used for the return address.

On a cold encounter (a.k.a. static prediction) a branch is sometimes predicted to▪ fall through if it goes forward.▪ be taken if it goes backward.

Page 33: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 33

Branch Prediction (9/9)

Best general advice is to arrange your code so that the most common path for branches is “not taken”. This improves the effectiveness of code prefetching and the trace cache.

Branch prediction, register renaming and speculative execution work extremely well together.

Page 34: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 34

An Example (1/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Page 35: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 35

An Example (2/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 0 – Instruction 0

Load RAX from memoryAssume cache miss – 300 cycles to loadInstruction starts and dispatch continues...

Page 36: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 36

An Example (3/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 0 – Instruction 1

This instruction writes RBX, which conflicts with the read in instruction 0.Rename this instance of RBX and continue…

Page 37: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 37

An Example (4/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 0 – Instruction 2

Value of RAX not available yet; cannot calculate value of Flags reg.Queue up behind instruction 0…

Page 38: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 38

An Example (5/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 0 – Instruction 3

Flags reg. still not available.Predict that this branch is not taken.Assuming 4-wide dispatch, instruction issue limit is reached.

Page 39: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 39

An Example (6/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 1 – Instruction 4

Store is speculative. Result kept in Store Buffer. Also, RBX might not be available yet (from instruction 1.)Load/Store Unit is tied up from now on; can’t issue any more memory ops in this cycle.

Page 40: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 40

An Example (7/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 2 – Instruction 5

Had to wait for L/S Unit.Assume this is another (and unrelated) cache miss. We have 2 overlapping cache misses now.L/S Unit is busy again.

Page 41: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 41

An Example (8/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 3 – Instruction 6

RAX is not ready yet (300-cycle latency, remember?!)This load cannot even start until instruction 0 is done.

Page 42: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 42

An Example (9/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 301 – Instruction 2

At clock 300 (or 301,) RAX is finally ready.Do the comparison and update Flags register.

Page 43: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 43

An Example (10/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 301 – Instruction 6

Issue this load too. Assume a cache hit (finally!) Result will be available in clock 304.

Page 44: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 44

An Example (11/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 302 – Instruction 3

Now the Flags reg. is ready.Check the prediction. Assume prediction was correct.

Page 45: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 45

An Example (12/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 302 – Instruction 4

This speculative store can actually be committed to memory (or cache, actually.)

Page 46: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 46

An Example (13/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 302 – Instruction 5

At clock 302, the result of this load arrives.

Page 47: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 47

An Example (14/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

Clock 305 – Instruction 6

Result arrived at clock 304; instruction retired at 305.

Page 48: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 48

An Example (15/15)

mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8]

To summarize,• In 4 clocks, started 7 ops

and 2 cache misses• Retired 7 ops in 306 cycles.• Cache misses totally

dominate performance. • The only real benefit came

from being able to have 2 overlapping cache misses!

Page 49: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 49

A New Performance Goal

To get to the next cache miss as early as possible.

Page 50: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 50

What Be This Cache?!

Main memory is slow; S.L.O.W. Very slow Painfully slow And it specially has very bad (high) latency But all is not lost! Many (most) references to

memory have high temporal and address locality.

So we use a small amount of very fast memory to keep recently-accessed or likely-to-be-accessed chunks of main memory close to CPU.

Page 51: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 51

General Properties of CPU Caches

Typically come is several levels (3 these days.) Each lower level is several times smaller, but

several times faster than the level above. CPU can only see the L1 cache, each level only

sees the level above, and only the highest level can communicate with main memory.

Data is transferred between memory and cache in units of fixed size, called a cache line. The most common size today is 64 bytes.

Page 52: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 52

How the Cache Works (1/3)

When any memory byte is needed, its place in cache is calculated;

CPU asks the cache; If there, the cache returns the

data; If not, the data is pulled in from

memory; If the calculated cache line is

occupied by data with a different tag, that data is evicted.

If the line is dirty (modified) it is written back to memory first.

Main MemoryEach block is the size of a cache line

The CacheEach block also holds metadata like tag (address) and some flags

Page 53: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 53

How the Cache Works (2/3)

In this basic model, if the CPU periodically accesses memory addresses that differ by a multiple of the cache size, they will constantly evict each other out and most cache accesses will be misses. This is called cache thrashing.

An application can innocently and very easily trigger this.

Page 54: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 54

How the Cache Works (3/3)

To alleviate this problem, each cache block is turned into an associative memory that can house more than one cache line.

Each cache block holds more cache lines (2, 4, 8 or more,) and still uses the tag to look up the line requested by the CPU in the block.

When a new line comes in from memory, an LRU (or similar) policy is used to evict only the least-likely-to-be-needed line.

Page 55: Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013

http://yaserzt.com/blog/ 55

Any Questions?

References: Patterson & Hennessy - Computer Organization and Design Intel 64 and IA-32 Architectures Software Developer’s

Manual – vol. 1, 2 and 3 Click & Goetz – A Crash Course in Modern Hardware Agner Fog - The Microarchitecture of Intel, AMD and VIA

CPUs Drepper - What Every Programmer Should Know About

Memory