1 comp 206: computer architecture and implementation montek singh wed, nov 2, 2005 mon, nov 7, 2005...

1

COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementation

Montek SinghMontek Singh

Wed, Nov 2, 2005Wed, Nov 2, 2005

Mon, Nov 7, 2005Mon, Nov 7, 2005

Topic: Topic: Caches (contd.)Caches (contd.)

2

OutlineOutline Cache OrganizationCache Organization Cache Read/Write PoliciesCache Read/Write Policies

Block replacement policiesBlock replacement policies Write-back vs. write-through cachesWrite-back vs. write-through caches Write buffersWrite buffers

Cache PerformanceCache Performance Means of improving performanceMeans of improving performance

Reading: HP3 Sections 5.3-5.7Reading: HP3 Sections 5.3-5.7

3

Review: 4 Questions for Mem Review: 4 Questions for Mem HierarchyHierarchy Where can a block be placed in the upper Where can a block be placed in the upper

level? level? (Block placement)(Block placement) How is a block found if it is in the upper level?How is a block found if it is in the upper level?

(Block identification)(Block identification) Which block should be replaced on a miss?Which block should be replaced on a miss?

(Block replacement)(Block replacement) What happens on a write?What happens on a write?

(Write strategy)(Write strategy)

4

Review: Cache ShapesReview: Cache Shapes

Direct-mapped(A = 1, S = 16)

2-way set-associative(A = 2, S = 8)



Fully associative(A = 16, S = 1)

5

0431 9Cache Index

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

Byte Select

Example 1: 1KB, Direct-Mapped, 32B Example 1: 1KB, Direct-Mapped, 32B BlocksBlocks For a 1024 (2For a 1024 (21010) byte cache with 32-byte blocks) byte cache with 32-byte blocks

The uppermost 22 = (32 - 10) address bits are the tagThe uppermost 22 = (32 - 10) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = The lowest 5 address bits are the Byte Select (Block Size =

2255)) The next 5 address bits (bit5 - bit9) are the Cache IndexThe next 5 address bits (bit5 - bit9) are the Cache Index

6

0431 9 Cache Index

:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte Select

0x00

Byte Select=

Cache Miss

1

0

1

0xxxxxxx

0x004440

Example 1a: Cache Miss; Empty Example 1a: Cache Miss; Empty BlockBlock

7

0431 9 Cache Index

:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

31


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x00

Byte Select=

1

1

1

0x0002fe

0x004440

New Block of data

Example 1b: … Read in DataExample 1b: … Read in Data

8

0431 9 Cache Index

:

Cache Tag

0x000050 0x01

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x08

Byte Select=

Cache Hit

1

1

1

0x0002fe

0x004440

Example 1c: Cache HitExample 1c: Cache Hit

9

0431 9 Cache Index

:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x04

Byte Select=

Cache Miss

1

1

1

0x0002fe

0x004440

Example 1d: Cache Miss; Incorrect Example 1d: Cache Miss; Incorrect BlockBlock

10

0431 9 Cache Index

:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x04

Byte Select=

1

1

1

0x0002fe

0x002450 New Block of data

Example 1e: … Replace BlockExample 1e: … Replace Block

11

Cache PerformanceCache Performance

penalty Miss rate Miss Hit time

timeaccessmemory Average

timeCycle penalty Missref MM

Misses

nInstructio

refs MM CPI PipelineIC timeCPU

cache without trafficBus

cache with trafficBus ratio trafficBus

12

MissPenalty

Block Size

MissRate Exploits spatial locality

Fewer blocks: compromisestemporal locality

Block Size

Increased Miss Penalty& Miss Rate

AverageAccess

Time

Block Size

Block Size TradeoffBlock Size Tradeoff In general, larger block size take advantage of spatial In general, larger block size take advantage of spatial

locality, locality, BUT:BUT: Larger block size means larger miss penaltyLarger block size means larger miss penalty

Takes longer time to fill up the blockTakes longer time to fill up the block If block size is too big relative to cache size, miss rate will go upIf block size is too big relative to cache size, miss rate will go up

Too few cache blocksToo few cache blocks

Average Access TimeAverage Access Time Hit Time + Miss Penalty x Miss RateHit Time + Miss Penalty x Miss Rate

13

Sources of Cache MissesSources of Cache Misses Compulsory (cold start or process migration, first Compulsory (cold start or process migration, first

reference): first access to a blockreference): first access to a block ““Cold” fact of life: not a whole lot you can do about itCold” fact of life: not a whole lot you can do about it

Conflict/Collision/InterferenceConflict/Collision/Interference Multiple mem locations mapped to the same cache Multiple mem locations mapped to the same cache

locationlocation Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Increase associativitySolution 2: Increase associativity

CapacityCapacity Cache cannot contain all blocks access by the programCache cannot contain all blocks access by the program Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Restructure programSolution 2: Restructure program

Coherence/InvalidationCoherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory

14

The 3C Model of Cache MissesThe 3C Model of Cache Misses Based on comparison with another cacheBased on comparison with another cache

Compulsory:Compulsory: The first access to a block is not in the The first access to a block is not in the cache, so the block must be brought into the cache. These cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses.are also called cold start misses or first reference misses.(Misses in Infinite Cache)(Misses in Infinite Cache)

Capacity:Capacity: If the cache cannot contain all the blocks If the cache cannot contain all the blocks needed during execution of a program (its working set), needed during execution of a program (its working set), capacity misses will occur due to blocks being discarded capacity misses will occur due to blocks being discarded and later retrieved.and later retrieved.(Misses in fully associative size X Cache)(Misses in fully associative size X Cache)

Conflict:Conflict: If the block-placement strategy is set-associative If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses blocks map to its set. These are also called collision misses or interference misses.or interference misses.(Misses in A-way associative size X Cache but not in (Misses in A-way associative size X Cache but not in fully associative size X Cache)fully associative size X Cache)

15

Sources of Cache MissesSources of Cache Misses

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Invalidation Miss

Big Medium Small

If you are going to run “billions” of instruction, compulsory misses are insignificant.

Same Same Same

Conflict Miss High Medium Zero

Low(er) Medium High

Same Same Same

16

3Cs Absolute Miss Rate3Cs Absolute Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

17

3Cs Relative Miss Rate3Cs Relative Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%

1 2 4 8

16

32

64

12

8

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

18

How to Improve Cache How to Improve Cache PerformancePerformance LatencyLatency

Reduce miss rateReduce miss rate Reduce miss penaltyReduce miss penalty Reduce hit timeReduce hit time

BandwidthBandwidth Increase hit bandwidthIncrease hit bandwidth Increase miss bandwidthIncrease miss bandwidth

19

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block 1. Reduce Misses via Larger Block SizeSize

20

2. Reduce Misses via Higher 2. Reduce Misses via Higher AssociativityAssociativity 2:1 Cache Rule2:1 Cache Rule

Miss Rate DM cache size N Miss Rate DM cache size N Miss Rate FA cache size N/2 Miss Rate FA cache size N/2 Not merely empiricalNot merely empirical

Theoretical justification in Sleator and Tarjan, “Amortized efficiency of Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2):202-208,1985list update and paging rules”, CACM, 28(2):202-208,1985

Beware: Execution time is only final measure!Beware: Execution time is only final measure! Will clock cycle time increase?Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-wayHill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

21

Example: Ave Mem Access Time vs. Miss Example: Ave Mem Access Time vs. Miss RateRateExample: assume clock cycle time is 1.10 for 2-way, Example: assume clock cycle time is 1.10 for 2-way,

1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of 1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of direct mappeddirect mapped

((RedRed means A.M.A.T. not improved by more means A.M.A.T. not improved by more associativity)associativity)

Cache size Associativity(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.684 1.72 1.67 1.61 1.538 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.3232 1.20 1.24 1.25 1.2764 1.14 1.2 1.21 1.23128 1.10 1.17 1.18 1.20

22

3. Reduce Conflict Misses via Victim 3. Reduce Conflict Misses via Victim CacheCache How to combine fast hit How to combine fast hit

time of direct mapped time of direct mapped yet avoid conflict misses yet avoid conflict misses

Add small highly Add small highly associative buffer to associative buffer to hold data discarded hold data discarded from cachefrom cache

Jouppi [1990]: 4-entry Jouppi [1990]: 4-entry victim cache removed victim cache removed 20% to 95% of conflicts 20% to 95% of conflicts for a 4 KB direct for a 4 KB direct mapped data cachemapped data cache

TAG DATA

?

TAG DATA

?

CPU

Mem

23

4. Reduce Conflict Misses via Pseudo-4. Reduce Conflict Misses via Pseudo-Assoc.Assoc. How to combine fast hit time of direct mapped How to combine fast hit time of direct mapped

and have the lower conflict misses of 2-way SA and have the lower conflict misses of 2-way SA cachecache

Divide cache: on a miss, check other half of cache Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)to see if there, if so have a pseudo-hit (slow hit)

Drawback: CPU pipeline is hard if hit takes 1 or 2 Drawback: CPU pipeline is hard if hit takes 1 or 2 cyclescycles Better for caches not tied directly to processorBetter for caches not tied directly to processor

Hit Time

Pseudo Hit Time Miss Penalty

Time

24

5. Reduce Misses by Hardware 5. Reduce Misses by Hardware PrefetchingPrefetching Instruction prefetchingInstruction prefetching

Alpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a miss Extra block placed in stream bufferExtra block placed in stream buffer On miss check stream bufferOn miss check stream buffer

Works with data blocks tooWorks with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses Jouppi [1990] 1 data stream buffer got 25% misses

from 4KB cache; 4 stream buffers got 43%from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for Palacharla & Kessler [1994] for scientific programs for

8 streams got 50% to 70% of misses from 2 64KB, 4-8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesway set associative caches

Prefetching relies on extra memory bandwidth Prefetching relies on extra memory bandwidth that can be used without penaltythat can be used without penalty e.g., up to 8 prefetch stream buffers in the e.g., up to 8 prefetch stream buffers in the

UltraSPARC IIIUltraSPARC III

25

6. Reducing Misses by Software 6. Reducing Misses by Software PrefetchingPrefetching Data prefetchData prefetch

Compiler inserts special “prefetch” instructions into Compiler inserts special “prefetch” instructions into programprogramLoad data into register (HP PA-RISC loads)Load data into register (HP PA-RISC loads)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC

v9)v9) A form of speculative executionA form of speculative execution

don’t really know if data is needed or if not in cache alreadydon’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” Most effective prefetches are “semantically invisible”

to prgmto prgmdoes not change registers or memorydoes not change registers or memorycannot cause a fault/exceptioncannot cause a fault/exception if they would fault, they are simply turned into NOP’sif they would fault, they are simply turned into NOP’s

Issuing prefetch instructions takes timeIssuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses?Is cost of prefetch issues < savings in reduced misses?

26

7. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns. InstructionsInstructions

Reorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce misses Profiling to look at conflictsProfiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB

direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocks DataData

Merging ArraysMerging Arrays Improve spatial locality by single array of compound elements vs. Improve spatial locality by single array of compound elements vs.

2 arrays2 arrays Loop InterchangeLoop Interchange

Change nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memory Loop FusionLoop Fusion

Combine two independent loops that have same looping and Combine two independent loops that have same looping and some variables overlapsome variables overlap

BlockingBlocking Improve temporal locality by accessing “blocks” of data Improve temporal locality by accessing “blocks” of data

repeatedly vs. going down whole columns or rowsrepeatedly vs. going down whole columns or rows

27

Merging Arrays ExampleMerging Arrays Example

Reduces conflicts between Reduces conflicts between valval and and keykey Addressing expressions are differentAddressing expressions are different

/* Before */int val[SIZE];int key[SIZE];

/* Before */int val[SIZE];int key[SIZE];

/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];

/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];

28

Loop Interchange ExampleLoop Interchange Example

Sequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words

/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];

/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];

/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

29

Loop Fusion ExampleLoop Fusion Example

Before: 2 misses per access to Before: 2 misses per access to aa and and cc After: 1 miss per access to After: 1 miss per access to aa and and cc

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];

/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

30

Blocking ExampleBlocking Example

Two Inner Loops:Two Inner Loops: Read all NxN elements of z[]Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedlyRead N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[]Write N elements of 1 row of x[]

Capacity Misses a function of N and Cache SizeCapacity Misses a function of N and Cache Size 3 NxN 3 NxN no capacity misses; otherwise ... no capacity misses; otherwise ...

Idea: compute on BxB submatrix that fitsIdea: compute on BxB submatrix that fits

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

31

Blocking Example (contd.)Blocking Example (contd.) Age of accessesAge of accesses

White means not touched yetWhite means not touched yet Light gray means touched a while agoLight gray means touched a while ago Dark gray means newer accessesDark gray means newer accesses

32

Blocking Example (contd.)Blocking Example (contd.)

Work with BxB submatricesWork with BxB submatrices smaller working set can fit within the cachesmaller working set can fit within the cache

fewer capacity missesfewer capacity misses

/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

33

Blocking Example (contd.)Blocking Example (contd.)

Capacity reqd. goes from (2NCapacity reqd. goes from (2N33 + N + N22) to (2N) to (2N33/B /B +N+N22))

BB = = “blocking factor”“blocking factor”

1 comp 206: computer architecture and implementation montek singh wed, nov 2, 2005 mon, nov 7, 2005...

Documents

byte cache

cache tag byte

cache data byte

0x00 byte

0x04 byte

0x08 byte

cache tagexample

cache shapes