1 comp 206: computer architecture and implementation montek singh wed, nov 2, 2005 mon, nov 7, 2005...
Post on 19-Dec-2015
213 views
TRANSCRIPT
1
COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementation
Montek SinghMontek Singh
Wed, Nov 2, 2005Wed, Nov 2, 2005
Mon, Nov 7, 2005Mon, Nov 7, 2005
Topic: Topic: Caches (contd.)Caches (contd.)
2
OutlineOutline Cache OrganizationCache Organization Cache Read/Write PoliciesCache Read/Write Policies
Block replacement policiesBlock replacement policies Write-back vs. write-through cachesWrite-back vs. write-through caches Write buffersWrite buffers
Cache PerformanceCache Performance Means of improving performanceMeans of improving performance
Reading: HP3 Sections 5.3-5.7Reading: HP3 Sections 5.3-5.7
3
Review: 4 Questions for Mem Review: 4 Questions for Mem HierarchyHierarchy Where can a block be placed in the upper Where can a block be placed in the upper
level? level? (Block placement)(Block placement) How is a block found if it is in the upper level?How is a block found if it is in the upper level?
(Block identification)(Block identification) Which block should be replaced on a miss?Which block should be replaced on a miss?
(Block replacement)(Block replacement) What happens on a write?What happens on a write?
(Write strategy)(Write strategy)
4
Review: Cache ShapesReview: Cache Shapes
Direct-mapped(A = 1, S = 16)
2-way set-associative(A = 2, S = 8)
4-way set-associative(A = 4, S = 4)
8-way set-associative(A = 8, S = 2)
Fully associative(A = 16, S = 1)
5
0431 9Cache Index
:
Cache Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
0123
:
Cache DataByte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte SelectEx: 0x00
Byte Select
Example 1: 1KB, Direct-Mapped, 32B Example 1: 1KB, Direct-Mapped, 32B BlocksBlocks For a 1024 (2For a 1024 (21010) byte cache with 32-byte blocks) byte cache with 32-byte blocks
The uppermost 22 = (32 - 10) address bits are the tagThe uppermost 22 = (32 - 10) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = The lowest 5 address bits are the Byte Select (Block Size =
2255)) The next 5 address bits (bit5 - bit9) are the Cache IndexThe next 5 address bits (bit5 - bit9) are the Cache Index
6
0431 9 Cache Index
:
Cache Tag
0x0002fe 0x00
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x00
Byte Select=
Cache Miss
1
0
1
0xxxxxxx
0x004440
Example 1a: Cache Miss; Empty Example 1a: Cache Miss; Empty BlockBlock
7
0431 9 Cache Index
:
Cache Tag
0x0002fe 0x00
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
31
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x00
Byte Select=
1
1
1
0x0002fe
0x004440
New Block of data
Example 1b: … Read in DataExample 1b: … Read in Data
8
0431 9 Cache Index
:
Cache Tag
0x000050 0x01
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x08
Byte Select=
Cache Hit
1
1
1
0x0002fe
0x004440
Example 1c: Cache HitExample 1c: Cache Hit
9
0431 9 Cache Index
:
Cache Tag
0x002450 0x02
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x04
Byte Select=
Cache Miss
1
1
1
0x0002fe
0x004440
Example 1d: Cache Miss; Incorrect Example 1d: Cache Miss; Incorrect BlockBlock
10
0431 9 Cache Index
:
Cache Tag
0x002450 0x02
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x04
Byte Select=
1
1
1
0x0002fe
0x002450 New Block of data
Example 1e: … Replace BlockExample 1e: … Replace Block
11
Cache PerformanceCache Performance
penalty Miss rate Miss Hit time
timeaccessmemory Average
timeCycle penalty Missref MM
Misses
nInstructio
refs MM CPI PipelineIC timeCPU
cache without trafficBus
cache with trafficBus ratio trafficBus
12
MissPenalty
Block Size
MissRate Exploits spatial locality
Fewer blocks: compromisestemporal locality
Block Size
Increased Miss Penalty& Miss Rate
AverageAccess
Time
Block Size
Block Size TradeoffBlock Size Tradeoff In general, larger block size take advantage of spatial In general, larger block size take advantage of spatial
locality, locality, BUT:BUT: Larger block size means larger miss penaltyLarger block size means larger miss penalty
Takes longer time to fill up the blockTakes longer time to fill up the block If block size is too big relative to cache size, miss rate will go upIf block size is too big relative to cache size, miss rate will go up
Too few cache blocksToo few cache blocks
Average Access TimeAverage Access Time Hit Time + Miss Penalty x Miss RateHit Time + Miss Penalty x Miss Rate
13
Sources of Cache MissesSources of Cache Misses Compulsory (cold start or process migration, first Compulsory (cold start or process migration, first
reference): first access to a blockreference): first access to a block ““Cold” fact of life: not a whole lot you can do about itCold” fact of life: not a whole lot you can do about it
Conflict/Collision/InterferenceConflict/Collision/Interference Multiple mem locations mapped to the same cache Multiple mem locations mapped to the same cache
locationlocation Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Increase associativitySolution 2: Increase associativity
CapacityCapacity Cache cannot contain all blocks access by the programCache cannot contain all blocks access by the program Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Restructure programSolution 2: Restructure program
Coherence/InvalidationCoherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory
14
The 3C Model of Cache MissesThe 3C Model of Cache Misses Based on comparison with another cacheBased on comparison with another cache
Compulsory:Compulsory: The first access to a block is not in the The first access to a block is not in the cache, so the block must be brought into the cache. These cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses.are also called cold start misses or first reference misses.(Misses in Infinite Cache)(Misses in Infinite Cache)
Capacity:Capacity: If the cache cannot contain all the blocks If the cache cannot contain all the blocks needed during execution of a program (its working set), needed during execution of a program (its working set), capacity misses will occur due to blocks being discarded capacity misses will occur due to blocks being discarded and later retrieved.and later retrieved.(Misses in fully associative size X Cache)(Misses in fully associative size X Cache)
Conflict:Conflict: If the block-placement strategy is set-associative If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses blocks map to its set. These are also called collision misses or interference misses.or interference misses.(Misses in A-way associative size X Cache but not in (Misses in A-way associative size X Cache but not in fully associative size X Cache)fully associative size X Cache)
15
Sources of Cache MissesSources of Cache Misses
Direct Mapped N-way Set Associative Fully Associative
Compulsory Miss
Cache Size
Capacity Miss
Invalidation Miss
Big Medium Small
If you are going to run “billions” of instruction, compulsory misses are insignificant.
Same Same Same
Conflict Miss High Medium Zero
Low(er) Medium High
Same Same Same
16
3Cs Absolute Miss Rate3Cs Absolute Miss Rate
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Conflict
17
3Cs Relative Miss Rate3Cs Relative Miss Rate
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%
1 2 4 8
16
32
64
12
8
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
18
How to Improve Cache How to Improve Cache PerformancePerformance LatencyLatency
Reduce miss rateReduce miss rate Reduce miss penaltyReduce miss penalty Reduce hit timeReduce hit time
BandwidthBandwidth Increase hit bandwidthIncrease hit bandwidth Increase miss bandwidthIncrease miss bandwidth
19
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16
32
64
12
8
25
6
1K
4K
16K
64K
256K
1. Reduce Misses via Larger Block 1. Reduce Misses via Larger Block SizeSize
20
2. Reduce Misses via Higher 2. Reduce Misses via Higher AssociativityAssociativity 2:1 Cache Rule2:1 Cache Rule
Miss Rate DM cache size N Miss Rate DM cache size N Miss Rate FA cache size N/2 Miss Rate FA cache size N/2 Not merely empiricalNot merely empirical
Theoretical justification in Sleator and Tarjan, “Amortized efficiency of Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2):202-208,1985list update and paging rules”, CACM, 28(2):202-208,1985
Beware: Execution time is only final measure!Beware: Execution time is only final measure! Will clock cycle time increase?Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-wayHill [1988] suggested hit time ~10% higher for 2-way vs. 1-way
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
21
Example: Ave Mem Access Time vs. Miss Example: Ave Mem Access Time vs. Miss RateRateExample: assume clock cycle time is 1.10 for 2-way, Example: assume clock cycle time is 1.10 for 2-way,
1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of 1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of direct mappeddirect mapped
((RedRed means A.M.A.T. not improved by more means A.M.A.T. not improved by more associativity)associativity)
Cache size Associativity(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.684 1.72 1.67 1.61 1.538 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.3232 1.20 1.24 1.25 1.2764 1.14 1.2 1.21 1.23128 1.10 1.17 1.18 1.20
22
3. Reduce Conflict Misses via Victim 3. Reduce Conflict Misses via Victim CacheCache How to combine fast hit How to combine fast hit
time of direct mapped time of direct mapped yet avoid conflict misses yet avoid conflict misses
Add small highly Add small highly associative buffer to associative buffer to hold data discarded hold data discarded from cachefrom cache
Jouppi [1990]: 4-entry Jouppi [1990]: 4-entry victim cache removed victim cache removed 20% to 95% of conflicts 20% to 95% of conflicts for a 4 KB direct for a 4 KB direct mapped data cachemapped data cache
TAG DATA
?
TAG DATA
?
CPU
Mem
23
4. Reduce Conflict Misses via Pseudo-4. Reduce Conflict Misses via Pseudo-Assoc.Assoc. How to combine fast hit time of direct mapped How to combine fast hit time of direct mapped
and have the lower conflict misses of 2-way SA and have the lower conflict misses of 2-way SA cachecache
Divide cache: on a miss, check other half of cache Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)to see if there, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 Drawback: CPU pipeline is hard if hit takes 1 or 2 cyclescycles Better for caches not tied directly to processorBetter for caches not tied directly to processor
Hit Time
Pseudo Hit Time Miss Penalty
Time
24
5. Reduce Misses by Hardware 5. Reduce Misses by Hardware PrefetchingPrefetching Instruction prefetchingInstruction prefetching
Alpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a miss Extra block placed in stream bufferExtra block placed in stream buffer On miss check stream bufferOn miss check stream buffer
Works with data blocks tooWorks with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses Jouppi [1990] 1 data stream buffer got 25% misses
from 4KB cache; 4 stream buffers got 43%from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for Palacharla & Kessler [1994] for scientific programs for
8 streams got 50% to 70% of misses from 2 64KB, 4-8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesway set associative caches
Prefetching relies on extra memory bandwidth Prefetching relies on extra memory bandwidth that can be used without penaltythat can be used without penalty e.g., up to 8 prefetch stream buffers in the e.g., up to 8 prefetch stream buffers in the
UltraSPARC IIIUltraSPARC III
25
6. Reducing Misses by Software 6. Reducing Misses by Software PrefetchingPrefetching Data prefetchData prefetch
Compiler inserts special “prefetch” instructions into Compiler inserts special “prefetch” instructions into programprogramLoad data into register (HP PA-RISC loads)Load data into register (HP PA-RISC loads)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC
v9)v9) A form of speculative executionA form of speculative execution
don’t really know if data is needed or if not in cache alreadydon’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” Most effective prefetches are “semantically invisible”
to prgmto prgmdoes not change registers or memorydoes not change registers or memorycannot cause a fault/exceptioncannot cause a fault/exception if they would fault, they are simply turned into NOP’sif they would fault, they are simply turned into NOP’s
Issuing prefetch instructions takes timeIssuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses?Is cost of prefetch issues < savings in reduced misses?
26
7. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns. InstructionsInstructions
Reorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce misses Profiling to look at conflictsProfiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB
direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocks DataData
Merging ArraysMerging Arrays Improve spatial locality by single array of compound elements vs. Improve spatial locality by single array of compound elements vs.
2 arrays2 arrays Loop InterchangeLoop Interchange
Change nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memory Loop FusionLoop Fusion
Combine two independent loops that have same looping and Combine two independent loops that have same looping and some variables overlapsome variables overlap
BlockingBlocking Improve temporal locality by accessing “blocks” of data Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rowsrepeatedly vs. going down whole columns or rows
27
Merging Arrays ExampleMerging Arrays Example
Reduces conflicts between Reduces conflicts between valval and and keykey Addressing expressions are differentAddressing expressions are different
/* Before */int val[SIZE];int key[SIZE];
/* Before */int val[SIZE];int key[SIZE];
/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];
/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];
28
Loop Interchange ExampleLoop Interchange Example
Sequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words
/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];
/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];
/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];
/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];
29
Loop Fusion ExampleLoop Fusion Example
Before: 2 misses per access to Before: 2 misses per access to aa and and cc After: 1 miss per access to After: 1 miss per access to aa and and cc
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];
/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
30
Blocking ExampleBlocking Example
Two Inner Loops:Two Inner Loops: Read all NxN elements of z[]Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedlyRead N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[]Write N elements of 1 row of x[]
Capacity Misses a function of N and Cache SizeCapacity Misses a function of N and Cache Size 3 NxN 3 NxN no capacity misses; otherwise ... no capacity misses; otherwise ...
Idea: compute on BxB submatrix that fitsIdea: compute on BxB submatrix that fits
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }
31
Blocking Example (contd.)Blocking Example (contd.) Age of accessesAge of accesses
White means not touched yetWhite means not touched yet Light gray means touched a while agoLight gray means touched a while ago Dark gray means newer accessesDark gray means newer accesses
32
Blocking Example (contd.)Blocking Example (contd.)
Work with BxB submatricesWork with BxB submatrices smaller working set can fit within the cachesmaller working set can fit within the cache
fewer capacity missesfewer capacity misses
/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }
/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }
33
Blocking Example (contd.)Blocking Example (contd.)
Capacity reqd. goes from (2NCapacity reqd. goes from (2N33 + N + N22) to (2N) to (2N33/B /B +N+N22))
BB = = “blocking factor”“blocking factor”