pierre michaud 2nd data prefetching championship, june...
TRANSCRIPT
A best-offset prefetcher
Pierre Michaud
2nd data prefetching championship, june 2015
DPC2 rules
2
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
DPC2 rules
3
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
DPC2 rules
4
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
occupancy
DPC2 rules
5
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
• L2 fill line • L2 victim line • time
occupancy
DPC2 rules
6
core
L1
L2
DRAM
L3
MSHR
prefetcher DPC2
simulator
• physical address • L2 hit/miss • IP • time
• L2 fill line • L2 victim line • time
occupancy
prefetch address must lie in same 4KB page as demand address
Offset prefetching
• Next-line prefetching O=1
• Full-fledged offset prefetcher varying offset
• Sandbox prefetcher (Pugsley et al., HPCA 2014)
7
prefetch demand line X prefetch line X+O
offset O
Proposed Best-Offset (BO) prefetcher
• New method for setting the offset automatically - different from Sandbox - first implementation in an in-house simulator in 2011
• Bandwidth & cache pollution prefetch throttling method - somewhat specific to DPC2 - DPC2 rules limit what can be done
8
Sequential stream
9
0 64 128 192 256 320 384 448
• if the offset is too small, prefetches may not be timely
(neglect page boundary effect)
offset=2
1 2 3 4 5 6 7 8
Strided stream
10
0 64 128 192 256 320 384 448
• constant byte-stride periodic sequence of line-strides (1,2,1,2,...) • offset = sum of line-strides in a period (offset=1+2=3) • ...or multiple of that sum (6,9,...)
offset=3
1 2 3 4 5 6
example: stride=+96 bytes
Interleaved streams
11
• 1st stream alone offset = multiple of 3 • 2nd stream alone offset = multiple of 2 • Both streams offset = multiple of 6
1 2 3 4 5 6
1 2 3 4
offset=6
offset=6
BO prefetcher: main idea
12
best-offset learning +
demand line X (miss / prefetched hit)
prefetch X+O O
BO prefetcher: main idea
13
best-offset learning
recent requests +
demand line X (miss / prefetched hit)
prefetch X+O O
- Y-O
fill line Y (prefetched)
BO prefetcher: main idea
14
best-offset learning
recent requests + -
test O' look up X-O'
demand line X (miss / prefetched hit)
prefetch X+O O
- Y-O
fill line Y (prefetched)
hit/miss ?
Recent Requests (RR) Table
• in 2011: 64-entry fully-associative FIFO
• for DPC2: two direct-mapped banks with different hashing - resembles 2-way skewed-associative - 2 x 64 x 12-bit tags 1536 bits
• Write same tag redundantly in both banks
15
Learning the best offset
• 46 different offsets evaluated - 23 positive + 23 negative - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,20,24,30,32,36,40
• Each offset has a 5-bit score - 46 x 5 230 bits
• Test the 46 offsets successively (46 L2 accesses) = one round - if hit in RR table for an offset, increment its score
• Learning phase finishes after 100 rounds, or if one of the scores reaches 31 - select the offset with the greatest score this is the new prefetch offset - new learning phase starts reset scores
16
Prefetch timeliness vs. prefetch accuracy
• BO prefetcher tries to do timely prefetches
• However...
• Sometimes, better to choose a smaller offset, even if it generates late prefetches - Example: short sequential streams
• Imperfect solution: delay queue
17
BO prefetcher with a delay queue
18
best-offset learning + -
test O' look up X-O'
demand line X (miss / prefetched hit)
prefetch X+O O
- Y-O
fill line Y (prefetched)
hit/miss ?
RR left
RR right
delay 60 cycles X
Prefetch throttling (DPC2)
• Turn prefetch on only if BO score > BADSCORE - DPC2 BADSCORE=1 (10 for small L3 config) - best-offset learning continues while prefetch is off
• Drop prefetch request if MSHR occupancy is above a threshold - Vary MSHR threshold depending on BO score and L3 access rate
19
L3 access rate
BO score
DRAM BW 50% BW 0
31
20 HIGH
LOW HIGH
State (number of bits)
20
prefetch bits (1 bit per L2 line)
recent requests (2x64x12)
scores (46x5)
delay queue (15 slots)
miscellaneous
TOTAL
2048
1536
230
473
74
4361
bits
fixed vs. adaptive offset (437.leslie3d)
21
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
spee
dup
o�set
BOPBOP w/o DQ
Fixed vs. adaptive offset (433.milc)
22
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
spee
dup
o�set
BOPBOP w/o DQ
Fixed vs. adaptive offset (434.zeusmp)
23
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
spee
dup
o�set
BOPBOP w/o DQ
BO prefetcher vs. Sandbox prefetcher
• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first published full-fledged offset prefetcher - fake prefetches evaluate an offset by setting bits in a Bloom filter - if demand access hits in Bloom filter fake prefetch successful - prefetch timeliness not considered - Sandbox method is orthogonal to offset prefetching
• BO prefetcher - no fake prefetches - strive for prefetch timeliness
24
25
FIN