pierre michaud 2nd data prefetching championship, june...

A best-offset prefetcher

Pierre Michaud

2nd data prefetching championship, june 2015

DPC2 rules

2

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

DPC2 rules

3

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

•  physical address •  L2 hit/miss •  IP •  time

DPC2 rules

4

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator


occupancy

DPC2 rules

5

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator


•  L2 fill line •  L2 victim line •  time

occupancy

DPC2 rules

6

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator


•  L2 fill line •  L2 victim line •  time

occupancy

prefetch address must lie in same 4KB page as demand address

Offset prefetching

•  Next-line prefetching O=1

•  Full-fledged offset prefetcher varying offset

•  Sandbox prefetcher (Pugsley et al., HPCA 2014)

7

prefetch demand line X prefetch line X+O

offset O

Proposed Best-Offset (BO) prefetcher

•  New method for setting the offset automatically - different from Sandbox - first implementation in an in-house simulator in 2011

•  Bandwidth & cache pollution prefetch throttling method - somewhat specific to DPC2 - DPC2 rules limit what can be done

8

Sequential stream

9

0 64 128 192 256 320 384 448

•  if the offset is too small, prefetches may not be timely

(neglect page boundary effect)

offset=2

1 2 3 4 5 6 7 8

Strided stream

10

0 64 128 192 256 320 384 448

•  constant byte-stride periodic sequence of line-strides (1,2,1,2,...) •  offset = sum of line-strides in a period (offset=1+2=3) •  ...or multiple of that sum (6,9,...)

offset=3

1 2 3 4 5 6

example: stride=+96 bytes

Interleaved streams

11

•  1st stream alone offset = multiple of 3 •  2nd stream alone offset = multiple of 2 •  Both streams offset = multiple of 6

1 2 3 4 5 6

1 2 3 4

offset=6

offset=6

BO prefetcher: main idea

12

best-offset learning +

demand line X (miss / prefetched hit)

prefetch X+O O


13

best-offset learning

recent requests +


prefetch X+O O

- Y-O

fill line Y (prefetched)


14

best-offset learning

recent requests + -

test O' look up X-O'


prefetch X+O O

- Y-O


hit/miss ?

Recent Requests (RR) Table

•  in 2011: 64-entry fully-associative FIFO

•  for DPC2: two direct-mapped banks with different hashing - resembles 2-way skewed-associative - 2 x 64 x 12-bit tags 1536 bits

•  Write same tag redundantly in both banks

15

Learning the best offset

•  46 different offsets evaluated - 23 positive + 23 negative - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,20,24,30,32,36,40

•  Each offset has a 5-bit score - 46 x 5 230 bits

•  Test the 46 offsets successively (46 L2 accesses) = one round - if hit in RR table for an offset, increment its score

•  Learning phase finishes after 100 rounds, or if one of the scores reaches 31 - select the offset with the greatest score this is the new prefetch offset - new learning phase starts reset scores

16

Prefetch timeliness vs. prefetch accuracy

•  BO prefetcher tries to do timely prefetches

•  However...

•  Sometimes, better to choose a smaller offset, even if it generates late prefetches - Example: short sequential streams

•  Imperfect solution: delay queue

17

BO prefetcher with a delay queue

18

best-offset learning + -

test O' look up X-O'


prefetch X+O O

- Y-O


hit/miss ?

RR left

RR right

delay 60 cycles X

Prefetch throttling (DPC2)

•  Turn prefetch on only if BO score > BADSCORE - DPC2 BADSCORE=1 (10 for small L3 config) - best-offset learning continues while prefetch is off

•  Drop prefetch request if MSHR occupancy is above a threshold - Vary MSHR threshold depending on BO score and L3 access rate

19

L3 access rate

BO score

DRAM BW 50% BW 0

31

20 HIGH

LOW HIGH

State (number of bits)

20

prefetch bits (1 bit per L2 line)

recent requests (2x64x12)

scores (46x5)

delay queue (15 slots)

miscellaneous

TOTAL

2048

1536

230

473

74

4361

bits

fixed vs. adaptive offset (437.leslie3d)

21

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

spee

dup

o�set

BOPBOP w/o DQ

Fixed vs. adaptive offset (433.milc)

22

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

spee

dup

o�set

BOPBOP w/o DQ

Fixed vs. adaptive offset (434.zeusmp)

23

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

spee

dup

o�set

BOPBOP w/o DQ

BO prefetcher vs. Sandbox prefetcher

•  Sandbox prefetcher (Pugsley et al., HPCA 2014) - first published full-fledged offset prefetcher - fake prefetches evaluate an offset by setting bits in a Bloom filter - if demand access hits in Bloom filter fake prefetch successful - prefetch timeliness not considered - Sandbox method is orthogonal to offset prefetching

•  BO prefetcher - no fake prefetches - strive for prefetch timeliness

24

25

FIN

pierre michaud 2nd data prefetching championship, june...

Documents