pierre michaud 2nd data prefetching championship, june...

25
A best-offset prefetcher Pierre Michaud 2 nd data prefetching championship, june 2015

Upload: others

Post on 25-Dec-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

A best-offset prefetcher

Pierre Michaud

2nd data prefetching championship, june 2015

Page 2: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

DPC2 rules

2

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

Page 3: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

DPC2 rules

3

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

•  physical address •  L2 hit/miss •  IP •  time

Page 4: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

DPC2 rules

4

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

•  physical address •  L2 hit/miss •  IP •  time

occupancy

Page 5: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

DPC2 rules

5

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

•  physical address •  L2 hit/miss •  IP •  time

•  L2 fill line •  L2 victim line •  time

occupancy

Page 6: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

DPC2 rules

6

core

L1

L2

DRAM

L3

MSHR

prefetcher DPC2

simulator

•  physical address •  L2 hit/miss •  IP •  time

•  L2 fill line •  L2 victim line •  time

occupancy

prefetch address must lie in same 4KB page as demand address

Page 7: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Offset prefetching

•  Next-line prefetching O=1

•  Full-fledged offset prefetcher varying offset

•  Sandbox prefetcher (Pugsley et al., HPCA 2014)

7

prefetch demand line X prefetch line X+O

offset O

Page 8: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Proposed Best-Offset (BO) prefetcher

•  New method for setting the offset automatically - different from Sandbox - first implementation in an in-house simulator in 2011

•  Bandwidth & cache pollution prefetch throttling method - somewhat specific to DPC2 - DPC2 rules limit what can be done

8

Page 9: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Sequential stream

9

0 64 128 192 256 320 384 448

•  if the offset is too small, prefetches may not be timely

(neglect page boundary effect)

offset=2

1 2 3 4 5 6 7 8

Page 10: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Strided stream

10

0 64 128 192 256 320 384 448

•  constant byte-stride periodic sequence of line-strides (1,2,1,2,...) •  offset = sum of line-strides in a period (offset=1+2=3) •  ...or multiple of that sum (6,9,...)

offset=3

1 2 3 4 5 6

example: stride=+96 bytes

Page 11: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Interleaved streams

11

•  1st stream alone offset = multiple of 3 •  2nd stream alone offset = multiple of 2 •  Both streams offset = multiple of 6

1 2 3 4 5 6

1 2 3 4

offset=6

offset=6

Page 12: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

BO prefetcher: main idea

12

best-offset learning +

demand line X (miss / prefetched hit)

prefetch X+O O

Page 13: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

BO prefetcher: main idea

13

best-offset learning

recent requests +

demand line X (miss / prefetched hit)

prefetch X+O O

- Y-O

fill line Y (prefetched)

Page 14: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

BO prefetcher: main idea

14

best-offset learning

recent requests + -

test O' look up X-O'

demand line X (miss / prefetched hit)

prefetch X+O O

- Y-O

fill line Y (prefetched)

hit/miss ?

Page 15: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Recent Requests (RR) Table

•  in 2011: 64-entry fully-associative FIFO

•  for DPC2: two direct-mapped banks with different hashing - resembles 2-way skewed-associative - 2 x 64 x 12-bit tags 1536 bits

•  Write same tag redundantly in both banks

15

Page 16: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Learning the best offset

•  46 different offsets evaluated - 23 positive + 23 negative - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,20,24,30,32,36,40

•  Each offset has a 5-bit score - 46 x 5 230 bits

•  Test the 46 offsets successively (46 L2 accesses) = one round - if hit in RR table for an offset, increment its score

•  Learning phase finishes after 100 rounds, or if one of the scores reaches 31 - select the offset with the greatest score this is the new prefetch offset - new learning phase starts reset scores

16

Page 17: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Prefetch timeliness vs. prefetch accuracy

•  BO prefetcher tries to do timely prefetches

•  However...

•  Sometimes, better to choose a smaller offset, even if it generates late prefetches - Example: short sequential streams

•  Imperfect solution: delay queue

17

Page 18: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

BO prefetcher with a delay queue

18

best-offset learning + -

test O' look up X-O'

demand line X (miss / prefetched hit)

prefetch X+O O

- Y-O

fill line Y (prefetched)

hit/miss ?

RR left

RR right

delay 60 cycles X

Page 19: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Prefetch throttling (DPC2)

•  Turn prefetch on only if BO score > BADSCORE - DPC2 BADSCORE=1 (10 for small L3 config) - best-offset learning continues while prefetch is off

•  Drop prefetch request if MSHR occupancy is above a threshold - Vary MSHR threshold depending on BO score and L3 access rate

19

L3 access rate

BO score

DRAM BW 50% BW 0

31

20 HIGH

LOW HIGH

Page 20: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

State (number of bits)

20

prefetch bits (1 bit per L2 line)

recent requests (2x64x12)

scores (46x5)

delay queue (15 slots)

miscellaneous

TOTAL

2048

1536

230

473

74

4361

bits

Page 21: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

fixed vs. adaptive offset (437.leslie3d)

21

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

spee

dup

o�set

BOPBOP w/o DQ

Page 22: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Fixed vs. adaptive offset (433.milc)

22

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.351 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

spee

dup

o�set

BOPBOP w/o DQ

Page 23: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

Fixed vs. adaptive offset (434.zeusmp)

23

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

spee

dup

o�set

BOPBOP w/o DQ

Page 24: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

BO prefetcher vs. Sandbox prefetcher

•  Sandbox prefetcher (Pugsley et al., HPCA 2014) - first published full-fledged offset prefetcher - fake prefetches evaluate an offset by setting bits in a Bloom filter - if demand access hits in Bloom filter fake prefetch successful - prefetch timeliness not considered - Sandbox method is orthogonal to offset prefetching

•  BO prefetcher - no fake prefetches - strive for prefetch timeliness

24

Page 25: Pierre Michaud 2nd data prefetching championship, june 2015comparch-conf.gatech.edu/dpc2/resource/dpc2_michaud_slides.pdf• Sandbox prefetcher (Pugsley et al., HPCA 2014) - first

25

FIN