improving the prefetching performance through code region profiling martí torrents, raúl...

32
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Upload: shanon-wright

Post on 18-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 3

TRANSCRIPT

Page 1: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

IMPROVING THE PREFETCHING PERFORMANCE

THROUGH CODE REGION PROFILING

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

Page 2: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

2

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 3: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

3

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 4: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

Motivation

• Number of cores in a same chip grows every year

Nehalem4~6 Cores

Tilera64~100 Cores

Intel Polaris80 Cores

Nvidia GeForceUp to 256 Cores

4

Page 5: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

5

Prefetching

• Reduce memory latency• Bring to a nearest cache next data required by CPU• Increase the hit ratio• It is implemented in most of the commercial

processors• Erroneous prefetching may produce

– Cache pollution– Resources consumption (queues, bandwidth, etc.)– Power consumption

Page 6: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

6

Prefetch in CMPs

• Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency

• Useless prefetchers implies less performance– More power consumption– More NoC congestion– Interference with other cores requests

Page 7: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

7

Prefetch adverse behaviors

M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Page 8: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

8

Prefetch in shared memories

• Prefetcher distributed

• Entails challenges – Distributed memory streams – Distributed prefetch queue– Statistics generation and recollection point differ

• Difficult the prefetcher task

• Harder to prefetch accuratelyM. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.

Page 9: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

9

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 10: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

10

Objective• Maximize the prefetching effect • By using it only when it is working properly• Minimizing its adverse effects

Page 11: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

11

Proposal

• Identify when the prefetcher generates slowdown– Identify code regions with several granularities– Analyze the prefetcher performance in these regions – Tag this code regions with stats

• Switch the prefetcher off– Save power– Avoid network contention– Avoid cache pollution

• Switch it on again– When it generates speedup

Page 12: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

12

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

Instruction level

Page 13: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

13

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

Basic Bloc level

Page 14: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

14

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

All the code

Page 15: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

15

Code Region Granularity

• Regions tagged with statistics– Accuracy / Miss Ratio

• Activate or deactivate at every new code region– According to the statistic and the current code region

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

• Identify and tag the regions – Statically (Profiling execution)– Dynamically (During the warm up)

Page 16: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

16

Switching off the prefetcher

• Detect the uselessness of the prefetcher

• Accuracy– Useful prefetches / Total number of prefetches– Switch off when the accuracy decreases

• Miss Ratio– Based on the number of misses

Page 17: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

17

Switching on the prefetcher

• Switched off prefetcher does not generate stats

• Cannot reactivate with accuracy increment

• Reactivate when?– Based on miss ratio– After a certain timeout

Page 18: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

18

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 19: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

19

Experimental framework

• Gem5– 16 x86 CPUs– Ruby memory system– L1 prefetchers– MOESI coherency protocol– Garnet network simulator

• Parsecs 2.1

Page 20: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

20

Simulation environment

Page 21: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

21

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

Page 22: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

22

Expected Results

• Power savings without losing performance

• Smaller granularity more accuracy– Blocs or super blocs better than the whole code– Single instructions more accurate than blocs or super blocs

• Smaller granularity: – More resources– More complexity

• Basic bloc granularity should provide good results with a realistic complexity

Page 23: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

23

Q & A

Page 24: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

24

IMPROVING THE PREFETCHING PERFORMANCE

THROUGH CODE REGION PROFILING

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

Page 25: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

25

Back up slides

Page 26: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

26

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

Page 27: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

27

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

L1 MISS for @

Page 28: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

28

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

L1 MISS for @

Distributed patterns

Page 29: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

29

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

Page 30: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

30

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

Queue filtering

Page 31: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

31

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Page 32: IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC

32

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Dynamic profiling