prefetch-aware shared-resource management for multi-core systems

Prefetch-Aware Shared-Resource Management

for Multi-Core Systems

Eiman Ebrahimi*Chang Joo Lee*+

Onur Mutlu‡

Yale N. Patt*

* HPS Research Group The University of Texas at

Austin‡ Computer Architecture Laboratory

Carnegie Mellon University+ Intel Corporation

Austin

2

Background and ProblemCore 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAMBank

0

DRAMBank

1

DRAM Bank

2... DRAM

Bank K

...

Shared MemoryResources

Chip BoundaryOn-chipOff-chip

2

Core 0 Prefetcher

Core N Prefetcher

...

...

Background and Problem Understand the impact of prefetching on

previously proposed shared resource management techniques

3


previously proposed shared resource management techniques Fair cache management techniques Fair memory controllers Fair management of on-chip inteconnect Fair management of multiple shared resources

4


previously proposed shared resource management techniques Fair cache management techniques Fair memory controllers

- Network Fair Queuing (Nesbit et. al. MICRO’06)- Parallelism Aware Batch Scheduling

(Mutlu et. al. ISCA’08) Fair management of on-chip interconnect Fair management of multiple shared resources

- Fairness via Source Throttling (Ebrahimi et. al., ASPLOS’10)

5

Background and Problem

6Perf. Max Slowdown

0

0.2

0.4

0.6

0.8

1

1.2

Fair memory scheduling technique: Network Fair Queuing (NFQ) Improves fairness and performance with no prefetching Significant degradation of performance and fairness

in the presence of prefetching

Perf.

Max Sl

owdo

wn0

0.20.40.60.8

11.21.41.61.8

FR-FCFSNFQ

No Prefetching Aggressive Stream Prefetching

Background and Problem Understanding the impact of prefetching on

previously proposed shared resource management techniques Fair cache management techniques Fair memory controllers Fair management of on-chip inteconnect Fair management of multiple shared resources

Goal: Devise general mechanisms for taking into account prefetch requests in fairness techniques

7

Background and Problem Prior work addresses inter-application

interference caused by prefetches Hierarchical Prefetcher Aggressiveness

Control (Ebrahimi et. al., MICRO’09) Dynamically detects interference caused by

prefetches and throttles down overly aggressive prefetchers

Even with controlled prefetching, fairness techniques should be made prefetch-aware

8

Outline Problem Statement Motivation for Special Treatment of

Prefetches Prefetch-Aware Shared Resource

Management Evaluation Conclusion

9

Parallelism-Aware Batch Scheduling (PAR-BS) [Mutlu & Moscibroda ISCA’08]

Principle 1: Parallelism-awareness Schedules requests from each thread to

different banks back to back Preserves each thread’s bank parallelism

Principle 2: Request Batching Marks a fixed number of oldest requests

from each thread to form a “batch” Eliminates starvation & provides fairness

10

Bank 0 Bank 1

T1

T1

T0

T0

T2

T2

T3

T3

T3 T2

T2

Batch

T0

T1 T1

Impact of Prefetching onParallelism-Aware Batch Scheduling Policy (a): Include prefetches and demands alike

when generating a batch

Policy (b): Prefetches are not included alongside demands when generating a batch

11

Impact of Prefetching onParallelism-Aware Batch Scheduling

12

Bank 1 Bank 2

Bank 1 Bank 2

Policy (a) Mark Prefetches in PAR-BS

Policy (b) Don’t Mark Prefetches in PAR-BS

P1D1D2P2

P1P1D2D2P2

Serv

ice O

rder

P1

D1D2

P2P1P1

D2D2

P2

DRAMBank 1Bank 2Core 1Core 2

P1 D1 D2 P2 P1 P1 D2 D2 P2

Compute

Compute

Hit P2 Hit P2

Serv

ice O

rder

Bank 1Bank 2Core 1Core 2

P1D1 D2 P2 P1 P1D2 D2 P2

Compute

ComputeMiss Miss

P1D1D2P2

P1P1D2D2P2

Saved Cycles

Saved Cycles

Accurate PrefetchInaccurate Prefetch

Accurate PrefetchesToo Late

Stall

Stall

C C

Stall

C CStall

Stall

Stall

Batch

Batch

Impact of Prefetching on Parallelism-Aware Batch Scheduling Policy (a): Include prefetches and demands

alike when generating a batch Pros: Accurate prefetches will be more timely Cons: Inaccurate prefetches from one thread can

unfairly delay demands and accurate prefetches of others

Policy (b): Prefetches are not included alongside demands when generating a batch Pros: Inaccurate prefetches can not unfairly delay

demands of other cores Cons: Accurate prefetches will be less timely

- Less performance benefit from prefetching

13




14

Prefetch-Aware Shared Resource Management Three key ideas:

Fair memory controllers: Extend underlying prioritization policies to distinguish between prefetches based on prefetch accuracy

Fairness via source-throttling technique:Coordinate core and prefetcher throttling decisions

Demand boosting for memory non-intensive applications

15





16

Batch

Prefetch-aware PARBS (P-PARBS)

17

Bank 1 Bank 2

P1D1D2P2

P1P1D2D2P2

Serv

ice O

rder DRAM


P1 D1 D2 P2 P1 P1 D2 D2 P2

Compute

Compute

Hit P2 Hit P2


Stall

C CStall

Policy (a) Mark Prefetches in PAR-BS

Batch

Prefetch-aware PARBS (P-PARBS)

18

Bank 1 Bank 2

Policy (b) Don’t Mark Prefetches in PAR-BS

P1

D1D2

P2P1P1

D2

P2

Serv

ice O

rder


P1D1 D2 P2P1 P1D2 D2 P2

Compute

Compute

Miss Miss

D2

Saved Cycles

Stall

Stall

C CStall

Stall

Bank 1 Bank 2

Our Policy: Mark Accurate Prefetches

P1

D1D2P2

P1P1

D2D2P2

Serv

ice O

rder

DRAM


P1D1 D2 P2P1P1D2 D2 P2

Compute

Compute

Hit P2 Hit P2


Stall

C CStall

Batch

Accurate PrefetchesToo Late

Underlying prioritization policies need to distinguish between

prefetches based on accuracy





19

Bank 1 Bank 2

Serviced First

Serviced Last

ServiceOrder

No Demand Boosting With Demand Boosting

Core1 Dem

Core2 Dem

Legend:

Core2 Pref

Core 1 is memory

non-intensive

Core 2 is memoryintensive

Core1 Dem

Core2 Dem

Legend:

Core2 Pref

Core 1 is memory

non-intensive

Core 2 is memoryintensiveBank 1 Bank 2

Demand boosting eliminates starvation of memory non-intensive

applications





21




22

Evaluation Methodology x86 cycle accurate simulator

Baseline processor configuration Per-core

- 4-wide issue, out-of-order, 256 entry ROB

Shared (4-core system)- 128 MSHRs- 2MB, 16-way L2 cache

Main Memory- DDR3 1333 MHz- Latency of 15ns per command (tRP, tRCD, CL)- 8B wide core to memory bus

23

System Performance Results

NFQ-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2

24

PARBS0

0.2

0.4

0.6

0.8

1

1.2

FST (Core Throt-tling)

0

0.2

0.4

0.6

0.8

1

1.2

No Prefetching

Aggressive Prefetching

HPAC

Prefetch-Aware

11% 10.9% 11.3%

Max Slowdown Results

NFQ-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2

25

PARBS0

0.2

0.4

0.6

0.8

1

1.2

FST (Core Throt-tling)

0

0.2

0.4

0.6

0.8

1

1.2

No Prefetching

Aggressive Prefetching

HPAC

Prefetch-Aware

9.9% 18.4% 14.5%

Conclusion State-of-the-art fair shared resource management

techniques can be harmful in the presence of prefetching Their underlying prioritization techniques need to be extended

to differentiate prefetches based on accuracy Core and prefetcher throttling should be coordinated with

source-based resource management techniques

Demand boosting eliminates starvation ofmemory non-intensive applications

Our mechanisms improve both fair memory schedulers and source throttling in both system performance and fairness by >10%

26

Prefetch-Aware Shared-Resource Management

for Multi-Core Systems

Eiman Ebrahimi*Chang Joo Lee*+

Onur Mutlu‡

Yale N. Patt*

* HPS Research Group The University of Texas at

Austin‡ Computer Architecture Laboratory

Carnegie Mellon University+ Intel Corporation

Austin

prefetch-aware shared-resource management for multi-core systems

Documents

isca08fair management

fairness techniques

presence of prefetching

controlled prefetching

dram bank

source throttling ebrahimi

batch policy b

mark prefetches