prefetch-aware shared-resource management for multi-core systems eiman ebrahimi * chang joo lee * +...

Prefetch-Aware Shared-Resource Management

for Multi-Core Systems

Eiman Ebrahimi*

Chang Joo Lee*+

Onur Mutlu‡

Yale N. Patt*

* HPS Research Group The University of Texas at

Austin‡ Computer Architecture Laboratory

Carnegie Mellon University+ Intel Corporation

Austin

2

Background and Problem

Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAMBank

0

DRAMBank

1

DRAM Bank

2

... DRAMBank K

...

Shared MemoryResources

Chip BoundaryOn-chipOff-chip

2

Core 0 Prefetcher

Core N Prefetcher

...

...


Understand the impact of prefetching on previously proposed shared resource management techniques

3


Understand the impact of prefetching on previously proposed shared resource management techniques Fair cache management techniques Fair memory controllers Fair management of on-chip inteconnect Fair management of multiple shared resources

4


Understand the impact of prefetching on previously proposed shared resource management techniques Fair cache management techniques Fair memory controllers

- Network Fair Queuing (Nesbit et. al. MICRO’06)- Parallelism Aware Batch Scheduling

(Mutlu et. al. ISCA’08)

Fair management of on-chip interconnect Fair management of multiple shared resources

- Fairness via Source Throttling (Ebrahimi et. al., ASPLOS’10)

5


6Perf. Max Slowdown

0

0.2

0.4

0.6

0.8

1

1.2

Fair memory scheduling technique: Network Fair Queuing (NFQ) Improves fairness and performance with no prefetching Significant degradation of performance and fairness

in the presence of prefetching

Perf.

Max

Slo

wdo

wn

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

FR-FCFSNFQ

No Prefetching Aggressive Stream Prefetching


Understanding the impact of prefetching on previously proposed shared resource management techniques Fair cache management techniques Fair memory controllers Fair management of on-chip inteconnect Fair management of multiple shared resources

Goal: Devise general mechanisms for taking into account prefetch requests in fairness techniques

7


Prior work addresses inter-application interference caused by prefetches Hierarchical Prefetcher Aggressiveness

Control (Ebrahimi et. al., MICRO’09) Dynamically detects interference caused by

prefetches and throttles down overly aggressive prefetchers

Even with controlled prefetching, fairness techniques should be made prefetch-aware

8

Outline

Problem Statement Motivation for Special Treatment of

Prefetches Prefetch-Aware Shared Resource

Management Evaluation Conclusion

9

Parallelism-Aware Batch Scheduling (PAR-BS) [Mutlu & Moscibroda ISCA’08]

Principle 1: Parallelism-awareness Schedules requests from each thread to

different banks back to back Preserves each thread’s bank parallelism

Principle 2: Request Batching Marks a fixed number of oldest requests

from each thread to form a “batch” Eliminates starvation & provides fairness

10

Bank 0 Bank 1

T1

T1

T0

T0

T2

T2

T3

T3

T3 T2

T2

Batch

T0

T1 T1

Impact of Prefetching onParallelism-Aware Batch Scheduling

Policy (a): Include prefetches and demands alike when generating a batch

Policy (b): Prefetches are not included alongside demands when generating a batch

11

Impact of Prefetching onParallelism-Aware Batch Scheduling

12

Bank 1 Bank 2

Bank 1 Bank 2

Policy (a) Mark Prefetches in PAR-BS

Policy (b) Don’t Mark Prefetches in PAR-BS

P1D1D2P2

P1P1D2D2P2

Serv

ice O

rder

P1

D1D2

P2P1P1

D2D2

P2

DRAMBank 1Bank 2

Core 1

Core 2

P1 D1 D2 P2

P1 P1 D2 D2 P2

Compute

Compute

Hit P2 Hit P2

Serv

ice O

rder

Bank 1Bank 2

Core 1

Core 2

P1D1 D2 P2

P1 P1D2 D2 P2

Compute

ComputeMiss

Miss

P1D1D2P2

P1P1D2D2P2

Saved Cycles

Saved Cycles

Accurate PrefetchInaccurate Prefetch

Accurate PrefetchesToo Late

Stall

Stall

C C

Stall

C CStall

Stall

Stall

Batch

Batch

Impact of Prefetching on Parallelism-Aware Batch Scheduling

Policy (a): Include prefetches and demands alike when generating a batch Pros: Accurate prefetches will be more timely Cons: Inaccurate prefetches from one thread can

unfairly delay demands and accurate prefetches of others

Policy (b): Prefetches are not included alongside demands when generating a batch Pros: Inaccurate prefetches can not unfairly delay

demands of other cores Cons: Accurate prefetches will be less timely

- Less performance benefit from prefetching

13

Outline




14

Prefetch-Aware Shared Resource Management

Three key ideas: Fair memory controllers:

Extend underlying prioritization policies to distinguish between prefetches based on prefetch accuracy

Fairness via source-throttling technique:Coordinate core and prefetcher throttling decisions

Demand boosting for memory non-intensive applications

15






16

Batch

Prefetch-aware PARBS (P-PARBS)

17

Bank 1 Bank 2

P1D1D2P2

P1P1D2D2P2

Serv

ice O

rder

DRAMBank 1Bank 2

Core 1

Core 2

P1 D1 D2 P2

P1 P1 D2 D2 P2

Compute

Compute

Hit P2 Hit P2


Stall

C CStall

Policy (a) Mark Prefetches in PAR-BS

Batch

Prefetch-aware PARBS (P-PARBS)

18

Bank 1 Bank 2

Policy (b) Don’t Mark Prefetches in PAR-BS

P1

D1D2

P2P1P1

D2

P2

Serv

ice O

rder

Bank 1Bank 2

Core 1

Core 2

P1D1 D2 P2

P1 P1D2 D2 P2

Compute

Compute

Miss Miss

D2

Saved Cycles

Stall

Stall

C CStall

Stall

Bank 1 Bank 2

Our Policy: Mark Accurate Prefetches

P1

D1D2P2

P1P1

D2D2P2

Serv

ice O

rder

DRAM

Bank 1Bank 2

Core 1

Core 2

P1D1 D2 P2

P1P1D2 D2 P2

Compute

Compute

Hit P2 Hit P2


Stall

C CStall

Batch

Accurate PrefetchesToo Late

Underlying prioritization policies need to distinguish between

prefetches based on accuracy






19

Bank 1 Bank 2

Serviced First

Serviced Last

ServiceOrder

No Demand Boosting With Demand Boosting

Core1 Dem

Core2 Dem

Legend:

Core2 Pref

Core 1 is memory

non-intensive

Core 2 is memoryintensive

Core1 Dem

Core2 Dem

Legend:

Core2 Pref

Core 1 is memory

non-intensive

Core 2 is memoryintensiveBank 1 Bank 2

Demand boosting eliminates starvation of memory non-intensive

applications






21

Outline




22

Evaluation Methodology

x86 cycle accurate simulator

Baseline processor configuration Per-core

- 4-wide issue, out-of-order, 256 entry ROB

Shared (4-core system)- 128 MSHRs- 2MB, 16-way L2 cache

Main Memory- DDR3 1333 MHz- Latency of 15ns per command (tRP, tRCD, CL)- 8B wide core to memory bus

23

System Performance Results

NFQ-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2

24

PARBS0

0.2

0.4

0.6

0.8

1

1.2

FST (Core Throt-tling)

0

0.2

0.4

0.6

0.8

1

1.2

No Prefetching

Aggressive Prefetching

HPAC

Prefetch-Aware

11% 10.9% 11.3%

Max Slowdown Results

NFQ-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2

25

PARBS0

0.2

0.4

0.6

0.8

1

1.2

FST (Core Throt-tling)

0

0.2

0.4

0.6

0.8

1

1.2

No Prefetching

Aggressive Prefetching

HPAC

Prefetch-Aware

9.9% 18.4% 14.5%

Conclusion

State-of-the-art fair shared resource management techniques can be harmful in the presence of prefetching Their underlying prioritization techniques need to be extended

to differentiate prefetches based on accuracy Core and prefetcher throttling should be coordinated with

source-based resource management techniques

Demand boosting eliminates starvation ofmemory non-intensive applications

Our mechanisms improve both fair memory schedulers and source throttling in both system performance and fairness by >10%

26

Prefetch-Aware Shared-Resource Management

for Multi-Core Systems

Eiman Ebrahimi*

Chang Joo Lee*+

Onur Mutlu‡

Yale N. Patt*

* HPS Research Group The University of Texas at

Austin‡ Computer Architecture Laboratory

Carnegie Mellon University+ Intel Corporation

Austin

prefetch-aware shared-resource management for multi-core systems eiman ebrahimi * chang joo lee * +...

Documents

isca08 fair management

problem core

inaccurate prefetches

impact of prefetching

p2 service order bank

fairness techniques

coordinate core

batch pros