lru-pea: a smart replacement policy for nuca caches on chip multiprocessors

24
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain [email protected] ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected] ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain [email protected] ICCD 2009, Lake Tahoe, CA (USA) - October 6, 2009

Upload: valiant

Post on 14-Jan-2016

29 views

Category:

Documents


2 download

DESCRIPTION

ICCD 2009, Lake Tahoe , CA (USA) - October 6, 2009. LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors. Javier Lira ψ Carlos Molina ψ, ф Antonio González ψ,λ. ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Javier Lira ψ

Carlos Molina ψ,ф

Antonio González ψ,λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

[email protected]

ICCD 2009, Lake Tahoe, CA (USA) - October 6, 2009

Page 2: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

2

Page 3: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Introduction

CMPs have emerged as a dominant paradigm in system design.

1. Keep performance improvement while reducing power consumption.

2. Take advantage of Thread-level parallelism.

Commercial CMPs are currently available.

CMPs incorporate larger and shared last-level caches.

Wire delay is a key constraint.

3

Page 4: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

NUCA

Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].

NUCA divides a large cache in smaller and faster banks.

Banks close to cache controller have smaller latencies than further banks.

Processor

[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

Page 5: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

NUCA Policies

Bank Placement Policy Bank Access Policy

Bank Replacement PolicyBank Migration Policy

Page 6: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

6

Page 7: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Methodology

Simulation tools:Simics + GEMSCACTI v6.0

PARSEC Benchmark Suite

Number of cores 8 – UltraSPARC IIIi

Frequency 1.5 GHz

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

L1 cache latency 3 cycles

NUCA bank latency 4 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 250 cycles (from core)

Private L1 caches 8 x 32 Kbytes, 2-way

Shared L2 NUCA cache 8 MBytes, 256 Banks

NUCA Bank 32 KBytes, 8-way

Page 8: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Baseline NUCA cache architecture

CMP-DNUCA

8 cores

256 banks

Non-inclusive

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Page 9: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Outline

Introduction

Methodology

LRU-PEA Background How does it work?

Results

Conclusions

9

Page 10: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Background

Entrance into the NUCA Off-chip memory L1 cache replacements

Migration movements Promotion Demotion

10

Page 11: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Data categories

11

1. Off-chip2. L1 cache Replacements3. Promoted data4. Demoted data

Page 12: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

LRU-PEA

LRU with Priority Eviction Approach Replacement policy for CMP-NUCA architectures.

Data Eviction Policy: Chooses data to evict from a NUCA bank.

Data Target Policy: Determines the destination bank of the evicted data. Globalizes replacement decisions to the whole NUCA.

12

Data Evictio

n Policy

Data Target Policy

LRU-PEA

Page 13: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Data Eviction Policy

Based on the LRU replacement policy. Static prioritisation of NUCA data categories. Lowest-category data is evicted from the NUCA bank.

PROBLEM: Highest-category could monopolize the NUCA cache.

Category comparisson is restricted to the LRU and the LRU-1 positions.

13

BANK

Local Central

+ L1 Replacements Promoted

PRIORITYPromoted Off-chip

Off-chip Demoted

- Demoted L1 Replacements

Page 14: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Data Eviction Policy

Example (NUCA bank, 4-way)**:

14

@APromoted

@BDemoted

@COffchip

@DPromoted

** The set associativity assumed in this work for NUCA banks is 8-way.

0 1 2 3

MRU LRUL1 Replacement

Promoted

Offchip

Demoted

@COffchip

@DPromoted

LRU-PEA

@DPromoted

Available

Page 15: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Data Target Policy

Migration movements provoke bank usage imbalance in the NUCA cache.

Replacements in most accessed banks are unfair.

LRU-PEA globalizes replacement decisions to evict the most appropriate data from the NUCA cache.

15

Page 16: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Data Target Policy

Example (256 NUCA Banks, 16 possible placements):

16

Current eviction

Off-chipP2

Central

Vs.Step 1

L1 Replac.P1

Local

Step 2

Off-chipP2

Central

Step 3

DemotedP4

Local…

Current eviction

DemotedP4

Local

Cascade mode

Page 17: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

17

Page 18: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Increasing network congestion

No Cascade Cascade Enabled

Direct Provoked

1 step 64 54 20

2 steps 12 7 7

3 steps 4 2 4

4 steps 3 2 4

5 steps 3 2 3

6 steps 2 1 4

7 steps 2 1 3

8 steps 2 1 4

9 steps 1 1 3

10 steps 1 1 4

11 steps 1 1 3

12 steps 1 1 6

13 steps 1 1 6

14 steps 1 1 30

15 steps 3 21 -

Values in percentage (%)

18

Page 19: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

NUCA miss rate analysis

19

Page 20: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Performance analysis

20

Page 21: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Dynamic EPI analysis

21

Page 22: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

22

Page 23: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Conclusions

LRU-PEA is proposed as an alternative to the traditional LRU replacement policy in CMP-NUCA architectures.

Defines four novel NUCA categories and prioritises them to find the most appropriate data to evict.

In a D-NUCA architecture, data movements provoke unfair replacements in most accessed banks.

LRU-PEA globalizes replacement decisions taken in a single bank to the whole NUCA cache.

LRU-PEA reduces miss rate, increases performance with parallel applications, reduces energy consumed per instruction, compared to the traditional LRU policy.

23

Page 24: LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Questions?