lru-pea: a smart replacement policy for nuca caches on chip multiprocessors

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Javier Lira ψ

Carlos Molina ψ,ф

Antonio González ψ,λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

antonio.gonzalez@intel.com

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

carlos.molina@urv.net

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

javier.lira@ac.upc.edu

ICCD 2009, Lake Tahoe, CA (USA) - October 6, 2009

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

Introduction

CMPs have emerged as a dominant paradigm in system design.

1. Keep performance improvement while reducing power consumption.

2. Take advantage of Thread-level parallelism.

Commercial CMPs are currently available.

CMPs incorporate larger and shared last-level caches.

Wire delay is a key constraint.

Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].

NUCA divides a large cache in smaller and faster banks.

Banks close to cache controller have smaller latencies than further banks.

Processor

[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

NUCA Policies

Bank Placement Policy Bank Access Policy

Bank Replacement PolicyBank Migration Policy

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

Methodology

Simulation tools:Simics + GEMSCACTI v6.0

PARSEC Benchmark Suite

Number of cores 8 – UltraSPARC IIIi

Frequency 1.5 GHz

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

L1 cache latency 3 cycles

NUCA bank latency 4 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 250 cycles (from core)

Private L1 caches 8 x 32 Kbytes, 2-way

Shared L2 NUCA cache 8 MBytes, 256 Banks

NUCA Bank 32 KBytes, 8-way

Baseline NUCA cache architecture

CMP-DNUCA

8 cores

256 banks

Non-inclusive

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Outline

Introduction

Methodology

LRU-PEA Background How does it work?

Results

Conclusions

Background

Entrance into the NUCA Off-chip memory L1 cache replacements

Migration movements Promotion Demotion

Data categories

1. Off-chip2. L1 cache Replacements3. Promoted data4. Demoted data

LRU-PEA

LRU with Priority Eviction Approach Replacement policy for CMP-NUCA architectures.

Data Eviction Policy: Chooses data to evict from a NUCA bank.

Data Target Policy: Determines the destination bank of the evicted data. Globalizes replacement decisions to the whole NUCA.

Data Evictio

n Policy

Data Target Policy

LRU-PEA

Data Eviction Policy

Based on the LRU replacement policy. Static prioritisation of NUCA data categories. Lowest-category data is evicted from the NUCA bank.

PROBLEM: Highest-category could monopolize the NUCA cache.

Category comparisson is restricted to the LRU and the LRU-1 positions.

Local Central

+ L1 Replacements Promoted

PRIORITYPromoted Off-chip

Off-chip Demoted

- Demoted L1 Replacements

Data Eviction Policy

Example (NUCA bank, 4-way)**:

@APromoted

@BDemoted

@COffchip

@DPromoted

** The set associativity assumed in this work for NUCA banks is 8-way.

0 1 2 3

MRU LRUL1 Replacement

Promoted

Offchip

Demoted

@COffchip

@DPromoted

LRU-PEA

@DPromoted

Available

Data Target Policy

Migration movements provoke bank usage imbalance in the NUCA cache.

Replacements in most accessed banks are unfair.

LRU-PEA globalizes replacement decisions to evict the most appropriate data from the NUCA cache.

Data Target Policy

Example (256 NUCA Banks, 16 possible placements):

Current eviction

Off-chipP2

Central

Vs.Step 1

L1 Replac.P1

Step 2

Off-chipP2

Central

Step 3

DemotedP4

Local…

Current eviction

DemotedP4

Cascade mode

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

Increasing network congestion

No Cascade Cascade Enabled

Direct Provoked

1 step 64 54 20

2 steps 12 7 7

3 steps 4 2 4

4 steps 3 2 4

5 steps 3 2 3

6 steps 2 1 4

7 steps 2 1 3

8 steps 2 1 4

9 steps 1 1 3

10 steps 1 1 4

11 steps 1 1 3

12 steps 1 1 6

13 steps 1 1 6

14 steps 1 1 30

15 steps 3 21 -

Values in percentage (%)

NUCA miss rate analysis

Performance analysis

Dynamic EPI analysis

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

LRU-PEA is proposed as an alternative to the traditional LRU replacement policy in CMP-NUCA architectures.

Defines four novel NUCA categories and prioritises them to find the most appropriate data to evict.

In a D-NUCA architecture, data movements provoke unfair replacements in most accessed banks.

LRU-PEA globalizes replacement decisions taken in a single bank to the whole NUCA cache.

LRU-PEA reduces miss rate, increases performance with parallel applications, reduces energy consumed per instruction, compared to the traditional LRU policy.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Questions?

lru-pea: a smart replacement policy for nuca caches on chip multiprocessors

Documents

nuca portfolio and personal statment presentation

multiprocessors interconnection networks

multiprocessors - trinity college dublin...

reactive nuca: near-optimal block placement and...

nuca fine art 2011

large scale multiprocessors and scientific applications ·...

characteristics of multiprocessors

1 multiprocessors computer organization computer...

brochure explicative lru

1 multiprocessors computer organization prof. h. yoon...

member benefits - nuca benefits one-pager.pdf · utility...

lru-k page replacement algorithm

l7 shared memory multiprocessors · l7 shared memory...

database design and implementation lru-k

lru algorithm

symmetric multiprocessors

multiprocessors— large vs. small scale multiprocessors—...

(current as of august 26, 2015) - nuca of florida...

numa multiprocessors

lru page replacement