last bank: dealing with address reuse in non-uniform cache architecture for cmps

Last Bank: Dealing with Address Reuse inNon-Uniform Cache Architecture for CMPs

Javier Lira ψ

Carlos Molina ф

Antonio González λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

[email protected]

Euro-Par 2009, Delft (The Netherlands) - August 27, 2009

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

2

Introduction

CMPs have emerged as a dominant paradigm in system design.

1. Keep performance improvement while reducing power consumption.

2. Take advantage of Thread-level parallelism.

Commercial CMPs are currently available.

CMPs incorporate larger and shared last-level caches.

Wire delay is a key constraint.

3

NUCA

Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].

NUCA divides a large cache in smaller and faster banks.

Banks close to cache controller have smaller latencies than further banks.

Processor

[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

Outline

Introduction

Methodology

Last Bank



Conclusions

5

Methodology

Simulation tools:Simics + GEMSCACTI v6.0

PARSEC Benchmark Suite

Number of cores 8

Core processor Out-of-order SPARCv9

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

L1 cache latency 3 cycles

NUCA bank latency 2 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 350 cycles (from core)

Private L1 data caches 8 KBytes

Private L1 instr. caches 8 KBytes

Shared L2 NUCA cache 1 MByte, 256 Banks

Baseline NUCA cache architecture

8 cores

256 banks

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Outline

Introduction

Methodology

Last Bank



Conclusions

8

Last Bank

Data movements concentrate most accessed data in few banks.

Data replacements in HOT banks are unfair.

9

Last Bank

An extra bank is included in the NUCA cache.

Acts as a Victim cache, but it is not fully-associative.

Provides evicted data a second chance for keeping in the NUCA.

10

Last Bank

Last Bank

11

Performance benefits restricted by Last Bank size.

Significant performance potential.

Analysis of reused addresses to find improvement points.

Outline

Introduction

Methodology

Last Bank



Conclusions

12


How many evicted addresses are later reused?

How many cycles do a reused address usually spend out of the NUCA before being reinserted?

Where were reused addresses located within the NUCA just before being evicted?

What action did motivate reused addresses eviction from NUCA?

13

Reused address statistics

14

Nearly 70% of evicted addresses return to the NUCA cache.

Most of the reused address, return to NUCA at least twice.

Time between Eviction and Reinsertion

15

Nearly 30% of evicted addresses return in less than 100,000 cycles.

In blackscholes, almost 50% of reused addresses return to NUCA in less than 1,000 cycles.

Last location within the NUCA

Most of reused addresses were evicted from Local Banks.

Most of addresses replaced from Central Banks are not later reused.

16

Outline

Introduction

Methodology

Last Bank



Conclusions

17

Selective Last Bank

Target: To reduce pollution in Last Bank.

This mechanism allows to select the evicted data blocks that are going to be stored in the Last Bank.

Implemented Selective Last Bank: Stores data blocks, if and only if, they were evicted from a Local Bank. Otherwise, sends them back to the main memory.

18

LRU Prioritising Last Bank

Target: To maintain reused addresses in the NUCA cache.

Modification of data eviction policy of NUCA banks.

Prioritises lines that come from Last Bank during the data replacement process.

19

@AP: 0

@BP: 0

@CP: 0

@DP: 1

0 1 2 3

MRU LRU

@DP: 0

@AP: 0

@BP: 0

@CP: 0

0 1 2 3

@D, P:0

@A, P:0 @B, P:0 @C, P:0

Results

20

Both optimizations increase Last Bank performance benefits.

There is still room for improvement.

Adaptive filters will be analysed in future works.

Outline

Introduction

Methodology

Last Bank



Conclusions

21

Conclusions

Data movements provoke unfair replacements in HOT banks.

Last Bank reduce access latency of promptly reused addresses.

Huge performance potential.

Two optimizations are proposed: Selective Last Bank: Reduce pollution in Last Bank. LRU Prioritising Last Bank: Maintain reused addresses in the NUCA cache.

22

Last Bank: Dealing with Address Reuse inNon-Uniform Cache Architecture for CMPs

Questions?

last bank: dealing with address reuse in non-uniform cache architecture for cmps

Documents