last bank: dealing with address reuse in non-uniform cache architecture for cmps
DESCRIPTION
Euro-Par 2009, Delft (The Netherlands) - August 27, 2009. Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs. Javier Lira ψ Carlos Molina ф Antonio González λ. ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected]. - PowerPoint PPT PresentationTRANSCRIPT
Last Bank: Dealing with Address Reuse inNon-Uniform Cache Architecture for CMPs
Javier Lira ψ
Carlos Molina ф
Antonio González λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
Euro-Par 2009, Delft (The Netherlands) - August 27, 2009
Outline
Introduction
Methodology
Last Bank
Characterization of replacements in NUCA
Last Bank Optimizations
Conclusions
2
Introduction
CMPs have emerged as a dominant paradigm in system design.
1. Keep performance improvement while reducing power consumption.
2. Take advantage of Thread-level parallelism.
Commercial CMPs are currently available.
CMPs incorporate larger and shared last-level caches.
Wire delay is a key constraint.
3
NUCA
Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].
NUCA divides a large cache in smaller and faster banks.
Banks close to cache controller have smaller latencies than further banks.
Processor
[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4
Outline
Introduction
Methodology
Last Bank
Characterization of replacements in NUCA
Last Bank Optimizations
Conclusions
5
Methodology
Simulation tools:Simics + GEMSCACTI v6.0
PARSEC Benchmark Suite
Number of cores 8
Core processor Out-of-order SPARCv9
Main Memory Size 4 Gbytes
Memory Bandwidth 512 Bytes/cycle
L1 cache latency 3 cycles
NUCA bank latency 2 cycles
Router delay 1 cycle
On-chip wire delay 1 cycle
Main memory latency 350 cycles (from core)
Private L1 data caches 8 KBytes
Private L1 instr. caches 8 KBytes
Shared L2 NUCA cache 1 MByte, 256 Banks
Baseline NUCA cache architecture
8 cores
256 banks
[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04
Outline
Introduction
Methodology
Last Bank
Characterization of replacements in NUCA
Last Bank Optimizations
Conclusions
8
Last Bank
Data movements concentrate most accessed data in few banks.
Data replacements in HOT banks are unfair.
9
Last Bank
An extra bank is included in the NUCA cache.
Acts as a Victim cache, but it is not fully-associative.
Provides evicted data a second chance for keeping in the NUCA.
10
Last Bank
Last Bank
11
Performance benefits restricted by Last Bank size.
Significant performance potential.
Analysis of reused addresses to find improvement points.
Outline
Introduction
Methodology
Last Bank
Characterization of replacements in NUCA
Last Bank Optimizations
Conclusions
12
Characterization of replacements in NUCA
How many evicted addresses are later reused?
How many cycles do a reused address usually spend out of the NUCA before being reinserted?
Where were reused addresses located within the NUCA just before being evicted?
What action did motivate reused addresses eviction from NUCA?
13
Reused address statistics
14
Nearly 70% of evicted addresses return to the NUCA cache.
Most of the reused address, return to NUCA at least twice.
Time between Eviction and Reinsertion
15
Nearly 30% of evicted addresses return in less than 100,000 cycles.
In blackscholes, almost 50% of reused addresses return to NUCA in less than 1,000 cycles.
Last location within the NUCA
Most of reused addresses were evicted from Local Banks.
Most of addresses replaced from Central Banks are not later reused.
16
Outline
Introduction
Methodology
Last Bank
Characterization of replacements in NUCA
Last Bank Optimizations
Conclusions
17
Selective Last Bank
Target: To reduce pollution in Last Bank.
This mechanism allows to select the evicted data blocks that are going to be stored in the Last Bank.
Implemented Selective Last Bank: Stores data blocks, if and only if, they were evicted from a Local Bank. Otherwise, sends them back to the main memory.
18
LRU Prioritising Last Bank
Target: To maintain reused addresses in the NUCA cache.
Modification of data eviction policy of NUCA banks.
Prioritises lines that come from Last Bank during the data replacement process.
19
@AP: 0
@BP: 0
@CP: 0
@DP: 1
0 1 2 3
MRU LRU
@DP: 0
@AP: 0
@BP: 0
@CP: 0
0 1 2 3
@D, P:0
@A, P:0 @B, P:0 @C, P:0
Results
20
Both optimizations increase Last Bank performance benefits.
There is still room for improvement.
Adaptive filters will be analysed in future works.
Outline
Introduction
Methodology
Last Bank
Characterization of replacements in NUCA
Last Bank Optimizations
Conclusions
21
Conclusions
Data movements provoke unfair replacements in HOT banks.
Last Bank reduce access latency of promptly reused addresses.
Huge performance potential.
Two optimizations are proposed: Selective Last Bank: Reduce pollution in Last Bank. LRU Prioritising Last Bank: Maintain reused addresses in the NUCA cache.
22
Last Bank: Dealing with Address Reuse inNon-Uniform Cache Architecture for CMPs
Questions?