energy-efficient hardware data prefetching yao guo, mahmoud abdullah bennaser and csaba andras...

25
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Upload: rudolph-mckinney

Post on 19-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

Introduction  Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed.  It removes apparent memory latency.  Two types  Software prefetching  Using compiler  Hardware prefetching  Using additional circuit

TRANSCRIPT

Page 1: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Energy-Efficient Hardware Data

Prefetching

Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Page 2: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

CONTENTS

Introduction

Hardware prefetching

Hardware data prefetching methods

Performance speedup

Energy-aware prefetching techniques

PARE

Conclusion

References

Page 3: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Introduction

Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed.

It removes apparent memory latency. Two types Software prefetching

Using compiler Hardware prefetching

Using additional circuit

Page 4: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Hardware prefetching

Use additional circuit

Prefetch tables are used to store recent load instructions and relations between load instructions.

Better performance

Energy overhead comes from

Energy cost

Unnecessary L1 cache lookup

Page 5: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Hardware Data Prefetching MethodsSequential prefetchingStride prefetchingPointer prefetchingCombined stride and pointer

prefetching

Page 6: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Sequential Prefetching

One block lookahead (OBL) approach Initiate a prefetch for block b+1 when block b is

accessedPrefetch_on_misso Whenever an access for block b results in a

cache missTagged prefetching

Associates a tag bit with every memory blockWhen a block is demand-fetched or a prefetched

block is referenced for the first time next block is fetched.

Page 7: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Click to edit the outline text format

Second Outline Level Third Outline

Level Fourth

Outline Level

Fifth Outline Level

Sixth Outline Level

Seventh Outline Level

Eighth Outline Level

Ninth Outline LevelClick to edit Master text styles Second level

Third level Fourth level

Fifth level

OBL Approaches Prefetch-on-miss Tagged prefetch

demand-fetchedprefetcheddemand-fetchedprefetched

demand-fetchedprefetched

01

demand-fetchedprefetched

00

prefetched1

Page 8: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Stride Prefetching

Employ special logic to monitor the processor’s address referencing pattern

Detect constant stride array references originating from looping structures

Compare successive addresses used by load or store instructions

Page 9: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Reference Prediction Table (RPT) RPT

64 entries 64 bits

Hold most recently used memory instructions Address of the memory instructionPrevious address accessed by the instructionStride valueState field

Page 10: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Organization of RPT

PC effective address

instruction tag previous address stride state

-

+

prefetch address

Page 11: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Pointer Prefetching

Effective for pointer_intensive programs

No constant stride

Dependence_based prefetching Detect dependence relationship

Use two hardware tables Correlation table(CT)

• Storing dependence information

Potential Producer Window(PPT) Records the most recent loaded values

and the corresponding instructions

Page 12: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Combined Stride And Pointer Prefetching

Objective to evaluate a technique that would work for all types of memory access patterns

Use both array and pointer

Better performance

All three tables (RPT, PPW, CT)

Page 13: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Performance Speedup Combined (stride+dep) technique has the best

speedup for most benchmarks.

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

mcf parser art bzip2 galgel bh em3d health mst perim avg

Spee

dup

no-prefetchsequentialtaggedstridedependencestride+dep

Page 14: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Energy-aware Prefetching Architecture

Prefetching FilteringBuffer (PFB)

...... ... ... ... ... L1 D-cache

StridePrefetcher

PointerPrefetcher

Stride Counter

LDQ RA RB OFFSET Hints

Prefetches

Tag-array Data-array

Prefetch from L2 Cache

RegularCache Access

Filtered

Filtered

Filtered

Compiler-Based Selective Filtering

Compiler-Assisted Adaptive Prefetching

Prefetch Filtering using Stride Counter

Hardware Filtering using PFB

Page 15: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Energy-aware Prefetching TechniqueCompiler-Based Selective Filtering (CBSF)

Only searching the prefetch hardware tablesCompiler-Assisted Adaptive Prefetching

(CAAP)Select different prefetching schemes

Compiler-driven Filtering using Stride Counter (SC)Reduce prefetching energy

Hardware-based Filtering using PFB (PFB)Reduce L1 cache related energy overhead

Page 16: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Compiler-based selective filtering

Only searching the prefetch hardware tables for selective memory instructions identified by the compiler

Energy reduced by Using loop or recursive type memory

access Use only array and linked data structure

memory access

Page 17: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Compiler-assistive adaptive prefetching

Select different prefetching scheme based on

Memory access to an array which does not belongs to any larger structure are only fed into the stride prefetcher.

Memory access to an array which belongs to a larger structure are fed into both stride and pointer prefetchers.

Memory access to a linked data structure with no arrays are only fed into the pointer prefetcher.

Memory access to a linked data structure that contains arrays are fed into both prefetchers.

Page 18: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Compiler-hinted Filtering Using A Runtime SC

Reducing prefetching energy consumption wasted on memory access patterns with very small strides.

Small strides are not used

Stride can be larger than half the cache line size

Each cache line contain Program Counter(PC) Stride counter

Counter is used to count how many times the instruction occurs

Page 19: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

PARE: A Power-aware Prefetch Engine

Used for reducing power dissipation

Two ways to reduce power Reduces the size of each entry

• Based on spatial locality of memory accesses Partitions the large table into multiple smaller

tables

Page 20: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Hardware Prefetch Table

Page 21: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Pare Hardware Prefetch Table

Break up the whole prefetch table into 16 smaller tables

Each table containing 4 entries

It also contain a group number

Only use lower 16 bit of the PC instead of 32 bits

Page 22: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Pare Table Design

Page 23: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Advantages Of Pare Hardware TablePower consumption reduced

CAM cell power is reduced

Small table

Reduce total power consumption

Page 24: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

Conclusion

Improve the performance

Reduce the energy overhead of hardware data prefetching

Reduce total energy consumption

compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine techniques are used.

Page 25: Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz

References

Yao Guo ,”Energy-Efficient Hardware Data Prefetching,” IEEE ,vol.19,no.2,Feb.2011

A. J. Smith, “Sequential program prefetching in memory hierarchies,”IEEE Computer, vol. 11, no. 12, pp. 7–21, Dec. 1978.

A. Roth, A. Moshovos, and G. S. Sohi, “Dependence based prefetching for linked data structures,” in Proc. ASPLOS-VIII, Oct. 1998, pp.115–126.