energy-efficient hardware data prefetching yao guo, mahmoud abdullah bennaser and csaba andras...
DESCRIPTION
Introduction Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed. It removes apparent memory latency. Two types Software prefetching Using compiler Hardware prefetching Using additional circuitTRANSCRIPT
Energy-Efficient Hardware Data
Prefetching
Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz
CONTENTS
Introduction
Hardware prefetching
Hardware data prefetching methods
Performance speedup
Energy-aware prefetching techniques
PARE
Conclusion
References
Introduction
Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed.
It removes apparent memory latency. Two types Software prefetching
Using compiler Hardware prefetching
Using additional circuit
Hardware prefetching
Use additional circuit
Prefetch tables are used to store recent load instructions and relations between load instructions.
Better performance
Energy overhead comes from
Energy cost
Unnecessary L1 cache lookup
Hardware Data Prefetching MethodsSequential prefetchingStride prefetchingPointer prefetchingCombined stride and pointer
prefetching
Sequential Prefetching
One block lookahead (OBL) approach Initiate a prefetch for block b+1 when block b is
accessedPrefetch_on_misso Whenever an access for block b results in a
cache missTagged prefetching
Associates a tag bit with every memory blockWhen a block is demand-fetched or a prefetched
block is referenced for the first time next block is fetched.
Click to edit the outline text format
Second Outline Level Third Outline
Level Fourth
Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level
Eighth Outline Level
Ninth Outline LevelClick to edit Master text styles Second level
Third level Fourth level
Fifth level
OBL Approaches Prefetch-on-miss Tagged prefetch
demand-fetchedprefetcheddemand-fetchedprefetched
demand-fetchedprefetched
01
demand-fetchedprefetched
00
prefetched1
Stride Prefetching
Employ special logic to monitor the processor’s address referencing pattern
Detect constant stride array references originating from looping structures
Compare successive addresses used by load or store instructions
Reference Prediction Table (RPT) RPT
64 entries 64 bits
Hold most recently used memory instructions Address of the memory instructionPrevious address accessed by the instructionStride valueState field
Organization of RPT
PC effective address
instruction tag previous address stride state
-
+
prefetch address
Pointer Prefetching
Effective for pointer_intensive programs
No constant stride
Dependence_based prefetching Detect dependence relationship
Use two hardware tables Correlation table(CT)
• Storing dependence information
Potential Producer Window(PPT) Records the most recent loaded values
and the corresponding instructions
Combined Stride And Pointer Prefetching
Objective to evaluate a technique that would work for all types of memory access patterns
Use both array and pointer
Better performance
All three tables (RPT, PPW, CT)
Performance Speedup Combined (stride+dep) technique has the best
speedup for most benchmarks.
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
mcf parser art bzip2 galgel bh em3d health mst perim avg
Spee
dup
no-prefetchsequentialtaggedstridedependencestride+dep
Energy-aware Prefetching Architecture
Prefetching FilteringBuffer (PFB)
...... ... ... ... ... L1 D-cache
StridePrefetcher
PointerPrefetcher
Stride Counter
LDQ RA RB OFFSET Hints
Prefetches
Tag-array Data-array
Prefetch from L2 Cache
RegularCache Access
Filtered
Filtered
Filtered
Compiler-Based Selective Filtering
Compiler-Assisted Adaptive Prefetching
Prefetch Filtering using Stride Counter
Hardware Filtering using PFB
Energy-aware Prefetching TechniqueCompiler-Based Selective Filtering (CBSF)
Only searching the prefetch hardware tablesCompiler-Assisted Adaptive Prefetching
(CAAP)Select different prefetching schemes
Compiler-driven Filtering using Stride Counter (SC)Reduce prefetching energy
Hardware-based Filtering using PFB (PFB)Reduce L1 cache related energy overhead
Compiler-based selective filtering
Only searching the prefetch hardware tables for selective memory instructions identified by the compiler
Energy reduced by Using loop or recursive type memory
access Use only array and linked data structure
memory access
Compiler-assistive adaptive prefetching
Select different prefetching scheme based on
Memory access to an array which does not belongs to any larger structure are only fed into the stride prefetcher.
Memory access to an array which belongs to a larger structure are fed into both stride and pointer prefetchers.
Memory access to a linked data structure with no arrays are only fed into the pointer prefetcher.
Memory access to a linked data structure that contains arrays are fed into both prefetchers.
Compiler-hinted Filtering Using A Runtime SC
Reducing prefetching energy consumption wasted on memory access patterns with very small strides.
Small strides are not used
Stride can be larger than half the cache line size
Each cache line contain Program Counter(PC) Stride counter
Counter is used to count how many times the instruction occurs
PARE: A Power-aware Prefetch Engine
Used for reducing power dissipation
Two ways to reduce power Reduces the size of each entry
• Based on spatial locality of memory accesses Partitions the large table into multiple smaller
tables
Hardware Prefetch Table
Pare Hardware Prefetch Table
Break up the whole prefetch table into 16 smaller tables
Each table containing 4 entries
It also contain a group number
Only use lower 16 bit of the PC instead of 32 bits
Pare Table Design
Advantages Of Pare Hardware TablePower consumption reduced
CAM cell power is reduced
Small table
Reduce total power consumption
Conclusion
Improve the performance
Reduce the energy overhead of hardware data prefetching
Reduce total energy consumption
compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine techniques are used.
References
Yao Guo ,”Energy-Efficient Hardware Data Prefetching,” IEEE ,vol.19,no.2,Feb.2011
A. J. Smith, “Sequential program prefetching in memory hierarchies,”IEEE Computer, vol. 11, no. 12, pp. 7–21, Dec. 1978.
A. Roth, A. Moshovos, and G. S. Sohi, “Dependence based prefetching for linked data structures,” in Proc. ASPLOS-VIII, Oct. 1998, pp.115–126.