feedback directed prefetching santhosh srinath onur mutlu hyesoon kim yale n. patt §¥ ¥ §
TRANSCRIPT
Feedback Directed PrefetchingSanthosh Srinath
Onur MutluHyesoon KimYale N. Patt
§¥
¥§
Problem
Prefetching can significantly improve performance When prefetches are accurate And timely
However, Prefetching can also significantly degrade performance Due to Memory Bandwidth impact Pollution of the cache
HPCA-13 Feedback Directed Prefetching 2
Feedback Directed Prefetching is a comprehensive mechanism which reduces the negative effects of prefetching as well as improves the positive effects
Solution
Feedback Directed Prefetching 3
Outline
Background and Motivation
Feedback Directed Prefetching (FDP) Metrics and How to collect How to adapt
Prefetcher Aggressiveness Cache Insertion Policy for Prefetches
Results
HPCA-13
Prefetch Distance
Prefetch Degree
Predicted StreamPredicted Stream
Feedback Directed Prefetching 4
Background (Prefetcher Aggressiveness)
X
Access Stream
PmaxPrefetch Distance
PmaxVery Conservative
PmaxMiddle of the Road
PmaxVery Aggressive
P
Prefetch DegreeX+1
1 2 3
HPCA-13
Feedback Directed Prefetching 5
Background (Prefetcher Aggressiveness) Very Aggressive
Well ahead of the load access stream Hides memory access latency better More speculative
Very Conservative Closer to the load access stream Might not hide memory access latency completely Reduces potential for cache pollution and
bandwidth contention
HPCA-13
Feedback Directed Prefetching 6
0.0
1.0
2.0
3.0
4.0
5.0
Inst
ruct
ion
s p
er
Cyc
le
No PrefetchingVery Conservative
Middle-of-the-RoadVery Aggressive
Motivation
Very Aggressive improves average performance by 84% However it can also significantly reduce performance on some benchmarks
48% 29%
HPCA-13
Feedback Directed Prefetching 7
Outline
Background and Motivation
Feedback Directed Prefetching (FDP) Metrics and How to collect How to adapt
Prefetcher Aggressiveness Cache Insertion Policy for Prefetches
Results
HPCA-13 7Feedback Directed Prefetching
Feedback Directed Prefetching 8
Feedback Directed Prefetching Comprehensive mechanism which takes in
account: Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution
Adapts Prefetcher Aggressiveness Cache Insertion Policy for Prefetches
HPCA-13
Feedback Directed Prefetching 9
Metrics
Prefetch Accuracy
Prefetch Lateness
Prefetcher-caused Cache Pollution
HPCA-13
Feedback Directed Prefetching 10
Prefetch Accuracy
Useful Prefetches are referenced by the demand requests when in L2
Memory Sent to Prefetches ofNumber
Prefetches UsefulofNumber Accuracy Prefetcher
HPCA-13
Feedback Directed Prefetching 11
Prefetch Accuracy
Low Accuracy More likely that Prefetching can reduce performance
-100%
-50%
0%
50%
100%
150%
200%
250%
300%
350%
400%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Per
cent
age
IPC
cha
nge
ove
r N
o P
refe
tchi
ng
Prefetcher Accuracy
HPCA-13
Feedback Directed Prefetching 12
Prefetch Accuracy
Implementation pref-bit added to each L2 tag-store entry Tracked using two counters: pref_total,
used_total
pref_total
used_totalAccuracy Prefetcher
HPCA-13
Feedback Directed Prefetching 13
Prefetch Lateness
Measure of how timely prefetches are Used to determine if increasing the
aggressiveness helps Implementation
pref-bit added to each L2 MSHR entry New counter: late_total
Prefetches UsefulofNumber
Prefetches Late ofNumber LatenessPrefetch
used_total
late_total LatenessPrefetch
HPCA-13
Feedback Directed Prefetching 14
Prefetcher-caused Cache Pollution
Measure of the disturbance caused by prefetched data in the cache
Used to determine if the prefetcher is evicting useful data from the cache
Misses Demand ofNumber
Prefetcher by the caused Misses Demand ofNumber
Pollution Cache causedPrefetcher
HPCA-13
Feedback Directed Prefetching 15
Prefetcher-caused Cache Pollution (2)
Hardware Implementation Insight – this does not need to be exact Track pollution using Pollution filter
Based on Bloom Filter concept Bit set when a prefetch evicts a demand miss Bit reset when a prefetch is serviced
Two Counters – pollution_total, demand_total
aldemand_tot
totalpollution_Pollution Cache caused-Prefetcher
HPCA-13
Feedback Directed Prefetching 16
Feedback Directed Prefetching Comprehensive mechanism which takes in
account: Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution
Adapts Prefetcher Aggressiveness Cache Insertion Policy
HPCA-13 16Feedback Directed Prefetching
Feedback Directed Prefetching 17
How to adapt? Prefetcher Aggressiveness Dynamic Configuration Counter
Current Aggressiveness
Distance Degree
1 Very Conservative 4 1
2 Conservative 8 1
3 Middle-of-the-Road 16 2
4 Aggressive 32 4
5 Very Aggressive 64 4
HPCA-13
Improve TimelinessReduce Cache Pollution
Feedback Directed Prefetching 18
High Accuracy
Not-Late
Polluting
Decrease
Late
Increase
How to adapt? Prefetcher Aggressiveness (2)
For Current Phase, based on static thresholds, classify Accuracy Lateness Cache-Pollution caused by Prefetches
Med Accuracy
Not-Poll
Late
Increase
Polluting
Decrease
Low Accuracy
Not-Poll
Not-Late
No Change
Decrease
Reduce memory bandwidth usage and
Cache Pollution
HPCA-13
Feedback Directed Prefetching 19
How to Adapt?Cache Insertion Policy for Prefetches Why adapt?
Reduce the potential for cache pollution Classify Cache Pollution based on static
thresholds: Low – Insert at MID(n/2) Position
Eg: For a 16-way cache, MID = 8 in LRU stack Medium – Insert at LRU-4(n/4) Position
Eg: For a 16-way cache, LRU-4 = 4 in LRU stack High – Insert at LRU Position
HPCA-13
Feedback Directed Prefetching 20
Outline
Background and Motivation
Feedback Directed Prefetching Metrics and How to collect How to adapt
Prefetcher Aggressiveness Cache Insertion Policy for Prefetches
Results
HPCA-13 20Feedback Directed Prefetching
Feedback Directed Prefetching 21
Evaluation Methodology
Execution-driven Alpha simulator Aggressive out-of-order superscalar processor 1 MB, 16-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency Detailed memory model
Prefetchers Modeled: Stream Prefetcher tracking 64 different streams Global History Buffer Prefetcher (in paper) PC-based Stride Prefetcher (in paper)
HPCA-13
Feedback Directed Prefetching 22
Results: Adjusting Only Aggressiveness
4.7% higher avg IPC over the Very Aggressive configuration Most of the performance losses have been eliminated
HPCA-13
Feedback Directed Prefetching 23
Results: Adjusting Only Cache Insertion Policy
5.1% better than inserting prefetches in MRU position 1.9% better than inserting prefetches in LRU-4 position
0.0
1.0
2.0
3.0
4.0
5.0
Ins
tru
cti
on
s p
er
Cy
cle
No PrefetchingLRULRU-4MIDMRUDynamic Insertion
Very Aggressive Prefetcher
HPCA-13
Feedback Directed Prefetching 24
Results: Putting it all together (FDP)
6.5% IPC improvement over Very Aggressive configuration Performance losses converted to performance gains!
11%13%
HPCA-13
BPKI - Memory Bus Accesses per 1000 retired Instructions Includes effects of L2 demand misses as well as pollution
induced misses and prefetches
FDP significantly improves bandwidth efficiency
6.5% higher performance and18.7% less bandwidth
Feedback Directed Prefetching 25
Bandwidth Impact
No. Pref. Very Cons Mid Very Aggr FDP
IPC 0.85 1.21 1.47 1.57 1.67
BPKI 8.56 9.34 10.60 13.38 10.88
13.6% higher performance with similar bandwidth usage
HPCA-13
Feedback Directed Prefetching 26
Hardware Cost
Total hardware cost 20784 bits = 2.54 KB Percentage area overhead compared to baseline
1MB L2 cache 2.5KB/1024KB = 0.24% NOT on the critical path
pref-bits for L2 cache 16384 blocks 16384 bits
Pollution Filter 4096 entries * 1bit 4096 bits
16-bit counters 11 counters 176 bits
pref-bits for MSHR 128 entries 128 bits
HPCA-13
Feedback Directed Prefetching 27
Outline
Background and Motivation
Feedback Directed Prefetching Metrics and collecting this information in
Hardware How to adapt
Results Conclusions
HPCA-13 27Feedback Directed Prefetching
Feedback Directed Prefetching 28
Contributions Comprehensive and low-cost feedback mechanism
for hardware prefetchers Uses
Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution
Adapts Aggressiveness Cache Insertion Policy for prefetches
6.5% higher performance and 18.7% less bandwidth compared to Very Aggressive Prefetching
Eliminates negative impact of prefetching Applicable to any data prefetch algorithm
HPCA-13
Feedback Directed Prefetching 29
Questions?
HPCA-13
Feedback Directed Prefetching 30
Backups
HPCA-13
FDP vs Prefetch Cache
Prefetch Caches eliminate prefetcher induced cache pollution
However, prefetches are now limited to the size of the prefetch cache
5.3% higher perf. than Very Aggr.+32KB Within 2% of Very Aggr.+64KB Memory bandwidth of FDP is 16% less than
32KB and 9% less than 64KB.
HPCA-13 31Feedback Directed Prefetching
Feedback Directed Prefetching 32
Performance on Other Prefetch algorithms Global History Buffer Prefetcher
20.8% less memory bandwidth than very aggressive with similar perf.
9.9% better performance than middle-of-the-road with similar bandwidth usage
PC-based Stride Prefetcher 4% better performance than the very aggressive 24% reduction in bandwidth usage
HPCA-13
IPC Performance
HPCA-13 Feedback Directed Prefetching 33
Dynamic Prefetcher Accuracy
HPCA-13 Feedback Directed Prefetching 34
Prefetch Lateness
HPCA-13 Feedback Directed Prefetching 35
Pollution Filter
HPCA-13 Feedback Directed Prefetching 36
Thresholds
HPCA-13 Feedback Directed Prefetching 37
Prefetches Sent
HPCA-13 Feedback Directed Prefetching 38
Distribution of dynamic aggressiveness level
HPCA-13 Feedback Directed Prefetching 39
Distribution of insertion position of prefetched blocks
HPCA-13 Feedback Directed Prefetching 40
Effect of FDP on memory bandwidth consumption
HPCA-13 Feedback Directed Prefetching 41
Performance of Prefetch cache vs FDP
HPCA-13 Feedback Directed Prefetching 42
Bandwidth consumption of prefetch cache vs. FDP
HPCA-13 Feedback Directed Prefetching 43
Effect of FDP on GHB
HPCA-13 Feedback Directed Prefetching 44
Effect of FDP on GHB(Bandwidth)
HPCA-13 Feedback Directed Prefetching 45
Effect of varying L2 size and memory latency
HPCA-13 Feedback Directed Prefetching 46
IPC on other benchmarks
HPCA-13 Feedback Directed Prefetching 47
BPKI on other benchmarks
HPCA-13 Feedback Directed Prefetching 48