Download - Far Fetched Prefetching?
![Page 1: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/1.jpg)
Far Fetched Prefetching?
Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1
![Page 2: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/2.jpg)
Memory Optimizations Memory Wall
Memory improves slower than processor Very little improvement in latency
Making sure processors have data to consume Software: tiling, prefetching, array
contraction Hardware: cache, prefetching
Important for both speed and power Further memory takes more power
2
![Page 3: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/3.jpg)
Prefetching Anticipate future memory accesses and
start data transfer in advance Both HW and SW versions exist
Hides latency of memory accesses Cannot help if bandwidth is the issue
Inaccurate prefetch is harmful Consumes unnecessary
bandwidth/energy Pressure on caches
3
![Page 4: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/4.jpg)
This talk A failure experience based on our
attempt to improve prefetching We target cases where trip count is
small Prefetch instructions must be placed in
previous iterations of the outer loop Use polyhedral techniques to find where
to insert prefetch
4
![Page 5: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/5.jpg)
Outline Introduction Software Prefetching
When it doesn’t work Improving prefetch placement
Code Generation Simple Example Summary and Next Steps
5
![Page 6: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/6.jpg)
Software Prefetching [Mowry 96] Shift the iterations by prefetching
distance Simple yet effective method for regular
computations
6
for (i=0; i<N; i++) … = foo(A[i], …)
for (i=-4; i<0; i++) prefetch(A[i+4]);for (i=0; i<N-4; i++) … = foo(A[i], …) prefetch(A[i+4]);for (i=N-4; i<N; i++) … = foo(A[i], …)
Prologue
Epilogue
prefetch distance = 4
![Page 7: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/7.jpg)
When Prefetching Works, When It Doesn’t, and Why Difficult to statically determine
prefetching distance They suggest use of tuning Not the scope of our work
Interference with HW prefetchers Cannot handle short stream of accesses
Limitation for both software and hardware
We try to handle short streams
7
[Lee, Kim, and Vuduc HiPEAC 2013]
![Page 8: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/8.jpg)
When N=5 Most prefetches are issued “too early” Not enough computation to hide the
latency Simply translating iterations in the
innermost loop is not sufficient
Problem with Short Streams
8
prefetch distance = 4
for (i=-4; i<0; i++) prefetch(A[i+4]);for (i=0; i<1; i++) … = foo(A[i], …) prefetch(A[i+4]);for (i=1; i<5; i++) … = foo(A[i], …)
![Page 9: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/9.jpg)
Large number of useless prefetches
2D Illustration
9
i
j
![Page 10: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/10.jpg)
The main idea to improve prefetch placement
Lexicographical Shift
10
i
j
![Page 11: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/11.jpg)
Outline Introduction Software Prefetching Code Generation
Polyhedral representation Avoiding redundant prefetch
Simple Example Summary and Next Steps
11
![Page 12: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/12.jpg)
Prefetch Code Generation Main Problem: Placement
Prefetching distance Manually given parameter in our work
Lexicographically shifting the iteration spaces
Avoid reading the same array element twice
Avoid prefetching the same line multiple times
We use the polyhedral representation to handle these difficulties
12
![Page 13: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/13.jpg)
Polyhedral Representation Represent iteration space (of polyhedral
programs) as a set of points (polyhedra) Simply a different view of the program
Manipulate using convenient polyhedral representation, and then generate loops
13
for (i=1; i<=N; i++) for (j=1; j<=i; j++) S0
S0i
j
N
i=ji=N
j=1Domain(S0) = [i,j] : 1≤j≤i≤N
![Page 14: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/14.jpg)
Transforming Iteration Spaces Expressed as affine transformations
Shifting by constant is even simpler Example: Shift the iteration by 4 along j
(i,j→i,j-4)
14
S0i
j
N
Domain(S0) = [i,j] : 1≤j≤i≤N
Domain(S0’) = [i,j] : 1≤i≤N && -3≤j≤i-4 S0’
![Page 15: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/15.jpg)
Lex. Shift as Affine Transform Piece-wise affine transformation
(i,j→i,j-1) if j>1 or i=1 (i,j→i-1,i-1) if j=1 and i>1
Apply n times for prefetch distance n
1515
S0i
j
N
Domain(S0) = [i,j] : 1≤j≤i≤N
Domain(S0’) = [i,j] : <complicated>
nS0’
![Page 16: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/16.jpg)
Avoiding Redundant Prefetch:Same Element Given:
Target array: A Set of statement instances that read
from A Array access functions for each read
access Find the set of first readers:
Statement instances that first read from each element in A
Find the lex. min among the set of points accessing the same array
16
![Page 17: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/17.jpg)
Avoiding Redundant Prefetch:Same Cache Line Let an element of array A be ¼ of a line.
The following prefetches the same line 4 times
We apply unrolling to avoid redundancy
17
for i … prefetch(A[i+4]); … = foo(A[i], …);
for i … prefetch(A[i+4]); … = foo(A[i], …); … = foo(A[i+1], …); … = foo(A[i+2], …); … = foo(A[i+3], …);
![Page 18: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/18.jpg)
Outline Introduction Software Prefetching Code Generation Simple Example
Simulation results Summary and Next Steps
18
![Page 19: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/19.jpg)
Simple Example Contrived example that better work
Expectation: when M is small and N is large, lexicographical shift should work better
Compare between unroll only Mowery prefetching (shift in innermost) Proposed (lexicographic shift)
19
for(i=0; i<N; i+=1) for(j=0; j<M; j+=1) sum = sum + A[i][j];
![Page 20: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/20.jpg)
Search for Simulators Need simulators to experiment with
Memory latency Number of outstanding prefetches Line size, and so on
Tried on many simulators XEEMU: Intel Xscale SoCLib: SoC gem5: Alpha/ARM/SPARC/x86 VEX: VLIW
20
![Page 21: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/21.jpg)
Simulation with VEX (HP Labs)
cycles misses prefetches
effective
speedup
original 658k 2015 - - -unroll 610k 2015 - - 1.08mowry 551k 1020 2000 992 1.19lex. shift 480k 25 2001 1985 1.37
21
Miss penalty: 72 cycles
We see what we expect Effective: number of useful prefetches
But, for more “complex” examples, the benefit diminishes (<3%)
![Page 22: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/22.jpg)
Lex. Shift was Overkill Precision in placement does not
necessary translate to benefit Computationally very expensive
Power of piecewise affine function by the prefetch distance
Takes a lot of memory and time High control overhead
Code size more than doubles compared to translation in the innermost loop
22
![Page 23: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/23.jpg)
Summary Prefetching doesn’t work with short
streams Epilogue dominates Can we refine prefetch placement?
Lexicographic Shift Increase overlap of prefetch and
computation Can be done with polyhedral
representation But, it didn’t work
Overkill23
![Page 24: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/24.jpg)
Translation in outer dimensions
Keeps things simple by selecting one appropriate dimension to shift
Works well for rectangular spaces
Another Possibility
24
i
j
![Page 25: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/25.jpg)
Coarser Granularity Lexicographic shift at statement
instance level seems too fine grained Can it work with tiles?
Prefetch for next tile as you execute one
25
![Page 26: Far Fetched Prefetching?](https://reader036.vdocuments.us/reader036/viewer/2022062501/568164b1550346895dd6bc16/html5/thumbnails/26.jpg)
Thank you
26