basic block scheduling utilize parallelism at the instruction level (ilp) time spent in loop...
TRANSCRIPT
![Page 1: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/1.jpg)
Basic Block Scheduling
Utilize parallelism at the instruction level (ILP)
Time spent in loop execution dominates total execution time
It is a technique that reforms the loop, so as to achieve an overlapped iteration execution
![Page 2: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/2.jpg)
Process Overview
Parallelize a single operation or the whole loop?
More parallelism achievable if we consider the entire loop
Construct Instructions that contain operations from different iterations of the initial loop
Construct a flat schedule, and repeat it over the time taking into account resource and dependence constraints.
![Page 3: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/3.jpg)
Techniques
Software pipelining restructures loops in order to achieve overlapping of various iterations in time
Although this optimization does not create massive amounts of parallelism, it is desirable
There exist two main methods for software pipelining: kernel recognition and modulo scheduling
![Page 4: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/4.jpg)
Modulo Scheduling
We will focus on modulo scheduling technique (it is incorporated in commercial compilers)
We try to select a schedule for one loop iteration and then repeat the schedule
No unrolling applied
![Page 5: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/5.jpg)
Terminology (Dependences)
To make a legal schedule, it is important to know which operations must follow other operations
A conflict exists if two operations cannot execute, at the same time, but it does not matter which one executes first (resource/hardware constraintsresource/hardware constraints)
A dependence exists between two operations if interchanging their order changes the result (data/control data/control dependencesdependences)
![Page 6: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/6.jpg)
Terminology (Data Dependence Graph)
Represent operations as nodes and dependences between operations as directed arcs
Loop carried arcs show relationships between operations of different iterations (may turn DDG into cyclic graph)
Loop independent arcs represent a must follow relationship among operations of the same iteration
Assign arc weights with the form of a (difdif, , minmin) dependence pair.
dif value indicates the number of iterations the dependence spans
min time which is the time to elapse between consecutive execution of the dependent operations
Value (min/dif) is called the slope of the schedule
![Page 7: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/7.jpg)
Terminology (Resource Reservation Table)
Construct Resource Reservation Table
![Page 8: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/8.jpg)
Terminology (Loop Types)
Doall loop in which iterations can proceed in parallel. Those type of loops lead to massive parallelism and are easy to schedule
Doacross loop in which synchronization is needed between operations of various iterations
![Page 9: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/9.jpg)
Doall Loop Example
dif=0, no loop-carried dependences min=1, loop-independent dependences Construct a valid flat schedule. Then, repeat it
![Page 10: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/10.jpg)
Doacross Loop Example (dif=1)
for (i=1; i<=n; i++)
O1: a[i + 1] = a[i] + 1
O2: b[i] = a[i + 1] /2
O3: c[i] = b[i] + 3
O4: d[i] = c[i]
dif=1 for Operation1 (loop-carried dependences exist)
min=1, loop-independent dependences
Construct a valid flat schedule. Then, repeat it
However, repetition is not easy We should take into account
that dif=1 for O1. Each Iteration should start with
one slot delay A legal scheduled has been
achieved
![Page 11: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/11.jpg)
Doacross Loop Example (dif=2)
for (i=1; i<=n; i++)
O1: a[i + 2] = a[i] + 1
O2: b[i] = a[i + 2] /2
O3: c[i] = b[i] + 3
O4: d[i] = c[i]
dif=2 for Operation1 (loop-carried dependences exist)
min=1, loop-independent dependences
Each second Iteration should now start with one slot delay from the previous
This is because dif=2, dependence is deeper (that is less restrictive)
![Page 12: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/12.jpg)
Comparison
In our first example where dif=1 and min=1, the kernel is found in the 4th time slot and is equal to 4 3 2 1. Instructions before and after the kernel are defined as the prelude and postlude of the schedule, respectively
In the second example the loop carried dependence is between iterations that are two apart. This is a less restrictive constraint, so iterations are overlapped more. Indeed, the kernel now is 4 4 3 3 2 2 1 1
![Page 13: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/13.jpg)
Main Idea
Let’s combine all these concepts (data dependence graph, resource reservation tables, schedule, loop types, arcs, flat schedule) in some simple examples
Don’t forget that the main idea behind software pipelining (incl. modulo scheduling) is that the body of a loop can be reformed so as to start one loop iteration before previous iterations have finished
![Page 14: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/14.jpg)
Another Loop Example
for (i=1; i<=n; i++)
O1: a[i] = i * i
O2: b[i] = a[i] * b[i– 1]
O3: c[i] = b[i] / n
O4: d[i] = b[i] % n
O1 is always scheduled in the first time step.
Thus the distance between O1 and the rest of the operations increases in successive iterations.
cyclic pattern (such as those achievable in other examples) never forms.
![Page 15: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/15.jpg)
Initiation Interval
So far we have described the first step of modulo scheduling procedure, that is analysis of DDG for a loop to identify all kinds of dependences
Second step is to try identify the minimum number of instructions required between initiating execution of successive loop iterations
Specifically, the delay between iterations of the new loop is called the Initiation Interval (II)
a) Resource Constrained II b) Dependence Constrained II
![Page 16: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/16.jpg)
Resource Constrained IIres
The resource usage imposes a lower bound on the initiation interval (IIres). For each resource, compute the schedule length necessary to accommodate uses of that resource.
If we have a DDG and 4 available resources, we
try to calculate the maximum usage for every maximum usage for every resourceresource
![Page 17: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/17.jpg)
Example
Resource 2 is required 4 times.
e.g. 2 can only be executed 4 cycles after its previous execution
Suppose that the Flat schedule is as shown.
We repeat it with 4 time slots delay
![Page 18: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/18.jpg)
Methods for computing IIdep
Modulo scheduling is all about calculating the lower bound for the initiation interval
We will present two techniques to compute the dependence constrained II (the calculation of IIres is straightforward)
1) Shortest Path Algorithm 2) Iterative Shortest Path
![Page 19: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/19.jpg)
1) Shortest Path Algorithm
This method uses transitive closure of a graph which is a reachability relationship
Let θ be a cyclic path from a node to itself, minθ be the sum of the min times on the arcs that constitute the cycle and difθ be the sum of dif times on the arcs
So the time between execution of a node and itself depends on II, e.g. time elapsed between execution of α node and another copy difθ iterations away is II * difθ
The maximum min/dif for all cycles is the IIdep
![Page 20: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/20.jpg)
Let’s see the effect of II on cyclic times in this figure
II must be large enough so that II * difθ >= minθ
Repeating Flat schedule
![Page 21: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/21.jpg)
Calculate IIdep
II * difθ >= minθ
0 >= minθ - II * difθ
0 >= minθ - IIdep * difθ
dif
II cyclesdep
minmax )(
Therefore, we select II:
II = max(IIdep, IIres)?
![Page 22: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/22.jpg)
Shortest Path Algorithm Example
IIdep = max([6/2],[4/1],[6/3])
Transitive closure of the graphTransitive closure of the graph
![Page 23: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/23.jpg)
2) Iterative Shortest Path2) Iterative Shortest Path
Simplify the previous method by recomputing the transitive closure of the graph for each possible IIfor each possible II
Use the term of distance (Mdistance (Mabab) between two nodes
![Page 24: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/24.jpg)
In the flat schedule (relative scheduling of each operation of the original iteration, something like list scheduling), the distance between two nodes a, b joint by an arc whose weight is (dif, min) is given by:
Ma,b = min – II * dif
We want to compute the minimum distance two nodes must be separated, but this information is dependent on the initiation intervaldependent on the initiation interval
Distance Distance MMa,ba,b
![Page 25: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/25.jpg)
Effect of Effect of IIII on node precedence on node precedence
![Page 26: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/26.jpg)
Procedure to find II Construct a matrix M where each entry Mi,j represents
the min time between two subsequent nodes i and j This computation gives the earliest time node j can be
placed with respect to node i in the flat schedule
Matrix M
Estimate that II=2
![Page 27: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/27.jpg)
Procedure to find II The next step is to
compute matrix M2 which represents the minimum time difference between nodes of length two
Continue by calculating matrix M3
and so on. Finally, we compute
Γ(Μ) as following 132 ...)( nMMMMM
Matrix M
Matrix M2
![Page 28: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/28.jpg)
Example (II=1)
Matrix M
Matrix Γ(M)
![Page 29: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/29.jpg)
Example (II=2)
Matrix M
Matrix M2Matrix Γ(M)
![Page 30: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/30.jpg)
Example (II=3)
Matrix M
Matrix Γ(M)
![Page 31: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/31.jpg)
Final Result
Γ(Μ) represents the maximum distance between each pair of nodes considering paths of all lengths
A legal II will produce a closure matrix in which entriesentries on the main diagonal are non-positivemain diagonal are non-positive
Positive values on the diagonal is an indication of a small initiation interval
Negative values across the diagonal indicate an adequate estimate of II
![Page 32: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/32.jpg)
Plus or minus
Drawback of this method is that before we are able to construct the matrix M, we should estimate II
However, this technique allows us to tell if the estimate for II is large enough or need to iteratively try larger II
![Page 33: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/33.jpg)
Why use “modulo” term in the first place? Initially, we have the flat schedule F, consisting
of location F1, F2….
Kernel K is formed by overlapping copies of F offset by II
Modulo scheduling results when all operations from locations in the flat schedule that have the same value modulo II are executed simultaneously
![Page 34: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/34.jpg)
Operations from (Fi: i mod II = 1) execute togetherOperations from (Fi: i mod II = 0) execute together
Flat schedule with II=2 Modulo scheduling
![Page 35: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/35.jpg)
Example 1DO I=1, 100
a[I] = b[I-1] + 5;
b[I] = a[I] • I; // mul->2clocks
c[I] = a[I-1] • b[I];
d[I] = c[I];
ENDDO
S1
S2
S3
S4
(0,1)
(1,2)
(0,2)
(1,1)
(0,2)
Graph of example 1
![Page 36: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/36.jpg)
Example 1PRODUCE THE FLAT SCHEDULE S1 and S2 are strongly connected Which one should be placed earlier in the Flat
schedule? S3 and S4 should be placed after S1 and S2
(S3 should precede S4) Eliminate all loop-carried dependences Loop-Independent arcs determine the
sequence of nodes in the flat schedule Flat schedule is therefore:
S1
S2
S3
S4
(0,1)
(1,2)
(0,2)
(1,1)
(0,2)
S1
S2
S3
S4
(0,1)
(0,2)
(0,2)
S1S2
S3
S4
t0:t1:t2:t3:t4:t5:
![Page 37: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/37.jpg)
Example 1
COMPUTE II Using the method of “Shortest Path
Algorithm” Find Strongly connected components Calculate the transitive closure of the
graph
S1
S2
S3
S4
(0,1)
(1,2)
(0,2)
(1,1)
(0,2)
S1
S2
(0,1)
(1,2)
Table I Transitive closure
Destination Node
Source Node 1 2
1 (1.3) (0.1)
2 (1.2) (1.3)
II = max (3/1, 3/1) = 3.
![Page 38: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/38.jpg)
Example 1
Execution schedule Each blue box
represents the kernel of the pipeline
S1S2
S3
S4
t0:t1:t2:t3:t4:t5:t6:t7:t8:t9:
t10:t11:t12:t13:t14:t15:t16:t17:
kernelS1S2
S3
S4
S1S2
S3
S4
S1S2
S3
S4
S1S2
S3
S4
6-30.5
6E
Worst case scenario: E=0 when II=length of the flat schedule => no overlapping between adjacent operations
length of flat schedule - IIExploitation
length of flat schedule
![Page 39: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/39.jpg)
EXAMPLE 2
DO I=1, 100
S1: a[I] = b[I-1] + 5;
S2: b[I] = a[I] • I;
S3: c[I] = a[I-1] • b[I];
S4: d[I] = c[I] + e[I-2];
S5: e[I] = d[I] • f[I-1];
S6: f[I] = d[I] • 4;
ENDDO
S1
S2
S3
S4
(0,1)
(1,2)
(0,1)(1,1)
(0,2)
S5
(1,2)
(0,1)
S6
(2,2)
(0,1)
![Page 40: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/40.jpg)
EXAMPLE 2 PRODUCE THE FLAT SCHEDULE Eliminate all loop-carried dependences Loop-Independent arcs determine the
sequence of nodes in the flat schedule Flat schedule is therefore:
S1S2
S3
S4S5,6
t0:t1:t2:t3:t4:t5:t6:
S1
S2
S3
S4
(0,1)
(1,2)
(0,1)(1,1)
(0,2)
S5
(1,2)
(0,1)
S6
(2,2)
(0,1)
S1
S2
S3
S4
(0,1)
(0,1)
(0,2)
S5
(0,1)
S6
(0,1)
![Page 41: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/41.jpg)
EXAMPLE 2
Source Node Transitive Closure
4 5 6
4 (2,3),(2,5) (0,1),(0,3) (0,1)
5 (2,2) (2,3),(2,5 (2,3)
6 (2,4) (0,2) (2,5)
S1
S2
S3
S4
(0,1)
(1,2)
(0,1)(1,1)
(0,2)
S5
(1,2)
(0,1)
S6
(2,2)
(0,1)
COMPUTE II using the method of “Shortest Path Algorithm”
Find Strongly connected components Initiation Interval for the first graph is II=3 Calculate the transitive closure of the
second graph
S1
S2
S4
(0,1)
(1,2)
S5
(1,2)
(0,1)
S6
(2,2)
(0,1)
II = max (3/2, 5/2) = 2.5
IItotal = max (2.5, 3) = 3
![Page 42: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/42.jpg)
EXAMPLE 2
Execution code of example 2 Kernel in blue box
length of flat schedule - IIExploitation
length of flat schedule
S1S2
S3
S4S5,6
t0:t1:t2:t3:t4:t5:t6:t7:t8:t9:
t10:t11:t12:t13:t14:t15:
kernel
S1S2
S3
S4S5,6
S1S2
S3
S4S5,6
S1S2
S3
S4S5,6
Prologue
Epilogue
7-30.5714
7E
![Page 43: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/43.jpg)
DO I=1, 100
S1: a[I] = b[I-3] 5; mul 3clks
S2: b[I] = a[I] ; sqrt 4clks
S3: c[I] = a[I-2] b[I-1]; mul 3clks
S4: d[I] = c[I] + 5; add 1clk
S5: e
[I] = d[I-1] + c[I-1]; add 1clk
ENDDO
EXAMPLE 3
S1
S2
S3
S4
S5
(0,2)
(3,4)
(2,2)
(1,4)
(0,3)
(1,3)
(1,1)
Nodes S1, S2 comprise a strongly connected component. II is therefore:
4 2 62
3 0 3II
![Page 44: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/44.jpg)
EXAMPLE 3
S1
S2
S3
S4
S5
(0,2)
(3,4)
(2,2)
(1,4)
(0,3)
(1,3)
(1,1)
Produce the Flat schedule.• Eliminate all loop-carried dependences
• There is not a loop-independent arc across all nodes
• In this case, flat schedule cannot be produced just by following the loop-independent arc
• We need a global method to generate the flat schedule.
• The pre-mentioned method does not always work
• Introduce “Modulo scheduling via hierarchical reduction”
S1
S2
S3
S4
S5
(0,2)
(0,3)
![Page 45: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/45.jpg)
Modulo Scheduling Via Hierarchical Reduction
modify the DDG so as to schedule the strongly connected components of the graph first
strongly connected components of a graph can be found using Tarjan’s algorithm
afterwards, schedule the acyclic DDG
![Page 46: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/46.jpg)
Modulo Scheduling Via Hierarchical Reduction
compute the upper and low bounds where each node can be placed in the flat schedule using Equations below
Its in an iterative method We begin with II=1 trying to find a legal schedule If it is not possible, II is incremented until all nodes are
placed in the flat schedule in legal positions
)(
),(cosmax)( )(
v
vutv
upper
IINulow
),(cos)(),(min()(
),(cos)(),(max()(
vutvuu
uvtvuuII
upperupper
IIlowlow
Equations to initialize low and upper bounds
Equations to update low and upper bounds
CostII(v, u) stands for the cost (measured by dif, min values) in order node v to reach node u. We need thus the cost matrix for the strongly connected nodes (i.e. the transitive closure).
![Page 47: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/47.jpg)
EXAMPLE 4 DO I=1,100S1: a(i) = c(i-1) + d(i-3);S2: b(i) = a(i) • 5; S3: c(i) = b(i-2) • d(i-1); S4: d(i) = c(i) + i;S5: e(i) = d(i);S6: f(i) = d(i-1) • i;S7: g(i) = f(i-1); ENDDO
S1
S7
S6
S5
S4
S3
S2
(0,1)
(1,2)
(3,1)
(2,2)
(1,1)
(0,2)
(0,1) (1,1)
(1,2)
S1
S4
S3
S2
(0,1)
(1,2)
(3,1)
(2,2)
(1,1)
(0,2)
Find strongly connected components Compute the transitive closure
Destination Node
Source Node 1 2 3 4
1 (3,5) , (5,6) (0,1) (2,3) (2,5)
2 (3,4) , (5,5) (3,5) , (5,6) (2,2) (2,4)
3 (1,2) , (3,3) (1,3) , (3,4) (3,5) , (1,3) , (5,6) (0,2)
4 (3,1) , (2,3) (3,2) , (2,4) (1,1), (5,4) (1,3), (5,6)
5 6 3max( , , ) 3
3 5 1II
![Page 48: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/48.jpg)
EXAMPLE 4
Destination Node
Source Node 1 2 3 4
1 (3,5) , (5,6) (0,1) (2,3) (2,5)
2 (3,4) , (5,5) (3,5) , (5,6) (2,2) (2,4)
3 (1,2) , (3,3) (1,3) , (3,4) (3,5) , (1,3) , (5,6) (0,2)
4 (3,1) , (2,3) (3,2) , (2,4) (1,1), (5,4) (1,3), (5,6)
)(
),(cosmax)( )(
v
vutv
upper
IINulow
Compute the Simplified Transitive Closure by keeping the values that give the maximum distance Initialize the upper and low bounds for the nodes in the strongly connected component
Simplified Transitive Closure
where costII(u,v) = M
IIab(min, dif u→v)
( ) max( ( ), ( ) cos ( , )
( ) min( ( ), ( ) cos ( , )
IIlow low
IIupper upper
u u v t v u
u u v t u v
where σ(v) is the time slot in F where the scheduled node has been placed
![Page 49: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/49.jpg)
EXAMPLE 4 (Initialize nodes)
Destination Node
Source Node 1 2 3 4
1 (3,5) , (5,6) (0,1) (2,3) (2,5)
2 (3,4) , (5,5) (3,5) , (5,6) (2,2) (2,4)
3 (1,2) , (3,3) (1,3) , (3,4) (3,5) , (1,3) , (5,6) (0,2)
4 (3,1) , (2,3) (3,2) , (2,4) (1,1), (5,4) (1,3), (5,6)
Initialize the upper and low bounds for the nodes in the strongly connected component
Simplified Transitive Closure
3 3 3
3 3 3
(1) max(cos (2,1),cos (3,1),cos (4,1))
max( (3,4), (1,2), (2,3))
max(4 3 3,2 1 3,3 2 3) 1
low
ab ab ab
t t t
M M M
3 3 3
3 3 3
(2) max(cos (1,2),cos (3,2),cos (4,2))
max( (0,1), (1,3), (2, 4))
max(1,3 1 3,4 2 3) 1
low
ab ab ab
t t t
M M M
3 3 3
3 3 3
(3) max(cos (1,3),cos (2,3),cos (4,3))
max( (2,3), (2,2), (1,1))
max(3 2 3,2 2 3,1 1 3) 2
low
ab ab ab
t t t
M M M
3 3 3
3 3 3
(4) max(cos (1,4),cos (2,4),cos (3,4))
max( (2,5), (2, 4), (0, 2)
max(5 2 3,4 2 3,2 0 3) 2
low
ab ab ab
t t t
M M M
![Page 50: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/50.jpg)
EXAMPLE 4 (Schedule the first node)
S1:[ -1,∞ ]S2:[ 1 ,∞ ]S3:[ -2,∞ ]S4:[ 2 ,∞ ]
Node S3 has the lowest low bound so it is scheduled first. It is placed in time slot 0 (t0). Afterwards, we need to update low and upper bounds for the rest nodes
S3t0:t1:t2:
![Page 51: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/51.jpg)
EXAMPLE 4 (Update nodes S1, S2, S4 )3
3
(1) max( (1), (3) cos (3,1))
max( 1,0 (1,2))
max(0,2 1 3) 0
low low
ab
t
M
3
3
(1) min( , (3) cos (1,3))
min( ,0 (2,3))
min( ,3) 3
upper
ab
t
M
3
3
(2) max( (2), (3) cos (3,2))
max(1,0 (1,3))
max(1,0) 1
low low
ab
t
M
3
3
(2) min( , (3) cos (2,3))
min( ,0 (2,2))
min( ,4) 4
upper
ab
t
M
3
3
(4) max( (4), (3) cos (3,4))
max(2,0 (0,2))
max(2,2) 2
low low
ab
t
M
3
3
(4) min( , (3) cos (4,3))
min( ,0 (1,1))
min( , 2) 2
upper
ab
t
M
S3
S4
t0:t1:t2:
S1:[ 0, 3 ]S2:[ 1 ,4 ]S4:[ 2 ,2 ]
Node S2 has the lowest upper bound, so it is scheduled first and is placed in the position indicated by the value of low bound (i.e. t2).Then, we need to update nodes S1, S2.
![Page 52: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/52.jpg)
EXAMPLE 4 (Update nodes S1, S2 )
S1,3S2S4
t0:t1:t2:
S1:[0,3]S2:[1,4]
Node S3 has the lowest upper bound, so it is scheduled first and is placed in the position indicated by the value of low bound (i.e. t0).
3
3
(1) max( (1), (4) cos (4,1))
max(0,2 (2,3))
max(0,2 3) 0
low low
ab
t
M
3
3
(1) min( (1), (4) cos (1,4))
min(3,2 (2,5))
min(3,3) 3
upper upper
ab
t
M
3
3
(2) max( (2), (4) cos (4,2))
max(1,2 (2,4))
max(1,0) 1
low low
ab
t
M
3
3
(2) min( (2), (4) cos (2,4))
min(4,2 (2,4))
min(4,4) 4
upper upper
ab
t
M
S1,3
S4
t0:t1:t2:
Finally, node S2 is placed in time slot 1 (t1).
![Page 53: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/53.jpg)
EXAMPLE 4 (Condensed Graph )
The acyclic graph is then easy to be scheduled, using the equation below, where σ(α) is the position in the flat schedule of the node on which node n depends. Min and dif values are the sum of mins and difs across the arcs that connect the corresponding nodes. S7
S6
S5
S1234
(0,1) (1,1)
(1,2)
After scheduling the strongly connected component, the new (condensed) graph is shown below
(( , ,min) )( ) max ( ( ) min * )low a n dif Scheduledn a II dif
4(6) ( ( ) min * ) (2 1 3 1) 0low S II dif
4(5) ( ( ) min * ) (2 1 3 0) 3low S II dif 4(7) ( ( ) min * ) (2 3 3 2) 1low S II dif
![Page 54: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/54.jpg)
EXAMPLE 4 (Condensed Graph )
The acyclic graph is then easy to be scheduled, using the equation below, where σ(α) is the position in the flat schedule of the node on which node n depends. Min and dif values are the sum of mins and difs across the arcs that connect the corresponding nodes. S7
S6
S5
S1234
(0,1) (1,1)
(1,2)
After scheduling the strongly connected component, the new (condensed) graph is shown below
(( , ,min) )( ) max ( ( ) min * )low a n dif Scheduledn a II dif
4(6) ( ( ) min * ) (2 1 3 1) 0low S II dif
4(5) ( ( ) min * ) (2 1 3 0) 3low S II dif 4(7) ( ( ) min * ) (2 3 3 2) 1low S II dif
S3,1,6,7S2S4S5
t0:t1:t2:t3:
![Page 55: Basic Block Scheduling Utilize parallelism at the instruction level (ILP) Time spent in loop execution dominates total execution time It is a technique](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfc71a28abf838ca8088/html5/thumbnails/55.jpg)
EXAMPLE 4 (Execution Schedule )
The execution schedule is then
S3,1,6,7S2S4S5
t0:t1:t2:t3:t4:t5:t6:t7:t8:t9:
S3,1,6,7S2S4S5 S3,1,6,7
S2S4S5
II=3