cmput 680 - compiler design and optimization1 cmput680 - winter 2006 topic e: software pipelining...
TRANSCRIPT
CMPUT 680 - Compiler Design and Optimization
1
CMPUT680 - Winter 2006
Topic E: Software PipeliningJosé Nelson Amaral
http://www.cs.ualberta.ca/~amaral/courses/680
CMPUT 680 - Compiler Design and Optimization
2
Reading List
Tiger book: chapter 20Other papers such as:
GovindAltmanGao97, RutenbergAtAl97
CMPUT 680 - Compiler Design and Optimization
3
Software Pipeline
Software Pipeline is a technique that reduces the executiontime of important loops by interweaving operations
from many iterations to optimize the use of resources.
0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time
ldffadds
stf
sub
cmpbg
CMPUT 680 - Compiler Design and Optimization
4
Software Pipeline
What limits the speed of a loop?• Data dependencies: recurrence initiation interval (rec_mii)• Processor resources: resource initiation interval (res_mii)• Memory accesses: memory initiation interval (mem_mii)
0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time
ldffadds
stf
sub
cmpbg
Initiation interval
CMPUT 680 - Compiler Design and Optimization
5
Problem Formulation (I)
Given a weighted dependence graph, derive a schedule which is “time-optimal” under a machine model M.Def: A schedule S of a loop L is time-optimal if among all “legal” schedules of L, no schedule is faster than S.
Note: There may be more than one time-optimal schedule.
CMPUT 680 - Compiler Design and Optimization
6
Example: The Inner Product
Q = 0.0DO k = 1, N Q = Q+Z(k)*X(k)ENDDO
z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N
uk load zk-1
vk load xk-1
wk uk * vk
qk qk-1 + wk
zk zk-1 + 4xk xk-1 + 4
END DO
(Dehnert, J. and Towle, R. A., “Compiling for Cidra 5”)
Dynamic Single Assignment (DSA): Uses an expanded virtual register (EVR) thatis an infinite, linearly ordered, set ofvirtual registers.
A program in DSA has no anti-dependenciesand no output dependencies.
CMPUT 680 - Compiler Design and Optimization
7
Machine Model and Resource Constraints
z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N
uk load zk-1 MEMvk load xk-1 MEMwk uk * vk FMULTqk qk-1 + wk FADDzk zk-1 + 4 ADDRxk xk-1 + 4 ADDR
END DO
What unit each operation in the loop uses?
Unit LatencyMEM1 6MEM2 6ADDR1 1ADDR2 1FMULT 2FADD 2
Machine Model
Without instruction level parallelism.How long does the loop take to execute? (6+6+2+2+1+1)*N=18*N
CMPUT 680 - Compiler Design and Optimization
8
The Resource Minimum Initiation Interval of a loopis given by:
Resource Minimum Initiation Interval (resMII)
Each processor resource definesa minimum initiation intervalfor the execution of the loop.
For instance in the machine model in the previousexample, a loop that requires the computationof 6 addresses has a ResMII(ADDR) = 6*1/2 = 3.
( )ii
RResMIImaxResMII =
CMPUT 680 - Compiler Design and Optimization
9
ResMII
z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N
uk load zk-1 MEMvk load xk-1 MEMwk uk * vk FMULTqk qk-1 + wk FADDzk zk-1 + 4 ADDRxk xk-1 + 4 ADDR
END DO
Unit LatencyMEM1 6MEM2 6ADDR1 1ADDR2 1FMULT 2FADD 2
Machine Model
There are enough units to schedule all the instructions of the loop in the same cycle. Therefore ResMII = 1. Canwe execute the loop in N+C cycles (C = a small constant)?
CMPUT 680 - Compiler Design and Optimization
10
Recurrence Minimum Initiation Interval (RecMII)
z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N(a) uk load zk-1
(b) vk load xk-1
(c) wk uk * vk
(d) qk qk-1 + wk
(e) zk zk-1 + 4(f) xk xk-1 + 4END DO
k=1
a b
c
d
e
f
k=2
a b
c
d
e
f
k=3
a b
c
d
e
f
a b
c
d
e
f
(1)
(1)
(1)
(1)
(1)
CMPUT 680 - Compiler Design and Optimization
11
Recurrence Minimum Initiation Interval (RecMII)
a b
c
d
e
f
(1,2)(1,1)
(1,1)
(1,1)
(1,1)
z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N Unit Lat.(a) uk load zk-1 MEM (6)(b) vk load xk-1 MEM (6)(c) wk uk * vk FMULT (2)(d) qk qk-1 + wk FADD (2)(e) zk zk-1 + 4 ADDR (1)(f) xk xk-1 + 4 ADDR (1)END DO
(dist,lat)
CMPUT 680 - Compiler Design and Optimization
12
Recurrence Minimum Initiation Interval (RecMII)
a b
c
d
e
f
(1,2)(1,1)
(1,1)
(1,1)
(1,1)
(dist,lat)
The recursive minimum initiation interval (rec_mii) is given by:
( )( )
( )⎥⎥⎤
⎢⎢
⎡= ∀ θ
θθ distanceiteration
latency max rec_mii cycle
Quiz: What is the rec_mii for the example?
CMPUT 680 - Compiler Design and Optimization
13
Minimum Initiation Interval
The Minimum Initiation Interval (MII) for a loopis constrained both by resources and recurrences,therefore, it is given by:
)RecMII,ResMIImax(MII =
In our example we have MII = max(1,2) = 2.Therefore the best that we can do without transformingthe loop is to execute it in 2*N+C.
CMPUT 680 - Compiler Design and Optimization
14
Module Schedule
In module scheduling, we:(1) start with the first instruction(2) schedule as many instructions as we can in every cycle, limited only by the resources available and by the dependences.
When a pattern emerges, we adopt the pattern as our module schedule.
Instructions before this pattern form the loop prologue.
Instructions after this pattern form the loop epilogue.
Recurrence Minimum Initiation Interval (RecMII)
z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N Lat.(a) uk load zk-1 (6)(b) vk load xk-1 (6)(c) wk uk * vk (2)(d) qk qk-1 + wk (2)(e) zk zk-1 + 4 (1)(f) xk xk-1 + 4 (1)END DO
cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD 1 a1 b1 e1 f1 2 a2 b2 e2 f2 3 a3 b3 e3 f3 4 a4 b4 e4 f4 5 a5 b5 e5 f5 6 a6 b6 e6 f6 7 a7 b7 e7 f6 c1 8 a8 b8 e8 f7 c2 9 a9 b9 e9 f8 c3 d1 10 a10 b10 e10 f9 c4 11 a11 b11 e11 f10 c5 d2 … … … … … … …
CMPUT 680 - Compiler Design and Optimization
16
Why an eager scheduler fails in our example
Cycle
s
b10b21
b32b43
b54b65
b76c1 b87
c28 b9d1 c39
c410d2 c511
1 2 3 4 5 6 7 8 9Iterations
b10b11
b12
11 12 13 14 15 16 17 1810
d3 c713c814
d415 c916
d51718
d61920
d72122
d823
c612b14
b15b16
b17c10c11 b18
c12c13
c14c15
c16c17
b13
Cycle
s
CMPUT 680 - Compiler Design and Optimization
17
Why an eager scheduler fails in our example
Cycle
s
b101
b223
b345
b46c17
b58d1 c29
b610d2 c311
1 2 3 4 5 6 7 8 9Iterations
11 12 13 14 15 16 17 1810
d3 c413b814
d4 c51516 b9
d5 c61718
d6 c71920
d7 c82122
d823 c9
b712
b10
b11
b12
Cycle
s
Therefore we can doit in 2*N+9 cycles.
CMPUT 680 - Compiler Design and Optimization
18
Collision vectors
Given the reservation tables for two operations A and B,the set of forbidden intervals, i.e., intervals at whichdistance the operations A and B cannot be issued iscalled the collision vector for the reservation tables.
CMPUT 680 - Compiler Design and Optimization
19
A Simplistic Module Scheduling Algorithm
1. Compute MII as discussed2. Use a modified list scheduling algorithm to generate a module schedule. The scheduling algorithm must obey the following restriction:
If an operation P is scheduled at time t, it cannot be scheduled at any time t k*II
for any k 0.
The Module Reservation Table has II rows, representing the cycles of the initiation interval, and as many columns as the resources that it needs to keep track of.
CMPUT 680 - Compiler Design and Optimization
20
Heuristic Method for Modulo Scheduling
Why a simple variant of list scheduling may not work?
Problem: Generate a module schedule of a loop by scheduling instructions until a pattern emerge.
CMPUT 680 - Compiler Design and Optimization
21
A C
B D
(0,4)(0,2)
(0,2)(1,2)
Counter Example I:List Scheduling May Fail
There is only one cycle in the dependence graph,therefore RecMII is given by:
410
22RecMII =
++
=
Therefore, in a machine with infinite resources,we must be able to schedule the loop in 4 cycles.
CMPUT 680 - Compiler Design and Optimization
22
Counter Example I:List Scheduling May Fail
A C
B D
(0,4)(0,2)
(0,2)(1,2)
CA
D
0
1
2
3
D
A C
List Scheduling: a greedyalgorithm that scheduleseach operation at its earliest possible time
B must be scheduled after the A of the current iterationand before the C of the nextiteration.
We are deadlocked!!!
BB
???
CMPUT 680 - Compiler Design and Optimization
23
Counter Example I:List Scheduling May Fail
A C
B D
(0,4)(0,2)
(0,2)
(1,2)
CA
DB
0
1
2
3D(0)
A(0) C(0)
4
5
6
7
A(1) B(0)
C(1)
… … ………D(N)B(N)
The solution is to createa kernel with operations from different iterations, and use a prologue and an epilogue.
pro
logu
eep
ilogu
ek
ern
el
CMPUT 680 - Compiler Design and Optimization
24
A1
C2
A3
A4
M5
M6(0,2)
(0,1)
(0,2)
(0,2)
(0,3)
(0,3)
A1, A3, and A4 are non-pipelined adds thattake two cycles at the adder
M5 and M6 are non-pipelined multiply operations that take three cycles each onthe multiplier
C2 is a copy operation that uses the busfor one cycle
What is the ResMII for these operations ina machine that has one adder, one multiplierand one bus?
ResMII(Adder) = 6; ResMII(Multiplier) = 6ResMII(Bus) = 1
ResMII = 6
Counter Example II:List Scheduling May Fail
CMPUT 680 - Compiler Design and Optimization
25
A1
C2
A3
A4
M5
M6(0,2)
(0,1)
(0,2)
(0,2)
(0,3)
(0,3)
Counter Example II:List Scheduling May Fail
012345
Adder Mult BusA1 A1
A3
A3
A4
M6M6
C2 C2
A4??? We cannot schedule A4 and
achieve an MII = ResMII = 6!!!
CMPUT 680 - Compiler Design and Optimization
26
A1
C2
A3
A4
M5
M6(0,2)
(0,1)
(0,2)
(0,2)
(0,3)
(0,3)
Counter Example II:List Scheduling May Fail
012345
Adder Mult BusA1 A1
A3A3
A4
M6
M5
M6
M5
C2 C2A4
Although it seems counter-intuitivewe obtain a module schedule withMII = 6 if we initially scheduleboth M6 and A3 one cycle later thanthe earliest possible time for theseoperations.
CMPUT 680 - Compiler Design and Optimization
27
Complex Reservation Tables
Consider three independent operations withthe reservation tables shown below
A1 M2 MA3
(0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus
What is the MII for a loop formed by this three operations?
ResMII(Add) = 1 + 0 + 1 = 2Res MII(Mult) = 0 + 1 + 1 = 2ResMII(Bus) = 1 + 1 + 0 = 2
ResMII = 2
CMPUT 680 - Compiler Design and Optimization
28
Is the MII = 2 Feasible??
A1 M2 MA3
(0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus
A1
01
Adder Mult Bus
A1A1 M2 M2
M2
Deadlocked. Cannot allocate MA3. Even though MII = max(ResMII, RecMII) = 2,MII = 2 is not feasible!!!!
CMPUT 680 - Compiler Design and Optimization
29
Increasing MII to 3 helps?
A1 M2 MA3
(0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus
A1 M2
012
Adder Mult Bus
A1A1
M2
M2MA3MA3
MA3
We find a module schedulewith MII = 3!!
CMPUT 680 - Compiler Design and Optimization
30
Iteration Between Recurrence Constraints and Resource Constraints
A1
A2
A3
A4
(0,2)
(2,2) (0,2)
(0,2)
A
(0,2) Add Mult Bus
What is the RecMII forthis loop?
RecMII = (2+2+2+2)/2 = 4
What is the ResMII forthe loop?
ResMII(Add) = 1+1+1+1 = 4ResMII(Mult) = 0+0+0+0 = 0ResMII(Bus) = 1+1+1+1 = 4
ResMII = 4
Therefore MII = max(ResMII,RecMII) = 4
CMPUT 680 - Compiler Design and Optimization
31
Is the MII = 4 feasible?
A1
A2
A3
A4
(0,2)
(2,2) (0,2)
(0,2)
A
(0,2) Add Mult Bus
A1
A2 A2A2
0123
Adder Mult Bus
A1A1
In order to finish A4 in time to produce the result for two iterations later, A3 must bescheduled at time 4.
But 4 module 4 = 0, which conflicts with A1.
Therefore there is no feasible schedulewith MII = 4.
CMPUT 680 - Compiler Design and Optimization
32
Scheduling Strategy
An exhaustive search will eventually reveal that theMII calculated is not feasible, but it might take too long.
In practice, we compute the MII and spend a pre-allocated budget of time trying to find aschedule with the MII. If we don’t find one, weincrease the MII.
In some commercial compilers, the search for the smallest feasible II is a binary search, where the IIis doubled at each step until a feasible one is found,at which point a linear search between the lastunfeasible II and the feasible one is conducted.
CMPUT 680 - Compiler Design and Optimization
33
Previous Approaches
Approach I (Operational): “Emulate” the loop execution under the machine
model and a “pattern” will eventually occur[AikenNic88, EbciogluNic89, GaoEtAl91]
Approach II (Periodic scheduling): Specify the scheduling problem into a periodical
scheduling problem and find optimal solution[Lam88, RauEtAl81,GovindAltmanGao94]
SoftwarePipelining
OperationalApproach
PeriodicScheduling
(Modulo Scheduling)
Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc)
Formal Model (GaoWonNin91)
Non-Exact Method (Heuristic)(RauGla81, Lam88, RauEtA192, Huff93, DehnertTow93, Rau94, WanEis93)
ExactMethod
Basic Formulation(DongenGao92)
ILP based
ExhausitiveSearch (Altman95, AltmanGao96)
Register Optimal(NingGao91, NingGao93, Ning93)
Resource Constrained(GovindAltGao94)
Resource & Register(GovindAltGao95, Altman95,EichenbergerDav95)“Showdown”
(RuttenbergGaoStouchininWoody96)