university of michigan electrical engineering and computer science compiler-directed synthesis of...
Post on 20-Dec-2015
213 Views
Preview:
TRANSCRIPT
University of MichiganElectrical Engineering and Computer Science
Compiler-directed Synthesis of Programmable Loop Accelerators
Kevin Fan, Hyunchul Park, Scott MahlkeSeptember 25, 2004EDCEP Workshop
University of MichiganElectrical Engineering and Computer Science
Loop Accelerators
• Hardware implementation of a critical loop nest– Hardwired state machine– Digital camera appln – 1000x vs Pentium III– Multiple accelerators hooked up in a pipeline
• Loop accelerator vs. customized processor– 1 block of code vs. multiple blocks– Trivial control flow vs. handling generic branches– Traditionally state machine vs. instruction driven
University of MichiganElectrical Engineering and Computer Science
Programmable Loop Accelerators• Goals
– Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use)
– Post-programmable – To a degree, allow changes to the application
– Use compiler as architecture synthesis tool• But …
– Don’t build a customized processor– Maintain ASIC-level efficiency
University of MichiganElectrical Engineering and Computer Science
NPA (Nonprogrammable Accelerator) Synthesis in PICO
Sequential Loop Nest
Performance Requirement
Systolic Array Datapath
Systolic Array
Controller
Systolic Array
Coprocessor Interface
External Bus
Data In
Data Out
commands
timing
done
commands
done
Loadyii
Loadwjj
Loadxii-jj
10
10
+
Store
10
10 1
0
t1
t2 t3
Xr-1
Yr-1 yii
Systolic Processor Datapath
University of MichiganElectrical Engineering and Computer Science
PICO Frontendfor i = 1 to nifor j = 1 to nj
y[i] += w[j] * x[i+j]
for jt = 1 to 100 step 10
for t = 0 to 502
for p = 0 to 1
(i,j) = function of (t,p)
if (i>1) W[t][p] = W[t-5][p] else w[jt+j]
if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j]
Y[t][p] += W[t][p] * X[t][p]
• Goals– Exploit loop-level parallelism– Map loop to abstract hardware– Manage global memory BW
• Steps– Tiling– Load/store elimination– Iteration mapping– Iteration scheduling– Virtual processor clustering
University of MichiganElectrical Engineering and Computer Science
PICO Backend
FSelectT
FSelectT
FSelectT TSelectF Load
LoadLoadTSelectF
+
Store
Copy
Copy
• Resource allocation (II, operation graph)• Synthesize machine description for “fake” fully connected processor with allocated resources
University of MichiganElectrical Engineering and Computer Science
Reduced VLIW Processor after Modulo Scheduling
FSelectT
FSelectT
FSelectT TSelectF Load
LoadLoadTSelectF
+
Store
Xr-1Yr-1
t3t1 t2 yii wjj xii-jj
yii
University of MichiganElectrical Engineering and Computer Science
Data/control-path Synthesis NPA
Loadyii
Loadwjj
Loadxii-jj
10
10
+
Store
10
10
10
t1
t2 t3
Xr-1
Yr-1 yii
University of MichiganElectrical Engineering and Computer Science
PICO Methodology – Why it Works?
• Systematic design methodology– 1. Parameterized meta-architecture – all NPAs have
same general organization– 2. Performance/throughput is input– 3. Abstract architecture – We know how to build
compilers for this – 4. Mapping mechanism – Determine architecture
specifics from schedule for abstract architecture
University of MichiganElectrical Engineering and Computer Science
Direct Generalization of PICO?
FSelectT
FSelectT
FSelectT TSelectF Load
LoadLoadTSelectF
+
Store
Copy
Copy
• Programmability would require full interconnect between elements• Back to the meta architecture!
• Generalize connectivity to enable post-programmability• But stylize it
University of MichiganElectrical Engineering and Computer Science
Programmable Loop Accelerator – Design Strategy
• Compile for partially defined architecture– Build long distance communication into schedule– Limit global communication bandwidth
• Proposed meta-architecture– Multi-cluster VLIW
• Explicit inter-cluster transfers (varying latency/BW)• Intra-cluster communication is complete
– Hardware partially defined – expensive units
University of MichiganElectrical Engineering and Computer Science
Programmable Loop Accelerator Schema
Intra-cluster Communication
Shift Register
SRAM
DRAM
…
Stream Buffer
Accelerator
Accelerator
Pipeline of Tiled orClustered Accelerators
Accelerator Datapath
ControlUnit
Stream Unit
Stream UnitII
FU
… …
FU
… …
FU
… …
MEM
Inter-clusterRegister File
… …
University of MichiganElectrical Engineering and Computer Science
Flow Diagram
FU Alloc
Partition
ModuloSchedule
Assembly code,II
# clusters# expensive FUs
# cheap FUsFUs assigned to clusters
Shift register depth, width, portingIntercluster bandwidth
LoopAccelerator
University of MichiganElectrical Engineering and Computer Science
Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) {
int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp;
t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2];
e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22));
e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp;
} }
University of MichiganElectrical Engineering and Computer Science
FU Allocation• Determine number of
clusters:
• Determine number of expensive FUs– MPY, DIV, memory
II
typeofops __#
IIops
4
#
• Sobel with II=4
41 ops 3 clusters
2 MPY ops 1 multiplier
9 memory ops 3 memory units
University of MichiganElectrical Engineering and Computer Science
Partitioning
• Multi-level approach consists of two phases– Coarsening– Refinement
• Minimize inter-cluster communication• Load balance
– Max of 4 II operations per cluster• Take FU allocation into account
– Restricted # of expensive units– # of cheap units (ADD, logic) determined from partition
University of MichiganElectrical Engineering and Computer Science
Coarsening
• Group highly related operations together– Pair operations together at each step– Forces partitioner to consider several operations as a
single unit• Coarsening Sobel subgraph into 2 groups:
+ + + +++
L LLL L
+ + + +++
L LLL L
+ + + +++
L LLL L
+ + + +++
L LLL L
University of MichiganElectrical Engineering and Computer Science
Refinement
• Move operations between clusters• Good moves:
– Reduce inter-cluster communication– Improve load balance– Reduce hardware cost
• Reduce number of expensive units to meet limit
• Collect similar bitwidth operations together
+ + + +
++
L LLL L
?
University of MichiganElectrical Engineering and Computer Science
Partitioning Example
• From sobel, II=4• Place MPYs together• Place each tree of ADD-
LOAD-ADDs together• Cuts 6 edges
University of MichiganElectrical Engineering and Computer Science
Modulo Scheduling
• Determines shift register width, depth, and number of read ports
• Sobel II=4
LD
ADD
ADD ADD
LD
ADD0
3
1
2
cycle FU0 FU1 FU2 FU3
FU Cycle Max resultlifetime
Req’ddepth
Req’d ports
0 2 4 4 1
1 1 2 4 2
3 4
2 4 1 1 1
3 0 - 1 1
3 1
University of MichiganElectrical Engineering and Computer Science
Test Cases
• Sobel and fsed kernels, II=4 designs• Each machine has 4 clusters with 4 FUs per cluster
M + -
+ - + -
M + -
+ - + -
M + -
* &
B <<
+ - <<
M + -
+ - <<
M + -
+ - <<
M + &
+ & + &
B + -
*
sobel
fsed
University of MichiganElectrical Engineering and Computer Science
Cross Compile Results
• Computation is localized– sobel: 1.5 moves/cycle– fsed: 1 move/cycle
• Cross compile– Can still achieve II=4– More inter-cluster communication– May require more units– sobel on fsed machine: ~2 moves/cycle– fsed on sobel machine: ~3 moves/cycle
University of MichiganElectrical Engineering and Computer Science
Concluding Remarks
• Programmable loop accelerator design strategy– Meta-architecture with stylized interconnect– Systematic compiler-directed design flow
• Costs of programmability:– Interconnect, inter-cluster communication– Control – “micro-instructions” are necessary
• Just scratching the surface of this work• For more, see the CCCP group webpage
– http://cccp.eecs.umich.edu
top related