retargetable mapping of loop programs on coarse-grained...
TRANSCRIPT
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010
Retargetable Mapping of Loop Programs on Coarse-grained Reconfigurable Arrays
Frank [email protected]
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 2
Overview
• Motivation
• Architecture
• Mapping methodology
• Current and future work
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 3
Motivation
System-on-a-Chip (SoC)
communication network
embeddedprocessor memory
I/Ointerface
acceleratorCGRAaccelerator
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 4
Parallel Architectures’ Trade-offs
Performance
Flexibility
• Coarse-grained reconfiguration data is one to two orders of magnitude smaller than fine-grained
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 5
Architecture
Tightly-coupled processor arrays• Coarse-grained and weakly-programmable• Highly parameterizable architecture template• Reconfigurable
interconnect
GlobalCtrl.
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 6
Multicast Reconfiguration Scheme
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 7
Design flow, based on the PARO HLS toolAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
WPPA Configuration Hardware Description (VHDL)
Test BenchGeneration
Simulation
SimulationSimulation
Architecture Model
Front EndBack End
WPPA
Space-Time MappingAllocation Scheduling Resource Binding
WPPA FPGA
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 8
Simulation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Design flow, based on the PARO HLS toolAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
WPPA Configuration Hardware Description (VHDL)
Test BenchGeneration
SimulationSimulation
Architecture Model
Front EndBack End
WPPA FPGAWPPA
Space-Time MappingAllocation Scheduling Resource Binding
FPGA
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 9
Why not starting from C-code?• Most existing high-level synthesis tools start from C/C++ code
• Limitation: Semantics of input language (statement order, loop order) define execution order ⇒ limited parallelismExample:
int s = 0;for (i=0; i<=7; i++) { s += a[i]; }
• Tools have only few high-level transformations to parallelize code
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 10
Design entry: PAULA language
Key features:• Functional programming language• Full static single assignment (SSA) form, also for
multidimensional arrays• Powerful expressions for the specification of polyhedral and
lattice iteration domains• Convenient usage of reductions like ∑• Architectural modeling capabilities
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 11
PAULA and intermediate representation
Extended reduced dependence graph (RDG):
Example, 2-D Gauss window filter:...par (x >= 0 and x < 1280 and y >= 0 and y < 1024){ w[0,0]=1; w[0,1]=2; w[0,2]=1;w[1,0]=2; w[1,1]=4; w[1,2]=2;w[2,0]=1; w[2,1]=2; w[2,2]=1;h[x,y]=SUM[i>=0 and i<=2 and j>=0 and j<=2]
(pic_in[x+i,y+j] * w[i,j]);pic_out[x,y]=h[x,y] >> 4; // divided by 16
}
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 12
Simulation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Design flowAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
WPPA Configuration Simulation
Architecture Model
Front EndBack End
WPPA FPGAWPPA
Space-Time MappingAllocation Scheduling Resource Binding
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 13
Localization
• Example of one-dimensional localizationfor (i=0; i <= N; i++) for (i=0; i <= N; i++) { b[i] = a[0]; { if (i > 0) a[i] = a[i-1]; } b[i] = a[i];
}
• Example of two-dimensional localization
:x :y :z
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 14
Simulation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Design flowAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
WPPA Configuration Simulation
Architecture Model
Front EndBack End
WPPA FPGAWPPA
Space-Time MappingAllocation Scheduling Resource Binding
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 15
Space mapping / partitioning
• LSGP partitioning
dependence graph architecture
Main advantage:
Place & route for free! Since the space mapping directly defines the placement.
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 16
Space mapping / partitioning
• LPGS partitioning architecture
dependence graph
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 17
Space mapping / partitioning
• Hierarchical partitioning architecture dependence graph
– Balancing of: Communication, I/O and different levels of (local) memory– Adaptation of bandwidth, computational power, and memory
constraints
LS
GS
local memory
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 18
Scheduling
• Features:– Resource constraints (number of functional units)– Module selection (different binding possibilities of operations)– Functional and software pipelining– Some part can be concurrently executed
other have to be serialized– Exact approach based on mixed integer
linear programming (MIP)
• Goal, simultaneous optimization of:– Local schedule (execution order within PEs) and– Global schedule (execution between all PEs)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 19
Example, Median Filter
• PAULA program of horizontal median filter
• Partitioned algorithm (4 stripes / processors)
2 90 3
48
m=4m=3
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 20
Code Generation, Median Filter
• Substitution of median-function by explicit comparisons
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 21
Code Generation, Median Filter
• VLIW code fragment for one processor
• Zero loop overhead: Run control flow completely in parallel with data flow
Generated by Global Controller
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 22
• Currently:– Static mapping, compilation for one fixed architecture allocation (no. of PEs, etc.)– Static partitioning and scheduling– If other resource constraints are given or different requirements are needed,
program has to compiled again
• Idea: Symbolic loop parallelization and mapping– Symbolic partitioning– Symbolic scheduling– Symbolic control generation
⇒* Replace at run-time only the symbols in theconfiguration data according to the available resources
* Algorithms can be adapted quickly at run-time without recompilation* Self-adaption enables fast reaction on QoS parameters, system load, failures, etc.
Current and future work
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 23
Questions?
Retargetable Mapping of Loop Programs onCoarse-grained Reconfigurable Arrays
Frank HannigHardware/Software Co-DesignDepartment of Computer Science Phone: + 49 9131 85-25153University of Erlangen-Nuremberg Fax: + 49 9131 85-25149Am Weichselgarten 3 Email: [email protected] Erlangen, Germany URL: http://www12.cs.fau.de
AcknowledgementsHritam Dutta, Dmitrij Kissler, Alexey Kupriyanov,Vahid Lari, Holger Ruckdeschel, Jürgen Teich
This work was partially supported by the German Research Foundation (DFG)in projects under contracts TE 163 /3-1, TE 163 /3-2, and SFB TRR 89.