communication synthesis of loop accelerator pipelines · 2010. 11. 5. · paro: synthesis of...
TRANSCRIPT
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010
Communication Synthesis of Loop Accelerator Pipelines
Frank [email protected]
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 2
Introduction
• Communicating loops
• Pipeline of hardware accelerators
Problem:Accelerator andCommunication Synthesis!
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 3
Outline• Introduction
• Basics of loop accelerator design
• Problem definition
• Solution approach
• Results
• Conclusion
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 4
Polyhedral model: Accelerator synthesis• Loop nest: Iteration Space + Reduced Dependence Graph
• Processor allocation (Q) and Scheduling (Loop Matrix, L)
for (i: 1 to 8)for (j: 1 to 8)
A[i,j] = func(...);
parfor (i: 1 to 8)for (j: 1 to 8)
A[i,j] = func(...);Mapping(Q,L)
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 5
Tiling
• Tiling determines the granularity of parallelism• LSGP partitioning
iteration space with dependencies processor array
partitioning matrix:
=
4004
P
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 6
Tiling, cont’d
• LPGS partitioning processor array
iteration space with dependencies
partitioning matrix:
=
2004
P
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 7
Tiling, cont’d
• Copartitioning processor arrayiteration space with dependencies
– Balancing of communication cost and different levels of (local) memory– Note that LSGP and LPGS are special cases of copartitioning
LS
GS
local memory
=
2004
LSP
=
2001
GSP
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 8
Notation
• Assumption: Rectangular iteration spaces and tiles
• Rectangular iteration space– is written as
• Rectangular tiles imply that tiling matrices are diagonal matrices– i.e., ,
• Succinct representation of copartitioned iteration space
( )11 diag aP = ( )22 diag aP =
LS GS
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 9
Polyhedral model: Communicating loops• Loop graph for representing communicating loops
• Mapped loop graph: Loop graph + mapping (Q,L)
• Node– Iteration space – Reduced dependence graph
• Edge – Iteration space of transported variable– Dependency between the iteration spaces– Processor allocation and read/write scanning order
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 10
Example
• Iteration space and corresponding processor arrayPE0
PE1
PE2
PE3
PE00
PE10
PE01
PE11
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 11
Problem definition
• Communication synthesis of a subsystem for the transport of a multi-dimensional array
• Features of a custom communication subsystem– Transportation of multi-dimensional arrays– Support of out-of-order communication– High throughput by supporting parallel access
???
PE0
PE1
PE2
PE3
PE00
PE10
PE01
PE11
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 12
Communication synthesis• WSDF (windowed synchronous data flow) models the transport of a
multi-dimensional array [3,2]– Producer and consumer token – Virtual token (e.g., image or matrix) – Read and write communication order
• Complex communication patterns (in order, out-of-order)
• Custom memory architecture with FIFO-like behavior can be generated
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 13
Method of solution
• Polyhedral to WSDF– Project mapped loop graph to WSDF model– I.e., given mapped loop graph,– Find, WSDF edge notation,
• WSDF to MD-FIFO– Synthesize custom memory architecture called multidimensional FIFO
from the given WSDF edge parameters
• All communicating loop nests belonging to our class of algorithms can be converted to the WSDF model
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 14
Polyhedral to WSDF
• Source data space ; Sink data space • Simple case: Continuous tokens• Inner loop parallelization (LPGS) or sequential
execution for both source and sink loop– i.e., if – Producer and consumer tokens are given by the number of I/O
processors
• Virtual token vector refers to common multidimensional array, which is tiled differently
• The read and write order is derived from the loop matrix
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 15
Example• Loop graph
• Inner loop parallelization (LPGS) for Source and sink loop– I.e., copartitioning(1,6) and copartitioning(1,4)– Source and sink iteration space are
• WSDF edge notation
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 16
Polyhedral to WSDF• Source data space ; Sink data space • Complex case: Non-continuous tokens
– Outer loop parallelization, i.e., LSGP or copartitioning• Construction: a copy node is introduced for reordering the data array for supporting
parallel access– Embedding into a new data space, where the tokens which are produced and
consumed are continuous– Parallel access required:
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 17
Example• Loop graph
• Mapping: Outer loop parallelization for source and sink loop– i.e. loops undergo copartitioning(6,2) and copartitioning(4,3) – Source and sink data space are
• Arrows show non-continuous data required for parallel access
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 18
Example, cont’d
• WSDF edge notation– Copy actor ensures parallel access of data tokens
77 7777 77
77
7 77 77 7
7
75
77
73
71
767472
7135
7713
77771
3
77
77772
4
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 19
WSDF to multidimensional FIFO
• Multidimensional FIFO
• Number of memory banks:
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 20
Multidimensional FIFO• Address Generation [3]
– Linearization in production order– Sink address generation using address increments
• Fill level control– Updates available number of data src fill level
controlsnk fill level
control
full
wr_count
empty
rd_count
srciter
snk iter
∆src
∆snk availcounter
∆srccalc
availcounter
∆snkcalc
rd_enawr_ena
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 21
Results• Communication Overhead:
• In-order communication leads to classical FIFOs with less overhead
• Out-of-order communication leads to large buffers and complex logic for address generation, hence a large overhead
• The multidimensional FIFOs are not a throughput bottleneck
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 22
Conclusions and outlook
• Novel bridge from the polyhedral model to synchronous data flow, clear representation of:– Communication semantics– Automated synthesis of a dedicated
communication subsystem
• One consistent step in the future is the exploration of different partitionings and scheduling orders in order to optimize the overall system of communicating loops
Algorithm (PAULA)
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Simulation
Architecture Model
Space-Time MappingAllocation Scheduling Resource Binding
FPGA
PAROHLSTool
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 23
Literature[1] H. Dutta, F. Hannig, M. Schmid , and J. Keinert.
Modeling and Synthesis of Communication Subsystems for Loop Accelerator Pipelines.In Proceedings of the 21st IEEE International Conference on Application-specific Systems, Architectures, and Processors(ASAP), pp. 125-132, Rennes, France, July 7-9, 2010.
[2] J. Keinert, H. Dutta, F. Hannig, C. Haubelt, and J. Teich.Model-Based Synthesis and Optimization of Static Multi-Rate Image Processing Algorithms.Proceedings of Design, Automation and Test in Europe (DATE), pp. 135-140, Nice, France, April 20-24, 2009.
[3] J. Keinert, C. Haubelt ,and J. Teich.Synthesis of Multi-Dimensional High-Speed FIFOs for Out-of-Order Communication.Proceedings of the International Conference on Architecture of Computing Systems (ARCS), pp. 130-143, Dresden, Germany, February 25-28, 2008.
[4] F. Hannig, H. Ruckdeschel, H. Dutta, and J. Teich.PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications.Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC), Lecture Notes in Computer Science (LNCS), pp. 287-293, Springer, London, United Kingdom, March 26-28, 2008.
[5] H. Dutta, F. Hannig, H. Ruckdeschel ,and J. Teich.Efficient Control Generation for Mapping Nested Loop Programs onto Processor Arrays.In Journal of Systems Architecture, 53(5-6):300-309, 2007.
[6] F. Hannig, H. Dutta ,and J. Teich.Mapping a Class of Dependence Algorithms to Coarse-grained Reconfigurable Arrays: Architectural Parameters and Methodology.In International Journal of Embedded Systems, Vol. 2, Nos. 1/2, pp. 114-127, 2006
-
University of Erlangen-NurembergFrank Hannig
CASA 2010, Scottsdale, USAOctober, 2010 24
Questions?
Communication Synthesis ofLoop Accelerator Pipelines
Frank HannigHardware/Software Co-DesignDepartment of Computer Science Phone: + 49 9131 85-25153University of Erlangen-Nuremberg Fax: + 49 9131 85-25149Am Weichselgarten 3 Email: [email protected] Erlangen, Germany URL: http://www12.cs.fau.de/
AcknowledgementsHritam Dutta, Joachim Keinert, Moritz Schmid, Jürgen Teich
This work was partially supported by the German Research Foundation (DFG)in project under contract TE 163 /3-2.
Foliennummer 1IntroductionOutlinePolyhedral model: Accelerator synthesisTilingTiling, cont’dTiling, cont’dNotationPolyhedral model: Communicating loopsExampleProblem definitionCommunication synthesisMethod of solutionPolyhedral to WSDFExamplePolyhedral to WSDFExampleExample, cont’dWSDF to multidimensional FIFOMultidimensional FIFOResultsConclusions and outlookLiteratureQuestions?