communication synthesis of loop accelerator pipelines · 2010. 11. 5. · paro: synthesis of...

University of Erlangen-NurembergFrank Hannig

CASA 2010, Scottsdale, USAOctober, 2010

Communication Synthesis of Loop Accelerator Pipelines

Frank [email protected]


CASA 2010, Scottsdale, USAOctober, 2010 2

Introduction

• Communicating loops

• Pipeline of hardware accelerators

Problem:Accelerator andCommunication Synthesis!



Outline• Introduction

• Basics of loop accelerator design

• Problem definition

• Solution approach

• Results

• Conclusion



Polyhedral model: Accelerator synthesis• Loop nest: Iteration Space + Reduced Dependence Graph

• Processor allocation (Q) and Scheduling (Loop Matrix, L)

for (i: 1 to 8)for (j: 1 to 8)

A[i,j] = func(...);

parfor (i: 1 to 8)for (j: 1 to 8)

A[i,j] = func(...);Mapping(Q,L)



Tiling

• Tiling determines the granularity of parallelism• LSGP partitioning

iteration space with dependencies processor array

partitioning matrix:

=

4004

P



Tiling, cont’d

• LPGS partitioning processor array

iteration space with dependencies

partitioning matrix:

=

2004

P



Tiling, cont’d

• Copartitioning processor arrayiteration space with dependencies

– Balancing of communication cost and different levels of (local) memory– Note that LSGP and LPGS are special cases of copartitioning

LS

GS

local memory

=

2004

LSP

=

2001

GSP



Notation

• Assumption: Rectangular iteration spaces and tiles

• Rectangular iteration space– is written as

• Rectangular tiles imply that tiling matrices are diagonal matrices– i.e., ,

• Succinct representation of copartitioned iteration space

( )11 diag aP = ( )22 diag aP =

LS GS



Polyhedral model: Communicating loops• Loop graph for representing communicating loops

• Mapped loop graph: Loop graph + mapping (Q,L)

• Node– Iteration space – Reduced dependence graph

• Edge – Iteration space of transported variable– Dependency between the iteration spaces– Processor allocation and read/write scanning order



Example

• Iteration space and corresponding processor arrayPE0

PE1

PE2

PE3

PE00

PE10

PE01

PE11



Problem definition

• Communication synthesis of a subsystem for the transport of a multi-dimensional array

• Features of a custom communication subsystem– Transportation of multi-dimensional arrays– Support of out-of-order communication– High throughput by supporting parallel access

???

PE0

PE1

PE2

PE3

PE00

PE10

PE01

PE11



Communication synthesis• WSDF (windowed synchronous data flow) models the transport of a

multi-dimensional array [3,2]– Producer and consumer token – Virtual token (e.g., image or matrix) – Read and write communication order

• Complex communication patterns (in order, out-of-order)

• Custom memory architecture with FIFO-like behavior can be generated



Method of solution

• Polyhedral to WSDF– Project mapped loop graph to WSDF model– I.e., given mapped loop graph,– Find, WSDF edge notation,

• WSDF to MD-FIFO– Synthesize custom memory architecture called multidimensional FIFO

from the given WSDF edge parameters

• All communicating loop nests belonging to our class of algorithms can be converted to the WSDF model



Polyhedral to WSDF

• Source data space ; Sink data space • Simple case: Continuous tokens• Inner loop parallelization (LPGS) or sequential

execution for both source and sink loop– i.e., if – Producer and consumer tokens are given by the number of I/O

processors

• Virtual token vector refers to common multidimensional array, which is tiled differently

• The read and write order is derived from the loop matrix



Example• Loop graph

• Inner loop parallelization (LPGS) for Source and sink loop– I.e., copartitioning(1,6) and copartitioning(1,4)– Source and sink iteration space are

• WSDF edge notation



Polyhedral to WSDF• Source data space ; Sink data space • Complex case: Non-continuous tokens

– Outer loop parallelization, i.e., LSGP or copartitioning• Construction: a copy node is introduced for reordering the data array for supporting

parallel access– Embedding into a new data space, where the tokens which are produced and

consumed are continuous– Parallel access required:



Example• Loop graph

• Mapping: Outer loop parallelization for source and sink loop– i.e. loops undergo copartitioning(6,2) and copartitioning(4,3) – Source and sink data space are

• Arrows show non-continuous data required for parallel access



Example, cont’d

• WSDF edge notation– Copy actor ensures parallel access of data tokens

77 7777 77

77

7 77 77 7

7

75

77

73

71

767472

7135

7713

77771

3

77

77772

4



WSDF to multidimensional FIFO

• Multidimensional FIFO

• Number of memory banks:



Multidimensional FIFO• Address Generation [3]

– Linearization in production order– Sink address generation using address increments

• Fill level control– Updates available number of data src fill level

controlsnk fill level

control

full

wr_count

empty

rd_count

srciter

snk iter

∆src

∆snk availcounter

∆srccalc

availcounter

∆snkcalc

rd_enawr_ena



Results• Communication Overhead:

• In-order communication leads to classical FIFOs with less overhead

• Out-of-order communication leads to large buffers and complex logic for address generation, hence a large overhead

• The multidimensional FIFOs are not a throughput bottleneck



Conclusions and outlook

• Novel bridge from the polyhedral model to synchronous data flow, clear representation of:– Communication semantics– Automated synthesis of a dedicated

communication subsystem

• One consistent step in the future is the exploration of different partitionings and scheduling orders in order to optimize the overall system of communicating loops

Algorithm (PAULA)

Hardware SynthesisProcessor Element Controller

Processor Array I/O Interface

HDL Generation

Hardware Description (VHDL)

Test BenchGeneration

Simulation

Simulation

Architecture Model

Space-Time MappingAllocation Scheduling Resource Binding

FPGA

PAROHLSTool

High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...



Literature[1] H. Dutta, F. Hannig, M. Schmid , and J. Keinert.

Modeling and Synthesis of Communication Subsystems for Loop Accelerator Pipelines.In Proceedings of the 21st IEEE International Conference on Application-specific Systems, Architectures, and Processors(ASAP), pp. 125-132, Rennes, France, July 7-9, 2010.

[2] J. Keinert, H. Dutta, F. Hannig, C. Haubelt, and J. Teich.Model-Based Synthesis and Optimization of Static Multi-Rate Image Processing Algorithms.Proceedings of Design, Automation and Test in Europe (DATE), pp. 135-140, Nice, France, April 20-24, 2009.

[3] J. Keinert, C. Haubelt ,and J. Teich.Synthesis of Multi-Dimensional High-Speed FIFOs for Out-of-Order Communication.Proceedings of the International Conference on Architecture of Computing Systems (ARCS), pp. 130-143, Dresden, Germany, February 25-28, 2008.

[4] F. Hannig, H. Ruckdeschel, H. Dutta, and J. Teich.PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications.Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC), Lecture Notes in Computer Science (LNCS), pp. 287-293, Springer, London, United Kingdom, March 26-28, 2008.

[5] H. Dutta, F. Hannig, H. Ruckdeschel ,and J. Teich.Efficient Control Generation for Mapping Nested Loop Programs onto Processor Arrays.In Journal of Systems Architecture, 53(5-6):300-309, 2007.

[6] F. Hannig, H. Dutta ,and J. Teich.Mapping a Class of Dependence Algorithms to Coarse-grained Reconfigurable Arrays: Architectural Parameters and Methodology.In International Journal of Embedded Systems, Vol. 2, Nos. 1/2, pp. 114-127, 2006



Questions?

Communication Synthesis ofLoop Accelerator Pipelines

Frank HannigHardware/Software Co-DesignDepartment of Computer Science Phone: + 49 9131 85-25153University of Erlangen-Nuremberg Fax: + 49 9131 85-25149Am Weichselgarten 3 Email: [email protected] Erlangen, Germany URL: http://www12.cs.fau.de/

AcknowledgementsHritam Dutta, Joachim Keinert, Moritz Schmid, Jürgen Teich

This work was partially supported by the German Research Foundation (DFG)in project under contract TE 163 /3-2.

Foliennummer 1IntroductionOutlinePolyhedral model: Accelerator synthesisTilingTiling, cont’dTiling, cont’dNotationPolyhedral model: Communicating loopsExampleProblem definitionCommunication synthesisMethod of solutionPolyhedral to WSDFExamplePolyhedral to WSDFExampleExample, cont’dWSDF to multidimensional FIFOMultidimensional FIFOResultsConclusions and outlookLiteratureQuestions?

communication synthesis of loop accelerator pipelines · 2010. 11. 5. · paro: synthesis of...

Documents