communication synthesis of loop accelerator pipelines · 2010. 11. 5. · paro: synthesis of...

24
University of Erlangen-Nuremberg Frank Hannig CASA 2010, Scottsdale, USA October, 2010 Communication Synthesis of Loop Accelerator Pipelines Frank Hannig [email protected]

Upload: others

Post on 10-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010

    Communication Synthesis of Loop Accelerator Pipelines

    Frank [email protected]

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 2

    Introduction

    • Communicating loops

    • Pipeline of hardware accelerators

    Problem:Accelerator andCommunication Synthesis!

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 3

    Outline• Introduction

    • Basics of loop accelerator design

    • Problem definition

    • Solution approach

    • Results

    • Conclusion

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 4

    Polyhedral model: Accelerator synthesis• Loop nest: Iteration Space + Reduced Dependence Graph

    • Processor allocation (Q) and Scheduling (Loop Matrix, L)

    for (i: 1 to 8)for (j: 1 to 8)

    A[i,j] = func(...);

    parfor (i: 1 to 8)for (j: 1 to 8)

    A[i,j] = func(...);Mapping(Q,L)

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 5

    Tiling

    • Tiling determines the granularity of parallelism• LSGP partitioning

    iteration space with dependencies processor array

    partitioning matrix:

    =

    4004

    P

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 6

    Tiling, cont’d

    • LPGS partitioning processor array

    iteration space with dependencies

    partitioning matrix:

    =

    2004

    P

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 7

    Tiling, cont’d

    • Copartitioning processor arrayiteration space with dependencies

    – Balancing of communication cost and different levels of (local) memory– Note that LSGP and LPGS are special cases of copartitioning

    LS

    GS

    local memory

    =

    2004

    LSP

    =

    2001

    GSP

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 8

    Notation

    • Assumption: Rectangular iteration spaces and tiles

    • Rectangular iteration space– is written as

    • Rectangular tiles imply that tiling matrices are diagonal matrices– i.e., ,

    • Succinct representation of copartitioned iteration space

    ( )11 diag aP = ( )22 diag aP =

    LS GS

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 9

    Polyhedral model: Communicating loops• Loop graph for representing communicating loops

    • Mapped loop graph: Loop graph + mapping (Q,L)

    • Node– Iteration space – Reduced dependence graph

    • Edge – Iteration space of transported variable– Dependency between the iteration spaces– Processor allocation and read/write scanning order

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 10

    Example

    • Iteration space and corresponding processor arrayPE0

    PE1

    PE2

    PE3

    PE00

    PE10

    PE01

    PE11

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 11

    Problem definition

    • Communication synthesis of a subsystem for the transport of a multi-dimensional array

    • Features of a custom communication subsystem– Transportation of multi-dimensional arrays– Support of out-of-order communication– High throughput by supporting parallel access

    ???

    PE0

    PE1

    PE2

    PE3

    PE00

    PE10

    PE01

    PE11

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 12

    Communication synthesis• WSDF (windowed synchronous data flow) models the transport of a

    multi-dimensional array [3,2]– Producer and consumer token – Virtual token (e.g., image or matrix) – Read and write communication order

    • Complex communication patterns (in order, out-of-order)

    • Custom memory architecture with FIFO-like behavior can be generated

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 13

    Method of solution

    • Polyhedral to WSDF– Project mapped loop graph to WSDF model– I.e., given mapped loop graph,– Find, WSDF edge notation,

    • WSDF to MD-FIFO– Synthesize custom memory architecture called multidimensional FIFO

    from the given WSDF edge parameters

    • All communicating loop nests belonging to our class of algorithms can be converted to the WSDF model

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 14

    Polyhedral to WSDF

    • Source data space ; Sink data space • Simple case: Continuous tokens• Inner loop parallelization (LPGS) or sequential

    execution for both source and sink loop– i.e., if – Producer and consumer tokens are given by the number of I/O

    processors

    • Virtual token vector refers to common multidimensional array, which is tiled differently

    • The read and write order is derived from the loop matrix

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 15

    Example• Loop graph

    • Inner loop parallelization (LPGS) for Source and sink loop– I.e., copartitioning(1,6) and copartitioning(1,4)– Source and sink iteration space are

    • WSDF edge notation

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 16

    Polyhedral to WSDF• Source data space ; Sink data space • Complex case: Non-continuous tokens

    – Outer loop parallelization, i.e., LSGP or copartitioning• Construction: a copy node is introduced for reordering the data array for supporting

    parallel access– Embedding into a new data space, where the tokens which are produced and

    consumed are continuous– Parallel access required:

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 17

    Example• Loop graph

    • Mapping: Outer loop parallelization for source and sink loop– i.e. loops undergo copartitioning(6,2) and copartitioning(4,3) – Source and sink data space are

    • Arrows show non-continuous data required for parallel access

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 18

    Example, cont’d

    • WSDF edge notation– Copy actor ensures parallel access of data tokens

    77 7777 77

    77

    7 77 77 7

    7

    75

    77

    73

    71

    767472

    7135

    7713

    77771

    3

    77

    77772

    4

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 19

    WSDF to multidimensional FIFO

    • Multidimensional FIFO

    • Number of memory banks:

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 20

    Multidimensional FIFO• Address Generation [3]

    – Linearization in production order– Sink address generation using address increments

    • Fill level control– Updates available number of data src fill level

    controlsnk fill level

    control

    full

    wr_count

    empty

    rd_count

    srciter

    snk iter

    ∆src

    ∆snk availcounter

    ∆srccalc

    availcounter

    ∆snkcalc

    rd_enawr_ena

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 21

    Results• Communication Overhead:

    • In-order communication leads to classical FIFOs with less overhead

    • Out-of-order communication leads to large buffers and complex logic for address generation, hence a large overhead

    • The multidimensional FIFOs are not a throughput bottleneck

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 22

    Conclusions and outlook

    • Novel bridge from the polyhedral model to synchronous data flow, clear representation of:– Communication semantics– Automated synthesis of a dedicated

    communication subsystem

    • One consistent step in the future is the exploration of different partitionings and scheduling orders in order to optimize the overall system of communicating loops

    Algorithm (PAULA)

    Hardware SynthesisProcessor Element Controller

    Processor Array I/O Interface

    HDL Generation

    Hardware Description (VHDL)

    Test BenchGeneration

    Simulation

    Simulation

    Architecture Model

    Space-Time MappingAllocation Scheduling Resource Binding

    FPGA

    PAROHLSTool

    High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 23

    Literature[1] H. Dutta, F. Hannig, M. Schmid , and J. Keinert.

    Modeling and Synthesis of Communication Subsystems for Loop Accelerator Pipelines.In Proceedings of the 21st IEEE International Conference on Application-specific Systems, Architectures, and Processors(ASAP), pp. 125-132, Rennes, France, July 7-9, 2010.

    [2] J. Keinert, H. Dutta, F. Hannig, C. Haubelt, and J. Teich.Model-Based Synthesis and Optimization of Static Multi-Rate Image Processing Algorithms.Proceedings of Design, Automation and Test in Europe (DATE), pp. 135-140, Nice, France, April 20-24, 2009.

    [3] J. Keinert, C. Haubelt ,and J. Teich.Synthesis of Multi-Dimensional High-Speed FIFOs for Out-of-Order Communication.Proceedings of the International Conference on Architecture of Computing Systems (ARCS), pp. 130-143, Dresden, Germany, February 25-28, 2008.

    [4] F. Hannig, H. Ruckdeschel, H. Dutta, and J. Teich.PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications.Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC), Lecture Notes in Computer Science (LNCS), pp. 287-293, Springer, London, United Kingdom, March 26-28, 2008.

    [5] H. Dutta, F. Hannig, H. Ruckdeschel ,and J. Teich.Efficient Control Generation for Mapping Nested Loop Programs onto Processor Arrays.In Journal of Systems Architecture, 53(5-6):300-309, 2007.

    [6] F. Hannig, H. Dutta ,and J. Teich.Mapping a Class of Dependence Algorithms to Coarse-grained Reconfigurable Arrays: Architectural Parameters and Methodology.In International Journal of Embedded Systems, Vol. 2, Nos. 1/2, pp. 114-127, 2006

  • University of Erlangen-NurembergFrank Hannig

    CASA 2010, Scottsdale, USAOctober, 2010 24

    Questions?

    Communication Synthesis ofLoop Accelerator Pipelines

    Frank HannigHardware/Software Co-DesignDepartment of Computer Science Phone: + 49 9131 85-25153University of Erlangen-Nuremberg Fax: + 49 9131 85-25149Am Weichselgarten 3 Email: [email protected] Erlangen, Germany URL: http://www12.cs.fau.de/

    AcknowledgementsHritam Dutta, Joachim Keinert, Moritz Schmid, Jürgen Teich

    This work was partially supported by the German Research Foundation (DFG)in project under contract TE 163 /3-2.

    Foliennummer 1IntroductionOutlinePolyhedral model: Accelerator synthesisTilingTiling, cont’dTiling, cont’dNotationPolyhedral model: Communicating loopsExampleProblem definitionCommunication synthesisMethod of solutionPolyhedral to WSDFExamplePolyhedral to WSDFExampleExample, cont’dWSDF to multidimensional FIFOMultidimensional FIFOResultsConclusions and outlookLiteratureQuestions?