mapping of hierarchically partitioned regular ... - informatik … · institut fur informatik 12...

INSTITUT F UR INFORMATIK 12Hardware-Software-Co-Design

Friedrich-Alexander-Universitat Erlangen-Nurnberg

Prof. Dr.-Ing. Jurgen Teich

INSTITUT F UR INFORMATIK 10System Simulation

Friedrich-Alexander-Universitat Erlangen-Nurnberg

Prof. Dr. Ulrich Rude

MASTER THESIS

Mapping of Hierarchically Partitioned RegularAlgorithms onto Processor Arrays

Student: Hritam Dutta, M.Sc.

Supervisors: Dipl.-Ing. Frank HannigProf. Dr.-Ing. Jurgen TeichProf. Dr. Ulrich Rude

Starting date: Apr. 26, 2004Ending date: Oct. 26, 2004

Erkl arung

Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung andererals der angegebenen Quellen angefertigt habe und dass die Arbeit in gleichen oderahnlicher Form noch keiner anderen Prufungsbehorde vorgelegen hat und von dieserals Teil einer Prufungsleistung angenommen wurde. Alle Ausfuhrungen, die wortlichoder sinngemaßubernommen wurden, sind als solche gekennzeichnet.

Erlangen, der 25. Oktober 2004

Hritam Dutta

i

Acknowledgments

My sincere thanks to my guideFrank Hannigwho trusted me with a challenging topicfor research and guiding my work. I would also like to acknowledgeProf. Dr.-Ing

Teichfor pointing out the open problem in control methodology which now forms animportant part of the work. The patient help, friendly advice, rock and metal supportgiven byMateusz, Dirk, Christophe, Andre, Dmitrij helped me go a long way. Myheartfelt thanks to all employees and students at Hardware-Software-Co-Design forproviding a wonderful atmosphere. Special thanks to my mentor Prof. Dr. Rude forencouraging me to pursue my interests throughout my stay in Erlangen. I am alsograteful toSiemensfor sponsoring my stay in Germany. My parents love, supportand motivation cannot be acknowledged in few words. Last of all, I cannot forget mygirlfriend Katja for her love and understanding and friendsLutz, Mukuland matesfrom “Comunio BundesLiga” for providing me a wonderful life outside the academicworld in Erlangen.

iii

Abstract

In the last decade, there has been a dramatic growth in research and developmentof massively parallel processor arrays both in academia and industry. The proces-sor array architectures provide an optimal platform for parallel execution of numbercrunching loop programs from fields of digital signal processing, image processing,linear algebra, etc. However due to lack of mapping tools, these massively parallelprocessor architectures are not able to realize their full potential. The polytope modelis an intuitive methodology for loop parallelization. Partitioning is an important trans-formation in this model for matching constraints in hardware resources to the regularalgorithm. The control generation for such partitioned algorithms lacks a completemethodology. The major contributions of the thesis are, a) Transformations are intro-duced to handle localization for co-partitioning, b) A complete new methodology forefficient control generation for different partitioning techniques (LSGP, LPGS, co-partitioning) and parallelepiped tiles is presented, c) Exact formulas for estimation oflocal memory size, and upper bounds on memory required for inter-processor com-munication and communication with external memory are given, d) Different addressgeneration schemes for piecewise linear algorithms are discussed.

v

Kurzfassung

Sowohl im akademischen als auch im industriellen Bereich ist in den letztenzehn Jahren ein starkes Wachstum bei der Erforschung und Entwicklung von mas-siv parallelen Prozessorfeldern zu verzeichnen. Diese Architekturen stellen eine op-timale Plattform zur Parallelverarbeitung fur eine Reihe von berechnungsintensivenSchleifenprogrammen dar, genannt seien beispielsweise die Bereiche der digitalenSignalverarbeitung, Bildverarbeitung, lineare Algebra und weitere. Aufgrund man-gelnder Entwurfswerkzeuge kann jedoch meistens leider nicht das volle Potentialdieser parallelen Prozessorarchitekturen ausgenutzt werden. Ein intuitive Methodikzur Schleifenparallelisierung ist das sogenannte Polytop-Modell. In diesem Mod-ell stellt die Partitionierung von Programmen eine zentrale Transformation dar umgegebene Beschrankungen an Hardware-Ressourcen zu erfullen. Bislang fehlte furdie Abbildung solcher partitionierter Algorithmen auf Prozessorfelder eine durchgan-gige Methodik. Deshalb liefert diese Arbeit folgende Beitrage, a) Transformationenwerden eingefuhrt um die Lokalisierung von Daten fur Co-Partitionierung durchfuhr-en zu konnen, b) eine neue Methodik zur effizienten Steuergenerierung von unter-schiedlichen Partitionierungstechniken (LSGP, LPGS, Co-Partitionierung) und Kach-eln in Form eines beliebigen Parallelogramms wird erlautert, c) exakte Formeln zurSchatzung des lokalen Speicherbedarfs, obere Schranken fur die Große von Zwis-chenspeicher zur Inter-Prozessor-Kommunikation und Kommunikationskosten zu ex-ternem Speicher werden angegeben und d) unterschiedliche Schemen zur Adress-generierung von stuckweise linearen Algorithmen werden diskutiert.

vii

Contents

1. Introduction 11.1 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Definition and Method 72.1 Program Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3. Partitioning 173.1 Partitioning Techniques . . . . . . . . . . . . . . . . . . . . . . . .18

3.1.1 Multiprojection . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.2 Local Sequential (LS) Partitioning . . . . . . . . . . . . . .18

3.1.3 Global Sequential (GS) Partitioning . . . . . . . . . . . . .19

3.1.4 Co-partitioning . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Tiling for Co-partitioning . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Embedding for Co-partitioning . . . . . . . . . . . . . . . . . . . .28

3.5 Partial Localization . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4. Control Transformation 414.1 Why Control Generation? . . . . . . . . . . . . . . . . . . . . . .42

4.2 Control Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 The Problem of Control Generation . . . . . . . . . . . . . . . . .47

4.3.1 Why is Problem of Control Generation Intractable? . . . . .51

4.4 Methodology: Control Generation . . . . . . . . . . . . . . . . . .53

4.4.1 Determination of PE Type . . . . . . . . . . . . . . . . . .54

4.4.2 Determination of Scanning Code . . . . . . . . . . . . . . .58

ix

Contents

4.4.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . 684.4.4 Propagation . . . . . . . . . . . . . . . . . . . . . . . . . .72

4.5 An Example: FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . 754.5.1 Control Generation . . . . . . . . . . . . . . . . . . . . . .78

5. Memory Consumption and Address Generation 835.1 Effect of Co-partitioning on Memory Requirements . . . . . . . . .84

5.1.1 Estimation of Local Memory . . . . . . . . . . . . . . . . . 875.1.2 Estimation of Memory for Inter-processor Communication .895.1.3 Estimation of FIFOs and Off-chip Memory . . . . . . . . .915.1.4 An Example: FIR Filter . . . . . . . . . . . . . . . . . . . 93

5.2 Address Generation for Processor Arrays . . . . . . . . . . . . . .96

6. A Case Study: Matrix Multiplication 99

7. Conclusion and Future Work 105

Bibliography 107

A. FIR Filter: ArchitectureComposer 113

B. Synthesis Report: Matrix Multiplication 115

C. Contents of CD 125

x

List of Abbreviations

LPGS Local Parallel Global Sequential.LSGP Local Sequential Global Parallel.VLIW Very Large Instruction Word.PLA Piecewise Linear Algorithm.PRA Piecewise Regular Algorithm.RDG Reduced Dependence Graph.WPPA Weakly Programmable Processor Array.HNF Hermite Normal Form.FIR Finite Impulse Response.IIR Infinite Impulse Response.PE Processor Element.PA Processor Array.ASIC Application Specific Integrated Circuit.DSP Digital Signal Processor.FPGA Field Programmable Gate Arrays.CDMA Code Division Multiple Access.UMTS Universal Mobile Telecommunications System.GSM Global System for Mobile Communications.

xi

List of Figures

1.1 GUI of PARO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 PARO design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Program and dependence graph. . . . . . . . . . . . . . . . . . . .11

2.3 a) Localized dependence graph b) Processor array after space-timemapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 a) Partitioned dependence graph, b) LSGP processor array . . . . .20

3.2 a) Partitioned dependence graph, b) LPGS processor array . . . . .21

3.3 a) Partitioned dependence graph, b) Co-partitioned processor array .23

3.4 a) Dependence graph of non-localized program in example 3.3.1, b)Dependence graph of localized program in example 3.3.1, c) Depen-dence graph on partitioning of localized program . . . . . . . . . .27

3.5 a) Processor array corresponding to figure 3.4, a) b) Processor ar-ray corresponding to figure 3.4b), c) Reduced processor array corre-sponding to figure 3.4c) . . . . . . . . . . . . . . . . . . . . . . .28

3.6 a) Localized regular program, b) Partitioned pre-localized program .33

3.7 Conventional design flow for mapping algorithms to processor arrays34

3.8 New design flow for mapping algorithms to processor arrays . . . .35

3.9 a) Dependence graph of equationx[i, j] = y[0, 0] after LSGP parti-tioning, b) Dependence graph after localization and partitioning, c)Processor array for a), d) Processor array for b) (taken from [TT02a])36

3.10 a) Localization of inter-tile dependencies for LSGP scheme, b) Lo-calization of inter-tile dependencies for LPGS scheme, c) Processorarray for a), d) Processor array for b) (taken from [TT02a]) . . . . .37

3.11 a) Partial localization for co-partitioning, b) Resulting processor ar-ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xiii

List of Figures

4.1 The dataflow graph of the example C-code . . . . . . . . . . . . . .43

4.2 PACT-XPP64 architecture . . . . . . . . . . . . . . . . . . . . . .45

4.3 Processing element . . . . . . . . . . . . . . . . . . . . . . . . . .46

4.4 a) Global model, b) Pre-stored local model, c) Propagating localmodel, d) Intermediate model . . . . . . . . . . . . . . . . . . . . .47

4.5 Hardware interpretation of prg1 . . . . . . . . . . . . . . . . . . .50

4.6 Hardware interpretation of prg2 . . . . . . . . . . . . . . . . . . .52

4.7 Scheduling of example tile . . . . . . . . . . . . . . . . . . . . . .53

4.8 PE type classification . . . . . . . . . . . . . . . . . . . . . . . . .57

4.9 Counter based control model . . . . . . . . . . . . . . . . . . . . .59

4.10 a) Dependence of execution order on loop matrix, b) Transformeddomain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.11 Pseudo scanning code for example 4.4.3 . . . . . . . . . . . . . . .65

4.12 Description of enable mechanism . . . . . . . . . . . . . . . . . . .66

4.13 Hardware interpretation of local control program . . . . . . . . . .71

4.14 Hardware interpretation of global control program . . . . . . . . . .73

4.15 Example 4.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

4.16 Counter for Controller . . . . . . . . . . . . . . . . . . . . . . . .79

5.1 PE array architecture for co-partitioned matrix multiplication imple-mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 a) Dependence graph, b) Processor Array, c) Space-time mapping, d)Memory for inter-processor communication, e) Optimal memory forinter-processor communication . . . . . . . . . . . . . . . . . . .87

5.3 Example dependence graph . . . . . . . . . . . . . . . . . . . . .92

5.4 Local memory vs tiling parameters . . . . . . . . . . . . . . . . . .94

5.5 PE array architecture for co-partitioned matrix multiplication imple-mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6 Architecture style for AGU . . . . . . . . . . . . . . . . . . . . . . 97

6.1 Matrix multiplication algorithm, C-code. . . . . . . . . . . . . . . . 99

6.2 The dataflow graph of the co-partitioned matrix multiplication8× 8

example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101

A.1 PE array for a FIR filter . . . . . . . . . . . . . . . . . . . . . . . .113

A.2 First PE block description . . . . . . . . . . . . . . . . . . . . . . .114

A.3 Counter for L . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114

xiv

List of Figures

B.1 Chip view of co-partitioned matrix multiplication of Xilinx VirtexXCV800 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115

C.1 Directory tree of CD . . . . . . . . . . . . . . . . . . . . . . . . .125

xv

List of Figures

xvi

1 Introduction

The trends in lithography and process technology indicate that billion transistor com-puter chips will be possible before the end of the decade. The major question thatarises is “What functionalities and organization are expected in a billion transis-tor chip?”. Until now designers have concentrated on using increased real estatein improving the performance of uniprocessor cores. The diminishing returns fromadding transistors to uniprocessor design encourages development of multiproces-sor chips with on-chip memory. The Semiconductor Industry Association Roadmapprojects a processing technology that by 2008, a 400mm2 die would easily accom-modate 10 to 80 processors running at 6 GHz or higher along with 64 MB of SRAM.The present technology enables coarse-grained arrays containing 10 to 100s of re-configurable processing elements capable of executing 32 bit fixed point operations.The current day FPGAs also enables cost-effective implementation of reconfigurablehigh performance processor arrays. A discussion of fine-grained (FPGA), coarse-grained (e.g. PACT-XPP), and large-grained multiprocessor systems can be found in[HDT04]. The hardware trends point to more use of such massively parallel dataprocessing architecture in custom and embedded systems.

The parallel algorithms from different fields of image processing, signal process-ing, numerical linear algebra, cryptography, etc. can be implemented on massivelyparallel processor arrays. In mid 80s, a lot of research work was proposed for execu-tion of specific computationally intensive algorithms on systolic arrays. The journalsare replete with optimized applications of FIR filters, QR decomposition, etc. as sys-tolic arrays on ASICs (Application Specific Integrated Circuits). The next generationof digital signal processing devices have to adapt to different protocols and standardsas UMTS, GSM, CDMA, etc. which contain compute intensive algorithms as videofilter, Viterbi decoder, pixel scaler etc. The future advances in medical and radar pro-cessing technology would require hardware to cope with real time implementation ofalgorithms as Feldkamp backprojection, Kalman filtering, etc. requiring 10-100 Giga

1

1 Introduction

operations per second.

The increasing amount of functionality and adaptability in such massively paral-lel architectures has led to their growing consideration for use in embedded systems.However on the other side there is the dilemma of not being able to focus the hard-ware complexity of such devices due to lack of mapping tools. The massively parallelarchitectures ordain a paradigm change from utilizing parallelism in sequential “VonNeumann” sense to data flow level parallelism. Therefore there is a lack of pro-grammability for fine and coarse-grained architectures. Hence, the parallelizationtechniques and compilers are of utmost importance in order to map computationallyintensive algorithm onto massively parallel architectures. The loop programs offera rich source of parallelism. The aim of such a parallelizing compiler is to synthe-size guaranteed correct parallel implementation automatically from given nested loopdescriptions.

There are few existing tools for automated design and synthesis of applicationspecific circuits. PICO-Express, a commercial tool from Synfora takes in an algo-rithmic description in terms of sequential nested loops and maps them onto pipelineprocessor architecture (PPA) [Syn]. There are also a number of fully developed hard-ware design languages and tools like Handel-C [CEL], or the Spark environment[GDGN03] but they use imperative forms and input code. PARO is another designsystem for modeling, transformation, optimization, and processor synthesis for theclass of Piecewise Linear Algorithms (PLA) [Pro]. PARO can be used for automatedsynthesis of regular circuits and will be described in the master thesis. The GUI ofthe PARO can be seen in Figure 1.1.

The systematic design methodology for synthesis of hardware from loop descrip-tion is based on the polytope model. The polytope model permits synthesis of differ-ent processor array with quantifications of latency, number of processors, communi-cation etc. The compilers based on polytope model can perform dependence analysisand convert broadcasts into local dependencies using localization transformations.An essential step in this methodology, partitioning was introduced as a methodologyin order to match the algorithm to hardware resource constraints in number of pro-cessors, I/O communication, and memory access [Tei93]. The different partitioningtechniques as LSGP, LPGS, co-partitioning, hierarchical partitioning meet hardwareconstraints and improve memory utilization as seen in [Eck01]. The partitioning steptiles the index space using hyperplane, boxes, or parallelepiped shapes. The tiles canbe executed sequentially (LPGS), or index points within tile are executed sequentially(LSGP), or other intermediate schemes [EM99]. The control generation is another

2

1.1 Goal of the Thesis

major step in automatic synthesis for dealing with iteration dependent conditionals.The transformation for generation and propagation of control for processor arrayswithout partitioning of the iteration space was introduced in [Tei93]. The selection ofoptimal space-time mapping is done after design space exploration of different pos-sible mappings [HT01]. The space-time mapping decides which iteration is executedat which processor at what time. From the resulting algorithm an RTL description isobtained which can be mapped onto massively parallel architectures.

1.1 Goal of the Thesis

The objective of this master thesis is to automatize the mapping process when map-ping partitioned regular algorithm onto processor arrays. The following tasks sum-marizes the objective.

• In processor arrays multiplexers are used to handle iteration dependent con-ditionals. The control signals are necessary to select the appropriate inputs.The automatic control generation is obtained as a transformation for iterationdependent conditionals in partitioned regular algorithms.

• The amount of local memory, memory for inter-processor communication,communication with peripheral memory needs to be estimated using deducedformulas.

• Localization as a transformation needs to be extended to handle hierarchicalpartitioning methods as co-partitioning. The idea of partial localization ofaffine data dependencies is to be extended for co-partitioning.

• Address generators are used to access data stored as array in memory. The pos-sible efficient addressing schemes, i.e., custom and incremental address gener-ators are discussed.

1.2 Structure of the Thesis

In chapter 2, the PARO design tool is briefly introduced. The basic definitions, ter-minologies of the polytope model, and class of loop algorithms under considerationis described.

3

1 Introduction

In chapter 3, The different partitioning methodologies are explained. The exten-sions to localization transformation and partial localization of affine data dependen-cies for hierarchical partitioning methods as co-partitioning are presented.

In chapter 4, the algorithm for control generation and propagation for global andlocal controller signals for processor arrays is introduced. The methodology of con-trol generation is illustrated with help of a FIR filter example.

In chapter 5, the formulas are deduced for estimation of local memory for eachprocessing element (PE) based only on tiling parameters and dependence vectors.The upper bound of memory requirement for inter-processor communication is de-duced. The upper bound for memory access to peripheral hierarchical memory is alsoderived. The optimal architecture style for address generation unit is also discussed.

The thesis finishes with a discussion of conclusions, future work, and open prob-lems in chapter 6.

4

1.2 Structure of the Thesis

Figure 1.1: GUI of PARO

5

1 Introduction

6

2 Definition and Method

In this chapter, an overview of the fundamentals of the existing mapping methodolo-gies is discussed. The development of systolic arrays in mid 80s led to the idea of asilicon compiler. The silicon compiler was thought to be a CAD tool which wouldaid generating synthesizable descriptions of massively parallel processor arrays fromregular algorithms using a transformative refinement approach. The example of sucha system is PARO, whose system design flow is depicted in Figure 2.1.

Starts from a given nested loop program in a sequential high-level language (sub-set of C). The program is parallelized by data dependence analysis intosingle assign-

ment code(SAC), where the whole parallelism is explicitly given. SAC is closelyrelated to a set of recurrence equations, a formalism introduced by Karp, Miller,and Winograd [KMW67]. This formalism has been used in many languages andadvanced over the years about affine dependencies or piecewise definitions. E.g.,Systems of Affine Recurrence Equations(SARE) which are used in the Alpha lan-guage [WS94], the class ofAffine Indexed Algorithms(AIA) [EM99], and the classof Piecewise Linear Algorithms(PLA) [Thi92, Tei93]. This class extends the notionof regular iterative algorithms[Rao85] that may be related to regular processor ar-rays. The model of computation that makes possible synthesis of above mentionedclass of algorithms is called polytope model. The loop program is represented bya finite convex set with flat surfaces. The intuitive geometric approach of polytopemodel offers an excellent model for analysis and synthesis of parallel loop programsonto massively parallel architectures. The concepts of polytope model are more ex-plicit in single assignment model. Therefore the parallelized loop after dependenceloop is in SAC.

7


HDL Generation

Array Structure

Processor Element

Controller

Allocation Placement

C Code (subset)

SAC Code

Parallelization

HW Synthesis

Mathematical Libraries and Solvers(PolyLib, LPSolve, CPLEX, PIP)

Design Space Exploration

SchedulingCost Estimation

Energy Estimation

Search Space Reduction

Localization

Operator Splitting

Control GenerationAffine Transformations

Partitioning

...

Core Transformations

Figure 2.1: PARO design flow.

2.1 Program Notations

Definition 2.1.1 (PLA). Apiecewise linear algorithmconsists of a set ofN quanti-

fied equations

S1 [I] , . . . , Si [I] , . . . , SN [I]

Each equationSi [I] is of the form

xi [PiI + fi] = Fi (. . . , xj [QjI − dji] , . . .) if CIi (I) (2.1)

for all I ∈ Ii ⊆ Zn is a linearly bounded lattice (definition follows), called iteration

space of the quantified equationSi [I], xi, xj are linearly indexed variables,Fi de-

note arbitrary functions,Pi, Qj are constant rational indexing matrices andfi, dji

8


are constant rational vectors of corresponding dimension. The dots. . . denote sim-

ilar arguments. The set of all vectorsPiI + fi, I ∈ Ii is called the index space of

variablexi. Furthermore, in order to account for irregularities in programs, we al-

low quantified equationsSi [I] to haveiteration dependent conditionalsCIi(I) which

can equivalently expressed byI ∈ ICi⊆ Zn, where the spaceICi

is an iteration

space calledcondition space.

A PLA is calledpiecewise regular algorithm(PRA) if the matricesPi andQj arethe identity matrix. Variables that appear on the left hand side of equations are calleddefined. Variables that uniquely appear on the left hand side of equations are calledoutput variables. Variables that appear on right hand side of equations are calledused. Variables that uniquely appear on right hand side of equations are calledinput

variables. A program is thus a system of quantified equations that implicitly definesa function of output variables in dependence of input variables. The piecewise linearalgorithms allows index domains not only to be convex but also union of convexpolytopes. This permits a sequence of perfect loop nests instead of one perfect loopnest. Some other semantical properties are particular to the class of programs we aredealing with.

Single assignment property: Any instance of an indexed variable appears at mostonce on the left hand side of an equation or, all equations defining the same variableare identical.

Computability: There exists a partial ordering of the equations such that any in-stance of a variable appearing on the right side of an equation earlier appears in theleft hand side in the partial ordering.

Execution Model: The execution model of programs is architecture independent.A program may be executed as follows: (1) All instances of equations are ordered re-specting the above defined partial ordering. (2) The indexed variables are determinedby successive evaluation of equations.

The domainsIi are defined as follows:

Definition 2.1.2 (Linearly Bounded Lattice). Alinearly bounded latticedenotes an

index space of the form

I = {I ∈ Zn | I = Mκ + c ∧ Aκ ≥ b}whereκ ∈ Zl, M ∈ Zn×l, c ∈ Zn, A ∈ Zm×l andb ∈ Zm. {κ ∈ Zl |Aκ ≥ b} denotes

the set of integral points within a convex polyhedron or in case of boundedness within

a polytope inZl. This set is affinely mapped onto iteration vectorsI using an affine

transformation (I = Mκ + c).

9


Throughout the thesis, we assume that the matrixM is square and of full rank andc is a null vector. Then, each vectorκ is uniquely mapped to an index pointI.Furthermore, we require that the index space is bounded. IfM is an identity matrixthe index space is a convex polytope. The following example illustrates our programnotations and semantics. The program notation is a subset of UNITY [CM88].

Example 2.1.1 The jacobi method is an important algorithms for solution of ellipticboundary value partial differential equation. The kernel operation for 1-D equationcan be represented as following

〈 ‖ ∀i, j : 1 ≤ i < N ∧ 0 ≤ j < T ::

y[i, j] = a[0, j] · u[i− j + T/2, 0], if j = 0

y[i, j] = y[i, j − 1] + a[0, j] · u[i− j + T/2, 0], if j > 0〉

The above program is a piecewise linear program. The program satisfy the conditionof single assignment, computability as can be verified in expansion and ordering ofquantifications for alli = 1, . . . , N i.e. y[i, N ] = y[i, N − 1] + a[0, N ] · u[i − N +

T/2, 0]; . . . ; y[i, 1] = y[i, 0] + a[0, 0] · u[i − T/2, 0]. The source polytope can bedescribed as convex polytope by set of half-spaces as following.

−1 0

0 −1

1 0

0 1

(i

j

)≤

−1

0

N

T

This satisfies the definition of a linearly bounded lattice whereM is an identity ma-trix. The upper part of the equation represents the lower bounds on the index variablesand the lower part represents the upper bound on index variables. The index spaceand the the dependence graph is depicted in Figure 2.2. Each circular node representsa MAC operation, i.e.y[i] = y[i− 1] + a[j] · u[i− j + T/2].

Definition 2.1.3 (Block Pipelining Period). Theblock pipelining periodof an al-

located and scheduled piecewise regular algorithm is the time interval between the

initiations of two successive problem instances and is denoted byβ.

Definition 2.1.4 (Iteration Interval). The iteration intervalδ of a scheduled regular

algorithm is the distance in time steps between two successive executions of a node

of an Global Dependece Graph (GDG).

10


for(t=0;t<20;t++) { //update forall(i=1;i< N ;i++) forall(j=0;j< T ;j++) { y[i]+=a[j]*u[i-j+T/2]; } //copy forall(i=1;i< N ;i++) u[i]=y[i]; }

:u

:a

:y

i

j

Figure 2.2: Program and dependence graph.

Definition 2.1.5 (Reduced Dependence Graph). The reduced dependence graph

RDG = (VD, ED, D) associated to a co-partitioned regular algorithms in the set

J , K, andL is defined as follows: To each variablexi[J,K,L], a nodevi ∈ VD is

associated. An edge(vi, vj) ∈ ED with the distance vectordij = (dJij dK

ij dLij)

T and

dij ∈ Z3s×1 appears if the variablexi[J,K,L] directly depends onxi[J − dJij, K −

dKij , L− dL

ij] via the dependence vectordij.

The terms global dependence graph and dependence graph are used in same sense.An example of dependence graph is seen in Figure 2.2.

11


2.2 Methodology

With this representation of equations and index spaces several combinations of par-allelizing transformations in the polytope model can be applied:

• Affine Transformationsof iteration spaces have been proven to be useful forthe parallelization of algorithms and serve as a basis for the scheduling andassignment of operations to processor elements when mapping regular loopalgorithms to processor arrays. Transformations likeloop reversal, loop inter-

change, loop skewing, etc. can be expressed by an affine transformation. Fur-thermore, affine transformations can be used to embed variables into a com-mon index space. The set of transformations in thePAROdesign trajectorycan be classified as affine transformation of index spaces, affine projection ofindex spaces, intersection of index spaces and convex hull of union of indexspaces. The transformations applied on a linearly bounded lattice gives a lin-early bounded lattice. Therefore affine transformations preserve the legalityand behavior of the program [Tei93]. The affine transformations form a majorbag of tricks in optimizing compilers for countable loops, i.e. no while loops.A discussion of loop restructuring methods can be found in [Wol96].

• Localizationis a transformation which might be used to transform non-uniformaffine data dependencies into uniform (regular) data dependencies. This isachieved by propagation of variables from one index point to a neighbor in-dex point. Local communication between processing elements is a requiredfeature of array architectures for several reasons:

– Processing elements have only a bounded fan-out/fan-in or the number ofavailable interconnects is limited, respectively.

– Regularity implies local neighborhood communication.

– Modularity requires interconnections of problem-size independent com-plexity.

The transformation for localization operation is presented as a systematic pro-cedure in [Tei93]. The Figure 2.3a) shows the effect of localization operationwhere global dependencies in Figure 2.2 are converted to local uniform de-pendencies. The dependency for piecewise linear algorithms quantification issaid to be global dependency ifP 6= Q and local dependency ifP = Q ( seeDefinition 2.1.1 ).

12

2.2 Methodology

:u

:a

:yi

j

Space-Time Mapping

P

t=

0 1 i

j1 10 1 2

1 2 3

2 3 4

3 4 5

4 5 6

5 6 7

6 7 8

7 8 9

PE1 PE2 PE3

b)a)

Figure 2.3: a) Localized dependence graph b) Processor array after space-time map-ping.

• Operator Splittingis a transformation which can be used to split a statementinto operations with fewer operands. In practice, operator splitting is usedto break down complex statements to, e.g., ALU operations with only twooperands.

• Exploration of Space-Time Mappings. Linear transformations as in Eq. (2.2),are used asspace-time mappingsin order to assign a processorp (space) andsequencing indext (time) to index vectors.

(p

t

)= TI =

(Q

λ

)I (2.2)

In Eq. (2.2),Q ∈ Z(n−1)×n andλ ∈ Z1×n. I is the index space.The main rea-sons for using linear allocation and scheduling functions is that the data flowbetween PEs is local and regular which is essential for low power VLSI im-plementations. The interpretation of such a linear transformation is as follows:The set of operations defined at index pointsλ · I = const. are scheduled at the

13


same time step. In figure 2.3a), the equation is represented by hyperplanes onwhich the index points are executed at same time. The index space of allocatedprocessing elements (processor space) is denoted byQ and is given by the setQ = {p | p = Q · I ∧ I ∈ I}. This set can also be obtained by choosing aprojection of the dependence graph along a vectoru ∈ Zn, i.e., any coprime1

vectoru satisfyingQ · u = 0 [Kuh80] describes the allocation equivalently.

The scheduling vectorλ are obtained by the formulation of a latency minimiza-tion problem as a mixed integer linear program (MILP) [Thi95, TTZ97]. Thiswell-known method is used here during exploration as a subroutine. In thisMILP, the number of resources inside each processing element can be limited.Also given is the possibility that an operation can be mapped onto differentresource types (module selection), and pipelining is also possible.

Besides, the consideration of latency as a measure of performance, also areacost and energy consumption can be considered during the exploration of space-time mappings [HT01, HT02, HT04].

• Partitioning. In order to match resource constraints such as limited number ofprocessing elements, partitioning techniques have to be applied. Most of thesehave in common that the index space of computations is tiled using hyperplane,boxes, or parallelepiped shapes. Here techniques are used to either sequen-tially execute tiles (LPGS – local parallel, global sequential partitioning) or tosequentialize operations within a tile (LSGP – local sequential, global paral-lel partitioning), or intermediate schemes [TT93]. A combination of LPGSand LSGP, called co-partitioning, are used for balancing memory and I/O-bandwidth requirements while maintaining problem size independence. So-calledaffine partitioning[TT02b] is used to reduce the number of local reg-isters inside the PEs. The partitioning method helps selection of number ofprocessor independent of constraints on projection vector. Partitioning will bediscussed in detail in chapter 3.

• Control Generation. As the functionality of one processing element can changeover the time, a control mechanism is required. In the trajectory of stepwiserefinement of program specifications, a step for control generation is neededthat specifies the control units and the control signals of the processing ele-ments [TT91, BHT01, BHT02]. Further control structures are necessary to

1A vectorx is said to becoprimeif the absolute value of the greatest value of the greatest commondivisor of its elements is one.

14

2.2 Methodology

control the internal schedule of a PE and the dataflow for partitioned arrays.The control generation problem has been dealt for partitioned case in chapter4.

• HDL Generation & Synthesis. Finally after all the refining transformations asynthesizable description in a hardware description language like VHDL maybe generated. This is done by generating one PE and the repetitive genera-tion of the entire array according to description of output aPiecewise Regular

Processor Array.

15


16

3 Partitioning

Partitioning is a transformation which matches the hardware constraints in memoryand number of processors to a regular algorithm. In other words the usual processof selection of projection and mapping cannot map arbitrary regular algorithms ontofixed size arrays. Therefore partitioning deals with decomposition of index space intotiles and scheduling corresponding operations in order to obtain processor arrays offixed known size and dimension. The partitioning is known as loop blocking in com-pilers for uni-processor machines, where they are used to improve the cache behavior.Partitioning is also used for mapping large simulation problems on supercomputers.The determination of partitioning techniques should take following factors into con-sideration [Kun88].

• Minimum overall computation time

• Minimum control overheads

• Balanced trade-off between external communication and local memory.

A discussion of partitioning with reference to above points is carried throughout thethesis. In code compilation partitioning adds additional outer loops and change innerloop bounds. In this chapter the partitioning schemes are studied with reference tosynthesis of massively parallel architecture from algorithm descriptions. New trans-formations relating to localization and partial localization for co-partitioning are alsointroduced. The following section gives a formal classification of most-used parti-tioning schemes.

17

3 Partitioning

3.1 Partitioning Techniques

3.1.1 Multiprojection

The space-time mapping as discussed in chapter 2, is applied to a given regular al-gorithm with ann-dimensional index space. This maps the algorithm into ann − 1

dimensional processor space. This method is calledprojection. Obviously a physicalimplementation of processor arrays can have at maximum dimension equal to 3. Themultiprojection enables generation of arrays of dimensionn − s where2 ≤ s ≤ n.In principle multiprojection is projection applieds times. Whens = n correspondsto sequential execution on a single processor. The linear space-time mapping formultiprojection can be represented as

(p

t

)= TI =

(Q

λ

)I (3.1)

In Eq. (3.1),Q ∈ Z(n−s)×n andλ ∈ Zs×n. I ∈ I, whereI is the index space. ThemappingQ · I gives the processor co-ordinatesp which forms the processor spaceP. P = {p : p = Q · I ∧ I ∈ I}. The mappingλ · I gives the time co-ordinateof the execution,t which forms the time spaceT , i.e., T = {t : t = λ · I ∧ I ∈I}. In multi-projectiondim(I) = dim(P) + dim(T ) shows the validity of linear

mapping

(Q

λ

)describing the systolic array as rank

((Q

λ

))= n . Whens > 1,

the multi-dimensional schedule is obtained from solution of Integer Linear Program(ILP) [Fea92]. The discussion of preserving of precedence or causality constraint,i.e.,

∑si=1 λi · d > 0 ∀ d. where,d is the dependence vector can be found in [Kun88].

3.1.2 Local Sequential (LS) Partitioning

In our taxonomy LSGP stands equivalently for all local sequential(LS) partitioningschemes. In LS scheme of partitioning the index points in a tile are executed sequen-tially by the same processor. The different tiles are executed in parallel by corre-sponding processors. The LS scheme of partitioning can be realized in three stages.The first stage is calledTiling, which divides the iteration space into congruent tiles.This stage is common to both LS and GS partitioning schemes. Tiling for mentionedpartitioning schemes can be described formally as decomposition of Iteration spaceI as following.

I −→ J ⊕K

18


1 whereJ is the set of all points within a tile andK represents the set of origin of alltiles. The exact description ofJ ,K is however determined by the tile shape and size.If I = J ⊕ K then its called a perfect tiling and non-perfect tiling if otherwise. Thetiling parameters are contained in a matrixPLSGP . The tiling operation determinesexactlyJ , K by calculating their iteration space in form of linearly bounded lattice.The expand operations takes care of dependencies in light of decomposition of indexspace. The fact specific to LSGP partitioning scheme is that all iteration points inJare executed by same processor and tiles represented by their origin,K, i.e., K ∈K are executed in parallel by respective processors. This is taken care byReduce

operation which defines the allocation and scheduling as following.

(p

t

)=

(0 E

λJ λK

)(J

K

)

whereE ∈ Zs×s is an identity matrix,J ∈ J , K ∈ K, (λJ λK) is the scheduling vec-tor. In cases = 1, we obtain a linear array and ifs = 2 we obtain a two dimensionalprocessor array. The number of processors is equal to number of elements in setK.A discussion of requisite transformations can be found in [Tei93]. The concept of LSpartitioning scheme is illustrated in Figure 3.1. The space-time mapping is given by

(p

t

)=

(0 0 1

1 2 2

)

j1

j2

k1

The LS partitioning sequences reduces the processor count according to fixed sizearray. However the disadvantage is the one may need a lot of local memory.

3.1.3 Global Sequential (GS) Partitioning

In our taxonomy Local Parallel and Global Sequential (LPGS) and Global Sequential(GS) partitioning schemes means the same. In LPGS scheme the iteration space isdivided into congruent tiles. However the major difference is that instead of eachoperations in a tile being sequentialized as in LSGP partitioning, the operation in atile is executed in parallel by corresponding processor and each tiles are executedsequentially. Therefore index points assigned to same processor does not form aconvex set. The tiling and reduce operation is common to both LS and GS partitioning

1J ⊕K = {i = j + k : j ∈ J ∧ k ∈ K}

19

3 Partitioning

:u

:a

:yi

j

LSGP

P

t= 0

0 2 4

1 3 5

2 4 6

3 5 7

4 6 8

5 7 9

6 8 10

7 9 11

b)a)

j

j

1

2

k1

j1j2k1

0 1

1 2 2

*

+

Figure 3.1: a) Partitioned dependence graph, b) LSGP processor array

schemes. The reduce operation takes care of the fact that all operations within a pointare executed on different processors in parallel as expressed in following equation.

(p

t

)=

(E 0

λJ λK

)(J

K

)

whereE ∈ Zs×s is an identity matrix,J ∈ J , K ∈ K, (λJ λK) is the schedulingvector. The equationp = E · J shows that the processor co-ordinates are the same asco-ordinates within a tile. The disadvantage of using LPGS scheme is that we needlot of wrap around interconnections. The intuitive understanding of both LS and GSpartitioning schemes can be found in Figure 3.1 and 3.2. The tiling matrix for both

Figure 3.2 and Figure 3.1 isPLSGP = PLPGS =

(2 0

0 3

). The space-time mapping

for the example in Figure 3.2 is,

(p

t

)=

(1 1 0

1 1 2

)

j1

j2

k1

20


:u

:a

:yi

j

LPGS

P

t=

10 1 2

1 2 3

2 3 4

3 4 5

4 5 6

5 6 7

6 7 8

7 8 9

b)a)

j

j

1

2

k1

j1j2k1

1 0

1 1 2

*

+

Figure 3.2: a) Partitioned dependence graph, b) LPGS processor array

3.1.4 Co-partitioning

The co-partitioning scheme simultaneously combines both LSGP and LPGS parti-tioning schemes. The iteration space is divided into LS congruent tiles with tilingmatrix, PLSGP . The tiled LS iteration space is then again partitioned using tilingmatrix,PLPGS into congruent GS tiles. The points within a LS tile are executed se-quentially. The LS tile within a single GS tile corresponds to a processor. Thereforethe total number of processor is equal to number of LS tiles within a GS tile. InLSGP scheme if tiling matrix is selected according constrained availability of localmemory, then we cannot map the algorithm onto fixed size processor array. HoweverLSGP partitioning scheme is characterized by high data reuse within the processorarray, thus reducing communication with the periphery. In LPGS scheme, the tilingparameters are selected to match tile to fixed size processor array. This requires min-imum local memory but increases the communication between processor array and

21

3 Partitioning

the periphery. Co-partitioning balances properties of both partitioning schemes. Ituses the optimal amount of local memory, balances communication with peripheralmemory and maps the algorithm onto fixed size array. The selection ofPLSGP is doneaccording availability of local memory. The selection ofPLPGS is done according tosize of processor array. The intuitive example of co-partitioning can be seen in Figure

3.3. The LS tiles are determined by tiling matrixPLSGP =

(2 0

0 3

). The GS tiles

are determined by tiling matrixPLPGS =

(4 0

0 3

). The allocation and scheduling is

given by the space-time mapping as following

(p

t

)=

(0 0 1 0

1 2 2 6

)

j1

j2

k1

l1

The co-partitioning method consists of three stages

• Tiling: This divides the iteration space into corresponding GS and LS tiles.Tiling decomposes initial given iteration space asI −→ J ⊕K ⊕ L.

• Embedding: Embedding transformation introduces new quantifications in PLAto take care of the dependencies in light of decomposition of the iteration space.

• Reduce: This transformation implements the requisite conditions for mappingof processor space and scheduling of operations for co-partitioning (see Defi-nition 4.1.1).

Definition 3.1.1 A co-partitioningof I is represented asI ⊆ J ⊕ K ⊕ L, where

J , K, L are linearly bounded lattices.I is an integral convex polyhedron, an n-

dimensional index spaceI is split into congruent tilesL, L ∈ L using the tiling

matrix PLPGS. Then another tiling matrixPLSGP , which is a full rank matrix is

chosen to split the index space contained in tilesL into congruent tilesK.

Definition 3.1.2 Embeddingtransformation w.r.t co-partitioning changes dimen-

sion of the index space to reflect the tiling, i.e.,I 7−→ I. Considering non-degenerate

tiles all variables are embedded in a3n-dimensional index space.I =

J

K

L

is

generated.

22

3.2 Tiling for Co-partitioning

:u

:a

:yi

j

co-partitioning

0 2 4

1 3 5

2 4 6

3 5 7

6 8 10

7 9 11

8 10 12

9 11 13

a)

j

j

1

2

k1

*

+

b)

GS Tile

LS Tile

l1

Figure 3.3: a) Partitioned dependence graph, b) Co-partitioned processor array

In the following sections the transformationstiling, embeddingare changed to dealwith co-partitioning. Also the new design trajectory, where localization is done afterpartitioning, i.e., partial localization is extended to handle co-partitioning.


I is the index space to be covered by tiling. The tiling ofI w.r.t to co-partitioning isgiven byI ⊆ J ⊕ K ⊕ L whereJ ,K,L are linearly bounded lattices. The tiling iscalled perfect tiling ifI = J ⊕K⊕L. The following lemma gives the exact definitionof linearly bounded lattices given the tiling matricesPGS(PLPGS) andPLS(PLSGP ).

Lemma 3.2.1 A decompositionJ ⊕K ⊕ L is defined by,

J = {J ∈ Zn : AjJ ≥ bj}K = {K ∈ Zn : K = PLSκ + s ∧ AKκ ≥ bK ∧ κ ∈ Zn}

23

3 Partitioning

L = {L ∈ Zn : L = PGSl + q ∧ All ≥ bl ∧ l ∈ Zn}where, assuming

Ak′ =

(σ1 · adj(PGS)

−σ1 · adj(PGS)

)bk′ =

(0

(−σ1 · det(PGS) + 1) · e

)e ∈ Zn

e = (1 1 . . . 1)T, adj(PGS) = P−1GSdet(PGS), σ1 = |det(PGS)|

det(PGS)

All ≥ bl ≡ Projl

((0 A

−Ak′PGS Ak

′

)(l

I

)≥

(b

bk′ + Ak

′q

))

Akk ≥ bk ≡ Projk

((0 Ak

′

−AjPLS Aj

)(k

k′

)≥

(bk′

bj + Ajs

))

K′= {k′ : Ak

′k′ ≥ bk

′} σ2 = |det(PLS)|det(PLS)

Aj =

(σ2 · adj(PLS)

−σ2 · adj(PLS)

)bj =

(0

(−σ2 · det(PLS) + 1) · e

)

is a valid co-partition of index spaceI = {I : A · I ≥ b} .

Proof 3.2.1 Co-partitioning refers to simultaneous application of LPGS and LSGP

partitioning schemes one after another. The application of LPGS partitioning creates

an intermediate partitioning which can be described by,I ⊆ K′ ⊕ L.

L = {L ∈ Zn : L = PGSl + q ∧ All ≥ bl ∧ l ∈ Zn}K′

= {k′ : Ak′k′ ≥ bk′}

where, Ak′ =

(σ1 · adj(PGS)

−σ1 · adj(PGS)

)bk′ =

(0

(−σ1 · det(PGS) + 1) · e

)

e = (1 1 . . . 1)T, adj(PGS) = P−1GSdet(PGS), σ1 = |det(PGS)|

det(PGS)

All ≥ bl ≡ Projl

((0 A

−Ak′PGS Ak′

)(l

I

)≥

(b

bk′ + Ak′q

))

The final step involves introducing a LSGP partitioning, i.e.,K′ ⊆ J ⊕K. Lets ∈ K′

be the offset vector so thatK describes the origin of the tiles accounting LS partition-

ing. Then

J = {J ∈ Zn : J = PLSk ∧ 0 ≤ k < e ∧ k ∈ Rn}= {J ∈ Zn : 0 ≤ σ2 · adj(PLS) · J < σ2 · det(PLS) · e}= {J ∈ Zn : AjJ ≥ bj}

24


NowJ = K′ −K andK = PLSk + s

AlsoAjJ ≥ bj therefore,{Aj(K′−(PLSk+s)) ≥ bj∧K ∈ Zn∧K′ ∈ Zn} describes

all points k.

The polyhedron{k ∈ Zn, Akk ≥ bk} is obtained by eliminating all the variablesk′

by projecting it onto subspace defined by variables k.

therefore,Akk ≥ bk ≡ Projk

((0 Ak′

−AjPLS Aj

)(k

k′

)≥

(bk′

bj + Ajs

))

The above lemma is illustrated in following example of digital FIR filter.

Example 3.2.1

〈 ‖ ∀i, j : 0 ≤ i < N ∧ 0 ≤ J < T ::

y[i, j] = a[0][j] · u[0, j − i], if j = 0

y[i, j] = y[i, j − 1] + a[0][j] · u[0, j − i], if j > 0

〉

On using tiling matrixPLPGS =

(N 0

0 4

)andPLSGP =

(N 0

0 2

). We obtain tiling

of I = {i, j : 0 ≤ i < N ∧ 0 ≤ J < T} as

J = {(

j1

j2

): 0 ≤ j1 < N ∧ 0 ≤ j2 < 2}

K = {(

k1

k2

)=

(N 0

0 2

)·(

κ1

κ2

)∧ 0 ≤ Nκ1 < N ∧ 0 ≤ 2κ2 < 4}

L = {(

λ1

λ2

)=

(N 0

0 4

)·(

l1l2

)∧ 0 ≤ N l1 < N ∧ 0 ≤ 4l2 < T}

〈 ‖ ∀i, j : i = j1 + Nκ1 + N l1 ∧ j = j2 + 2 · κ2 + 4 · l2 ∧ 0 ≤ j1 < N ∧ 0 ≤ j2 < 2

∧ 0 ≤ Nκ1 < N ∧ 0 ≤ 2κ2 < 4 ∧ 0 ≤ N l1 < N ∧ 0 ≤ 4l2 < T ::

y[i, j] = a[0][j] · u[0, j − i], if j = 0

y[i, j] = y[i, j − 1] + a[0][j] · u[0, j − i], if j > 0〉

25

3 Partitioning

After distribution of iteration and condition spaces the redundant inequalities can beremoved at this stage in order to obtain an embedded space of reduced dimension,i.e.,k1, l1 can be removed in example.

3.3 Localization

The localization is a transformation which converts non-uniform data dependenciesinto an equivalent algorithm with uniform data dependencies. This caters to efficientVLSI implementation characterized by local communication between processing el-ements, bounded fan-in, and fan-out. The idea of localization as a transformation wasfirst introduced in works of [Tei93]. In this chapter the concept of partial localizationis extended for an extra class of input dependencies. The automatic derivation of par-titioned program descriptions from localized non-partitioned program descriptions isextended to handle co-partitioning, i.e., the embedding transformation is modified tohandle co-partitioning. The basic idea of localization is explained briefly in followingexample.

Example 3.3.1 〈‖ ∀i, j : 0 ≤ i < 8 ∧ 0 ≤ j < 4 :: b[i, j] = a[i, 0]〉 describesa simple regular algorithm. The dependence graph of the affine recurrence equationand its equivalent algorithm obtained after localization and tiling are depicted in theFigure 3.4. The affine recurrence equation of the regular algorithm after localizationlooks as follows.

〈‖ ∀i, j : 0 ≤ i < 8 ∧ 0 ≤ j < 4 ::

b[i, j] = b[i, j − 1] If j > 0

b[i, j] = a[i, j] If j = 0〉

The figure 3.4b) aptly shows propagation of value ofa[i, 0] using propagationvector(0 1)T. After localization, linear scheduling and allocation are applied to de-

rive full size array. The LSGP partitioning using the tiling matrixPLSGP =

(4 0

0 4

)

decomposes the index space. The new affine recurrence equation of the algorithmin tiled space is given as follows. The corresponding dependence graph is shown in

26

3.3 Localization

Figure 3.4c).

〈‖ ∀i, j : 0 ≤ j1 < 4 ∧ 0 ≤ j2 < 4 ∧ 0 ≤ k1 < 2 ::

b[j1, j2, k1] = b[j1, j2 − 1, k1] If j2 > 0

b[j1, j2, k1] = b[j1, j2 + 3, k1 − 1] If j2 = 0 ∧ k1 > 0

b[j1, j2, k1] = a[j1, j2, k1] If j2 = 0 ∧ k1 = 0〉

The projection vector(1 0)T is chosen for derivation of processor array fromabove program description. The Figure 3.5a), and 3.5b) shows full size array withoutlocalization and with localization respectively. The obvious advantage is observed inreplacement of global communication by short communication links. Figure 3.5c)shows the reduced array on applying partitioning on localized program and thenscheduling and allocating the index points. The partitioning technique is undertakento match resource constraints as limited number of processing elements, e.g. twoprocessor elements for given example.

Figure 3.4: a) Dependence graph of non-localized program in example 3.3.1, b) De-pendence graph of localized program in example 3.3.1, c) Dependence graph on par-titioning of localized program

27

3 Partitioning

Figure 3.5: a) Processor array corresponding to figure 3.4, a) b) Processor array cor-responding to figure 3.4b), c) Reduced processor array corresponding to figure 3.4c)

3.4 Embedding for Co-partitioning

The partitioning of a regular algorithm was first introduced as a transformation in[Tei93]. The Transformation can deal with LSGP, LPGS partitioning and other parti-tioning techniques involving a single tiling matrix. In this subsection this transforma-tion is extended to handle hierarchical partitioning methods as co-partitioning. Thefirst step in partitioning of an already localized regular algorithm is tiling. Here ann-dimensional index space is partitioned recursively into congruent tiles. In case of co-partitioning, two tiling matricesPLPGS andPLSGP are selected. This decomposes then-dimensional index space into3n-dimensional index space, i.e.,I =⇒ J ⊕K⊕L.In particular, the iteration vectorI is replaced by(J K L)T , whereJ ∈ J ,K ∈ K, andL ∈ L. The second step concerns embedding of all variables in thishigher dimensional index space according to the chosen tiling. This transformationintroduces new quantifications with new dependence vectors if dependencies crossdifferent tiles. This requires a modification of dependence analysis for embeddingtransformation. Without loss of generality we assume that all equations in piecewiseregular program, i.e.,Sl[I] are all given in output normalized form as below.

x[I] = F(x[I − d], . . .) ∀ I ∈ I

28


for simplification of analysis of dependencies input splitting leads to following set ofequivalent equations.

x[I] = F(z[I], . . .) ∀ I ∈ I (3.2)

z[I] = y[I − d] ∀ I ∈ I (3.3)

The variables in quantification (3.2) do not have dependencies crossing different tilesbecaused = 0. These equation are called motionless equations. The equations ofquantification type (3.3) have to be taken care of explicitly to preserve the affine datadependencies crossing the tiles and are known as displacement equations.

Definition 3.4.1 Given a program prg containing equations of the above form.

Then embedding transformation gives program prg1 containing equations of the fol-

lowing forms, i.e.

prg1 = embedding(prg, PLSGP , PLPGS) =

〈∀J ∈ J , K ∈ K, L ∈ Lx[J,K, L] = F(z[J,K,L], . . .) If J + K + L ∈ I (3.4)

〈∀J ∈ J , K ∈ K, L ∈ Lz[J,K, L] = y[J − d− PLSGP λ1, K + PLSGP λ1 − PLPGSλ2, L + PLPGSλ2]

if J − d− PLSGP λ1 ∈ J ∧ K + PLSGP λ1 − PLPGSλ2 ∈ K∧ L + PLPGSλ2 ∈ L ∧ J + K + L ∈ I. (3.5)

where quantifications of form (3.5) are generated for all vectorsλ1, λ2 ∈ Θ and

Θ = {λ1, λ2 ∈ Zn : −e < λ1 + P−1LSGP d < e ∧ −e < λ2 − P−1

LPGSPLSGP λ1 < e},where e=(1, 1, . . . , 1)T.

The program correctness of above transformation is extended from lemma in [Tei93].The lemma is extended to handle co-partitioning.

Lemma 3.4.1 Given a PRP prg containing quantifications of the form (3.2), (3.3)

and a tilingJ ⊕K ⊕ L of its index spaceI. The programprg1 = embedding(prg,

PLSGP , PLPGS) with equations of the form (3.4), (3.5) satisfiesprg1 = prg, if the

variables x and y have neither instances of input variable nor output variables in the

expansion of prg.

29

3 Partitioning

Proof 3.4.1 The program containing quantifications of above form is a PRP as all

indexing functions are of the formg(I) = I − d′

with d′

= 0 for the motionless

equations andd′

= (d + PLSGP λ1 − PLSGP λ1 + PLPGSλ2 − PLPGSλ2)T for

displacement equations. Hence, all the dependence vectors are constant. All the

condition spaces and iteration spaces are linearly bounded lattices. The proper-

ties of single assignment, computability, and equivalence are proved using following

results. The motionless equations have an unique substitution, i.e., each variable

x[I], z[I] is replaced by embedded variablesx[J,K,L], z[J,K,L]. Thus, the sub-

stitution of variables preserves the computability and single assignment property.

The displacement equations are not easily embedded as dependencies cross differ-

ent tiles. To this end new quantifications are created to reroute the dependencies.

As J + K + L − d = (J − d − R1) + (K + R1 − R2) + (L + R2). So we need

to determineR1, R2 such that for some givenJ ∈ J , K ∈ K, L ∈ L, we have

J − d−R1 ∈ J , K +R1−R2 ∈ K, andL+R2 ∈ L. Using the definitions ofJ ,K,

L in definition 3.4.1, we obtainR1 = PLSGP λ1, andR2 = PLPGSλ2 for some vector

λ1, λ2 ∈ Zn and the conditionJ−d−PLSGP λ1 ∈ J , K+PLSGP λ1−PLPGSλ2 ∈ K,

L + PLPGSλ2 ∈ L. Hence the unique substitution is

〈∀J ∈ J , K ∈ K, L ∈ Ly[J + K + L− d] = y[J − d− PLSGP λ1, K + PLSGP λ1 − PLPGSλ2, L + PLPGSλ2]

if J − d− PLSGP λ1 ∈ J , K + PLSGP λ1 − PLPGSλ2 ∈ K,

L + PLPGSλ2 ∈ L, and J + K + L ∈ I.

with

Θ = {λ1, λ2 ∈ Zn : 〈∃J,K :: J−d−PLSGP λ1 ∈ J , K+PLSGP λ1−PLPGSλ2 ∈ K〉},

consequently, for eachλ1, λ2 a new quantification is introduced. The functional

equivalence comes from unique substitution of variable and assumption that vari-

ablesx and y have neither instances of input variable nor output variables in the

expansion of prg.Θ is explicitly derived using definitions ofJ ,K, L. We have

Θ = {λ1, λ2 ∈ Zn : 〈∃κ1, κ2, κ3, κ4 : κ1, κ2, κ3, κ4 ∈ Rn :: 0 < κ1, κ2, κ3, κ4 < e∧PLSGP κ1 − d− PLSGP λ1 = PLSGP κ2 ∧ PLPGSκ3 + PLSGP λ1 − PLSGP λ2 =

PLPGSκ4〉}

This leads to explicit expression ofΘ as in definition 3.4.1. The transformed condi-

tion space are also linearly bounded lattices.

30


Finally the normalization step changes linearly bounded latticesK, L into poly-topesK, L. The affine transformation accordingK = P−1

LSGP (K) andL = P−1LPGS(L)

leads to final program containing quantifications of following form.

〈∀J ∈ J , K ∈ K, L ∈ Lx[J, K, L] = F(z[J, K, L], . . .) If J + PLSGP K + PLPGSL ∈ I〈∀J ∈ J , K ∈ K, L ∈ Lz[J, K, L] = y[J − d− PLSGP λ1, K + λ1 − P−1

LSGP PLPGSλ2, L + λ2]

if J − d− PLSGP λ1 ∈ J , PLSGP K + PLSGP λ1 − PLPGSλ2 ∈ K, PLPGSL + PLPGSλ2 ∈ L, and J + K + L ∈ I.

The index spacesJ , K, L are given by

J = {J ∈ Zn : AJJ ≥ bJ}K = {K ∈ Zn : AKK ≥ bK}L = {L ∈ Zn : ALL ≥ bL}

The transformation of embedding transformation in case when equations are ofform x[I] = y[QI − d] can be extended from results in [TT02a]. The followingexample illustrates above transformations.

Example 3.4.1 The example initial program prg be

〈‖i1, i2 : 0 ≤ i1, i2 < N ::

a[i1, i2] = y[i1 − 1, i2] if i1 ≥ 1〉Let the tiling matrices for co-partitioning of index space defined by above program

be PLSGP =

(L1 0

0 N

)andPLPGS =

(L2 0

0 N

). The transformed program on

tiling (prg, PLPGS, PLSGP ) looks as follows

〈‖i1, i2 : i1 = j1 + L1 · κ1 + L2 · l1 ∧ i2 = j2 + N · κ2 + N · l2 ∧ 0 ≤ j1 < L1∧0 ≤ j2 < N ∧ 0 ≤ L1κ1 < L2 ∧ 0 ≤ Nκ2 < N ∧ 0 ≤ L2l1 < N ∧ 0 ≤ Nl2 < N ::

a[i1, i2] = y[i1 − 1, i2] if i1 ≥ 1 ∧ 0 ≤ i1, i2 < N

〉

31

3 Partitioning

For co-partitioning to be validL2 > L1. Then on application of embedding as indefinition 3.4.1,

Θ = {λ1, λ2 ∈ Zn : −e < λ1 + P−1LSGP d < e ∧ −e < λ2 − P−1

LPGSPLSGP λ1 < e}

=

({(

0

0

),

(0

0

)}, {

(−1

0

),

(0

0

)}, {

(−1

0

),

(−1

0

)})

The three elements ofΘ have the corresponding condition spaces according to defi-nition 3.4.1,

I1

( (0

0

),

(0

0

))= {

(j1

j2

),

(k1

k2

),

(l1l2

): j1 ≥ 1 ∧ j1 + k1 + l1 ≥ 1}

I1

( (−1

0

),

(0

0

))= {

(j1

j2

),

(k1

k2

),

(l1l2

): j1 < 1 ∧ k1 ≥ 1 ∧ j1 + k1 + l1 ≥ 1}

I1

( (−1

0

),

(−1

0

))= {

(j1

j2

),

(k1

k2

),

(l1l2

): j1 < 1 ∧ k1 < 1 ∧ j1 + k1 + l1 ≥ 1}

Therefore the transformed program according Def. 3.4.1 is given as following

〈‖j1, j2, k1, k2, l1, l2 : 0 ≤ j1 < L1 ∧ 0 ≤ j2 < N ∧ 0 ≤ L1κ1 < L2 ∧ 0 ≤ Nκ2 < N

∧ 0 ≤ L2l1 < N ∧ 0 ≤ Nl2 < N ::

a[j1, j2, k1, k2, l1, l2] = a[j1 − 1, j2, k1, k2, l1, l2]

If j1 ≥ 1 ∧ j1 + k1 + l1 ≥ 1

a[j1, j2, k1, k2, l1, l2] = a[j1 − 1 + L1, j2, k1 − L1, k2, l1, l2]

If j1 < 1 ∧ k1 ≥ 1 ∧ j1 + k1 + l1 ≥ 1

a[j1, j2, k1, k2, l1, l2] = a[j1 − 1 + L1, j2, k1 − 1 + L2, k2, l1 − L2, l2]

If j1 < 1 ∧ k1 < 1 ∧ l1 ≥ 1 ∧ j1 + k1 + l1〉

The above linearly bounded lattice with respect to variableK and L needs to beconverted to a polytope in the final step. Therefore last step of normalization gives usthe final program, on which the allocation and scheduling operations can be carriedout.

〈‖j1, j2, k1, k2, l1, l2 : 0 ≤ j1 < L1 ∧ 0 ≤ j2 < N ∧ 0 ≤ κ1 < L2/L1 ∧ 0 ≤ κ2 < 1

∧ 0 ≤ l1 < N/L2 ∧ 0 ≤ l2 < 1 ::

32

3.5 Partial Localization

a[j1, j2, k1, k2, l1, l2] = a[j1 − 1, j2, k1, k2, l1, l2]

If j1 ≥ 1 ∧ j1 + L1k1 + L2l1 ≥ 1

a[j1, j2, k1, k2, l1, l2] = a[j1 − 1 + L1, j2, k1 − 1, k2, l1, l2]

If j1 < 1 ∧ k1 ≥ 1 ∧ j1 + L1k1 + L2l1 ≥ 1 (3.6)

a[j1, j2, k1, k2, l1, l2] = a[j1 − 1 + L1, j2, k1 − 1 + L2/L1, k2, l1 − 1, l2]

If j1 < 1 ∧ k1 < 1 ∧ l1 ≥ 1 ∧ j1 + L1k1 + L2l1 ≥ 1〉

The final above program is illustrated in Figure 3.6. The dependence graph is de-

picted for N=8,L1 = 2, L2 = 4. Therefore our tiling matrices arePLSGP =

(2 0

0 8

),

PLPGS =

(4 0

0 8

). The quantifications in program (3.6) can be verified in Figure 3.6.

i1

i2

0

1

0

1

0

1

0

1

0

1

2

0

0

1

0 1 2 3 4 5 6 7j2

a) b)

Figure 3.6: a) Localized regular program, b) Partitioned pre-localized program

To summarize, for partitioning an already localized regular algorithm appropriate ex-tensions to embedding and normalization transformation are introduced in this sub-section.


A typical design flow for systolic array processing is shown in Figure 3.7. The par-allelized loop program is localized to convert global into local data dependencies.

33

3 Partitioning

The localization process results in local nearest neighbor processor communication.The space-time mapping is carried out to obtain a full-size array. However usuallythe number of processing elements available is limited by technological or architec-tural constraints. Therefore partitioning is carried to match the resource constraintsand then the space-time mapping is carried out to obtain reduced size array. How-ever localization prior to partitioning introduces unnecessary copy operational andadditionally restricts optimal schedules available after partitioning [TT02a]. A newdesign flow implements partitioning before localization transformation eliminatesabove mentioned disadvantages. The concept of this partial localization entails lo-calization of data dependencies is only for intra-tile dependencies in case of LPGSpartitioning and inter-tile dependencies for LSGP partitioning [TT02a] and partial lo-calization for intermediate schemes. The new design flow is depicted in Figure 3.8. Inthis section the concept of partial localization is extended to hierarchical partitioningtechniques as co-partitioning in particular.

Parallelization

Localization

Space-time mappingPartitioning

Affine Dependence Algorithm(PLA)

Uniform Dependence Algorithm(PRA,URE)

Reduced (fixed) size array

Space-time mappingFull size array

Figure 3.7: Conventional design flow for mapping algorithms to processor arrays

The following example taken from [TT02a] illustrates the new design flow.

Example 3.5.1 The example parallelized program is

〈‖ i, j : 0 ≤ i, j < N :: x[i, j] = y[0, 0]〉

34


Parallelization

Affine Partitioning

Reduced (fixed) size array

Partial Localization

Space-time mapping

Figure 3.8: New design flow for mapping algorithms to processor arrays

The corresponding dependence graph in shown in Figure 3.9a) withN=6. The pro-gram on localization looks as follows and corresponding dependence graph is shownin Figure 3.9b). The value ofy(0, 0) is propagated along the iteration index spaceusing the propagation vectors(0 1)T and(1 0)T. This is partitioned into tiles using

partitioning matrixP =

(3 0

0 3

), i.e., the tiles are squares of dimensionM × M ,

M = 3.

〈‖i, j : 0 ≤ i, j < N :: x[i, j] = z[i, j]

z[i, j] = z[i, j − 1] If i = 0 ∧ j > 1

z[i, j] = z[i− 1, j] If i ≥ 1

z[i, j] = y[0, 0] If i = 0 ∧ j = 0〉The space-time mapping in case of LSGP partitioning replaces each tile with a pro-cessor. This leads to a reduced fixed size array of size2× 2. The advantage of local-ization is shown in Figure 3.9c) and 3.9d) which shows the resulting processor arraysfor the algorithms corresponding to dependence graph in Figure 3.9a) and 3.9b) re-spectively. The global communication lines are replaced by nearest neighbour localcommunication. The white node in Figure 3.9b) correspond to intermediate propa-gation variablez introduced due to localization. The annotated numbers in 3.9b) are

35

3 Partitioning

the execution time of iterations as determined by an optimal linear schedule. Local-ization prior to partitioning introduces 9 redundant copy operations. Therefore localmemory overhead isO(M2) [TT02a]. The other disadvantage of localization beforepartitioning is reduction in freedom of selection of schedules. This can be seen inFigure 3.9b) where the execution of tiles(1 0)T and(0 1)T can take place only afterM andM × (M − 1) time steps respectively. Therefore better schedules are possiblehere if in this case only dependencies between the tiles are localized.

:y:x

:z

ij

ija) b)

c) d)

PE

PE

PE

PE

PE

PE PE

PE

0

1

2

3

4

5 11

13

14

15

7

8

9

13

14

15

12

11

10

12

11

10

10

9

8

7

63

4

5

6

7

8

17

18

16

Figure 3.9: a) Dependence graph of equationx[i, j] = y[0, 0] after LSGP partitioning,b) Dependence graph after localization and partitioning, c) Processor array for a), d)Processor array for b) (taken from [TT02a])

In the new design flow partitioning takes place before localization. The localiza-tion now uniformizes the dependence responsible for communication between pro-cessor elements. This means in case of LSGP partitioning, localization of inter-tiledependencies is undertaken. Whereas for LPGS, localization of intra-tile dependen-cies takes place.

36


Example 3.5.2 The example program is first partitioned and then partially local-ized both for LSGP and LPGS partitioning scheme. The corresponding dependencegraph is depicted in Figure 3.10a) and 3.10b). The resulting processor arrays areshown in Figure 3.10c) and 3.10d) respectively. The advantages of partial localiza-tion is clearly seen in better schedules. For LSGP partitioning the new methodologyresults in reduction of total execution time from 18 to 10 time steps. Another advan-tage is seen in removal of redundant copy operations.

:y

:z

:xi

ji

jk =02

k =01

1k =1

k =12

0

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

2

3

4

5

6

7

8

9

10

0

1

2

1

2

3

2

3

4

1

2

2

3

3

4

2

3

3

4

4

5

3

4

4

5

5

6

3

4

4

5

5

5

6

6

7

a) LSGP b) LPGS

d) LPGS-array

PE PE

PE PE

PE PE PE

PE

PE PE PE

PEPE

c) LSGP-array

Figure 3.10: a) Localization of inter-tile dependencies for LSGP scheme, b) Local-ization of inter-tile dependencies for LPGS scheme, c) Processor array for a), d)Processor array for b) (taken from [TT02a])

The embedding transformation was modified to split a dependency into inter-tile dependency and intra-tile dependency. This transformation is extended to co-

37

3 Partitioning

partitioning. As co-partitioning is LPGS partitioning applied on a LSGP partitionedindex space. Therefore localizing intra-tile dependency of LPGS tiles is same aslocalizing inter-tile dependencies for LSGP tiles. This fact is illustrated by Figure3.11. The LS tiles are the tiles within the green boundary which itself is a single GStile. The partial localization of inter-tile dependencies of LS tile is undertaken. Thecommunication of data between GS tiles is also localized.

a) Co-partitioning

0

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

9

9

10

11

12

13

14

15

16

17

10

11

12

13

14

15

16

17

18

b) Co-partitioned Array

PE

PE

j1

j2

l2=0 k

1=0

k1=1

Figure 3.11: a) Partial localization for co-partitioning, b) Resulting processor array

The embedding transformation for equationz[I] = x[QI − d] is modified forco-partitioning as follows

z[J,K, L] = x[Q(J + K + L)− d, 0, 0] ∀ J + K + L ∈ I (3.7)

This means the variable is embedded in the spaceK = 0 andL = 0. As dependenciesoriginate from a single tile so no case-dependent analysis of dependencies is required.Then a three fold splitting of Equation (3.7) is undertaken to transform it to equivalentsystem of four equations, i.e.,

z[J,K,L] = z1[QJ,K,L] ∀ J + K + L ∈ Iz1[QJ,K, L] = z2[QJ,QK,L] ∀ J + K + L ∈ I

38


z2[QJ,QK, L] = z3[QJ,QK, QL] ∀ J + K + L ∈ Iz3[QJ,QK, L] = x[Q(J + K + L)− d, 0, 0] ∀ J + K + L ∈ I

In co-partitioning only the dependencies betweenz1 and z2 (second equation)and betweenz2 andz3 (third equation) should be localized. The transformation is thesame as proposed in [Tei93].

Example 3.5.3 The example equationx[i, j] = y[0, 0] is co-partitioned and thenpartial localization is applied according to modified embedded transformation. Weobtain

x[J,K, L] = z1[0, K, L] ∀ J + K + L ∈ Iz1[0, K, L] = z2[0, 0, L] ∀ J + K + L ∈ Iz2[0, 0, L] = z3[0, 0, 0] ∀ J + K + L ∈ Iz3[0, 0, 0] = y[0, 0, 0] ∀ J + K + L ∈ I

Only the second and third equations are localized. Therefore on partial localizationwe obtain

z1[0, K, L] = z1[0, K − (1 0)T, L] If k1 > 0

= z1[0, K − (0 1)T, L] If k1 = 0 ∧ k2 > 0

= z2[0, K, L] If k1 = 0 ∧ k2 = 0

z2[0, 0, L] = z1[0, 0, L− (1 0)T] If l1 > 0

= z1[0, 0, L− (0 1)T] If l1 = 0 ∧ l2 > 0

= y[0, 0, 0] If l1 = 0 ∧ l2 = 0

The variablez3 is not necessary so the third equation is replaced byz2[0, 0, L] =

y[0, 0, 0]. The dependence graph of above algorithm is shown in Figure 3.11a). Forthe examplek2 and l1 is always zero, so the equations are further simplified. Thetiling matrices are

PLSGP =

(3 0

0 3

)PLPGS =

(6 0

0 3

)

The following space-time mapping results in processor array and schedule as depicted

39

3 Partitioning

in Figure 3.11.

(p

t

)=

(0 0 1 0

1 3 1 9

)

j1

j2

k2

l1

40

4 Control Transformation

This chapter deals with the systematic generation of control units, (i.e. local andglobal controller) for processor arrays in case of multi-projection, and other parti-tioning techniques, (i.e. LPGS, LSGP, co-partitioning). In thePAROdesign flow (seeFigure 2.1), tool control generation specifies control units, and control signals. Asa side benefit it also defines theaddress generation unit(AGU) of each processingelement. The major work in the master thesis was to develop a complete methodol-ogy for control generation for mapping of regular algorithms onto processor arrays.The systematic design of control units for regular processor array definition was in-troduced in [TT91]. Another procedure for systematic definition of control signalsfor the class of conditional uniform recurrence equations (CUREs) was introducedin [Xue92]. The first method was characterized by local control flow, problem sizeindependence, and optimization of number of required control variables. Howeverthe methodology is restricted to simple space-time mappings obtained by a projec-tion. Xue’s methodology is problem size dependent and also restricted only to simpleprojection. Darte et.al. introduced a memory efficient but computationally expensivemethod for automatic generation of control code in case of LSGP partitioning andlinear scheduling in [Dar02]. Also only rectangular tiles are considered in the parti-tioning of the index space and it is not efficient for implementation in case of fine-grained processor arrays. The methodology has been implemented in a commercialcompiler for automatic synthesis of a processor array that can be used as complexinstruction set extension of a Very Large Instruction Word (VLIW) processor. Thetool has been tested successfully for a number of examples from fields of image pro-cessing, signal processing, etc. [Syn]. In this chapter a new technique of controlgeneration is introduced which allows systematic design of control generation usinga transformative approach. This technique not only allows consideration of differ-ent parallelepiped tiles shapes but also encompasses different partitioning techniquessuch as multi-projection, LSGP, LPGS, co-partitioning. The proposed methodology

41


is also shown to be more area efficient than existing methodologies.

4.1 Why Control Generation?

Example 4.1.1 Consider the following C loop nest

forall(i1=0; i1 < 8;i1++)

forall(i2=i1;i2 <i1 + 8;i2++)

{

if(i1==0)

{ a[i1][i2]=A[i2];

}

else

{ a[i1][i2]=a[i1-1][i2-1];

}

if(i1-i2==0)

{ b[i1][i2]=B[i1];

}

else

{ b[i1][i2]=a[i1][i2-1];

}

if(i2!=9)

c[i1][i2]=a[i1][i2] * b[i1][i2];

else

c[i1][i2]=a[i1][i2] - b[i1][i2];

}

Figure 4.1 shows the dependence graph of the C-program on co-partitioning. Theprocessor array implementation of the co-partitioned example is given by followingspace-time mapping (satisfying Definition 4.1.1),

p1

p2

t

=

0 0 1 0 0 0

0 0 0 1 0 0

1 1 3 2 8 4

·

J

K

L

+

0

0

1

Figure 4.1 indicates optimal schedule given by the time of execution of each iterationpoint and mapping onto PEs.

42

4.1 Why Control Generation?

1 2 3 4 5 6 7 8

4 5 6 7 8 9 10 11

6 7 8 9 10 11 12 13

9 10 11 12 13 14 15 16

11 12 13 14 15 16 17 18

12 13 14 15 16 17 18 19

14 15 16 17 18 19 20 21

3 4 5 6 7 8 9 10

i2

i1

PE2 PE3

PE0 PE1

j

j

2

1

k1

k2

Figure 4.1: The dataflow graph of the example C-code

The co-partitioning is done using matricesPLS =

(0 2

2 2

)andPGS =

(0 2

2 2

).

The black index point indicates dynamic selection of input from memory. The greenpoints correspond to the iteration points satisfying If (i2!=9). After allocation andscheduling, synchronous parallel form is obtained which specifies the processor andtime index of execution for each iteration point. The tiles (i.e. LS tiles) within thedashed tiles are executed in parallel by four processors whose origins are given bysetK. The origin of dashed tiles (i.e. co-partitions) are represented byL. The clusterco-ordinates within solid tiles are given byJ . The exact enumeration of all ele-ments ofJ ,K,L is given in equation(4.1). The successful execution of above C loopnest requires consideration of comparisons to compute predicates involvingJ ,K,L.where,J = (j1 j2)

T, K = (k1 k2)T, andL = (l1 l2)

T,

J =

J ∈ Z2 |

1 0

−1 0

−1 1

1 −1

(j1

j2

)≥

0

−1

0

−1

K =

K ∈ Z2 |

1 0

−1 0

−1 1

1 −1

(k1

k2

)≥

0

−1

0

−1

43


and

L =

L ∈ Z2 |

1 0

−1 0

−1 1

1 −1

(l1l2

)≥

0

−1

0

−1

Hence

J = {(0, 0), (1, 1), (0, 1), (1, 2)},K = {(0, 0), (1, 1), (0, 1), (1, 2)},and L = {(0, 0), (1, 1), (0, 1), (1, 2)} (4.1)

To compute the predicate if(i2!=9), one needs to compute cluster co-ordinates(j1,j2), processor co-ordinates (k1,k2), and (l1,l2) because the PRA describing co-partitioned example has the predicate in formIf(j2 + 2 · k2 + 4 · l2! = 9). Theproblem of control looks simple, however the catch is in computing the co-ordinates(j2, k2, l2) using only space-time co-ordinates (p1, p2, t) given by the space-time map-ping. i.e. given processor and time index appropriate control signals needs to begenerated for successful execution of all iteration points. In other words we need togenerate control signals so that at time 11,12,15,16 in PE3 and at time 14,15 in PE0,we execute the subtraction operation as defined by the IF conditional.

Therefore givent = j1 + j2 + 3 · k1 + 2 · k2 + 8 · l1 + 4 · l2 one needs to computethe predicate in our example, i2!=9, i.e.j2 + 2 · k2 + 4 · l2! = 9. In other words giventime one needs to extractj1, j2, k1, k2, l1, l2 from the linear Diophantine equation soas to compute the predicate.

Definition 4.1.1 (Space-time mapping for co-partitioning). A space-time mapping

in case of co-partitioning is an affine transformation of the form

(p

t

)=

(0 E 0

λJ λK λL

)

J

K

L

(4.2)

whereE ∈ Z(n−s)×nK is the unity matrix,λJ ∈ Z(n−s)×nJ , λK ∈ Z(n−s)×nK , λL ∈Z(n−s)×nL andnJ + nK + nL = 3 · n.

The rigidity of the control problem comes from the fact that givent = λJ · J + λK ·K+λL ·L, where t is known andJ , K, L is unknown. We need to compute predicatesof the formAJ ·J +AK ·K +AL ·L ≥ b or AJ ·J ≥ bJ ∧AK ·K ≥ bK∧AL ·L ≥ bL.The main reasons for control generation can be summarized as follows:

44

4.2 Control Models

• Removal of iteration dependent conditionals to preserve single assignment prop-erty by control generation. The iteration dependent conditionals are obtaineddue to localization, partitioning, or initial program definition.

• Memory addresses given by array elements whose indices are affine functionsof co-ordinates ofJ ,K,L. The “live-out” value to be read from or written intoglobal memory is determined by addresses generated by control units.

4.2 Control Models

The control unit is responsible for dynamic selection of input data, i.e. It generatesthe control signals for selection of input and output ports. The control unit alsoneeds to superintend the processing element function. Each processing element maybe distinguished by presence of local control unit. The present day coarse-grainedarchitecture are characterized by presence of event registers and logic units for theirprocessing. An example is the PACT-XPP architecture [PAC03]. The bold linesrepresent the control signals in the PACT-XPP architecture in Figure 4.2.

Configuration Manager

RAM-PAEs ALU-PAEs I/O Elements

FREG ALU BREG

Register

Register Register Register

RegisterRegister

reconfLUT

Data Flow

Control Flow

Figure 4.2: PACT-XPP64 architecture

Therefore a generalized processing element containing control unit can be illus-trated as in Figure 4.3.

The description of different models of control is introduced in [TT91]. The pro-cessor arrays can contain following models of control. All the control models areillustrated in Figure 4.4.

• Global model: In global model of control, a central control unit is responsiblefor the generation of individual control signals for all processing element. The

45


Data Path

Control Path

Data Inputs

Control Inputs

Data Outputs

Control Outputs

D

D

D

D

Figure 4.3: Processing element

PEs do not contain a control path, i.e. local controller is absent. However thecentral control unit size and I/O rate is problem size dependent. So it is notfeasible for high throughput applications.

• Local model: Each processing element is assigned its own control unit. Thelocal model of control is further classified into pre-stored local control modeland propagating control model. The pre-stored local control model is seen inVLIW processing elements as used in only commercial ASIC compiler from[Syn]. The run-time control of iterative dependent conditionals are performedby using comparators, adders, modulo counters operators can be used to com-pute operations on index vectors. The control unit is not purely combinatorial.The control unit can be constituted by a set of state machines or programmedcontrollers using dedicated counters and comparators. The complexity of thelogic blocks in control path depends on the application, partitioning scheme,etc. A application will have more complex control with increased degree ofhierarchical partitioning. The second propagating control flow model is char-acterized by propagation of control variables with or without their modificationin each PE. As the conditionals are replaced by predicates, so the control unitis purely combinatorial. Therefore control unit cannot be implemented usingdedicated counters.

• Intermediate model: Some processing element have their own control path.This model is characterized by co-existence of both global and local controller.The optimal model of control saves cost by using intermediate model of con-trol. The global controller generates control signal which are uniform overall PEs. Therefore given a scheduling the global control signal can be propa-gated with adequate delays. The advantage is that local controller in each PE

46

4.3 The Problem of Control Generation

does not need to implement the global control predicates and therefore reduc-ing area complexity. These PEs are responsible for propagating local controlsignal to neighboring PEs executing the same function. These neighboring PEsdo not contain any control unit. Here a common programmed controller canbe decoded by control path of certain PEs which is then passed to neighboringprocessing elements without control path. In our proposed methodology, theintermediate model of control is used with success.

DP

CP

a)

DP

CP

c)

DP DP DP DP

CUb)

DP

DP

CP

DP

DP

CP

DP

CP

DP

DP

CP

DP

d)

Figure 4.4: a) Global model, b) Pre-stored local model, c) Propagating local model,d) Intermediate model


The main aim of control generation is replacement of iteration dependent condition-als by predicates in control variables. For the formal description of problem of con-trol for different partitioning techniques, and some definitions are introduced in thissubsection.

Definition 4.3.1 LSGP and LPGS are tiling techniques such that partitioning is

applied once and therefore the index spaceI decomposed into spacesJ andK,

i.e. , I 7→ J ⊕ K. Co-partitioning can be intuitively explained as applying LPGS

partitioning on a already LSGP tiled space. So the index spaceI decomposed into

spacesJ , K, L , i.e. I 7→ J ⊕ K ⊕ L. Similarly a n-hierarchical partitioning

decomposes the index spaceI in n+1 spaces. The introduced methodology of control

generation is generic applicable to all techniques of partitioning. However the formal

descriptions take into account co-partitioning.

47


Definition 4.3.2 The control spaceIc defines the set of index points with respect

to iteration vector for which the control variables will be defined. Let the iteration

vectorI of thePLA (see Definition 2.1.1 ) after space time mapping beI = ( p t )T

wherep denotes the processor index andt denotes the sequencing (time) index. The

Control spaceIc is defined as

Ic = {I =

(p

t

)∈ Zn−s+1 : p ∈ P ∧ t ∈ T } (4.3)

P ={p ∈ Zn−s : AK · p ≥ bK}T ={t ∈ Z : t = λJ · ηJ + λK · p + λL · ηL :

AJ · ηJ ≥ bJ ∧ AK · p ≥ bK ∧ AL · ηL ≥ bL}

where the processor and time space is given for co-partitioning. The definition of

matricesAK , AJ , AL follows from lemma 3.2.1. The variablesηJ , ηL are just the

renamed variables in spaceJ , andL after space-time mapping. The processor index

and time space definition follows from space mapping, i.e.P = K ( see Definition

4.1.1). Therefore, the control spaceIc is a linearly bounded lattice.s.t

Ic = {I =

(p

t

)∈ Zn−s+1 :

AK 0

−λK 1

λK −1

0 0

0 0

(p

t

)+

0 0

−λJ −λL

λJ λL

AJ 0

0 AL

(ηJ

ηL

)≥

bK

0

0

bJ

bL

}

The above inequalities follows from the processor and time space definitions.

Therefore the aim of control generation is given a PLA of the following form, toextract processor element definitions and generate control for them, i.e. ,

Definition 4.3.3 Given a input PRA prg and control spaceIc and control definition

spaceI ′ = ( p ηJ ηL )T , where PRA is of the following form

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

x1[I] =F11 (. . . , xj[I − d1,1], . . .) If I

′ ∈ I1′1

...

x1[I] =FW11 (. . . , xj[I − dW1,k], . . .) If I

′ ∈ IW′1

1

48


...

xk[I] =F1k (. . . , xj[I − d1,k], . . .) If I

′ ∈ I1′k

...

xk[I] =FWkk (. . . , xj[I − dWk,k], . . .) If I

′ ∈ IW′k

k

〉

The PRA has k quantifications and for each quantificationWk equations. Remember,

we are formally describing the problem of control for co-partitioning.

Definition 4.3.4 Therefore given a prg of the above form, we define a control trans-

formation producing program prg1=controltransformation(prg) as output. The pro-

gram prg1 is of the following form.

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

x1[I] = F1(. . .)

...

xk[I] = Fk(. . .)

〉where

F1(. . .) = IF (I′ ∈ I1′

1 , F 11 (. . . , xj[I − d1,1], . . .),

I′ ∈ I2′

1 , F 21 (. . . , xj[I − d1,2], . . .),

...

I′ ∈ IW ′

11 , FW1

1 (. . . , xj[I − d1,W1 ], . . .))

...

Fk(. . .) = IF (I′ ∈ I1′

k , F 1k (. . . , xj[I − dk,1], . . .),

I′ ∈ I2′

k , F 2k (. . . , xj[I − dk,2], . . .),

...

I′ ∈ IW ′

kk , FWk

k (. . . , xj[I − dk,Wk], . . .))

The property of single assignment requires the condition spaces of a quantification,

say k to be mutually exclusive. i.e.∀ s, t Isk

⋂ Itk = { }, wheres 6= t. The semantics

49


of IF function describes a pair with the first element defining the predicate and the

second element being the corresponding expression. Therefore if thejth predicate for

lth variable is true, then variablexl is equal to thejth expression.

The hardware interpretation of the formal definition are processing elements withlocal control as seen in Figure 4.5.

PE1 PE2 PE3

PE4 PE5 PE6

PE7 PE8 PE9

CU

DPMultiplexer

Fk

xk

F1

(p η η ) IIJ L k1

(p η η ) J L k

2I

...

(p η η ) J L kIWk

Figure 4.5: Hardware interpretation of prg1

The iteration dependent conditionals are calculated inside the processing ele-ments. Informally, the above formal description is equivalent to local control model.i.e. there is no global controller. There are only two possible types of control con-ditions as in equations (4.4), (4.5) on page 54. The iteration conditionals dependingonly on J ,L( origin of GS tiles), i.e. conditional of formAL · L ≥ bL, AJ · J ≥bJ ,AJ · J + AL ·L ≥ b, as the don’t depend onK (= P, processor space) are uniformthroughout the processor index. The control definition space are defined byp, ηJ , ηL.This form of conditional computation can be carried outside of the processor arrayand control signals can be propagated through the processor array. These condition-als can be executed by a global controller. Henceforth we are looking at intermediatemodel of control with co-existence of local control and global control. Therefore fol-lowing formal description describes the optimal control transformation consideringintermediate model of control. The obvious advantage is obtained in area and logiccomplexity due to calculation of predicates only once in global controller instead ofbeing calculated in local controller of each PE.

50


Definition 4.3.5 Optimal control transformation produces program prg2 = opti-

malcontroltransformation(prg) of the following form.

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

x1[I] = F1(. . .)

...

xk[I] = Fk(. . .)

〉where

F1(. . .) = IF (I′ ∈ I1′

1

⋃exp1

1(ctrlvar11), F

11 (. . . , xj[I − d1,1], . . .),

I′ ∈ I2′

1

⋃exp2

1(ctrlvar21), F

21 (. . . , xj[I − d1,2], . . .),

...

I′ ∈ IW ′

11

⋃expW1

1 (ctrlvarW11 , FW1

1 (. . . , xj[I − d1,W1 ], . . .))

...

Fk(. . .) = IF (I′ ∈ I1′

k

⋃exp1

k(ctrlvar1k), F

1k (. . . , xj[I − dk,1], . . .),

I′ ∈ I2′

k

⋃exp2

k(ctrlvar2k), F

2k (. . . , xj[I − dk,2], . . .),

...

I′ ∈ IW ′

kk

⋃expWk

k (ctrlvarWkk ), FWk

k (. . . , xj[I − dk,Wk], . . .))

Therefore for each variablexk, we have a list of control variables (global control)

ctrlvark =⋃Wk

i=1 ctrlvarik that are initialized outside the processor space and then

propagated within the processor space. The computation of predicated of formI′ ∈

I1′k (say) takes place in local controller. Therefore the main aim of control generation

is to determine global and local predicates. A more intuitive understanding of the

formal definition can be derived from Figure 4.6.

4.3.1 Why is Problem of Control Generation Intractable?

The main problem of control generation problem for partitioned piecewise linearalgorithms is, given a space-time mapping as in Definition 4.1.1 and Index space. Weneed to find out the conditionals, i.e.,

51


PE1 PE2 PE3

PE4 PE5 PE6

PE7 PE8 PE9

CU

DPMultiplexer

Fk

x

F1

ctrlvar

GC

Figure 4.6: Hardware interpretation of prg2

Given I ∈ Ic we need to compute predicates of the formI′ ∈ IW

′k

k (say). i.e. ,given ( p t )T , the space( p ηJ ηL )T needs to be calculated for predicatecomputation. The linear Diophantine equation

t− λk · p = λj · ηJ + λl · ηL

can be solved forηJ , ηL. The possible values ofηJ , ηL are bounded.The basic question to be resolved for solving a linear Diophantine equation is

how one can tell whether it has a solution? If solutions exist how can one find oneor all of them under a set of constraints? In 1861 [Hal86], the problem was solved infull generality by H.J.S. Smith. A detailed account of the method can be found in astandard textbook [Hal86].

This direct approach involving solution of Diophantine equations has the follow-ing disadvantages

• We need to maintain a table of pre-calculated control variables for each pro-cessing elements over all the execution steps in time. The values of the tableare obtained by calculating predicates, there the index variables are obtainedby solution of Diophantine equations. The size of the table is proportional tototal time of execution. Therefore the notion of pre-stored control is infeasiblein terms of hardware costs.

• The other way is to implement in hardware the run-time solution of linear Dio-phantine equations. The Hardware costs associated with solution of Diophan-

52

4.4 Methodology: Control Generation

tine equations using linear algebra methods is very high. This is against thephilosophy of regular processor arrays.

Example 4.3.1 The index space in Figure 4.7 shows the execution ordering for aniteration space (tile)J . The execution ordering is obtained due to the feasible chosensequencing index, i.e. ,t = (3 4) · (j1 j2)

T. The conditional IF (j1 + j2 == 2) needsto be executed. Therefore givent = 3 · j1 + 4 · j2, the problem is to solve forj1,j2. The IF conditional can be computed using calculatedj1, j2. The control variablecorresponding to predicate is assigned as true if and only ifj1 + j2 = 2. Hence, ifc1 is the control variable corresponding to our defined predicate. Thenc1 needs to betrue at time step 8, 9, 10.

j1

j2

0

1

2

4

5

6

8

9

10

11

12

13

15

16

17

19

20

21

22

23

24

26

27

28

30

31

32

Figure 4.7: Scheduling of example tile


The methodology introduced is not based on a direct approach, but based on scanningthe index space given the space-time mapping. The current techniques of control gen-eration for commercial ASIC compiler is restricted to rectangular partitioning of the

53


polytope, linear scheduling and LSGP partitioning. The methodology control genera-tion proposed in this section encompasses all partitioning techniques, and tile shapes.The only major assumption is application of linear affine scheduling. This is not adisadvantage as it had been shown that linear scheduling under certain restriction isnearly optimal [Dar91]. Our approach is constituted of following four steps. Themethodology introduces an optimal control transformation which is more efficientthan existing approaches.

1. Determination of PE type.

2. Scanning of partitioned polytope with execution order determined by sequenc-ing index.

3. Initialization of local and global control variables.

4. Propagation of control and iteration variables.

4.4.1 Determination of PE Type

The main aim of determination of PE type is to

• Separate predicates which can be executed by global controller. Without deter-mination of PE type the methodology would implement local control model,i.e. the predicates would be implemented in local controller of each PE.

• Classifying processor regions executing the same functions. The advantageappears in customizing local control unit in each PE.

The iterative control conditions in co-partitioned space after space-time mapping aregenerally of the form

AJ · ηJ ≥ bJ ∧ AK · p ≥ bK ∧ AL · ηL ≥ bL (4.4)

AJ · ηJ + AK · p + AL · ηL ≥ b (4.5)

i.e. The iterative conditionals in the control space as shown in Definition 4.3.3are of the above forms. The processing elements are classified in accordance withtheir control space. The major purpose of classification of PE type is to customizeeach processing element to functions it needs to execute. For example, in systolicarrays only the border PEs are supposed to communicate with memory. This kind ofinformation of processor based functionality can be inferred from the control space

54


definitions of each quantification. The classification of PEs is done by partitioning theprocessor space intoQ convex polytopes. If the conditional is of the form (4.4), thenthe conditionals can be classified into processor based conditional (i.e. p) and timebased conditionals (ηJ , ηL). The time based conditional are accounted by generationof global control and enable signals. The processor based conditional is accountedby code generation from PRA description. The optimal usage of resources requirescategorization of processor based conditionals into PE types. The conditional oftype (4.5) can be converted into union of type (4.4) control conditions. However theproposition entails large hardware costs during implementation. The conditional oftype (4.5) have to be included into every processing element description. The reasonbeing dependence of calculation of predicate on processor index. These conditionalform the local controller. In other words, we have a number of equations assigningdifferent variables each of which have a different definition domain. What we wouldlike to do is to deduce the region of the processor array where all the processors haveexactly the same behavior and hence leading to optimal resource utilization. The“regions” classify the PE types with the statements defining the PE type. The formaldefinition of the problem and a efficient solution is given in this subsection.

Starting with one iteration dependent conditional for any one of the indexed vari-ables of the form

xk[I] = F1k (. . . , xj[I − d1,k], A[j], . . .) If I

′ ∈ I1′k

where,

If I′ ∈ I1′

k = {I ′ = (p ηJ ηL)T ∈ Zn−s+1 : AJ · ηj ≥ bJ ∧ AK · p ≥ bK ∧ AL · ηl ≥ bL}then from all quantifications(1 . . . k) and equations (1 . . . Wk), the processor basedconditional is singled out from the iteration conditional of form 4.4.ηJ , ηL are therenamed variables J, and L after space-time mapping.

A1,1K · p ≥ b1,1

K i.e. S1,1 : P1,1

A2,1K · p ≥ b2,1

K i.e. S2,1 : P2,1

...

Ak,Wk

K · p ≥ bk,Wk

K i.e. Sk,Wk: Pk,Wk

S2,1 : P2,1 says the that statement defining quantification 2 for equation 1,S2,1 isdefined in polyhedra defined by processor based conditional,P2,1 . Then the Q PEtypesPE1, PE2, . . . , PEQ is given by non null intersection of above P convex

polytope s.tQ⋃

k=1

PEk =P⋃

i=1

Pi. Pi is the convex polytope obtained by intersection of

55


the corresponding polyhedra with contextCP . WhereCP defines the processor array.The piecewise regular design thus allows partitioning of control space into differentcontrol spaces for different sub-domains of processor index.

Icq = {I = (p t)T ∈ Zn : p ∈ PEq ∧ t ∈ T}

A point to be repeated is that conditional of form (4.5) are included in all controlspaces.

An efficient algorithm for classification of PE types is proposed. The initial algo-rithm was proposed by Quillere, et.al. in [Qui00] for the code generation of efficientnested loops from polyhedra. The algorithm recursively decomposes the union ofpolyhedra into imperfectly nested loops taking lexicographic scanning into account.The algorithm is modified to solve the problem of processor element (PE) type clas-sification for different techniques of partitioning.

Algorithm

Given a polyhedron with its corresponding statement, set of disjoint polyhedra iscreated. Therefore given the input a set of polyhedra with corresponding statementsS = { Si : Pi ∀ 1 ≤ i ≤ P}, ( statementSi valid in polyhedraPi). The algo-rithm computes a set ofQ disjoint polyhedra each identified withq valid statements,i.e. T = { S1, . . . , Sq : PEk ∀ 1 ≤ k ≤ Q} as output. The condition to be

satisfied isQ⋃

k=1

PEk =P⋃

i=1

Pi . These disjoint polyhedra define the PE types.

Algorithm: PE classification

Step 1 : Initialization:T = { } and∀ i = 1, . . . , P , Pi = Pi ∩ Cp, whereCp ispolyhedra defining the processor array as given in PRA descriptionStep 2 : Initializei = 1Step 3 : Consider polyhedraSi : Pi ∈ S, then ∀ S1, . . . , Sq : PEk ∈ T , addS1, . . . , Sq : (PEk − Pi) to Tnew.Step 4 : ∀ S1, . . . , Sq : PEk ∈ T , addS1, . . . , Sq, Si : (PEk ∩ Pi) to Tnew.Step 5 : addSi : (Pi −

⋃k PEk) to Tnew.

Step 6 : Ifi < P ,theni = i + 1 andT = Tnew and Goto Step 3 else Goto Step 7.Step 7 : Do a lexicographic scanning of processor spaceCp and for each index clas-sify the corresponding PE type and statements, by checking against the final set of

56


disjoint polyhedra inT = {{S1, . . . , Sq : PE1}, . . . , {S1, . . . , Sq : PEQ}}.Step 8 : Add statement corresponding to the conditionals of type 4.5 to all processorelements and add the time based conditionals to the corresponding statements. Theoutput form is shown in equation (4.12) in page 68.

Example 4.4.1 The gist of the above algorithm is illustrated in this example.Givena polyhedra with corresponding statements,i.e.S1 : P1 andS2 : P2. The processorspace is given byCp. On applying the above algorithm to the example with inputT =

{ } andS = {{S1 : P1}, {S2 : P2}}, we obtain the finalT = {{S1 : (P1 ∩ Cp−P2 ∩Cp)}, {S2 : (P2∩Cp−P1∩Cp)}, {S1, S2 : (P1∩Cp)∩(P2∩Cp)}}. The scanning of theprocessor spaceCp, assigns PE types to all processing elements, e.g., the processorelement in the center is assigned to type defined by{S1, S2 : (P1∩Cp)∩ (P2∩Cp)}}.Therefore it executes both statementsS1, S2.

S : P1 1

S : P2 2

Cp

Figure 4.8: PE type classification

The operations of union, intersection, etc. are defined in POLYLIB library de-veloped by Doran Wilde at IRISA. The library uses Chernikova algorithm as thedescription of polyhedra are in terms of its generators (rays and vertices) instead ofthe system of inequalities [Qui00]. A lot of methods for scanning a polyhedra mostlybased on Fourier-Motzkin elimination have been proposed in recent years. We use

57


the approach introduced in next section for scanning of the processor space. Theidentification of processor elements with respective polyhedra inT is carried out byadding another loop testing against each element inT .

4.4.2 Determination of Scanning Code

The major problem to be solved is generation of scanning code corresponding tothe execution order implied by the scheduling vector. This subsection deals withautomatic generation of scanning code for a given partitioning and scheduling vector.In earlier subsection, the infeasibility of table-based method where all the states andtransitions are enumerated in an HDL specification is discussed. In comparison, acounter based approach is more modular and offers reuse of data across multipledatapath control units. Figure 4.9 illustrates the counter based control model. Thecontrol model that is proposed consists of a counter which generates the scanningcode, i.e. index vectors in order of their execution. The index vectors are sent to aglobal controller for the iteration conditional that does not depend on processor index.They are also send directly to local control in case the iteration based conditionaldepends on the processor index,p. Here we discuss the automatic generation of scancode in this section. In later sections the combinational logic of global, local controland propagation of control signals is explained.

The Example 4.3.1 in page 53 can be used to illustrate the problem. The correctscanning code corresponding to the given scheduling vector for the mentioned exam-ple as shown in Figure 4.7 is given in following table 4.1.

x denotes the stall states in Table 4.1. This example was deliberately chosento illustrate the fact that for parallelepiped tiles, the iteration interval (see definition2.1.4 ) is not constant. In contrary rectangular tiles have constant iteration intervalsfor linear scheduling. So the construction of scan codes for rectangular tiles is aneasy affair. The same is not true for parallelepiped tiles. The counter is supposed toproduce index valuesj1, j2 as given in Table 4.1 at the respective times.

The scan code can be generated by knowing the schedule vector and loop ma-trix. The choice of loop matrix is an essential input for finding optimal schedulingin case of LSGP partitioning [Zha96], [Tei93]. The introduced methodology willautomatically generate the scan code given the loop matrix.

Definition 4.4.1 Loop Matrix

A loop matrixP = (p1, p2, . . . , ps) ∈ Zs×s determines the ordering of index points

58


PE1 PE2 PE3

PE4 PE5 PE6

PE7 PE8 PE9

CounterGlobal Controller

Loca

l

Control

Figure 4.9: Counter based control model

J′

on time, t. Index points in direction ofp1 are mapped side by side onto t, index

points in direction ofp2 are separated by blocks of points in directionp1 and so

on. The ordering is similar to a sequential nested loop program where loop index

ik corresponds to iterations in direction ofpk. The inner loop index isi1, and the

outermost loop index isis.

The selection criteria of loop matrix and the feasible affine transformation for findinga feasible schedule vector is discussed in detail in [Tei93]. Our problem now reducesto finding a scanning code given a loop matrix. The synchronization of scanning codewith respect to space-time mapping for generation of correct control signals is alsodiscussed.

The polyhedral scanning problem has been an intensive area of research in codegeneration, since Ancourt and Irigoin solved the problem in their seminal paper[CI91]. In our problem, the scanning problem reduces to finding a set of counterswhich produces the scanning order, i.e. visiting each point in polyhedra only once.The most of work proposed deals with code generation for loop transformations. As a

59


Time j1 j2 Time j1 j2 Time j1 j2 Time j1 j2 Time j1 j2

0 0 0 8 0 2 16 0 4 24 0 6 32 0 81 -1 1 9 -1 3 17 -1 5 25 x x2 -2 2 10 -2 4 18 x x 26 2 53 x x 11 1 2 19 1 4 27 1 64 0 1 12 0 3 20 0 5 28 0 75 -1 2 13 -1 4 21 -1 6 29 x x6 -2 3 14 x x 22 2 4 30 2 67 x x 15 1 3 23 1 5 31 1 7

Table 4.1: Scan code for example 4.3.1

result all of them till now have considered only in their lexicographic order. Howeveras we can see in Figure 4.7, the execution order specified by our scheduling vec-tor does not correspond to scanning in lexicographic order. Therefore the approachis inherently limited for our scanning. The only work in field of code generationthat deals with a scanning order not being the lexicographic order was introduced byC.Bastoul [Bas03]. The main idea introduced in the paper is that with transformationinto another domain and scanning in lexicographic order, one obtains a new scanningorder on transforming back to original domain. This idea is very intuitive to prob-lems in mathematical domain, where co-ordinate transformations, changing of basisis a powerful tool for solving problems in a simpler manner. The algorithm proposedhere works on a similar principle. Given the polyhedra to be scanned we introduce atransformation to the basis as defined by theloop matrix. The concept is explainedwith help of following example and Example 4.3.1.

Example 4.4.2 The Figure 4.10(a) shows the relationship between chosenloop ma-

trix, P , scheduling vector,λ and Transformation matrix,T .

P =

(−3 3

3 6

)t = (3 4)︸︷︷︸

λ

·(

j1

j2

)T =

(−2 1

1 1

)

The lexicographic scanning inj1, j2 cannot lead to execution order as defined byloop matrix, P = (s1, s2), wheres1 ands2 are columns ofP . The transformation

to domain defined byp1, p2 is done using a transformation matrix,T i.e.

(p1

p2

)=

60


T ·(

j1

j2

). Then a lexicographic scanning inp1, p2 is implemented. Finally, a reverse

transformation gives required scanning which corresponds to execution order. InFigure 4.10(b), the scanning is done in the transformed domain.

The circular points in Figure 4.10(b) have no images in the original domain. Thestatement can be verified from(j1 j2)

T = T −1·(p1 p2)T. The matrixT −1 has rational

elements. Hence the circular integer points have rational images in original domain,these circular points are also known as holes. The purpose of scanning is now to avoidthese holes, i.e. scan rectangular points by jumping over the circular points. Theproblem of avoiding holes has been proposed in many papers using Hermite NormalForm [Sch98]. The discussion of such a method can be found in [Ram95]. Themethod discussed in this master thesis uses calculation of lower bounds by solutionof system of inequalities using PIP and the standard method for calculating non-unitstrides. The method is illustrated for our example and then given as a transformation.

j

j

0

1

2

4

5

6

8

9

10

11

12

13

15

16

17

19

20

21

22

23

24

26

27

28

30

31

32

1

2

p1

p2

p1

p2a)

b)

Figure 4.10: a) Dependence of execution order on loop matrix, b) Transformed do-main

The usual transformation in methods using Hermite Normal Form changes basisof original polyhedron to target polyhedron defined by theloop matrix. Therefore theoriginal polyhedra given by

A · x ≥ b ⇒ AT −1 · y ≥ b (4.6)

61


The other transformation policy as proposed in [Bas03] is of the form

E −T−E T0 A

·

(y

x

)≥

0

0

−b

(4.7)

This polyhedra depends on iteration variables both in original and usual transformedpolyhedra. The scanning code requires finding of lower bounds and strides whichremains fixed. On obtaining lower bounds and strides, one can generate counters forscanning transformed domain as defined byloop matrix. The inverse of transforma-tion matrix,T −1 then gives the iteration vectors for scanning our original polyhedra.The problem of finding lower bounds and strides is dealt with parametric integerprogramming[Fea88].

The bound on outermost loop in transformed domain is obtained from the inequal-ities in the original domain. The lower bound of rest of the variable depends on outerloop variables, constants, and parameters. The problem of finding the stride is foundfrom the system of constraints as given in the transformation defined by equation 4.7.Another way is to find the Hermite Normal Form (HNF),H of the transformationmatrix,T . The strides are then given by diagonal elements of the matrix,H. We usethe Hermite Normal Form (see definition 4.4.2 ) to find the stride.

Definition 4.4.2 Hermite Normal Form

If U is anm×n integer matrix with rank(U) = m, then there exists ann×n unimod-

ular matrix,C ( i.e.det(C) = ±1) such that:

• UC = (H, 0) andH is in its Hermite Normal Form.

• H−1 · U is an integer matrix.

(H, 0) is called the Hermite Normal Form ofU . In our case as transformation matrix

is non-singular matrix of sizen×n, so its Hermite normal form,H also has sizen×n.

The Hermite Normal Form (HNF) can be found using a set of elementary columnoperations in polynomial time. An overview over the methods for finding HNF canbe found in [Sch98]. The algorithm for generating scanning code is illustrated infollowing example which uses the polyhedra in Figure 4.10a). The transformation islater described in a algorithm.

62


Example 4.4.3 The transformation matrix,T for original and transformed polyhe-dra is

T =

(−2 1

1 1

)

The HNF,H of T is

H =

(3 0

0 1

)C−1 =

(−2

313

1 1

).

whereT = HC−1 andC−1 is a unimodular matrix. The strides ofp1, p2 are diagonalelements ofH, i.e., 3 and 1 respectively. The bounds of outermost variable,p2 isobtained from the usual transformation on original polyhedra,i.e. ,

−6 3

3 3

6 −3

−3 −3

︸︷︷︸A

(−1

313

13

23

)

︸︷︷︸T −1

·(

p1

p2

)

︸︷︷︸y

≥

0

0

−26

−26

︸︷︷︸b

⇒

3 0

0 3

−3 0

0 −3

·

(p1

p2

)≥

0

0

−26

−26

.

Therefore we obtain the bounds onp2 as0 ≤ 3 · p2 ≤ 26. The bounds onp2 isused as context for finding lower bound ofp1. The context is determined by thecontents of the loop matrix, more precisely the last column of the loop matrix. Thelower bound is the lexicographic minimum value ofp1 in the polyhedra defined bythe transformation policy. i.e.

−6 · j1 + 3 · j2 ≥ 0

3 · j1 + 3 · j2 ≥ 0

6 · j1 − 3 · j2 + 26 ≥ 0

−3 · j1 − 3 · j2 + 26 ≥ 0

p2 − j1 − j2 = 0

p1 + 2 · j1 − j2 = 0

under the context:0 ≤ 3 · p2 ≤ 26.

Using Parametric Integer Programming (PIP) [Fea88], one can find lexical mini-mum or maximum if given a finite set of linear inequalities in a set of variables andparameters being restricted to positive values. On assumingp1, j1, j2 as unknowns,we can compute lower bounds from PIP output. The exact meaning of the PIP outputcan be verified in PIP manual [Fea88].

63


Version E.2 $Revision: 1.3 $

(

****************** Comment *************************

(unknowns p1 j1 j2 parameter p2)

(Inequality:-6*j1 + 3*j2 >= 0

3*j1 + 3*j2 >= 0

6*j1 - 3*j2 + 26 >= 0

-3*j1 - 3*j2 + 26 >= 0

p2-j1-j2=0

p1 + 2*j1 -j2=0

context: 0 <= 3*p2 <= 26

)

(#unknowns, #parameters, #domain ineq, #context ineq,

index of big param, true = integer solution)

(p1, j1,j2, const, p2)

******************************************************

3 )(newparm 1 (div #[ 2 0]

3)

)

(newparm 2 (div #[ 0 1 0]

2)

)

(list #[ 1 0 -3 0]

#[ 0 0 1 0]

#[ 1 0 -1 0]

)

)

cross : 997, alloc : 1, compa : 14

PIP output for example 4.4.3

The values in list corresponding to iteration variables in original polyhedra,j1,j2 are not considered. From the above output, the pseudo-code in Figure 4.11 forscanning is generated, and the values ofj1, j2 are calculated using inverse transfor-mation. The generation of counter specification in hardware is obvious once we havea description in terms of for loops. The pseudo-code counts all the black points in

64


Figure 4.10b). The values produced by the loop can be verified against Table 4.1 inpage 60.

for( p2=0; p2<= 8; p2=p2 + 1)

{

lower= p2 − 3 · (((2 · p2)÷ 3)÷ 2)

for( p1=lower; p1<= 8; p1=p1 + 3)

{

j1=(−p1 + p2)÷ 3;

j2=(p1 + 2 · p2)÷ 3;

}

}

Figure 4.11: Pseudo scanning code for example 4.4.3

The questions that remains to be answered are

• How is the scanning code synchronized with time as determined by a schedul-ing vector?

• How is the transformation matrix,T determined?

The scanning code needs to be synchronized with the time as determined by thescheduling vector. In the example case this corresponds to generating no values(stallstates) at certain times as shown in Table 4.1. This is accomplished by providing aenable signal to the counter which stops the counter at requisite times. The enablemechanism is not required if there are no stall states. The enable mechanism is shownin Figure 4.12. The scan counter counter output the values of the iterative variableswithin the tile. The other counter is time counter which runs from 0 to the timeof execution of the last index in the tile,Ttile. The counter and scan counter arereset when this last point is reached. A simple measure of finding out if an enablemechanism is not required can be given if total time taken to compute a tile is amultiple of the number of points within the tile.i.e. ,

Ttile − 1

|det(P )| = c, c ∈ N (4.8)

Where P is the tiling matrix.Ttile can be calculatedλJ · Jmax, whereJmax is the thelast index point to be executed. The first index point to be executed is always taken asthe origin. FindingJmax is again a integer programming problem as following, which

65


Scan Counter Counter(time)

s=3j + 4j

Ifs=t

1 2

Enable

Reg

Reset

(0 to T )Tile

Figure 4.12: Description of enable mechanism

is solved by calling PIP library functions in our transformed domain and multiplyingthe output withT −1.

max : J

given : AJ · J ≥ b

In [Fea88] , a first approach to finding lexicographic maximum as a integer pro-gramming problem was introduced. A nice example to problem is given in [Fea88].

In the given exampleTtile=32 and

∥∥∥∥∥

(−3 3

3 6

)∥∥∥∥∥ = 27. Therefore we need a enable

mechanism. This measure is valid for linear affine schedules.

The selection of the transformation matrix,T is dependent on the inverse of thechosen tiling matrix,P . The transformation matrix is obtained by multiplying theinverse of the tiling matrix with the common denominator as in equation (4.9). Thereason behind is that if the transformation matrix has rational elements then the im-ages of original polyhedra will not necessarily be integers.

T =σ · adj(P )

gcd(Pi,j)(4.9)

whereσ = |det(P )|/det(P ) andgcd(Pi,j) is the greatest common denominator of allelements of the tiling matrix. Using Equation (4.9) we can find out the transformationmatrix for any given tiling. In LSGP partitioning the method enables scanning ofindex points in a tile, i.e.J . In LPGS partitioning it enables scanning of the origins of

66


tiles,i.e. , elements ofK. In co-partitioning it helps scanning of index points within LStiles. i.e. , elements ofJ and the origins of global sequential tiles, i.e. , elements ofL.The point to be repeated is that iteration variables are needed for control generation.The elements ofK, J , K for LSGP, LPGS, and co-partitioning are taken care of byprocessor element indexes as specified in the allocation. The problem of scanningwas avoided until now by using rectangular tiles which leads to intuitive generationof control. The approach introduced in this section is able to handle the scanningproblem for parallelepiped tiles. The above discussion for scanning is summarizedin following algorithm. The algorithm is capable of generating the counter part ofthe model proposed in Figure 4.9. The algorithm has to be applied on each of theindex spaces not dependent on the processor index. Therefore for co-partitioning,one obtains two counters. One responsible for spaceJ and the other forL. Similarlyfor LPGS and LSGP there is only one counter responsible forK andJ , respectively.

Algorithm: Counter generation

Step 1: If (Adj(P ) 6= P ), Transformation matrix is determined using equa-tion (4.9), elseT = E, whereE is the Identity matrix.

Step 2: Determine generation of scanning code, given loop matrix, andtransformation matrix.

Step 2.1: Determine the non-unit strides from Hermite Normal Form of trans-formation matrix,T .

Step 2.2: Determine bounds of variables in transformed domain,Pt

Step 2.3: Determine values of counter variables by inverse transformation,i.e.I = T −1 · Pt

Step 2.4: Write down counter description in terms of for loop.

Step 3: If condition (4.8) is not satisfied, generate enable mechanism to syn-chronize scanning code with execution order determined by schedul-ing vector else update.

Step 3.1: CalculateTtile , time needed to execute the tile.Step 3.2: Generate counter(time) as in Figure 4.12 with 0 as lower bound and

Ttile as upper bound. The counter updates every clock cycle.Step 3.3: The conditional unit as in Figure 4.12 is configured, producing en-

able if and only if time corresponding to scan code produced byscan counter matches time as specified by counter(time). The Resetsignal is produced when counter(time) reaches upper bound,TTile.

67


else update scan counter everyδ cycles, whereδ is the time of exe-cution of a index point in global dependence graph.

4.4.3 Control Unit

The Figure 4.9 illustrates the plan for this section. The previous sections dealt withthe problem of generation of counters and PE classification. In this section the prob-lem of design of global and local control units is dealt. The goal of designing controlunits in framework of Definition 4.3.4 and 4.3.5 is targeted by use of combinationallogic. The Definition 4.3.4 defines a local control unit for each processor element ofthe array with no global control. The Definition 4.3.5 defines a local control unit foreach processor element of the array along with a global control. We obtain controlunit in accordance with Definition 4.3.4 if we do not apply step 1 of our methodol-ogy, i.e PE type classification. The fact repeated once again is the classification ofiterative based conditionals for co-partitioning only in two types. i.e. ,

AJ · ηJ ≥ bJ ∧ AK · p ≥ bK ∧ AL · ηL ≥ bL (4.10)

AJ · ηJ + AK · p + AL · ηL ≥ b (4.11)

ηJ and ηL are just the renamed index variables with respect to variables in indexspacesJ ,L after space-time mapping as defined by co-partitioning. The second typeconditional have to be dealt by the local controller. In caseAK = 0, the conditionalof first type can be handled by the global controller. Otherwise the conditional ishandled by both global and local control signals. After classification of PE types,each PE definition is of the following form.

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

x1[I] =F11 (. . . , xj[I − d1,1], . . .) If I

′ ∈ I1′1

...

xk[I] =F1k (. . . , xj[I − d1,k], . . .) If I

′ ∈ I1′k

...

xk[I] =FWkk (. . . , xj[I − dWk,k], . . .) If I

′ ∈ IW′k

k

Type: Local

(4.12)

68


S11 : xs[I] =F t

s(. . . , xj[I − dt,s], . . .) PE1 : If I′′ ∈ P1

...

S1q : xu[I] =Fv

u(. . . , xj[I − dv,u], . . .) PE1 : If I′′ ∈ Pq

〉

Type: Global

The definition corresponds to processor element of type PE1. The first few state-ments have iterative conditionals of type (4.11) and they are included in all PEtypes. The iterative conditionals of these statements have to be evaluated withineach processor element. Hence, they have to be evaluated locally within the pro-cessor element, i.e. , local control.The second class of statements declared byS.

.

is distinguished by conditionals of type (4.10), however with processor based parttaken out as in step (8) of the algorithm for PE classification. The index spaceI ′′ = {I = (ηJ ηL)T : ηJ ∈ ηJ ∧ ηL ∈ ηL}. This shows that the conditionals areonly time-based conditionals which can be evaluated outside as they are independentof processor-based conditional. Therefore the optimal control transformation gener-ates control signals of type (4.10) in a global controller and which is propagated torespective processors with requisite delays.

The generation of a local control unit is discussed first. A simplification in un-derstanding can be obtained by considering only a single variable with its iterativedependent conditional of form as in type: local.

If I′ ∈ I1′

1 = {I ′ = (ηJ p ηL)T ∈ Zn : GI′ ≥ g ⇔ AJ ·ηJ +AK ·p+AL ·ηL ≥ b}

(4.13)A point to note here is ifAK=0, then the statement is moved to global control part.If rank(G) = m, then the control needs to check whether the vectorI

′is inside

the polyhedron defined bym inequalities in (4.13). This is done by generating apredicate in control variables. This is done by introducingm boolean variablesci,which is true ’1’ only ifGiI

′ ≥ gi and false at all other points outside the halfspace.

〈∀I ′ : I′ ∈ I ′ :: 〈∧i : 1 ≤ i ≤ m :: ci[I

′]〉 ⇔ 〈∧i : 1 ≤ i ≤ m :: GiI

′ ≥ gi〉

The I′

has the variables defining the scan code. This is a change from control gen-eration in systolic array, where the control conditional are given in processor andtime. The control variables are generated for each quantification in every variable.Therefore control variables are responsible for selection of inputs in variables. Themutual exclusivity of conditionals within a variable allows us to decode the controlvariables. The local control variables need not be propagated to neighbouring PEs.

69


The output of application of local control has a program for each PE type explainedin next program for a processor element sayPE1 . The intuitive explanation of theprogram is shown in the next Figure 4.13.

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

GivenηJ ∈ηJ andηL ∈ ηL, (scanning code) andp ∈ P (processor index of PE1)

lctrlk,i[I] = 1 if {∃I ′ s.tGl

k,i · I′ ≥ gl

k,i}= 0 otherwise

lctrk,i = lctr1k,i ∧ . . . ∧ lctrm

k,i

Ck[I] = 0 if lctrk,1 ∪ gctrk,1 = 1

Ck[I] = 1 elseiflctrk,2 ∪ gctrk,2 = 1

...

Ck[I] = Wk − 1 elseiflctrk,Wk∪ gctrk,Wk

= 1

x1[I] = F1(. . .)

...

xk[I] = Fk(. . .)

where,

Fk(. . .) = IF (Ck[I] == 0, F 1k (var1

k), . . . , Ck[I] == Wk − 1, FWkk (varWk

k ))

The scanning code is propagated among the processor elements. At present we as-sume in our program definition theηJ , ηL as given. The processor co-ordinate as-sociated with each PE is fixed vector,p. The var1

k is the variable list forF 1k . The

mutual exclusiveness guaranteed in correct program assumption leads to verificationof single assignment property. The hardware interpretation of the above programcorresponds to computation of boolean expressions and conditionals in the processorelement. The circuit of local control can hence be derived from the program as com-bination of testing of conditionals, arithmetic operations, and computing of booleanexpressions as in Figure 4.13. The program however produces control as ’one hot en-coding’. The optimization of number of control bits required is obtained by encodingin minimal bit encoding. This reduces the ratio of number of control bits fromα tolog2α.

The generation of local control unit for each PEs follows from the above pro-gram. The global control unit is common to all PEs. The construction of globalcontrol units is derived from the type:global as shown in Equation (4.12). The trans-

70


η

Local

ControlPE1

...

Encoder

...

Encoder

....

....

Control Path

If A +A +A ηpj j k l l > b.. ..

W1

1

1

Wk

ηIf A +A +A ηpj j k l l > b.. ..

k

Data Path

globalcontrol

Figure 4.13: Hardware interpretation of local control program

formation for construction of global control unit collects the type:global conditionalsfrom all PE definitions in a single program, with only single occurrence of commonconditionals across different processing elements. The control signals are generatedas interpreted from the above program. The global controller sends the control fromthe border of the processor array as shown in Figure 4.9. Instead of broadcastingglobal control signals to each PE, efficient propagation of global control signals us-ing appropriate delays is carried out. The computation of global control pursues thesame methodology of checking whether the index vector,I ′ lies in the polyhedrondefined by inequalities. The global control is independent of the processor index,thereforeI

′= (ηJ ηL)T. The example of different PE types having a common con-

ditional is optimized by having a common control signal as shown by computationof global control for sth quantification of kth variable for PE2. The other equationcombines all other control signals in a AND operation. The other way round can beinstead of calculating individual control signals for each hyperplaneGl

k,1 · I′ ≥ gl

k,1,one generates a single control signal by checking whether the index vector lies in thepolyhedraGk,1 · I ′ ≥ gk,1. In this case, a complexer conditional unit is obtained inexchange of benefit of doing off with AND operations. The hardware interpretationof the program can be tuned to the architectures. The fine grained architectures arecharacterized by lot of routing resources. In this case all the computation of controlsignal of different PEs can be combined in a single global controller. In case of coarsegrained architectures characterized by lesser routing resources, the program can be

71


interpreted a global controller for each PE type assuming all the PEs of same typeare in a convex space. Hereq global controller for each PE type is obtained. Thehardware interpretation of the global controller for the fine grained case is shown inFigure 4.14.

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

GivenηJ ∈ηJ andηL ∈ ηL, (scanning code)

PE1 :

gctrlk,1[I] = 1 if {∃I ′ s.tGl

k,1 · I′ ≥ gl

k,1 ⇔ Aj · ηj ≥ bj ∧ Al · ηl ≥ bl}= 0 otherwise

...

gctrk,i = gctr1k,i ∧ . . . ∧ gctrm

k,i

PE2 :

gctrlk,1[I] = 1 if {∃I ′ s.tGl

k,i · I′ ≥ gl

k,i ⇔ Aj · ηj ≥ bj ∧ Al · ηl ≥ bl}= 0 otherwise

...

gctrk,1 = gctr1k,1 ∧ . . . ∧ gctrm

k,1

gctrk,s = (PE1 : gctr1k,1)(If I

′′ ∈ S11 , PE1 : P1 ⇔ Ss

2, PE2 : Ps, whereP1 = Ps)

...

PEq :

The interpretation of global controller use arithmetic and comparison operationsgenerating the control signals. Unlike the memory resources whose usage dependson the partitioning parameters, i.e. , tiling matrix. The size of control unit is mostlyindependent from partitioning parameters. The only dependence comes from log ofbit size of number of index points within a tile. The local and global control units areproblem size independent as the number of control variables are independent of thenumber of index points in any index space. The location and propagation of globalcontrol signals are discussed in the next section.

4.4.4 Propagation

The systolic model of computation is associated with short communication paths be-tween PEs for data and control transfer. This not only has benefits in terms of clock

72


qCounter

If A A j j l l

..

η η> b > b

...

If A A j j l lη η> b > b

...

2

......

Global Controller

Figure 4.14: Hardware interpretation of global control program

speed but also memory, routing resources. In this section propagation of scan code asproduced by the counter and the global counter signals are propagated to neighbour-ing PEs. The local control signals need not be propagated as they are individual toeach PEs. The propagation of control signals mainly concerns itself with the numberof delay registers and the direction of the propagation vector required for communi-cation among neighbouring PEs.

The global controller is located next to the processor element which starts theexecution. The processor co-ordinates of this processor is assumed to be (0,0) in caseof a 2 dimensional array. In case of 1 dimensional array the obvious propagationvector is the only direction of communication. The localization of control signalsfollows fromgctr(p1, p2, t) = gctr(0, 0, t− λd). λd is the number of delay registers.The equation says the global control signals communicated to first processor elementmust reach processor element(p1, p2) after timeλp1,p2. The selection of propagationvector for a PE(p1, p2) is obtained by looking at the start time of execution,λs foreach neighboring processor. The processor element withλs less thanλ(p1, p2) and thedifference being smallest is selected and the communication link is the propagationvectordp. In case of a tie, the propagation vector is the same as the propagationvector for neighboring PE if the are of same PE type. Otherwise the selection is doneat random. This heuristic leads to minimal number of delay register and makes thecircuit regular. Once the propagation vectordp is found using above heuristic, the

73


number of delay register is found asλd = λK · dp. The partitioning causes formationof uniform tiles. Hence the processor space,P is not a linearly bounded lattice. Thedp is limited to {(1,0),(0,1),(1,1)}. The input is PE definition with local control.The only difference is for processor with earliest start times. In this case the scancode and the global control signals are directly taken from the counter and globalcontroller, respectively. The program for a PE after application of propagation looksas in following program.

〈‖I : I ∈ Ic ∧ I′: I

′ ∈ I ′ ::

ηJ [p, t] = ηJ [p− dp, t− λd]

ηL[p, t] = ηL[p− dp, t− λd]

gctrk,1[p, t] = gctrk,1[p− dp, t− λd]

...

gctrk,Wk[p, t] = gctrk,Wk

[p− dp, t− λd]

lctrlk,i[I] = 1 if {∃I ′ s.tGl

k,i · I′ ≥ gl

k,i}= 0 otherwise

lctrk,i = lctr1k,i ∧ . . . ∧ lctrm

k,i

Ck[I] = 0 if lctrk,1 ∪ gctrk,1 = 1

Ck[I] = 1 elseiflctrk,2 ∪ gctrk,2 = 1

...

Ck[I] = Wk − 1 elseiflctrk,Wk∪ gctrk,Wk

= 1

x1[I] = F1(. . .)

...

xk[I] = Fk(. . .)

where,

Fk(. . .) = IF (Ck[I] == 0, F 1k (var1

k), . . . , Ck[I] == Wk − 1, FWkk (varWk

k ))

The algorithm for determination of propagation vectors and delays is summa-rized below. The delays are used for propagation of global control signals and thescan code. The illustration of the algorithm is given in Example 4.4.4. The followingis the algorithm for finding the propagation vector in processor space and the requi-site number of delays.

74

4.5 An Example: FIR Filter

Algorithm: delay and propagation determination

Step 1: For all processing elements,p ∈ P

Step 2: For all propagation vector,dp determineλ(p − dp)(start time) usingthe scheduling vector. i.e.λ(p − dp)= λK · (p − dp). TheλK is thepart of schedule vector associated with processor space.

Step 3: Selectdp, s.t λp−dp < λp andλp−dp > λp−s ∀λp−s < λp, wheres ∈ dp. In other words the biggest start time of neighboring PE lessthan start time of considered PE.

Step 4: In case of tie, Ifp andp− dp are of the same PE type, then propaga-tion vector ofp is same as propagation vector of PEp − dp elsedp

is selected randomly.

Example 4.4.4 Given a3×3 processor array with scheduling vectorλ = (λJ λK λL)T

andλK = (1 3)T. On applying the above algorithm one can verify the number ofdelay registers and propagation of registers from Figure 4.15. The PE types are shownin the figure.


The FIR filter is taken as example to explain our methodology of control generation.FIR filtering is widely used in the field of digital signal processing to process theinput signal. The simple C-code for the loop part of a FIR filter is as following.

for (i=0;i<T;i++)

for (j=0;j>N;j++)

y(i)+=a(j)*u(i-j);

TheT is the number of input signals andN is the number of weights associatedwith the filter mask. The array a and u contain the weights of mask and the inputsignals respectively. The piecewise regular algorithm for FIR filter after localizationof data dependencies looks as follows in our notation.

75


D D

D D

D D

DD

DD

DD

CounterGlobalController

PE1 PE2

PE3 PE4

PE2

PE3

PE4

PE4 PE4

Figure 4.15: Example 4.4.4

〈i, j : 0 ≤ i < T ∧ 0 ≤ j < N ::

a[i, j] =a[i− 1, j] If i > 0

=Aj If i = 0

u[i, j] =u[i− 1, j − 1] If i > 0 ∧ j > 0

=Ui−j If i = 0 ∨ j = 0

y[i, j] =y[i, j − 1] + a[i, j] · u[i, j] 〉

The partitioning is carried on the localized PRA using the matrices.

PLS =

(2 0

0 N

)PGS =

(4 0

0 N

)

76


Henceforth we obtain the following PRA. We assume throughout thatN = 4.

〈j1, j2, k1, l1 : 0 ≤ j1 < 2 ∧ 0 ≤ j2 < 4 ∧ 0 ≤ k1 < 2 ∧ 0 ≤ l1 < dT/4e ::

a[j1, j2, k1, l1] =

a[j1 − 1, j2, k1, l1]

a[j1 + 1, j2, k1 − 1, l1]

a[j1 + 1, j2, k1 + 1, l1 − 1]

Aj2

if j1 > 0

if k1 > 0 ∧ j1 = 0

if l1 > 0 ∧ k1 = 0

∧ j1 = 0

if l1 = 0 ∧ k1 = 0

∧ j1 = 0

u[j1, j2, k1, l1] =

u[j1 − 1, j2 − 1, k1, l1]

u[j1 + 1, j2 − 1, k1 − 1, l1]

u[j1 + 1, j2 − 1, k1 + 1, l1 − 1]

Uj1+2·k1+4·l1−j2

if j1 > 0 ∧ j2 > 0

if k1 > 0 ∧ j1 = 0

∧j2 > 0

if l1 > 0 ∧ k1 = 0

∧ j1 = 0 ∧ j2 > 0

if l1 = 0 ∧ k1 = 0

∧ j1 = 0 ∧ j2 > 0

z[j1, j2, k1, l1] = a[j1, j2, k1, l1] · b[j1, j2, k1, l1]

y[j1, j2, k1, l1] =

{y[j1, j2 − 1, k1, l1] + z[j1, j2, k2, l1]

z[j1, j2, k1, l1]

if j2 > 0

if j2 = 0

The following non-optimal schedule is undertaken to illustrate the complete tra-jectory of control generation methodology. A simple inspection of dependence graphshows existence of a better schedule.

(p

t

)=

(0 0 1 0

4 1 5 10

)

j1

j2

k1

l1

+

(0

1

)

The space-time mapping shows thatk1 is the processor index. The other iterationvariablesj1, j2, l1 are renamed asη1, η2, η3. The piecewise regular processor arrayobtained on space-time mapping can be written as

〈η1, η2, p, η3 : 0 ≤ η1 < 2 ∧ 0 ≤ η2 < 4 ∧ 0 ≤ p < 2 ∧ 0 ≤ η3 < dT/4e ::

77


a[p, t] =

a[p, t− 4]

a[p− 1, t− 1]

a[p + 1, t− 1]

Aη2

if η1 > 0

if p > 0 ∧ η1 = 0

if η3 > 0 ∧ p = 0 ∧ η1 = 0

if η3 = 0 ∧ p = 0 ∧ η1 = 0

u[p, t] =

u[p, t− 5]

u[p− 1, t− 2]

u[p + 1, t− 2]

Uη1+2·p+4·η3−η2

if η1 > 0 ∧ η2 > 0

if p > 0 ∧ η1 = 0 ∧ η2 > 0

if η3 > 0 ∧ p = 0 ∧ η1 = 0

∧η2 > 0

if η3 = 0 ∧ p = 0 ∧ η1 = 0

∧η2 > 0

z[p, t] = a[p, t] · b[p, t]

y[p, t] =

{y[p, t− 1] + z[p, t]

z[p, t]

if η2 > 0

if η2 = 0

〉

4.5.1 Control Generation

The first step involves classification of PE types. The application of step 1,(i.e. PEclassification) realizes two types of processing element PE0, i.e.p = 0 and PE1,i.e.p > 0.

The second step involves the determination of scan-code. The index pointsj1, j2 ∈J are executed local sequentially.l1 ∈ L is executed global sequentially. Therefore,we obtain two counters,counterj which corresponds to the scan-code for executionof index points within a LS tile andcounterl corresponds to global execution of co-partitions. The rectangular tile implies the scan code forcounterj to be

for( p1=0; p1<= 1; p1=p1 + 1)

{

for( p2=0; p2<= 3; p2=p2 + 1)

{

η1=p1;

η2=p2;

}

}

78


The scan code forcounterl is given by

for( p1=0; p1<= T/8 ; p1=p1 + 1)

{

η3=p1;

}

The construction of the counter for the global control needs an enable mechanism.For the LS tileTLS = |det(PLS)|. This means we don’t need any internal enablemechanism for controllingcounterj. However there are idle cycles for PEs becauseof the unoptimal schedule. Therefore one needs a global enable mechanism for con-trolling counterj. Also the execution of co-partitions are done with uniform intervalstherefore no internal enable mechanism forcounterl is required. The counter for thecontroller is depicted in the following Figure 4.16.

Counter j

Scan-code

Counter l

Scan-code

j1

j2, l1

If4 j + j + 10 l =T1 2 1

Counter(Time)

(0 to T )maxEnable

then enable=1Enable

Figure 4.16: Counter for Controller

TheTmax is obtained byTmax = 4·j1max+j2max+10·l1max wherej1max , j2max , l1max

are given by the lexicographic maximum of iteration variables inJ , andL.

The generation of local and global control now follows from the step 3 of the

79


algorithm. After PE classification, one can write the definition for PE0 (p = 0) as,

a[p, t] =

a[p, t− 4]

a[p + 1, t− 1]

Aη2

if η1 > 0

if η1 = 0 ∧ η3 > 0

if η3 = 0 ∧ η1 = 0

u[p, t] =

u[p, t− 5]

u[p + 1, t− 2]

Uη1+2·p+4·η3−η2

if η1 > 0 ∧ η2 > 0

if η3 > 0 ∧ η1 = 0 ∧ η2 > 0

if η3 = 0 ∧ η1 = 0 ∧ η2 > 0

z[p, t] = a[p, t] · b[p, t]

y[p, t] =

{y[p, t− 1] + z[p, t]

z[p, t]

if η2 > 0

if η2 = 0

〉

Similarly for PE1 (P > 0), the processor description is obtained as

a[p, t] =

{a[p, t− 4]

a[p− 1, t− 1]

if η1 > 0

if η1 = 0

u[p, t] =

{u[p, t− 5]

u[p− 1, t− 2]

if η1 > 0 ∧ η2 > 0

if η1 = 0 ∧ η2 > 0

z[p, t] = a[p, t] · b[p, t]

y[p, t] =

{y[p, t− 1] + z[p, t]

z[p, t]

if η2 > 0

if η2 = 0

〉

For the above processor descriptions it can be seen that all iterative conditionals areof type:global. However the implementation in ArchitectureComposer assumes onlythe condition depending onη3 as global. Therefore using our methodology thereis no need for local control for the specific example. The description of the global

80


controller can be obtained as follows

〈Givenη1, η2, η3(scanning code)

PE0 :

gctra,1 = 1 if {η1 > 0}= 0 otherwise

gctra,2 = 1 if {η1 = 0 ∧ η3 > 0}= 0 otherwise

gctra,3 = 1 if {η1 = 0 ∧ η3 = 0}= 0 otherwise

gctru,1 = 1 if {η1 > 0 ∧ η2 > 0}= 0 otherwise

gctru,2 = 1 if {η3 > 0 ∧ η1 = 0 ∧ η2 > 0}= 0 otherwise

gctru,3 = 1 if {η3 = 0 ∧ η1 = 0 ∧ η2 > 0}= 0 otherwise

PE1 :

...

Common :

gctry,1 = 1 if {η2 > 0}= 0 otherwise

gctry,2 = 1 if {η2 = 0}= 0 otherwise

〉Here for present example we see use of a global controller producing control forall PEs. The control signals are passed to relevant PEs using propagation vectorsand delay. The important observation for a better implementation is finding basiccontrol predicates. For FIR filter this mean instead of computing all control predicatesindividually, it is a better idea to implementη1 > 0, η2 > 0, η3 > 0 in the globalcontroller. And the re-construe the requisite control signal within the PE array. Theimplementation in ArchitectureComposer has a global controller implementing onlyη3 > 0. The other predicates are computed and then compared within the localcontroller. However according to our methodology the final description of the PE0 is

81


obtained as

〈η1[0, t] = η1(counterj)

η2[0, t] = η2(counterj)

η3[0, t] = η3(counterl)

gctra,1[0, t] = gctra,1(global counter)...

Ca = 0 if gctra,1 = 1

Ca = 1 elseifgctra,2 = 1

Ca = 2 elseifgctra,3 = 1

...

a[0, t] = Fa(. . .)

where,

Fa(. . .) = IF (Ca == 0, a[p, t− 4], Ca == 1, a[p + 1, t− 1], Ca == 2, Aη2)

...

〉

Similarly a description of PE1 (p > 0) can be obtained. Here the propagation of thescan code and global control signals are done using propagation vectordp = 1, andλd = 5. The number of delays for inter-processor communication of scan code isgiven by step 4. The implementation of the FIR filter in ArchitectureComposer isseen in Appendix A.

82

5 Memory Consumption andAddress Generation

Partitioning was introduced as transformation to match the algorithm to hardwareresource constraints. The possibility of different partitioning techniques and relevantparameters lead to the problem of creating optimal processors based on followingcriterion.

• Memory: The partitioning of a dependence graph introduces requirement oflocal memory and FIFOs for local communication, communication betweenperiphery and the processor array respectively. The selection of a partitioningtechnique and the tile shape and size influence the size of memory and numberof accesses. The memory interface usually forms the bottleneck for processorarray implementations as its much slower the arithmetic units.

• Energy dissipation: The power consumption of a PE array depends on a) PE ar-chitecture, i.e. local memory as determined by partition parameters, functionaldata-path, b) Number of PEs implemented on the array, and c) I/O operationenergy dissipation, The I/O operations involve energy dissipation from pinsand memory. The reduction of number of I/O operation considerably reducespower consumption. A detailed study of power estimation of regular processorarray depending on tiling parameters can be found in [DR02].

• Control logic cost in terms of area, power, latency.

• Address generation unit: The calculation of address is undertaken by customor incremental address generation units. The size of an address generation unitis a major design criterion.

Therefore efficient memory access is the most important criteria for design ofprocessor arrays. In this chapter the effect of different parameters in co-partitioning

83

5 Memory Consumption and Address Generation

on communication rate, local memory is discussed. Also optimal design of addressgeneration units is discussed.

5.1 Effect of Co-partitioning on MemoryRequirements

Irigoin and Triolet [IT88] introduced the supernode partitioning technique for multi-processors to match to resource constraints in number of processors and memoryaccess. The effect of size and shape of tiles on communication for multi-processorsystem was first studied in [JR92]. The problem of selecting optimal partitioningmatrix with objective of minimum communication and ratio of communication andcomputation was dealt as a combinatorial problem in [BDRR94]. The effect of co-partitioning on communication rate was also examined in [Sie03]. The contributionof this master thesis is in estimation of local memory, FIFOs and communication-ratefrom tiling matricesPLSGP andPLPGS in co-partitioning. The following exampleillustrates our problem.

Example 5.1.1 The implementation of matrix multiplication of matrices of size128 × 128 on a2 × 2 processor array is shown in Figure 5.1. The co-partitioning

tiling matrices arePLSGP =

4 0 0

0 4 0

0 0 1

and PLPGS =

16 0 0

0 16 0

0 0 1

. In this

section formulas are introduced for estimation of size of local memory within PE,FIFOs for interprocessor communication, FIFOs for wrap around communication.

The size of local memory for each PE is an important factor and must be ac-counted for in determination of tiling matrixPLSGP . The amount of local memoryavailable to each processing element is limited. The local memory required dependson internal dependence, i.e., both the ends of dependence vector are within the sameLSGP tile, i.e., withinJ in our co-partition notation. After co-partitioning, i.e., em-bedding transformation (see Definition 3.4.1) a quantification

x[I] = F(z[I], . . .) ∀I ∈ Iz[I] = y[I − d] ∀I ∈ I

84

5.1 Effect of Co-partitioning on Memory Requirements

FIFO FIFO

FIFO

FIFO *

+

Memory(B)Memory(A)

Figure 5.1: PE array architecture for co-partitioned matrix multiplication implemen-tation

changes to following form due to decomposition of index space.

x[J,K, L] = F(z[J,K, L], . . .)

z[J,K, L] = y[J − dJ , K − dK , L− dL]∀J − dJ ∈ J ∧K − dK ∈ K ∧ L− dL ∈ L

Therefore dependence vectord is given by possible values ofdJ , dK , dL in decom-posed index space. The estimation of memory involves consideration of followingthree cases of dependence vectors.

• Case 1:dK = 0∧dL = 0. The dependence vectordJ represents communicationwithin a LSGP tile, i.e., within a single PE. Therefore for estimation of localmemory dependence vectordJ needs to be considered.

• Case 2:dK 6= 0∧dL = 0. The dependence vectordK anddJ together representcommunication between different LSGP partitions within a co-partition, i.e.,communication between different PEs. The estimation ofFIFOreg requiredfor interprocessor communication represented by black registers in Figure 5.1requires consideration of this case.

• Case 3:dL 6= 0. The dependence vectordL, dK , anddJ are responsible for

85


communication between different co-partitions. The calculation of amount ofFIFOback requires the consideration of this case.

Therefore given an optimal scheduling vectorλ = (λJ λK λL)T, the above threecases lead to following exact formulas for memory.

MemoryLocal =n∑

i=0

λJ .dJi

The local memory involves dependence vectors falling in case 1. The number ofdependence vector of given case type totaling over all quantification is assumed to ben. The estimation of number of registers required for inter-processor communicationrequires consideration of case 2 as given by following formula. The same assumptionof m dependence vector of case type 2 is undertaken.

FIFOreg =m∑

i=0

(λJ .dJi+ λK .dKi

)

The exact formula for estimation ofFIFOback is given by consideration of case 3.

FIFOback =n∑

i=0

(λJ .dJi+ λK .dKi

+ λL.dLi)

There are two problems with above exact formulas due to the fact the memory esti-mation expressed by above formulas, which is nothing other than time steps requiredfor data reuse from source to destination.

• In the new design flow partitioning is undertaken before space-time mapping.Therefore the formulas do not aid in finding optimal tiling matricesPLSGP andPLPGS as scheduling is done after partitioning.

• Consider a co-partitioned dependence graph as shown in Figure 5.2a) withstart time for different iteration points given by the space-time mapping asin 5.2c). The number of registers calculated for inter-processor communica-tion of data for dependenced = (1 1)T. The dependence crossing smallersquares cause inter-processor communication. One of the quantification re-quiring inter-processor communication isx[j1, j2, k1, k2, l2] = x[j1 − 1, j2 +

3, k1, k2 − 1, l2] If k2 > 0 ∧ j2 = 0. This falls under case 2 asdJ = (1 − 3)T,dK = (0 1)T, dL = (0 0)T. Using the formulas we obtainFIFOreg = 5. How-ever as can be seen in Figure 5.2b) and d), of the five register only a maximumof two registers are busy at any time step.

86


0 1 2 3 4

84 5 6 7j1

j 2

k1

k2

l 2

a)

b)

PE at t=7F1

c)

d)t=3 t=4 t=5 t=6 t=7

t=3 t=4 t=5 t=6 t=7

p

t=

0 0 1 1 0

4 1 12 4 17

J

KL

e)

Figure 5.2: a) Dependence graph, b) Processor Array, c) Space-time mapping, d)Memory for inter-processor communication, e) Optimal memory for inter-processorcommunication

5.1.1 Estimation of Local Memory

The above mentioned problems are dealt by taking loop matrixP1 and P2 whichdetermine the scheduling vector. The loop matrix are closely related to tiling matrixas they are column permutation of tiling matrices. The following work finds the exactrequirement of memory for an optimized PE array. The following assumption is takenfor validity of the estimation formulas.

• The tile size defined by the LSGP tiling matrix is assumed to be larger thanmagnitude of any dependence vector. In other words the source and sink ofdependence vector lie within same tiles or in neighbouring tiles.

87


• The dependence matrix,D is non-singular or an×m matrix of rankn.

• If PLPGS andPLSGP are the tiling matrix for co-partitioning andG = PLPGS,H = PLSGP andD is the dependence matrix. ThenHD ≥ 0 andGD ≥ 0.This assures that two distinct tiles or co-partitions are not dependent on eachother.

• All the tiles and co-partitions satisfy the convexity condition.

If the tiling matrixPLSGP = (S1. . . . Sn.)T be an×n matrix. The loop matrix,P1

is column permutation of tiling matrixPLSGP . Thensi = gcd(si1, si2, . . . , sin) ∀ i =

1, . . . , n. The matrixDS is obtained by dividing each column ofPLSGP , i.e.,Si. withcorrespondingsi. ThereforeDS = {DSij = Sij/si : ∀1 ≤ i ≤ n∧ 1 ≤ j ≤ n}. Thecolumns of matrixDS, i.e., DSi gives the set of extreme dependence vectors. Forexample if

PLSGP =

(−3 3

3 6

)then s1 = 3 and s2 = 3 ⇒ DS =

(−1 1

1 2

)

Therefore dependence vectors(−1 1)T and (1 2)T forms the basis of dependencevectors. As tilings are non-empty parallelepipeds therefore any other dependencevector can be expressed as linear combination of extreme dependence vectors. i.e.,di =

∑ni=1 αi ·DSi. If pij is a permutation matrix then the loop matrixP1 = PLPGS ·

Pr, wherePr =∏

pij. For example

P1 =

(3 −3

6 3

)= PLPGS · p12, where p12 =

(0 1

1 0

)

The new matrixDT is obtained by multiplyingDS with the permutation matrix,i.e., DT = DS · Pr. This permutes also the basis dependence vectors. Thereforethe columns ofDT , i.e., dt1, . . . , dtn are basis dependence vectors in rearrangedorder. Again any dependence vector,d can be expressed as linear combination ofdependence vectors,dti as follows

d = α1 · dt1 + α2 · dt2 + . . . + αn · dtn. (5.1)

This forms a system of equation of full rank. LetS be a diagonal matrix containingsi as its diagonal matrix. ThenST is obtained by multiplyingS with the permutationmatrix. Then every columni of ST has only one non-zero entrysti. Therefore valuesof α1, . . . , αn can be found by solving system of linear equations. The optimal exact

88


amount of local memory required within a PE for a dependence vector d is given byfollowing formula.

MemoryLocal = bα1 + α2 · st1 · ‖dt2‖1 + . . . + αn

n−1∏i=1

sti · ‖dtn‖1c (5.2)

where‖dti‖1 is 1-norm distance also colorfully known as taxicab norm or Manhattandistance, because it is the distance a car would drive in a city laid out in square blocks(if there are no one-way streets). The total local memory is calculated by summingover all local memory obtained for each individual dependence vector. The derivationof exact formulas are illustrated for example 4.3.1.

Example 5.1.2 The Figure 4.10a) shows the relation between loop matrix,P1 andthe resulting schedule vector for index spaceJ . In particular

PLPGS =

(3 −3

6 3

)P1 =

(−3 3

3 6

)and λJ = (3 4)

For dependence vectord = (0 3)T, the old formula gives amount of local memoryrequired for PE as3 · 1 + 4 · 2 = 12. However optimal number of local memoryshould be10. Using new formulas one obtains permutation matrix,P , DS, DT , andST as follows

P =

(0 1

1 0

), DS =

(1 −1

2 1

), DT =

(−1 1

1 2

), ST =

(0 3

3 0

)

Therefore dependence vectord = (0 3)T is expressed linear combination of basisdependence vector, i.e.,dt1 = (−1 1)T and dt2 = (1 2)T using equation (5.1).Solving system of linear equations we obtainα1 = 1 andα2 = 1. Therefore on usingformula (5.2) we obtain,

Local Memory = b1 + 1 · 3 · 3c = 10.

The notation of formulas deal with internal dependence, i.e., all dependence sat-isfy case 1. The exact calculation of local memory is important asPLPGS can beselected according its availability.

5.1.2 Estimation of Memory for Inter-processorCommunication

TheFIFOreg is the memory associated with inter processor communication. Thisdepends a lot on the tile shape and size. First we introduce a formula which estimates

89


the upper bound onFIFOreg. The exact amount of registers required can be foundout using efficient enumeration of the condition spaces along with consideration ofloop matrixP1. The inter-processor communication can be calculated before local-ization. SupposeD is the dependence matrix containingm dependence vectors andsatisfy our assumption. The matrixPLPGS andPLSGP are the givenn × n tilingmatrix. The idea behind the following formula is that required memory for inter pro-cessor communication is proportional to number of dependence vector going froma node inside the LSGP tile to a node to other LSGP tile. This is representative ofamount of data transferred to each processor. LetH = P−1

LSGP andG = P−1LPGS.

S1 =

(1

det(H)

n∑i=1

m∑j=1

n∑

k=1

hi,k · dk,j

) |det(PLPGS)||det(PLSGP )|

S2 =

(1

det(G)

n∑i=1

m∑j=1

n∑

k=1

gi,k · dk,j

)

Total FIFOreg ≤ S1 − S2 (5.3)

The basic idea behind derivation of above formula is that|det(PLSGP )|(h1·d) approx-imates the number of index point of a LS tile which form source of dependence vectord crossing the tile through face subtended by vectorsp2, . . . , pn which are columnsof PLSGP . This multiplied by total number of processors, i.e.,|det(PLPGS)|

|det(PLSGP )| . Howeverfor data represented by the dependence vectors which cross co-partition as they aretaken care by Case 3, i.e.,FIFOback. ThereforeS2 which approximates this amount,is subtracted fromS1 to give an upper bound on memory required for inter-processorcommunication. The above formula is the worst-case estimation of required memoryas it is exact only in case of non-overlapping execution of tiles. However in caseof optimal schedules with overlapping execution of tiles is miles away from exactestimation of accurate memory.

Example 5.1.3 The optimal amount of registers required for inter-processor com-munication for dependence graph and array implementation as seen in illustration 5.2is 7. On applying equation (5.3),S1 = 32 andS2 = 16, thereforeFIFOreg ≤ 16.This illustrates that approximate estimation is way off mark. The implementationwithout optimization takes 13 registers. The advantage of upper bound is exhibitedin case of following space-time mapping for same example.

(p

t

)=

(0 0 1 1 0

4 1 12 16 20

)

J

K

L

90


This would give approximate amount of register required for inefficient implemen-tation would be 31. However then upper bound can suggest either to use efficientimplementation or change in scheduling vector. The second advantage is that theupper bound can be normalized by dividing with|det(PLPGS)|, this can be used tocompare different tiling matrix in terms of memory required for inter-processor com-munication. This figure will give ratio of communication to computation. This willbe shown in a FIR filter case study.

5.1.3 Estimation of FIFOs and Off-chip Memory

The calculation ofFIFOback is necessary as it doesn’t need to be an on-chip memory.Therefore it may be deciding in terms of memory access cost. TheFIFOback can betreated as hierarchical memory. Here we estimate the exact amount of memory andnumber of access required. However we differentiate on following basis.

• Some co-partition are neighbours in space and time. The dependence vec-tor crossing these co-partitions communicate data that can be stored on-chipmemory as the neighbourhood in time and space determines memory that canbe accommodated on-chip, e.g., as distributed memory on FPGA. This memorywe will denote withFIFOb.

• The dependence vector crossing co-partitions that are neighbours in space andbut not in time, would require a lot of memory. Therefore the correspond-ing data can be stored in a hierarchical memory like different level of cachesaccording to sequencing vector. The memory can be denoted byMemcache.

The concept is explained in Figure 5.3. The co-partition 1 and 2 are neighbours inspace and time according to the given space-time mapping. However co-partition 1and 3 are neighbours in space but not in time. The dependence vector which cross theface between co-partition 1 and 3 requires much more memory than the one betweenco-partition 1 and 2. In case of large index space, the memory requirements can befulfilled by SRAMs on FPGA or hierarchical cache structure.

Let P2 be the loop matrix which determines the sequencing index for execution ofco-partitions. LetD be the dependence matrix. LetF = P−1

2 andPLPGS andG havethe same definitions as defined earlier. If the loop matrixP2 = (S1. . . . Sn.)

T. Thensi = gcd(si1, si2, . . . , sin) ∀ i = 1, . . . , n. Then one can accurately approximate therequired memory using following formulas. The derivation of the formulas is based

91


0 1 2 3 4

84 5 6 7j1

j 2

k1

k2

l 2

p

t=

0 0 1 1 0

4 1 12 4 17

J

KL

1 2

3 4

0

33

Figure 5.3: Example dependence graph

on ideas introduced for estimation of local memory and memory for inter-processorcommunication.

FIFOb ≤ det(PLPGS) ·n∑

i=1

1

c1

f1 · di

f1 is the vector along which the co-partitions are neighbour in space and time.c1 isan integer such thatc1S1. = PLPGS(k). In other words, the product of a integer withfirst column ofP2 such that itskth column inPLPGS. The estimate is exact only incase of non-overlapping execution of co-partitions, and

Memcache ' det(PLPGS) ·m∑

i=2

n∑j=1

(1

ci

fi · dk)i−1∏

k=1

si

92


Example 5.1.4 The dependence graph in Figure 5.3 has the same processor imple-mentation as in Figure 5.2b). The missing part is FIFO structure as shown in 5.1.

Given isPLPGS =

(8 0

0 8

), Loop matrixP2 =

(0 2

2 0

). In other words loop matrix

says the co-partitions are first executed inl2 direction and then inl1 direction. The

execution of co-partitions overlap, therfore for dependence matrixD =

(1 1

0 1

)and

P−12 =

(0 1

212

0

). We further obtainc1 = 2 andc2 = 2. Therefore using the formu-

las we obtain estimates forFIFOb = 16 andMemcache = 16. The optimal exactmemory should beFIFOb = 3 andMemcache = 10.

The bigger the index spaceMemcache converges to value given by formula. Theformula for FIFOb is just an upper bound. The exact estimation would requireenumeration of condition spaces obtained on localization for each quantification andcorresponding dependence vector satisfying Case 3.

5.1.4 An Example: FIR Filter

Example 5.1.5 The formulas discussed in previous section are for co-partitionedimplementation of a 128 tap FIR filter with 256 input signals. Then assuming only

rectangular LSGP tiles, i.e.,PLSGP =

(sx 0

0 sy

). The local memory is dependent

only on LSGP tiling parameters. The selection of an optimal sequencing vector leadsto local memory requirements as depicted in Figure 5.4. The dependence matrix,

D =

(1 1 0

0 1 1

). The following general facts can be observed from the Figure 5.4.

• Larger the size of LS tile as determined byPLSGP larger is the local memoryrequired.

• The selection of rectangular tiles and symmetric dependence vectors of FIRfilter algorithm causes local memory requirement to be symmetric along squaretiles as seen in Figure 5.4.

• The determination of optimal tiling parameters can be selected under hardwareconstraints on availability of local memory, e.g., if given availability of only26 registers as local memory, The optimal tiling matrixPLPGS under resource

93


constraints can be found from values ofsx andsy. One optimal tiling matrix

can be,PLPGS =

(12 0

0 12

).

The determination of optimal scheduling vector is done after partitioning. Howevergiven the dependence matrix and loop matrix one can find the exact amount of lo-cal memory required according to the tiling parameters. The previous approachesof finding optimal tile sizes and shapes were based on linear program formulationminimizing inter-processor communication. For massively parallel architecture thebalancing of local memory and communication with external memory is more impor-tant. In case of parallelepiped tiles also the given formulas are valid. Therefore oneis not only restricted to hyperquader tiles.

510

1520

2530

0

5

10

15

20

25

30

35

0

10

20

30

40

50

60

70

sy

sx

Loca

l Mem

ory

Figure 5.4: Local memory vs tiling parameters

Figure 5.5 depicts the dependence of memory required for inter-processor com-munication on number of processor and size of LS tile, i.e., number of iterationsexecuted by a processor in a co-partition. The following conclusions can be drawnfrom the figure.

• The increase in memory required for inter-processor communication corre-sponds todet(PLSGP ), i.e., number of iteration points within a LS tile.

94


• The increase in number of processors does not necessarily increase the memoryfor communication. This fact can be observed where processor array of size2 × 7 requires more inter-processor communication than a processor array ofsize4× 4.

510

1520

25

0

10

20

30

0

200

400

600

800

1000

1200

det(P−LSGP)

No of Processor

Shift

Reg

ister

s

Figure 5.5: PE array architecture for co-partitioned matrix multiplication implemen-tation

Let hardware constraint restrict the implementation to a2×2 or 1×4 processor array,where each processor element has a maximum local memory of 25 registers. The aimof co-partitioning is maximum data reuse through use of local memory and minimalcommunication of processor array with periphery. However the memory requiredfor inter-processor communication needs to be balanced with communication withperiphery. In this particular case given the size of the processor array one can de-termine the maximum amount of memory required per processor for inter-processorcommunication. In other case the influence of processor array on memory requiredfor communication can be calculated. Therefore hardware constraints on availabilityof memory for inter-processor communication can determine the size of processorarray and hence tiling matrixPLPGS.

95


5.2 Address Generation for Processor Arrays

The efficient address generation is a key problem for piecewise regular programswhere the data is stored in a memory. The applications in signal, image process-ing etc. characterize the need of efficient address generation to achieve real-time de-mands. The calculation of memory address involves linear expressions as given inthe PRA. The calculation of address needs to be carried out fast in cycle constraintsotherwise it forms a bottleneck in the final implementation. The use of specializedprogrammable address generation units for signal processing have been discussed in[Kit91]. The custom hardware for address generation adds to design cost and com-plexity. The DSPs also have dedicated address generators with special addressingmodes. The alternative way is to use many small address generators for each proces-sor element in the processor array to each individual index expression. The addressgeneration for distributed memory architectures was developed as a methodology atIMEC under ADOPT (ADress equationOPTimization and generation environment)[MM96]. The address generation for mapping of index expression in PRA onto pro-cessor array derives from ADOPT methodology.

There exists two different strategies for generation of address generation

• Incremental/decremental address generation unit: This architecture style is use-ful for PRAs characterized by regular memory access. This is particularly thecase when a counter responsible for generation of iteration vectors incrementsonly in one or few vectors.

• Custom address calculation unit: This architecture example maps address ex-pressions onto cheap adders, subtracters, and barrel shifters. This is useful incase when the iteration variables changes in lot of variables or the memory ac-cess is highly irregular. Thus making incremental style of address generationcostly in terms of mapping logic.

Example 5.2.1 We take the part of PRA after space-time mapping of matrix mul-tiplication as in Figure 6 responsible for reading of matrix values from the memory,i.e.

a[p1, p2, t] = A2·p1+η1+4·η3,η5 if p2 = 0 ∧ η2 = 0 ∧ η4 = 0

b[p1, p2, t] = Bη5,2·p2+η2+4·η4 if η1 = 0 ∧ p1 = 0 ∧ η3 = 0

96

5.2 Address Generation for Processor Arrays

The variablesη1, η2, η3, η4, η5 are just renaming of iteration variablesj1, j2, l1, l2, l3respectively after space-time mapping. Thep1, p2 are the processor index of theprocessing elements. The equations signify fetching variablesA[2 · p1 + η1 + 4 ·η3][η5] andB[η5][2 · p2 + η2 + 4 · η4] in terms array expression. The step of controlgeneration generates a global counter which scans the index domain given the space-time mapping, i.e., the counter produces the variables which are sequentially executednamelyj1,j2,l1,l2, and l3. In other words the global counter produces the valuesη1, η2, η3, η4, η5 respectively which are then send to corresponding PEs with requisitedelays. The matrixesA andB are assumed to be stored in row major form in differentmemories. This means the index expressions2 · p1 + η1 + 4 · η3 + N · η5, andη5 + N · (2 · p2 + η2 + 4 · η4) needs to be calculated. The assumption here is thattwo N × N matrix are multiplied and letN = 8. The valuesp1, p2 are individualto each PE. This means we need to have an address generation unit corresponding toeach PE. However as the processor dependent If conditionals say the memory accessare only needed for processing elements on the border. Therefore for PE(0,0) theaddress generation unit can be both of Incremental or Custom type as illustrated inthe following Figure 5.6a) and b). The number in the bracket indicates the time stepof generation of corresponding address. At other time steps the address generationunit (AGU) produces no output.

(A[η +4.η +8.η ])=1 3 5

0(1),1(3),4(9),5(11),8(17),9(19), ...

a) Incremental AGU b) Custom AGU

η1

η3

η5η3

*

4 +

η1 η5*

8

+

Enable A

η1

η3

η5

e

e

e

Enable AReset

LUT

+

Address

Figure 5.6: Architecture style for AGU

This can be verified against dataflow graph in Figure 6.2. The eventseη1, eη3, eη5

indicate incremental change in variableη1, η3, η5 respectively. The correspondingvalue to be added is stored in the look-up table. The enable signal is determined bythe corresponding If conditional of the matrix A.

97


The execution of calculation of address expressions need to be calculated in asingle cycle. Therefore custom based AGU are associated with high hardware costs.

The measures of code optimization as identifying address clusters, and algebraictransformations as dead-code elimination, etc. for having common address genera-tion units are explored in [MM96]. The placement of AGUs within each processingelements and calculation of address only on receiving the scan code can form the bot-tleneck. In proposed strategy of address generation unit the AGU is taken out of thePEs. Another point to be repeated is that space-time mapping is usually so selectedthat the memory access per cycle is only one. In case of several access per memorymulti ported memories have to be used. The AGU is generally a combination of bothIncremental and Custom architectural styles.

98

6 A Case Study: MatrixMultiplication

The matrix multiplication is taken as an example to illustrate our design flow method-ology and efficiency of resource utilization as implemented on a FPGA. The productC = A ·B of two matricesA ∈ ZN×N andB ∈ ZN×N is defined as follows

cij =N∑

k=1

aikbkj ∀ 1 ≤ i ≤ N ∧ 1 ≤ j ≤ N. (6.1)

Let the matrix product defined in Eq. (6.1) be given as C-program (Figure 6.1). .

for (i = 0; i<N; i++)

{ for (j = 0; j<N; j++)

{ for (k = 0; k<N; k++)

{ c[i][j] = a[i][k] * b[k][j] + c[i][j];

}

}

}

Figure 6.1: Matrix multiplication algorithm, C-code.

The matrix multiplication is taken as an example to illustrate the control genera-tion transformation. The dataflow graph for the partitioned partial localized depen-dence graph and the Piecewise Regular Algorithm for the co-partitioned matrix mul-tiplication algorithm is shown in Figure 6.2. The dependencies observed in the PRAare verified in the dataflow graph. Formally, the given index space,I is decomposedinto direct sum of three linearly bounded latticesJ , K, andL s.t.I ⊆ J ⊕ K ⊕ L.An intuitive explanation of co-partitioning can be seen in Figure 6.2. The iterationvectorJ = (j1 j2)

T ∈ J gives the local index vector for the points within a tile. The

99

6 A Case Study: Matrix Multiplication

iteration vectorK = (k1 k2)T ∈ K contains the set of origin of LS tiles, i.e., marked

with same color. The iteration vectorL = (l1 l2 l3)T ∈ L contains the set of origins

of pair of GS tiles, i.e.,2×2 pair of tiles having different colors. The explanation withreference to colors is valid only fork = 0 in the figure. So index pointI = (5, 7, 0)

is denoted byJ = (1, 1), K = (0, 1) andL = (1, 1, 0). The loop matrices used fortiling are

PLPGS =

2 0 0

0 2 0

0 0 1

PLSGP =

2 0 0

0 2 0

0 0 1

The co-partitioning methodology involves subsequent application of LSGP and LPGSpartitioning. The sequentialization of operation within a LSGP tile impliesPJ = 0

and the sequentialization of the computation of different tiles lead toPL = 0. Themapping of LS tiles to processor space leads toPK = E. PJ , PK , PL define the spacemapping for co-partitioning (see Definition 4.1.1 on page 44) .

a[j1, j2, k1, k2, l1, l2, l3] =

a[j1, j2 − 1, k1, k2, l1, l2, l3]

a[j1, j1, k1, k2 − 1, l1, l2, l3]

a[j1, j1, k1, k2, l1, l2 − 1, l3]

Aj1+2·k1+4·l1,l3

if j2 > 0

if k2 > 0 ∧ j2 = 0

if l2 > 0 ∧ k2 = 0

∧ j2 = 0

if l2 = 0 ∧ k2 = 0

∧ j2 = 0

b[j1, j2, k1, k2, l1, l2, l3] =

b[j1 − 1, j2, k1, k2, l1, l2, l3]

b[j1, j2, k1 − 1, k2, l1, l2, l3]

b[j1, j2, k1, k2, l1 − 1, l2, l3]

Bl3,j2+2·k2+4·l2

if j1 > 0

if k1 > 0 ∧ j1 = 0

if l1 > 0 ∧ k1 = 0

∧ j1 = 0

if l1 = 0 ∧ k1 = 0

∧ j1 = 0

z[j1, j2, k1, k2, l1, l2, l3] = a[j1, j2, k1, k2, l1, l2, l3] · b[j1, j2, k1, k2, l1, l2, l3]

c[j1, j2, k1, k2, l1, l2, l3] =

c[j1, j2, k1, k2, l1, l2, l3 − 1] + z[j1, j2, k1, k2, l1, l2, l3]

if l3 > 0

z[j1, j2, k1, k2, l1, l2, l3] if l3 = 0

Co[j1, j2, k1, k2, l1, l2, l3] = c[j1, j2, k1, k2, l1, l2, l3] if l3 = 7

100

1 2 3 5 6 6 72

3 4 5 7 8 8 94

2 3 4 6 7 7 83

4 5 6 8 9 9 105

9 10 11 13 14 14 1510

11 12 13 15 16 16 1712

10 11 12 14 15 15 1611

12 13 14 16 17 17 1813

j1

j2

A

B

Figure 6.2: The dataflow graph of the co-partitioned matrix multiplication8 × 8 ex-ample

for all J ∈ J ∧ I ∈ I, with

J =

J ∈ Z2 |

1 0

−1 0

0 1

0 −1

(j1

j2

)≥

0

−1

0

−1

K =

K ∈ Z2 |

1 0

−1 0

0 1

0 −1

(k1

k2

)≥

0

−1

0

−1

and

101


L =

L ∈ Z3 |

1 0 0

−1 0 0

0 1 0

0 −1 0

0 0 1

0 0 −1

l1l2l3

≥

0

−1

0

−1

0

−7

The problem of determining optimal sequencing index taking into account con-straints on timing of processor array and availability of resources is solved by a MixedInteger Linear Programming (MILP) formulation of the problem which is not part ofthis work. One optimal space-time mapping in framework of definition 4.1.1 is

p1

p2

t

=

0 0 1 0 0 0 0

0 0 0 1 0 0 0

2 1 1 1 8 4 16

·

J

K

L

+

0

0

1

whereJ = (j1 j2)T, K = (k1 k2)

T, andL = (l1 l2 l3)T. The number within

the nodes of the dataflow graph represents the time of execution as determined bythe scheduling. A2 × 2 processor space is obtained which executes LS tiles, i.e.,tiles with same color in parallel and GS tiles sequentially. This fact can be construedfrom the scheduling of the operations within tiles. The Piecewise Regular Programsecured after space-time mapping is shown as following. The processor array spec-ification is interpreted from the PRP aftercontrol generation. The specification canbe implemented with different interpretations as asynchronous or synchronous com-munication, coarse or fine grained architectures, hardware or software modules in aHardware Description Language (HDL). A VHDL implementation of co-partitionedmatrix multiplication was carried out.

102

Implementation Control Memory Datapath Clock

Gui03, Multidimensional time 65 slices 2 RAM Blocks 26 slices 60 MhzCo-partitioned MM 6 slices - 103 slices 58 Mhz

Table 6.1: Comparision of Implementation

a[p1, p2, t] =

a[p1, p2, t− 1]

a[p1, p2 − 1, t− 1]

a[p1, p2, t− 4]

A2·p1+η1+4·η3,η5

if η2 > 0

if p2 > 0 ∧ η2 = 0

if p2 = 0 ∧ η2 = 0 ∧ η4 > 0

if p2 = 0 ∧ η2 = 0 ∧ η4 = 0

b[p1, p2, t] =

b[p1, p2, t− 2]

b[p1 − 1, p2, t− 1]

b[p1, p2, t− 8]

Bη5,2·p2+η2∗4·η4

if η1 > 0

if η1 = 0 ∧ p1 > 0

if η1 = 0 ∧ p1 = 0 ∧ η3 > 0

if η1 = 0 ∧ p1 = 0 ∧ η3 = 0

z[p1, p2, t] = a[p1, p2, t] · b[p1, p2, t]

c[p1, p2, t] =

{c[p1, p2, t− 16] + z[p1, p2, t]

z[[p1, p2, t]

if η5 > 0

if η5 = 0

Co[p1, p2, t] = c[p1, p2, t] if η5 = 7

The control generation for determination of counter, global and local controller wasobtained from application of methodology as introduced in chapter 3. The synthe-sis of the matrix multiplication on Xilinx Virtex XCV800 gave the following result.The area complexity is expressed in terms of slices. One CLB has two slices eachcontaining 2 look up tables. The coefficients of matrices were chosen to be 8 bitintegers. The main result as obtained for application of multidimensional time andcontrol generation in [Gui03] was that area complexity of control path is significant,i.e., approximately three times the size of data path. The table 6.1 summarizes thecomparison of results from [Gui03] and the our implementation of co-partitioned ma-trix multiplication.

The absence of RAM blocks is explained by use of slices as registers to make up thelocal memory. That accounts for the large size of the data path. The global countertakes up 209 slices as seen from synthesis results in Appendix B. The high cost ofglobal counter and controller is however offset by increasing number of processor

103


elements as it is almost a constant cost. Co-partitioning as a technique also works ona index space of larger dimension, and hence has more control conditions than multi-dimensional time. Therefore a quite visible advantage of the control methodologycan be seen. A more detailed analysis of the implementation is found in Appendix B.

104

7 Conclusion and Future Work

The trends in VLSI design indicates future generation as being multi-processor arrayin fine, coarse and large grained architectures. However the algorithmic implemen-tation have to take into consideration constraints in hardware resources. Partitioningis an indispensable transformation for automatic synthesis of processor array as itmatches algorithm implementation to hardware resources. Moreover control genera-tion is important for for obtaining correct program implementation. The work in themaster thesis solves for first time the automatic generation of local and global controlfor partitioned regular algorithms irrespective of tile shape and sizes. Furthermore,the design space exploration requires an accurate pre-estimate of memory and as aconsequent also the power consumption. New formulas have been introduced in thework for memory estimation, which also helps in limiting the range of possible opti-mal tiling parameters. Also incremental style of address generation is more suited foraddress generation for our class of algorithms. The importance of localization cannotbe underestimated as a transformation as it reduces number of memory accesses. Thetransformations of localization, partitioning have been extended to handle hierarchi-cal partitioning methods in particular co-partitioning.

The algorithms for generation and propagation of control signals needs to be in-tegrated in a VHDL code generator interface with PARO. Equally important is toinvestigate the effect of tile size, shape on communication for different partitioningtechniques and its formulation as integer linear programming problem. The problemof finding optimal scheduling vector for hierarchical partitioning schemes also needsto be solved. Besides, the transformation for co-partitioning needs to be implementedin PARO. Furthermore, computationally intensive algorithms from field of medicalimage processing, image processing, signal processing, linear algebra needs to beimplemented as case studies in PARO and prove its viability as a breakthrough work.

105

7 Conclusion and Future Work

106

Bibliography

[Bas03] C Bastoul. Efficient code generation for automatic parallelization and op-timization. In ISPDC’2 IEEE International Symposium on Parallel and

Distributed Computing, pages 23–30, October 2003.

[BDRR94] Pierre Boulet, Alain Darte, Tanguy Risset, and Yves Robert. (pen)-ultimate tiling? Integr. VLSI J., 17(1):33–51, 1994.

[BHT01] Marcus Bednara, Frank Hannig, and Jurgen Teich. Boundary Control: Anew Distributed Control Architecture for Space-Time Transformed (VLSI)Processor Arrays. InProceedings 35th IEEE Asilomar Conference on Sig-

nals, Systems, and Computers, Pacific Grove, California, USA, November2001.

[BHT02] Marcus Bednara, Frank Hannig, and Jurgen Teich. Generation of Dis-tributed Loop Control. In Ed F. Deprettere, Jurgen Teich, and StamatisVassiliadis, editors,Embedded Processor Design Challenges: Systems, Ar-

chitectures, Modeling, and Simulation – SAMOS, volume 2268 ofLecture

Notes in Computer Science (LNCS), pages 154–170, Heidelberg, Germany,2002. Springer.

[CEL] CELOXICA, Handel-C.www.celoxica.com .

[CI91] C.Ancourt and F. Irigoin. Scanning polyhedra with do loops. In3rd ACM

SIGPLAN Symposium on Principles and Practice of Parallel Programming,pages 39–50, June 1991.

[CM88] K. Mani Chandy and Jayadev Misra.Parallel Program Design : A Foun-

dation. Addison-Wesley, Reading, Massachusetts, 1988.

[Dar91] Robert,Y Darte,A , Khachiyan,L . Linear Scheduling is nearly optimal.Parallel Processing Letters, 1(2):73–81, 1991.

107

Bibliography

[Dar02] Rau, B. Ramakrishna and Vivien, Frederic Darte, Alain, Schreiber, Robert.Constructing amd Exploiting Linear Schedules with Prescribed Paral-lelism. ACM Transactions on Design Automation of Electronic Systems,7(1):159–172, 2002.

[DR02] Steven Derrien and Sanjay Rajopadhye. Energy/power estimation of regu-lar processor arrays. InISSS’02, pages 50–55, 2002.

[Eck01] Uwe Eckhardt. Algorithmus-Architektur-Codesign fur den Entwurf dig-

italer Systeme mit eingebettetem Prozessorarray und Speicherhierarchie.PhD thesis, Technische Universitat Dresden, June 2001.

[EM99] U. Eckhardt and R. Merker. Hierarchical Algorithm Partitioning at SystemLevel for an Improved Utilization of Memory Structures.IEEE Trans-

actions on Computer-Aided Design of Integrated Circuits and Systems,18(1):14–24, 1999.

[Fea88] Paul Feautrier. Parametric integer programming.RAIRO Recherche Oper-

ationnelle, 22:243–268, 1988.

[Fea92] Paul Feautrier. Some efficient solutions to the affine scheduling problem:I. one-dimensional time.Int. J. Parallel Program., 21(5):313–348, 1992.

[GDGN03] S. Gupta, N. D. Dutt, R. K. Gupta, and A. Nicolau. SPARK: A High-Level Synthesis Framework for Applying Parallelizing Compiler Transfor-mations. InProceedings of the International Conference on VLSI Design,January 2003.

[Gui03] Risset,T. Guillou,A.C., Quinton,P. Hardware synthesis for multi-dimensional time,.IEEE computer Society, ASAP’2003, 2003.

[Hal86] Hall, M. Combinatorial Theory. Wiley-Interscience, 1986.

[HDT04] Frank Hannig, Hritam Dutta, and Jurgen Teich. Mapping a Class of Depen-dence Algorithms to Coarse-grained Reconfigurable Arrays – ArchitecturalParameters and Methodology.International Journal of Embedded Systems

(IJES) (to appear), 2004.

[HT01] Frank Hannig and Jurgen Teich. Design Space Exploration for MassivelyParallel Processor Arrays. In Victor Malyshkin, editor,Parallel Comput-

ing Technologies, 6th International Conference, PaCT 2001, Proceedings,

108

volume 2127 ofLecture Notes in Computer Science (LNCS), pages 51–65,Novosibirsk, Russia, September 2001. Springer.

[HT02] Frank Hannig and Jurgen Teich. Energy Estimation of Nested Loop Pro-grams. InProceedings 14th Annual ACM Symposium on Parallel Algo-

rithms and Architectures (SPAA 2002), Winnipeg, Manitoba, Canada, Au-gust 2002. ACM Press.

[HT04] Frank Hannig and Jurgen Teich. Domain-Specific Processors: Systems,

Architectures, Modeling, and Simulation, chapter 6, Energy Estimationand Optimization for Piecewise Regular Processor Arrays, pages 107–126.Number 20 in Signal Processing and Communications. Marcel Dekker,New York, U.S.A., 2004.

[IT88] F. Irigoin and R. Triolet. Supernode partitioning. InProceedings of the

15th ACM SIGPLAN-SIGACT symposium on Principles of programming

languages, pages 319–329. ACM Press, 1988.

[JR92] P. Sadayappan J. Ramanujam. Tiling multidimensional iteration spaces formulticomputers.Journal of Parallel and Distributed Computing, (16):108–120, 1992.

[Kit91] Demura,T Araki,Y Takada,T Kitagaki,K, Oto,T. A new address generationunit architecture for video signal processing.Visual Communications and

Image Processing‘91: Image Processing, 1606:891–900, 1991.

[KMW67] R. M. Karp, R. E. Miller, and S. Winograd. The Organization of Compu-tations for Uniform Recurrence Equations.Journal of the Association for

Computing Machinery, 14(3):563–590, 1967.

[Kuh80] Robert H. Kuhn. Transforming Algorithms for Single-Stage and VLSI Ar-chitectures. InWorkshop on Interconnection Networks for Parallel and

Distributed Processing, pages 11–19, West Layfaette, IN, April 1980.

[Kun88] S.Y. Kung.VLSI Processor Arrays. Prentice-Hall, Inc., 1988.

[MM96] M. Janssen H. De Man M. Miranda, F. Catthoor. Efficient hardware ad-dress generation in distributed memory architectures.Proc. IEEE 9th In-

ternational Symposium on System Synthesis, La Jolla, CA, pages 20–25,1996.

109

Bibliography

[PAC03] PACT XPP Technologies. XPP64-A1 Reconfigurable Processor –

Datasheet. Munich, Germany, 2003.

[Pro] PARO Design System Projectwww12.informatik.uni-erlangen.de/research/paro .

[Qui00] Rajopadhye S Wilde D Quillere, F. Generation of Efficient NestedLoops from Polyhedra.International Journal of Parallel Programming,28(5):469–498, 2000.

[Ram95] J Ramanujam. Beyond Unimodular transformations.The Journal of Su-

percomputing, 9(4):365–389, 1995.

[Rao85] S. K. Rao.Regular Iterative Algorithms and their Implementations on Pro-

cessor Arrays. PhD thesis, Stanford University, 1985.

[Sch98] Alexander Schrijver.Theory of Linear and Integer Programming. A Wiley-Interscience Publication, 1998.

[Sie03] Sebastian Siegel. “ Untersuchung von modifizerten Ablaufen fur den En-twurf von Prozessorarrays”. Master’s thesis, Technical University, Dres-den, 2003.

[Syn] Synfora, Inc.www.synfora.com .

[Tei93] J. Teich.A Compiler for Application-Specific Processor Arrays. PhD the-sis, Institut fur Mikroelektronik, Universitat des Saarlandes, Saarbrucken,Deutschland, September 1993.

[Thi92] L. Thiele. Computer Systems and Software Engineering: State-of-the-Art,chapter 4, Compiler Techniques for Massive Parallel Architectures, pages101–151. Kluwer Academic Publishers, Boston, U.S.A., 1992.

[Thi95] Lothar Thiele. Resource Constrained Scheduling of Uniform Algorithms.Journal of VLSI Signal Processing, 10:295–310, 1995.

[TT91] Jurgen Teich and Lothar Thiele. Control Generation in the Design of Pro-cessor Arrays.Int. Journal on VLSI and Signal Processing, 3(2):77–92,1991.

110

[TT93] Jurgen Teich and Lothar Thiele. Partitioning of Processor Arrays: A Piece-wise Regular Approach.INTEGRATION: The VLSI Journal, 14(3):297–332, 1993.

[TT02a] J. Teich and L. Thiele. Exact partitioning of affine dependence algorithms.In E. F. Deprettere, J. Teich, and S. Vassiliadis, editors,Embedded Proces-

sor Design Challenges, volume 2268 ofLecture Notes in Computer Science

(LNCS), pages 135–153, Springer, Berlin, March 2002.

[TT02b] Jurgen Teich and Lothar Thiele. Exact Partitioning of Affine DependenceAlgorithms. In Ed F. Deprettere, Jurgen Teich, and Stamatis Vassiliadis,editors,Embedded Processor Design Challenges, volume 2268 ofLecture

Notes in Computer Science (LNCS), pages 135–153, March 2002.

[TTZ97] Jurgen Teich, Lothar Thiele, and Li Zhang. Scheduling of Partitioned Reg-ular Algorithms on Processor Arrays with Constrained Resources.Journal

of VLSI Signal Processing, 17(1):5–20, September 1997.

[Wol96] Michael Wolfe. High Performance Compilers for Parallel Computing.Addison-Wesley Inc., 1996.

[WS94] D. Wilde and O. Sie. Regular Array Synthesis using Alpha. InInt. Conf.

on Application Specific Array Processors, San Francsico, California, pages200–211, August 1994.

[Xue92] J. Xue. The Formal Synthesis of Control Signals for Systolic Arrays. PhDthesis, University of Edinburgh, March 1992.

[Zha96] L. Zhang. Scheduling and Allocating with Integer Linear Programming.PhD thesis, der Universitat des Saarlandes, 1996.

111

Bibliography

112

A FIR Filter:ArchitectureComposer

Figure A.1: PE array for a FIR filter

113

A FIR Filter: ArchitectureComposer

The ArchitectureComposer is a tool developed in Department of Hardware-Software-Co-Design, which enables design of system on chip using drag and drop RTL blocks.The ArchitectureComposer was used to develop and simulate a FIR filter in courseof this master thesis. Figure A.1 is the block description of implementation of FIRfilter. The two blocks on the top left corner account for counter(J) and counter(L),another block on top accounts for the global controller. The big blocks in the middleare the PEs in the PE array. The processor array implementation is done only withtwo PEs. Figure A.2 shows the RTL description of a single PE. The PE showed in

Figure A.2: First PE block description

the diagram is the first PE. The blocks on the left of the figure constitute the localcontroller controlling the multiplexers for correct input signal.

Figure A.3: Counter for L

Figure A.3 shows the block description of one of the counters.

114

B Synthesis Report: MatrixMultiplication

The VHDL files and testbenches are provided in the attached CD. The chip view

Figure B.1: Chip view of co-partitioned matrix multiplication of Xilinx VirtexXCV800

of the application shows the irregular structure of the implementation. In the futureJBits or bounding boxes could also be used to specifiy the position of PEs in orderto obtain regular on-chip placement. The major conclusion to be drawn from thefollowing synthesis reports are

115

B Synthesis Report: Matrix Multiplication

• The clock speed is determined by the data path. The control path does notplay a major role in determining clock speed. On using Xilinx Virtex 2, theclock speed was increased from 58Mhz to 170 Mhz due to presence of fastmultipliers.

• The local control is much smaller than corresponding data path, however theglobal control is costly due to the size of the counter. However with increas-ing size of PE array, the contribution of global controller to area complexitydiminishes as it is almost a constant cost.

=========================================================================

* Final Report *

=========================================================================

Final Results

RTL Top Level Output File Name : mat3dim_array.ngr

Top Level Output File Name : mat3dim_array

Output Format : NGC

Optimization Goal : Speed

Keep Hierarchy : NO

Design Statistics

# IOs : 67

Macro Statistics :

# Registers : 122

# 1-bit register : 17




# Multiplexers : 12

# 2-to-1 multiplexer : 12

# Adders/Subtractors : 9

# 16-bit adder : 4

# 32-bit adder : 5

# Multipliers : 4

# 8x8-bit multiplier : 4

# Comparators : 2

# 32-bit comparator greatequal: 1

# 32-bit comparator less : 1

Cell Usage :

# BELS : 1301

# BUF : 14

116

# GND : 1

# LUT1 : 166

# LUT2 : 10

# LUT2_D : 1

# LUT2_L : 64

# LUT3 : 112

# LUT3_D : 60

# LUT3_L : 8

# LUT4 : 132

# LUT4_D : 8

# LUT4_L : 76

# MULT_AND : 64

# MUXCY : 301

# VCC : 1

# XORCY : 283

# FlipFlops/Latches : 843

# FDCE : 682

# FDRE : 160

# FDSE : 1

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 66

# IBUF : 34

# OBUF : 32

=========================================================================

Device utilization summary:

---------------------------

Selected Device : v800bg560-6

Number of Slices: 648 out of 9408 6%

Number of Slice Flip Flops: 843 out of 18816 4%

Number of 4 input LUTs: 637 out of 18816 3%

Number of bonded IOBs: 66 out of 408 16%

Number of GCLKs: 1 out of 4 25%

=========================================================================

TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.

FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT

GENERATED AFTER PLACE-and-ROUTE.

117


Clock Information:

------------------

-----------------------------------+------------------------+-------+

Clock Signal | Clock buffer(FF name) | Load |

-----------------------------------+------------------------+-------+

clk | BUFGP | 843 |

-----------------------------------+------------------------+-------+

Timing Summary:

---------------

Speed Grade: -6

Minimum period: 17.427ns (Maximum Frequency: 57.382MHz)

Minimum input arrival time before clock: 14.902ns

Maximum output required time after clock: 24.434ns

Maximum combinational path delay: No path found

-------------------------------Datapath----------------------------------

=========================================================================

* Final Report *

=========================================================================

Final Results

RTL Top Level Output File Name : mat3dim_pe_core.ngr

Top Level Output File Name : mat3dim_pe_core

Output Format : NGC


Keep Hierarchy : NO

Design Statistics

# IOs : 46

Macro Statistics :

# Registers : 23


# Multiplexers : 3

# 2-to-1 multiplexer : 3


# 16-bit adder : 1

# Multipliers : 1

# 8x8-bit multiplier : 1

Cell Usage :

# BELS : 146

118

# BUF : 2

# GND : 1

# LUT2 : 8

# LUT2_L : 16

# LUT3 : 3

# LUT3_D : 15

# LUT3_L : 2

# LUT4 : 2

# LUT4_L : 18

# MULT_AND : 16

# MUXCY : 31

# XORCY : 32


# FDCE : 184

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 45

# IBUF : 21

# OBUF : 24

=========================================================================


---------------------------







=========================================================================

TIMING REPORT




Clock Information:

------------------

-----------------------------------+------------------------+-------+


119


-----------------------------------+------------------------+-------+

clk | BUFGP | 184 |

-----------------------------------+------------------------+-------+

Timing Summary:

---------------

Speed Grade: -6




Maximum combinational path delay: 23.462ns

------------------------------Control Path-----------------------------

========================================================================

* Final Report *

========================================================================

Final Results

RTL Top Level Output File Name : mat3dim_pe_controller_pe1.ngr

Top Level Output File Name : mat3dim_pe_controller_pe1

Output Format : NGC


Keep Hierarchy : NO

Design Statistics

# IOs : 21

Macro Statistics :

# Registers : 6



Cell Usage :

# BELS : 1

# LUT3 : 1


# FDCE : 10

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 20

# IBUF : 7

# OBUF : 13

=========================================================================


120

---------------------------







=========================================================================

TIMING REPORT




Clock Information:

------------------

-----------------------------------+------------------------+-------+


-----------------------------------+------------------------+-------+

clk | BUFGP | 10 |

-----------------------------------+------------------------+-------+

Timing Summary:

---------------

Speed Grade: -6

Minimum period: No path found



Maximum combinational path delay: 8.495ns

-------------------------------Counter-----------------------------------

=========================================================================

* Final Report *

=========================================================================

Final Results

RTL Top Level Output File Name : counter_j.ngr

Top Level Output File Name : counter_j

Output Format : NGC


Keep Hierarchy : NO

121


Design Statistics

# IOs : 10

Macro Statistics :

# Registers : 6




# 32-bit adder : 5

# Comparators : 2

# 32-bit comparator greatequal: 1

# 32-bit comparator less : 1

Cell Usage :

# BELS : 716

# BUF : 1

# GND : 1

# LUT1 : 160

# LUT1_L : 6

# LUT2 : 4

# LUT2_D : 1

# LUT2_L : 2

# LUT3 : 75

# LUT3_D : 1

# LUT3_L : 24

# LUT4 : 48

# LUT4_D : 17

# LUT4_L : 43

# MUXCY : 177

# VCC : 1

# XORCY : 155


# FDRE : 160

# FDSE : 1

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 8

# IBUF : 1

# OBUF : 7

=========================================================================


---------------------------

122







=========================================================================

TIMING REPORT




Clock Information:

------------------

-----------------------------------+------------------------+-------+


-----------------------------------+------------------------+-------+

clk | BUFGP | 161 |

-----------------------------------+------------------------+-------+

Timing Summary:

---------------

Speed Grade: -6




Maximum combinational path delay: No path found

123


124

C Contents of CD

The attached CD contains VHDL sources, testbenches, LATEX sources, Architecturecomposer sheets alongwith their xasm descriptions. The Order is described in thefollowing Figure C.1.

+ Master_Thesis

+

+

+

VHDL

Arch_Comp

Latex

+

+

CASXASM

+

+

figures reports

Figure C.1: Directory tree of CD

The directory vhdl contains all the vhdl files along with their testbenches for sim-ulation. The directory ArchComp standing for architecture composer recursivelycontains directory CAS, which contain the computer architecture specification filesfor FIR filter and the directory XASM which contains XASM files for the simula-tion of the architectures. The Latex directory contains the LATEX files and figures inreports/report1 and figures respectively.

125

mapping of hierarchically partitioned regular ... - informatik … · institut fur informatik 12...

Documents