the processing of a dense matrix multiplied by a dense ...uszkaygj/4zp6/cellmatrixfinal2007.pdf ·...

THE PROCESSING OF A DENSE MATRIX

MULTIPLIED BY A DENSE MATRIX ON THE STI CELL

by

N.A. Cumpson

D.A. Karunaratne

A.R. Schulz

C.J. Venantius

F. Zhao

COMPUTER SCIENCE 4ZP6 PROJECT

Supervised by Dr. C. Anand

Department of Computing and Software

and

Dr. W. Kahl

Department of Computing and Software

McMaster University

Hamilton, ON, L8S 4L8

2007

2

Acknowledgements The thesis group would like to take this chance to thank the following people in helping us get to

this point: Dr. Christopher Anand for providing us with a challenging project and guidance, Dr.

Wolfram Kahl for his work on code graphs, Gordon Uszkay for organizational guidance and providing

sanity during the year, and for our friends who had to put up with us during the course of the project.

3

Abstract This project is to examine the possibilities of manipulating the STI Cell, developed by Sony,

Toshiba and IBM, to handle parallelized applications. The objective is to take a simple problem, defined

as matrix multiplication, and attempt to find an efficient solution that takes advantage of the underlying

hardware. In no way is the report claiming that the following implementation is the most efficient or an

optimized solution, however; its primary purpose is to provide a framework to approach similar problems

on the Cell or similar hardware. The original proposal is to have implemented a dense matrix multiplied

by a dense matrix using parallelization methods to increase performance, and then to extend the build to a

limited set of sparse matrices.

By the end of academic year, the original goals have been downsized to account for the realistic

nature of the project. The extension of sparse matrices has been forgone, and the complete integration of

the final implementation is left undone. The final build consists of four separate working and well

documented parts, which illustrates a dense matrix multiplied by a dense matrix. Although it is not

integrated, thus not a fully working solution, the objective of providing a framework for computational

problems that can be parallelized is successfully completed. The project provides the knowledge

necessary, a suggested approach, and enough implementation for the project to be carried on to

completion, and enough potential to motivate continuing research in this area.

4

Table of Contents

Acknowledgements ....................................................................................................................................... 2

Abstract ......................................................................................................................................................... 3

I.I Introduction ..................................................................................................................................... 10

I.II Purpose ........................................................................................................................................ 10

I.III Scope ........................................................................................................................................... 10

I.IV Definitions, Acronyms and Abbreviations .................................................................................. 10

I.V References ................................................................................................................................... 11

I.VI Overview ..................................................................................................................................... 11

I.VII General Description .................................................................................................................... 11

I.VIII Product Perspective ................................................................................................................. 11

I.IX Product Functions ....................................................................................................................... 11

I.X User Characteristics .................................................................................................................... 12

I.XI General Constraints ..................................................................................................................... 12

I.XII Assumptions and Dependencies .................................................................................................. 12

I.XIII Specific Requirements ............................................................................................................ 12

I.XIII.I Functional Requirements .................................................................................................... 12

I.XIII.II External Interface Requirements ......................................................................................... 22

I.XIII.III STI Cell Processor Hardware Constraints........................................................................... 22

I.XIII.IV Attributes ............................................................................................................................. 23

I.XIII.V Other Requirements ............................................................................................................ 23

II Design ................................................................................................................................................. 24

II.I Proposed Problem ....................................................................................................................... 24

II.II Software and Hardware ............................................................................................................... 24

II.III The Matrix Type ......................................................................................................................... 24

II.IV Matrix Multiplication .................................................................................................................. 25

II.IV.I Definition of Matrix Multiplication .................................................................................... 25

II.IV.II Algorithms for Matrix Multiplication ................................................................................. 25

II.V Storage Format ............................................................................................................................ 28

II.V.I Row Order ............................................................................................................................... 28

5

II.V.II Entries Storage .................................................................................................................... 28

II.VI Direct Memory Access (DMA) ................................................................................................... 29

II.VI.I Definition of DMA.............................................................................................................. 29

II.VI.II DMAs for Matrix Multiplication ........................................................................................ 30

II.VI.III Reduction of DMAs for Inner Product Computation .......................................................... 31

II.VII Cell Programming Models ...................................................................................................... 32

II.VII.I PPE Programming Models: ................................................................................................. 32

II.VII.II Small-Single SPE Models ................................................................................................... 32

II.VII.III Large-Single SPE Models ................................................................................................... 32

II.VII.IV Parallel Programming Models ............................................................................................ 32

II.VIII Clustering ................................................................................................................................ 33

II.VIII.I Definition ............................................................................................................................ 33

II.VIII.II Motivation ........................................................................................................................... 33

II.VIII.III Size of Cluster ................................................................................................................. 34

II.VIII.IV Block Size Implications .................................................................................................. 34

II.VIII.V Block Size Choice ............................................................................................................... 34

II.VIII.VI Cluster Size Choice ......................................................................................................... 34

II.IX Parallelism Methods .................................................................................................................... 35

II.IX.I Data Parallelism .................................................................................................................. 35

II.IX.II Pipeline Parallelism............................................................................................................. 35

II.X Methods of Implementation ........................................................................................................ 36

II.X.I CodeGraph .............................................................................................................................. 36

II.X.II Haskell Directly .................................................................................................................. 38

II.XI Run Time System ........................................................................................................................ 38

II.XI.I Functionality Overview ....................................................................................................... 38

II.XI.II Context Information ............................................................................................................ 38

II.XI.III Control Threads................................................................................................................... 39

II.XI.IV Buffers ................................................................................................................................. 39

II.XI.V Memory Transfers ............................................................................................................... 41

II.XI.VI Data Computation ............................................................................................................... 42

II.XI.VII Synchronization .................................................................................................................. 43

II.XII Algorithms for Kernel Matrix Multiplication ......................................................................... 47

6

II.XII.I Kernel Computation ............................................................................................................ 47

II.XII.II Row Column Algorithm ...................................................................................................... 47

II.XII.III Row-Row Algorithm........................................................................................................... 48

II.XII.IV Other Algorithms ................................................................................................................ 49

II.XII.V Pipelining ............................................................................................................................ 49

II.XII.VI Assembly Code Generation ................................................................................................ 51

II.XII.VII Storage of Code ............................................................................................................... 51

II.XIII Issue Tracking of Design ........................................................................................................ 51

II.XIII.I Requirement Report ............................................................................................................ 51

II.XIII.II Schedule .............................................................................................................................. 51

II.XIII.III Scope ............................................................................................................................... 52

II.XIII.IV Design Report ................................................................................................................. 52

II.XIV Testing ..................................................................................................................................... 52

II.XIV.I Exploratory Approach ......................................................................................................... 52

III Implementation ............................................................................................................................... 53

III.I Run Time System ........................................................................................................................ 53

III.I.I Introduction ............................................................................................................................. 53

III.I.II Extra Material ..................................................................................................................... 53

III.I.III Method of Explanation ........................................................................................................ 53

III.I.IV Constants ............................................................................................................................. 53

III.I.V Run Time PPU .................................................................................................................... 53

III.I.VI Context Information ............................................................................................................ 54

III.I.VII Run Time SPU .................................................................................................................... 54

III.I.VIII Initialization ........................................................................................................................ 55

III.I.IX Test ...................................................................................................................................... 55

III.I.X Signaling ............................................................................................................................. 56

III.I.XI Display Matrix .................................................................................................................... 56

III.I.XII DMA ................................................................................................................................... 56

III.I.XIII Compute .............................................................................................................................. 57

III.I.XIV Display Data ........................................................................................................................ 58

III.II Code Graphs ................................................................................................................................ 58

III.II.I Multi-loop ........................................................................................................................... 60

7

III.II.II First Layer ........................................................................................................................... 61

III.II.III Loop Layer .......................................................................................................................... 61

III.II.IV Additional Loop Layer ........................................................................................................ 62

III.II.V Mult ..................................................................................................................................... 63

III.II.VI MaddLayer .......................................................................................................................... 63

III.II.VII MaddstoreLayer .................................................................................................................. 63

III.II.VIII Direction ......................................................................................................................... 63

III.III Kernel Computation ................................................................................................................ 64

III.III.I Introduction ......................................................................................................................... 64

III.III.II Method of Explanation ........................................................................................................ 64

III.III.III Function multcalc ................................................................................................................ 64

III.III.IV Function matrixmult ............................................................................................................ 65

III.III.V Function rowmult ................................................................................................................ 65

III.III.VI Function matrixmult1 .......................................................................................................... 66

III.III.VII Function mul44‘ .............................................................................................................. 66

III.IV Techniques of loop optimization ............................................................................................. 66

III.IV.I Software pipelining ............................................................................................................. 67

III.IV.II Explicitly staged software pipelining .................................................................................. 71

III.IV.III Implementation ................................................................................................................... 73

III.IV.IV Implementing the loop code ................................................................................................ 77

IV Verification ..................................................................................................................................... 81

IV.I Run Time System ........................................................................................................................ 81

IV.I.I Motivation ............................................................................................................................... 81

IV.I.II Process ................................................................................................................................ 81

IV.I.III Algorithmic Test ................................................................................................................. 81

IV.I.IV Regression Test ................................................................................................................... 81

IV.I.V Stress Test ........................................................................................................................... 82

IV.II Loop Optimization Theory Testing ............................................................................................. 82

IV.III Code Graph Verification ......................................................................................................... 82

IV.IV Kernal Code Verification ........................................................................................................ 82

V Bibliography ....................................................................................................................................... 84

VI Appendix I: User Guide and Upkeep ............................................................................................. 86

8

VI.I Run Time System User Manual .................................................................................................. 86

VI.II Pitfalls of the SDK 2.0 ................................................................................................................ 88

VI.III Run Time System Issue Tracking ........................................................................................... 89

VI.IV Code Graph Issue Tracking ..................................................................................................... 90

VII Appendix II: Code Snippets and Test Data .................................................................................... 91

VII.I Code Files for Run Time System ................................................................................................ 91

VII.I.I runTimePPU.c ..................................................................................................................... 91

VII.I.II contextInfo.h ....................................................................................................................... 94

VII.I.III initialization.c ...................................................................................................................... 96

VII.I.IV runTimeSPU.c ..................................................................................................................... 98

VII.I.V signal.c ................................................................................................................................ 99

VII.I.VI displayMatrix.h ................................................................................................................. 101

VII.I.VII compute.c .......................................................................................................................... 102

VII.I.VIII dma.c ............................................................................................................................. 104

VII.I.IX test.c .................................................................................................................................. 108

VII.II Test Data for Run Time System ............................................................................................ 113

VIII Appendix III: Internal Report ...................................................................................................... 118

VIII.I Report on Matrix Algorithms ................................................................................................ 118

VIII.II Report on Data Structures and Corresponding Algorithms .................................................. 120

VIII.III Report on a ―Codegraph‖ to Illustrate Multiplication ........................................................... 121

VIII.IV Report on Minimizing DMAs through ―Codegraph‖ ........................................................ 124

VIII.V Report on Methods of Parallelism......................................................................................... 127

VIII.VI Haskell Implementation .................................................................................................... 128

VIII.VII Report on Low Level Matrix Calculations 1 ..................................................................... 130

VIII.VIII Report on Low Level Matrix Calculations 2 ..................................................................... 132

VIII.IX Tips for Using the STI Cell ............................................................................................... 135

VIII.X Report on translating Codegraph to STI Cell (idea) ............................................................. 136

VIII.XI Report on STI Cell DMA, MFC and Memory .................................................................. 139

VIII.XII Real-Time System Overview 1.1 ...................................................................................... 143

VIII.XIII Dense-Dense Matrix Multiplication: Computation Kernel .............................................. 147

VIII.XIV SPE Local Store Stack Frame ........................................................................................... 158

VIII.XV SPE Processor Affinity ..................................................................................................... 160

9

VIII.XVI Notes on the Library SPE Document ................................................................................ 161

VIII.XVII BLAS Overview ............................................................................................................ 180

VIII.XVIII Dense-Dense Matrix Multiplication CodeGraph Generator ......................................... 182

VIII.XIX LaTeX Advanced .............................................................................................................. 189

VIII.XX LaTeX Usage .................................................................................................................... 193

VIII.XXI Memory Mapping for a Dense Matrix .............................................................................. 195

VIII.XXII CABx Dense Parallel Algorithms With B Being a Vector ........................................... 197

VIII.XXIII CABx Dense Parallel Algorithms With B Being a Matrix ........................................... 199

10

I Requirement Analysis

I.I Introduction

I.II Purpose

This Software Requirement Specification (SRS) document is used to describe the functionality and

operation of the Matrix Calculation Optimizer.

I.III Scope

The requirements will describe the entire optimizer for the individuals or teams working on designing,

implementing or testing. This document has been modified and updated from the original SRS according

to feedback provided by the project supervisors.

I.IV Definitions, Acronyms and Abbreviations

Cell Processor: The new computer processor designed by Sony and IBM which includes the new

PowerPC processor, eight new Synergistic Processing Elements (SPEs) for vector calculations

and architecture for optimal performance.

SPE: Synergistic Processing Element is a processor unit for the Cell processor that specializes

in vector calculations and can run in parallel with each other SPE and the PPE

PPE: PowerPC Processing Element is the main processor unit of the Cell which is used for

running the operating system, trivial and non-trivial calculations and controlling the SPE units.

MMU: Memory Management Unit is the controller of memory for the SPE and PPE

processors. The MMU is used for accessing memory by the PPE and translating memory

addresses from: Effective-to-Real Address Translation Buffers, Page Table, Segment Lookaside

Buffers and Translation Lookaside buffers.

LPAC: A matrix algebra transformation technology for optimal matrix calculations

CBE: Cell Broadband Engine is the Cell Processor unit whole. This encompasses the entire

architecture and processing units. Designed by IBM, Sony and Toshiba. CBE processor and Cell

processor will be used interchangeably throughout this document.

EIB: Element Interconnect Bus is the network used for connecting the PPE, SPEs and other

devices. The EIB transfers data at a rate of 25.6GB/s

DMA: Direct Memory Access is the engine used for concurrent memory transfers with

processing.

11

I.V References

Project Supervisors: Dr. Christopher Anand, Gordon J. Uszkay M. Sc.

IEEE Guide to Software Requirements Specifications. New York, NY: Institution of Electrical and

Electronics Engineers, Inc 345 East 47th Street, New York, NY 10017, USA, 1984. 1-24. 30 Sept. 2006.

Cell Broadband Engine Programming Handbook v1.0. International Business Machines Corporation.

2006. 1-876. 01 Sept. 2006.

SPU C/C++ Language Extensions v2.1. International Business Machines Corporation. 2006. 1-257. 15

Oct. 2006.

Synergistic Processor Unit Instruction Set Architecture v1.1. International Business Machines

Corporation. 2006. 1-257. 01 Oct. 2006.

Boehm, Barry W. "A Spiral Model of Software Development and Enhancement." IEEE os (1988): 1-12.

20 Sept. 2006.

I.VI Overview

The SRS is organized into the general description of the project in section 2 and then the set of Software

Requirements is found in section 3.

I.VII General Description

I.VIII Product Perspective

The optimizer will be standalone software that will take standard inputs and return a standard output. The

optimizer will need to be run on an Intel processor machine for simulation or on a computer system with

the CBE processor. The existing software required will be the UNIX operating system.

I.IX Product Functions

The optimizer has one overall goal: calculate the result of two matrices and their multiplication operator.

This operation can be broken into a set of functions for the optimizer:

1. Build the code graph for the optimal method of calculation for the matrices based on a chosen

algorithm

2. Segment the matrices into matrix blocks for calculation on the CBE processor with each SPE

computing different block combinations.

12

3. Schedule and complete calculations for the resulting matrix using the CBE processor.

I.X User Characteristics

The general user of the first version of the finished product should be an expert in the field of parallel

computing / low-level programming with hardware / code graph theory & algebra methods of calculation.

These types of users are necessary for understanding the different sections of the finished project to help

work on the alpha testing. The final project after testing and verification should be a user that is

competent with matrix computation and linear algebra with matrices and also have some computer skills.

Since this project should demonstrate its successful results to the IBM research team, the level of

difficulty of the project should be presentable to the company.

I.XI General Constraints

Development and use of the project software should only be available to the development thesis

team, their respectful supervisors and associates of the supervisors and/or team, which must be

agreed upon by both supervisors and the development team.

The software development must be run on an Intel machine for simulation or a computer system

containing the actual STI Cell Processor. Simulation computer memory should be at least 512MB

for running the software.

Calculations on the Cell processor should be done in parallel and use a scheduler for each of the

SPEs.

I.XII Assumptions and Dependencies

The project is being developed based on the assumption that the simulator will be able to operate on an

Intel UNIX system available for development. We also assume that a parser and type-checker will be

compatible with our code graph programming and a scheduler for parallel processing will be compatible

with our CBE programming.

I.XIII Specific Requirements

I.XIII.I Functional Requirements

Matrix Inputs are a set of typed matrices

Introduction

This is an interface requirement that describes the input that must be accepted by the software for

calculation. Inputs that are not within the correct range of acceptable data must be rejected or they will

13

not be calculated correctly. The software will simulate the inputs and is responsible for inputting

appropriate data. Outside the scope of this project is changing the source of inputs from the user using the

software at run time.

Inputs

The inputs come from the software itself.

The input must be in the form of integer or real numbers in a matrix formation.

The inputs are created at run-time at the beginning of execution.

The inputs must be in the range of [-2E+129… 2E+129] of Real numbers with precision

of [1.17549435E-38… 6.80564694E+38]

Process

The input data must be validated that it is within the given range and the proper type of input data, a real

or integer number. The input data will be read in, checked for validity and then accepted or rejected. If it

is accepted, it will be sent to be parsed and segmented into its respectful data types. If an abnormal input

is used, such as a character array or values outside the range, then the inputs must be rejected. Overflow

will result in an error to be returned with an error code associated with the output for more details about

the overflow or underflow. Inputs should be developed by a separate module defined at run-time which

will allow testing to be automated over numerous input matrices.

(Input ε [-2E+129… 2E+129] ε R | Input ε [1.17549435E-38… 6.80564694E+38] )

Outputs

The output is the accepted matrix as was entered or there is no output

Matrix inputs are of variable sizes m and n determined at compile time

Introduction

This is an interface requirement. The input matrices can be any m by n matrix but must be in the correct

form at input. The user is responsible for setting the correct matrices.

Inputs

The dimensions of the m by n matrix must be integer values and are determined when the user inputs the

matrices at compile time. The maximum value of m or n of a matrix is 4096 for either m or n, thus

4096x4096 being the largest matrix which will be the size of the matrix we benchmark.

14

Process

When the system creates the data matrices, the dimensions are determined and must be within the proper

range. The input size will be validated to ensure that the matrix is not going to cause a memory overflow.

With a maximum matrix size of 4096x4096, there is enough room for at least 3 matrices needed for

computing a binary matrix calculation.

Outputs

The output is the user‘s input matrices if the dimensions are in the specified range.

A layer of abstraction interfacing low-level programming to high-level programming must incorporate

the operations for underlying hardware

Introduction

This is an interface requirement. The interface between the compiler and computations must contain the

operations for executing the optimal calculations for the specific hardware processing the data.

Inputs

The input is the ―opcode‖ parameter and input data to be transformed. The input ―opcode‖ must be

derived from the set of instructions the system can process.

Process

A function must be called with the required parameters passed that will select the proper set of execution

operations based on the ―opcode‖ parameter. If an ―opcode‖ parameter is passed that does not exist for

this hardware, then an error code must be returned and the process must terminate.

Outputs

The output is the execution sequence for the computational algorithm. The execution series must be in the

range of corresponding algorithms for the ―opcodes‖. This execution series can be found in an Execution

Table within the interface.

Calculations on the SPE must minimize the number of instruction calls and balances these

instructions over the even and odd pipelines.

Introduction

This is a performance requirement. The fewer amount instructions used for computing matrices saves

cycles. By balancing the instructions over the even and odd pipelines, we can concurrently execute the

instructions to save additional time.

15

Inputs

The inputs are matrices A and B for computation. The user is responsible for inputting the correct

matrices.

Processing

Using the 128 registers and using row-row matrix alignments, we can minimize the number of

instructions used and we balance the instruction load across the pipelines. Choosing a matrix size for

computation that balances instructions is necessary. Matrix size 16x16 is used for computing sections of a

64x64 matrix divides the computations of a matrix block evenly and balances the loads as best as

possible.

Outputs

Output is the 16x16 SPE single computation size for a matrix with matrices A and B in row-row form.

Matrix block size must be 64x64 based on DMA transfer limitations

Introduction

This is a hardware requirement. The Element Interconnect Bus (EIB) has 4 unidirectional channels for

transferring data at 16B each in pairs for each direction. The DMA transfers data along the EIB in transfer

sizes of 1, 2, 4, 8 and n*16 bytes at a time with a maximum of 16KB.

Inputs

The input is a matrix divided into blocks of size of a most 64x64.

Processing

With a DMA block size of 16KB and floating-point numbers using 4 bytes each, we have 16KB/4B =

4KB entries per block. The root of 4KB is 64, thus we have square matrices of 64x64. Blocks must be

square for optimal computation time in row-row matrix form based on balancing operations among the

two EIB channels using as many registers of an SPE as possible.

Outputs

The outputs are blocks of size 64x64 of the matrices.

Blocks should be clustered into a 2x2 matrix where each element of this matrix is a block

Introduction

This is a performance requirement. Based on the size of the local stores for each of the SPEs, the cluster

size of blocks must be small enough to fit in a cluster for matrix A, a cluster for matrix B and result

16

cluster for matrix C. The cluster size should also be large enough that as much of the LS can be used as

possible.

Inputs

The inputs are the matrices A, B and C divided into boxes of size 64x64.

Processing

With a local store of 256KB, we allocate 16KB of that space for the code segment and data such as local

variables. That leaves us with 240KB left for matrices or 15 16KB blocks for matrices. This leaves room

for 5 blocks of A, 5 of B and 5 blocks for matrix C but this creates an awkward output distribution of

blocks when we try to reform C, thus creating an overhead for a dynamic shape grouping – an

unnecessary overhead. Also, the 5th blocks cannot start computing until there are at least three other

blocks for each matrix as well. However, when we work with a 2x2 matrix, we still use 80% of the

available LS space and the shape of the clusters are static and always can be broken into smaller 16x16

matrices for processor computation without needing padding. This makes processing the larger arrays

easy for the low-level instructions to handle and uses processing time for computations only. The

remaining space can be used at a later point.

Outputs

Output is the cluster in a 2x2 matrix with each block in the cluster being 64x64.

A code graph will be constructed to model the dependencies of the

calculations' operations

Introduction

This is a system requirement. The matrix computations will be

modeled by a code graph which will illustrate the dependencies

between computations and the DMA transfers required.

Inputs

The inputs for the code graph are the matrices to be calculated broken

up into their blocks of 64x64, to fit a 16KB DMA transfer. Then

blocks from matrices A and B are sent to an SPE's LS where the result

matrix C blocks are allocated space, where A,B,C come from A.B = C

or A+B = C.

17

Process

Using the Haskell language, the input matrices will be separated into blocks of 64x64. The graph is

modeled as a hyper graph where hyper edge is a result matrix block computation or a DMA transfer of

data to the LS. The nodes in between the hyper edges describe the data now stored into the LS for each

SPE after a computation or DMA transfer, more or less, the state. The code graph is created in Haskell

using the code graph library written by Dr. Wolfram Kahl. The Haskell program using this code graph

library must import the modules CodeGraph.lhs and CodeGraphOps.lhs. An additional file,

CodeGraphDot.lhs, can be used for exporting the generated code graph into a dot file for viewing. This is

a very useful tool for debugging code graphs.

Outputs

Outputs for the code graph are the result matrix blocks for each of the corresponding matrices'

computations. The code graph outputs combined will formulate the C result matrix.

The code graph must be scheduled for computations

Introduction

This is a system requirement. The generated code graph symbolizing all the operations needed for

computing the matrix calculation must be sent to the scheduler which will distribute the associated graph

with an SPE for computation.

Inputs

The input for the scheduler is the code graph itself. The code graph is read by nodes and edges.

Processing

The scheduler has been implemented by Dr. Kahl Wolfram and will translate a code graph created from

his code graph library into scheduled tasks.

Outputs

Outputs will be a set of scheduled tasks for the SPE processors to run.

A greedy algorithm is used for scheduling processor tasks

Introduction

This is a performance requirement. As a second method to computing the matrix computations, a greedy

algorithm must be used for scheduling tasks. This second method is implemented to test against the code

graph method to discover the most efficient method.

18

Inputs

The inputs are the matrices blocks from A and B, where A and B are from A.B = C.

Processing

The greedy algorithm will search a list for the highest priority job, find it and then compute the job. Our

jobs are weighted based on the number of dependencies on a job. If a job has more dependencies then it

has a higher rank. Once a job is assigned to an SPE, all its dependent jobs are computed on the same SPE

for result biased storage in the LS.

Outputs

Output is a list of scheduled jobs for each of the SPEs on the Cell processor.

Synchronization of data across the processors must be implemented

Introduction

This is a system requirement. Data being transferred from memory to SPE LS, from LS to LS or from LS

to memory must all be synchronized across the Cell processor ensuring that tasks are uniquely scheduled

to provide data validity. If the same job can be processed by more than one processor, results for that job

will be unstable and will not ensure valid data.

Inputs

The inputs to be synchronized are the data associated with jobs to be executed.

Processing

To implement synchronization, we will be using lockless synchronization where data tags will be used to

verify data before processing, and signals will be used to control sending of data requests. The idea is

that no locking or forced waiting will occur until it is absolutely necessary, therefore; in good running

conditions these are avoided entirely.

Outputs

Output is the synchronizing mechanism for the tasks to be performed.

Algorithms must be implemented with parallelism as the goal

Introduction

This is a system requirement. The algorithms implemented at the high and low-level language must be

implemented in such a manner that parallel computation is supported.

19

Inputs

None.

Process

None.

Output

Interfaces must be determined and must follow strict rules

Introduction

This is an interface requirement. The interfaces between all the modules and subsystems of the entire

system must be determined early through exploratory prototyping and the rules for interfacing must be

strict so that modularization can continue betweens subsystems based on the inputs and outputs that need

to be produced. Negotiation between subsystem development teams must be a part of this decision

Inputs

Exploratory prototyping information.

Process

Exploratory prototyping for the subsection must discover the required inputs and outputs for successful

and optimal execution. Negotiations between the subsystems must follow the prototyping to set a strict set

of rules.

Outputs

The interface between the code graph layer and the scheduler will follow a strict set of instructions

including parameters in the form of memory pointers and dimension information to handle tasks of

transferring data and computing data. The interface between the scheduler and the run time system is the

same interface as the code graph layer, since the scheduler is an ―invisible‖ transport of those instructions.

The interface between the run time system and computational units in assembly is a function call with the

name specifying the type of computation, and the parameters providing memory pointers.

Resulting matrix must have data that is within an accuracy tolerance TBD.

Introduction

This is a system requirement. The data result must be within a specified tolerance TBD for the system to

be successful.

20

Inputs

None.

Process

Due to the choice of simplifying the project by using only single precision floating point number, there is

a sizable amount of data error that will be introduced through rounding errors during calculations. To

minimize this issue, the system will pick the best algorithm for matrix multiplication to lower the amount

of calculations and use multiplication-add functions supplied by the STI Cell that further minimizes error.

Outputs

The tolerance will not be pre-determined, since the project is restricted to whatever the calculations

produce.

Computations must be parallelized across the processors.

Introduction

This is a performance requirement. The computations of the executing algorithm must be computed in

parallel across all the processors or scheduled in the most efficient and optimal pattern given that the job

can executed on the Cell without running into memory constraints.

Inputs

None.

Process

The operations and data that is executing must be distributed across all processing units available or in a

manner that is most efficient. The operation sequence must be analyzed and scheduled to be processed.

Each processor must finish its computations then deliver its results to the proper resulting matrix location.

Outputs

None.

Efficiency of Executing Algorithms must be more efficient than default algorithms

Introduction

This is a performance requirement. The efficiency of the algorithms used must be more efficient than that

of the default algorithm.

21

Inputs

None.

Process

None.

Outputs

None.

Instructions should be pipelined so that execution time is minimized

Introduction

This is a performance requirement. Instructions should be pipelined from the system to the parallel SPE

computing the operation set. There should also be a pipeline for determining whether a SPE is free for

computation.

Inputs

None.

Process

A pipeline will execute cyclical to find free SPE processors to execute. Once an SPE is found, the

instruction set should be pipelined to the SPE. The SPE will finish execution and return the resulting data

through the pipeline to be compiled into the resulting matrix.

Outputs

None.

Maximize the registers and memory transfers for each SPE

Introduction

This is performance requirement. The registry and Local Store memory (cache) storage should be

maximized to limit the number of main memory accesses required for the arithmetic computations.

Inputs

None.

Process

None.

22

Outputs

None.

Machine Requirements

Introduction

This is a hardware requirement. The machine required for testing the simulations of the program requires

an Intel Pentium computer capable of running the Cell SDK. Software for running the machine is the

UNIX operating system Fedora Core 5.0. A running CBE processor machine would also be an advantage

for execution and testing purposes.

Inputs

None.

Process

None.

Outputs

None.

I.XIII.II External Interface Requirements

User Interface

The user interface will be command line run off of a UNIX terminal. Display results will also be

displayed to the command line.

Software Interface

Conforming to the SDK specifications needs to be a part of the interface for simulating the software on

the CBE processor.

I.XIII.III STI Cell Processor Hardware Constraints

The project must be designed based on the hardware and software constraints imposed by the

CBE specifications. Limited time and resources are also a factor that will constrain the amount of

production possible.

The main memory of the Cell processor is 256MB of XDRAM and cannot be increased.

The local stores of each SPE are 256KB for storing data for quick memory accesses which are

addressed relative to the local store.

Bottlenecks occur when data needs to be retrieved from slower memory and storage devices.

23

Data is transferred between LS and memory via the DMA which has a maximum transfer size of

16KB taking 2048 processor cycles to complete.

Pipelines for the SPE consist of an even and odd pipeline each capable of its own set of

instructions. Balancing instruction loads across the pipelines is necessary for faster computations.

Each SPE has 128 available registers that it can access.

The Element Interconnect Bus (EIB) has two transfer rings, one going in either direction, with a

bandwidth of 32-bits.

I.XIII.IV Attributes

The portability of this system will be limited as certain hardware and software requirements must

be met for operating. Particularly, the hardware required would be the Cell processor or an Intel

processor chip for simulation. The software required would be the UNIX operating system on the

Cell or a Linux Fedora Code 5 operating system with the Cell SDK for simulation.

The maintainability needs to be high as all the work will be well documented. Abstraction to

multiple layers will keep the program separated into modules.

Availability will not be a factor with this system as long as access to the machine is possible.

Remote connections may be made possible in the future. The simulator can be accessed via an

SSH connection with multiple connections testing the simulator at once.

Robustness will be sufficient for the functionality of this software. Error codes will help to report

specific errors encountered as well as a logging file keeping records on operations.

Security should not be an issue as all documentation; coding and related files will be submitted

using subversion and kept in the repository for authorized access. The software itself has very

little need for security protocols as a thesis project.

I.XIII.V Other Requirements

Alternative methods of implementations must be well documented, benchmarked or proven inefficient for

comparing our results with the Cell. When making decisions about implementation of the software

design, data must be provided for justifying the choice made where alternative choices are outlined,

compared and shown to be less efficient.

24

II Design

II.I Proposed Problem

The project is to explore possibilities of multiplying dense matrices on the STI Cell. The task involves

two sequential designing stages; an initial knowledge gathering, followed by detailed design. N.

Cumpson and C. Venantius focused on the upper theory, classified as parallelism programming

approaches, algorithms for computation, and code graph modeling. F. Zhao researched potential methods

to implement loop optimizations. Lastly, D. Karunaratne and A. Schulz examined the hardware of the

STI Cell. After sufficient knowledge, the design was split into four segments. The upper level, carried

out by N. Cumpson, defines code graph modelling of the problem into schedulable instructions. A middle

level, performed by D. Karunaratne and C. Venantius, is the run time system that handles processing of

those instructions and memory management on the hardware. A secondary middle level of potential loop

optimizations was pursued by F. Zhao, and lastly the lower level issues of computational kernels are

developed by A. Schulz. Unfortunately, the individual segments are not integrated into a working build

due to time constraints, however; the goal of examining potential solution to parallelizable problems on

the STI Cell is achieved in enough detail to motivate further work in this direction.

II.II Software and Hardware

A major necessity was an organizational and software backup tool. McMaster University, through the

academic advisors of this project, provided us with subversion access to a shared repository called

coconut. The repository provided the version control required and contained files necessary to implement

code graphs. In order to develop the run time system, a working simulator of the STI Cell is required.

Therefore, a laptop with fedora core environment and SDK 2.0, a development kit for the STI Cell

simulator, is provided by the university. Additionally, the Glasgow Haskell Compiler, GHCI, is used to

develop the computational kernels and the loop optimization possibilities. Microsoft Office 2007 and

OmniGraffle are used to create reports and presentations. Finally, a working STI Cell was purchased in

the form of Sony‘s Playstation 3 to give access to a non-simulated environment.

II.III The Matrix Type

The project is being built with the possibility of taking advantage of a type system produced by Gordon

Uszkay for McMaster University. The system defines a matrix type, where the type of the matrix will

contain information on its dimensions and attributes, among other things (Uszkay, September 27 Meeting

with Advisor). The project does not assume there is a working type system of this nature, however; it is

25

assumed that one will exist; therefore, the benefits of such a system are included. For implementation

purposes the type system is boxed out, where new data types are created to represent what the type system

will provide. Therefore, data types for a matrix are created, with fields to store the dimensions, and

attributes. Additionally, a type checker of the inputs before computation is assumed to be handled

thought the type system.

II.IV Matrix Multiplication

II.IV.I Definition of Matrix Multiplication

Through this report a matrix will be represented by a capital letter. Additionally, matrix multiplication

will be shown as A.B, where the ―.‖ represents non-associative multiplication. Mathematically, if given

the formula C = A.B, matrix multiplication is the inner-product of the various rows in A and columns in B

to provide associative solution entries in C. Therefore, if A is of dimension i by j, and B is of dimension

j by k, then multiplication is defined as: (Weisstein, ―Matrix Multiplication‖).

II.IV.II Algorithms for Matrix Multiplication

Row Column Approach

The row column approach is computing the matrix multiplication by

the definition. Generally, one would compute the inner products

between rows and columns to solve the entries for the solution. The

algorithm for this provides O(n3) time complexity, because it

performs three nested loops to compute (Foster). See appendix 1 for

more details. Figure 1 illustrates the process.

The problem with the approach is when it is applied to a practical

machine. Realistically, the goal is to perform large matrix

multiplication, with large indicating that the dimensions of the

matrices are in the thousands. The above process can be logically broken down in a series of inner

product routines that are repeated. This indicates that each computation will involve the inner product of

a large row and column. Having a large row and column means one has to have system resources to

retrieve and compute the above computation. However, this is not realistic in any machine, since

memory, be it cache or in the STI Cell case the local store (LS), will eventually reach its limit. A solution

to this situation is to iterate over the individual multiplications of the inner products, producing partial

results along the way. However, this method means each computation is a floating point multiplication

routine, which is impractical because of its simplistic nature. Therefore, another avenue has to be

Figure 1: row column matrix multiplication

26

examined, that is to divide the rows and columns into blocks that can be computed within the memory

restrictions.

Row Column Blocked Approach

Row Column blocked approach, is performing the row column method but on blocks instead of entries.

A block is a square matrix that is part of the composition of a larger matrix. Therefore, for A.B = C, A

and B matrices would be broken into blocks, which are in fact smaller matrices, that all share the same

dimensions (Karypis). This causes the matrix multiplication

routine to be an inner product over blocks instead of an inner

product over entries. Figure 2

shows this process.

One has to immediately ask the

question; what if the dimensions of

the blocks do not map into the

matrix perfectly? Meaning, if A is

of dimension m by n, and the block

is of dimensions k by k, we are in the situation where m/k or n/k produces

a remainder. Figure 3 shows this situation.

This can be solved in various ways. One is to implement block padding,

which refers to extending all matrices by columns and rows of zeros until

it aligns with the block sizes. However, this is not preferable since we

have to store the added rows and columns in memory, therefore;

producing waste. The other option is to have different blocking sizes;

special blocks would be defined to handle the situations where

there is not enough data to full the standard block. However,

this causes an overhead of defining various blocks.

Fortunately, the type system utilized provides a way of

handling this situation. An attribute named regency can be

created that will define a matrix that consists of a sub-matrix

of data within, and the remaining entries being zeros. Figure

4 shows a visual example of this.

Therefore, we can define the blocks that cover only a partial

Figure 2: row column block matrix multiplication

Figure 3: alignment issues on blocks

Figure 4: regent matrix

Figure 5: location of regency

27

section of the matrix as regency blocks. Figure 5, shows a visual representation of where possible

regency blocks can occur. The figure shows how the rightmost block of each row of blocks, and the

bottommost block of each column of blocks, run the risk of being regent matrices. However, the type

system allows only the data required to be stored in memory, and the formulas to treat the block like the

others. The only difference being the attribute tag to indicate it is a regent matrix, allowing the

computation to know how it is stored in memory and how to proceed.

Therefore, this method in a sense is the same as the previous row column approach with iterations over

the individual entries. Here, the entries are really blocks, where the system will iterate over the matrix

multiplication of two blocks to produce the partial inner product result.

Strassen Formula Approach

The above two approaches both use the basic algorithm through the definition of matrix multiplication.

However, there exists another algorithm one could employ. The method is called the Strassen formula,

named after its inventor Volker Strassen. If applied to C = A.B, it breaks the A, B and C matrices into

four blocks. The formula then defines seven intermediate Q matrices that are the resultant of operations

on A and B blocks. The Q matrices are then used to discover the C blocks, which composes the solution.

The solving of each Q matrix results in another matrix multiplication, but this time of dimensions of the

block size. The Strassen formula can be applied recursively to each of the Q computations, until the case

of just entries are being considered (Weisstein, "Strassan Formulas"). Below illustrates the formula:

(Weisstein, "Strassan Formulas")

The above algorithm reduces the complexity to compute a matrix multiplication for cases where the

maximum dimension of the inputs is greater than 654 (Weisstein, "Strassan Formulas"). Therefore, since

the project is mainly concerned with larger matrices, this seems like a likely implementation. However,

the Strassen formula introduces more sources of round off error because of its increase number of

operations. The following project is only concerning itself with single precision floating point numbers,

therefore; this decision requires that a constant focus on reducing rounding errors be maintained: see

28

appendix 1 for more details. It is for this reason, that the Strassen approach will not be implemented, but

it is important to note that with double precision floating point numbers this could be a potential

efficiency increase.

Decision

The row column method provides a sure way of computing the answer. However, it suffers when trying

to implement large matrices, because of memory constraints. The Strassen approach can decrease the

time complexity of the operation, but introduces instability issues. Therefore, the block row column

approach provides a method that works on the practical scale in terms of memory constraints, and does

not introduce instability. In conclusion, this is the matrix multiplication algorithm that will be employed

for the project.

II.V Storage Format

II.V.I Row Order

The previous section already outlined that

a matrix will be broken into blocks in order

to help with the computations. Therefore,

the question to be asked is how a matrix

and its associated blocks are to be stored in

memory? First, an examining of this

question is necessary at the block level. A

matrix is now broken into a bunch of

blocks. It makes sense to store them in

memory one block followed by the next,

until there are no more. It is necessary to

decide on a set way of storing these blocks,

either by row or column order. The lower

level, referring to the implementation on

the STI Cell, illustrates an increase in

efficiency if blocks and there entries are all

stored in row format. An explanation why this is will be found in the algorithms for kernel matrix

multiplication section. Figure 6 illustrates the storage order.

II.V.II Entries Storage

The entries, which are single precision floating point values, can be stored by two major data structures

recognized by the hardware. The entries can be stored as vectors, where each vector will store a row of

Figure 6: storage format

29

entries in the corresponding block. Vectors have supported functionality within the hardware, therefore;

it provides an opportunity to take advantage of the provided tools (Cell Broadband Engine Programming

Handbook). In contrast, the entries can be stored as a one dimensional array, where one array will store

all the values of a block. Where one row ends and another begins can easily be found since the block is

of matrix type with its dimensions known. Therefore, the size of the column dimension of the block will

indicate the spacing between rows in the array. Arrays allow an easier system to create space saving

techniques when dealing with special matrices (Uszkay, October 4 Meeting with Advisor). For example,

a diagonal matrix only needs the data along the diagonal to be stored in memory, and not the entries off

the diagonal. Since, the project will be using the type system benefits, attributes can be tagged to

matrices through its type, providing a way to save space and not lose the structure. Therefore, both

approaches provide a method to store the entries, but the vector method hinders the ability of using the

attributes to save memory: see appendix 2 for more information. It is for this reason why the block

entries are being stored as a one dimensional array.

II.VI Direct Memory Access (DMA)

II.VI.I Definition of DMA

The DMA is the connection between SPU, PPU and other IO devices. The DMA transfers are executed

from a DMA command issued from each device. SPU elements have a Memory Control Flow (MFC) unit

that has read and write ports of 16B. These ports communicate with the DMA and issue the DMA

commands needed for fetching and sending memory. The SPU issues MFC commands to control its

associated MFC unit. When data is transferred across the DMA, there is an optional 5-bit tag that is

appended to the data transfer that is used to signal different options for synchronization. There are three

different ways that data is transferred using the DMA:

DMA Transfers Tool to transfer data between the main memory and the local stores for

the PPE and SPEs. SPEs use the DMA to transfer data asynchronously in

parallel to hide memory latency

Mailboxes Mailboxes store up to 16 32-bit messages. Each SPE has two mailboxes,

one for sending messages and one for receiving messages. The mailboxes

are used for communication between PPE, SPEs and other devices

Signal

Notification

Also called signaling and is used to communicate between the PPE and

other devices. Signaling uses 32-bit registers that can be configured for

one-to-one or many-to-one signaling.

The DMA transfer‘s characteristics:

Transfer sizes for the DMA have a maximum of 16KB.

Each transfer must be 1, 2, 4, 8 or n * 16 bytes.

30

Memory access uses GETL = Stanza scatter/gather = distributed in global and packed in local.

II.VI.II DMAs for Matrix Multiplication

For the problem of C = A.B, each inner product of blocks to compute an associated C block will be done

in two stages. The first stage is the multiplication of the first A and B blocks for the inner product. This

result will be stored as a partial solution. The second stage is the repeated multiplications of the

remaining A and B blocks to complete the inner product. These results will be summed with the previous

partial results. Therefore, if each of these block multiplications are to be assigned to a processing unit, in

the first stage one would have two DMAs as

input to the computation, and one DMA as

output. The second stage would consist of

three DMAs as input to the computation, and

one DMA as output. Figure 7 illustrates this

notion.

One can see right away that the second

multiplication for inner product is dependent

on the first multiplication, the third is

dependent on the first two, and so on. This

occurs because the design implements the

algorithm of matrix multiplication as one

would logically process it. One might ask if

it is more beneficial doing multiplications of

the inner product ahead of others for

efficiency? Since the situation is an inner

product of squared matrix blocks of the same

dimensions, the answer is no. If there is a

quicker block multiplication routine, such as

the case of a zero block times a dense block,

the increase in efficiency for the entire inner

product operation would be the same if that multiplication was done first, last or somewhere in the

middle: see appendix 3 for more information.

Figure 7: image to present processing ``notion of code graph``

31

II.VI.III Reduction of DMAs for Inner Product Computation

The previous section makes the assumption that each processing unit does not care about the inner

product computation but only its assigned matrix block multiplication.

However, if one forces the processing units to

process certain flows based on the data, it can

eliminate DMAs. For example, take the

situation where a processing unit is provided

a block from A and a block from B to begin

the computation of an inner product. Once

completed, the processor can DMA the C as

output but request the next multiplication that

involves its current A block. Therefore, a

processor will compute all of the block

multiplications required in C = A.B for an

associate A block first: see appendix 4 for

more information. This notion eliminates all

DMAs for the A block during the inner

product computation. Figure 8 is showing the

process of performing this type of focus for a

processor.

The issue with this approach, which is similar

if one did a B block focus, is the additionally

dependencies one creates in the algorithm.

Before, as shown in figure 7, there exists

already a set of dependencies on the order of operations because of the modeling chosen. The choice of

being A or B block focus adds more constraints with that fact that we are creating dependencies now

across an inner product computation, which is against the model. Therefore, it is recommended we

explore a C focus approach, as shown in figure 9.

The C focus approach allows the system to eliminate the DMAs of C data, representing a partial sum

while computing an inner product, until completed its calculation. This cuts down the number of DMAs

by 1/3 for all second stage operations for the inner product. However, unlike the A or B focus approach it

does not create additionally dependencies. The rational, is because a C focus approach does not conflict

Figure 8: A focus reduction

32

with the model of matrix multiplication chosen. Therefore, it is obvious this is the method of approach

the project will take for implementation.

II.VII Cell Programming Models

II.VII.I PPE Programming Models:

The PPE is programmed traditionally and acts as

an OS and a hypervisor. The PPE establishes a

run-time environment for the SPE elements by

handling exceptions, memory management and

other functions.

II.VII.II Small-Single SPE Models

The SPE can be programmed for small programs

of 256KB that fit within the local store memory

of the SPU. The MFC for the SPU fetches all the

program code and data through the DMA,

completes the task, and then returns the result

back to memory. There is no need to translate

local store addresses (LSA) to effective addresses

(EA) for retrieving new data.

II.VII.III Large-Single SPE Models

These models run small tasks on an SPE using

both local store and global memory. There are a

few different methods used for implementing this model: Steaming, Automatic Software-Managed Data

Cache, Plug-in, Job Queues and Cell Embedded SPE Object Format (CESOF) unique to the Cell. Details

about the implementations of these methods can be found in appendix 11.

II.VII.IV Parallel Programming Models

These models connect a series of Large-Single SPE models where data shared between memory of the

SPU elements must be properly synchronized using some type of locking mechanism for data that resides

in effective address space. The implementations of these models are: Streaming, Job Queue, Self-

Multitasking of SPE elements, Message Passing and Pipelining. Details about the implementations of

these models can be found in Appendix 11.

Figure 9: C focus reduction

33

II.VIII Clustering

II.VIII.I Definition

It has been established that a SPE will carry an inner product computation for a C block from start from

finish. Clustering is defined by this report as the notion that a SPE will process the computations of

multiple C blocks, called a cluster, from start to finish through an interleaved method. The interleaved

method is defined as the SPE processing the first block multiplication for the inner products of all of its

assigned C blocks to compute. Then it will continue to compute all the second block multiplications, the

third, and so on until the inner products of all C blocks are done.

II.VIII.II Motivation

The reason for wanting to pursue

clustering is concerned with data

locality and minimizing DMAs. By

having the C blocks connected in a

sense by being from the same row or

column or a combination, allows the

creation of data locality. Data locality

meaning that the inner product

computations for each C block will be

using shared blocks from the

multiplicands. Examples of three ways

to cluster four C blocks are shown in

figure 10.

As one can see, there is a pattern

occurring for the DMAs of required

multiplicand blocks and the cluster size. In fact, the total number of DMAs for a block multiplication step

in the inner product is the summations of the cluster‘s dimensions, where the dimension is based on the C

blocks. Therefore, if one labels a cluster‘s dimension like a matrix, so m by n, the number of C blocks to

compute for the SPE is m*n. Additionally, the number of DMAs per a C block for each inner product

multiplication is (m + n) / (m*n) (Anand, Meeting: Examination of Clustering and Haskell Code). This

type of relationship will always be minimized if m and n can be brought as close to being equal as

possible. Therefore, it indicates that clustering that forms a square dimension, m by m, is beneficial.

Figure 10: clustering reduction

34

II.VIII.III Size of Cluster

If the cluster size has to be with square dimensions, what dimension is chosen? To answer that question,

an examination of the implications of different cluster sizes has to occur.

II.VIII.IV Block Size Implications

Each SPE has the memory constrain of its local store being only 256 KB (Cell Broadband Engine

Programming Handbook). It is mandatory that 16 KB of this memory be set aside for the code segment

to implement the hardware computations: more on this later in the report. Additionally, each C block

computation requires that three blocks of matrix data to be stored in the LS: the C partial result, the A

block and the B block. It makes sense to break the remaining LS space into slots, where each slot can

hold a block size of data. Therefore, if one increases the block size with a fixed limitation on the

memory, it results in a decrease in the number slots and a decrease in the number of C block computations

a single SPE can process. Therefore, choices on the block size must be made.

II.VIII.V Block Size Choice

The hardware implementation must be able to process block multiplication. A restriction on the hardware

is that these block sizes be multiples of four. The STI Cell can read four single point precision values at a

time from the LS, therefore; it makes sense to accommodate this process (Cell Broadband Engine

Programming Handbook). This leaves the choices of a range between 4x4 and 128x128, which are

squares and multiples of four. Any larger square matrix goes beyond the bounds of the LS for a SPE to

perform a single C block solution. A second factor to the choice of block size is the fact one would want

to use as many register as available to the SPE for a given computation. If one goes below that value,

unused registers exists, indicating a potential to increase the computation. If one goes above that value,

then the computation cannot be done in an efficient method, since the task must be broken down into sub

steps. Based on the STI Cell, the choice that maximizes registers is 64x64. Additionally, the maximum

amount of data a DMA from the memory unit to a SPE is 16 KB, which translates to a 64x64 block size

(Cell Broadband Engine Programming Handbook). Therefore, this is the choice that will be used for the

block size.

II.VIII.VI Cluster Size Choice

A block size of 64x64 indicates 16 KB of memory usage. If one breaks the remaining LS after allocating

for the code segment, it creates 15 slots in the LS that can store a block of data each. This indicates that

one can have at most 5 C block computations occurring in the cluster. As indicated in the motivation for

clustering, the goal is to make the dimensions of the cluster as close to equal as possible. In the case of 5

C blocks this results in a 1 x 5 cluster shape. This does not utilize the potential of saving DMAs through

clustering as one hoped. However, if 4 C blocks of 64x64 dimensions are computed, one has a clustering

35

shape of 2x2; a square which was hoped for. Additionally, although this route takes a drop in the amount

of solutions a SPE process, it does provide three extra blocks that can be used as buffers for the code

segment if that increases above 16 KB.

II.IX Parallelism Methods

II.IX.I Data Parallelism

Data parallelism refers to breaking a process to be computed concurrently by multiple handlers (Henri

and Haines). In the project case, the handlers refer to the multiple SPEs that will compute concurrently

different C block clusters. This provides a logical division where a computation will not rely on another‘s

computation to terminate. Data parallelism is an obvious method to use because matrix multiplication

lends itself to these logical breaks where one can delegate the tasks concurrently. The motivation for

utilizing this method in the STI Cell is for the fact it allows all SPEs available to share in the computation,

therefore; increasing the overall efficiency.

II.IX.II Pipeline Parallelism

Already it has been established that a given SPE will be computing a single cluster of four C blocks at a

time. However, it is very likely if dealing with large matrices, that the blocks used in the inner products

would be required by more than one cluster computation. A flaw in the current design is the fact that all

blocks of data appear to be DMAs from XDR memory. The XDR memory is the local hardware memory

for the whole Cell unit (Cell Broadband Engine Programming Handbook). It would be more efficient to

DMA these blocks that are used by other SPEs from the SPEs that currently hold the information. This

process is defining a pipeline system, where data is flowing through one unit to another that requires it

(Henri and Haines). Additionally, this transfer will use the LS to LS bandwidth and not the DMA

memory access bandwidth, therefore; resulting in an elimination of possible transfer clashes and

accessing the memory unit which is slower. The SPEs also have a mailbox system that provides support

to handle the pipeline. Each SPE contains two mailboxes, one set for receiving and one for sending.

Furthermore, each mailbox can hold a queue of 16 32 bit messages. The issues with implementing such

as system is the difficulty in ensuring the data is where it needs to be when the DMAs are called, since it

is not guaranteed, as in the case of XDR memory. One potential solution would be to use the mailboxes

to send messages that a particular block of data has been sent between two LS units. Therefore, the

receiving SPE can check its mailbox for a message indicating the data is there, before reading and

computing with that particular block. More information on this will be provided when discussing the real

time system in the report. As of now, what is important is the identification of the potential increase in

efficiency through pipelining the data, and the potential pitfalls of synchronization.

36

II.X Methods of Implementation

II.X.I CodeGraph

The Overview

A codegraph method is the process of taking a series of input nodes, defining a set of dependencies,

linking the nodes through hyper edges to create new graphs, and eventually producing output nodes. The

objective is to use a method that will take a series of input nodes, where each node contains a block of

matrix A and B. It will then define the hyper edges to be the DMAs and instructions that are necessary to

occur for the inner product computation. The dependency list is the dependencies for each computation

and DMA, as discussed in earlier sections. The newly created nodes are the change in state of the various

SPEs through the stages of computing the inner products. Finally, the output nodes are a series of nodes

where each contains a solution block of data. The benefit of using the codegraph is the fact that it models

the problem of matrix multiplication in a clear fashion. Additionally, the handling of special matrices or

when the project begins to take into account attributes, the codegraph method will dynamically account

for these, as long as the hyper edges and rules for them are provided.

The implementation of the codegraphs will be done using the CodeGraph library provided by Dr. Kahl

found in the Coconut repository. This repository has a variety of available codegraph functions that

demonstrate implementation of his codegraph theory. This CodeGraph library is written in the Haskell

language and can output Dot files which, through the Graphviz program, can be used to create graphical

files to be viewed. Codegraphs can be created in this library in two different manners: from joining graphs

together to combine a larger graph or through defining a mapping of nodes to edges. The later

implementation is the route that we are choosing to follow.

Loop Specification

Using the CodeGraph library, we need to build a codegraph that fits the requirements above such that we

have hyper edges for all the instruction calls and the nodes representing the state of the local store

memory on each of the SPE‘s. The codegraphs created are then scheduled for execution. However, simply

creating codegraphs for all the matrix operations will not suffice since the size of a codegraph can be very

large when input matrices are large. Specifically, if we have an mxn and an nxp matrix, then we have

O(C*n) edges, where C is the number of DMA transfers and instruction calls needed computing two

elements and n is the row and column dimension of our two input matrices, respectively. The result

matrix is an mxp matrix which each need O(C*n) edges, so we have an O(C*n*m*p). If we have that

n≤m and n≤p, then we have O(C*n³) or just O(n³) edges. A graph of this size could easily require more

than the memory capacity available so a better solution is required.

37

Figure 11: 2x2 dense-dense matrix multiplication

To improve the codegraph, we find a pattern in the codegraphs generated and put this pattern into a

looping structure. The pattern we found is the repetition DMA transfers and a single operational

instruction, such as floating-point multiplication

addition, for a single cells instruction channel. The

instruction channel, being the sequence of instructions

required for computing a cell‘s result, is now

represented as a looping codegraph which takes up

O(C) edges in memory. This is an improvement but

our codegraph will still have O(n²) edges in memory.

This will still require an improvement, so another

pattern is identified throughout the matrix operation;

each cell requires the same instruction channel, with

different inputs and outputs, to compute the result.

Thus, we create a new looping structure for the result

matrix where each loop executes the looping structure

for a single channel of instructions. This results in a

memory complexity of O(C), which is manageable in

memory.

Scheduler

Once the system has the codegraph, a codegraph scheduler that already exists, will schedule the jobs to be

computed by the SPEs, taking into account the dependencies. The existing scheduler has also been

created by Dr. Kahl found in the Coconut repository. Interfacing the codegraph with the scheduler such

that it recognizes the number of iterations for each codegraph will be a challenge. Specifics on the

interfacing will come from implementing the codegraph and discovering what resulting inputs, outputs

and graphs are created from the CodeGraph library.

Figure 12: loop specification

38

II.X.II Haskell Directly

The other option of implementation is to use the Haskell functional programming language to write a

program that will do everything required. Basically, the program needs functions to take a series of

blocks for A and B, and arrange them in an ordering that will indicate what blocks belong to what cluster.

After this, the program will require a function that would take each cluster and create all of the jobs,

consisting of DMAs and multiplication routines, to compute the inner products. Following this, the

program will allocate the clusters to SPEs in a manner that balances the jobs among the processors.

Finally, the program will assign where in memory each block is assigned for each SPE, specifically the

slot number in the LS. A Haskell implementation that provides all the functionality except for the

assignment of memory slots is found in appendix 6. The output of this code, which will be a sequence of

instructions instead of a graph, can then be sent to a basic linear scheduler, which will assign the jobs

without any optimizations with parallelism. The goal of this project is to provide a Haskell scheduler that

will incorporate the pipeline parallelism methods discussed earlier. This will provide the project with

something to contrast against the code graph approach. A major drawback already noticed with the

Haskell method is the fact that it does not model the system as concisely as the code graph approach,

making it more difficult to impose the algorithms desired. Therefore, it has been decided because of this

drawback and after further research to forgo the Haskell option.

II.XI Run Time System

II.XI.I Functionality Overview

The job of the run time system is to be able to interpret the instructional calls defined in the system, and

inform the appropriate SPE to process the matching command. The types of instructional calls either

involve computation of data or memory accessing. The memory accessing will cover the operations of

getting data from memory to an SPE‘s LS, sending data from an SPE‘s LS to memory, and sending data

from one SPE‘s LS to another SPE‘s LS. The computations, at this point in time, will refer to dense

matrix block multiplication.

II.XI.II Context Information

In order for the real time system to operate, the SPEs in the system must know information concerning the

overall system and each other. This is in order to control the flow of processing data to a specific SPE,

and enable SPE to SPE communication. This information will be setup by the PPE, and passed to every

SPE in the system. The context information will essential be a table that will list every SPE‘s

identification number given by the STI Cell Hardware, a pointer to its LS base address, and a pointer to

the problem state error. The pointer to the LS will allow every SPE in the system to have access to the

base address of the LS for every other SPE. Therefore, it can do LS to LS memory transfers. The pointer

39

to the problem state error is to gain access to the SPE mailbox and SPE signalling constructs, that can be

employed for SPE communication (Cell Broadband Engine Programming Handbook).

The run time system can now control the flow of processing data by assigning different instructions to

different SPEs based upon their SPE identification number. Unfortunately, affinity masks, which will

allow the system more control over accessing specific SPEs when desired, therefore; possible eliminating

the need for this construct, have not been implemented in current SDK (Greenberg).

II.XI.III Control Threads

In all the programs written on the STI Cell that employ the use of the SPEs, a control thread for each SPE

will be created in the system. The control thread is made through the PPE by using a program handle.

The program handle links the SPE to the

particular program that it is to run. Once

the control thread is created, the SPE

immediately begins execution of the code in

the program handle. This is necessary to

cover in the design report; since it outlines

the way the real time system will initiate the

SPEs to utilize (Cell Broadband Engine

Programming Handbook).

II.XI.IV Buffers

Definition

In order to store the data each SPE requires to process in an organized manner, areas in the SPEs LS have

to be set aside. Therefore, the real time system will create buffer constructs. A buffer is an array of set

size, where each element of the array can hold a logical chunk of data. In this case, the size of a buffer

will coincide with the size of a matrix block.

Data Buffer

A data buffer will be a buffer in every SPE‘s LS that will only store blocks of data used in the

computations. The size of this buffer, based on the assumption the system will be employing the

clustering idea; will be a minimum of four. However, later on when this report discusses buffering, it will

show how it is better to have a data buffer size of at least eight, and possibly twelve.

Figure 13: general control direction

40

Solution Buffer

A solution buffer will be a buffer in every SPE‘s LS that will only store the partial computed matrix

blocks. Therefore, based on the clustering idea presented earlier in this report, the size of this buffer is

four. The only purpose of having a solution buffer and a data buffer, instead of one larger buffer, is for

readability and organizational purposes in the implementation.

Buffering

Buffering is a process that will help eliminate potential waits on data through a computational process. It

works on the principle that there is an associated amount of data required for a computation that is

performed repeatedly. For N buffering, where N is an integer value of at least two or more, the data for N

computations are first loaded into the system. When the system begins computing, it will compute using

the data for the first computation, then move to the second, third, and so on. The purpose of buffering, is

when the system is done using the data for computation x, it will begin the transfer of data for the (x + N

+ 1) computation, overwriting the current information just used. Therefore, with the exception of start up

costs, if the system is computational bound, meaning it takes longer to compute the repeated task than

transfer the data for it, the system should never have to wait on data once it is initiated. The amount of

buffering done allows the system to ―see‖ that far ahead in terms of data. Therefore, double Buffer, with

an N of two, allows the system to see the data for computation two steps ahead at any given time.

For testing purposes, the real time system currently employs double buffering. This will be hard coded

into the system. It is done in order to illustrate the notion of buffering when testing the real time system.

The rational for hard coding is the amount of buffering that will take place will be determined by

information from the codegraph; the logical layer above in the overall system. Using the codegraph to

explain, information on latencies for every instruction will be provided in the upper layer. Therefore,

since the codegraph illustrates the dependencies of instructions, and now contains their associated

latencies, it will be able to schedule the operations with as much buffering as possible as the latencies for

computations allow. However, there is a physical limitation of this in the LS data buffer. If the data

buffer is set to a size of eight, then only a maximum of double buffering can be obtain. There is no more

room in the LS memory to store the steps further than that. If the system keeps each buffer spot to the

size of a matrix block, the maximum amount of buffering that can be done is three. The only way that

this can change is if we logically change what a computation is to the buffering. The above is working on

the assumption that a computation routine is the process of computing one inner product step for each of

the solution blocks in the SPE, therefore; four inner products steps are computed. However, there is a

possibility of breaking this down by the individual solution blocks, therefore, each computation requires

only two buffers of data, and because of clustering some of them are already in the system. Either way,

41

when integrated, the scheduler should only make the choices of buffering based on the latencies,

therefore; it should do this type of a decision.

II.XI.V Memory Transfers

Main Memory to SPE’s LS

The run time system must provide a way for an SPE to transfer data from memory to its LS. The

codegraph will issue an instruction and provide the run time system with enough information to process

it. The signature of the instruction is of the form:

Instruction name parameter (pointer) parameter (integer)

inDMA address in memory index

The first parameter is a pointer to an address that will hold the start address of the data to be transferred

into the LS. The second parameter is the integer for indexing the data buffer that will be used to indicate

the location within the LS where the data is stored.

The actual instruction will be implemented by making a DMA List call to transfer the data. The DMA

List call with minimal code setup is a built in command in the SPU library provided by the SDK. It will

take as parameters the effective address of the source of the data; referring to the signature‘s first

parameter, the relative address of the destination; referring to the data buffer array with indexed spot

using the second parameter, the size of the transfer; refers to the size of a buffer spot, and a DMA tag

used for the DMA channel. The transfer will then begin placing transfer commands on the SPE‘s DMA

channel to process the DMA List. The list refers to the fact that the transfer of the data will be broken

into sequential pieces, and sent in a non-deterministic fashion. The call does not ensure the DMA is done

processing; instead it only places a command to process the DMA on the SPE‘s DMA controller.

Normally, this call would be followed by a waiting command that is used to wait for the DMA with the

matching tag. However, this thesis is going to explore a different method of synchronizing the data that

will be discussed under synchronization.

SPE’s LS to Main Memory

The run time system must be able to handle instructions that desire the transfer of data from an SPE‘s LS

to main memory. The signature of the instruction is of the form:

Instruction name parameter (pointer) parameter (integer)

outDMA address in memory index

In this case, the pointer is used to get the address in memory where the data is to be stored. The index is

used to index the solution buffer, where the data itself is found. As in the case of transfers into LS, the

42

transfer is handled through a built in command in the SDK. The parameters in the command is the

effective address of the destination; refers to the pointer that points to a spot in memory, the relative

address of the source; refers to the solution buffer indexed spot using the index, the size of the transfer;

refers to the size of a buffer spot, and a DMA tag for the DMA channel.

SPE’s LS to SPE’s LS

Finally, in terms of memory transfers, the system must be able to handle instructions to transfer data from

one SPE‘s LS to another SPE‘s LS. The motivation to do this is the fact that memory access takes longer

than LS memory access. Therefore, the system can reduce the transferring time by taking advantage of

direct SPE to SPE transfer. The signature of the instruction is of the form:

Instruction name parameter (integer)

transferDMA source index,

destination index,

to SPE

It is important to remember the SPE sending the data is where the instruction is to be processed.

Therefore, the source index, is the index used to identify which data buffer is to be transferred. The

destination index is used to indicate which data buffer in the destination SPE is the data to be written to.

The SPE value is used to determine which SPE to send the data to. Therefore, since the context

information contains base pointers to every SPE‘s LS, and using the SPE integer parameter, the system

can get access to the base address of the destination LS. Then using the data buffer‘s address of the

sending SPE, since it is a virtual address and is the same for the all SPEs, it calculates the effective

address of the destination of the data. Therefore, it can use the same sending command used in the other

transfers, with the exception that the effective address does not point to main memory, but to an SPE‘s

LS.

II.XI.VI Data Computation

Codegraph to the Run Time System

The run time system has to be able to handle instructions to process the computation routines of block

multiplication. The signature used for this computation between the codegraph layer and the real time

system is:

Instruction name parameter (integer)

FMBIP index 1, index 2, index 3

tag 1, tag 2

43

The command FMBIP stands for float matrix block inner-product, and refers to the logical break of the

multiplication algorithm being assigned to a processor. The index 1 parameter is used to identify the first

data buffer spot used to store one of the two matrix blocks to multiply. The index 2 parameter is used to

identify the second data buffer spot involved in the multiplication. The index 3 parameter is used to

identify the solution buffer spot where the partial solution is to be stored. Finally, the two tags are used to

ensure that the data in the buffers for multiplication are correct, more on this when the report discuss the

synchronization method. The purpose of this instruction is to tell the real time system that an SPE must

perform a computation at this point in time. This computation will be done through the calling of a kernel

routine that will perform the multiplication.

Run Time System to Kernel

The run time system after the computation of two matrix blocks has been issued, it must issue an

instruction to run the kernel that will perform the computation. How the kernel processes in detail will be

discussed in the next section. The signature used for the run time system to the kernel is:

Instruction name parameter (pointer)

FMBIPK address 1, address 2, address 3

The only difference between the codegraph to the real time system instruction for computation is the

removal of the tag information, and the change from indices to addresses. The kernel will be passed the

base address of the two data buffer spots being multiplied; address 1 and address 2, and the base address

of the solution buffer to add the result to its current value; address 3.

II.XI.VII Synchronization

Locks and Semaphores

Locks and semaphores provide a way to prevent computations of the inner products to continue without

the proper data values in the associated LS memory slots. A lock can be placed with each computation

kernel. The computation does not occur until the lock is free, allowing the task to obtain it. However, the

lock will never be freed for the task until the associated dependent operations, directly being the

associated DMAs to get the data to compute the correct result, are completed first. A semaphore

approach is same idea, where each FBMIP operation can have a semaphore initial value of 2. When the

DMAs occur that fill the slots necessary for this computation, the value of the semaphore is decremented.

Therefore, when the computation of the FBMIP receives its two input blocks to compute, the semaphore

value would be zero, allowing the computation to take place.

44

The problem with using locks and semaphores over a complicated parallel processes is the fact that it

becomes increasingly difficult to ensure deadlocking does not occur. In fact, since the project is planning

to utilize the codegraph scheduler as part of the implementation for the codegraph approach, and since

this scheduler is something that is an external resource to this project, we have no guarantees that the

scheduled tasks will not cause deadlocking. Therefore, it is necessary to explore a better synchronization

solution that places the onus on guaranteeing functionality on the project members.

Message Passing Interface

MPI stands for (Massage Passing Interface) and is currently the most widely used method for parallel

computing. MPI is actually a set of low level instructions that programmers can implement in order to

stretch a program over multiple processors. MPI uses the SPMD (single program multiple data) method

of approaching parallelism.

MPI works by creating communication groups which define a set of processes that may communicate

with each other while being executed. Within this communication group each process is given a rank

which acts as a unique identifier for that process. This allows program control via conditional statements

(if rank = 0 then). A process can be part of multiple communication groups. Since a process can be part of

more than one communication group this presents a dependency between communication groups which

can affect performance. For example if once communication group needs a process to be in state ―a‖ , but

another needs it to be in state ―b‖, one group must wait for the other before it can modify the current state

of the process.

Communication between processors and processes is essential when working with parallelism. MPI uses

two types of communication, one is a Point-to-Point communication and the other is a collective

communication. Point-to-Point is used to send a message from a specific source to a single destination.

This type of communication would be used to send messages within a communication group. For

example, sending data from one spot to another, or alerting a concurrent process that a computation has

been completed. It is an individual communication between one source and its destination. This

implementation can be the source of deadlock if there is insufficient storage at the receiving destination.

A buffer is usually implemented to store waiting messages but this is usually small and can easily be

filled resulting in the same problem. At this point send must wait for the user to provide space.

MPI also supports collective broadcasting which consists of three types of operations: Synchronization,

Data Movement and Collective Computation. Synchronization uses a barrier command which stops all

processes in a specific group until they all reach a specific ―synchronization‖ point. This is inefficient

since you can have multiple processors sitting idle and waiting. The synchronization must be used in the

45

case where the next sets of computations rely on the results of the current processes. This is again because

of dependencies. Data Movement is just used to either broadcast to all processors or to scatter/gather

information from or to specific communication groups.

Collective Computation is similar to Synchronization except that there is one processor that is collecting

results of data from different communication groups in order to perform a calculation on all of the data.

This again requires processors to be sitting idle waiting for results or other data.

The previous discussion has been on MPI implementation and the limitations follow as a result. But one

of the main inefficiency of MPI is the approach that it uses. Using a SPMD paradigm follows a linear

path where certain parts can be done on multiple processors but then must come together at a point before

the completion of the calculation because of dependencies. This creates bottle necks which is the biggest

drawback of this approach.

This MPI method is the most commonly used method for parallelizing a program or set of processes.

Using this method there are only so many tricks to gain efficiency before a shift of thinking or a new

approach is needed to capitalize on using multiple processors (Gropp)(" An Introduction to MPI")

("Message Passing Interface (MPI)")

Lock Free Synchronization

A lock free synchronization is method of choice for this project. Instead of using locks and semaphores

that run the risks of deadlocking, atomic primitives supported at the hardware level are used. In the

project case, an option would be to use a compare and swap, known as CAS primitive. The CAS takes

three inputs, the memory address, the old value and the new value. If the memory address does not equal

the old value when processing, indicating that a concurrent process has changed it during the interval, the

function will fail and recall itself with the old value being the new current value (Goetz). This can be

modified to synchronize pipeline parallelism, where the system is trying to DMA data blocks from SPEs

to SPEs. The memory address is the location of the data block in the SPE‘s LS where the transfer is

coming from. The old value can indicate the appropriate data block that should still be located at that

slot. The new value can refer to the memory slot in the SPE‘s LS where the data is going to. If the data

block represented in the memory address is replaced during processing before the CAS call, the old value

will not match. This will result in a fail CAS. However, instead recalling the primitive, and causing a

possible non-wait free situation, the real time engine will DMA the new value slot the data from memory.

This potential implementation allows the system to continue processing if the pipelining beings to fail.

During benchmarking, this type of information will be recorded to see if the CAS failures accumulate,

indicate a failure in the pipeline model, or if the lack of CAS failures indicate a successful model.

46

A second possibility is the implementation of memory barriers. Memory barriers basically ensure that

certain assignments to memory locations are made visible to all processors prior to continuing with other

assignments. In this project, a full fence implementation of a memory barrier can be utilized to ensure

synchronization ("Synchronization and Mutliprocessor Issues"). A fence can be placed before the

beginning of the first overwritten slot of data in a SPE‘s LS in the system of all SPEs. Therefore, it

ensures that all blocks of data are pipelined across the appropriate SPEs‘ LS before rewriting takes place.

After the rewriting of a slot, the next fence would be placed before the first rewritten slot gets rewritten

again.

Synchronization in Run Time System

As discussed earlier, memory transfers calls do not guarantee when the data will arrive. Therefore, there

is major risk with using a data buffer before the buffer has the correct information. This situation is

handled through a verification process. At the instant before the required data is to be use by a SPE, the

processor calls a verification function that checks that the data has arrived. Therefore, the system will

only stall at the last instant if necessary to avoid wasted cycles. The method of checking is based on a tag

system. Every block of data has a tag of information stored along with it. The tag information will

contain enough information to uniquely identify the data in the algorithm. Therefore, a quick

examination of the tag in the buffer spot to the tag expected is processed before data is used.

A second major risk is involved in the transfer of data from main memory to a SPE or from a SPE to

another SPE. It is possible to begin a transfer of data of either type into a buffer before that buffer spot is

no longer needed; therefore, overwriting valuable data. The solution to this issue is solved differently for

the two types of transfers.

In the case of transferring data into a SPE from main memory, the risk occurs of overwriting a data buffer

that in turn is still being transferred out of the SPE. This is due to the nondeterministic nature of DMA

transfers. All other possible occurrences of overwriting risks are not an issue, because the SPE is in

charge of initiating the transfer of data into the SPE from main memory. Therefore, it makes no sense the

instruction will proceed instructions that will conflict with the order. To solve the problem the run time

system associates all DMA processes with the identifier tag that matches the buffer spot concerned.

Therefore, before a transfer of data into the buffer from main memory is initiated, the system will wait on

all outstanding DMA calls that involve that buffer spot tag.

In the case of transferring data across from one SPE to another SPE, the risk involves transferring data

before the destination SPE is ready. This problem is solved through utilizing the single registers. Each

SPE has two 32 bit single registers. The implementation will use one of these registers to bit map the data

47

buffer of a SPE. Due to the nature of the algorithm, the SPE to SPE transfers is a one to one relationship.

SPE 0 transfers to SPE 1, SPE 1 to SPE 2, and so on until SPE 7. The run time system uses the signal

register to bit map the data buffer of the associated SPE in the one to one relationship. Therefore, in

above situation SPE 0‘s signal register will map the data buffer of SPE 1. Whenever SPE 1 has a free

buffer spot, it can indicate that to SPE‘s 0 single register by sending a single to it and turning a bit from 0

to 1. A 1 bit will indicate a free buffer spot, while a 0 bit represents a spot in use. The register maintains

the information of its bits by being set in OR mode, therefore; every received single is logically OR with

the current state. Whenever a SPE is about to transfer data to another SPE, it checks its single register to

ensure it is safe to do so, and if not it waits till the other SPE is ready.

II.XII Algorithms for Kernel Matrix Multiplication

II.XII.I Kernel Computation

The kernel is a small computation unit occurring within the SPE performing assembly instructions to

handle the matrix multiplication operations. Multiple kernels will be written in order to accommodate the

computation of various types of matrix multiplications. The kernels will be written in Haskell and then

translated into assembly. This is discussed further in Assembly Code Generation section found below.

II.XII.II Row Column Algorithm

The row column algorithm for kernel matrix

multiplication requires a block A and B, where

multiplication is defined by A.B. Block A is row

ordered while block B is column ordered. In

order to compute the first row of the solution C

block, the first row of block A as well as the

whole B block are required to be loaded into the

registers of the SPE. Afterwards, all the floating-

point multiplications (fm) are computed and

stored in temporary registers. After the initial

fm, if any floating-point multiplication additions

(fma) are required they are computed. At the end

of these computations, all the data that is required

to compute the result of a row will be in

temporary registers. The data in these registers

will need to be shuffled (shufb) among each other

in order to get them into a configuration that will provide the correct result. After this, pairs of registers

Figure 14: computation of first row

48

are added together with the floating-point add (fa) operation. Once the addition is complete, the previous

two steps of shuffling and adding continue until the desired result for the row is reached. When the

desired result is achieved, the row is stored in the location where the registers of the row of block A is

located. This overwriting method is used in order to minimize register allocation. This algorithm can be

applied to the remaining rows of block A in order to produce the resultant C block, where block C is

block A at the conclusion of the computation. For more information refer to Appendix III.VII. Figure 14

which illustrates the computation of the first row of a (4x4) . (4x4) case (Synergistic Processor Unit

Instruction Set Architecture).

II.XII.III Row-Row Algorithm

The row-row algorithm for kernel matrix

multiplication requires a block from A and B,

where multiplication is defined by A . B.

Both blocks A and B are row ordered. To

compute the first row of the solution C block,

the first row of block A as well as the whole B

block are required to be loaded into the

registers of the SPE. Afterwards, each value

of the row that was loaded into register is

copied into temporary registers by the shuffle

bytes (shufb) operation. Each temporary

register will have the same value repeated four

times and then be multiplied (fm) by the first

row of block B. After the initial

multiplication, floating-point multiplication

and additions (fma) will be applied to the

remaining rows of block B. These last two steps

of applying an fm followed by multiple fmas will be repeated depending on the dimension of the blocks.

After these computations have finished, the first row of the resultant C block is achieved. The resultant

row is then stored in the location where the registers for the row of block A. This algorithm can be

applied to the remaining rows of block A in order to produce the resultant C block. For more information

refer to Appendix III.VIII. Figure 15 which illustrates the computation of the first row of a (4x4) . (4x4)

case (Synergistic Processor Unit Instruction Set Architecture):

Figure 15: row-row algorithm

49

After considering row column and row-row approach to kernel matrix multiplication, the overall trend

shows that the row-row algorithm is more efficient when considering the number of instructions required

for the computation. Therefore the row-row algorithm is chosen as the desired method for kernel matrix

multiplication. It is also determined that the ideal size for kernel matrix multiplication would be a 16x16

block since the top level will be sending blocks of dimensions 64x64. A block of dimension 16x16

divides the 64x64 block into four easy computations. Other blocks larger than 16x16 that divided 64x64

evenly are too large to compute on the SPE in a single instance due to a lack of registers. Below shows

the comparison between the row column and row-row algorithms with respect to instruction count:

II.XII.IV Other Algorithms

The other algorithms that were considered were the column row and column-column algorithm, where

block A would be column ordered and block B would be either row or column ordered respectively. In

both cases, A.B would define matrix multiplication. If either algorithm is computed directly, the result of

any floating-point multiplication would result in values that result in a wrong computation. In order to

avoid this situation, the shuffle (shufb) operation would need to be applied in order to configure the

matrices properly for computation. After the shuffles are done, algorithms similar to the row column or

row-row methods described above would have to be applied in order to compute the resultant C block.

Since shuffles have to done before the computation algorithms can be applied, both of these algorithms

would result in an increased number of instructions being applied when compared to the row column and

row-row algorithms. Therefore, both column row and column-column algorithms can be eliminated as

feasible solutions to optimize kernel matrix multiplication calculations.

II.XII.V Pipelining

The STI Cell processor has an even (0) and odd (1) pipeline for processing instructions. The row row

algorithm for kernel matrix multiplication utilizes the shufb, fm, fma, load and store instructions to

50

complete its computation. The shufb, load and store instructions are all odd pipeline instructions while

the fm and fma instructions are even pipeline instructions. The goal with respect to pipelining is to

balance the even and odd instructions on their respective pipelines to produce optimal results. Below are

Table 1 and Table 2 which illustrate the instructions counts on pipeline 0 and 1 for the row column and

row-row algorithm respectively:

Table 1: Row, Column Instruction Count

Matrix

Multiplication

Pipeline 0

Instruction Count

Pipeline 1

Instruction Count

Row, Column

Format (4,4) x (4,4) 28 32 60

(4,8) x (8,4) 44 44 88

(4,12) x (12,4) 60 52 112

(4,16) x (16,4) 76 50 126

(4,20) x (20,4) 92 68 160

(8,4) x (4,8) 112 128 240

(8,8) x (8,8) 176 144 320

(8,12) x (12,8) 240 160 400

(8,16) x (16,8) 304 176 480

(8,20) x (20,8) 368 192 560

(12,4) x (4,12) 252 276 528

(12,8) x (8,12) 396 300 696

(12,12) x (12,12) 540 324 864

(12,16) x (16,12) 684 348 1032

(12,20) x (20,12) 828 372 1200

(16,4) x (4,16) 448 480 928

(16,8) x (8,16) 704 512 1216

(16,12) x (12,16) 960 544 1504

(16,16) x (16,16) 1216 576 1792

(16,20) x (20,16) 1472 608 2080

(20,4) x (4,20) 700 740 1440

(20,8) x (8,20) 1100 780 1880

(20,12) x (12,20) 1500 820 2320

(20,16) x (16,20) 1900 860 2760

(20,20) x (20,20) 2300 900 3200

Table 2: Row, Row Instruction Count

Matrix

Multiplication

Pipeline 0

Instruction Count

Pipeline 1

Instruction Count

Row, Row

Format

(4,4) x (4,4) 16 28 44

(4,8) x (8,4) 32 52 84

(4,12) x (12,4) 48 76 124

(4,16) x (16,4) 64 100 164

(4,20) x (20,4) 80 56 136

(8,4) x (4,8) 64 64 128

51

(8,8) x (8,8) 128 112 240

(8,12) x (12,8) 192 160 352

(8,16) x (16,8) 256 208 464

(8,20) x (20,8) 320 256 576

(12,4) x (4,12) 144 108 252

(12,8) x (8,12) 288 180 468

(12,12) x (12,12) 432 252 684

(12,16) x (16,12) 576 324 900

(12,20) x (20,12) 720 396 1116

(16,4) x (4,16) 256 160 416

(16,8) x (8,16) 512 256 768

(16,12) x (12,16) 768 384 1152

(16,16) x (16,16) 1024 448 1472

(16,20) x (20,16) 1280 544 1824

(20,4) x (4,20) 400 220 620

(20,8) x (8,20) 800 340 1140

(20,12) x (12,20) 1200 460 1660

(20,16) x (16,20) 1600 580 2180

(20,20) x (20,20) 2000 700 2700

II.XII.VI Assembly Code Generation

The kernel matrix multiplication code that is written in Haskell is sent to an existing module that produces

a codegraph with loop optimization. This codegraph is then sent to an existing scheduler that generates

and schedules the assembly code for execution on the STI Cell processor.

II.XII.VII Storage of Code

The assembly code and the listing of the computations are stored in the local store (LS) of the SPE.

16KB of memory are set aside for this purpose. The registers of the SPE access the code blocks by using

pointer arithmetic.

II.XIII Issue Tracking of Design

II.XIII.I Requirement Report

The initial requirement report did not contain enough detail or specifications. In order to rectify these

issues, the requirement report was modified to include more detail as well as specifications. Also, there

were some errors as well as ambiguity with the initial report which were also rectified.

II.XIII.II Schedule

Due to the nature of exploratory project, constant changes were required to be made to the schedule. This

included the rejection of old jobs due to their incorrectness with respect to the project and the inclusion of

new jobs that reflected the current knowledge of the project.

52

II.XIII.III Scope

The scope of this project was defined too broadly. Therefore, it was reduced to only encompass matrix

multiplication, more specifically the case dense, dense matrix multiplication.

II.XIII.IV Design Report

The design report was modified from its initial version to include a complete section on code graphs, the

real-time system and the computation kernel.

II.XIV Testing

II.XIV.I Exploratory Approach

Exploratory testing is applied in cases where the functionality of the program is not completely

understood or if the program shows instability issues (Robinson). In the case of this project, since it is

applying an explorative approach, the functionality and the stability are constantly unknowns and

questioned through the code development. Therefore this approach matches the situation. All that is

known during development is the specific input ranges into certain function or modules and the expected

outputs. Therefore, in this particular case expressive statements are used to define the language of

computation. The output of each module or function will output a string of text that can be interpreted

(Bach). The string would be able to be interpreted, as of now, by a tester, and compared with the

expected results. If the results do not match, it indicates issues with instability, for example computing a

multiplication with garbage data in a slot, or with functionality, for example the algorithm does not define

matrix multiplication properly. An example of the typical expressions for Haskell Code found below:

COMPUTATION OF A CIJ BLOCK

*Matrix>computeMultCij "A" "B" 2 2 1

[FM ("T0",2,1) ("A",2,0) ("B",0,1),FMA ("T1",2,1) ("A",2,1) ("B",1,1) ("T0",2,1)]

*Matrix>allComputeMult "A" "B" 3 2 3

[[[FM ("T0",0,0) ("A",0,0) ("B",0,0),FMA ("T1",0,0) ("A",0,1) ("B",1,0) ("T0",0,0)],

[FM ("T0",0,1) ("A",0,0) ("B",0,1),FMA ("T1",0,1) ("A",0,1) ("B",1,1) ("T0",0,1)],

[FM ("T0",0,2) ("A",0,0) ("B",0,2),FMA ("T1",0,2) ("A",0,1) ("B",1,2) ("T0",0,2)]],

[[FM ("T0",1,0) ("A",1,0) ("B",0,0),FMA ("T1",1,0) ("A",1,1) ("B",1,0) ("T0",1,0)],

[FM ("T0",1,1) ("A",1,0) ("B",0,1),FMA ("T1",1,1) ("A",1,1) ("B",1,1) ("T0",1,1)],

[FM ("T0",1,2) ("A",1,0) ("B",0,2),FMA ("T1",1,2) ("A",1,1) ("B",1,2) ("T0",1,2)]],

[[FM ("T0",2,0) ("A",2,0) ("B",0,0),FMA ("T1",2,0) ("A",2,1) ("B",1,0) ("T0",2,0)],

[FM ("T0",2,1) ("A",2,0) ("B",0,1),FMA ("T1",2,1) ("A",2,1) ("B",1,1) ("T0",2,1)],

[FM ("T0",2,2) ("A",2,0) ("B",0,2),FMA ("T1",2,2) ("A",2,1) ("B",1,2) ("T0",2,2)]]]

53

III Implementation

III.I Run Time System

III.I.I Introduction

The run time system was developed by D. Karunaratne and C. Venantius. Its purpose is to manage the

instructions, SPUs and PPU, and memory. The work is coded in the C programming language using the

SDK environment to execute and test. The method to explain the implementation will be to explore each

coded file, explaining its usage and how it implements the design.

III.I.II Extra Material

It is recommended for complete understanding of the run time system to examine appendix VII.I that

contains the code files in complete. Additionally, appendix VI.II contains common pitfalls with the SDK

environment that the project has identified during development. Lastly, appendix VI.I contains a manual

on how to execute, test, guidance on maintenance, and where to continue development concerning the run

time system.

III.I.III Method of Explanation

The run time system is a complicated engine when first attempting to understand its functionality. The

approach this report undergoes is to explain the code from a file point of view. It will traverse how the

engine executes by traversing a source code file at a time, explaining the code along the way. It is the

hope that this method of presentation, plus the supplied code in appendix IV, is sufficient to understand.

III.I.IV Constants

Before describing the files that compose the run time system, it is necessary to take a short detour to

outline the system constants. The constants are stored in the file contextInfo.h: a header that contains

more information to set up the processors, but that will be explained later. It is recommended one reads

over appendix VII.I.II, to see the constants specified in the system. This is the preferred location to add

and edit these values.

III.I.V Run Time PPU

The system is executed though the object file created from the C code file runTimePPU: see appendix

VII.I.I for the file source code. The object file is executed on the PPU processor. Its main objective is to

provide control to the run time system. The first task is to create the source matrices for input. This is

done to simulate the input data, rather than have it inputted from an external source or the user. It uses an

offset and a random number generator to set up the data. Additionally, it initializes the tag information

54

for each block of data that composes the matrix. Remember this tag is vital for the synchronization of the

run time system, since it uniquely identifies blocks of data.

After creating the data, the PPU initializes the SPUs in the system. It accomplishes this by creating

control threads for each SPU, passing as parameters; the file address of the program to execute, the

address of the context information: see section III.I.VI, and the values to set the signal register to OR

mode: used to synchronize data transfers. After this call, one has to assume the SPUs have begun

executing the assigned object file. When examining the code file for the SPU it will be clear how the

system handles the synchronization with the PPU.

After the SPUs are set to execute, the PPU creates the context information. This information is required

for the SPUs to operate properly. The information is an array of information, where each entry contains

data on every SPU in the system. Therefore, if every SPU has a copy of this array, they will have vital

information on the other SPUs in the system, such as base addresses of local stores for SPU to SPU

transfers.

It is necessary for the SPUs to wait before executing the program for this information to be setup.

Therefore, the mailbox control is used for synchronization. When examining the runTimeSPU code, one

of the first things a SPU does is wait for a mailbox message. This corresponds to the message that the

PPU sends the SPUs concerning its identification number and number of SPUs in the system. The PPU

only sends this message after it is done executing the setting up of the context information. Therefore,

this ensures the SPUs have access to the correct data before they begin to operate.

After this is completed, the PPU waits for each SPU to finish executing. Then it exits with a program

successful printout if everything worked without error.

III.I.VI Context Information

The context information in the source file contextInfo.h found in appendix VII.I.II not only holds the

constants of the system, but also vital information each SPU needs concerning the other SPUs in the

system. It contains information on the thread identification number that identifies the SPU control thread,

the SPU identification number that identifies the SPU, a pointer to that SPU‘s LS: required for

transferring data, a pointer to that SPU‘s signal register: required for signaling, and a pointer to the

control area: required to access certain tools in that SPU.

III.I.VII Run Time SPU

The run time SPU refers to the code file runTimeSPU.c found in appendix VII.I.IV. The program is

essential what the SPUs execute when they get control threads from the PPU. The program calls two

55

sequential functions, initialization and test. Initialization sets up the processor to handle the matrix

computation, and test are the instructions that will carry out the computation. It is important to note that

the test function is essentially a listing of the instructions to execute. When integrated, this would be

replaced by the instructions scheduled by the scheduler from the code graph. The instructions would be

executed in the order that the scheduler dictates. The test file illustrates just one possible way the

schedule would set up the instruction order.

III.I.VIII Initialization

Initialization is the file that contains the first function call that the SPUs execute. The source code is

found in appendix VII.I.III. The first thing that initialization does; thus, what a SPU does, is setup the

data buffers to have a negative one value in the tag section in each buffer spot and 0 values for the data in

the solution buffers. A negative one value refers to an invalid tag value, since they are restricted to

positive integer values including zero. The solution buffers are zeroed out in order to remove the junk in

memory to ensure proper data addition. The data buffers data does not require to be zeroed out because

information will be overwriting that section before use. It is necessary for the solution buffer, since the

algorithm in the system assumes a zeroed solution buffer to work with when beginning computation.

This can be made more robust with the inclusion of a zeroing out buffer instruction.

After the setting of the buffers, the SPU continues by waiting for a mailbox message. The message is

being sent by the PPU, and refers to the previous discussed synchronization in run time PPU. The

message will provide the processor with its identification number and indicate that the context

information is setup in main memory. Therefore, it continues after receiving by transferring the context

information from memory by the address provided as a parameter in the control thread. At this instant,

since the processor explicitly waits for the transfer to complete, the SPU is ready to start processing.

III.I.IX Test

The test file is the second function that the SPUs run after they have finished initialization. The test file is

a simple test case of assigning instructions for each SPU. The SPU with the matching SPU ID number

will execute the series of instructions in the if-statement. The instructions follow the algorithm specified

in the design, with explicit double buffering being done. The test function relies heavily on the constant

values. This is a necessity to allow the run time system to be properly verified when testing. It allows a

developer to modify constant values to change the size of the blocks of data, source matrices, number of

SPUs, and so on.

56

III.I.X Signaling

The signal file found in appendix VII.I.V, sets up the sending and receiving of signals. Essentially, each

time a SPU sends or receives a signal; it will do so through these functions. There is nothing of

significant interest to the project in this file. It is mainly a file to setup a construct required in the system.

III.I.XI Display Matrix

The displayMatrix.h found in appendix VII.I.VI, is used to output data to the console. It is mainly used to

assist in the testing of the run time system. It contains functions to display the state of a matrix entirely.

It is important to note that if printing information from a SPU, there is no indication of which SPU will

print first. If for an example 4 SPUs have print statements, since all print statements go through the PPU,

it creates a race condition to display the data.

III.I.XII DMA

The code for the DMA memory transfers are found in appendix VII.I.VIII. This section is the most

difficult to understand, since many of the synchronizing methods are embedded. Therefore, please read

this section carefully.

DMA List

DMA list is a way to transfer data from two different sources that is greater than a normal DMA call. It

takes the data required to be transferred, and breaks it into a series of manageable chunks. The reason

why it is being used in the run time system is because of the tag information with the data blocks and how

DMAs are implemented. It is important to understand how atomicity of DMA transfers do not exists. In

fact, if one specifies a large amount of data to be transferred through one DMA, there is no guarantee that

the data will be transferred sequentially. Therefore, if the run time system did not use a different

construct than the default one, there is the risk of the system assuming data is present when only the tag

has been transferred, since this is how it determines if the transfer is complete. Using the DMA list

construct allows fences to be placed on list elements. Therefore, the system can specify to transfer a

block of data, knowing the tag is in the last block, and provide a fence that will ensure the last block with

the tag is the last to be transferred. Essential, the list construct provides the system to a solution of the

lack of atomicity of the DMA engine.

DMA transfer from Memory to SPE

The function call to transfer memory from main memory to a SPE is inDMA. The function first uses the

buffer spot number to check if there are any outstanding DMAs to be process using that buffer. This

relates to the design decision to associate DMA transfers tags with its respected source buffer position.

Therefore, it will eliminate a risk of overwriting data that is still required or still in use due to DMA

processing because of its nondeterministic behavior. After the wait is completed, the function computes

57

the effective address of the data in main memory, the list size for DMA listing, and calls a built-in DMA

list function provided by the SDK to initiate the transfer.

DMA transfer from SPE to Memory

The function call to transfer memory from a SPE‘s LS to main memory is outDMA. In the case of the

matrix multiplication algorithm, this is only done after a block of the solution matrix is fully completed

by a processor. The code is essentially the same as inDMA, with the exception that after the transfer is

completed; the processor wipes out the solution buffer to prepare for the next round of calculations.

Request SPE Transfer

The function call for a SPE to request data from another SPE is requestSPETransfer. Essentially, this is

part of the synchronization solution to avoid data being overwritten before expected. The call is from the

destination end of the transfer, informing the source processor that the buffer spot is free, so it can start

transmitting data if it wishes to. This is done through signaling and bit mapping using the signal register.

The idea is explained in more detail in section II.XI.VII under synchronization in the run time system. It

is important to understand that sending signals are not blocking calls, so a SPE that requests a transfer

continues to process instructions, and that the signal registers are in OR mode; therefore, only the bit is

affected. If the processor reaches a point that it requires the data, and the transfer has yet to be completed,

a verification function is used to synchronize this situation: explained explicitly in III.I.XIII.

Transfer DMA

The function call for a SPE to transfer data to another SPE is transferDMA. The first thing the function

does before beginning the transfer is check to see if the destination SPE‘s buffer spot is able to be

overwritten. It does this by checking the bits in its own signal register that maps the buffers of the

destination SPE. If the bit indicates the buffer spot is still in use, the function contains to wait for new

signals until the buffer spot is free. There is the potential risk of indefinite wait, however; this case only

occurs if there is a mistake in the algorithm or scheduling order of instructions. It is assumed that the run

time system will be provided with a correct algorithm and instructions to process.

Once the destination is ready, the SPE checks to ensure the data is correct before sending. This is done

through a verification function: more detail on this in the next section since the function code lies there.

If verified, it transfers the block of data using the same DMA list structure and idea as all other transfers.

III.I.XIII Compute

The code for computing routines and verification of data buffers is found in the code file compute.c in

appendix VII.I.VII. The first function worth examining is verifyBuffer (). The purpose of the function is

to ensure that a buffer spot contains the appropriate data. It accomplishes this task by checking if the tag

58

in the buffer matches the one expected. This function is used before performing a DMA transfer of data

out of a SPE, and before using the data in a SPE. Therefore, if a processor does not call this function until

the last possible moment, corresponding to the first usage of the data buffer, it will provide the run time

system with the largest possible window to transfer the data into that buffer. If the data is not present, that

processor will wait until the data arrives.

The other function worth examining in this code file is fmbip(): float matrix block inner product. The

current run time system does not have computational kernels integrated, therefore; it must provide its own

functionality to compute the algorithm. The function is a C code implementation of performing the

partial inner product of the blocks of data. The function is a basic inner-product routine, with the only

distinction is the fact that it deals with blocks rather than vectors.

III.I.XIV Display Data

As in the case of section III.I.XI Display Matrix, this code file is also for outputting information on the

console for testing purposes. The two types of information that this file displays is information

concerning the SPE and information concerning the buffer spots of the calling SPE. The printing of the

data still has race condition issues since it is called by the SPE through the PPU. However, it still

provides the developer with a useful debugging tool. This file is not necessary for the run time system to

operate.

III.II Code Graphs

The CoConut repository has the library of Haskell code for the creation of code graphs, written by Dr.

Kahl. Following the implementation this library, there are three different ways to create a code graph:

creating the code graph in entire explicitly, building a code graph from joining and composing smaller

code graphs and creating a code graph through a monad. Each method has a different result where some

allow the nodes and edges to be labelled as you choose or they may be labelled by enumeration. The

labels are unimportant to the use of the code graphs except when trying to identify a node or edge, thus

each label must be unique for a distinct node or edge. Nodes or edges with the same label are considered

to be the same node and will all share the same inputs and same outputs.

The Code Graph library has a Dot graph creation tool which will create the resulting undirected graph,

based on the code graph itself. The outputs from the Dot graph give a good representation of the shape of

the code graph created so that a visual inspection can be done. This is an important tool used for verifying

that the code graph created is in fact the desired code graph. When a code graph is manipulated or created

in whole by joining and composing smaller code graphs, it can be very difficult to verify that the resulting

59

compositions are going to be as expected without a visual representation; the code is difficult to follow to

understand the result.

With the libraries and tools available, implementing the code graphs in Haskell started with creating our

first code graph that modelled a simple 2x2 matrix example, figure 16. Each edge and node are given a

distinct name according to the computation channel that they are involved in. For the cell at (1,1) in the

output matrix, a channel is specified for that cell computation which is the pattern of multiplying then

adding on a row and column of the input matrices, producing the dot product. This code graph mapped

the operations needed to compute this matrix multiplication over multiple SPUs on the Cell processor.

The matrices must load data from the main memory using a DMA transfer call to store blocks of a matrix

in to a slot of an SPU's local store. When the memory is loaded into the local store, the SPE can compute

the dot product by calling the mult and multadd function. However, it is important to realize the levels of

abstraction used for the code graphs compared to the actual instructions called on the Cell processor. A

mult and a multadd function take to single floating-point numbers (or three for multadd) and computes a

result for these inputs. In the code graph, the mult and multadd functions are an abstraction that takes two

matrix blocks (or three for multadd) where the dot product is computed on the matrix blocks. Thus, a

single mult or multadd instruction call on this level represents many of other instructions which would be

executed to compute the dot product for each row and column of the input blocks. The end of each

computation channel is a DMA transfer to move the computed data out of the local store into main

memory for the PPU.

Figure 16: illustrating 2 x 2 matrix multiplication using code graphs

The problems with this implementation avenue were discussed in the design above, where a new code

graph would have to be created for each matrix computation and the size of this code graph grows

enormously. However, we can recognize patterns in this implementation that are reoccuring in each

computation channel. Further, we can also notice that each channel follows a pattern. From these patterns,

we can reduce the size of our code graphs be looping through these patterns with a small overhead

associated with the loop.

60

III.II.I Multi-loop

From a discussion with Dr. Anand on January 31, 2007, the idea of the multi-loop was brought up where

Dr.Anand recommended using a multi-loop, which would determine the operations needed at each

iteration depending on the location of the blocks from their corresponding input matrix blocks. We could

assign a numbering scheme to a matrix that will represent the type of operation to be. This would remove

the need to create a strict computational structure for this computation allowing flexibility so that this

implementation would be reused only changing the type of operations to be executed and mapping a new

numbering scheme to the matrices. It is feasible to think that this could be a separate module to be

implemented where the inputs to this module would be the instructions to be executed along with a matrix

mapping for the inputs. When executed, the actual matrix block inputs would then enter the multi-loop

and execute according to the module inputs specifying the loop operation. This module was not

implemented and is currently being developed by Dr. Anand in Haskell.

The multi-loop takes care of a single channel's computations, regardless of the type of operation to be

executed, but the channels themselves need to also be looped. Additionally, since there are 8 SPUs for

computing matrix instructions, it would be beneficial to unroll this loop into 8 separate channels, one for

each SPU with DMA transfers from SPU to SPU being defined. This would not be created by the code

graph correctly without the loop unrolling; each DMA transfer from SPU to SPU needs to also be a code

graph joining the channel code graphs together. Creating the code graph for modelling this type of

operation is a current implementation that needs to be explored and implmented correctly. As of right

now, this is implemented by having a single large code graph with a DMA transfer as another instruction,

a hyper-edge, which encapsulates the request for a DMA transfer and the sending DMA transfer from the

corresponding SPUs. On February 13th, 2007 Dr. Anand discussed a solution where the SPU-to-SPU

DMA transfer should be modelled as a separate code graph and connect the SPUs together. This avenue

could be more explored by keeping a single hyper-edge between the SPU channels and then expanding

the single edge into a code graph, via the CodeGraphExpand library of functions. This is an area that

needs to be further explored.

Looking more in depth at the code graph, we define two main data types for the code graph that are

generic and can be built on and could encapsulate data types for lower layers. The data types are created

as a means of specifying information about the matrix type, specifically the dimensions and shape of a

matrix, such as a 3x3 identity matrix. However, when expanding on these types to make a more specific

and suitable type, it would be necessary to keep in mind that there is a type system that will be used in the

future for identifying matrices as types. This type system was created by Gordon Uszkay, who would be

61

responsible for working with the development team to incorporate this system into the entire CoConut

project.

The first implementation of the code graph was written with node outputs where the code graph was

explicitly created using the mkCodeGraph function in the CodeGraph library. This method has each edge

explicitly defined with a label and its input and output nodes. This approach proved to be a good method

for defining the code graph and verifying that it was shaping into the desired result. After the new design

was created, the code graphs were created using the same method, with the mkCodeGraph . The

difference on this code graph was the layering approach taken where each layer abstracts details from the

encapsulated layers. Each layer is created using the CodeGraph library where the edges of the layer

represent an encapsulated layer that is expanded into a larger code graph.

III.II.II First Layer

The first layer of abstraction is a simple layer which takes two inputs, labelled Matrix A and Matrix B,

has an edge labelled MatrixMult for the matrix multiplication of Matrix A and Matrix B and finally an

output matrix labelled Matrix C. The MatrixMult edge is then expanded into the next layer, Loop Layer.

Figure 17 The First Layer of the code graph

III.II.III Loop Layer

On this layer, we create an edge that will handle the loop overhead for iterating the codegraph. The

iterator is used for calculating the pointer arithmetic of the input Blocks which is necessary for the DMA

loads into SPU local stores. The code graph here has two inputs, the matrices labelled Matrix A and

Matrix B. Each of the edges for computations on the matrices, multLayer, maddLayer, maddstoreLayer,

have outputs that loop back to the iterator edge for the looping patterns.

62

Additionally, at this layer, we will be using the multi-loop discussed above. The iterator will also be

responsible for determining which instruction, based on the algorithm for the case statement, will be

executed from the multi-loop.

With loop unrolling, this layer needs to consist of eight code graphs, one for each SPU. The connection

between the two still needs to be established to define the SPU-to-SPU transfers needed.

Figure 18 Unrolled Loop Layer by two iterations

III.II.IV Additional Loop Layer

In addition to the Loop Layer, another layer was created that does a similar task to that of the Loop Layer,

but is created slightly different with the same idea in mind, but different syntax. This loop was created

because the complete functionality of the code graphs has yet to be defined or at least discovered for this

thesis specifically. This additional layer provides another code graph for compatibility with the

requirements for additional tools and libraries of the repository, such as the LoopSpec library or the

scheduler.

Alternatively to the method above, we can implement the same idea using the scheduler's LoopSpec

module. The result is a ―cleaner‖ choice that should also be more compatible with the scheduler. Here,

there is a simple loop input and loop output, and the result block which is used for storing the dot

products of each iteration. We can change our iterator node with a loopOverhead which will calculate the

pointer arithmetic needed for these calculations. The D's needed for this are given a value of -1 to retrieve

data from the previous iteration.

The difficulty with this will be defining the composition of these nodes such that the result block is the

correct block from the operation of: multLayer, maddLayer, maddstoreLayer. These three operations are

disjointed and only one can be executed per loop iteration, something that the multi-loop will have to deal

with. Currently, this is not a part of the code graph and cannot be implemented. This tool has to be

63

developed so that a decision on which operation to execute can be determined and then the result of this

execution must be resulting output. Otherwise, it could be done so that each of the operations can loop

back to the input of the code graph. This would imply that you would have one input and three outputs,

but because the multi-loop makes each iteration have disjoint operations, it would only be one input and

one output. This is something that still needs to be explored and implemented further by someone in the

CoConut project.

Figure 19 Alternative method for the Loop layer. Note “ResultOut” is not connected due to complications with the library composition functions.

III.II.V Mult

This is the operation for transferring the input matrices' blocks into the local store and then the dot

product is computed on the data on the blocks.

III.II.VI MaddLayer


product is computed on the data on the blocks then is added to the third input block, the result block.

III.II.VII MaddstoreLayer


product is computed on the data on the blocks then is added to the third input block, the result block.

After the multadd operation, the result block is stored back to the main memory with completed dot

product computations.

III.II.VIII Direction

The code graph has more work to be done in a few areas. Some will have to relate to the CodeGraph

library and some will be coded directly into this module. The first step will to be expanding the code

graph‘s layers into a single code graph. There is a monad function, cgExpand, which can help with this,

but it will take some time to figure out how to make it work with this code graph. Additionally, a main

component needed will be intregrating the Loop Layer with the LoopSpec library. The LoopSpec library

is what will have the scheduler generate loops in the resulting code. Generally, adding the LoopSpec to

this code graph will not be a difficult task, but deciding on how the scheduler will work with the

64

LoopSpec will be the difficult task. Work with the scheduler will help to understand how to conform to its

requirements. One difficulty with code graphs on the Cell processor is dictating when the code in the code

graph should be run on the PPU or the SPU. A mechanism for separating the locality of the nodes and

edges of the code graph needs to be added to the library. It may be easy to have this done on each code

graph layer as a whole, where a layer would be given a locality to the PPU or an SPU.

Difficulties from the the code graphs will mostly come from conforming and integrating with the other

modules of the project. It is recommended that work continued on the code graph will involve an intimate

understanding of the entire project and how this module will have to be integrated. The scheduler will be

the largest problem; the scheduler was created before this code graph so there will be some

incompatibilities to be resolved. Gordon Uszkay's type system should be integrated into this system but

the specifications for how it will be integrated still need to be developed fully. This will take exploratory

work with the subject.

III.III Kernel Computation

III.III.I Introduction

The kernel multiplication code was developed by Adam Schulz and is to be used in conjunction with code

already developed by Dr. Anand. The code for this multiplication is written in Haskell and the final

product has been developed to run through the scheduler developed by Dr. Kaul for eventual

implementation on the STI Cell. The kernel code controls the matrix multiplication operations that will be

computed on each individual SPU. The multiplication code has been implemented to handle any size

matrices that adhere to the multiplication requirement mxn * nxk. For the simplicity of this thesis the

specific case of and 8x8 * 8x8 has been implemented.

III.III.II Method of Explanation

As mentioned in the introduction the implementation for the matrix multiplication operations is part of a

larger piece of code. This code does the loading and storing of data to and from the local store in each

SPU. The explanation of this is not included in this report as it was developed by Dr. Anand. The

implementation of the mathematical operations will be outlined below and will be broken down by

individual functions.

III.III.III Function multcalc

This function has parameters that accept a partially completed solution matrix C as well as an A matrix

and B matrix that will be used in the current calculation. It also takes in the dimensions for both the A

matrix and B matrix. It is assumed that the real time system verified that the solution matrix is of the right

dimensions to add to the multiplication of A and B. Also the C matrix is passed throughout the code as a

65

Maybe case. This is because there will be some instances when there will be no partial solution to add to

the current result. This is the case for the first iteration. The Maybe C allows for the case where C exists

and also where C doesn‘t exist. We will see the implementation in a later function. MultCalc is used

create an offset value based on the number of columns of the B matrix. This offset value will be used later

to determine which registers in the B matrix need to be used to generate the right multiplications with the

A matrix. MultCalc also takes the number of columns inputted (corresponds to number of entries in a

row) and divides that by 4 to simulate the number of registers that will be used for a given matrix. For

example if a matrix has eight columns, each row would be stored in two registers not eight four values per

register. The remainder of the functions in the multiplication code are based on the number of registers

used for each row and not the number of entries in each row.

III.III.IV Function matrixmult

MatrixMult has eight parameters. It accepts three matrices C, A and B as well as the dimensions of A and

B (number of rows and number of columns divided by 4) as well as the offset calculated in MultCalc.

MatrixMult generates a column of solutions for each of its iterations. For the 8x8 * 8x8 case the first

iteration will generate solutions for matrix entries [1,5 1,6 1,7 1,8] which is one register in the column to

[8,5 8,6 8,7 8,8]. Therefore one iteration will genreate eight registers with partial solutions in them. The

next iteration will do the rest of the computations [1,1 1,2 1,3 1,4] to [8,1 8,2 8,3 8,4]. MatrixMult is a

recursive function that has two bases cases and one recursive step. The first base case is when the B

matrix has only one column of entries. This case is only used for any mx4 B matrix. This case was built

in for generality and is not actually used within the scope of the 8x8 * 8x8 specific case for this thesis.

The other base case occurs when the offset value reaches 2. Again for the specific case of 8x8 *8x8 this

is the only case used. This case concatenates two function calls to rowmult with the offset being 1 in the

first call and 0 in the second call. The general case is needed for any calculation that has a B matrix with

more than 2 columns. It works by concatenating a function call to rowmult with the corresponding offset

along with a recursive call back to MatrixMult decrementing the offset counter.

III.III.V Function rowmult

The rowmult function also has eight parameters. These parameters are of the same type as MatrixMult.

rowmult is used traverse each row in matrix A. The base case is when there is only one row left to

calculate in matrix A. For both the recursive step and the base case the function passes the register that

corresponds to the last four entries of that row. The registers are accessed based on the dimensions of

matrices as well as the current C matrix entries that are being calculated. This information is passed to the

matrixmult1 function.

66

III.III.VI Function matrixmult1

Accepts a specific C matrix register, (this register contains four entries) as well as the corresponding A

matrix register. It also accepts the entire B matrix and the corresponding dimensions for both A and B as

well as the current offset value. The matrixmult1 function traverses each column in a row for matrix A.

The base case is when the last register which holds entry [x,1 x,2 x,3 x,4] (where x is the current row

begin used for calculation). The last register contains the first values of the row because we are working

recursively starting at the end of the row. matrixmult1 calls mul44‘ function with these specific registers

for the final calculation.

III.III.VII Function mul44’

Mul44‘ accepts a C matrix register an A matrix register the entire matrix B the dimensions for only the B

matrix and the current offset. Mul44‘ is the function that multiples the individual entries within the

registers. The algorithm used for the calculations is outlined in the design report and will not be explain

here. Mult44‘ is where the Maybe implementation is used. For the first multiplication of the A register

with the corresponding B matrix it must be know if that solution is going to be added to an existing C

register or if it is the first calculation and will just be multiplied. If the C register passed in is null then

there is no partial solution and we will just multiply B and A. If the C register is not null we will multiply

A and B and add that product to the partial solution in C. For mul44‘ we use some register shuffling that

corresponds to the algorithm used to compute the multiplication.

III.IV Techniques of loop optimization

There are a lot of techniques for loop optimization, like loop splitting, loop unrolling, loop parallelization

and software pipelining etc. We are now taking an overview on these techniques.

Loop splitting/loop peeling : It is is a compiler optimization technique. Loop splitting attempts to

simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same

bodies but iterate over different contiguous portions of the index range. A useful special case is

loop peeling, which can simplify a loop with a problematic first iteration by performing that

iteration separately before entering the loop.

Loop unrolling: Duplicates the body of the loop multiple times, in order to decrease the number

of executions of loop overhead instructions and the number of loop jumps, thus it may reduce

cache misses and reduce branching. This will speed up the program if the overhead instructions of

the loop impair performance significantly. The major side effects of loop unrolling are: a) the

increased register usage in a single iteration to store temporary variables, which may hurt

http://en.wikipedia.org/wiki/Loop_splitting

http://en.wikipedia.org/wiki/Compiler_optimization

http://en.wikipedia.org/wiki/Dependence_analysis

http://en.wikipedia.org/wiki/Loop_unrolling

67

performance; and b) the code size expansion after the unrolling, which is undesirable for

embedded applications.

Loop parallelization : a special case for parallelization focusing on loops. Parallel computing is

the simultaneous execution of the same task (split up and specially adapted) on multiple

processors in order to obtain results faster. The idea is based on the fact that the process of

solving a problem usually is able to be divided into smaller tasks, which may be carried out

simultaneously with some coordination. Loop parallelization is to divide the loop body and

restructure them such that the rebuilt loop can be run efficiently on multiprocessor systems. It can

be done automatically by compilers or manually.

Software pipelining : is the technique of scheduling instructions across several iterations of a

loop. It may reduce pipeline stalls(i.e. utilize the instructions latencies) on sequential pipelined

machines and exploit instruction level parallelism. Intuitively, software pipelining is to make

iterations to execute in overlapped fashion so that an iteration starts before the previous iteration

have completed.

Since our project is to implement dense matrix muplication targetting on STI Cell Broadband Engine,

which is like Intel so-called multi core CPU. If we can apply some software pipelining techniques on our

matrix multiplication to exploit the CBE parallel computation capacity, this will increase the overall

efficiency of our matrix multiplication to some extent.

There are many techniques for software pipelining in use and under research, including unrolling, kernel

recognition, modulo scheduling, and decomposed software pipelining. Let us take a look on software

pipelining briefly.

III.IV.I Software pipelining

Introduction

Utilizing parallelism at the instruction level is an important way to improve performance. Because the

time spent in loop itself execution dominates total execution time, a large loop body of optimizations is to

focus on decreasing the time spend on executing each iteration. Software pipelining is a technique that

reforms the loop so that a faster execution rate is realized.

Software Pipelining is a method of instruction scheduling where each iterations of loop is scheduled more

efficiently by executing more than one iteration of the loop in parallel. If iterations are able to be executed

in an overlapped fashion, not in a traditional sequential, one by one fashion, this way can increase

parallelism. Although the operations of a single iteration can be parallelized, more parallelism may be

http://en.wikipedia.org/w/index.php?title=Loop_parallelization&action=edit

http://en.wikipedia.org/wiki/Parallelization

http://en.wikipedia.org/wiki/Central_processing_unit

http://en.wikipedia.org/wiki/Software_pipelining

68

achieved if the entire loop is considered rather than a single iteration. Let {ABC}n represent a loop

containing operations A, B, C that is executed n times. The software pipelining transformation utilizes the

fact that a loop {ABC}n is equivalent to A{BCA}n−1BC. Actually, the operations contained in the loop

body did not change, but the numbers and sequence of operations in the loop body may be rearranged so

that the operations in the transformed loop body can be executed in parallel, i.e. different iterations of the

original loop is able to be executed in parallel in the transformed loop, not one operation by another one,

and the transformed loop is equivalent to the original loop.

Implementation

Consider the following loop:

for i = 1 to n

A(i)

B(i)

C(i)

end

Here, let A(i), B(i), C(i), be instructions, each operating on data i, and they are dependent on each other. In

other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a

register, B could perform some arithmetic operation on the data, and C could store the data back into

memory. However, to make it simple, let there be no dependence between operations of different

interations. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations will execute in the following sequence:

A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...

Assuming that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the

looping control flow). Also assuming that an instruction can be dispatched every cycle, as long it is has

no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus

takes 7 cycles to complete (3 + 3 + 1, because we assumed that no dependencies between different

iterations, i.e. A(i+1) does not have to wait for C(i)).

Now supposing that we can implement some kind of software pipelining, so consider the following

sequence of instructions (with software pipelining):

A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...

http://en.wikipedia.org/wiki/Random_access_memory

http://en.wikipedia.org/wiki/Processor_register

http://en.wikipedia.org/wiki/Clock_cycle

69

It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3

iterations can be executed in a total of 9 cycles instead of 7*3 cycles of the original loop, giving an

average of 3 cycles per iteration.

Software pipelining is often used in combination with loop unrolling, and this combination of techniques

is often a far better optimization than loop unrolling alone. In the example above, we can conbine loop

unrolling and software pipelining to make the original loop to be executed in this pattern, the transformed

loop code will be:

for i = 1 to (n divided by 3)

A(i)

A(i+1)

A(i+2)

B(i)

B(i+1)

B(i+2)

C(i)

C(i+1)

C(i+2)

End

However, this is only a very simple sample to illustrate the basic idea of software pipelining, most of

time, matters are complicated if (as is usually the case) we can't guarantee that the number of iterations

will be divisible by the number of iterations we unrolled.

Before we get into solving this problem, we need to introduce prolog and epilog. Simply speaking, prolog

and epilog is the code which handle iterations at the beginning and end of the loop after unrolled. In our

case, prolog is the code before the loop for handling the case of n not divisible by the number of

unrolling; epilog is the code after the loop for terminating the loop such that the unrolled loop is

equivalent with the original loop.

Generally speaking, loop unrolling may not be the best way to implement software pipelining. For

example, consider the following code which the loop containing instructions with a high latency:

for i = 1 to n

A(i) ; 3 cycle latency

B(i) ; 3

C(i) ; 12(for example, a floating point operation)

D(i) ; 3

E(i) ; 3

F(i) ; 3

End

http://en.wikipedia.org/wiki/Loop_unrolling

http://en.wikipedia.org/wiki/Latency

70

If we only simply do software pipelining with loop unrolling, it would require 12 iterations of the loop to

be unrolled to rebuid a new software pipelined iteration to avoid the bottleneck of instruction C. This

means that the code size of the loop would increase by a factor of 12, which not only affects memory and

registers usage, but can also affect cache performance. Even worse, the prolog will likely be even larger

than the code for the loop itself, and very probably inefficient because software pipelining cannot be used

in this code. Furthermore, if n is expected to be moderate in size compared to the number of iterations

unrolled, then the execution will spend most of its time in this inefficient prolog code, making the

software pipelining optimization inefficient.

Here is an alternative implementation of software pipelining for our example:

prolog

for i = 1 to (n - 6)

A(i+6)

B(i+5)

C(i+4)

D(i+2) ; note that we skip i+3

E(i+1)

F(i)

end

epilog

Let's verify that this code does the same thing as the original for iterations in the middle of the loop.

Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the

first iteration that includes an instruction from iteration 7 of the original loop. The sequence of

instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)







However, unlike the original loop, the pipelined version avoid the bottleneck at instruction C. Note that

there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency

cycles of instruction C(7) are used for other instructions instead of being wasted.

The prolog and epilog handle iterations at the beginning and end of the loop. Here is a possible prolog for

our example above:

; loop prolog (arranged on lines for clarity)

A(1)

A(2), B(1)

A(3), B(2), C(1)

http://en.wikipedia.org/wiki/Cache

71

A(4), B(3), C(2)

A(5), B(4), C(3), D(1)

A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the pipelined loop, with instructions for iterations that have

not yet begun removed. The epilog would look similar.

Conclusion

The requirement of a prolog and epilog is one of the major difficulties of implementing software

pipelining. Software pipelining a loop is a trade-off between speed and memory usage. If the code size

expanded after doing loop unrolling and software pipelining is too large, it will affect speed anyway via a

decrease in cache performance.

Another difficulty, which may make this implementation of software pipelining useless, is that on many

architectures, most instructions use a register as an argument, and that the specific register to use must be

hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an

instruction as "multiply the contents of register X and register Y and put the result in register Z", where X,

Y, and Z are numbers taken from other registers or memory.

III.IV.II Explicitly staged software pipelining

Introduction

This explicitly staged software pipelining is a new software pipelining technique developed by Dr. C.

Anand, Dr. W. Kahl and W. Thaller. This method is different from other software pipelining, it is

defining and working with an enriched data dependency graph from the loop body, called loop

specification, in which relationships between multiple logical iterations are expressible. This simplifies

some concepts (modulo variable renaming), and facilitates a different trajectory through the scheduling

problem. Rather than assigning cycles, or partial orderings to instructions, they are first assigned to

stages. This algorithm is called Explicitly-Staged Software Pipelining (ExSSP) in contrast to modulo

scheduling where stage assignment is not explicit

Code Graphs

Before getting into the explicitly staged software pipelining , we need to introduce the concept of code

graph.

A code graph is a hypergraph with a sequence of input nodes and a sequence of output nodes. Each node

in the code graph is labelled with a type. The hyperedges of the graph are labelled with machine

instructions and their immediate arguments, i.e., any constants that are directly encoded in the opcode, but

no source or target registers. Each hyperedge has zero or more ordered input tentacles (connected to

72

nodes representing the arguments consumed by the instruction) and one or more ordered output tentacles

(connected to nodes representing the results of the instruction).

For formal definition of code graph, please refer to ―Control-flow semantics for assembly-level data-flow

graphs‖ --W. Kahl, C. K. Anand, and J. Carette,

Loop Specifications

If a loop body can be represented by a code graph, then variables modified by this loop body are

represented both as inputs and outputs of that code graph, and the input instances feed from the output

instances of the previous iteration.

In a staged loop body, the situation can be more complex, since inputs of one stage may be connected to

outputs from different stages, which will then belong to different ―logical‖ iterations of the unstaged loop

body.

For formal definition of Loop Specification, please refer to ―Explicitly Staged Software Pipelining‖ – by

Dr. C. Anana, Dr. W. Kahl and W. Thaller.

Once we have the loop represented by a code graph, we are able to schedule the loop with the scheduler.

This new type of scheduling is based on staging, the loop body is broken up into several stages to reduce

data dependencies between instructions, allowing more freedom in scheduling code graphs, and

ultimately reducing the chance of pipeline stalling.

Theory of Staging

Software pipelining can be viewed as a transformation of the loop specification. If the code graph can be

splited into sequential parts, which are called stages, and compose them again in parallel to get a new loop

body such that adding appropriate prologue and epilogue code to the loop yields a loop that is equivalent

to the original loop.

73

Software pipelining with three stages;

Left: loop independent data dependences in a non-pipelined loop restrict parallelism;

Right: the same dependences are now dependences between different iterations of the pipelined loop.

Algorithm for Stage Splitting

Please see ―Explicitly Staged Software Pipelining‖ – by Dr. C. Anana, Dr. W. Kahl and W. Thaller.

Conclusion

The defined loop specifications represented by a code graph is the key point to guide the code

transformations necessary for software pipelining. The stage composition is arising from first

investigating the pipelining transformation on the level of composition of code graphs and loop

specification.

III.IV.III Implementation

Background

In order to generate efficient loop code targeting IBM Cell processor, a new method of software

pipelining is developed and implemented by Dr. C. Anand, Dr. W. Kahl and W. Thaller called ―Explicitly

Staged Software Pipelining‖, the key point is that a loop body may be represented by a code graph, so the

scheduler is able to schedule it to achieve parallel computation performance. Scheduling was done for the

problem of function mapping over large data set.

74

State register

When mapping a function over an array, some information is needed.

The addresses of the input and outpur array are necessary. How many addresses will depend on the

function itself, this can be known at the compile time. At the run time, the addresses are needed to make

sure the function mapping onto the correct data.

The number of iteration to be performed (i.e. the length of the arrays, the number of computations etc.).

This is also known at compile time, and must be provided at run time.

The registers of SPU is 128 bit, and may be composed of 4 words, so we may construct a state register

which is used for the loop to keep the input/output arrays and counters.

Counter Array load Array store Unused

Word3 Word2 Word3 Word4

(128 bit state register for a loop with one input array and one output array)

Counter Array1 load Array2 load Array store


(128 bit state register for a loop with two input arrays and one output array)

The architecture of the SPU allows for some very unique optimizations in the way the loop is

implemented. Since SPU instructions are SIMD, all of the state values (the input/output addresses and

iteration counter) to be updated simultaneously in a single instruction after each iteration. Because

processing element-by-element in an array causes the addresses to be incremented by the same amount,

we can achieve this goal just simply by a ―ai‖ (add immediate) instruction.

Word 3 holds the iteration counter. It starts at a precalculated value and increases with each iteration.

Once the ―magic bit‖ 24 changes from a 0 to a 1, this will trigger another register, the branch target

75

register, rotating to left by 4 bytes, so the addresses in the branch register changes from the starting

address of the loop, to the address right after the loop exit address.

76


Byte3 Byte2 Byte1 Byte0


Branch target register

Terminating the loop is done by using a branch target register. After each iteration, the branch register is

rotated by the number of words specified in Byte 3 of the iteration counter (the Word3 of the state

register). The value in Byte 3 of the iteration counter is always 0 except after the last iteration, the Byte 3

is updated to 1, so this will trigger the loop to exit.

Word 3 contains the address of the starting point of the loop, while Word 2 contains the address right after

the end of the loop. Once the loop is completed, the program counter jumps to the address contained in

Word 3, which will always be the beginning of the loop. The only exception is once the ―magic bit‖

changes, then the branch target register rotates by 4 bytes so that the program counter jumps out of the

loop after the last iterations (jumps to the address right after the end address of the loop).

“magic bit” 24, which triggers the loop to

exit once it changes from 0 to 1

Depending on the type of function being mapped, words 0~2

will store the addresses of the input/output array. These

addresses get incremented by a set amount when each time

an iteration of the loop is executed.

77

Starting addr Exit addr Unused Unused

128 bit branch target register

Exit addr Unused Unused Starting addr

128 bit branch target register is trigger to rotate after the last iteration

III.IV.IV Implementing the loop code

Since our project is targeting on IBM Cell processor, we are taking advantage of the Cell architecture to

obtain as much increasing of performance as possible for the loop optimization. Now, we are turning our

eyes a little bit on the Cell architecture, specially, on the SPU pipelines and dual-issue rules.

SPU Pipelines and Dual-Issue Rules

The SPU has two pipelines, even (pipeline 0) and odd (pipeline 1), into which it can issue and complete

up to two instructions per cycle, one in each of the pipelines. Whether an instruction goes to the even or

odd pipeline depends on its instruction type, which is related to the execution unit that performs the

function. Each execution unit is assigne to one of the two pipelines.

To obtain high performance from the pipeline:

Design for balanced pipeline use. Typically, the algorithm dictates the instruction mix. However, there

may be multiple ways to achieve the same computational results. Choosing the one that achieves balanced

pipeline use will often result in improved performance.

Unroll loops and interleave computation to hide latency (reduce dependency stalls) and improve dual-

issue rates.

Pipeline—The pipeline instructions are issued on.

78

Stalls—The number of additional cycles before another instruction of the same type can be issued. For

example, double-precision floating-point operations have a 6 cycle stall. Therefore, for back to back

double-precision floating-point operations, the second operation will be issued at least 7 cycles after the

first operation.

Latency—The number of instructions before the result is available.

The SPU issues all instructions in program order according to the pipeline assignment. Each instruction is

part of a doubleword-aligned instruction-pair called a fetch group. A fetch group can have one or two

valid instructions. This means that the first instruction in the fetch group is from an even word address,

and the second instruction from an odd word address. The SPU processes fetch groups one at a time,

continuing to the next fetch group when the current fetch group becomes empty. An instruction becomes

issueable when register dependencies are satisfied and there is no structural hazard (resource conflict)

with prior instructions or LS contention due to DMA or ECC activity. See Section 3.1.1.3 on page 66 for

LS access priorities.

Dual-issue occurs when a fetch group has two issueable instructions in which the first instruction can be

executed on the even pipeline and the second instruction can be executed on the odd pipeline. If a fetch

group cannot be dual-issued, but the first instruction can be issued, the first instruction is issued to the

proper execution pipeline and the second instruction is held until it can be issued. A new fetch group is

loaded after both instructions of the current fetch group are issued.

Implementation constraint

Because of the Cell SPU Pipelines and Dual-Issue Rules, we should keep in mind on the selection of SPU

instructions when implementing the software pipelined loop optimization code. Since we are doing the

dense dense matrix multiplication targeting on the IBM Cell processor, and major instructions of the

matrix calculation are FA (floating addition), FM (floating multiplication) and FMA (floating addition

and multiplication) and so on, all of these instructions are pipeline 0 instructions. So, in order to balance

the instructions between pipeline 0 an pipeline 1 to achieve as high performance as possible, we need to

choose pipeline 1 instructions as possible as we can when implementing the optimized loop code.

Implementation

After compiling, the matrix A an B are vectorized and are being held in array A and array B after DMA to

the local store, let register 3 and register 4 are pointing the starting point of these two arrays repectively

(i.e. these two registers are holding the starting address of these two arrays). We know that the register of

the Cell SPU are 128 bit, and can compose of 4 words.

79

Construction of state register

First of all, we need to construct a state register from these registers which can be updated after each

iteration of the loop. Here we choose to use a ―shufb‖ (shuffle byte) instruction which is pipeline 1

instruction to move the two array starting addresses to the Word 2 and Word 3 of the state register,

because only one word of each the two registers is holding the array starting addresses, other words are

useless.

Before filling out the Word 3, the counter field, of the state register, we need to do some pre-jobs.

Fortunately, we are able to know the number of iterations of the loop, the starting and exit addresses of

the loop at the compile time. Let assuming that register 5 is holding the number of iterations of the loop.

So, we can use another ―shufb‖ (shuffle byte) instruction to move the number of iterations to the Word 3

of the state register. But before the ―shufb‖ instruction, we need to use an extra instruction ―mpyi‖

(multiply immediate) to multiply the iterations by a (-16), then another instruction ―a‖ (additon) to add the

result to the number of 2 to the power 24, since we need the 24th bit of Word 3 of state register set to 1 in

order to trigger the branch target register to rotate so that the loop is able to terminate after its completion.

To do this, we are able to make sure that all the words of the state register are updated by (16) after each

iteration (updating by 16 is to make the function can be mapped to next element of the array), and the

counter (Word 3) is updated simultaneously with the array addresses to get the correct loop count number.

Then, after the stage register is build from the loop iterations and the two array starting addresses, we

need to add the pre-calculated stage offset in order to generate the correct array addresses in the stage

register. This goal can be implemented just by a ―a‖ (addition) instruction. Right now, the initial state is

constructed completely in correct value.

Construction of branch register

Since we can know the starting and exit point address of the loop body, it is easy to construct the branch

register. The Word 3 and Word 2 of the branch register are holding the starting and exit addresses of the

loop body respectively.

Loop code

All the loop needs to do is updating the state by a value (16) after each iteration, this will make the

function mapped to the next value (located to the next element of the array), and simultaneously, the pre-

calculated value of the counter is also updated by (16). So, we need a ―ai‖ (add immediate) instruction to

implement this. At the same time, we need to use the ―rotqby‖ (rotate quad word by byte) instruction to

rotate the branch target register by the ―magic bit‖ trigger, but the ―magic bit‖ is always 0 except after the

last iteration, it is set to 1 and trigger the rotation. So, actually the branch target register is not rotating at

all except after the last iteration, it is only rotated once to exit the loop.

80

In order to guarantee the rotation process, i.e. guarantee the ―magic bit‖ is set to 1, we use an extra

instruction ―andhi‖ (and half word immediate) to make sure that the ―magic bit‖ can be set to 1 after the

last iteration.

81

IV Verification

IV.I Run Time System

IV.I.I Motivation

The purpose to verify the run time system is to ensure that memory is being transferred and synchronized

correctly based on the instructions being processed. It is assumed that the instructions are correct and

being executed in proper order. This is a fair assumption, since the assignment of instructions when

integrated is inputted from an external source; the scheduler, and the onus is on that piece to verify order

of operations. See appendix VII.II for output of the test results.

IV.I.II Process

The run time system is verified by providing sample matrices to the system, and analyzing the output.

Based on the structure of the run time system, if the system fails with memory management or processing

one of the following possibilities will occur:

1. System will stall: one or more SPUs will be in a hanging state indicating a sychronizaiton error

with the blocking calls of either receive mailbox message or receive signal.

2. System will loop indefinetly: one or more SPUs will be looping through a construct with no

possibilities of exiting indicating a wait for data in the verification routine that never arrives.

3. System will output incorrect results: the final output will be wrong, indicating issues with the

algorithm or issues with the transfers of data.

4. System completes with correct output: desired state.

The method of testing is done in three stages. The first is an algorithmic test to ensure the general

processing of matrix multiplication is performing correctly. The second is a regression test, where dense

source matrices were computed and compared. The final is a stress test to ensure possibility 1 or 2 above

does not occur if system is actively running for long time periods.

IV.I.III Algorithmic Test

The algorithmic test is processed by creating two source matrices with entries of 1 or zero. Furthermore,

initial test were processed using source matrices of identifiable patterns, such as the identity or a single

one in the top left corner of each block composing the matrices. Therefore, through visual inspection, it is

possible with a degree of certainity to verify if the system is processing the algorithm properly.

IV.I.IV Regression Test

The regression test involved inputing two dense randomly generated source matrices into the system. The

output was checked for correctness and used to verify the system. The purpose of this test is to see if the

system can handle random matrices and complicated computation.

82

IV.I.V Stress Test

The first two tests are used to verify that the system is able to handle the task of dense matrix multiplied

by a dense matrix using the algorithm and processes specified. However; this does not verify the

sychonrization routines to a full extent, since it is restricted to the simplistic mathematical problem.

Therefore, a test case of a series of memory transfers, and computational routines was created that would

function for an estimated 8 hours. The output would in no fashion reflect matrix multiplication; rather

what was of interest is the state of the processors during the test. After 8 hours of executing, all

processors in use were active, indicating in a state of processing. Therefore, no processor was in a hang

state that would indicate a synchronization issue.

IV.II Loop Optimization Theory Testing

We have written a test function to test the implemened loop code to works for a set of iteration counts, i.e.

after each iteration, all the words of the state register is updated by a value of (16) simultaneously, but the

branch register is kept to be the same excpet after the last iteration, it is rotated to the left by 4 bytes.

IV.III Code Graph Verification

Verification for the code graphs consists primarily of a visual inpection of a dot graph output. The full

functionality of the code graphs‘ have not yet been implemented with the scheduler. To verify a code

graph in full, it will need to be run through the scheduler and have the resulting assembly code analyzed

for correctness or simply executed on a Cell Processor. Visual inspection of the dot graph has nodes and

edges labelled that display the type of the node and the label associated, provided that these are of the

Show class in Haskell.

The actual shape of the dot graph produced cannot be controlled by the code graph library but the

semantics of the graph remain. The output of a loopspec graph will also display the ―d‖ values of the

loops, as well as the types and label of the original code graph.

IV.IV Kernal Code Verification

The kernel code can be verified using two methods. The first method generates a code graph of the loads

and stores as well as the fma‘s and fa‘s that occur for the multiplication. This graph can be traced to

ensure that the correct operations are being done on the correct information. Also a separate piece of code

was written to test only the multiplication without and loads and stores. The test generates two matrices as

strings using entry positions in each matrix as values and after the calculations have been completed

outputs a solution matrix as a list of strings for each entry. This allows us to visually see which entries

and being multiplied by which entries and which are being added.

84

V Bibliography Microsoft Developer Network (MSDN). 2006. Synchronization and Mutliprocessor Issues. 26

Nov 2006<http://msdn2.microsoft.com/en-gb/library/ms686355.apx>.

International Business Machine Corporated. Synergistic Processor Unit Instruction Set Architecture.

New York: IBM Systems and Technology Group, 2006.

—. Cell Broadband Engine Programming Handbook. New York: IBM Systems and Technology

Group, 2006.

Argonne National Laboratory. An Introduction to MPI. 27 November 2006<http://www-

unix.mcs.anl.gov/mpi/tutorial/mpiintro/ppframe.htm>.

Livermore Computing. Message Passing Interface (MPI). 7 December 2006. 28 November

2007<http://www.llnl.gov/computing/tutorials/mpi/#What>.

Anand, Dr. Christopher. Meeting: Examination of Clustering and Haskell Code Christopher

Venantius. 21 November 2006.

Anand, Dr. Christopher, Dr. W. Kahl and W. Thaller, ―Explicitly Staged Software Pipelining‖,

http://www.cas.mcmaster.ca/~anand/papers/AnandKahlThaller2006.pdf

Bach, James. What is Exploratory Testing? 29 January 2001. 28 November

2006<http://www.stickyminds.com/sitewide.asp?ObjectId=2255&ObjectType=COL&Function=edet

ail>.

Foster, Ian. "Case Study: Matrix Multiplication." 2005. Designing and Building Parallel Programs.

13 October 2006<http://www.it.uom.gr/teaching/dbpp/text/node45.html>.

Goetz, Brian. "Going Atomic." 23 November 2004. Java Theory and Practice. Internation Business

Machine Corporated. 26 November 2006<http://www-128.ibm.com/developerworks/java/library/i-

jtp11234/>.

Greenberg, Dan. IBM: Explicit Mapping of Threads to SPEs. 23 September 2006. 14 December

2006<http://www-

128.ibm.com/developerworks/forums/dw_thread.jsp?message=13872797&cat=46&thread=136944&

treeDisplayType=threadmode1&forum=739#13872797>.

Gropp, William. Learning More. 14 October 1998. 28 November 2006<http://www-

unix.mcs.anl.gov/mpi/tutorial/mpibasics/sld012.htm>.

Henri, E Bal and Haines Matthew. Approaches for Integrating Task and Data Parallelism. 3. Vol. 6.

IEEE Concurrency: Parallel, Distrubuted and Mobile Computing, July-Sept 1998.

Karypis, George. "Introduction to Parallel Computing: Dense Matrix Algorithms." Minnesota.

85

M. Lam, ―Software pipelining: an effective scheduling technique for VLIW machines,‖ SIGPLAN

Not., vol. 23, no. 7, pp. 318–328, 1988.

Robinson, Harry. Exploratory Modelling. 28 November

2006<http://www.testingcraft.com/exploratory-robinson.html>.

Uszkay, Gordon. October 4 Meeting with Advisor Nathan Cumpson and Christopher Venantius. 4

October 2006.

Uszkay, Gordon. September 27 Meeting with Advisor Nathan Cumpson and Christopher Venantius.

27 September 2006.

Weisstein, Eric W. "Matrix Multiplication." 2006. Math World. 24 11

2006<http://mathworld.wolfram.com/MatrixMultiplication.html>.

—. "Strassan Formulas." 2006. Math World. 13 October

2006<http://www.mathworld.wolfram.com/StrassanFormulas.html>.

Http://en.wikipedia.org/

W. Kahl, C. K. Anand, and J. Carette, ―Control-flow semantics for assembly-level data-flow

graphs,‖ in 8th Intl. Seminar on Relational Methods in Computer Science, RelMiCS 8, Feb. 2005

(W. McCaull, M. Winter, and I. D¨untsch, eds.), vol. 3929 of LNCS, pp. 147–160, Springer-Verlag,

2006.

86

VI Appendix I: User Guide and Upkeep

VI.I Run Time System User Manual

Installation

The run time system itself does not require any installation. However, in order to operate the run time

system, the IBM STL CELL simulator is required. For information regarding the installation of IBM STI

CELL simulator, please contact Dr. Christopher Anand. Listed below are the login names and passwords

for the Apple MacBook Pro which is required in order to operate and install software.

Apple OS X:

Login: zzz Password: cell@r2007

Fedora Core (in the Parallels simulation environment):

Login: cell Password: cell2006

Login: root Password: cell@r

Working with the Run Time System

Boot up in to OS X on the Apple MacBook Pro and start up Parallels. Click the Play button in the upper

right hand corner in order to start running Fedora Core. When you are prompted for a login and password

enter the following:

Login: cell Password: cell2006

Once you are logged in to Fedora Core open a terminal and change to the root user by typing in:

su root <enter>

When you are prompted for a password, type in the following:

cell@r <enter>

To get to the directory with the run time system files, type in the following:

cd /opt/ibm/cell-sdk/prototype/coconut/samples/tutorial/runTimeSystem/

Executing the Run Time System

When in the run time system directory, type the following in to compile the run time system:

make <enter>

In order to send the runtime system to the directory where the STI CELL simulator can reach it, type in

the following:

87

cp runTimePPU /opt/ibm/systemsim-cell/run/cell/linux/ <enter>

cp spu/runTimeSPU /opt/ibm/systemsim-cell/run/cell/linux/ <enter>

To run the STI CELL simulator, open up a terminal and switch to the root user and type in the following:

cd /opt/ibm/systemsim-cell/run/cell/linux/ <enter>

../run_gui <enter>

Once the GUI for the simulator is open, click Fast Mode followed by clicking Go. Once the simulator has

started type the following in to the xterm to import the run time system files and execute them:

callthru source runTimePPU > runTimePPU <enter>

callthru source runTimeSPU > runTimeSPU <enter>

chmod 777 runTimePPU

./runTimePPU

88

VI.II Pitfalls of the SDK 2.0

Installation

Unfortunately there is no installation script provided with the STI CELL SDK. It is up to the user to write

their RPM files for the installation. Consult Dr. Christopher Anand with any questions regarding the

installation of the STI CELL SDK.

Linking of Libraries

From version to version of the STI CELL SDK, IBM tends to reorganize the file structure. This means

that if your software worked in a previous version of the SDK, it is not guaranteed to work in the current

build without modification to conform to the present build. It may be required that old library files may

have to be manually imported into the newer version of the SDK in order for the software to execute

appropriately.

Missing Features

According to the documentation provided with the STI CELL SDK, there should be full libraries of

functions in C in order to take advantage of the STI CELL system architecture. Unfortunately, many

features that are discussed in the documentation are not present in the current build of the SDK. One such

feature is the affinity mask construct that allows the programmer to specify exactly what SPU is going to

be used for execution. For up to date information on what is and what is not available in the

STI CELL SDK, consult the IBM forums.

DMA Lists

The DMA list function is implemented and full functional in the current version of the STI CELL SDK.

Unfortunately, the programmer cannot manually break up the DMA list in to segments according their

preference, but rather has to let the SDK automatically break it up as it likes. If the programmer attempts

to break it up to their preference, the DMA transfers will fail and result in only partial data being

transferred.

Instability

The STI CELL SDK tends to be extremely unstable. When executing software in the simulator, it is

recommended that the execution is done more than once if the initial execution stalls the simulator. If the

user continually executes a piece of software continuously, for example in a stress test, the simulator will

stall. In addition to this, if the program runs in a continuous loop for an extended period of time, it is very

likely that the simulator will stall.

89

VI.III Run Time System Issue Tracking

ISSUE TRACKING ON THE RUN TIME SYSTEM

ISSUE OWNER DESCRIPTION START DATE END DATE SOLUTION 1 Christopher

Venantius Synchronization issue with overwriting data before DMA transfers are done

22 Jan 07 25 Jan 07 DMA events have tags associated with buffer spots

2 Christopher Venantius

Valid tables not being updated in SPUs creating major stalls

22 Jan 07 03 Feb 07 Using OR signal registers to map buffers

3 Damith Karunaratne

Synchronization with ID tags being placed at the end of data block

08 Feb 07 03 Mar 07 A section of code was missing that should have been there


Time issue of translation of code into assembly

10 Feb 07 23 Feb 07 Section of project is now out of scope of thesis – approved


B source matrix not loaded properely

09 Mar 07 19 Mar 07 DMA list structure was misaligned by one spot

6 Christopher Venantius

SPE transfers after solving issue 5 now stall

19 Mar 07 21 Mar 07 Test case algorithm code was incorrect

90

VI.IV Code Graph Issue Tracking

Issue

Number

Owner Description Start Date End Date Solution

1 Nathan

Cumpson

Variable size in the code

graph generator grows at

n3.

12/01/200

7

15/01/2007 The LoopSpec library

can create loops in the

code graph to reduce

size.

2 Nathan

Cumpson

Looping the entire graph

would be infeasible. Need

to loop only a section.

22/01/200

7

24/01/2007 A method of layering

code graphs can allow

for code graphs or

LoopSpecs to be

embedded in other

code graphs.

3 Nathan

Cumpson

Not sure how the

scheduler will handle

loops within loops.

26/01/200

7

31/01/2007 Dr.Anand has a

solution using a

Multiloop to use a

single loop with

decisions.

4 Nathan

Cumpson

Code Graph expansion

should be distinct to a

processor (SPU or PPU)

16/02/200

7

TBD A solution is required

for this. It is confusing

and not deterministic

to which memory and

processor code resides

at in the code graph.

5 Nathan

Cumpson

Compatibility issue may

occur with the scheduler

and loopspec of the code

graph

11/03/200

7

TBD Creating the LoopSpec

and the code graph

was done using the

node-output method

which has a different

result than the

method used in

examples. Not sure if

this will be

compatible.

6 Nathan

Cumpson

Types for an SPU and

PPU may need to be

different

20/02/200

7

TBD Cannot use differnt

types throughout

Haskell. The PPU may

need to deal with lists

of matrix blocks for

DMA transfers where

an SPU will deal with

only a single block for

computation.

91

VII Appendix II: Code Snippets and Test Data

VII.I Code Files for Run Time System

VII.I.I runTimePPU.c /* runTimePPU.c

* Authors: Damith Karunaratne & Christopher Venantius

* Last Modified: March 22, 2007.

* Description: Creates and initializes the SPE threads

* Copyright 2007 McMaster University. All rights reserved.

*/

/*****included header files*****/

#include <stdlib.h>

#include <stdio.h>

#include <errno.h>

//library of spe functions

#include <libspe.h>

//contains the context info for threads

#include "spu/contextInfo.h"

//contains the function to print out matrices

#include "displayMatrix.h"

/*****external links*****/

//handle for the SPE program

extern spe_program_handle_t runTimeSPU;

/*****main: central program for the PPU*****/

/*

PARAMETERS:

argc and argv: input parameters

RETURNS(int):

0: on succcessful completion

-1: on failure

*/

int main(int argc, char **argv) {

//counters

int i,j,k,l;

//offset for randomness

float offset[NUM_MATX];

//each SPE control thread

speid_t speIDs[ACTIVESPE];

//for SPE completion --> waiting the PPE

int status = 0;

//space for matrix data in memory

float matrices[NUM_MATX][MATX_ROW_BLOCK_DIM][MATX_COL_BLOCK_DIM][BLOCK_SIZE] __attribute__

((aligned (128)));

//space for context information for each SPE control thread

struct contextInfo ci[ACTIVESPE] __attribute__ ((aligned (128)));

//void the passed in arguments

(void)argc;

(void)argv;

//setup offsets for generating matrices for testing

offset[0] = 9.9;

offset[1] = 4.9;

offset[3] = 0;

//initialize matrix

for (l = 0; l < NUM_MATX; l++) {

//for each row of blocks

for (i = 0; i < MATX_ROW_BLOCK_DIM; i++) {

//for each block in row

for (j = 0; j < MATX_COL_BLOCK_DIM; j++) {

//set the tag information

matrices[l][i][j][BLOCK_SIZE-3] = (float)l;

matrices[l][i][j][BLOCK_SIZE-2] = (float)i;

92

matrices[l][i][j][BLOCK_SIZE-1] = (float)j;

//for each data entry for the matrix

for (k = 0; k < BLOCK_SIZE-TAG_SIZE; k++) {

//sets all the entries of matrices to null

matrices[l][i][j][k] = 0;

}

}

}

}

//setup data for matrices

for (l = 0; l < NUM_MATX-1; l++) {


for (i = 0; i < MATX_ROW_BLOCK_DIM; i++) {

//for each block in row

for (j = 0; j < MATX_COL_BLOCK_DIM; j++) {

//set the tag information

matrices[l][i][j][BLOCK_SIZE-3] = (float)l;

matrices[l][i][j][BLOCK_SIZE-2] = (float)i;

matrices[l][i][j][BLOCK_SIZE-1] = (float)j;

//for each data entry for the matrix

for (k = 0; k < BLOCK_SIZE-TAG_SIZE; k++) {

//sets random values up to offset value

matrices[l][i][j][k] = offset[l]

* (float) random()

/ (float) 0x7fffffff;

}

}

}

}

//for each control thread

for(i=0; i<ACTIVESPE; i++) {

//create spe program

speIDs[i] = spe_create_thread(0, &runTimeSPU, &ci[0], NULL, -1,

SPE_MAP_PS | SPE_CFG_SIGNOTIFY1_OR);

//check if thread was created

if (speIDs[i] == 0) {

fprintf(stderr, "Failed spu_create_thread(rc=%p, errno=%d)\n",

speIDs[i], errno);

exit(1);

}

}


for (i = 0; i < ACTIVESPE; i++) {

//set the context information's thread ID to control thread value

ci[i].threadID = i;

//set the context information SPEID to SPE ID for associate SPE

ci[i].speID = (unsigned long)speIDs[i];

//for each matrix in memory

for (j = 0; j < NUM_MATX; j++) {

//set context information for matrices in system

ci[i].matrix[j]=&matrices[j];

}

//set the control area and check if done (for mailbox)

if ((ci[i].controlArea = spe_get_ps_area(speIDs[i],

SPE_CONTROL_AREA)) == NULL) {

printf("ERROR: spe_get_ps_area failed for (%d)\n", errno);

return -1;

}

//set the signal one area and check if done (for signals)

if ((ci[i].signalOne = spe_get_ps_area(speIDs[i],

SPE_SIG_NOTIFY_1_AREA)) == NULL) {

printf("ERROR: spe_get_ps_area failed for sig1(%d)\n", errno);

return -1;

}

//set the LS area and check if done (for memory)

if ((ci[i].ls = spe_get_ls(speIDs[i])) == NULL) {

fprintf(stderr, "ERROR: get ls return NULL");

return -1;

93

}

}


for (i = 0; i < ACTIVESPE; i++) {

//write in the SPE mailbox the number of control threads and threadID

if (spe_write_in_mbox(speIDs[i], ACTIVESPE) < 0

|| spe_write_in_mbox(speIDs[i], ci[i].threadID) < 0) {

fprintf(stderr, "ERROR: writing messages to spe failed\n");

exit(-1);

}

}


for (i=0; i<ACTIVESPE; i++) {

//wait for SPE execution to finish

(void)spe_wait(speIDs[i], &status, 0);

}

printf("\nThe program has successfully executed!\n");

return (0);

}

94

VII.I.II contextInfo.h /* contextInfo.h



* Description: Contains the information regarding the SPEs


*/

/*****sets structure to be preprocessed*****/

#ifndef _CONTEXTINFO_H_

#define _CONTEXTINFO_H_

/*****constant declarations for the real time system*****/

/*

Below are a series of constant used by the modules in the realtime system.

Since, almost all modules need access to these values, they are kept in this

"central" location.

*/

/*

ACTIVESPE: Refers to the number of SPEs in the STI Cell hardware being

allocated to the realtime system.

*/

#define ACTIVESPE 4

/*

NUM_MATX: Refers to the number of matrices being manipulated in the

computation routines. The number refers to both data and solution matrices.

*/

#define NUM_MATX 4

/*

BLOCK_ROW_DIM: Refers to the row dimension of a block of data. In the case of

matrices, each matrix is broken down into a series of blocks that are composed

of the data entries in the matrix.

*/

#define BLOCK_ROW_DIM 2

/*

BLOCK_COL_DIM: Refers to the column dimension of a block of data. In the case

of matrices, each matrix is broken down into a series of blocks that are

composed of the data entries in the matrix.

*/

#define BLOCK_COL_DIM 2

/*

TAG_SIZE: Refers to a tag of data that is attached to each block that is used

to identify the contents.

*/

#define TAG_SIZE 4

/*

BUFF_SIZE: Refers to the amount of buffers in the LS for every SPE. The

buffers are used to memory map the LS for an SPE for particular chunks of

data.

*/

#define BUFF_SIZE 8

/*

SOL_BUFF_SIZE: Refers to the amount of buffers in the LS for every SPE. The

buffers are used to memory map the LS for an SPE for particular chuncks for

computed data.

*/

#define SOL_BUFF_SIZE 4

/*

BLOCK_SIZE: Refers to the amount of data entries in a particular block. This

includes data entries from a matrix, plus tag information.

95

*/

#define BLOCK_SIZE (BLOCK_ROW_DIM*BLOCK_COL_DIM+TAG_SIZE)

/*

MATX_ROW_BLOCK_DIM: Refers to the row dimension with repects to blocks in a

given matrix.

*/

#define MATX_ROW_BLOCK_DIM 8

/*

MATX_COL_BLOCK_DIM: Refers to the column dimension with respects to blocks in

a given matrix.

*/

#define MATX_COL_BLOCK_DIM 8

/*****structure for the context information*****/

/*

Below is information that each SPE contains in order to function within

the realtime system

*/

struct contextInfo {

//identifies current SPE control thread

int threadID;

//identifies current SPE

unsigned long speID;

//pointer to current SPE's LS

void* ls;

//pointer to matrices in memory

void* matrix[NUM_MATX];

//pointer to singal notification area (singals)

void* signalOne;

//pointer to control area (mailbox)

void* controlArea;

//pad for alignment issues

char pad[12];

};

#endif

96

VII.I.III initialization.c /* initialization.c



* Description: Initializes the run time system


*/


//header with spu funcitons

#include <spu_intrinsics.h>

//library for memory flow controller

#include "/home/cell/cell-sdk-1.1/sysroot/usr/include/cbe_mfc.h"

//library for spu memory flow controller i/o

#include <spu_mfcio.h>

#include <stdio.h>

#include <stdlib.h>


#include "contextInfo.h"


//link to receive signal code in signal.c

extern unsigned long receiveSignal(void);

//link to send signal code in signal.c

extern int sendSignal(int, unsigned long);

/*****global variables to the module*****/

//number of control threads (initialization)

int myThreadCount = -1;

//the current control thread (initialization)

int myThreadID = -1;

//effective address for the context information in memory

unsigned long myParm;

//context information

volatile struct contextInfo myContextInfo[ACTIVESPE] __attribute__ ((aligned (128)));

//data buffers in LS

volatile float dataBuf[BUFF_SIZE][BLOCK_SIZE] __attribute__ ((aligned (128)));

//solution buffers in LS

volatile float solnBuf[SOL_BUFF_SIZE][BLOCK_SIZE] __attribute__ ((aligned (128)));

/*****initialization: sets up the system*****/

/*

PARAMETERS:

none

RETURNS(int):

0: on successful completion

-1: on failure

*/

int initialization(void) {

//counters

int i,j;

//number of active SPEs in system --> control threads

int threadCount;

//current control thread --> this SPE

int threadID;

//tagID for DMASS

int tagID = 0;

//for each block in the data buffer

for (i = 0; i < BUFF_SIZE; i++) {

//for each tag entry initialize data buffer for data integrity

dataBuf[i][BLOCK_SIZE-3] = -1;



}

//for each block in the solution buffer

97

for (i = 0; i < SOL_BUFF_SIZE; i++) {

for(j = 0; j < BLOCK_SIZE; j++) {

solnBuf[i][j] = 0;

}

}

//read SPE mailbox --> number of control threads and active SPEs

threadCount = spu_read_in_mbox();

//check size value

if (threadCount <= 0 ) {

printf("ERROR: wrong thread count in the init\n");

return -1;

}

//set the thread count to instance variable myThreadCount

myThreadCount = threadCount;

//read SPE mailbox --> number of current control thread and active SPE

threadID = spu_read_in_mbox();

//check rank value

if (threadID < 0 ) {

printf("ERROR: wrong thread id in the init\n");

return -1;

}

//set the threadID to instance variable myThreadID

myThreadID = threadID;

//DMA the context information to the control thread

spu_mfcdma32(&myContextInfo[0], (unsigned long) myParm,

myThreadCount* sizeof(struct contextInfo), tagID, MFC_GET_CMD);

//wait for DMA on the MFC to complete

(void) spu_mfcstat(2);

//Synchronize all the SPUs

sendSignal((myThreadID+1)%ACTIVESPE, 1);

receiveSignal();

return 0;

}

98

VII.I.IV runTimeSPU.c /* runTimeSPU.c



* Description: Deals with conducting the acutal sending and

* receiving of data from SPE to SPE using

* lockless sync


*/


#include <stdio.h>








//link to function initialization in dma.c (for initialization of system)

extern int initialization(void);

//link to function test in test.c

extern void test(void);

//link to myParm in dma.c (base address of context information in memory)

extern unsigned long myParm;

/*****main: central program for each SPU*****/

/*

PARAMETERS:

spuID: default parameter for spu thread --> ID for the SPU

parm: the address of the context information stored in memory

RETURNS(int):


*/

int main (unsigned long long spuID, unsigned long long parm) {

//type cast the inputs to void type

(void)spuID;

(void)parm;

//intializes the write channel to wait on all DMA tags

spu_writech(MFC_WrTagMask, -1);

//sets the address of context information for the SPE

myParm = (unsigned long) parm;

//calls initialize to setup the system

initialization();

//calls test to test the system

test();

return 0;

}

99

VII.I.V signal.c /* signal.c



* Description: Deals with sending and receiving signals

* to and from SPEs


*/








#include <stdio.h>

#include <stdlib.h>



#include <string.h>


//link to number of control threads in dma.c

extern int myThreadCount;

//link to current SPE control thread number in dma.c

extern int myThreadID;

//link to SPE's context information in dma.c

extern struct contextInfo myContextInfo[ACTIVESPE];

/*****sendSignal: sends a signal to a SPE*****/

/*

PARAMETERS:

toSPE: The destination of the signal being sent.

sig: The signal message / id to be sent

RETURNS (int):

0: on sucessful completion

-1: on error

*/

int sendSignal(int toSPE, unsigned long sig) {

//setup the signal

int signal[4] __attribute__ ((aligned (128)));

//location of effective address

unsigned long effectiveAddr;

//pointer to LS

char* ls;

//tagID for DMA**send sig DMA

int tagID = 1;

//check if toSPE is in bounds

if ((toSPE >= myThreadCount) && (toSPE < 0)) {

printf("ERROR: destination (%d) out of system bounds", toSPE);

return -1;

}

//set message

signal[3] = sig;

//set effective address for message

effectiveAddr = (unsigned long)myContextInfo[toSPE].signalOne + 12;

//set ls address where message is

ls = ((char*)&signal[0])+ 12;

//send signal

mfc_sndsig(ls , effectiveAddr, tagID, 0,0);

//wait till completed

(void)spu_mfcstat(2);

100

return 0;

}

/*****receiveSignal: receives a signal from another SPE*****/

/*

PARAMETERS:

none

RETURNS(unsigned long):

signal: on sucessful completion

NOTE: this is a blocking call --> SPE hangs until a message is received

*/

unsigned long receiveSignal(void) {

//message to receive

unsigned long sig;

//grab message and unblock self (through call)

sig = spu_read_signal1();

return sig;

}

101

VII.I.VI displayMatrix.h /* displayMatrix.h



* Description: Code for displaying the matrices


*/

/*****displayMatrix: prints out the matrix called*****/

/*

PARAMETERS:

matrix: a matrix to display

RETURNS:

none

*/

void displayMatrix(float matrix[MATX_ROW_BLOCK_DIM][MATX_COL_BLOCK_DIM][BLOCK_SIZE]){

//counters

int i,j,k,l;


fflush(stdout);

for(i=0;i<MATX_ROW_BLOCK_DIM;i++){

//for each row of matrix data within a row of blocks

for(j=0;j<BLOCK_ROW_DIM;j++){

//for each block along the row

for(k=0;k<MATX_COL_BLOCK_DIM;k++){

//for each entry along a row in a block

for(l=0;l<BLOCK_COL_DIM;l++){

printf("%.2f ",matrix[i][k][j*BLOCK_ROW_DIM+l]);

fflush(stdout);}}

printf("\n");

fflush(stdout);}}

fflush(stdout);

}

102

VII.I.VII compute.c /* compute.c



* Description: Code for computations and verifying data


*/








#include <stdio.h>

#include <stdlib.h>





extern int sendSignal(int,unsigned long);



//link to SPE's data buffer in initializaiton.c

extern volatile float dataBuf[BUFF_SIZE][BLOCK_SIZE];

//link to SPE's solution buffer in initialization.c

extern volatile float solnBuf[SOL_BUFF_SIZE][BLOCK_SIZE];

//link to SPE's thread id in initialization.c


//link to SPE's context information in initialization.c

extern volatile struct contextInfo myContextInfo[ACTIVESPE];

/*****verifyBuffer: checks to see if the buffers holds the data required*****/

/*

PARAMETERS:

matx: the matrix (A, B or C)

blockRow: the corresponding row of the matrix block

blockCol: the corresponding column of the matrix block

buffer: the buffer spot that the matrix is contained in

RETURNS(int):

0: on succcessful completion

*/

int verifyBuffer(float matx, float blockRow, float blockCol, int buffer) {

//boolean flag to check if data from LS TO LS transfer completed

int flag = 0;

//while data from LS to LS transfer not present

while (flag == 0) {

//checks to see if the right matrix is being used

if ((dataBuf[buffer][BLOCK_SIZE-3]) == matx){

//checks to see if the right block row is being used

if ((dataBuf[buffer][BLOCK_SIZE-2]) == blockRow){

//checks to see if the right block column is being used

if(((dataBuf)[buffer][BLOCK_SIZE-1]) == blockCol){

flag = 1;

}

}

}

}

return 0;

}

/*****innerProduct: computes the inner prodcut of matrix blocks*****/

/*

PARAMETERS:

103

indexOne: the spot in the data buffer of Matrix A

row: the row being computed on in Matrix A

indexTwo: the spot in the data buffer of Matrix B

col: the column being used for the computation in Matrix B

RETURNS(float):

value: the result of the inner product

*/

float innerProduct(int indexOne, int row, int indexTwo, int col) {

//will contain the value of the inner product

float value;

//counter variable

int i;

//initialization of value

value = 0;

//goes through each value of the column

for (i = 0; i < BLOCK_COL_DIM; i++) {

//computes the temporary inner product values

value = value + dataBuf[indexOne][row*(BLOCK_COL_DIM)+i]

* dataBuf[indexTwo][col+i*(BLOCK_COL_DIM)];

}

//returns the inner product

return value;

}

/*****fmbip: computes the floating-point matrix block inner product*****/

/*

PARAMETERS:

dataBuffOne: data buffer which the matrix A is contained in

matxOne: matrix identifier (matrix A)

blockRowOne: the row index of the block for matrix A

blockColOne: the column index of the block for matrix A

dataBuffTwo: data buffer which the matrix B is contained in

matxTwo: matrix identifier (matrix B)

blockRowTwo: the row index of the block for matrix B

blockColTwo: the column index of the block for matrix B

buffIndex: spot for storage in solution buffer

RETURNS(void):

N/A

*/

void fmbip(int dataBuffOne, float matxOne, float blockRowOne,

float blockColOne, int dataBuffTwo, float matxTwo,

float blockRowTwo, float blockColTwo, int buffIndex){

//counter variables to traverse the block row and block column

int i,j;

//verify the correct data is in the LS

verifyBuffer(matxOne, blockRowOne, blockColOne, dataBuffOne);

verifyBuffer(matxTwo, blockRowTwo, blockColTwo, dataBuffTwo);

//performs the inner product calculation on blocks

for(i = 0; i < BLOCK_ROW_DIM; i++) {

for(j = 0; j < BLOCK_COL_DIM; j++) {

solnBuf[buffIndex][i*(BLOCK_ROW_DIM)+j] = solnBuf[buffIndex][i*(BLOCK_COL_DIM)+j]

+ innerProduct(dataBuffOne, i, dataBuffTwo, j);

}

}

}

104

VII.I.VIII dma.c /* dma.c



* Description: Initializes and does DMA from SPE to SPE


*/

/* Note: Signalling currently maps only its adjacent SPU's memory for the

* `slots' in that SPU. The SPUs are currently being mapped randomly so we have

* no assurance that the ordering of the SPUs are the same as our enumeration.

* The SPUs may be ordered so that the enumerated SPU threads must loop around

* the entire token ring (EIB), inefficiently. This can be transferred into

* Assembly code where the enumeration can be accurate.

*/








#include <stdio.h>

#include <stdlib.h>




//link to SPE's thread count in initialization.c

extern int myThreadCount;



//link to SPE's context information in initialization.c

extern volatile struct contextInfo myContextInfo[ACTIVESPE];

//link to SPE's data buffer in initializaiton.c

extern volatile float dataBuf[BUFF_SIZE][BLOCK_SIZE];

//link to SPE's solution buffer in initialization.c

extern volatile float solnBuf[BUFF_SIZE][BLOCK_SIZE];


extern int sendSignal(int,unsigned long);



//link to verify data is in a buffer from compute.c

extern int verifyBuffer(float, float, float, int);

/*****dmaListElemet: the structure of an element in a DMA list*****/

/*

PARAMETERS:

N/A

RETURNS(N/A):

N/A

*/

struct dmaListElement {

union {

//the size of the element in the DMA list

unsigned int all32;

} size;

//the effective address of the data for the list element

unsigned int eaLow;

};

//Creates the dma list with the appropriate number of spots

struct dmaListElement list[(MATX_ROW_BLOCK_DIM/8)+1] __attribute__ ((aligned (8)));

/*****createDMAList: creates the DMA list*****/

105

/*

PARAMETERS:

ea: the base of the effective address where the matrix is stored

RETURNS(int):

i: the amount of elements in the DMA list

*/

int createDMAList (volatile unsigned long ea) {

//counter for the elements in the dma list

int i = 0;

//holds the value for the size of a dma list element

unsigned int sz;

//the number of bytes for required for the dma list

unsigned int nBytes = (BLOCK_SIZE)*sizeof(float);

//creates the whole dma list

while (nBytes > 0) {

//calculates the bytes of data required for the element

//maximum is 16384 bytes per transfer

sz = (nBytes < 16384) ? nBytes : 16384;

//assigns the size to the list

list[i].size.all32 = sz;

//assigns the effective address for the data into the list

list[i].eaLow = ea;

//decreases the number of bytes needed to be loaded

nBytes -= sz;

//increase the spot in the effective address for the next element

ea += sz;

//increment the counter for elements in the dma list

i++;

}

//returns the amount of elements in the dma list

return(i);

}

/*****requestSPETransfer: sends a signal to indicate data can be sent*****/

/*

PARAMETERS:

toSPE: where the data is going to come from and the signal is being sent

toBuffer:

RETURNS(int):


*/

int requestSPETransfer(int toSPE, int toBuffer) {

//writes a 1 to the buffer

spu_writech(MFC_WrTagMask, 1 << toBuffer);

//updates all tags

spu_mfcstat(MFC_TAG_UPDATE_ALL);

//sends a signal to the SPE indicated

sendSignal(toSPE, 1 << toBuffer);

return 0;

}

/*****transferDMA: transfer a buffer of data from one spe to another spe*****/

/*

PARAMETERS:

fromBuffer: the buffer spot where the data is originating

toSPE: the SPE where the data is arriving

toBuffer: the buffer spot where the data is placed

matx: the matrix that is being transfered (A, B or C)

blockRow: index of the matrix block row location to be transfered

blockCol: index of the matrix block column location to be transfered

RETURNS(int):

0: on sucessful completion

-1: on failure

*/

int transferDMA(int fromBuffer, int toSPE, int toBuffer, float matx,

float blockRow, float blockCol) {

//holds the value for the size of the dma list

unsigned int listsize;

106

//effective address

volatile unsigned long ea;

//the singal register map

static int signalReg = 0;

//initialize the bitSignalReg value

int bitSignalReg = signalReg & (1<<(fromBuffer));

//wait for a signal from the SPE you are transfering to, to ensure this

//dma is not overwritting data in the to SPE's LS

while (bitSignalReg == 0) {

//signalReg is logically ORed to value of the signal sent from the SPE

signalReg = signalReg | receiveSignal();

//the bitSignalReg is updated to represent the received signal

bitSignalReg = signalReg & (1<<(fromBuffer));

}

//check if self DMA

if (myThreadCount < 2){

printf("ERROR: size too small to do dma pratice(%d)", myThreadCount);

return -1;

}

//verify that the data is there to transfer

verifyBuffer(matx, blockRow, blockCol, fromBuffer);

//calculate the base effective address of the matrix block

ea = (unsigned long)myContextInfo[toSPE].ls

+ (unsigned long)&dataBuf[toBuffer];

//calculates the size of the dma list

listsize = createDMAList(ea) * sizeof(struct dmaListElement);

//call the DMA using DMA lists with fences

spu_mfcdma32(&dataBuf[fromBuffer][0], (unsigned int) &list[0], listsize,

fromBuffer, MFC_PUTLF_CMD);

//resets the signalReg using logical XOR

signalReg = signalReg & (~(1<<fromBuffer));

return 0;

}

/*****inDMA: transfer data from memory into active SPE's buffer*****/

/*

PARAMETERS:

toBuffer: which buffer the data is sent to

matxID: identifier of the matrix in memory

rowBlock: identifier of the row block dimension of the data

colBlock: identifier of the column block dimension of the data

RETURNS(int):


*/

int inDMA(int toBuffer, int matxID, int rowBlock, int colBlock) {



//effective address


//writes a 1 to the buffer

spu_writech(MFC_WrTagMask, 1 << toBuffer);

//updates all tags

spu_mfcstat(MFC_TAG_UPDATE_ALL);

//compute effective address where the data is coming from

ea = ((unsigned long)(myContextInfo[myThreadID].matrix[matxID]

+rowBlock*MATX_COL_BLOCK_DIM*BLOCK_SIZE*sizeof(float)

+colBlock*BLOCK_SIZE*sizeof(float)));



107


spu_mfcdma32(&dataBuf[toBuffer], (unsigned int) &list[0], listsize,

toBuffer, MFC_GETLF_CMD);

return 0;

}

/*****outDMA: transfer data from active SPE's buffer into memory*****/

/*

PARAMETERS:

fromBuffer: which buffer the data is coming from

matxID: identifier of the matrix in memory

rowBlock: identifier of the row block dimension of the data

colBlock: identifier of the column block dimension of the data

RETURNS(int):


*/

int outDMA(int fromBuffer, int matxID, int rowBlock, int colBlock) {

//counter variable

int i;



//effective address


//compute effective address where the data is going

ea = ((unsigned long)(myContextInfo[myThreadID].matrix[matxID]

+rowBlock*MATX_COL_BLOCK_DIM*BLOCK_SIZE*sizeof(float)

+colBlock*BLOCK_SIZE*sizeof(float)));




spu_mfcdma32(&solnBuf[fromBuffer][0], (unsigned int) &list[0], listsize, 1,

MFC_PUTLF_CMD);

//clears all the values in the current solution buffer

for(i = 0; i < BLOCK_SIZE; i++) {

solnBuf[fromBuffer][i] = 0;

}

return 0;

}

108

VII.I.IX test.c /* test.c



* Description: Used to setup and test the run time system.


*/


#include <stdlib.h>

#include <stdio.h>

#include <errno.h>




//link to function transferDMA in dma.c (transfer data from SPE to SPE)

extern int transferDMA(int, int, int, float, float, float);

//link to function outDMA in dma.c (transfer data from SPE to memory)

extern int outDMA(int, int, int, int);

//link to function inDMA in dma.c (transfer data from memory to SPE)

extern int inDMA(int, int, int, int);

//link to function requestSPETransfer in dma.c

extern int requestSPETransfer(int, int);

//link to fmbip in compute.c

extern void fmbip(int, float, float, float, int, float, float, float, int);



/*****constant declarations for realTimeSPU*****/

//constants to enumerate the matrices in the system

#define AMATRIX 0

#define BMATRIX 1

#define CMATRIX 2

/*****test: code used to test the real time system*****/

/*

PARAMETERS:

none

RETURNS(void):

none

SUMMARY: Below is a function that hardcodes the instructions each SPE

in a system of 4 active SPEs will receive in order to compute matrix

multiplication with double buffering and 8 x 8 block matrices.

*/

void test(void) {

//counters used to iterate over all blocks of a matrix

int i, j;

//for the iteration over the row block in matrix A

for (i = 0; i < MATX_ROW_BLOCK_DIM; i=i+2) {

//execute the run time system depending on what spe is in use

if (myThreadID == 0) {

//initial loading of A and B matrices

inDMA(0, AMATRIX, i, 0);

inDMA(4, BMATRIX, 0, (2*myThreadID) );

inDMA(1, AMATRIX, i+1, 0);

inDMA(5, BMATRIX, 0, (2*myThreadID)+1 );

//initialize j which is a counter for the middle loading,

//transfering and computation code

j = 1;

//iterate over all the blocks except the last one

while (j < MATX_ROW_BLOCK_DIM-1) {

//load appropriate blocks of A and B matrix

inDMA(2, AMATRIX, i, j);

inDMA(6, BMATRIX, j, (2*myThreadID) );

109

inDMA(3, AMATRIX, i+1, j);

inDMA(7, BMATRIX, j, (2*myThreadID)+1 );

//transfer appropriate blocks A block to the next SPE

transferDMA(0, myThreadID+1, 0, AMATRIX, i, j-1);

//run computations on the appropriate A and B blocks

fmbip(0, AMATRIX, i, j-1, 4, BMATRIX, j-1, (2*myThreadID),0);

fmbip(0, AMATRIX, i, j-1, 5, BMATRIX, j-1, (2*myThreadID)+1,1);


transferDMA(1, myThreadID+1, 1, AMATRIX, i+1, j-1);


fmbip(1, AMATRIX, i+1, j-1, 4, BMATRIX, j-1, (2*myThreadID),2);

fmbip(1, AMATRIX, i+1, j-1, 5, BMATRIX, j-1, (2*myThreadID)+1,3);

//load appropriate blocks of A and B matrix

inDMA(0, AMATRIX, i, j+1);

inDMA(4, BMATRIX, j+1, (2*myThreadID) );

inDMA(1, AMATRIX, i+1, j+1);

inDMA(5, BMATRIX, j+1, (2*myThreadID)+1 );


transferDMA(2, myThreadID+1, 2, AMATRIX, i, j);


fmbip(2, AMATRIX, i, j, 6, BMATRIX, j, (2*myThreadID),0);

fmbip(2, AMATRIX, i, j, 7, BMATRIX, j, (2*myThreadID)+1,1);


transferDMA(3, myThreadID+1, 3, AMATRIX, i+1, j);


fmbip(3, AMATRIX, i+1, j, 6, BMATRIX, j, (2*myThreadID),2);

fmbip(3, AMATRIX, i+1, j, 7, BMATRIX, j, (2*myThreadID)+1,3);

//increment by 2 since using 4 SPEs w/ double buffering

j = j + 2;

}

//load last blocks of A and B matrix

inDMA(2, AMATRIX, i, j);

inDMA(6, BMATRIX, j, (2*myThreadID) );

inDMA(3, AMATRIX, i+1, j);

inDMA(7, BMATRIX, j, (2*myThreadID)+1 );















fmbip(2, AMATRIX, i, j, 7, BMATRIX, j, (2*myThreadID)+1 ,1);



110


fmbip(3, AMATRIX, i+1, j, 6, BMATRIX, j, (2*myThreadID) ,2);

fmbip(3, AMATRIX, i+1, j, 7, BMATRIX, j, (2*myThreadID)+1 ,3);

//save computed C matrix blocks into memory

outDMA(0, CMATRIX, i, (2*myThreadID));

outDMA(1, CMATRIX, i, (2*myThreadID)+1);

outDMA(2, CMATRIX, i+1, (2*myThreadID));

outDMA(3, CMATRIX, i+1, (2*myThreadID)+1);

} else if (myThreadID == (ACTIVESPE-1)) {

//load appropriate blocks B and recieve blocks of A matrices

requestSPETransfer(myThreadID-1,0);

inDMA(4, BMATRIX, 0, (2*myThreadID));


inDMA(5, BMATRIX, 0, (2*myThreadID)+1);



j = 1;





inDMA(6, BMATRIX, j, (2*myThreadID));


inDMA(7, BMATRIX, j, (2*myThreadID)+1);








inDMA(4, BMATRIX, j+1, (2*myThreadID));


inDMA(5, BMATRIX, j+1, (2*myThreadID)+1);







j = j + 2;

}

//load last blocks B and recieve blocks of A matrices


















111



} else {



inDMA(4, BMATRIX, 0, (2*myThreadID));


inDMA(5, BMATRIX, 0, (2*myThreadID)+1);



j = 1;




















inDMA(4, BMATRIX, j+1, (2*myThreadID));


inDMA(5, BMATRIX, j+1, (2*myThreadID)+1);












j = j + 2;

}

//load last blocks B and recieve blocks of A matrices










112












transferDMA(3, myThreadID+1, 3, AMATRIX, i+1, 7);









}

}

}

113

VII.II Test Data for Run Time System /* testData



* Description: Contains the test results for our test cases for A x B = C,

* where A, B and C are matricies. All the test's use 8 x 8 block

* sizes with each block having 2 x 2 entries. Thus the matricies

* are of size 16 x 16 entries. All the tests utilize 4 SPUs.

* All complex results (matrix C) were verified using MatLab.


*/

/* Test Case: Matrix A = Identity Matrix

* Matrix B = Identity Matrix

*/

74433031171269: (30408943861): Matrix A:

74433031203199: (30408975723): 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031302360: (30409074743): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031399285: (30409171531): 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031496205: (30409268314): 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031593130: (30409365102): 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031690050: (30409461885): 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031786975: (30409558673): 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031883895: (30409655456): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433031980820: (30409752244): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433032077740: (30409849027): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

74433032174665: (30409945815): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

74433032271585: (30410042598): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

74433032368510: (30410139386): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

74433032465430: (30410236169): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

74433032562355: (30410332957): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

74433032659275: (30410429740): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

74433032758349: (30410528667): Matrix B:

74433032766609: (30410536919): 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433032863533: (30410633706): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433032960458: (30410730494): 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033057378: (30410827277): 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033154303: (30410924065): 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033251223: (30411020848): 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033348148: (30411117636): 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033445068: (30411214419): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033541993: (30411311207): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033638913: (30411407990): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

74433033735838: (30411504778): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

74433033832758: (30411601561): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

74433033929683: (30411698349): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

74433034026603: (30411795132): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

74433034123528: (30411891920): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

74433034220448: (30411988703): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

74433037049731: (30414930746): Matrix C:

74433037058236: (30414939240): 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037156435: (30415037285): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037254635: (30415135331): 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037352830: (30415233372): 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037451030: (30415331418): 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037549225: (30415429459): 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037647425: (30415527505): 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037745620: (30415625546): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037843820: (30415723592): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

74433037942015: (30415821633): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

74433038040215: (30415919679): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

74433038138410: (30416017720): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

74433038236610: (30416115766): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

74433038334805: (30416213807): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

74433038433005: (30416311853): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

74433038531200: (30416409894): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

/* Test Case: Matrix A = Identity Matrix

* Matrix B = Randomly Generated Matrix

*/

79416992792287: (32426392551): Matrix A:

79416992824236: (32426424432): 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416992923418: (32426523473): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993020343: (32426620261): 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993117263: (32426717044): 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993214188: (32426813832): 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

114

79416993311108: (32426910615): 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993408033: (32427007403): 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993504953: (32427104186): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993601878: (32427200974): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993698798: (32427297757): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

79416993795723: (32427394545): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

79416993892643: (32427491328): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

79416993989568: (32427588116): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

79416994086488: (32427684899): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

79416994183413: (32427781687): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

79416994280333: (32427878470): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

79416994379407: (32427977397): Matrix B:

79416994387682: (32427985664): 4.12 1.93 4.47 0.97 1.36 2.71 1.79 2.52 3.11 3.51 0.08 1.19 0.77 1.96 4.89 1.07

79416994486421: (32428084261): 3.84 3.91 1.64 3.76 2.34 3.08 4.67 4.49 0.69 2.97 0.67 3.94 0.64 0.53 2.51 4.11

79416994583946: (32428181649): 3.00 1.45 2.42 4.77 2.58 3.77 1.39 1.73 0.34 4.65 0.94 3.25 0.31 0.10 1.17 4.76

79416994681575: (32428279141): 3.12 2.57 1.43 3.78 1.96 4.37 3.96 4.50 2.58 0.42 4.36 1.71 2.24 0.31 4.42 4.17

79416994778863: (32428376292): 1.31 2.64 2.51 3.27 2.14 4.57 1.39 3.62 3.37 0.81 4.06 1.62 1.72 3.36 3.22 4.21

79416994875964: (32428473256): 1.84 3.73 2.60 0.19 4.56 3.53 3.14 1.73 2.16 4.31 1.12 4.38 4.69 2.88 2.15 4.53

79416994973074: (32428570229): 1.95 3.99 2.36 1.06 0.72 4.32 3.04 1.38 2.19 1.11 2.73 2.04 0.51 0.62 4.83 4.58

79416995070547: (32428667565): 3.35 4.46 4.66 4.51 3.14 2.12 3.85 1.51 0.92 1.35 0.83 4.44 2.43 3.73 3.35 1.88

79416995167816: (32428764697): 3.67 1.81 2.86 1.20 0.61 3.89 0.37 4.66 0.86 1.18 3.22 4.74 0.46 0.66 0.34 1.00

79416995265711: (32428862455): 1.44 1.14 0.75 3.59 0.80 3.65 0.26 2.56 3.91 3.59 3.13 3.72 2.55 0.38 2.26 4.02

79416995363302: (32428959909): 2.81 3.70 4.90 1.00 4.89 0.26 0.02 4.52 0.80 1.92 1.76 2.71 3.37 0.49 1.49 4.86

79416995461079: (32429057549): 0.25 0.77 4.36 0.61 4.27 0.35 2.91 0.88 4.47 4.02 2.84 2.22 2.60 3.71 2.83 4.30

79416995558616: (32429154949): 3.66 3.08 4.08 4.53 4.80 3.64 3.27 2.44 4.36 0.38 3.08 1.12 1.61 1.13 1.10 3.19

79416995655876: (32429252072): 0.17 3.66 4.28 4.07 4.43 4.82 0.80 4.07 3.18 1.22 3.43 1.55 0.36 3.10 2.50 4.76

79416995753102: (32429349161): 1.37 2.68 2.31 2.90 1.65 4.15 1.69 2.93 3.31 2.37 3.49 0.89 2.03 3.41 1.70 0.90

79416995850340: (32429446262): 3.52 0.56 4.63 2.21 2.13 0.02 4.08 1.15 2.36 1.49 3.05 0.20 3.30 3.12 2.98 3.07

79416998678907: (32432380296): Matrix C:

79416998687427: (32432388805): 4.12 1.93 4.47 0.97 1.36 2.71 1.79 2.52 3.11 3.51 0.08 1.19 0.77 1.96 4.89 1.07

79416998786166: (32432487389): 3.84 3.91 1.64 3.76 2.34 3.08 4.67 4.49 0.69 2.97 0.67 3.94 0.64 0.53 2.51 4.11

79416998884966: (32432586035): 3.00 1.45 2.42 4.77 2.58 3.77 1.39 1.73 0.34 4.65 0.94 3.25 0.31 0.10 1.17 4.76

79416998983870: (32432684785): 3.12 2.57 1.43 3.78 1.96 4.37 3.96 4.50 2.58 0.42 4.36 1.71 2.24 0.31 4.42 4.17

79416999082433: (32432783194): 1.31 2.64 2.51 3.27 2.14 4.57 1.39 3.62 3.37 0.81 4.06 1.62 1.72 3.36 3.22 4.21

79416999180809: (32432881416): 1.84 3.73 2.60 0.19 4.56 3.53 3.14 1.73 2.16 4.31 1.12 4.38 4.69 2.88 2.15 4.53

79416999279194: (32432979647): 1.95 3.99 2.36 1.06 0.72 4.32 3.04 1.38 2.19 1.11 2.73 2.04 0.51 0.62 4.83 4.58

79416999377942: (32433078241): 3.35 4.46 4.66 4.51 3.14 2.12 3.85 1.51 0.92 1.35 0.83 4.44 2.43 3.73 3.35 1.88

79416999476486: (32433176631): 3.67 1.81 2.86 1.20 0.61 3.89 0.37 4.66 0.86 1.18 3.22 4.74 0.46 0.66 0.34 1.00

79416999575656: (32433275647): 1.44 1.14 0.75 3.59 0.80 3.65 0.26 2.56 3.91 3.59 3.13 3.72 2.55 0.38 2.26 4.02

79416999674522: (32433374359): 2.81 3.70 4.90 1.00 4.89 0.26 0.02 4.52 0.80 1.92 1.76 2.71 3.37 0.49 1.49 4.86

79416999773574: (32433473257): 0.25 0.77 4.36 0.61 4.27 0.35 2.91 0.88 4.47 4.02 2.84 2.22 2.60 3.71 2.83 4.30

79416999872386: (32433571915): 3.66 3.08 4.08 4.53 4.80 3.64 3.27 2.44 4.36 0.38 3.08 1.12 1.61 1.13 1.10 3.19

79416999970921: (32433670296): 0.17 3.66 4.28 4.07 4.43 4.82 0.80 4.07 3.18 1.22 3.43 1.55 0.36 3.10 2.50 4.76

79417000069422: (32433768643): 1.37 2.68 2.31 2.90 1.65 4.15 1.69 2.93 3.31 2.37 3.49 0.89 2.03 3.41 1.70 0.90

79417000167935: (32433867002): 3.52 0.56 4.63 2.21 2.13 0.02 4.08 1.15 2.36 1.49 3.05 0.20 3.30 3.12 2.98 3.07

/* Test Case: Matrix A = Matrix of all 1's

* Matrix B = Matrix of all 1's

*/

82525242388312: (33681364980): Matrix A:

82525242420242: (33681396842): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525242519339: (33681495798): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525242616204: (33681592526): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525242713064: (33681689249): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525242809929: (33681785977): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525242906789: (33681882700): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243003654: (33681979428): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243100514: (33682076151): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243197379: (33682172879): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243294239: (33682269602): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243391104: (33682366330): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243487964: (33682463053): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243584829: (33682559781): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243681689: (33682656504): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243778554: (33682753232): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243875414: (33682849955): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525243974432: (33682948826): Matrix B:

82525243982692: (33682957078): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244079552: (33683053801): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244176417: (33683150529): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244273277: (33683247252): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244370142: (33683343980): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244467002: (33683440703): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244563867: (33683537431): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244660727: (33683634154): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244757592: (33683730882): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525244854452: (33683827605): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

115

82525244951317: (33683924333): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525245048177: (33684021056): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525245145042: (33684117784): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525245241902: (33684214507): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525245338767: (33684311235): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525245435627: (33684407958): 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

82525248274190: (33687351558): Matrix C:

82525248286418: (33687363769): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525248408153: (33687485366): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525248529893: (33687606968): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525248651628: (33687728565): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525248773368: (33687850167): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525248895103: (33687971764): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249016843: (33688093366): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249138578: (33688214963): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249260318: (33688336565): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249382053: (33688458162): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249503793: (33688579764): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249625528: (33688701361): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249747268: (33688822963): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249869003: (33688944560): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525249990743: (33689066162): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

82525250112478: (33689187759): 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00

16.00 16.00 16.00

/* Test Case: Matrix A = Randomly generated matrix

* Matrix B = Randomly generated matrix

*/

84404128821532: (34445791704): Matrix A:

84404128856237: (34445826334): 8.32 3.90 9.03 1.96 2.75 5.48 3.61 5.08 6.29 7.10 0.16 2.40 1.55 3.97 9.89 2.16

84404128956091: (34445926047): 7.75 7.90 3.32 7.61 4.73 6.23 9.43 9.07 1.40 6.01 1.36 7.96 1.28 1.08 5.08 8.31

84404129053937: (34446023756): 6.07 2.93 4.89 9.63 5.21 7.62 2.80 3.49 0.69 9.40 1.90 6.57 0.64 0.20 2.36 9.61

84404129152150: (34446121832): 6.31 5.19 2.90 7.64 3.96 8.83 8.00 9.10 5.21 0.85 8.81 3.45 4.53 0.62 8.93 8.42

84404129250712: (34446220257): 2.64 5.34 5.07 6.61 4.33 9.23 2.81 7.31 6.81 1.64 8.21 3.27 3.47 6.80 6.51 8.50

84404129348432: (34446317840): 3.71 7.53 5.26 0.39 9.22 7.14 6.34 3.51 4.36 8.71 2.27 8.84 9.47 5.83 4.35 9.15

84404129446899: (34446416170): 3.94 8.07 4.78 2.14 1.46 8.72 6.13 2.78 4.43 2.24 5.51 4.12 1.02 1.25 9.75 9.26

84404129544911: (34446514045): 6.77 9.02 9.41 9.11 6.35 4.28 7.78 3.04 1.86 2.73 1.68 8.98 4.90 7.53 6.78 3.79

84404129642927: (34446611924): 7.42 3.65 5.79 2.42 1.24 7.86 0.74 9.41 1.74 2.38 6.50 9.58 0.93 1.34 0.69 2.03

84404129740951: (34446709811): 2.91 2.30 1.51 7.25 1.62 7.38 0.52 5.16 7.90 7.25 6.33 7.52 5.15 0.77 4.57 8.11

84404129838477: (34446807200): 5.68 7.48 9.90 2.02 9.88 0.54 0.04 9.14 1.62 3.88 3.56 5.47 6.81 0.99 3.01 9.82

84404129937172: (34446905758): 0.51 1.56 8.81 1.24 8.62 0.72 5.88 1.79 9.04 8.11 5.74 4.48 5.25 7.50 5.71 8.69

84404130035581: (34447004030): 7.40 6.23 8.25 9.16 9.70 7.36 6.60 4.92 8.80 0.76 6.23 2.27 3.25 2.29 2.21 6.45

84404130133811: (34447102123): 0.35 7.40 8.65 8.23 8.94 9.74 1.62 8.22 6.43 2.46 6.94 3.14 0.73 6.27 5.06 9.62

84404130232507: (34447200682): 2.77 5.41 4.67 5.87 3.33 8.39 3.41 5.92 6.69 4.78 7.05 1.81 4.10 6.89 3.44 1.83

84404130329770: (34447297808): 7.12 1.12 9.35 4.46 4.30 0.03 8.25 2.32 4.77 3.02 6.16 0.40 6.67 6.31 6.03 6.21

84404130429718: (34447397609): Matrix B:

84404130437976: (34447405859): 3.58 1.61 4.51 3.36 2.61 0.43 3.36 0.46 2.83 2.91 3.80 1.41 4.82 0.02 0.92 2.14

84404130535643: (34447503389): 3.63 0.99 3.20 1.26 1.28 4.30 0.55 1.77 3.27 1.42 1.62 0.93 4.05 1.62 4.70 4.50

84404130633127: (34447600736): 3.75 3.43 1.88 3.79 4.22 1.00 1.46 4.43 2.44 2.82 4.24 2.41 2.43 1.43 3.56 0.68

84404130730534: (34447698006): 0.59 3.36 4.62 4.49 3.89 2.69 4.46 4.28 0.80 1.34 2.27 4.16 0.88 3.35 2.96 2.41

84404130827836: (34447795171): 4.11 3.55 2.44 0.59 1.59 4.57 4.10 4.01 1.93 3.23 0.74 0.36 1.78 1.41 2.09 4.58

84404130925371: (34447892569): 0.87 1.09 0.68 1.77 4.45 3.05 2.43 1.64 2.98 1.27 0.53 3.17 1.62 0.45 2.86 1.30

84404131022807: (34447989868): 3.23 3.73 4.33 3.07 2.73 2.09 1.20 1.60 4.83 1.66 2.01 0.03 1.44 0.56 0.24 2.20

84404131120360: (34448087284): 2.39 0.77 2.54 1.02 4.07 1.93 3.57 3.13 4.40 0.67 3.84 3.79 4.24 3.53 4.83 3.47

84404131217639: (34448184426): 1.03 2.32 0.49 1.88 3.96 0.65 2.24 3.83 0.58 2.89 2.92 1.77 2.34 0.83 3.03 2.92

84404131315230: (34448281880): 4.24 0.46 1.48 3.22 0.25 0.26 3.39 2.17 2.84 2.60 1.49 4.35 2.99 2.58 1.14 4.07

84404131412810: (34448379323): 0.34 0.48 2.36 1.10 1.75 4.30 3.23 0.18 3.07 4.10 0.97 3.00 3.83 3.53 1.55 2.13

84404131510256: (34448476632): 4.53 0.83 4.05 1.43 1.69 3.99 1.26 3.81 1.51 1.08 0.54 3.31 0.98 1.97 1.13 1.89

84404131607709: (34448573948): 2.61 0.76 1.86 1.87 1.28 3.18 3.36 3.97 3.16 0.03 3.03 3.15 1.77 3.52 0.75 0.16

84404131705282: (34448671384): 2.72 0.07 1.50 3.61 2.71 4.51 3.42 1.53 2.61 4.14 2.54 1.96 3.93 3.32 0.31 3.36

84404131802780: (34448768745): 0.92 3.03 0.01 0.03 3.21 4.20 3.27 4.31 4.34 0.91 4.06 3.31 1.93 3.46 3.62 4.57

84404131900318: (34448866146): 3.43 2.78 1.50 1.28 0.89 1.67 3.20 1.53 0.77 2.47 4.43 0.94 4.26 2.68 1.14 4.54

116

84404134700519: (34451798626): Matrix C:

84404134710690: (34451808789): 197.72 145.29 150.80 161.99 209.21 165.46 198.85 207.65 209.06 153.72 207.23

181.01 205.73 139.72 181.71 212.28

84404134844183: (34451942160): 253.22 175.82 242.72 186.57 222.91 223.11 236.48 227.83 244.32 161.59 216.86

203.84 240.74 175.04 203.82 265.95

84404134977647: (34452075502): 210.11 148.41 190.52 171.84 178.71 171.80 216.91 195.84 170.98 149.98 173.29

191.25 192.84 150.03 158.78 216.67

84404135111083: (34452208816): 206.64 185.03 219.79 173.68 255.21 251.87 263.03 236.10 259.43 176.18 236.39

220.40 252.27 200.37 222.40 261.89

84404135244607: (34452342218): 198.51 160.45 184.43 173.40 244.80 251.28 253.22 230.16 226.45 189.90 220.90

215.51 247.46 201.16 213.55 254.45

84404135378055: (34452475544): 291.68 168.41 212.45 186.71 209.35 264.24 254.30 258.75 251.98 193.46 220.62

211.88 254.19 197.00 188.68 276.15

84404135511646: (34452609013): 184.03 151.19 158.40 135.06 197.21 207.26 188.35 185.77 209.49 149.60 192.51

166.76 209.78 149.66 184.00 226.83

84404135645238: (34452742483): 267.09 190.44 250.40 217.34 245.01 269.02 245.33 267.78 249.82 187.96 229.44

212.25 237.80 192.78 207.58 257.15

84404135778573: (34452875696): 170.85 91.25 167.91 129.71 178.95 163.71 166.32 155.70 172.85 127.19 144.85

173.31 185.07 126.17 157.31 155.93

84404135911710: (34453008711): 178.58 127.59 164.05 150.64 190.39 190.00 222.34 208.26 175.90 147.89 179.72

213.54 198.29 175.01 170.33 212.58

84404136045355: (34453142234): 249.78 153.59 191.31 147.58 181.99 208.78 223.81 230.21 203.16 162.96 218.72

178.20 238.79 178.29 196.60 236.32

84404136178867: (34453275624): 235.71 168.66 163.28 172.96 194.14 215.91 233.22 238.64 204.38 203.26 213.18

182.58 222.43 185.32 160.56 248.32

84404136312647: (34453409282): 228.39 200.95 229.38 200.92 257.00 235.33 256.26 248.09 223.60 203.67 224.52

193.16 244.16 170.55 217.60 251.14

84404136446077: (34453542590): 225.76 185.01 196.87 184.70 260.53 267.56 266.46 259.15 227.89 207.39 227.31

221.39 257.60 207.85 245.31 282.92

84404136579687: (34453676078): 173.36 125.53 164.32 166.78 212.38 208.92 215.22 196.30 204.66 166.60 175.31

194.31 208.89 166.86 180.81 207.04

84404136713027: (34453809296): 200.23 164.33 180.19 180.56 197.41 188.02 218.87 202.62 207.93 178.22 220.19

163.47 213.86 165.38 144.43 201.44

/* MatLab computations of the same matrices

* MatLab is more accurate as far as round error is concerned

*/

197.69 145.28 150.87 162.07 209.16 165.46 198.82 207.64 209.10 153.78 207.27 180.87 205.62 139.74 181.53 212.24

253.26 175.81 242.82 186.71 223.00 223.16 236.53 227.81 244.49 161.75 216.90 203.87 240.63 175.05 203.73 265.98

210.14 148.39 190.59 171.92 178.73 171.82 216.92 195.79 171.07 150.06 173.34 191.28 192.78 150.10 158.70 216.67

206.56 184.97 219.89 173.75 255.27 251.82 262.99 236.11 259.52 176.27 236.40 220.37 252.08 200.32 222.36 261.84

198.39 160.36 184.49 173.37 244.82 251.31 253.20 230.11 226.45 190.00 220.93 215.45 247.32 201.14 213.46 254.41

291.74 168.42 212.55 186.74 209.45 264.31 254.34 258.77 252.06 193.65 220.70 211.91 254.14 197.04 188.59 276.23

184.01 151.13 158.50 135.14 197.22 207.27 188.38 185.76 209.56 149.72 192.58 166.76 209.70 149.70 183.97 226.85

267.11 190.44 250.44 217.37 245.10 269.13 245.38 267.76 249.90 188.06 229.46 212.25 237.62 192.76 207.52 257.11

170.93 91.26 167.99 129.83 179.05 163.75 166.34 155.70 172.97 127.30 144.98 173.32 185.01 126.32 157.27 155.99

178.45 127.51 164.09 150.67 190.40 189.94 222.23 208.22 175.90 147.92 179.73 213.47 198.14 174.99 170.26 212.53

249.88 153.60 191.36 147.58 182.12 208.82 223.87 230.19 203.27 163.13 218.83 178.25 238.75 178.30 196.49 236.37

235.69 168.60 163.34 172.89 194.13 215.96 233.20 238.64 204.39 203.37 213.19 182.54 222.43 185.32 160.40 248.34

228.35 200.92 229.40 200.90 256.93 235.41 256.22 248.03 223.57 203.77 224.53 193.07 244.01 170.48 217.50 251.09

225.75 184.99 197.00 184.71 260.60 267.68 266.53 259.14 227.95 207.56 227.44 221.43 257.54 207.89 245.26 282.95

173.32 125.52 164.40 166.80 212.40 209.02 215.23 196.37 204.68 166.74 175.38 194.25 208.80 166.89 180.79 207.06

200.18 164.27 180.22 180.50 197.38 188.02 218.87 202.60 207.99 178.30 220.18 163.41 213.79 165.37 144.32 201.43

/* Test Case: Matrix A = Matrix of randomly placed 1's

* Matrix B = Matrix of randomly placed 1's

*/

26420077585916: (10991401647): Matrix A:

26420077619750: (10991435411): 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00

26420077718920: (10991534440): 0.00 1.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00

26420077815829: (10991631212): 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

26420077912741: (10991727987): 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 1.00

26420078009650: (10991824759): 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00

26420078106554: (10991921526): 1.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

26420078203467: (10992018302): 1.00 0.00 0.00 1.00 1.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

26420078300375: (10992115073): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00

26420078397292: (10992211853): 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00

26420078494200: (10992308624): 1.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00

26420078591113: (10992405400): 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00

26420078688021: (10992502171): 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00

26420078784930: (10992598943): 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

26420078881842: (10992695718): 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00

26420078978751: (10992792490): 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

26420079075659: (10992889261): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00

26420079174725: (10992988180): Matrix B:

26420079182989: (10992996436): 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

26420079279901: (10993093211): 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00

26420079376810: (10993189983): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00

26420079473722: (10993286758): 0.00 0.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00

117

26420079570631: (10993383530): 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00

26420079667535: (10993480297): 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00

26420079764452: (10993577077): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

26420079861368: (10993673856): 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00

26420079958273: (10993770624): 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00

26420080055184: (10993867399): 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00

26420080152089: (10993964167): 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00

26420080249001: (10994060942): 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

26420080345914: (10994157718): 0.00 1.00 1.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00

26420080442814: (10994254481): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

26420080539731: (10994351261): 1.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

26420080636643: (10994448036): 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00

26420083451161: (10997366780): Matrix C:

26420083459666: (10997375274): 1.00 0.00 0.00 1.00 0.00 1.00 1.00 1.00 1.00 0.00 1.00 1.00 1.00 0.00 0.00 2.00

26420083557825: (10997473279): 1.00 0.00 2.00 1.00 1.00 1.00 1.00 4.00 2.00 2.00 1.00 2.00 2.00 1.00 0.00 3.00

26420083655973: (10997571273): 1.00 0.00 1.00 1.00 1.00 0.00 0.00 3.00 1.00 1.00 1.00 0.00 1.00 1.00 0.00 2.00

26420083754128: (10997669274): 1.00 2.00 0.00 0.00 0.00 0.00 0.00 2.00 1.00 2.00 0.00 2.00 1.00 1.00 0.00 3.00

26420083852296: (10997767288): 1.00 2.00 2.00 0.00 1.00 1.00 0.00 2.00 0.00 1.00 1.00 3.00 2.00 0.00 0.00 1.00

26420083950455: (10997865293): 0.00 3.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 2.00 0.00 1.00 1.00 1.00 1.00 1.00

26420084048619: (10997963303): 1.00 2.00 1.00 1.00 0.00 3.00 1.00 1.00 0.00 0.00 1.00 2.00 2.00 1.00 1.00 1.00

26420084146770: (10998061300): 0.00 1.00 1.00 1.00 2.00 0.00 0.00 3.00 0.00 1.00 0.00 2.00 2.00 1.00 0.00 1.00

26420084244934: (10998159310): 0.00 0.00 0.00 2.00 1.00 1.00 1.00 1.00 1.00 0.00 1.00 0.00 1.00 1.00 0.00 2.00

26420084343093: (10998257315): 0.00 2.00 1.00 1.00 1.00 2.00 0.00 1.00 1.00 1.00 0.00 2.00 2.00 2.00 1.00 0.00

26420084441245: (10998355313): 1.00 0.00 0.00 2.00 2.00 0.00 0.00 2.00 1.00 1.00 0.00 0.00 0.00 2.00 0.00 2.00

26420084539416: (10998453330): 0.00 1.00 1.00 1.00 0.00 3.00 2.00 1.00 2.00 1.00 1.00 2.00 2.00 0.00 1.00 2.00

26420084637568: (10998551328): 0.00 3.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 2.00 1.00 1.00 1.00 1.00 1.00 0.00

26420084735727: (10998649333): 1.00 0.00 1.00 1.00 0.00 2.00 1.00 1.00 1.00 0.00 1.00 2.00 2.00 0.00 0.00 2.00

26420084833887: (10998747339): 1.00 4.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 3.00 1.00 0.00 0.00 1.00 1.00 2.00

26420084932055: (10998845352): 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 2.00 1.00 0.00 1.00 1.00 1.00 0.00 1.00

/* MatLab computations of the same matrices

* MatLab is more accurate as far as round error is concerned

*/

1 0 0 1 0 1 1 1 1 0 1 1 1 0 0 2

1 0 2 1 1 1 1 4 2 2 1 2 2 1 0 3

1 0 1 1 1 0 0 3 1 1 1 0 1 1 0 2

1 2 0 0 0 0 0 2 1 2 0 2 1 1 0 3

1 2 2 0 1 1 0 2 0 1 1 3 2 0 0 1

0 3 0 0 0 1 0 1 0 2 0 1 1 1 1 1

1 2 1 1 0 3 1 1 0 0 1 2 2 1 1 1

0 1 1 1 2 0 0 3 0 1 0 2 2 1 0 1

0 0 0 2 1 1 1 1 1 0 1 0 1 1 0 2

0 2 1 1 1 2 0 1 1 1 0 2 2 2 1 0

1 0 0 2 2 0 0 2 1 1 0 0 0 2 0 2

0 1 1 1 0 3 2 1 2 1 1 2 2 0 1 2

0 3 0 0 0 1 0 1 0 2 1 1 1 1 1 0

1 0 1 1 0 2 1 1 1 0 1 2 2 0 0 2

1 4 0 0 0 1 0 1 0 3 1 0 0 1 1 2

0 0 0 1 1 0 0 1 2 1 0 1 1 1 0 1

/* Stress Test Case (1000 continuous iterations): Matrix A = Randomly generated matrix

* Matrix B = Randomly generated matrix

*/

The stress test failed due to unstabilities in the STI CELL simulator.

118

VIII Appendix III: Internal Report

VIII.I Report on Matrix Algorithms REPORT: METHODS TO COMPUTE DENSE CABx OPERATIONS

Done By: Christopher Venantius: 13 October 2006

Summary: Outlines different approaches to compute CABx operations with dense matrices.

Sources:

Weisstein, Eric W. "Strassen Formulas." From MathWorld--A Wolfram Web

Resource.http://mathworld.wolfram.com/StrassenFormulas.html

Ian,Foster. "Case Study: Matrix Multiplication." Designing and Building Parallel Programs. 2005. 13 October 2006

<http://www.it.uom.gr/teaching/dbpp/text/node45.html>

Information:

Basic algorithms given A (mxn) B (nxs) C(mxs)

For i = 1 to m (for every row in A)

For j = 1 to s (for every column in B)

For k = 1 to n (for every combination)

C(i,j) = C(i,j) + A(i,k)*B(k,j)

end

end

end

Algorithm is O(n^3) complexity (Foster)

Strassen Algorithm given same input as above (the other algorithm idea introduced by Foster)

Make Matrix A and B txt matrices by expanding them with zero rows and columns, where t is even. Above is a

based on the largest of (m,n,s) and redefines the answer matrix as well to txt. Break matrix A, B, C into four sub-

matrices of equal dimensions (t/2 x t/2)

A = [A11 A12] B = [B11 B12] C = [C11 C12]

[A21 A22] [B21 B22] [C21 C22]

Then create the following Mi matrices for i = 1 to 7

M1 = (A11 + A22)(B11 + B22)

M2 = (A21 + A22)B11

M3 = A11(B12 - B22)

M4 = A22(B21 - B11)

M5 = (A11 + A12)B22

M6 = (A21 - A11)(B11 + B12)

M7 = (A12 - A22)(B21 + B22)

Note, the Strassen Algorithm is now applied to each Mi computation (because they all involve a single

multiplication) recursively until we are left with single entry multiplication

Then, we can solve C as:

C11 = M1 + M4 - M5 + M7

C12 = M3 + M5

C21 = M2 + M4

C22 = M1 - M2 + M3 + M6

(Mathworld)

On the subject of complexity: It's noted that the general algorithm uses O(n^(log8)) multiplications *log of base 2.

Therefore, the above algorithms has reduced the number of multiplications by O(n^(log7)) = n^(2.807). It is argued

that the complexity of a matrix multiplication operation significantly out weighs the addition of 15 summation

functions.

Therefore, the algorithm is faster than the general. MathWorld states that this is the case only for matrices of size

that are generally larger than the word size of the machine.

On the subject of stability: The algorithm is also noted to be numerically unstable, MathWorld outlines its stability

as weak stability, meaning the operation holds with Matrix norms but if adding the actual entries / values of the

matrix, the stability is not ensured.

119

Conclusion: Although, the Stressen algorithm does appear to be a lot faster in efficiency, the cost of numeric

stability (of any value), is too high for the purpose of the thesis. Our goal is provide support for matrix

multiplication with focusing on the efficiency of the task, HOWEVER, not at the sack of accuracy. Additionally, the

goal is to speed up matrix operations of larger size, where numeric result would matter.

120

VIII.II Report on Data Structures and Corresponding Algorithms REPORT: DATA STRUCTURES

Done by: Christopher Venantius: Monday Oct 9th, 2006

NOTE: This places restrictions on the interface between hardware level and upper levels.

REFERENCES: Uszkay, Gordon. Personal Interview. 4 October 2006.

SUMMARY:

To go over the different data structures available to store the matrices and scalars through the more abstract layers of

the implementation. The decision were made to improve space complexity, and time complexity for the computing

algorithms, as well as the knowledge that the underlying system will utilize vector and vector arithmetic.

OVERVIEW OF POSSIBLE DATA STRUCTURES:

Vector of Vectors: A vector of vectors structure will define a matrix where each vector will represent a row or

column of the matrix. Benefits of this structure is the smoother transition to underlying architecture that supports

vector arithmetic. Negative, is if attempting to use space saving techniques to store upper triangular or lower

triangular or symmetric matrices. Solutions would be a vector of different sized vectors (awkward definition of the

structure). Additionally, for defining symmetric matrices in this such way, to 'recreate' the rows or columns of the

matrix would involve jumping through multiple vectors.

One dimensional array: A one dimensional array that stores the data either row by row or column by column. The

fact that the dimension and attributes of the matrix are stored in the type system, one can make algorithms using the

index of the array to traverse the matrix as normal, or 'recreate' the rows and columns for a symmetric matrix. The

benefits, is the fact that this allows an easier system to create space saving structures for matrices aforementioned

(Gordon, 4 Oct 2006). However, more complicated algorithms will have to be used to recreate the vectors or define

the computation of the matrix. Additionally, we do not know if the time complexity of these algorithms would be

much better than the vector of vectors approach. Also, the structure is less smooth if we assume that all operations

for simplicity on the underlying system will be done by vector arithmetic.

Conclusion: The vector of vector approach is awkward when attempting to save memory, but is smoother to the

architecture desired. Therefore, whenever a matrix is not attempting to save space this method will be utilized.

Whenever a space saving possibility exists, a one dimensional array will be used instead. Algorithms will be created

to recreate the vectors at the top level to allow the underlying system to have the same computation algorithms for

addition and multiplication (the interface will still have a vector being passed). When having an identity matrix, in

this special case, nothing will be saved except the type of the second matrix, since that is all that is needed. NOTE:

AN ENTIRE APPROACH OF ARRAY MIGHT BE A BETTER POSSIBILITY BECAUSE IT ALLOWS THE

GROUP TO APPLY MULTIPLE WAYS OF PARALLELISM (ROW, CHECKER, ETC), WITHOUT DATA

STRUCTURE CONSTRAINTS - ISSUE

Nate: Making use of vector implementation does imply that we will be using C++ for our low-level coding, using

the Vector datatype (I assume the CBE works on that datatype). This datatype allows us to store additional

information and optimized functions for finding attributes of the data. The proposed array method, Chris suggested,

was to use the first 4 - 8 array entries, in a 1D array implemenation, as the attributes of the matrix/row. I would like

to request that the low-level programmers look into weather the vector implementation would increase any memory

complexity significantly i.e. by a factor of n for O(n^2).

121

VIII.III Report on a “Codegraph” to Illustrate Multiplication

REPORT ON BASIC DENSE DENSE MATRIX MULTIPLICATION CODEGRAPH

BY: Christopher Venantius

Introduction:

The purpose of this report is to outline a basic codegraph idea showing C = A.B where A and B

are dense matrices. The report is essentially used as a foundation to discuss optimizations that

can occur through examining the codegraph.

How the codegraph is created:

The codegraph is created by translation the basic algebraic computation of A.B using a

block memory system. It assumes that the block sizes are stored in such a way that one does not

have ill condition operations with respect to dimensions not matching. The example provided is

specific for a A matrix of dimension 2x2 and a B matrix of dimension 2x3. However, the goal is

to understand the process, therefore; one can generalize to any size matrices. No optimizations

are included with respect to minimizing DMA transfers; this is saved for a different report.

Question on associative Addition:

The question was posed if one will see a difference in efficiency based on which summations of

the partial vector dot product one does first, second, and so on. Below is an argument, that

concludes there would not be any change in efficiency :

Rational:

A vector product is defined as if given vectors:

x = [a1 a2 a3 ... an]

y = [b1 b2 b3 ... bn]

x*y = a1*b1 + a2*b2 +...+an*bn

We know that there is no efficiency change if one does ai*bi or bi*ai since a basic

multiplication is being processed, and the machine layer treats both as equal.

Let us now examine the summation of the multiplied results:

if all ai and bi are normal values (not zero or complex)

then the order of summation does not matter

if an ai or bi is zero -> therefore, we do not have to add

then one receive a large efficiency increase when accumulated that multiplied result. However, if

one does it first, last or in the middle of the summation, it does not matter, since the same

increase in efficiency is gained.

if an ai or bi is complex

If this is the case then we treat the whole vector as being complex vector.

Therefore, the problem will be a summation of complex values, and again,

as in the argument in the first case, the order of summation does not matter.

Conclusion: Therefore, we can say with vector dot product s the order of the summation

does not change the efficiency of the program. Now, one can ask the question if the above

rational can be expanded t o the order of summation of associated multiplied matrix blocks. I

will provide an argument that it can as follows :

Rational:

Treat the blocks as an abstraction of a vector dot product

Therefore, we run into a parallel analysis as with vector dot product.

First, if we have A11.B11 we cannot perform B11.A11 because it is not an associative

122

process, therefore; we can ignore this.

Now if we examine the summation process:

if A.Bs blocks are normal (not special matrix or complex)

Then the order of summation does not matter, since we are adding same sized blocks

if a A.B block is complex

Then at least A or B matrix is complex

Therefore, we have the summation of complex matrices

Therefore, the order does not matter because if follows from the first case.

If a A.B block is a special defined matrix (diagonal, etc)

Then we can gain efficiency when doing this summation.

However, as with the vector ai or bi value being zero case, if we do this

summation first, last or in the middle does not change the overall efficiency increase in the

program.

Therefore, an argument is provided to the question stated, therefore, the software solution will

assume that efficiency is not hindered by associated addition and not explore the different ways

of performing the summation.

Brief Explanation of the Codegraph Below:

The codegraph below only shows a portion or cut-out of the total graph. This is done to simplify

the graph, however, the graph can still be examined since two entire C block solution are shown,

therefore, one can examine before downwards and across / parallel flows.

The gray triangles represent inputs from the top level

The blue triangles represent outputs from the top level

The red triangles represent inputs found in the syst em

The green ovals represents mathematical computation

The purple ovals represents DMA transfers

The blue circles represents what's in memory (in each LS -> corresponds t o an SPU/processor)

124

VIII.IV Report on Minimizing DMAs through “Codegraph”

REPORT ON OPTIMIZING DENSE DENSE MATRIX MULTIPLICATION CODE

GRAPH


Introduction:

The purpose of this report is to show that there are two major ways one can optimize the dense-

dense multiplication codegraph, to increase overall efficiency. This is only taking the codegraph

structure into account. Additionally, the two way, shown below, cannot be implemented together

for obvious reasons, therefore, it will be shown why one is better than the other.

Question on codegraph optimization:

The question is how can we examine the codegraph in the report on dense dense matrix

multiplication codegraph, wit h the end goal of increasing efficiency. First one has to ask what

can we do to optimize, which brings us with two areas (based on the hyper-edges of the graph):

Mathematical Operations: At this point in time, since we are not yet considering cases of special

matrices, it is impossible to eliminate any of these hyper-edges. We need every result of the

multiplied blocks, and no two blocks are multiplied together twice (claim is based on the

mathematical algorithm of matrix multiplication, and I'm assuming the proof is not necessary).

Therefore, we can argue at this point we cannot increase efficiency here.

DMA transfers: This is the second hyper-edge and the other place we can look to increase

efficiency. It is here that we can go two major, but actually three different ways to make

improvements, let me explain: We have for every mathematical operation three pieces of data

required, x, y, z that refers to the operation x.y + z. Instead of having DMA transfers of all three

inputs into the LS for every calculation, we will propose making a algorithm that prioritize with

one of the three inputs, meaning we do not DMA the value out of the LS to another SPU, but

rather continually DMA two inputs into the associated LS and continue processing. The

prioritizing with x or y are essentially the same, therefore, for simplicity we only examine the x

case.

How Good is it? In both cases we eliminate one DMA transfer for every multiplication-add

hyper-edge. Therefore, if we are dealing with a problem of A.B, with A(mxn) and B(nxk), we

have n-1 associated multiplication-add operations to solve a C block of data. Therefore, for each

C block of the solution, we eliminated (n-1) DMA transfers .

What's the difference? If both provide the same reduction, there has to be a rational why one is

better than the other. The claim we make is t hat prioritizing with z is the method of choice.

Logically Better: Prioritizing with z creat es an algorithm that an en empt y SPU will start

calculation a solution block, and continue with that row/column process till done. Therefore, if

will perform the process how one would logically perform the computation on paper. The second

method prioritizes with the A blocks of data. Although this essentially reduces the same number

of DMAs, it does make a less comprehensible algorithm. The argument of keeping things simple

can be made here.

A better reason: If prioritizing with z, the same DMA transfers are dependent on the same DMA

transfers as before (no additional dependencies). However, if we prioritize with A or B blocks,

we start creating more DMA dependencies when we move "horizontally" in the codegraph

diagram on the same level. Therefore, we have added a new restriction on what can be processed

125

first, and so on. The calculation or process to figure out with ones can be done first, second, and

so on (same level), adds additionally overhead, therefore; reduces efficiency to a degree.

Brief Explanation of the Codegraph Below:

The codegraph below only shows a portion or cut-out of the total graph. This is done to simplify

the graph, however, the graph can still be examined since two entire C block solution are shown,

therefore, one can examine before downwards and across / parallel flows.

The gray triangles represent inputs from the top level

The blue triangles represent outputs from the top level

The red triangles represent inputs found in the system

The green ovals represents mathematical computation

The purple ovals represents DMA transfers

Bracket numbers refers to DMA dependencies that are not directly related to the structure.

The blue circles represents what's in memory (in each LS -> corresponds t o an SPU/processor)

LSx and LSy refers to an SPU / LS that was allocated (through some means - not

important right now) to these processes.

127

VIII.V Report on Methods of Parallelism REPORT: METHODS OF PARALLELISM - GENERAL METHODOLOGIES

Done By: Christopher Venantius: Mon, Oct 9th 2006

MUST BE OVERLOOKED: Cell team to ensure restrictions unknown at this time about the hardware abilities to

do the below noted functionality. Therefore, this report must be updated at a later date.

References:

Henri E. Bal, Matthew Haines, "Approaches for Integrating Task and Data Parallelism,"

IEEE Concurrency: Parallel, Distributed and Mobile Computing, vol. 06, no. 3, pp. 74-84, Jul-Sept,

1998.

SUMMARY:

Refers to three main types of parallel processing methods. Below, I briefly describe each,

and then conclude with the method employed in the thesis and rational.

INFORMATION:

METHODS OF PARALLELISM:

Task Parallelism: Breaking down a large task / goal, into subgoals / subtasks that are independent to one another.

Then allowing the parallel processors to compute the subtasks individually. Issues can occur if subtasks have an

dependencies or relations. Careful mechanism must be employed to prevent bad access or violations. Our concern

here is with parts of the matrices used in a computation with the multiple SPE units, or the solution matrix being

defined by separate tasks.

Data Parallelism: Decomposing a large chunk of data that needs to be operated on by an independent operator. One

way to think of this is by fork/join routines. Problem exists here because data dependencies are usually impossible to

get rid of. For example, in our case the composing of the solution matrix cannot occur until an entry is computed.

Possible solutions is to approach it like the consumer / producer model. The producers being the task to compute an

entry, and the consumers being the composition of the solution.

Pipeline Parallelism: Breaking breaking a large task into smaller consecutive subtasks for each processor. Therefore,

data will flow from one task to the other (processor to process), until done. Negative, is improper or unbalanced

dividing leaving a processor idle. Additionally, multiple transfers between processors.

Conclusion: For our thesis are of all computations involve operations on matrices that result in a solution matrix.

The operation on the matrix, being either addition or multiplication, can be seen as an independent operator over a

row vector and row vector if dealing with addition, and a row vector and matrix B if dealing with multiplication

(A*B = C). It makes sense to divide the task of computation with the data parallelism approach, therefore, reduces

the number of memory access, since the local store can hold the two vectors, or vector and matrix. The computations

have no dependencies, either with computation or data. The task of forming the solution matrix, I propose to use a

form of pipeline parallelism. Where, one can think of one SPE for this second task, and all other SPEs for the

computation task. The rational for this, is the C matrix can be kept in the local store the entire time until fully

computed, and then sent back into memory (low memory accesses). This issue with this, is that the computations

might be done quickly, that there will be a queue waiting on this SPE to combine the results (decreases efficiency).

However, if we allow each SPE to do the combine task itself, we have many memory accesses and now

a data dependency in data parallelism. Additionally, a form of lock mechanism must be placed on C access to ensure

proper output, but leads to a similar queue wait for the combination process. The pipelining approach, although does

not increase the efficiency of the last noted issue, drastically decreases the memory accesses.

128

VIII.VI Haskell Implementation module Matrix where

import List -- import list module

--data for Block operation unassigned slots

data Bop a = FM a a a

| FMA a a a a

| FA a a a

deriving (Show)

data AssignedBop a b = AFM a a a b

| AFMA a a a a b

| AFA a a a b

deriving (Show)

--data for the operaitons a SPE needs to perform to compute an operation

data SPEcomputations a b = SPE a b

deriving (Show)

data Bdma slot block spe = FMDMA slot block spe --from memory DMA

| TMDMA slot block spe --to memory DMA

| FSPEDMA slot block spe spe --from SPE LS DMA

deriving (Show)

type SPE = (Int, Int, Int) --processor number, flag, counter

type Block = (String, Int, Int) --Matrix, i, j

type Slot = Int --location [1..15]

cijClusterSize = 2

--function that outputs a list of Bops to solve a Cij block for A.B = C

computeMultCij :: String->String->Int->Int->Int->[Bop Block]

computeMultCij a b size i j = (FM("T0",i,j) (a,i,0) (b,0,j))

: [FMA ("T"++show k,i,j) (a,i,k) (b,k,j) ("T"++show (k-1),i,j)

| k <- [1..(size-1)]]

--function that outputs a list of all Bops to solve C blocks for A.B = C

--A(n x size).B(size x m)

allComputeMult :: String->String->Int->Int->Int->[[[Bop Block]]]

allComputeMult a b n size m = [[computeMultCij a b size i j

| j <- [0..(m-1)]]

| i <- [0..(n-1)]]

--function that outputs a list (single element) of Bops to solve a Cij block for A+B

computeAddCij :: String->String->Int->Int->[Bop Block]

computeAddCij a b i j = [(FA ("T0", i, j) (a,i,j) (b,i,j))]

--function that outputs a list of all Bops to solve C blocks for A+B = C

allComputeAdd :: String->String->Int->Int->[[[Bop Block]]]

allComputeAdd a b n m = [[computeAddCij a b x y

| y <- [0..(m-1)]]

| x <- [0..(n-1)]]

--function that arranges the computations for C by cijClusterSize by cijClusterSize

arrangeToCijClusters :: [[t]]->[[t]]

arrangeToCijClusters = unpack . map transpose . pack

where

pack :: [[t]]->[[[[t]]]]

pack = split . map split

split :: [t]->[[t]]

129

split = chop cijClusterSize

unpack :: [[[[a]]]]->[[a]]

unpack = map concat . concat

--function that breaks a list in a series of lists by a given size

chop :: Int->[t]->[[t]]

chop n [] = []

chop n xs = take n xs : chop n (drop n xs)

--functions that gathers all the Cij computaitons for a given SPE

--k is a counter starting at 1 when called for first SPE

computeSPE :: [[[Bop Block]]] -> Int -> Int -> Int -> [[[Bop Block]]]

computeSPE [] _ _ _ = []

computeSPE clusters speNum speTotal k = if k == speTotal then

if speNum == k then

(head clusters)

:computeSPE (tail clusters) speNum speTotal 1

else

computeSPE (tail clusters) speNum speTotal 1

else

if speNum == k then

(head clusters)

:computeSPE (tail clusters) speNum speTotal (k+1)

else

computeSPE (tail clusters) speNum speTotal (k+1)

--function to create a list of computes for each SPEs

allSPEcomputes clusters speTotal = [SPE (x,(-1),(-1)) (computeSPE clusters x speTotal 1)

| x <- [1..speTotal]]

--functions to determine type of Bop

isBopFM :: Bop a -> Bool

isBopFM (FM _ _ _) = True

isBopFM (_) = False

isBopFMA :: Bop a -> Bool

isBopFMA (FMA _ _ _ _) = True

isBopFMA (_) = False

isBopFA :: Bop a -> Bool

isBopFA (FA _ _ _) = True

isBopFA (_) = False

130

VIII.VII Report on Low Level Matrix Calculations 1

By: Damith Karunaratne Introduction:

The purpose of this report is to compute A.B on the STI CELL on the low level

using the Computation 1 approach. A and B are both dense matrices, where A is row

ordered and B is column ordered. A.B represents matrix multiplication.

Computation 1:

The approach that is used in order to compute dense, dense matrix multiplication

is defined as Computation 1. The method to calculate a row using Computation 1

requires the fist row of matrix A as well as the entire matrix B to be loaded into registers.

Afterwards, all the floating-point multiplications (fm) are computed and stored in

temporary registers. Following this, if any floating-point multiplication additions (fma)

are required, they are computed. Then the bytes in the registers are shuffled (shufb) in

order to accommodate for the next step, which is floating point addition (fa) of the

registers. The final operation that is required is to store the computed row. This method

is then repeated for the rest of the rows in matrix A in order to compute the result matrix.

Below is a diagram of this procedure for the first row of (4,4) x (4,4) matrix:

Pipeline and Register Allocation:

Below is a table that shows the number of required instructions on Pipeline 0 (fm,

fma, fa), Pipeline 1 (load, store, shufb), number of total calculations and number of

131

required registers in order to compute dense, dense matrix multiplication:

132

VIII.VIII Report on Low Level Matrix Calculations 2 Report : Matrix Calculations 2

Done By : Adam Schulz

Summary : A method for computer matrix multiplication for A * B, where A and B meet the restrictions below.

Restrictions: Matrices must have dimensions that are multiplies of 4 (number of entries per register on cell).

Matrices must have proper dimensions for multiplications. M x N * N x K

Matrices cannot be larger than 16 * 20 based on register limitations

Information: The following will go through step by step the process for multiplying two matrices together. We will

look at the specific case

of a 4 x 8 * 8 x 4 to make the process clear.

1) Matrix B is loaded into registers in row format.

see Figure 1

2) First 4 entries of Matrix A are loaded into a register.

see Figure 2

3) Copy first entry of Matrix a, A11 into a register, (it will be in all 4 register spots)

see Figure 3

4) Using an fm instruction multiply A11 by B11 ... B1n. Store results in separate registers.

see Figure 4.

5) Copy next entry of A matrix over the previous. Register with A11 now contains A12. This saves on register use.

see Figure 5

6) Use and fma to multiple A12 by B21 to B2n and add to previous multiplication.

see Figure 6

7) Repeat steps 5 and 6 until each entry of the first row in A has been multiplied and added to result. When you load

a new

A value into the register, move down one row in the B matrix. This will produce the first row of the result matrix.

(If matrix A has more than 4 entries per row a load will have to be done

to load A15 ... A18 over current register with A11 .. A14. continue repeating process)

8) Store the completed result in memory.

9) Repeat from step 2 until each row of A has been multiplied. You will now have the result matrix C stored in

memory.

A single matrix entry for multiplication is of the form A11B11 + A12B21 + ... + A1nBn1. This method computes a

whole row of

results simultaneously. See figure 7 to see that the process is correct for a single row of calculations. Knowing that

the

result for 1 row is correct, to calculate the other row results we simply load the next row of A and follow the same

algorithm.

Also for larger matrices, since they are multiples of 4 we can simply add another set of calculations on a new

register that are computed simultaneous

with the same calculations as the one outlined here.

Diagrams

Figure 1

133

B11 B12 B13 B14

B21 B22 B23 B24

B31 B32 B33 B34

B41 B42 B43 B44

B51 B52 B53 B54

B61 B62 B63 B64

B71 B72 B73 B74

B81 B82 B83 B84

* Each row here represents a register. B11 .. B14 is stored in 1 register. *

Figure 2

A11 A12 A13 A14

*Matrix A will be stored in the LS and sections will be loaded into registers as need. This represents 1 register.*

Figure 3

A11 A11 A11 A11

* This can be done using a Cell instructions. The purpose of this is to align multiplications within registers for

easier computation.*

Figure 4

A11 A11 A11 A11

B11 B12 B13 B14

A11B11 A11B12 A11B13 A11B14

* The table below is the result of multiplying the register containing A11, and the register with B11 ... B14

Figure 5

A12 A12 A12 A12

*This is the same register that contained A11 in the previous step (figure 3) *

Figure 6

A12 A12 A12 A12

B21 B22 B23 B24

A12B21 A12B22 A12B23 A12B24

*This is an intermediate step when using the fma command does not use a register, showing this step just for clarity.

*

134

A11B11 + A12B21 A11B12 + A12B22 A11B13 + A12B23 A11B14 + A12B24

* This is the result of the fma instruction. *

Figure 7

*shows intermediate steps if we were not conserving registers and computing all partial results and then adding the

result *

A11B11 A11B12 A11B13 A11B14

A12B21 A12B22 A12B23 A12B24

A13B31 A13B32 A13B33 A13B34

. . . .

. . . .

A1nBn1 A1nBn2 A1nBn3 A1nBn4

A11B11 + A12B21 + A13B31

+ ... + A1nBn1

A11B12 + A12B22 + A13B32

+ ... + A1nBn2

A11B13 + A12B23 + A13B33

+ ... + A1nBn3

A11B14 + A12B24 + A13B34

+ ... + A1nBn4

135

VIII.IX Tips for Using the STI Cell REPORT : T ips and Guide Lines for Maximizing Efficency on the Cell

Done By: Adam Schulz

References:

Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal

application performance Daniel A. Brokenshire, 27 Jun 2006

http://www-128.ibm.com/developerworks/power/library/pa-celltips1/

Summary: The article presented both implementation as well as theoretical concepts and ideas for creating efficient

programs on the cell. The article focused on four individual aspects of the cell. First was the general system design

and taking advantage of the computational SPE's in terms of load balancing and data transfer. There was a brief

summary of multi-threading on the PPE in terms of when it was effective, as

well as managing cache transfers on the PPE. The bulk of the information presented was on

the memory sub-systems and SPE programming practices.

Information: The following will describe general pitfalls and concepts presented in the article and will briefly

explain some solutions. For a more details look check the above link. The article is well laid out and it is easy to find

the information. For efficiency all data transfers within the SPE should be initiated by the programmers so they

understand the accesses that are being used. Memory transfers initiated by the SPE should come from system

memory instead of PPE's L2 cashe. MFC transfers from system memory have high bandwidth and moderate latency,

whereas transfers from the L2 have moderate bandwidth and low Latency. Also EBI (Element Interconnect Bus)

overhead can be minimized by

transfer sizes of 128 bytes. Data transfer between PPE and SPE should be done by having the SPE "pull" data from

the PPE using the SPE to initiate the DMAs. The reasons for this is due to the fact that there are 8 SPE and one PPE.

Also data retrieval is faster for the SPE than the PPE.

All data transfers done within the SPE should be in quad-word format. It is more efficient to change a scalar (sub-

quad) into a quad word before load/store are implemented than to have the computer load/store a scalar. One other

solution is to group scalars together into word size and load/store that way. This however requires data retrieval of

the information

after the store. It is however still more efficient than storing scalars.

When programming on the cell in a high level language such as c/c++ the compiler must translate the code into

machine language. These translations are not always done in the most efficient way. Since programming in

assembly can be very difficult for large programs, there are intrinsics which is inline assembly with C function call

syntax. "They provide the programmer explicit control of the instructions used, but (unlike assembly) eliminate

many of the optimization tasks that compilers are good at. These include register coloring, instruction scheduling,

data loads and stores, looping and branching, and literal vector construction."(taken directly from article) The

intrinsics are used closely with the pipelining available on the cell and choosing your instructions carefully to

balance the load on both pipelines can increase efficiency.

Other practical solutions for increasing efficiency included loop unraveling, which allows the loop to be broken into

multiple instructions that have no dependencies on each other and can therefore be processed simultaneously

increasing efficiency. Overlapping data movement with computation, is possible with the multiple pipelines and

again allows for things to be done simultaneously. Finally reducing the number of branch predictions. These become

a problem with miss predictions which causes stalls which decrease efficiency.

One solution around this problem is the idea of exploiting the select bits instruction.

Which is similar to the unrolling of loop.

Conclusion: There were many god ideas and tricks for increasing efficiency on the cell. Being able to design our

implementation with these in mind will improve the efficiencies of our calculations. Also this article introduced a

few techniques that will have to be looked into further and fully understood in order to use them effectively. Things

to look at include algorithms for unraveling loops, the select bit instruction for breaking down if statements,

pipeline instructions and functionality and the intrinsics instruction set.

136

VIII.X Report on translating Codegraph to STI Cell (idea)

REPORT ON TRANSLATING THE CODEGRAPH INTO JOBS FOR THE STI CELL


Introduction:

The purpose of this report is to explain how one gets from the codegraph to a series of

jobs to be performed on the underlying hardware. The report will argue t hat a basic greedy

algorithm can be applied that will complete t he process and make a claim of being efficient.

Secondly, the report will examine the issue of DMA or other hyper-edge latencies causing stalls

in the computation and how to counter these.

The Stack / One Dimensional Array of Jobs:

Before one can begin to assign the jobs, there's needs to be a way to not only get what's needs to

done, but get t hat information wit h the order and dependencies. Therefore, what is being

proposed is a function that will step through the codegraph and create this data.

How to traverse?

All we need right now is a listing of jobs in a order of dependencies. Therefore, if we

traverse the graph, if viewed as a tree, left -most depth first, we would have a s tack shown in

figure 1 on page 4. The back arrows in the stack show dependencies. This is important

information one needs to record in this structure to enable the greedy algorithm to work

(explained in next section) .

Assigning Jobs from Stack (GREEDY ALGORITHM):

The algorithm to translate the stack into an assignment of jobs into processors is detailed in very

pseudo fashion on the next page. It places on an emphasis on a processor to continue to processes

a row column completely before moving on to another job, therefore; if works with the report on

increasing efficiency by removing DMAs on the codegraph. Each loop in the overall function

can be seen as a cycle in the processors or a step forward in t he latency of t he processes

137

Why Is This An Acceptable Implementation of Assignment:

Without doing a formal proof and comparing ot her algorithms / methods of doing the

assignment (not in scope of this thesis), I am going to argue why t his method is acceptable.

First, the algorithm works with our more efficient codegraph, therefore, we eliminate some DMA

transfers to increase efficiency. Second, the algorithm only assigns DMAs when the processor is

ready for that computation, which is what we want to happen. Additionally i,t assigns the DMAs

as soon as the processor is available, though this is at the end of the DMA transfer list,, since it is

impossible t o parallel t he DMA trans fers, this is the best we can do. Finally, it looks to run as

many row-column sequence of operations at the same time through searching for first all the idle

processors, then assigning as many new series of tasks as possible. The figure on page 5 shows

the idea of as signment .

Latency Stalls:

There is the possibility of our DMA transfer times taking too long, or our hyper-edges

138

that involve computation being too complicated, that there combined latency causes extra

moments of idle. Figure on page 6 shows an issue of having DMA latencies causing extra stalls.

This occurs when the computation is done before DMA list is ready to accept the new DMAs to

setup the next computation. However, this isn't really an issue, because this only occurs when the

DMA list is busy processing other DMAs for other processors. Therefore, the SPU at worst case

has to wait the latency of (number of processors - 1) * latency for DMA * 2 .I f we attempt to fix

this, reorder the DMA list so this processor will have its DMAs done earlier w, e would be

forcing another processor to wait even longer than the worst case latency, since it is being

bumped back in the line. To minimize this problem we can increase our ot her hyper-edges

latency (more computation), therefore the DMA in respect to the computations are not as long.

However, this is only beneficial if we reduce the number of computational hyper-edges, if not we

are only increasing the other time of the computation. There is the possibility of our

computations times taking too long, and our DMA transfer list is empty and waiting. This would

only occur if all other DMAs are dependent on the computations that are being processed,

therefore, it can start a new one until one of the processors are done. The figure on page 7

illustrates this problem. The problem is not a bad one, considering that we are in the case then of

all processors being active, which is what we want for

large matrices. Additionally, we don't want to s tart DMA transfers until the processor is ready to

handle the data. Therefore, I don't see having the DMA processing idle when all processors

are active as a drop in efficiency.

139

VIII.XI Report on STI Cell DMA, MFC and Memory

Report: STI Cell DMA, MFC and Memory

Management Nathan Cumpson

0367642

November 28, 2006

Sources:

―Course Code: L3T2H1-39.‖ IBM. 26 June 2006. 25 Nov. 2006

<http://www.power.org/resources/devcorner/cellcorner/cellworkshop0606/

Day1 09-1 CourseCode L3T2H1-56 DevelopingCodeForCell-

DMA.pdf>.

―Course Code: L2T1H1.‖ IBM. 12 May 2006. 24 Nov. 2006

<http://publib.boulder.ibm.com/infocenter/ieduasst/stgv1r0/topic/com.ibm.iea.cbe/

cbe/1.0/Introduction/L2T1H1 11 CellSoftwareModel.pdf>.

1 Cell‘s Primary Communication Mechanism

2 Memory Flow Control The SPU can execute MFC commands by using channel instructions. The PPE

and other devices can make MFC commands by using the MMIO. A MFC

command that accesses memory is a DMA command. When DMA transfers are

made, a 5-bit tag is appended to the data to signal different options. Tagging

DMA transfers is optional but is useful for when using barriers or grouping of

DMAs.

3 Direct Memory Access

140

3.1 DMA Transfers • transfer sizes for DMA has a maximum of 16KB.

• each transfer must be 1, 2, 4, 8 or n × 16 bytes.

• with single precision floating point numbers each number is 4 bytes. For

matrices of floating point numbers, the maximum square matrix size is

64 × 64. (Square matrices are used to break up larger matrices into a

cluster of smaller matrices. This is covered in another report of ours)

• memory access uses GETL = Stanza scatter/gather = distributed in global

and packed in local.

3.2 Cell Programming models 3.2.1 PPE Programming Models

The PPE is programmed traditionally and acts and an OS or hypervisor. The

OS provides services to the SPEs and threads such as I/O. The PPE estab-

lishes the a run-time environment for SPEs by handling expections, memory

management and other functions.

3.2.2 Small Single SPE Models

These models are good for small tasks that can fit into a 256KB LS including

both code and data. They use LSAs (Local Store Addresses) for accessing

memory letting the MFC fetch the EAs (Effective Addresses).

3.2.3 Large Single SPE Models

Large-Single SPE Models run code and data in the LS and global memory.

EA and LSA memory is used for this with different models being used for

communication between the XDRAM and the LS via Memory Controller, DMA

and MFCs.

Streaming: This method does a DMA fetch from the system memory into

the LS then is processed by the SPE and then finally written back by the DMA

to the system memory.

Automatic Software-Managed Data Cache: This approach allows for

the data transfers to be controled by the software cache framework libraries.

This has the LS act much like the cache in a MIPS architecture that we would

be more familiar with.

Plug-in: A manual system for using the Plug-in framework library. This

has manual loading of plug-ins into the code buffer.

Job Queue: The job queue model packs together code and data into a

DMA transfer to be executed. An SPE kernel controls the DMA transfers from

memory.

Hiding DMA latency: Ensuring that the DMA latencies are hidden is

a key factor in improving overall SPE run-time. Any holds on the SPE slows

down the overall performance time. A method for hiding this is using double

buffering — use a memory buffer for both input and output for each function

(operation) of a SPE.

CESOF: Cell Embedded SPE Object Format is used for transfering objects

from the global memory to the LS where the object structure is used to construct

the structure of the object in LS.

3.2.4 Parallel Programming Model

Data in shared memory must have must have proper locking mechinisms for

large SPE programs in the effective address space. Accessing data is done

through its address and is random access by nature.

Streaming: An array if inputs is distributed among the SPEs. The shared

job queue must be locked by a SPE to obtain its next job so that no other SPE

can obtain the same job. The job queue in this case is special because it will

contain only data being transfered to the SPE. The output works the same as

the input, except it transfers the data from the LS to the global address space

after the SPE has terminated processing. Jobs are balanced among the SPEs

141

for uneven workloads.

Job Queue: The PPE maintains a job queue and schedules the jobs to

each of the SPEs. Each SPE has its own kernel that is responsible for fetching

a job, executing the code and synchronizing its actions with the PPE.

Self-multitasking of SPEs: The kernel and scheduling are distributed

across the SPEs. The SPEs each act like threads in a conventional operating

system where they are synchronized using mutexes or semaphores. The SPEs

have a queue that stores a list of ready-to-run tasks, much like a semaphore lock

would store when threads a ready-to-run.

Message Passing: LS to LS DMA transfers are optimized for data stream-

ing trhough the pipeline model. Data access for this is sequencial in nature and

the message connection is still built on the shared memory.

Pipelining: This method uses LS to LS DMA transfers to share data which

will only use the LS to LS DMA bandwidth and not the DMA memory access

bandwidth. Balancing the loads are more difficult with pipelining.

3.3 Typical Development Flow • Algorithm study

• Data layout/locality and Data flow analysis

• Experimental partitioning an dmapping of the algorithm and program

structure to the architecture

• Develop PPE control, PPE Scalar code

• Develop PPE control, partitioned SPE scalar code

– Coummunication, synchronization, latency handling

• Transform SPE scalar code to SPE SIMD code

• Re-balance the computation / data movement

• Other optimization considerations

– PPE SIMD, system bottle-neck, load balance

3.4 DMA channels An SPE has multiple channels that it uses for a DMA command. To have

a DMA command execute, the MFC must execute an MFC command with a

channel as a parameter to invoke the DMA command. The MFC uses the wrch

command with one of the channel names:

1. Write the LS address to the MFC LSA channel

2. Write the EA-High (EAH) to the MFC EAH channel.

3. Write the EA-Low (EAL) to the MFC EAL channel.

4. Write the transfer size to the MFC Size channel.

5. Write the tag ID to the MFC TagID channel.

6. Write the class ID and command opcode to the MFC Cmd channel.

List Figure 1: Taken from L3T2H1-56

Conclusion This report offered research that covers how the DMA works with the SPE,

PPE and other devices on the EIB. It failed to find the exact solution I was

searching for: time to transfer data from Memory-to-SPE, SPE-to-SPE, PPE-

to-SPE, PPE-to-IO Device and SPE-to-IO Device. Even though the goal wasnt

achieved, many other goals what would be necessary in the very near future for

planning how the algorithms would be engineered for the Cell were achieved.

Additionally, this report has lead me to find a new path to follow to achieve

me initial goal in determining data transfer times for single DMA transfers for

16KB. I end this report with a simple quote that may give some foresight on

some research I will be doing. This quote is by IBM Senior Engineer David

Krolak, EIB lead designer:

A ring can start a new op every three cycles. Each transfer always

142

takes eight beats. That was one of the simplifications we made, it‘s

optimized for streaming a lot of data. If you do small ops, it doesn‘t

work quite as well. If you think of eight-car trains running around

this track, as long as the trains aren‘t running into each other, they

can coexist on the track.

143

VIII.XII Real-Time System Overview 1.1

Real-Time System Overview 1.1

By: Christopher Venantius and Damith Karunaratne

Date: January 10 / 2007

REPORT SUMMARY:

To provide the reader with an understanding of the C code used to model the real time system for a 2006 /

07 McMaster University computer science thesis: PARALLELIZING MATRIX MULTIPLICATION ON THE STI

CELL. The report will begin by introducing the general objectives of the system, an overall schematic,

and then traverse each code module detailing functionality and how it satisfies the requirements.

Fundamental knowledge of the C programming language and the general ideas of the thesis is required for

the understanding of this report. For background knowledge of the thesis see: DESIGN REPORT 1.0:

4ZP6 PROJECT

OBJECTIVES

The real time system for the project essential has to carry out the task of multiplying the matrices

through interpreting provided instructions that will perform the appropriate operations. The

following is an outline of objectives that must be satisfied:

Functionality for memory to local store transfers

The system must provide a means to transfer data from memory to an appropriate SPE‘s local

store (LS). The data for computation is stored in memory directly accessible to the PPU,

therefore; the SPE‘s processor must gain access to this data before it can perform any

computation. The transfer is a DMA routine from main memory to an SPE‘s LS. The real time

system handles this through adding a DMA call that places the request for transfer on the

appropriate SPE‘s DMA channel. Memory protection and synchronization of the process is

discussed in later objectives. For now, it is important to note that the real time system must

provide a means to initiate the transfer of data from memory to an SPE‘s LS.

Functionality for local store to memory transfers

The system must provide a means to transfer data from an appropriate SPE‘s LS to memory. This

direction of transfer is required to store the computed results of an SPE‘s computation. The real

time system must take this into account and allocate main memory to store the results, whether

they be partial or complete solutions. As in the case of memory to LS transfers, the issues of

synchronization and protection will be discussed in a later objective.

144

Functionality for local store to local store data transfers

The system must provide a means to transfer data from an SPE’s LS to another SPE’s LS.

Through the algorithm chosen by the thesis group, different SPE’s will be utilizing the same

data in order to complete partial computations. In addition, accessing main memory is

considerably slower than accessing an SPE’s LS. Therefore, it is reasonable to include the

possibility to transfer data between two different SPE’s LS. This will allow a potential

reduction in computing the final results through the elimination of memory accesses.

Protection of LS memory

The real time system must provide a way to organize the temporary storing of data for

computations. The method chosen by the thesis group is to create a data buffer and a

solution buffer. The buffers act as a memory map of allotted memory in an SPE’s LS for

the given task. The data buffer will be an array of buffers that can be viewed as buckets

that hold transferred data for computations. The solution buffers will be an array of

buffers that can be viewed as buckets that hold partial solutions, before the computed

results are transferred back to memory. The size of the data buffers and the solution

buffers, for convenience, will correspond to the size of a block in a matrix. The size of a

block has been decided by the thesis group to be 64 x 64 floating point entries. It is

important to note and understand how the size of the data buffers and solution buffers is a

physical limitation, therefore; it will affect how much data can be transferred at a given

time for a computation.

Verification of Data / Synchronization

As of now, the report has outlined the various ways data must be transferred in the system.

However, what has not been explained is any form of synchronization that informs the

system that the data has arrived, allowing it to continue processing. Synchronization is

handled through a lockless mechanism, specifically every piece of data transferred in the

system is transferred with a tag. The tag is a piece of information that is added to the data

block. Therefore, each buffer is extended from 64 x 64 floating point entries, to include the

storing of a tag. For now due to alignment issues, this corresponds to the row size of a

matrix block, which is 64 floating point entries. The tag, in the case of doing matrix

multiplication, will contain information of which matrix the block of data originates from,

as well as the block’s location within the matrix. Therefore, before any form of

manipulation of the data buffer, a verification that it is the correct data is completed. If

unsuccessful, the real time system works on the assumption that the DMA to transfer the

data into the buffer is not yet processed, therefore; that particular SPE will hang and

continue to check until the data matches. This runs a risk of the overall system being stuck

in verification steps if the data never arrives. However, this should never happen because

the algorithm and the scheduler ensures a proper ordering. In order to avoid a

catastrophic situation, the system will wait a set amount of reasonable cycles for the data

before exiting on an error, instead of waiting forever.

145

Blocking out the top level / Simulating inputs

In order to fully test and verify that the real time system can handle matrix multiplication

before integrating it with exterior input, one must simulate the data. The current build has

the PPU prior to executing and creating any SPE’s threads, creating the matrices for

computation and storing. The dimension of the matrices, and the dimensions of the blocks

are determined by constant values in the code. Therefore, it allows the system to test

multiple combinations and sizes. The data entries themselves, are generated randomly,

allowing the system to handle different sorts of input. The simulation of the data within the

system is how the thesis group decided to test the system before integrating it with the other

layers of the development.

Blocking out the bottom level / Simulating computations

The computations for the real time system involves dense matrix block multiplication and

addition. These will eventually be performed through kernel code that is being developed to

execute a series of statements that will carry out the computation. However, for testing

purposes, the multiplication and addition code is being done within the real time system

through functions. This will be a lot slower than the final build, since it is not able to

maximize register usage being implemented at a higher level, however; it will be able to

verify that the real time system is moving and computing data correctly in accordance to

the algorithm. Therefore, one can check if the randomly generated inputs match the

computed multiplied output.

Double / N-multiple buffering technique

Buffering is a technique that minimizes the time that the processor stalls or waits for data in

order to perform a task. The general idea, using double buffering as an example, is to

transfer what you need for task i, and task i+1 to start. Therefore, there is a larger start up

cost to wait for double the amount of data. Then the processor will work on task i, and

when it moves onto task i+1, it will request the data for task i+2, replacing the data for task

i in memory. Therefore, the system is always transferring data for the next task, and

computing on the current task. This in theory, should eliminate waiting for data if the

computations take longer than transferring all the data required for the next step. N

multiple buffering uses the same methodology, except the system transfers the data for N-1

steps ahead. However, one must see how as the number of buffering increases, the demand

on having enough memory to store that information increases as well. Therefore, it is only

logically to only buffer ahead the minimal amount based on the computation size. The final

build of the system will have the amount of buffering decided by the scheduler, which is

exterior to the real time system. Latencies for tasks, such as DMA transfers and block

computations, will be determined and provided in the early stages of the overall system.

Therefore, the scheduler will have this information in order to DMA as many possible data

transfers into an SPE’s LS before a computation begins. For testing purposes, the real time

system implements double buffering in the test code, in order to simulate the basic idea of

buffering.

GENERAL OVERVIEW

146

The following section is going to provide the general overview of the real time system. This is to

provide the necessary background knowledge before traversing the implementation. In a

summarized overview, the system begins through the PPU. The PPU creates information that

each SPE will need to function. The information includes pointers to data in memory, as well as

information of the other SPEs in the system. The PPU then proceeds by creating the test data

used in the computation. Finally, control threads are created that are essentially the SPE

programs for each SPE. These threads utilize calls to initiate DMA transfers to move data it

requires to and from the SPEs. Below is an image illustrating the general flow of control of the

real time system. Please note that this is an over simplification but necessary to provide a

framework before continuing through the code:

147

VIII.XIII Dense-Dense Matrix Multiplication: Computation Kernel

Report

Dense-Dense Matrix Multiplication

Computation Kernel

By: Adam Schulz

Date: January 16, 2007

Report Summary:

This report will provide those persons involved in the Cell SDK thesis project 4ZP6 with an

understanding of both the design and implementation of the matrix multiplication kernel that will run on

the individual SPE‘s. An understanding of the Haskell Programming language is required to understand

the implementation that will be shown. The code below is a simulation of what will happen at the

hardware level on the Cell SDK, therefore it is important to understand the process of the multiplication

rather than the exact implementation.

Dense-Dense Matrix Multiplication

Overview

The module discussed in this report is part of a larger project that has been divided into pieces.

Since these pieces are not ready to be integrated yet, both input and output of this module are being

simulated by the module itself. Since this is the case there are certain hardware limitation that have been

ignored for this module that will be taken into account when this code is modified for the Cell SDK

specifically. The limitations that have not been taken into account are the finite number or registers

available per SPE as well as the transfer of the data from LS (local store) to the registers where the

arithmetic operations are done as well as where the output is to be stored after it is computed. These

transfers will be taken care of by an interface that will be outlined in a later report.

Registers on the Cell SDK have been simulated using Haskell implementation. A register is

represented by square brackets [] and can contain four single precision values. All of the computations

simulate Cell SDK register arithmetic operations that are predefined for the Cell SDK. This will be

discussed later in the report.

Code

148

The remainder of this report will cover the actual code implementations. It will be broken into

sections based on the recursive steps that have been used. Along with each implementation step relevant

parts that can be transferred to the Cell SDK will be highlighted as well as changes that will need to be

made. To begin the common variables used throughout the code will be defined in order to avoid

repetition.

Variables

Figure 1 outlines the variables used throughout the modules, variable names, types and

descriptions are defined.

{- type definition

type Val = [String] -> Val is of type list of Strings

type Register = [Val] -> Register is of type list of Vals. This is // to simulate a

register of 4 values. Each // value will hold multiple strings (matrix entries)

-}

{- variable definition

matrixa -> input matrix 1. matrixa is of type [Val]

rowa -> number of rows in matrix a. rowa Is of type Int

cola -> number of calculated columns (number of columns divided by 4 since one register consists of 4 values). cola

is of type Int

matrixb -> input matrix 2. matrixb is of type [Val]

rowb -> number of rows in matrixb. rowb is of type Int

colb -> number of columns in b (same definition of a). rowb is of type Int

offset -> current colunb of result matrix being calculated. offset is of type Int

-}

Figure 1 – Type and Variable Definition

Input

The inputs for this module are two simulated matrices. For the actual Cell SDK implementation

the information will be retrieved from the local LS and placed into the appropriate registers for

calculation. The code in figure 2 will therefore not be necessary for the actual implementation but is

shown to provide a more complete understanding of the whole module. There is a restriction on the input

that the matrices must have dimensions both row and column divisible by four. This is due to the

underlying Cell hardware working with registers that contain four variables. Also the input must be of the

proper matrix multiplication form (m,n) * (n,k) where m,n,k represent matrix dimensions. The module

does not verify this information since that will be handled by a different part of the project.

149

{- ramtix creates a matix based on a row and column input. rmatrix runs through the number of rows and cmatrix

fills in the column number -}

rmatrix :: String -> Int -> Int -> [Val]

rmatrix name 1 col = cmatrix name 1 col

rmatrix name row col =

concat [rmatrix name (row - 1) col, cmatrix name row col]

rmatrix _ _ _ = error "rmatrix"

cmatrix :: String -> Int -> Int -> [Val]

cmatrix name row 4 = [printrow name row 4]

cmatrix name row col =

concat [cmatrix name row (col - 4), [printrow name row col]]

cmatrix _ _ _ = error "cmatrix"

Figure 2 – Code to generate input matrices

rmatrix take three inputs, a name and the dimensions of the matrix to be created and calls cmatrix once for

each row in the matrix. cmatrix prints the name, row and column for each of the four entries in a register

and then repeats the process for each column. This is done by a recursive concat call. The output is a

matrix of type [Val]. The input and output will be shown in figure 3.

Input

rmatrix "amatrix" 4 4

Output

[["amatrix(1,1)","amatrix(1,2)","amatrix(1,3)","amatrix(1,4)"],

["amatrix(2,1)","amatrix(2,2)","amatrix(2,3)","amatrix(2,4)"],

["amatrix(3,1)","amatrix(3,2)","amatrix(3,3)","amatrix(3,4)"],

["amatrix(4,1)","amatrix(4,2)","amatrix(4,3)","amatrix(4,4)"]]

Figure 2 – Input and Output for rmatrix

*Note* output formatted for clarity

Result Matrix

The remainder of the code will go through the process of calculating the result matrix. The code

is broken into three sets of recursive calls and one calculation function. There is one function that is used

150

by multiple functions called traverse that takes a matrix and returns the appropriate four values or register

from that matrix. This code will be examined next.

traverse function

The traverse function takes an input of type [Val] -> Int and produces a type of Val which is

equivalent to a specific register or four values in a matrix. This function could be thought of as a

simplified version of the interface between the LS and the registers since it traverses a matrix and passes

back the necessary parts for computation.

The code is shown in figure 4.

traverse :: [Val] -> Int -> Val

traverse matrix 0 = head(matrix)

traverse matrix col = traverse (tail matrix) (col-1)

traverse _ _ = error "traverse"

Figure 4 - traverse function

Since the matrix is stored in a list format if traverse was to be called and only passed the tail of the matrix

it will move through the matrix one entry of the list at a time. The function is given a specific number of

iterations to perform this call and returns one entry of the list (a register) to the function that called it.

matrixmult function

The matrixmult function iterates calculations for the result matrix based on columns. For example

a result matrix that has dimensions 8 by 8 would have a recursive call twice since the matrix has two

columns registers per row, each resister having four entries. Figure 5 gives a graphical representation of

this.

r11 … … … … … … r18

… … …

… … …

… … …

… … …

… … …

… … …

r18 … … … … … … r88

Figure 5 – a result matrix named ―r‖ of size 8 by 8

The first iteration of matrixmult would result in the all of the red entries of matrix r being calculated and

the last iteration of matrixmult would result in all the blue entries being calculated. The code for

matrixmult is shown in figure 6.

151

{- this is the recursion that takes care of the each column of the

result matrix.

The function accepts a matrix of type [Val] the dimensions of that matrix, another matrix of the same type and the

dimensions as well as an offset value.

-}

matrixmult :: [Val] -> Int -> Int ->[Val] -> Int -> Int -> Int ->[Val]

matrixmult matrixa rowa cola matrixb rowb 1 offset =

rowmult matrixa rowa cola matrixb rowb 1 0

matrixmult matrixa rowa cola matrixb rowb colb 2 = concat [(rowmult matrixa rowa cola matrixb rowb colb 1),

(rowmult matrixa rowa cola matrixb rowb colb 0)]

matrixmult matrixa rowa cola matrixb rowb colb offset = concat [

(rowmult matrixa rowa cola matrixb rowb colb (offset-1))

, (matrixmult matrixa rowa cola matrixb rowb colb (offset-1))]

matrixmult _ _ _ _ _ _ _ = error "matrixmult"

Figure 6 – code for matrixmult function

This function is broken into 3 recursion calls based on the value of colb which refers to both the column

of matrixb you are using as well as the column of the result matrix you are computing. These values are

the same based on the fact that whatever column of b you are using for the multiplication say column 8

the result of those calculations will be placed in the result matrix somewhere in column 8 based on which

row it was multiplied by.

The first function call is based on a colb value of 1 and is for any case where the bmatrix and

therefore the result matrix has only four entries. This call is only used in this scenario. The second

function call is the terminating call for any matrix of column size eight or greater. It takes all the

calculations for entries five to eight (one register) and concatenates them with the values for entries one to

four returning the result. The final function call concatenates the current column calculations with a

recursive call to itself decreasing the colb value by 1 until colb is equal to 2.

rowmult function

The rowmult function traverses each row of amatrix passing the correct offset of each row to

another function for multiplication. A visual representation is shown in

figure 7.

a11 … … … … … … a18

… … …

… … …

… … …

… … …

152

… … …

… … …

a81 … … … … … … a88

Figure 7 – rowmult function representation

Each color in the diagram represents a row in matrix. If the matrix were a 12 by 12 each color would

represent 12 entries instead of 8. The code for the rowmult function is displayed in figure 8.

{- This is the recursion that goes through the calculation for each row

of amtraix

rowmult accepts a matrix of type [Val] the dimension of that matrix, another matrix of type [Val] and the

dimensions as well as an offset

-}

rowmult :: [Val] -> Int -> Int -> [Val] -> Int-> Int -> Int ->[Val]

rowmult matrixa 2 cola matrixb rowb colb offset =

concat [[matrixmult1 matrixa ((2*cola)-1) cola matrixb rowb colb offset], [matrixmult1 matrixa (cola-1) cola

matrixb rowb colb offset]]

rowmult matrixa rowa cola matrixb rowb colb offset =

concat[[matrixmult1 matrixa ((cola*rowa)-1) cola matrixb rowb colb offset], (rowmult matrixa (rowa-1) cola

matrixb rowb colb offset)]

rowmult _ _ _ _ _ _ _ = error "rowmult"

Figure 8 – code for rowmult function

This function has two recursive calls which are based on the value of rowa. The first is the terminating

call which happens when there are only two rows left in amatrix to be multiplied. This call concatenates

the calculations of rowa two with rowa one and returns the output. The final recursive call concatenates

the calculations for rowa n (where n is the current row being calculated) with a call to itself decreasing the

rowa by one. This happens until the rowa value is two. The use of this function is that it passes the offset

for the current rowa to the next function. The offset is the number of iterations that the traverse function

will need to go through to get to the last register in the row. We use the last register in the row since the

matrices will be of variable size for any given computation. We must start at the end of the row and work

backwards, otherwise there will be no terminating case for the recursion.

matrixmult1 function

153

The matrixmult1 function takes the offset passed to it by rowmult which points to the last register

in the row and calls traverse which returns the actual register corresponding to that offset. This is then

passed to another function that will compute the actual multiplication value for each individual register

entry. This process is repeated for each register in the given row. Figure 9 shows this graphically.

a11 … … … … … … a18

… … …

… … …

… … …

… … …

… … …

… … …

a81 a82 a83 a84 a85 a86 a87 a88

Figure 9 – matrixmult function representation

If we extend figure 7 we can see that row a8 has been passed to the matrixmult1 which then

breaks the column up even further into registers represented by the different shading. Each of these

registers will be passed onto another function separately. This process is extended for matrices of larger

sizes. The code for matrixmult1 is presented in figure 10.

{-The recursion for each in individual register in the current row of calculations of matrix a

matrixmult1 accepts a matrix of type [Val] and the dimensions, another matrix of the same type and the dimensions

as well as an offset.

-}

matrixmult1 :: [Val] -> Int -> Int ->[Val] -> Int -> Int -> Int -> Val

matrixmult1 matrixa rowa 1 matrixb rowb colb offset =

mul44' (traverse matrixa rowa) matrixb 1 colb offset

matrixmult1 matrixa rowa 2 matrixb rowb colb offset =

fa (mul44' (traverse matrixa rowa) matrixb 2 colb offset)(mul44' (traverse matrixa (rowa-1)) matrixb 1 colb

offset)

matrixmult1 matrixa rowa cola matrixb rowb colb offset =

fa (mul44' (traverse matrixa rowa) matrixb cola colb offset) (matrixmult1 matrixa (rowa-1) (cola-1) matrixb rowb

colb offset)

matrixmult1 _ _ _ _ _ _ _ = error "matrixmult"

figure 10 – matrixmult1 code

154

matrixmult1 is broken into three function calls based on the column size of the amatrix. If amatrix has

only one column (one register, four columns) the output is a single function call to mul44‘ which

calculates the value for those individual entries. If amatrix has two columns left to calculate the results

from a mul44‘ call with both register 1 and register 2 are added together to get the final result. If cola

(number of registers columns amatrix has) is greater than two the result of the current cola is added to a

recursive call decreasing cola by 1. The reason that the results are added together and not concatenated is

due to the fact that to compute a single entry in the result matrix the entire row of amatrix multiplied by

an entire column of bmatrix must be added together..

mul44‘ function

The mul44‘ function does all the actual arithmetic calculations for a all the values in a given

register. The function accepts a register from the amatrix, the entire bmatrix, cola and colb for each matrix

and an offset. The following diagram, figure 11 provides a visual aid in the multiplication process.

Figure 11 – mul44‘ function diagram 1

The function starts with one register from the amatrix and the entire bmatrix, cola, colb and an offset

value being passed in. The first entry from the amatrix register is copied into a new register four times.

155

This is multiplied by the corresponding row of the bmatrix. The register for the bmatrix is computer by

the traverse function which computes the bmatrix register based on the cola, colb and offset vaule passed

to it from mul44‘. The two registers are multiplied using a function defined to simulate register arithmetic

computation available on the Cell SDK. The result is placed in a new matrix. This process is repeated for

the other values in the amatrix register as shown in figures 12 and 13.

Figure 12 – mul44‘ function multiplication with a12

Figure 13 – mul44‘ function after multiplication is complete

156

The other register multiplication are added to the result register. Again we have simulated the

register arithmetic operation fma (multiply and add) which does a multiplication on two registered and

adds the product to another register. After this process is done for the four registers values we end up with

the proper results for four entries as shown in figure 13.

Simulated functions

There are several arithmetic register operations that were simulated in order to complete the

matrix multiplications. These Haskell functions will be shown in figures 14 and 15.

{- pre-defined register manipulations, add, multiply, add and multiply as -}

fm :: Val -> Val -> Val

fm = zipWith (\x y -> "(" ++ x ++ "*" ++ y ++ ")")

fa :: Val -> Val -> Val

fa = zipWith (\x y -> "(" ++ x ++ "+" ++ y ++ ")")

fma :: Val -> Val -> Val -> Val

fma = zipWith3 (\x y z -> "(" ++ x ++ "*" ++ y ++ " + " ++ z ++ ")")

Figure 14 – arithmetic register operations multiply, add and multiply add

The fm (multiply) function takes two registers and multiplies them together. Entry one of both registers is

multiplied together, Entry two with two together and so forth. The result is placed in a new register. The

fa (add) function also takes two registers and adds the individual entries of one register with the

corresponding entries of the other register. The fma (multiply add) function combines the two previous

functions. Two registers are multiplied together, the result is then added to another matrix. Since the

multiplication result is added to another register this function does not require a new register for the

solution. Diagrams for this procedure can be seen in the previous figures 11, 12 and 13.

{- returns a register with the correct values for multiplication in the a matrix. (see multiplication algorithm for more

clarification) -}

sfc1 :: Val -> Val

sfc1 [x1,x2,x3,x4] = [x1,x1,x1,x1]

sfc1 _ = error "sfc1"

sfc2 :: Val -> Val

sfc2 [x1,x2,x3,x4] = [x2,x2,x2,x2]


sfc3 :: Val -> Val

sfc3 [x1,x2,x3,x4] = [x3,x3,x3,x3]


sfc4 :: Val -> Val

sfc4 [x1,x2,x3,x4] = [x4,x4,x4,x4]

157


Figure 15 – registers copy functions

Each of the functions in figure 15 simulate a copy instruction on the Cell SDK. Each function takes a

register (the amatrix register) and returns a new register with one of the four entries copied four times into

a new register. The process is again shown in figures 11, 12 and 13.

Conclusion

The code shown above outlines the process that will need to be executed at a hardware level of

the Cell SDK in order for matrix multiplication to be done correctly and efficiently. Several methods for

calculating matrix multiplication were examined for both number of instructions and register use and this

method was shown to be the most efficient. This is details in the 4ZP6 Cell SDK Design Report. The

above code also assumes that both the entire bmatrix as well as the result matrix are stored in registers for

the whole computation. This may not be the case for the actual implementation at the hardware level,

allowing for larger matrices to be computed with fewer transfers from the LS to registers. The matrix

algorithm works correctly assuming that that correct data in the correct place at the right time. This will

need to be carefully implemented by the LS and register interface in order for this multiplication to work

correctly.

158

VIII.XIV SPE Local Store Stack Frame

SPE Local Store Stack Frame

By: Damith Karunaratne

Las Modified: 12/28/2006

Purpose:

The purpose of this report is to give an overview of the organization of the SPE Local Store (LS) Stack

Frame. This information will be utilized when developing the real-time engine.

Introduction:

This report on the SPE LS Stack Frame follows the Cell Broadband Engine Linux Reference

Implementation Application Binary Interface Specification. Other implementations with respect to initial

values of the Stack and its associated registers are possible, but due to lack of documentation regarding

this matter, it will disregard.

Stack Frame:

Register Initialization:

The following are the initial values required by the registers on initialization:

Register R1: Stores the stack pointer (SP)

Register R3: Stores the SPU identification (spuid)

Register R4: Stores the pointer to the array of program parameters (argp)

Register R5: Stores the pointer to the SPU task environment (envp)

Stack Initialization:

On initialization of the stack, the top of the stack is set to the topmost register (quadword) in the LS. For

calling functions, registers R3 through R79 can be used to store parameters. Below is the composition of

a standard application stack frame on the SPE:

159

References:

Cell Broadband Engine Programming Handbook v1.0. New York: International Business Machine (IBM),

2006.

160

VIII.XV SPE Processor Affinity

REPORT

SPE PROCESSOR AFFINITY

OWNER(S): Damith Karunaratne and Christopher Venantius

DATE: Wednesday, December 20, 2006

PURPOSE: The goal is to outline the possibility of using the built in masks to assign an SPE affinity

with threads. If possible, it will outline how one can use this to control the flow of

processing / tasks from the code graph to associate to unique SPE processors; therefore, enabling the

sharing of data between tasks on the local store (LS).

DEFINITION OF AFFINITY AND ITS RELATION TO THE PROJECT: Affinity refers to the

unique binding one can restrict a thread and particular system resources. In particular, in a

multi-processor system, as in the case of the STI Cell, it will refer to the association of particular SPE

threads to a specific SPE processing unit. The choice of which processing unit will

be determined during a "upper level" of processing. This may be done during compile time, or through a

greedy algorithm during runtime that would associate processing threads to

processor. The reason of use is the fact that through affinity one can control the flow of processing

threads to a single SPE unit, therefore; taking advantage of a shared LS and possible

overlapping off data resources.

METHODOLOGIES:

ISSUE WITH SMALL COMPUTATIONAL FUNCTIONS:

In order to see any work being done on a SPE unit for testing purposes it is advisable to use a large

looping construct of 100 iterations. Mainly, anything that is smaller will

not be visible in the tracking system of the GUI STI Cell simulator.

Printing functions are run by the PPU therefore, they won't show up as SPU processing. ***note a loop of

just a print will still use a SPU to process the looping.

AFFINITY IMPLEMENTATION:

Unfortunately, the current release of libspe2.0 does not support affinity masks directly, on the basis of

avoiding commitments to the API till as late as possible. Options to

solve our problem is to wait till the next iteration of the SDK comes out hoping that it is implemented, or

to find a possible workaround to the problem.

One workaround, is the option of increasing the workload for a single thread to encapsulate all of the data

overlapping we wanted to take advantage of through sharing a

processor, therefore; sharing the LS. However, this creates large threads, and indicates that the code for

the associating processes are manageable to fit the code segment.

Second workaround, is to create our own affinity mask system. Not sure what this actually involves, since

the system is new to the group, and would like to avoid this option

because the implementation is not directly working towards our goal.

CURRENT OUTLOOK:

We are going to move on to research other areas, and assume that the "random" ordering can be worked

into the system. The only constraint is that once the system chooses

the associated SPU, we have a way to know which one was picked; therefore, SPE to SPE communication

routines can be employed to pass information and reduce the

161

number of memory DMA accesses. It is believed this is possible through using SPE identifies that is

supported in the system. We are continuing with the researching of SPE

to SPE transfers than a real time system, where are all the above assumptions will be tested.

VIII.XVI Notes on the Library SPE Document

Notes on the Library SPE Document

Summary

• To provide the group with an understanding of SPE transfers and manipulation.

• To gain an understanding of the additions made in the current SDK release

Related Documents

C/C++ Language Extensions for STI Cell

• Worth while document on the C/C++ extensions for the STI Cell - implications when implementing

the real time systemOverview

Application Control

• Application have no direct control over physical SPE resources: managed by the OS

• Applications use software constructs called SPE contexts

• SPE context is a logic representation of a SPE -> OS schedules the contexts from running

applications

Basic scheme for an application

• create a SPE context

• load SPE executable object into LS (code)

• run the context -> transfer control to OS

• destroy context

More complex scheme (using multiple contexts)

• create N SPE contexts

• load all SPE objects into the SPE LS (multiple SPEs) create N threads

• in each thread run one context

• terminate thread

• wait for N threads to terminate

• destroy all contexts

Advance Controls

provided support to give modified possible schemes

• PPE functions that create/destroy SPE and gang contexts

• PPE functions to load SPE obejcts in LS

• PPE functions to start execution of SPE and obtain stopping reasons for a SPE (if stopped)

• PPE functions to receive asynchronous events by a SPE

PPE functions to access the MFC including

• SPE signal notification

• mailbox facililty

• MFC proxy command issue and proxy tag-group facility (not sure on revelance)

• PPE functions to enable direct access to LS and problem areas

162

• Means to access PPE assisted library calls for a SPE program

SPE context

• holds all presistent information on an SPE LS

• used by application through a pointer

Gang context

• holds all persistent information on a group of SPE contexts that should be

treated together.

• used by application through a pointer

Main thread

• application main thread that controls the multi-threads used to manipulate concurrently running

SPEs

SPE thread

• a regular thread running on the PPE that accesses an SPE context SPE event

• event mechansim for asynchronous notification

• used to indicate when an SPE has stopped executing, mailbox messages, and PPE initiated DMA

operations have completed

Code examples

Overview

Examples The following example shows how to load and run a simple SPE executable ―hello‖:

Example 1: Run the simple SPE program “hello” #include <stdlib.h>

#include <libspe2.h>

int main()

{

spe_context_ptr_t spe;

unsigned int createflags = 0;

unsigned int runflags = 0;

unsigned int entry = SPE_DEFAULT_ENTRY;

void * argp = NULL;

void * envp = NULL;

spe_program_handle_t * program;

program = spe_image_open("hello");

spe = spe_context_create(createflags, NULL);

spe_program_load(spe, program);

spe_context_run(spe, &entry, runflags, argp, envp, NULL);

spe_image_close(program);

spe_context_destroy(spe);

}

The following simple multi-threaded example shows how an application can run the SPE

program ―hello‖ on multiple SPEs concurrently:

Example 2: Simple multi-threaded example #include <stdlib.h>

163

#include <pthread.h>


#define N 4

struct thread_args {

struct spe_context * spe;

void * argp;

void * envp;

};

void my_spe_thread(struct thread_args * arg) {


• PURPOSE: a simple program that illustrates the creation of a spe context, and loading a simple

program

•SPE Runtime Management Library, Version 2.0 The following simple multi-threaded example shows how an application can run the SPE program

―hello‖ on multiple SPEs concurrently:

Example 2: Simple multi-threaded example #include <stdlib.h>

#include <pthread.h>


#define N 4

struct thread_args {

struct spe_context * spe;

void * argp;

void * envp;

};

void my_spe_thread(struct thread_args * arg) {


unsigned int entry = SPE_DEFAULT_ENTRY;

// run SPE context

spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL);

// done - now exit thread

pthread_exit(NULL);

•Overview 5 }

int main() {

pthread_t pts[N];

spe_context_ptr_t spe[N];

struct thread_args t_args[N];

int value[N];

int i;

// open SPE program

spe_program_handle_t * program;

program = spe_image_open("hello");

for ( i=0; i<N; i++ ) {

// create SPE context

spe[i] = spe_context_create(0, NULL);

// load SPE program

spe_program_load(spe[i], program);

// create pthread

t_args[i].spe = spe[i];

t_args[i].argp = &value[i];

t_args[i].envp = NULL;

pthread_create( &pts[i], NULL, &my_spe_thread, t_args[i]);

}

// wait for all threads to finish

for ( i=0; i<N; i++ ) {

164

pthread_join (pts[i], NULL);

}

// close SPE program

spe_image_close(program);

// destroy SPE contexts

for ( i=0; i<N; i++ ) {

spe_context_destroy (spe[i]);

}

return 0;

}

SPE Runtime Management Library, Version 2.0

• PURPOSE: a simple multi-threaded program to run the same hello program on multiple SPEs

concurrently.

Interesting Note:

• Uses a for loop to join all threads sent for execution to force a wait for all threads to finish

• Examine how the pthread is created for each thread using the my_spe_thread procedure as an

arguement to its function

SPE Context Creation

before using an SPE the context has to be created and initialized

• done through spe_context_create once it is used the context should be freed including its memory

• done through spe_context_destroy an SPE gang has to be created and initialized before utilized

• done through spe_gang_context_create

• contexts are added through spe_context_create an SPE gang can be deallocated to free up its

resources

• first deallocate all spe contexts in gang by spe_context_destroy

• then deallocate the spe gang

IDEA:

• can we use a spe gang to help with associating spe - spe dma transfers

Code explanation for spe_context_create

•8 SPE Context Creation SPE Runtime Management Library, Version 2.0

SPE Context Create Functions

spe_context_create

C Specification #include <libspe2.h>

spe_context_ptr_t spe_context_create(unsigned int flags,

spe_gang_context_ptr_t gang)

Description

Create a new SPE context.

Parameters

165

flags A bit-wise OR of modifiers that are applied when the SPE context is

created.

The following values are accepted:

SPE_EVENTS_ENABLE Event handling shall be enabled on this SPE context

SPE_CFG_SIGNOTIFY1_OR Configure the SPU Signal Notification 1 Register1 to be in ―logical

OR‖ mode instead of the default ―Overwrite‖ mode.

SPE_CFG_SIGNOTIFY2_OR Configure the SPU Signal Notification 2 Register1 to be in ―logical

OR‖ mode instead of the default ―Overwrite‖ mode.

SPE_MAP_PS Request permission for memorymapped access to the SPE‘s problem state area(s) 2.

SPE_ISOLATE This context will execute on an SPU in the isolation mode. The specified SPE

program must be correctly formated for isolated execution.

gang Associate the new SPE context with this gang context. If NULL is 1 See Cell Broadband Engine Architecture, SPU Signal Notification Facility

2 See Cell Broadband Engine Architecture, Problem State Memory-Mapped Registers

flags

• used as a bit-wise OR of modifers

• SPE_EVENTS_ENABLED: event handling enabled on SPE context

• ???SPE_CFG_SIGNOTIFYI_OR: configure the SPU signal notification 1 register to be in logical

OR mode instead of Overwrite mode

• ???SPE_CFG_SIGNOTIFY2_OR: configure the SPU signal notification 2 register to be in logical

or mode instead of Overwrite mode

• SPE_MAP_PS: request premission for memory-mapped access to SPE problem state area(s)

• SPE_ISOLATE: execute on an SPU in isolation mode. Must be formatted for execution. gang

• associates the SPE context with a gang -> if null no gang is associated return values

• on success a pointer to the context is returned on error null is returned and errno sets the indicated

error

• ENOMEN: could not be allocated due to lack of system resources

• EINVAL: value passed for flags was invalid

• EPERM: process does not have permission to add threads to the SPE gang context, or to use the

SPU_MAP_PS setting

• ESRCH: gang context not found

• EFAULT: a runtime error in the underlying OS occurred

Code explanation for spe_context_destroy

•10 SPE Context Creation

spe_context_destroy


166

int spe_context_destroy (spe_context_ptr_t spe)

Description

Destroy the specified SPE context and free any associated resources.

Parameters

spe Specifies the SPE context to be destroyed.

Return Value

On success, 0 is returned. On failure, -1 is returned and errno is set appropriately.

Possible errors include: ESRCH The specified SPE context is invalid. spe

• assoicated spe context to destroy return values on success 0 is returned, on failure -1 is returned and

errno is set

• ESRCH: the spe is invalid

• EAGAIN: the spe context cannot be destroyed because it is in use

• EFAULT: a runtime error in the underlying OS occurs

Code explanation for spe_gang_context_create

•SPE Context Creation 11

spe_gang_context_create


spe_gang_context_ptr_t spe_gang_context_create (unsigned int flags)

Description

Create a new SPE gang context.

Parameters

flags A bit-wise OR of modifiers that are applied when the SPE context is

created.

The following values are accepted:

<none><none>

Return Value

On success, a pointer to the newly created gang context is returned. On error, NULL is returned and

errno will be set to indicate the error. Possible errors include:

ENOMEM The gang context could not be allocated due to lack of system resources.

EINVAL The value passed for flags was invalid.

EFAULT A runtime error of the underlying OS service occurred.

See Also

spe_context_create; spe_gang_context_destroy; SPE Runtime Management Library, Version 2.0

flags

167

• currently not supported return values

• on success a pointer to the gang is returned, else Null is returned and errno is set

• ENOMEM: not enough system resources

• EINVAL: value for the flags are invalid

• EFAULT: underlying OS error

Code explanation for spe_gang_context_destroy

•12 SPE Context Creation

spe_gang_context_destroy


int spe_gang_context_destroy (spe_gang_context_ptr_t gang)

Description

Destroy the specified gang context and free any associated resources.

Before destroying a gang context, you must destroy all associated SPE contexts using

spe_context_destroy.

Parameters

gang Specifies the gang context to be destroyed.

Return Value


Possible errors include: ESRCH The specified gang context is invalid. EAGAIN The specified gang

context cannot be destroyed at this time since it is in use. EFAULT A runtime error of the underlying

OS service occurred.

See Also

spe_gang_context_create; spe_context_destroy; gang

• specifies a gang context to be destroyed return values on success it returns 0 and on failure it

returns a -1 and errno is set

• ESRCH: gang context is invalid

• EAGAIN: gang cannot be destroyed at this time because it is in use

• EFAULT: a runtime error due to underlying OS SPE Image Handling

• spe_program_load loads the program in the LS if the file is an independependt ELF image it needs

to be loaded into memory by spe_image_open

SPE ELF = SPE Executable and Linking Format

• purpose is to define a simple, standard hierarchy file structure

• standard object format for many UNIX based OS'

• compilers generate these files and linkers link to ELF files in libraries

• systems can run ELF files

• refer to chapter 14 handbook for more detail information Code explanation for spe_image_open

•14 SPE Program Image Handling

SPE Image Functions

168

spe_image_open


spe_program_handle_t * spe_image_open (const char *filename)

Description

spe_open_image opens an SPE ELF executable indicated by filename and maps it into system

memory. The result is a pointer to an SPE program handle which can then be used with

spe_program_load to load this SPE main program into the local store of an SPE before running it

with spe_context_run. The application needs "execute" access rights to the file with the SPE

executable. SPE ELF objects loaded using this function are not shared with other

applications/processes. It is sometime more convenient to embed SPE ELF objects directly within the

PPE executable using the linker and an "embed_spu" (or equivalent) tool (see toolchain

documentation). In this case, SPE ELF objects are converted to PPE static or shared libraries with

symbols which will point to the SPE ELF objects after these special libraries are loaded. These

libraries are then linked with the associated PPE code to provide a direct symbol reference to the SPE

ELF object. • used to open an SPE ELF executable by filename

• they are options to share an SPE ELF object among processes by embedding it directly within PPE

executable --> not sure we want to run it on an PPE and this is confusing filename

• specifies the SPE ELF to be loaded return values

• on success it returns address which the obejct is mapped, or null if failed and sets errno

• EACCES: necessary permissions

• EFAULT: filename points to an address not contained in calling process's address space

• other: other ones can come out as well Code explanation for spe_program_load SPE Run Control

Code explanation for spe_image_close

•16 SPE Program Image Handling SPE Runtime Management Library, Version 2.0

spe_image_close


int spe_image_close (spe_program_handle_t *program)

Description

spe_close_image unmaps and closes an SPE ELF object that was previously opened and mapped

using spe_open_image.

Parameters

program A valid address of a mapped SPE program.

Return Value

169

On success, 0 is returned. On failure, -1 is returned and errno is set appropriately. Possible errors

include: EINVAL The specified address of the SPE program is invalid. other A number of other

errno values could be returned by the munmap(2) or close(2) system calls which may be utilized by

the spe_image_open function.

See Also

spe_image_open;

• from now on the explanations are abbreviated

• used to close the image object

• after creating the spe context and loading the spe program into the LS it can now be set to run

• the thread that executes the context is called an SPE thread the API function is a synchronous,

blocking call

• while the program is executing the assoicated thread blocks and will usually be put to sleep by the

OS

• when the program stops the context_run function returns

• usually to run multiple SPEs concurrently, we need to create at least a thread for each SPE context

we need. Most likley n+1 threads, the addition being a master thread to orchastra the process

• it is convienent to use the asynchronous notification functions spe_stop_info_read that allows the

main thread to find out why a SPE stopped when it does

•SPE Program Image Handling 17

spe_program_load


int spe_program_load (spe_context_ptr_t spe, spe_program_handle_t

*program)

Description

spe_program_load loads an SPE main program that has been mapped to memory at the address

pointed to by program into the local store of the SPE identified by the SPE context spe. This is

mandatory before running the SPE context with spe_context_run.

Parameters

spe A valid pointer to the SPE context for which an SPE program should be loaded. program A valid

address of a mapped SPE program.

Return Value


include: ESRCH The specified SPE context is invalid. EINVAL The specified address of the SPE

program is invalid.

See Also

spe_image_open; spe_context_run;

• loads the SPE main program that is mapped to the memory address pointed to by program in the LS

of the SPE identified by context spe.

• mandatory before running the spe context Code explanation for spe_context_run

170

•20 SPE Run Control

SPE Run Functions

spe_context_run


int spe_context_run(spe_context_ptr_t spe, unsigned int *entry,

unsigned int runflags, void *argp, void *envp,

spe_stop_info_t *stopinfo)

Description

The function spe_context_run requests execution of an SPE context on a physical SPE resource of

the system. It is necessary that a SPE program has been loaded (using spe_program_load) before

running the SPE context. The thread calling spe_context_run will block and wait until the SPE

stops, either because of normal termination of the SPE program, an SPU stop and signal instruction,

or some error condition. When spe_context_run returns, the calling thread must take appropriate

actions depending on the application logic. spe_context_run returns information about the

termination of the SPE program in three ways. This allows applications to deal with termination

conditions on various levels. First, the most common usage for many applications is covered by the

return value of the function and the errno value being set appropriately. Second, the optional stopinfo

structure provides detailed information of the termination condition in a structured way that allows

applications more finegrained

• requires spe_program_load to already occurred the function blocks and waits for termination

• normal exiting

• SPU stop / signal instruction

• error returning issues

• commmon usage is covered by return value of the function and errno being set

• second option is to use the stopinfo structure to get more information to handle error or special

scenarios

• thrid, the stopinfo structure has a filed spu_status, that contains the CBEA

SPU Status reigster, this can be used in conjection with

SPE_NO_CALLBACKS flag for a more relax structure

Interesting Notes

• the spe_stop_info contains stop_reasons that are used to determine the type of stop arguements to

an spe program can be passed using argp, envp and the

SPE_RUN_USER_REGS flag

• if the above flag is set then envp is ignored

•Code explanation for spe_stop_info_read

• SPE Run Control 25

spe_stop_info_read


int spe_stop_info_read (spe_context_ptr_t spe, spe_stop_info_t *stopinfo)

171

Description

Read information about the exact conditions in which the SPE identified by spe stopped program

execution, corresponding to the last SPE_EVENT_STOPPED event. This function is intended for

usage in a multi-threaded environment. An SPE thread would run the SPE context using

spe_context_run. A main thread would be able to receive stop events, whenever the

spe_context_run call returns, that is the SPE stops, in the SPE thread. This is a non-blocking call. If

the information does not exist, for example, because the context has never been run, or has already

been read, for example, by another thread, the function will return an error with errno set to

EAGAIN. This function requires that the SPE context spe has been created with event support, that

is, the SPE_EVENTS_ENABLE flag has been set. Otherwise, it will return an error ENOTSUP.

Parameters

spe A valid pointer to the SPE context for which stop information is requested. stopinfo A pointer to

a structure of type spe_stop_info_t (specified in spe_context_run). The structure will be filled with

all information available as to the reason why the SPE program stopped execution.

Return Value On success, 0 is returned. On failure, -1 is returned and errno is set appropriately.

Possible errors include:

ESRCH The specified SPE context is invalid.

EAGAIN No data available.

ENOTSUP Event processing is not enabled for this SPE context.

See Also

spe_context_run;

• used to read the information on the stop condition

• the main thread recieved stop events whenever a spe_context_run return grabbing the information

is a non-blocking call

• if it does not exist or already read than an error returns

• the above requires that the SPE_EVENTS_ENABLE flag has been set SPE Event Handling

• main thread sets up an event handler to receive notification about certain events from SPEs

• it uses an event loop to wait for events (using spe_event_wait)

• events can be a finishing PPE initiated DMA transfer, a mailbox read / write, stopped execution.

Code explanation for spe_event_handler_create

•28 SPE Event Handling

SPE Event Functions spe_event_handler_create


spe_event_handler_ptr_t spe_event_handler_create(void)

Description

Create a SPE event handler and return a pointer to it.

Parameters

172

void none

Return Value

On success, a valid pointer to an SPE event handler is returned. On failure, NULL is returned and

errno is set appropriately. Possible errors include:

ENOMEM The SPE event handler could not be allocated due to lack of system

resources. EFAULT A runtime error of the underlying OS service occurred.

See Also

spe_event_handler_destroy;

• simple pointer creation, meaning there' s no important assoicated parameters

Code explanation for spe_event_handler_destroy

•SPE Event Handling 29

spe_event_handler_destroy


int spe_event_handler_destroy (spe_event_handler_ptr_t evhandler);

Description

Destroy a SPE event handler and free all resources associated with it.

Parameters

evhandler A valid pointer to the SPE event handler to be destroyed.

Return Value


include: ESRCH The specified SPE event handler is invalid. EAGAIN The specified SPE event

handler cannot be destroyed at this time since it is in use, that is an spe_event_wait call is currently

active waiting on this handler.

• simple pointer destruction

Code explanation for spe_event_deregister


spe_event_handler_deregister


int spe_event_handler_deregister(spe_event_handler_ptr_t evhandler,

spe_event_unit_t *event);

Description

Deregister the application‘s interest in SPE events of the specified nature as defined in the event

structure. It is no error to deregister interest in events that have not been registered before. Therefore,

all events on a specific evhandler and spe can be always deregistered with a single function call using

the SPE_EVENT_ALL_EVENTS mask. This function requires that the SPE context spe in event has

173

been created with event support, that is, the SPE_EVENTS_ENABLE flag has been set. Otherwise, it

will return an error ENOTSUP.

Parameters

• deregister the interest for an event Code explanation for spe_event_handler_register


spe_event_handler_register


int spe_event_handler_register(spe_event_handler_ptr_t evhandler,

spe_event_unit_t *event);

Description

Register the application‘s interest in SPE events of the specified nature as defined in the event

structure. This function requires that the SPE context spe in event has been created with event

support, that is, the SPE_EVENTS_ENABLE flag has been set. Otherwise, it will return an error

ENOTSUP.

• used to register an application interesting in a particular event events are as follows

• SPE_EVENT_OUT_INTR_MBOX: triggered when message is sent outbound through the mailbox

and at least one entry as been written

• SPE_EVENT_IN_MBOX: triggered when the inbound mailbox was full, and on at least one read

of the maibox by the SPU, the inbound mailbox can be written to again

• SPE_EVENT_TAG_GROUP: spu event tag group signalled

• SPE_EVENT_SPE_STOPPED: program execution stopped

• SPE_EVENT_ALL_EVENTS: bitwise or of flags above

Interesting Note:

• based on my understanding of the the mailbox, I believe it is theorectically possible to use it with

lockless transferring. More on this in a separate report / proposal option. Code explanation for

spe_event_wait

•SPE Event Handling 33

spe_event_wait


int spe_event_wait(spe_event_handler_ptr_t evhandler, spe_event_unit_t

*events, int max_events, int timeout);

Description

Wait for SPE events.

Parameters

evhandler A valid pointer to the SPE event handler. events The pointer to the memory area where the

events will be stored. The 'events' member will contain the event bit field indicating the actual event

174

received, and the 'spe' member will contain pointer to the SPE context that generated the event. For

the specification of spe_event_unit_t, see

spe_event_handler_register.

max_events Maximum number of 'events' to receive. The call will return if at least one event as been

received – or if it times out. timeout Timeout in milliseconds. -1 means 'infinite'. 0 means that the

call should not wait but return immediately with as many events as are currently available up to a

maximum of max_events.

Return Value

On success, the number of SPE events received. If 0 is returned, no SPE event was received because

the request timed out. On failure, -1 is returned and errno is set appropriately. Possible errors include:

ESRCH The specified SPE event handler is invalid.

EINVAL Error in parameters.

EFAULT A runtime error of the underlying OS service occurred.

See Also

spe_event_handler_register; spe_event_handler_deregister; spe_out_intr_mbox_read;

spe_in_mbox_write; spe_mfcio_tag_status_read; spe_stop_info_read;

• waits for events to be gathered

• the timeout indicates how long to wait, a time of 0 grabs all current events available

SPE MFC Proxy Command Functions

• provide PPE initiated DMA functionality

• commands are based on an SPE centric viewpoint

Code explanation for spe_mfcio_put, spe_mfcio_putb, spe_mfcio_putf

•36 SPE MFC Programs State Facilities

spe_mfcio_put, spe_mfcio_putb, spe_mfcio_putf


int spe_mfcio_put (spe_context_ptr_t spe, unsigned int lsa, void *ea,

unsigned int size, unsigned int tag, unsigned int tid,

unsigned int rid)

int spe_mfcio_putb (spe_context_ptr_t spe, unsigned int lsa, void *ea,


unsigned int rid)

int spe_mfcio_putf (spe_context_ptr_t spe, unsigned int lsa, void *ea,


unsigned int rid)

Description

The spe_mfc_put function places a get DMA command on the proxy command queue of the SPE

context specified by spe. The put command transfers size bytes of data starting at the local store

address specified by lsa to the effective address specified by ea. The DMA is identified by the tag id

specified by tag and performed according transfer class and replacement class specified by tid and

rid respectively. The spe_mfc_putb function is identical to spe_mfc_put except that it places a putb

(put with barrier) DMA command on the proxy command queue. The barrier form ensures that this

175

command and all sequence commands with the same tag identifier as this command are locally

ordered with respect to all previously issued commands with the same tag group and command

queue. The spe_mfc_putf function is identical to spe_mfc_put except that it places a putf (put with

fence) DMA command on the proxy command queue. The fence form ensures that this command is

locally ordered with respect to all previously issued commands with the same tag group and

command queue. The caller of these functions must ensure that the address alignments and transfer

size is in accordance with the limitation and restrictions of the Cell Broadband Engine Architecture.

put: places command on the queue of an SPE context

• dma is identified by the tag

• putb: places a command on the queue with a barrier. Ensures that all sequence commands are

locally ordered with repect to all previously issued commands with the same tag

• putf: places a command on teh queue with a fence. Ensure that this command are locally ordered

with respect to all previous issued commands with the same tid (transfer class id) / rid (replacement

class id) parameters

• tid -> specifies the transfer class identifier of the DMA command

• rid -> specifies the replacement class identifier of the DMA command

Code explanation for spe_mfcio_get, spe_mfcio_getb, spe_mfcio_getf

•SPE Runtime Management Library, Version 2.0

spe_mfcio_get, spe_mfcio_getb, spe_mfcio_getf


int spe_mfcio_get (spe_context_ptr_t spe, unsigned int lsa, void *ea,


unsigned int rid)

int spe_mfcio_getb (spe_context_ptr_t spe, unsigned int lsa, void *ea,


unsigned int rid)

int spe_mfcio_getf (spe_context_ptr_t spe, unsigned int lsa, void *ea,


unsigned int rid)

Description

The spe_mfc_get function places a get DMA command on the proxy command queue of the SPE

context specified by spe. The get command transfers size bytes of data starting at the effective

address specified by ea to the local store address specified by lsa. The DMA is identified by the tag

id specified by tag and performed according transfer class and replacement class specified by tid and

rid respectively. The spe_mfc_getb function is identical to spe_mfc_get except that it places a getb

(get with barrier) DMA command on the proxy command queue. The barrier form ensures that this

command and all sequence commands with the same tag identifier as this command are locally

ordered with respect to all previously issued commands with the same tag group and command

queue. The spe_mfc_getf function is identical to spe_mfc_get except that it places a getf (get with

fence) DMA command on the proxy command queue. The fence form ensure that this command is

locally ordered with respect to all previously issued commands with the same tag group and

command queue. The caller of these functions must ensure that the address alignments and transfer

size is in accordance with the limitation and restrictions of the Cell Broadband Engine Architecture.

176

Parameters spe Specifies the SPE context into which proxy command queue the get command is to be placed

into.

lsa Specifies the starting local store destination address.

ea Specifies the starting effective address source address.

size Specifies the size, in bytes, to be transferred.

tag Specifies the tag id used to identify the DMA command. The range for

• get: places command on the queue of the spe context (transfer from ea to lsa)

• getb and getf work on the same principles as their counterparts in put

Code explanation for spe_mfcio_tag_status_read

•40 SPE MFC Programs State Facilities SPE Runtime Management Library, Version 2.0

SPE MFC Proxy Tag-Group Completion Functions

spe_mfcio_tag_status_read


int spe_mfcio_tag_status_read(spe_context_ptr_t spe, unsigned int mask,

unsigned int behavior, unsigned int

*tag_status)

Description

The spe_mfc_tag_status_read function is used to check the completion of DMA requests associated

with the tag groups specified by the optional mask parameter. A mask of value ‗0‘ indicates that all

current DMA requests should be taken into account. The behavior field specifies whether all or any

of the specified tag groups have to be completed, or whether it just checks current completion status.

The non-blocking reading of the tag status by specifying SPE_TAG_IMMEDIATE is especially

advantageous when combining with SPE event handling. Note that after receiving a tag group

completion event, the tag status has to be read before another DMA is started on the same SPE.

Parameters

spe Specifies the SPE context for which DMA completion status is to be checked.

mask The mask parameter can be set to 0 indicating that all current DMA requests should be taken

into account. This will take into account only those DMAs started using libspe library calls, since the

library and operating system have no way to know about DMA initiated by applications using direct

problem state access. A non-zero value has to be specified according to the ―Cell Broadband Engine

Architecture, Version 1.0‖, section 8.4.3. Each of the bits 0:31 of this mask corresponds to a tag

group. These tag groups may include those used for DMA started using application direct problem

state access. behavior Specifies the behavior of the operation. The value can be one of:

SPE_TAG_ALL The function suspends execution until all DMA commands in the tag groups

enabled by the mask parameter have no outstanding DMAs in the proxy command queue of the SPE

context specified by spe. The masked tag status check the status of a DMA requests assoicated with a

tag group specified

177

• a mask value of 0 indicates that all DMA requests should be taken into account IDEA: have tag

groups for each mult / multadd routine

• using the SPE_TAG_ALL in the behavior SPE Mailbox

• functions that allow the main thread to communicate through its mailbox

• naming is based on an SPE centric view -> spe_out...reflects a read from the mainbox Code

explanation for spe_out_mbox_read

•SPE MFC Programs State Facilities 43

spe_out_mbox_read


int spe_out_mbox_read (spe_context_ptr_t spe, unsigned int *mbox_data, int

count)

Description

This function reads up to count available messages from the SPE outbound mailbox for the SPE

context spe. This is a non-blocking function call. If less than count mailbox entries are available, only

those will be read.

spe_out_mbox_status can be called to ensure that data is available prior to reading the outbound

mailbox.

Parameters

spe Specifies the SPE context for which the SPU outbound mailbox has to be read.

mbox_data A pointer to an array of unsigned integers of size count to receive the 32-bit mailbox

messages read by the call.

count The maximum number of mailbox entries to be read by this call.

Return Value

>0 the number of 32-bit mailbox messages read

0 no data read

-1 error condition and errno is set appropriately



• function reads up to count messages in the outbound mailbox

• I believe it is read by the main thread

• spe_out_mbox_status can be read before to ensure that data is there

• messages are 32 bit

Code explanation for spe_out_mbox_status

•44 SPE MFC Programs State Facilities

spe_out_mbox_status


int spe_out_mbox_status (spe_context_ptr_t spe)

178

Description

The spe_out_mbox_status function fetches the status of the SPU outbound mailbox for the SPE

context specified by the spe parameter. A 0 value is return if the mailbox is empty. A non-zero value

specifies the number of 32-bit unread mailbox entries.

Parameters

spe Specifies the SPE context for which the SPU outbound mailbox has to be read.

Return Value

>0 the number of 32-bit mailbox messages available for read

0 no data available



• returns 0 to represent empty, or a number to represents how many messages

Code explanation for spe_in_mbox_write


spe_in_mbox_write


int spe_in_mbox_write (spe_context_ptr_t spe, unsigned int *mbox_data, int

count, unsigned int behavior)

Description

This function writes up to count messages to the SPE inbound mailbox for the SPE context spe. This

call may be blocking or non-blocking, depending on behavior. The blocking version of this call is

particularly useful to send a sequence of mailbox messages to an SPE program without further need

for synchronization. The non-blocking version may be advantageous when using SPE events for

synchronization in a multi-threaded application.

spe_in_mbox_status can be called to ensure that data can be written prior to writing the SPU

inbound mailbox.

Parameters

• writes up to count messages to the spe inbound mailbox, it can be blocking or non-blocking

• blocking will block until all count messages have been written non-blocking is good for

synchronization in a multi-threaded environment

• not stalling your spe

Code explanation for spe_in_mbox_status

•spe_in_mbox_status


int spe_in_mbox_status (spe_context_ptr_t spe)

Description

179

The spe_in_mbox_status function fetches the status of the SPU inbound mailbox for the SPE

context specified by the spe parameter. A 0 value is return if the mailbox is full. A non-zero value

specifies the number of available (32-bit) mailbox entries.

Parameters spe Specifies the SPE context for which the SPU outbound mailbox has to be read.

Return Value

>0 the number of 32-bit mailbox messages that can be written

0 no data can be written (mailbox full)




EIO An I/O error occurred.

See Also

spe_in_mbox_write; SPE Runtime Management Library, Version 2.0

• a 0 return represents an empty mailbox, and a positive value represents the number of entries in the

mailbox

Code explanation for spe_signal_write


SPE SPU Signal Notification Functions

spe_signal_write


int spe_signal_write (spe_context_ptr_t spe, unsigned int signal_reg,

unsigned int data)

Description

The spe_signal_write function writes data to the signal notification register specified by signal_reg

for the SPE context specified by the spe parameter.

Parameters

spe Specifies the SPE context whose signal register is to be written to.

signal_reg Specifies the signal notification register to be written. Valid signal

notification registers are:

SPE_SIG_NOTIFY_REG_1 SPE signal notification register 1

SPE_SIG_NOTIFY_REG_2 SPE signal notification register 2

data The 32-bit data to be written to the specified signal notificationregister.

Return Value




EIO An I/O error occurred

180

See Also SPE Runtime Management Library, Version 2.0

• function writes data to the signal notification register

• can write 32 bits of data

IDEA: another area where one can provide lockless free synchronization

• but is there enough room??

VIII.XVII BLAS Overview

REPORT: BLAS Overview


References:

"BLAS Frequently Asked Questions (FAQ)." Netlib. 25 July 2005. 16 October 2006

<http://www.netlib.org/blas/faq.html#4>

"BLAS Quick Reference Guide." Netlib. 16 October 2006 <http://www.netlib.org/blas/blasqr.pdf>

"Intel® Math Kernel Library Quick Reference." 2005 Intel Corporation. 17 October 2006

<http://www.ualberta.ca/AICT/RESEARCH/LinuxClusters/doc/mkl/mklqref/index.htm>

Summary: The purpose of this report is to outline the basics of the BLAS package, and how it relates to our thesis.

Definitions:

T: refers to a transpose of a matrix

H: refers to a conjugate transpose of a matrix (ie, the values of the imaginary changes signs when flipped)

inc(vector): refers to apply for every inc(vector) value to that vector ie incX= 2 apply to every other value in x

***do not need to support this if we are using vector of vectors****

ld(matrix): refers to the dimension of the columns of this matrix (above all from Intel)

Information:

General: Provides methods to perform vector and matrix operations. It is broken into three sub levels, BLAS 1,

BLAS 2 and BLAS 3. This provides a way to layer the operations in terms of difficulty. (BLAS FAQ)

BLAS 1:

Provides basic scalar, vector and vector-vector operations. Operations we would want to support in our thesis are as

follows. (BLAS FAQ) for use scalars are either a, b, c, dimensions are n and all other letters are vectors things are

passed in a maximum 5 dimension array of data being of form (dimension, scalar, vector1, inc(vector1), vector2,

inc(vector2), scalars) (INTEL) SubRoutines Functions xSwap(n,,x,incx,y,incy,): x <-> y xDot(n,,x,incx,y,incy,): dot

= xTy normal dot

product

xScal(n,a,x,incx,,,): x <- ax xDOTC(n,,x,incx,y,incy,): dot = xHy conjugated

vector with normal vector

xCopy(n,,x,incx,y,incy,): y <- x

xAXPY(n,a,x,incx,y,incy,): y <- ax + y

(all functions from BLAS QUICK REF)

BLAS 2:

Provides vector-matrix operations, done in the form (BLAS FAQ) (property, row dimension, column dimension,

subdiagonals, superdiagonals, scalar, matrix, ld(matrix), vector1, inc(vector1), scalar, vector2, inc(vector2))

Things we should support here:

FOR MATRIX AND VECTOR MULTIPLICATION THEY HAVE ALREADY DEFINED SOME DIFFERENT

TYPES BASED ON THE PROPERTIES OF THE MATRIX, ONLY GOING TO SHOW ONE CASE HERE

SINCE WE ARE IMPLEMENTING OUR TYPES OF MULT BASED ON THE ATTRIBUTES OF THE MATRIX

181

calculates matrix times a vector operation

gmv(prop of matrix, m, n, , , a, A, , x, incx, b, y, incy) = a*A*x + b*y

if prop is 'C' -> complex = a*conju(A)*x + b*y

we can define different gmv operations like gtmv for triangular, gsmv for symmetric, etc (from BLAS INTEL)

BLAS 3:

Provides routines to compute a matrix matrix operations where C = aAB with A = mxk B = kxn (property of transA,

property of transB, row dimension of A, column dimension of B, column dimension of A, a, A, ldA, b, B, ldB, c,

ldC) gemm(transA,transB,m,n,k,a,A,ldA,b,B,ldB,c,ldC) = a*op(A)*op(B) + b*C where op recreates the matrix from

the storage system (based on property of matrix) Again, as inm BLAS 2, we can make different gemm routines for

different matrix multiplications (From BLAS INTEL)

Conclusion: Basically, this outlines what we should at least cover in our thesis. It also gives us a framework on how

to divide or layer the operations to build from one to another. Additionally, the ldA etc with inc values gives us a

quick way and how others have implemented finding a column or row if the matrix is stored in array format (if we

go that route). As an interesting note it assumes that matrix addition is defined already in basic low level, this

assumption may mean there some low operation like array A + array B that just adds the corresponding elements

together, if so, this would help a lot with the implementation. If we are going with vector approach, then we must

guarantee first that the low level has this kind of operation (issue here). The prop first argue tag for level 2 and level

3 BLAS is a bit ambiguous to me right now, with the addition of new mult functions, however, I have only seen it

used to identity conjugate examples and not major breaks.

182

VIII.XVIII Dense-Dense Matrix Multiplication CodeGraph Generator

Report

Dense-Dense Matrix Multiplication CodeGraph

Generator Nathan Cumpson

Note: The code for this was done using the Haskell programming language

and Dr.Kahl‘s code graph library in the Coconut repository. This document

is sensitive and may not be viewed by the public without the consent of the

Coconut project supervisors.

References: Dr.Kahl personal interview Jan. 5th, Design Report 1.0: 4Z06

Project, Venantius, Karunaratne, Cumpson, Schulz, Fei and CodeGraphs, Kahl.

Summary This report outlines the implementation of dense-dense matrix mulitplication

code graph implementation using the Coconut repository. The code graph is

created as a hyper graph, with labelled nodes and hyper-edges. The nodes and

the edges of this code graph are a simple string type; however, in the actual

implementation, more complex node and edge types will be needed. The code

graph functionality will be briefly explained but the focus will be more on the

actual implementation.

Dense-Dense Matrix Multiplication The first matrix operation we choose to implement is the dense-dense matrix

multiplication case. This option was choosen because it is the most interesting

case, since it would be considered the most computational intensive operation.

For this solution, there were multiple algorithms investigated as possible imple-

mentations; however the choice taken was the Block Row-Column multiplica-

tion, taken from the Design Report:

The row column method provides a sure way of computing the an-

swer. However, it suffers when trying to implement large matrices,

because of memory constraints. The Strassen approach can decrease

the time complexity of the operation, but introduces instability is-

sues. Therefore, the block row column approach provides a method

that works on the practical scale in terms of memory constraints,

and does not introduce instability. In conclusion, this is the matrix

multiplication algorithm that will be employed for the project.

With this decision made, we needed a way to map out the instructions such

that they are scheduled efficiently across the Cell‘s SPU cores. To do this, we

create a code graph where the dependencies between instructions are represented

by the code graph itself. After the code graph is created, it must be passed to a

scheduler which will schedule the instructions of the code graph as they need to

be scheduled. For this process to be efficient, we need to create a module that

will generate a code graph based on the given matrices to be multiplied.

CodeGraphs For constructing the code graph, we use Dr.Kahl‘s library for codegraphs which

183

also has documentation on the theory of his work1. I will limit the background

knowledge needed for this report to the general theory behind a code graph

(CodeGraphs, Kahl):

Term graphs are usually represented by graphs where nodes are

labelled with function symbols and edges connect function calls with

their arguments [Sleep, Plasmeijer+ 1993]. An alternative represen-

tation was introduced with the name of jungle by Hoffmann and

Plump [Hoffmann, Plump 1988] for the purpose of efficient imple-

mentation of term rewriting systems (it is called ―term grap‖ in

[Plump 1999]).

A jungle is a directed hypergraph where nodes are only labelled

with type information (if applicable), function names are hyperedge

labels, each hyperedge has a sequence of input tentacles and ex-

actly one output tentacle, and for each node, there is at most one

hyperedge that has its output tentacle incident with that node.

For representing our declarative assembly code fragments, we use

a generalisation of the jungle concept, corresponding to Stef˘anescus

―flow graphs‖ [Gheorghe Stef˘anescu 2000]:

Definition 1.1.1 A code graph G = (N, ", In,Out, src, trg, eLab)

over an edge label set ELab consists of

• a set N of nodes and a set " of hyperedges or ( edges),

• two node sequences In,Out : N_ countaining the input nodes

and output nodes of the code graph,

• two functions src, trg : " −! N_ assigning each hyperedge the

sequence of its source nodes and target nodes respectively, and

• a function eLab : " −! ELab assigning each hyperedge its edge

label, where the label has to be compatible with the numbers

of source and target nodes of the edge. 1See CodeGraphDoc from the Coconut repository

So we have that nodes can have one input and one output but each node

must connect to a hyperedge and every hyperedge must connect to a node.

Thus, there are no node-to-node connections and no edge-to-edge connections.

Although each node can only have one input and one output, the same is not

the case for the edges. Edges can have multiple inputs from nodes and they may

have multiple outputs to edges. In the case of an edge having multiple tentacles

(connections in or out), it is created from joining two code graphs together 2.

The case where there are multiple out-tentacles is preferred as this causes less

stress on resister allocation.

Building The Matrix Multiplication CodeGraph Constructing this codegraph assumes that we have a set of instructions needed

to be scheduled: DMA (to retrieve a block for an SPU), DMAOut (to return a

block from an SPU local store), Multadd (the floating-point multiplication-

addition instruction - fma) and Mult (the floating-point multiplication - fm).

Each of these instructions are the edges of the codegraph. In addition to the

instructions, we need states for SPUs such that memory for the operation is

mapped into the local stores. These states are the nodes of the hypergraph and

are labelled by LS acc.

MatrixOps The first file created was a simple utility module with a set of different functions

used in relation to matrices. This module is primarily used for extracting the

184

dimensions of the input matrices.

Matrix Operations:

These operations are for matrices. The operations will be used in the cell sim-

ulator, specifically for creating code graphs for the input matrices. The matrix

operations can only be applied on two-dimension matrices. Perhaps this could

be extended to any dimension later.

module MatrixOps(matrixHeight, matrixWidth, matrixDimensions) where

import qualified Data.Map as Map

These are utility functions.

add :: Int -> Int -> Int

add x y = x + y

inc :: (Int->Int)

inc x = add x 1 2pg 4, CodeGraphs, Kahl)

This code will find the n dimension of a n*m matrix

Example:

matrixHeight [[1,2], [2,1],[1,1]] -- a 3x2 matrix

3

matrixHeightIter :: Int -> [m] -> Int

matrixHeightIter count (hd:matrix) = matrixHeightIter (inc count) matrix

matrixHeightIter count [] = count

matrixHeight :: [m] -> Int

matrixHeight = matrixHeightIter 0

This will be used to find the height of a matrix. We assume that the matrix

is well formed – the input is correct

Example:

matrixRowRetrieve [[2,3], [2,2], [3,2]]

3

matrixRowRetrieve :: [[m]] -> [m]

matrixRowRetrieve (hd:matrix) = hd

matrixWidthIter :: Int -> [m] -> Int

matrixWidthIter count (hd:row) = matrixWidthIter (inc count) row

matrixWidthIter count [] = count

matrixWidth matrix = matrixWidthIter 0 (matrixRowRetrieve matrix)

Now if we want to see both dimensions of the matrix, we can show the result

as a tuple.

Example:

matrixDimensions [[2,3], [2,2], [3,2]]

(3,2)

matrixDimensions matrix = ((matrixHeight matrix, matrixWidth matrix))

Now that we can access the dimensions of a matrix we want to be able to

do matrix multiplication on 2 matrices.

matrixMult will do dense-dense matrix multiplication. This function isn‘t

used for any code graph creation, but more of a utility.

Example:

matrixMult [[1,0,0],[0,1,0],[0,0,1]] [[2,3],[2,2],[3,2]]

[[2,3],[2,2],[3,2]]

transpose :: [[Int]] -> [[Int]]

transpose [] = []

transpose ([]:xss) = transpose xss

transpose ((x:xs) : xss) = (x : [h | (h:t) <- xss]) :

transpose (xs : [t | (h:t) <- xss])

-- matrixMult takes two matrix inputs and if their dimensions are acceptable

-- for matrix multiplication, continue. Otherwise error

185

-- Note that matrix indexing starts at (1,1)

-- Example: matrixMult [[1,0],[0,1]] [[1,0],[0,1] is [[1,0],[0,1]]

matrixMult :: [[Int]] -> [[Int]] -> [[Int]]

matrixMult matrixA matrixB =

if (matrixWidth matrixA) == (matrixHeight matrixB) then

matrixMultRow dimM dimM matrixA (transpose matrixB)

else error "matrixMult: Incorrect dimension sizes"

where

dimM = matrixHeight matrixA

matrixMultRow will multiple each row of matrixA by (tranposed) matrixB.

Input: counter for the rows (we cannot traverse to an empty list becase we

cannot build a list with a [[]] type), matrix width dimension (used for identifying

cells in the code graph construction ie. A11, B32, etc), matrixA and MatrixB.

matrixMultRow :: Int -> Int -> [[Int]] -> [[Int]] -> [[Int]]

matrixMultRow 1 md (row:matrixA) matrixB =

(matrixMultCol row matrixB):[]

matrixMultRow m md (row:matrixA) matrixB = (matrixMultCol row matrixB):

(matrixMultRow (m-1) md matrixA matrixB)

matrixMultCol will multiply a row from matrixA by all the columns of ma-

trixB. Input: row number, col number, row (from matrixA), matrixB

matrixMultCol :: [Int] -> [[Int]] -> [Int]

matrixMultCol rowA [] = []

matrixMultCol rowA (rowB:matrixB) = matrixMultCell rowA rowB:

(matrixMultCol rowA matrixB)

matrixMultCell does the computations for a single cell of the result matrix.

Input: row number, col number, rowA, rowB

matrixMultCell :: [Int] -> [Int] -> Int

matrixMultCell rowA rowB = foldl (+) 0 . map (uncurry (*)) $ zip rowA rowB

MatrixCodeGraph This module constructs the CodeGraph using Dr.Kahl‘s CodeGraph library

where the graph is mapped out by the edges creating a node oriented out-

put graph. A looping structure is used to create a channel of instructions for

each cell in the result matrix (refered to as the matrix C).

Building Code Graphs from parsing a Matrix

This module will ‗parse‘ a matrix computation, such as dense-dense matrix

multiplication, and generate a code graph representing the computational op-

erations.

module MatrixCodeGraph where

import CodeGraph

import CodeGraphOps

import MatrixOps

import Char ( intToDigit, ord, chr )

import qualified Data.Map as Map

We want to build the graph based on the edges of the code graph. Imple-

menation for this is similar to that of the CodeGraphExample module.

It starts with building the hyper-edges for the graph. We need a structure

for accessing all of the cells of a matrix. We want to build the graph so that

each cell has the necessary nodes and edges required to map out the procedure

of a dense-dense matrix multiplication. This uses each cells‘ indexes to label

the nodes and edges.

matrixMultGraph a b = mkCodeGraph () Map.empty

(Map.fromList $ zip (map Edge [1..]) $ matrixEdges dimM dimN dimM dimN)

inputs

outputs

where

186

dimM = matrixWidth a

dimN = matrixHeight b

inputs = mcgInputs a b

outputs = mcgOutputs dimM dimN

matrixEdges returns the codegraph required for one cells computation. In-

puts: the row index, the column index, the LS access counter, the max row

index from matrix A, the max column index (row format) from matrix B.

matrixEdges :: Int -> Int -> Int -> Int -> [EdgeInfo String Op]

matrixEdges 1 1 dimM dimN = (cellTree 1 1 dimM dimN 1 dimN)

matrixEdges r 1 dimM dimN = (cellTree r 1 dimM dimN 1 dimN)++

(matrixEdges (r-1) dimN dimM dimN)

matrixEdges r c dimM dimN = (cellTree r c dimM dimN 1 dimN)++

(matrixEdges r (c-1) dimM dimN)

cellTree will build the edges in the codegraph with nodes as the output.

cellTree :: Int -> Int -> Int -> Int -> Int -> Int -> [EdgeInfo String Op]

cellTree r c dimM dimN index indLim =

if index == 1 then

(cellDMA r c index lblA):(cellDMA r c index lblB):

(cellMult r c index):(cellTree r c dimM dimN (index+1) indLim)

else if index <= indLim then

(cellDMA r c index lblA):(cellDMA r c index lblB):

(cellMultAdd r c index):(cellTree r c dimM dimN (index+1) indLim)

else

(cellDMAOut r c index):[]

where

lblA = "A"++(appendTuple(r,index))

lblB = "B"++(appendTuple(index,c))

cellMultAdd will build all the hyperedges for the multadds, for floating-point

calculations, needed for constructing the code graph. Inputs: row index, column

index, counter of ls/mult commands (for each cell).

cellMuld is similar to cellMultAdd, except it will build the hyperedge for

multiplication instructions for floating-point numbers.

cellMultAdd :: Int -> Int -> Int -> (EdgeInfo String Op)

cellMultAdd r c lsCount = edgeInfo ("MultAdd_forC"++(appendTuple(r,c))++"_"++tailCount)

["LS_acc_forC"++(appendTuple(r,c))++"_"++(tailCount)]

["LS_acc_forC"++(appendTuple(r,c))++"_"++(incTailCount)]

where

tailCount = map intToDigit (intToList lsCount)

incTailCount = map intToDigit (intToList (lsCount+1))

cellMult :: Int -> Int -> Int -> (EdgeInfo String Op)

cellMult r c lsCount = edgeInfo ("Mult_forC"++(appendTuple(r,c))++"_"++tailCount)


["LS_acc_forC"++(appendTuple(r,c))++"_"++(incTailCount)]

where


incTailCount = map intToDigit (intToList (lsCount+1))

Build a hyperedge for the DMA of a matrix cell with respect to the result

matrix cell. Inputs: the row index, the column index, the ls access counter (the

number of times local store is read/write accessed for each result cell), the cell

to be DMA.

cellDMA :: Int -> Int -> Int -> String -> EdgeInfo String Op

cellDMA r c lsCount str = edgeInfo ("DMA_"++str++"forC"++(appendTuple(r,c)))

[str]


where


187

cellDMAOut :: Int -> Int -> Int -> EdgeInfo String Op

cellDMAOut r c indLim = edgeInfo ("DMAOut_forC"++(appendTuple(r,c)))

["LS_acc_forC"++(appendTuple(r,c))++"_"++tailCount]

["C"++appendTuple(r,c)]

where

tailCount = map intToDigit (intToList indLim)

Build the input nodes for the hypergraph with a unique string for the inputs.

mcgInputs :: [[Int]] -> [[Int]] -> [String]

mcgInputs a b = (mcgBuildNodes "A" (matrixHeight a) (matrixWidth a)) ++

(mcgBuildNodes "B" (matrixHeight b) (matrixWidth b))

An interface for building a set of nodes based on the dimensions of a matrix

and the iterator function for building the set of nodes.

mcgBuildNodes str r c = mcgBuildNodesIter str r c c

mcgBuildNodesIter :: String -> Int -> Int -> Int -> [String]

mcgBuildNodesIter str 1 1 mcol = (str++(appendTuple(1,1))):[]

mcgBuildNodesIter str r 1 mcol = (str++(appendTuple(r,1))):

(mcgBuildNodesIter str (r-1) mcol mcol)

mcgBuildNodesIter str r c mcol = (str++(appendTuple(r,c))):

(mcgBuildNodesIter str r (c-1) mcol)

Build the output nodes for the hypergraph as tuples with a unique number

and a unique string.

mcgOutputs :: Int -> Int -> [String]

mcgOutputs r c = mcgBuildNodes "C" r c

These are utility functions added for making labelling easier. We also use

the functions supplied by the Char module.

intToReversedList :: Int -> [Int]

intToReversedList 0 = []

intToReversedList x = (mod x 10):intToList (div x 10)

intToList :: Int -> [Int]

intToList = reverse . intToReversedList

tupleToList :: (Int,Int) -> [Int]

tupleToList (x,y) = (intToList x)++(intToList y)

appendTuple :: (Int, Int) -> String

appendTuple (x,y) = map intToDigit (tupleToList (x,y))

newtype Op = Op String

deriving (Eq, Ord)

instance Show Op where showsPrec _ (Op s) = (s ++)

edgeInfo op = EdgeInfo (Op op)

MatrixCodeGraphExample This module is used to produce a dot file that demonstrates a code graph. Given

two matrices, generate the codegraph for the multiplication of these matrices.

Matrix Multiplication Code Graph Examples

Examples of using he matrix code graph programs for building the code graphs

to represent a matrix multiplication.

import MatrixCodeGraph

import CodeGraphDot

Output two dot files to demonstrate the code graphs. (M1.dot and M2.dot)

m1 = matrixMultGraph [[1,0],[0,1]] [[2,3],[3,2]]

m2 = matrixMultGraph [[1,0,0],[0,1,0],[0,0,1]] [[2,3],[3,2],[2,2]]

test = do

dotCodeGraph1 "M1" m1

dotCodeGraph1 "M2" m2

CodeGraph M1 Output

Figure 1: The codegraph for a 2x2 matrix multiplicaton

188

From creating the dot file for the codegraph m1, the resulting output is

shown in figure 1 for the 2x2 dense-dense matrix multiplication. Even though

the first input matrix is the identity matrix, we still use our Block Row– Column

algorithm – no optimizations were made.

Conclusion

We can see from the generated code graph in Figure 1 that the operations

for a matrix multiplication is mapped out with the dependencies between the

instructions. The DMA instruction must be called to retrieve the data that will be

used by the Mult and MultAdd instructions for each result matrix cell. However,

we can notice that the size of this matrix is bound only by the number of nodes

and edges. Although this solution may build the codegraphs required, it is not

optimal as the size of codegraph could cause a large bottleneck in computation

time for the PPU to create and schedule this codegraph.

From here, we now must make a new codegraph that has a looping structure.

We can notice that each of the channels for the computing a result matrix cell

has a pattern it follows. This pattern could be looped so that the codegraph

size is reduced. Furthermore, we can also notice that each cell has the same

pattern for each of its channels. Thus, leading to the need for another looping

structure to iterate all the cells of the result matrix. So, for this codegraph, we

need an outer looping structure for each cell in the result matrix and an inner

looping structure for each instructure set within a channel.

Dr.Anand has pointed out Dr.Kahl‘s Loop Spec that could be useful for

creating a looping structure for the code graph. The trick to making this looping

structure will be to have the output of a cell‘s channel loop back to get the inputs

for more the next cell continuously.

189

VIII.XIX LaTeX Advanced REPORT: LaTeX Advanced

Done By: Nathan Cumpson: 19 November 2006

Note: To a beginner, this document may seem very long and too much to remember all at one. Try using LaTeX

first and when you are having difficulty compiling with larger documents, you may want to come back to read this

document. The next report on LaTeX will be the last report outlining importing modules/files.

Summary: LaTeX usage for more advanced functionality that covers the limitations and practiced methods for

documenting. Outlines the use of all the components of a document.

Sources:

Roberts, Andrew. "Getting to Grips with Latex." Andy-Roberts. University of Leeds. 15 Oct. 2006

http://www.andy-roberts.net/misc/latex/.

Information:

Earlier I posted the first report on how to use LaTeX for beginners. It was an introduction to creating a LaTeX

template, explaining the structure of the document (preamble, body, etc.), and compiling a LaTeX file. After

extensive use with LaTeX, my experience has shown me that it is trickier to use than expected and is very picky and

finicky when it comes to compiling a document. The compiler can return many errors as a problem when they may

only be 1 error or just a package missing; try not to fix everything at once because some things may not need to be

changed. The document type: \documentclass{class} If you are simply using the LaTeX file for reading later and are

not too concerned with the appearance,

then the type of document you use isn't too important. When using the letter, report and article types, I didn't notice

much different besides a slight margin differences and the title page for the report. The type of document is usually

referred to as its class. There are 5 classes to use:

1. Letter to write parts of a letter

2. Article to write shorter documents - most commonly used for normal documents

3. Report generates a title page for the document, good for something like this report.

4. Book to write books that makes use of the \section and \part commands (LaTeX Explained

report)

5. Thesis to write a thesis according to the graduate studies standards although I will be modifying

this for our thesis write up.

\documentclass[option, option, ... ]{class} is the command for the document class to be used, the first part of any

LaTeX document. The options can specify things from special pages (title page, abstracts), page size (letter,

a4paper), margins and even font size. Options are usually omitted for basic documents.

The document packages: \includepackage{packagename}

There are a set of included packages that already come with LaTeX, bundled in the download. If there is a package

you need and are not sure if you have it then you can search your file system for the package name and download

the missing package or you can compile your document and use the package download software that is used to find

packages online or saved somewhere on disk. The packages that are commonly used for mathematical documents

and documents that may contain program code are:

• amsmath for math commands to be used.

• times for the top matter of a document such as the author, title, url, etc.

• float for giving program code a floating table.

• graphicsx for importing graphics/images into a document.

• array for creating matrices.

The document program code: \newfloat{name}{alignment}{alignment}

When trying to write code in the document, just typing in the code wouldn't work to well without creating a very

awkward appearance, so to fix that we want to make the code show up just as if we typed it in a text editor. This can

be done by using the command \begin{verbatim} \end{verbatim}, which is much like the <pre> </pre> tag in

HTML. This command could be used in conjunction with the \newfloat command to give the code the look of a

http://www.andy-roberts.net/misc/latex/

190

floating figure. To do this, the environment must be created so it is recognized, in the preamble, similar to declaring

a function. This is done with 3 lines of code:

\floatstyle{ruled}

\newfloat{code}{thp}{lop}

\floatname{code}{Code}

\floatstyle will tell what style should be used for the figure - ruled, boxed and plain are the choices.

\newfloat will create the new environment name for the figure, in the case it is code. \floatname will

name the figure in the document, otherwise it will name it as the environment name. After this is written in the

preamble, all that is needed is to write the code in the document. Here is an example:

\begin{code}

\begin{verbatim}

int main(void){

printf("Hello World!");

}

\end{verbatim}

\caption{this is the hello world program}

\end{code}

The document images: \includegraphic[option]{filename}

This command is used for importing images into the document. The options that can be used are for sizing, rotating

and reflecting the image. Scaling is the most common option used, typed as [scale=0.5]. After options, the filename

is typed as you would for any file system. However, the images that are included must be the post script image

format eps. The file extension is .eps and this image can be created using any Adobe imaging product or for open

source solution, you can use GIMP. To make the image included into a figure, you would use the following code:

\begin{figure}[h]

\begin{center}

\includegraphics[scale=0.4]{images/LRgraph}

\end{center}

\caption{LR Graph for Bottom-up parsing. Note that the order is incorrect. DOT ordering

is not working }

\end{figure}

The document tables: \begin{tabular}{ columns alignments }

The table command is used as an environment for the data. The table consists of rows and columns, creating

individual cells. When you are importing data into the table, insert the data into each cell individually --- kind of

obvious. The data is written across a row for each column, then goes on to the next row. Columns are separated by

the character & and rows are separated by the newline command

\\. An example is:

\begin{table}

\begin{tabular}{ l | r } % this will create a table with 2 columns aligned left and right

respectively. There is also a inner line separator between columns.

Age & Height \\

12 & 155 \\

14 & 167 \\

8 & 130

\end{tabular}

\caption{This is a table}

\end{table}

The environment used to surround the table is \begin{table} which is used to create a floating figure for the table.

You should note that images and tables are each numbered according to their appearance in the document, but each

have their own respective numbering. For example you might see: Image 1, Image 2, Table 1, Image 3, Table 2.

The documents items: \begin{itemize}, \begin{enumerate}

Creating a list of items or a list of ordered items is a function that is commonly used in documents and LaTeX has an

environment for the abilities as well. The command environments are \begin{itemize} ...

\end{itemize} and \begin{enumerate} ... \end{enumerate}. For both of these enviroments, when you want to add an

item to the list, use the command \item. Here is an example:

\begin{itemize}

\item milk

191

\item eggs

\item bread

\begin{itemize} % nesting a 2nd level for items

\item bagels

\item wraps

\item muffins

\end{itemize}

\end{itemize}

\begin{enumerate}

\item President

\item Vice-President

\item Chief of Production

\item Chief of Research

\end{enumerate}

This is two different ways to list items. There is also options available for lists to customize the type of listing your

using, such as dots, circles, diamonds, blocks, etc. You should also note there are only 4 levels of nesting allowed.

The documents math:

Matrices: \begin{array}{column alignment}

Matrices are a big part of what we do, so to add them to our documents would be useful. To

Use matrices, include the package array in the preamble. Matrices are created the exact same way a table is created

except using the term array instead of tabular and we can encapsulate the matrix in brackets to make it look better.

Here is some sample code:

\left[\begin{array}{ c c } % this will create a matrix with 2 columns and their data will

be centered

1 & 2 \\

3 & 4

\end{array}

\right]

Symbols: \symbolname

There are hundreds of different math symbols in LaTeX that can be used. These symbols are all treated as

commands and are named the way they are said. Here are some sample symbols:

\alpha

\beta

\omega_1 % omega with a subscript 1

\Sigma_{n=1}^{\inf} % Sigma with a subscript n=1 and a superscript infinity.

\rightarrow % creates an arrow pointing right -->

\Beta % some symbols like \Beta, \Alpha can use subscripts and superscripts but \beta

and \alpha can.

Equations: \begin{equation}, \begin{eqnarray}, \begin{displaymath}, \begin{math}, $$, $

There a number of ways to show math formulas. Anytime that you use a math symbol, you should tell the document

that it is math. To add math inline with the text of a document, you can use the environment \begin{math} ...

\end{math} or you can use $ ... $ which does the same thing. Here, the format of the text will look like an equation

but it will still be inline with the text. If you want to separate the equation form the text, you can use

\begin{displaymath} ... \end{displaymath} or you can use $$ ... $$ which does the same thing. This will center the

math and give it some padding from the rest of the document. However, this can only display the math on one line.

For a similar affect, you can use the environment \begin{equation} ... \end{equation}. If you need to write the math

on multiple lines, than you do this using the environment \begin{eqnarray} ... \end{eqnarray} which works exactly

like the table except without the alignment block. Here's some examples:

$ a + b = c $

$$ \alpha_1 . . . \alpha_n \subset \Beta $$

\begin{eqnarray*} % we put the * symbol in so that the equations wont be numbered

a & = & (x + y) * (z + y)\\

&= & (x + z) * y

\end{eqnarray*}

192

You should note that in the first row, there should always be the proper number of columns, with none being blank.

It should work with blanks in the first row as well, but sometimes it may give compile errors to work around. Still

working to find a good solution for this. Blank cells in other rows is fine.

Conclusion: LaTeX can be very frustrating at first, especially since the compilers are very picky and there isn't a ton

of documentation on the specifics of commands. If you have questions about why you're code isn't compiling, check

your syntax, check all your packages needed are included and if you still have problems, send me an email and I'll

explain more to you.

193

VIII.XX LaTeX Usage Report: LaTeX Usage

Done By: Nathan Cumpson

Note: Running LaTeX on a Windows OS requires the MikTeX software, used for creating the LaTeX documents on

the 95/98/2000/XP platform. Executing LaTeX must be done in the directory containing the .tex file. To compile a

LaTeX document in DOS, enter: latex <filename>.tex OR latex <filename>.

This will compile your document as a .dvi file which can be viewed using MikTeX's previewer, Yarr. PDFTeX is

required for outputing a PDF file.

Reference:

Roberts, Andrew. "Getting to Grips with Latex." Andy-Roberts. University of Leeds. 15 Oct. 2006

http://www.andy-roberts.net/misc/latex/.

Summary: LaTeX is very much similar to any markup languages that we may be familiar with. There is header

information, the document body and sections within the body. Structures like tables, bibliographies, captions, figures

and formatting are all possible as well as objects like images and mathematics.

Information:

Using LaTeX the Type Setter for Windows, MikTeX. Miktex can be found at http://www.miktex.org and is a free

download. Installation is pretty self-extracting. To compile any latex documents, you must first create a text

document using your favourite text editor: Notepad (whooo!), textpad, crimson editor, Notepad v2, etc. Save your

document as a .tex file. Using your DOS command prompt, navigate to the folder containing your tex file and enter

the command latex <filename> OR latex <filename>.tex. Your output will be a dvi file. To convert your dvi file into

a Postscript file, use:

dvips <filename>.tex -o <filename>.ps. The -o says to save the output into a file. Now that you have your postscript

file, you can use: ps2pdf <filename>.ps <filename>.pdf. Your result should be your output PDF file.

Getting Started:

Now that you have your text editor ready, you can start writing a document. So like an HTML file, there is a header

to a tex document. The header contains information about the document class (article, journal, thesis, book, etc.) and

the packages to use, which are a set of macros stored as a package to use. For example (Roberts):

% simple.tex - A simple article to illustrate document structure.

\documentclass{article}

\usepackage{times}

\begin{document}

Here the first like starts with a % symbol for comments. The next line is the document class to format the output.

The third line tells the compiler which packages to use, in this case the Postscript Times type 1 font. The last line

begins the body of the document where the content is located. The first content that you may come across in the

document is the topmatter. There is no explicit

\topmatter command, however there are a set of commands the comprise the topmatter: \title,

\author, \date, \maketitle, etc. Here is another example (Roberts):

\title{How to Structure a \LaTeX{} Document}

\author{Andrew Roberts\\

School of Computing,\\

University of Leeds,\\

Leeds,\\

United Kingdom,\\

LS2 1HE\\

\texttt{[email protected]}}

\date{\today}

\maketitle

http://www.andy-roberts.net/misc/latex/

194

The \title command is pretty self-explanatory. \author is comprised of a string of text using the command \\ as the

newline. \texttt will format the email address in a mono space font, such as courier. The \date(\today) command

simply gets date today and displays it as a date format. The last command, \maketitle, finishes the topmatter and

displays its output. Following the topmatter come the primary sections of the document. These sections are

organized by commands, in sequence by level:

\part{ part }

\chapter{ chapter }

\section{ section }

\subsection{ subsection }

\subsubsection{ subsubsection }

\paragraph{ paragraph }

\subparagraph{ subparagraph }

note: for numbering use \section*{ numbered section }

and the special abstract command,

\begin{abstract}

...

\end{abstract}

which must follow the topmatter and pre-empt the content. Once you have finished your document you can use the

\end{document} to finish the document.

Conclusion: It may not be as obvious at the moment that LaTeX has some major advantages because it takes some

experience with LaTeX before you can do the more advanced commands that make documents easier to produce

which are consistent, manageable and possibly the best part, processed by an open source application. Many thesis

in at a University level make use of LaTeX, so it would only be appropriate that our thesis use LaTeX, especially

since we are producing documents as a part of the Coconut project.

195

VIII.XXI Memory Mapping for a Dense Matrix

MEMORY MAPPING FOR A DENSE MATRIX

BY: Christopher Ve nantius

Done: November 12/2006

Introduction:

The purpose of this report is to outline how our thesis is approaching the storing of a

dense matrix. Issues handled in this report are concerning first the abstract notion of the storing

order of the blocks, followed by the s toring of t he ent ries in memory. Additionally, the issue of

alignment and padding are also discussed.

Looking at the Block Level:

The dense matrix is going to be broken into blocks of size Xn by Xn, basically the

dimensions of the block are multiples of X. The value of X is based on the underlying hardware,

and how much data can be read at a time. For the case of the STI cell, this will be a quadword,

therefore; our value of X is four. The value of n depends on the amount of fas t s torage each

processor in the underlying system has (for the STI this is dependent on the LS) . The las t block

of each row in the matrix and the last block of each column runs the risk of being partially full, if

the dimensions of the matrix are not multiples of X. If this situation arises, and realis tically it

will often, we have a few options to handle the case.

Option One: It to create blocks that handle t he special cases with padding. Therefore, we would

be saving a block of partial useful data, and the rest would be zeroed out. This allows the storage

routine to be relatively simply, however, we end up wast ing memory by storing data that is

irrelevant.

Option Two: Is t o create special blocks t o handle the special cases. Therefore, we can define a

block of size Xn by R for the blocks that make up the last block for each row (except for the last

row), a block of size C by Xn for the blocks that make up t he last block for each column (except

for the last column), a nd a block size of C by R for the block that makes up the last block for the

196

last column and row. In this case R < Xn and C < Xn. The issue with going this route is having

to store the dimension of the special block sizes, since this value is not standard for every matrix,

it can vary as we increase X, and n. Therefore, if X is four (as with the STI) t he n is 3, we have a

block size of 12x12. This leads to the R and C dimensions for the special blocks can vary from

11 values [1, 11]. However, if we treat each block as a matrix, and since our type system stores

the dimension of the matrix as part of its -type, we are not creating more storage to save this

information. Therefore, for a practical case say have a 15x15 matrix wit h 4x4 blocks : Here, we

would have a memory location where the matrix is stored, and its type will tell us its 15x15.

Therefore, we know that this matrix is really s tored as a series of blocks. Since each block is a

matrix, w e also know its dimension. Additionally, since we know the block dimensions, we are

planning to store the blocks with row order, we know in memory we have in sequence blocks

that store the first row, followed by the second,third, et c. In our specific example we would have

to start (15/4) rounded down = 3 4x4 normal blocks, followed by a 4 by 15-(15/4)rounded down

* 4 special block = 4x3. This makes up the first row of blocks for the 15x15 matrix. We would t

hen have (15/4) rounded down = 3 rows like this. Followed by the last row with to start with

(15/4) rounded down = 3 [15-(15/4)rounded down * 4] by 4 special blocks = 3x4. Lastly, the

bottom right corner block would be a 15-(15/4)rounded down * 4 by 15 - (15/4)rounded down *

4 = 3x3 block of data. The point of t his is show that through a s imple calculation, with the only

information needed is the overall dimension of the matrix and the dimension of the blocks, one

can get the dimension of all blocks, i ncluding special blocks. Therefore, one can extract the

blocks, knowing when one starts and ends. Therefore, the above is feasible solution, additionally,

we might not have to do all of the above calculation. As discussed before, if we define a matrix

that is larger than our block size as one that is defined by a series of sub matrices, all we need to

know is the sub matrices. Therefore, using the 15x15 matrix example, we could just know t hat it

is defined as 4 blocks by 4 blocks. The blocks themselves, being sub matrices , would tell us its

dimension. More research / clarificat ion on defining a matrix recurs ively has to be done before

we can assume it will work this way.

197

VIII.XXII CABx Dense Parallel Algorithms With B Being a Vector REPORT: CABx DENSE PARALLEL ALGORITHMS WITH B BEING A VECTOR


Summary: Outlines a 1D and 2D approach to parallelize the process with data parallelize at the forefront.

References: Karypis, George. Introduction to Parallel Computation: Dense Matrix Algorithms. Presented at the

University of Minnesota.

Information:

First approach 1D (break it into rows / or columns - for our purpose we will stick to rows) Given: A(mxn) and B is a

n vector Break A into m/p rows where p is the number of processors (0 to x), and break B vector is the same way If

AB=[a11 a12... a1n] [b1] then we break it as AB = [p0 ] [ ] where each process contains all the

[a21 ] [b2] [p1 ] [ ] information across the "row"

[... ] [..] [... ] [ ]

[am1... amn] [bn] [px ] [ ]

Since, each processor needs all of vector B, we do an all to all broadcast

So if the vector B is applied to each process as:

P0 [b1 , up , up, .....up]

P1 [down, b2 , up, ........up]

P2 [down, down, up, ..........up]

...

Px [ down, down, down.......bn]

After the broadcast each processor has all the information to compute the desired C(i,1) entry. Second approach 2D

(break it into sub matrices for each processor - checkerboard approach)

Given: same inputs as above

Break A into sub-matrices by the number of processor one has, with an emphasis of each sub matrices for

corresponding row breaks to have the same number of columns for each processor split, and square if possible (So

using the examples below with 4 P0 and P2 will have the same number of columns, and P1 and P3 will have the

same number of columns) If A = [a11 a12 ...a1n] -> A = [P0 ... ...P(sqrt(index) - 1)]

[... ... ] [P(sqrt(x)) P(sqrt(x)+1) .... ]

[am1 ..amn] [P(2*(sqrt(x)) P(2*sqrt(x)+1) ]

So for example if x = 4 we would have: [P0 P1] x = 8 we have: [P0 P1] x = 9 we have: [P0 P1

P2]

[P2 P3] [P2 P3] [P3 P4 P5]

[P4 P5] [P6 P7 P8]

[P6 P7]

Now we distribute the B vector as follows: We break the B vector into sections based on the number of columns

each corresponding row process has. We diagonally align the vector breaks and distribute among the column. So

using the example of x = 9 we would do as follows.

B = [b1 up up] where b1 is associated with P0, P3, P6, and b2 with P1, P4, P7 and b3 with

P2, P5, P6

[down b2 up]

[down down b3]

We apply the operation to get a partial result matrix of A (again using the x = 9 example):

A = [Res0 Res1 Res2] and apply an all to one reduction across each row to get the resulting

answer

[Res3 Res4 Res5]

[Res6 Res7 Res8]

C = [Res0 + Res1 + Res2, Res3 + Res4 + Res5, Res6 + Res7 + Res8]

(Karypis) --- whole concept

198

Conclusion:

This reports is not going to make a claim on which one of the two algorithms is better to implement. A later report(s)

on the complexity of these algorithms and memory usage would determine that. In fact, it is possible both methods

could be coded and applied and timed to see which one is more efficient. The conclusion of this report is to

familiarize one with the two major methods, and use it as a stepping stone to understand CABx dense where B is a

matrix.

199

VIII.XXIII CABx Dense Parallel Algorithms With B Being a Matrix REPORT: CABx DENSE PARALLEL ALGORITHMS with B being a matrix

Done By: Christopher Venantius: 15, October, 2006

Summary: The following outlines 2d and 3d approaches to CABx dense computations. NOTE READ "CABx

Dense Parallel Algorithms with B being a Vector" before reading the following article (assumes approach is

understood).

References: Karypis, George. Introduction to Parallel Computation: Dense Matrix Algorithms. Presented at the

University of Minnesota.

Information:

2D approach:

Given A B and C as nxn matrices (easy to go through algorithm with nxn plus if necessary matrices can be expanded

to nxn by adding zeroed rows / columns. Note if that was the case, space saving algorithms could be used to not

waste memory for the extra rows/ columns and an attribute of zero matrix can be added if an operation is being used

with a sub matrix of zeros to "skip" the computation. Break the matrices of A, B and C like the 2d approach we used

to break matrix A in the report "CABx dense parallel algorithms with B being a vector". Therefore, the matrices are

of form (assume we have x processors):

A/B/C = [0 ... (sqrt(x)-1)]

[(sqrt(x)) (sqrt(x))+1 ... ]

[... ..(x-1)]

Now, assuming the example of x = 9, and identifying each sub matrices as A00, A01, etc, we have the following sub

matrices

A = [A00 A01 A02] B = [B00 B01 B02] C = [C00 C01 C02]

[A10 A11 A12] [B10 B11 B12] [C10 C11 C12]

[A20 A21 A22] [B20 B21 B22] [C20 C21 C22]

Now, we apply the following transforming to initialize the alignment:

A = [A00 A01 A02] basically, we shift left the value B = [B00 B11 B22]

basically, we shift up the value of

[A11 A12 A10] of the current row, starting with [B10 B12 B02] the

current column, starting with

[A22 A20 A21] zero to n-1 [B20 B01 B12] zero to

n-1

Now, we compute the values of C through a series of summed dot products, where processors send information of

their entries in the following way:

A = [left left left] B = [up up up]

[left left left] [up up up]

[left left left] [up up up]

To clarify I will run through two passes of the above example:

Pass one would do the following

C = [C00 + A00*B00 C01 + A01*B11 C02 + A02*B22]

[C10 + A11*B10 C11 + A12*B12 C12 + A10*B02]

[C20 + A22*B20 C21 + A20*B01 C22 + A21*B12]

Pass two would move the values of A and B around specified and do the dot product again

C = [C00 + A01*B10 C01 + A02*B12 C02 + A00*B02]

[C10 + A12*B20 C11 + A10*B01 C12 + A11*B12]

[C20 + A20*B00 C21 + A21*B11 C22 + A22*B22]

Therefore, each processor only has to do three loops of a matrix times matrix computation ---

ideally, each processor would be working with only data elements, not matrices, but most likely it will be matrices.

3D approach:

200

Basically, same idea above, but asking the question is it possible to parallelize the dot product computations. So, to

computer C00 in pass one, we would break the of A00 dot B00 into the following, assuming

A00 and B00 are txt size

A00 = [A00 A01 ... A0t] and the same breaking down for B00 we define out sub tasks as

[A10 ]

[At0... Att]

D000 = A00*B00 D001 = A00*B01 ... D00t = A00 * B0t

D100 = A01*B10 D101 = A01 *B11 D10t = A01 * B1t

... ... ...

Dt00 = A0t*Bt0 Dt01 = A0t * Bt1 Dt0t = A0t * Btt

Therefore, C00 = sum of i = 1 to t of Di00

(Whole concept from Karypis)

Basically, one can parallelize all the dot product computations. THIS COULD BE A SOLUTION IF THE LOW

LEVEL (HARDWARE) DOES NOT SUPPORT MATRIX MULTIPLICATION, THEN WE CAN FEED THE

INDIVIDUAL DOT PRODUCTS AS MULTIPLICATION OF ELEMENTS AND SUM THEM AFTER. OR WE

CAN PASS THE SUB MATRICES OPERATIONS TO A SECOND LAYER, THAT WILL DIVIDE IT INTO

DOT PRODUCTS OF VECTORS, THAT WOULD INTERFACE IT WITH THE HARDWARE.

Conclusion: The algorithm outlined as the advantage of breaking the computation into blocks of data. This allows

processor A to only need the blocks it needs, instead of a row implementation where it would need one row and one

whole matrix to compute. The benefit of the 3d approach would be determine (if it is better to add the

implementation or not), when a complexity and memory analysis is done. Dividing the processes using the 3d

approach obviously leads to less memory usage for each processor, but more memory accesses or more

computations for a processor to handle. This might not take advantage of the cell architecture and the underlying LS.