Download - The design of image processing algorithms on …THE DESIGN OF IMAGE PROCESSING ALGORITHMS ON PARALLEL COMPUTERS By Saad Ali Amin, MPhil., BSc. A Doctoral Thesis submitted in partial

Loughborough UniversityInstitutional Repository

The design of imageprocessing algorithms on

parallel computers

This item was submitted to Loughborough University's Institutional Repositoryby the/an author.

Additional Information:

• A Doctoral Thesis. Submitted in partial fulfilment of the requirementsfor the award of Doctor of Philosophy at Loughborough University.

Metadata Record: https://dspace.lboro.ac.uk/2134/27573

Publisher: c© Saad Ali Amin

Rights: This work is made available according to the conditions of the CreativeCommons Attribution-NonCommercial-NoDerivatives 2.5 Generic (CC BY-NC-ND 2.5) licence. Full details of this licence are available at: http://creativecommons.org/licenses/by-nc-nd/2.5/

Please cite the published version.

https://dspace.lboro.ac.uk/2134/27573

This item was submitted to Loughborough University as a PhD thesis by the author and is made available in the Institutional Repository

(https://dspace.lboro.ac.uk/) under the following Creative Commons Licence conditions.

For the full text of this licence, please go to: http://creativecommons.org/licenses/by-nc-nd/2.5/

LOUGHBOROUGH UNIVERSITY OF TECHNOLOGY

LIBRARY

AUTHOR/FILING TITLE I

_______________ f)_tj_I_~-+--~.:A_._ ______________ .

ACCESSION/COPY NO

6'too.., 1. '1 ' ' ----------------- ---------------------------------VOL. NO. CLASS MARK

I 1111111111111111 11111

THE DESIGN OF IMAGE PROCESSING

ALGORITHMS

ON PARALLEL COMPUTERS

By

Saad Ali Amin, MPhil., BSc.

A Doctoral Thesis submitted in partial fulfilment

of the requirements for the

Award of the Degree of

Doctor of Plulosophy

of Loughborough University of Technology

Apn11993

©Saad All Amin,1993

----~--

Loughborough University of Techr::h9y Library

-~..- .. Date s¥v- '1'1

Class

Ace. <N: oo '7?, 'lt 1 No.

CERTIFICATE OF ORIGINALITY

This is to certify that I am responsible for the work submitted in this thesis,

that the original work is my own except as specified in acknowledgements

or in footnotes, and that neither the thesis nor the original work contamed

therein has been submitted to this or any other institution for a higher degree.

S.A.Amin

To the memory of my father Ah Amin.

To my mother, for her unfailing support.

To my wife, Maha, for her patience and love; the two things I needed most

in hard times.

To my daughter Shahad and my son Alt. And last, but by no means least,

to my brother and two sisters.

Acknowledgments

I wish to express my sincere thanks and gratitude to Professor D.J. Evans,

the director of the research, for his diligent guidance, useful suggestwns and

advice throughout the course of the research and preparation of this thesis

and also for his kindness and support which without it, this work would have

never been completed.

My thanks also go to my supervisor, Dr. A. Benson.

My gratitude are also extended to my friends Dr. M.A. Saeed and Dr.

W.S.Yosif for their helpful comments and suggestions during the preparation

of this thesis

My deep appreciation also go to my dear sister Sulaf and her husband

Fiasal for their endless support and love.

Finally, I wish to acknowledge the financial support offered by the Scien

tific Research Council in Iraq.

ABSTRACT

This thesis is a study of the design of parallel algorithms for low-level digital

image processing on Very Large Scale Integration (VLSI) processor arrays which

are implemented on both a Sequent Balance (MIMD) via an Occarn simulator and a

transputer network runnmg the Transputer Development System (1DS). The Occarn

programming language is used as a tool to simulate and map systolic arrays for the

image processing algorithms proposed.

In chapter I a general introduction to parallel processing is presented. This

chapter covers a wide selection of the principles of the significant parallel computer

arclntectures, and various classifications of parallel architectures are presented.

Chapter 2 starts with a brief review of the main contenders of VLSI-onented

computing systems. Then the transputer architecture and the associated language

Occarn are given, the transputer development system (TDS) is also descnbed,

followed by a detailed description of the hardware used for the research. The

chapter concludes with a description of the basic techniques for the design of

systolic array.

Chapter 3 presents the techniques for filtering digital images, for both low

pass and high-pass filtering and includes the basic mathemancal theory involved

followed by a discussion of the use of systolic array and the various methods for

implementing these algorithms on different types of parallel architectures.

To achieve greater efficiency in the processor arrays, systolic arrays for one

dimensional and two-dimensional digital convolution are introduced and discussed

in chapter 4. The implementation of these systolic arrays on transputer networks

are also presented and their performances on transputer networks analysed.

In chapter 5 systolic array designs for the Laplacian and gradient operators are

presented, and the transputer implementanon of these systolic arrays evaluated with

i

different s1zes of transputer networks used. The systolic array design for the

Prewitt and Sobel operators are also introduced.

The parallel implementation of the Low-level image processing filters is

covered in chapter 6.A set of systolic systems for digital image filters namely;

Sigma, inverse gradient, mean and weighted mean filters, are d1scussed. The

implementauon of each of these designs on transputer networks is discussed.

Chapter 6 also presents a low-level image processing system software library,

which includes both kinds of low-pass and high-pass filters systolic designs. The

motivation for the work is to develop a methodology for the implementation of an

rmage processing library on the Sequent Balance.

Further there occurs frequently in fllter design the problem of solving banded

Toeplitz systems. In chapter 7 a new operator method for their solution is

introduced and shown to be applicable. Some systolic array designs to solve this

problem are discussed.

Some conclusions and a discussion of further research topics in image

processing and of systolic array algorithm research is the subject of the last chapter.

The thesis concludes with a comprehensive reference list and Appendices,

together With a selection of systolic programs in Occam.

ii

CONTENTS

ABSTRACT

1 INTRODUCTION TO PARALLEL COMPUTER

ARCHITECTURES 1

1.11N1RODUCTION 2

1.2 MA1N MOTIV A TIONS 5

1.3 PIPELINED COMPUTERS 7

1.4DATA-FLOWCOMPUTERS 13

1.5 ARRAY PROCESSORS 18

1.6 DESIGN CLASSIFICATIONS 21

1.6.1 Flynn's Classification 21

1.6.2 Shore's Classification 25

1.6.3 Other Classification Approaches 26

1.7 MULTIPROCESSOR STRUCTURE: PROCESSING 30

AND COMMUNICATION

2 THE VLSI TECHNOLOGY AND SYSTOLIC PARADIGM 36

2.1 INTRODUCTION 37

2.2 VLSI-ORIENTED ARCHITECTURES 40

2.2.1 TheW ARP Architecture 40

2.2.2 The Wavefront Array Processor (W AP) 42

2.2.3 The CHIP Architecture 44

2.3 INMOS TRANSPUTERS AND OCCAM 46

2.3.1 Transputer Architectures 47

2.3.2 OCCAM 53

2.3.3 Transputer Development System 55

2.3.4 Performance Measurements of a Transputer Network 58

2.3.5 The Transputer Network Used for this Research 59

iii

2.4 TilE SEQUENT BALANCE 8000 SYSlEM 62

2.5 SYSTOLIC SYSlEM FOR VLSICOMPUTING STRUCfURES 64

2.5.1 An Environment For The Development Of 65

The Systolic Approach.

2.5.2 Systolic Algorithms, Constraints and Classification 69

2.5.3 Systolic Array Simulation 72

3 FUNDAMENTALS OF DIGITAL IMAGE PROCESSING 76

3.1INTRODUCTION 77

3.2 LOW-LEVEL IMAGE PROCESSING ALGORITHMS 80

3.2.1 Parallel Paradigm 82

3.2.1.1 Image Parallelism 83

3.2.1.2 Task Parallelism 84

3.3 IMAGE FIL lERING 85

3.3.1 Digital approximations to the Gradient and Laplacian Operators 86

3.3.2 Low Pass and High Pass Filters 89

3.4 VLSI IMPLEMENTATION FOR LOW 95

LEVEL IMAGE PROCESSING

3.4.1 Systolic Array Implementation 95

3.4.2 Pyramid Architecture 97

4 SYSTOLIC DESIGNS FOR DIGITAL CONVOLUTION 100

4.1 INTRODUCTION 101

4.2 ONE DIMENSIONAL CONVOLUTION DESIGN 103

4.2.1 Problem Definition 103

4.2.2 Systolic Design 104

4.3 TWO DIMENSIONAL CONVOLUTION 111

4.3.1 Problem Definition 111

4.3.2 Computation of 2D Convolution as ID Convolution 112

4.3.3 Systolic Array Design for 2D Convolution 114

4.4 MULTIDIMENSIONAL CONVOLUTION 122

4.5 CONSTANT TIME OPERATION 124

4.6 TRANSPUlER NETWORK FOR ONE 127

DIMENSIONAL CONVOLUTION

4.7 PERFORMANCE OF TilE ONE DIMENSIONAL CONVOLUTION129

iv

SYSTOLIC DESIGN ON 1HE TRANSPUTER NETWORK

4.8 TRANSPUTER NETWORK FOR TWO

DIMENSIONAL CONVOLUTION

134

4.9 PERFORMANCE OF 1HE TWO DIMENSIONAL CONVOLUTION 135

SYSTOLIC DESIGN ON 1HE TRANSPUTER NETWORK

4.10 ANALYSIS AND COMPARISON OF 1HE

TWO-DIMENSIONAL SYSTOLIC ARRAY

138

5 PARALLEL IMPLEMENTATION OF THE LAPLACIAN 143

AND GRADIENT OPERATORS IN COMPUTER VISION


5.2 SYSTOLIC DESIGNS FOR DIGITAL IMAGE FILTERS 145

5.3 PARALLELIMPLEMENTATIONOFlHE 146

LAPLACIAN OPERA TOR.

5.3.1 Lap1acian Operator Algorithms 146

5.3.2 Systolic Array for the Lap1acian Operator 147

5.3.2.1 Plus shaped Laplacian Operator Design 147

5.3.2.2 Systolic Array Design for Square Laplacian Operator 152

5.3.3 Designing Transputer Networks for the Laplacian Operator 154

5.3.4 Performance of the Laplacian Operator 156

systolic design on Transputer Network.

5.4PARALLELIMPLEMENTATIONOF1HE 162

GRADIENT OPERA TOR .

5.4.1 Gradient Operator Algorithms 162

5.4.2 Systolic Array Design for the Gradient Operator 164

5.4.3 Transputer Network for Gradient Operator 169

5.4.4 Prewitt and Sobel Operator Algorithms 171

5.4.5 Systolic Array Design for Prewitt and Sobel Operators 172

5.4.6 Performance of the Gradient Operator Systolic 174

Design on the Transputer network

6 LOW-LEVEL IMAGE PROCESSING AND 177

FILTER SOFTWARE LIBRARY DEVELOPMENT


6.2PARALLELIMPLEMENTATIONOF1HESIGMA FILTER 179

V

6.2.1 Sigma Filter Algorithm

6.2.2 Systolic Array Design for the Sigma Filter

6.2.3 Transputer network for Sigma Filter

6.2.4 Performance of the Sigma Filter Systolic Array

6.3 PARALLEL IMPLEMENTATION OF THE

INVERSE GRADIENT FILTER

179

180

184

185

188

6.3.1 Inverse Gradient algorithm Transputer Network 188

6.3.2 Systolic Array implementation for the Inverse Gradient Filter 189

6.3.3 Transputer Network for the Inverse Gradient Filter 190

6.3.4 Performance of the Inverse Gradient Filter 190

systolic Array on the Transputers network

6.4 PARALLEL IMPLEMENTATION OF THE MEAN 193

AND WEIGIITED MEAN FILTERS

6.4.1 Mean Filter Algorithms 193

6.4.2 Weighted Mean Filter algorithm 194

6.4.3 Systolic Design for Mean Filter 195

6.4.4 Transputer Network for the Mean and Weighted Mean Filter 198

6.4.5 Performance of the Mean and Weighted Mean Filter 198

Systolic Designs on the Transputers networks

6.5 AN ENVIRONMENT FOR DEVELOPING LOW LEVEL 204

IMAGE PROCESSING ON PARALLEL COMPUTERS

6.5.1 Introduction

6.5.2 Background

6.5.3 The Workbench

6.5.3.1 Software Structure

6.5.3.2 The User Interface

6.5.3.3 Execution Mode

6.5.4 Implementation

6.5.5 Workbench Facilities

6.5.6 Image Input, Output and Data Types

6.5. 7 Types of Kernel

6.5.8 Contents of Library

6.5.9 Extending the Environment

vi

204

206

206

207

208

209

210

211

212

213

213

214

7 SYSTOLIC ALGORITHM FOR THE SOLUTION 215

OF TOEPLITZ MATRICES

7.11NTRODUCTION 216

7.1.1 Digital Contour Smoothmg 218

7.2 SOLUTION OF CERTAIN TOEPLITZ SYSTEMS 220

7 .2.1 Tridiagonal Case 221

7.2.2 Quindiagonal Case 225

7.2.3 The General Case 229

7.3 SYSTOLIC ARRAY IMPLEMENTATION FOR 234

TOEPLITZMATRICES

7.3.1 Systolic Array Design for Tridlagonal Case 236

7.3.2 Systolic Array Design for the Quindiagonal Case 244

7.3.3 Systolic Array Design for General Banded Toeplitz Matrices 247

7.3.4 Numerical Test Example 249

7.4 PERFORMANCE OF THE SYSTOLIC ARRAY 251

7 .4.1 Timing of the Systolic Array 251

7 .4.2 Area of the Systolic Array 253

8 CONCLUSION AND DISCUSSION 254

REFERENCES 260

APPENDICES 271

APPENDIX A 272

APPENDIXB 276

APPENDIXC 283

APPENDIXD 294

APPENDIXE 300

vii

I

CHAPTER 1

INTRODUCTION TO PARALLEL COMPUTER ARCHITECTURES

1.1 INTRODUCTION

The greatest possible speed, throughput, performance, flexibility and a high

level of availability and reliability is the requirement of many scientific and

engineering applications, many of which need to be solved in real tinle, amongst the

huge numbers of computer applications which range from the simple personal

computer games to the weather forecasting calculation, image processing and

satellite transmission programmes, there are many that require the use of large

amounts of computational time. In an attempt to meet the challenging problem of

providing fast and economical computation, Large-Scale Parallel Computers were

developed. In fact until recently computational speed was derived only from the

development of faster electronic devices.

The current technology has gone a long way to increase the speed of operations

and the development continues. There is of course a natural limitation in

technology development; no signal can propagate faster than the speed of light

In the late 1960s, Integrated Circuits (ICS) were used in computer design and

were followed by Large Scale Integrated (LSI) techniques. The Very Large-Scale

Integrated Circuits (VLSI), developed a few years ago, are currently being used in

the design of very high speed special and general purpose computer systems.

Until eight years ago, the current state of electronic technology was such that

all factors affecting computational speed were almost minimised and any further

computational speed increased could only be achieved through both increased

switching speeds and increased circuit density. Hence, even if switching tinles are

almost instantaneous, distances between any two points in a circuit may not be

small enough to minimise the propagation delays and thus improve computational

speed. Therefore, the achievement of even faster computenis condJtional on the use

2

of new approaches that do not depend on any breakthrough in device technology

but rather on imaginative applications of the skills of the designers of computer

architecture.

As computeiSwere developed, more and more elementary operations were

performed concurrently, on a time-overlap basis. For instance, the fetch procedure

of a new instruction could be started before the previous one was completed. This

was called a prefetch operation. Obviously one approach to increasing speed is

through parallelism.

We can define the concept of parallel processing as a method of organization

of operations in a computing system where more than one operation 1s performed

concurrently or simultaneously [ Tabak 1989 ].

The parallel computer systems or multiprocessors as they are commonly

known, not only increase the potential processing speed, but they also increase the

overall throughput, flexibility, reliability and provide tolerance for processor

failures.

Hockney and Jesshope [Hackney 1988] summarised the principle ways of

introducing parallelism at the hardware level of the computer architectures as :

1- The application of pipelining-assembly line-techniques in order to

improve the performance of the anthmetic or control units. A processor is

decomposed into a certain number of elementary subprocesses each of

which IS capable of execution on dedicated autonomous units.

2- The provision of several independent units, operating in parallel, to

perform some basic fundamental functions such as logic, addition or

mulnplications.

3

3- The provision of an array of processing elements performing

simultaneously the same instruction on a set of different data where the data

is stored in the processing elements (PE) private memory.

4- The provision of several independent processors, working in a co

operative manner towards the solution of a single task by communicating

via a shared or common memory, each one of them being a complete

computer, obeying its own stored instructions.

One area which has received considerable attention in recent years, is the

design of real-time systems for the early processing of sensory data (i.e. low-level

image and signal processing). Most of the image processing algorithms need

massive amounts of band matrix operations. However these algorithms contain

explicit parallelism which can be efficiently exploit by processor arrays. All sections

of the image have to be processed in exactly the same way, regardless of the

position of the image section withrn the image, or the value of the pixel data.

Low level functions, involve matrix vector operation which are repeated at very

high speed. Such systems must handle large quantities of data (typical images have

512 x 512 pixe1s) at a high throughput.

The main study of this thesis is the design of parallel algorithms for digital


are implemented on both a Sequent Balance (MIMD) via an Occam simulator and a

transputer network running the transputer Development System (IDS). The Occam



4

The following sections will cover a wide selection of the principal significant

parallel computer architectures, which differ sufficiently from each other, the

pipeline, SIMD, MIMD, data-flow, VLSI systems specially the Inmos Transputer

system and systolic arrays, to illustrate alternative hardware and software

approaches to parallellS!n.

1.2 MAIN MOTIV A TIONS

During the last decade the multiple processor approach has tailored a set of

long sought after motivating goals in order to satisfactorily meet many of the

challenging system design requirements. In reviewing some aspects of parallel

processing systems, one finds that while the hardware is improving at a fast rate,

the software tools to take advantage of the provided benefits are only slowly

emerging, a fact that affects the design motivations mentioned below.

Since the early developed multiple processing systems, the system

characteristics that have motivated the connnued development in this field have not

changed much. The most significant of these are increased throughput, improved

flexibility and reliability. Since none of these goals are numerically specified (i.e.

they are all qualitative goals), it is not surpnsing that the design of the future

"supercomputers" will also be motivated by the same objectives as today's parallel

computers. However, the improvements of some or all of these specificanons must

ulnmately result in an improved overall system performance, usually measured on

the basis of cost effectiveness.

The multiprocessing computing systems are composed of multiple processors,

interconnected to each other, and sharing the use of memory, input-output

peripherals and other resources. Each of the processors is capable of executing a

5

different part of the same program or a different program altogether. The multiple

processor approach is a cost-effective solution to the achievement of the system

throughput. The use of several cooperating processing units can considerably

increase this goal which could not even be matched by a uniprocessor system with

enhanced logic circuitry.

Literally, flexibility means the ease in changing the system configuration to suit

new conditions and the use of more than one processor has greatly increased the

system potential flexibility since it offers the ability to expand the memory space,

the number of processing units and even the software facilities in order to meet the

new demands. This flexibility may also be used to justify the increased reliability

of the system.

Broadly speaking, the reliability is related to different system aspects required

by different applications. The first one is the system availability which is defined

by the requirement that the system should remain available even in the case of a

malfunctiOning unit. An example of this is the computer controlled telephone

switching board. The second one is the system integrity and is defined by the

requirement that the information contained within should be "protected" against any

defection or corruption (e.g. in a secure banking system).

In conclusion, since all the system characteristics that have motivated the

development of the parallel processor computers are not descnbed quantitatively,

any new major system concept has been claimed by its proponents as the ultimate

solution to achieving these motivating goals. In fact, the same motives were behind

the follow- up to the parallel processing systems, the VLSI architectl!res.

6

1.3 PIPELINED COMPUTERS

The p1peline concept [Ha yes 1988] has been implemented in the practice in the

third computer generation ( on IBM 360/91 for instance ). The computer

pipeline is analogous to the assembly line processor in industrial manufacturing

processors. It is one form or technique of embedding parallelism or concurrency in

a computer system. Although, essentially sequential, this type of computer helps to

match the speeds of various subsystems without duplicating the cost of the entire

system involved. It also improves system availability and reliability by providing

several copies of dedicated subsystems.

The pipeline is particularly efficient for long sequences of operands or in other

words, for highly dimensional vector operands. For this reason, pipelined

processors are also sometimes called Vector Processors.

Pipelined computers achieve an increase in computational speed by

decomposing every process into several sub-processes which can be executed by

special autonomous and concurrently operating hardware unit. . Furthermore

pipelining can be introduced at more than one level in the design of computers.

Ramamorrthy [Ramamorrthy 1977] distinguished two pipeline levels, the system

level for the pipelining of the processing units and the subsystem level for the

arithmetic pipelining. Particularly Handler [Handler 1982] introduced a third level

and distinguished them under the names : macro-pipelining for the program level,

instruction pipelining for the instruction level and the arithmetic p1pelining for the

word level. Others designers have distinguished instruction pipelming, depending

on the control structure in the system, to either strict and relaxed pipelining. A pipe

can be further distinguished by its design configuration and control strategies into

two forms ; it can be either a static or dynamic pipe. Sometimes, a pipelined

structure 1s dedicated to a single function, e g. a p1pehned adder or multiplier. In

7

• Stage 1 ~

T Stage 2 _h_ Main Control

Memory --r wr=

Umt

Stage3

l I ~

StageN ~

Figure (1.1) A parallel processmg system

this case it is termed a unfunctional pipe with static configuration. On the other

hand, a pipelined module can serve several different functions. Such a pipe is

called a multifunctional pipe which can be static or dynamic depending on the

number of active configurations. If only one configuration is active at any one

Ume, then the pipe is said to be static. In a dynamic multifunctional pipe, more than

one configuration can be active at any one time, thus permitting a synchronous

overlappmg on different interconnections. The simplified model of a space general

pipelined computer is shown in Fig. (1.1) where the processor unit is segmented

into M modules, each of which performs its part of the processing and the result

appears at the end of the Nth stage.

The pipelined concurrency, a main characteristic of the simplest pipe lining, is

exemplified by the process of executing instructions. In fig. (1.2), we consider four

modules : Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF) and

Execution (E), obtained when segmenting the process of processing instructions.

8

Consequently, if the process is decomposed into four subprocesses and executed

on the four-module pipelined system as defined above, then four successive

instructions may execute in parallel and independently of each other but at different

execution stages : the first instruction is in the execution phase, the second one is in

the operand fetching stage, the third is in the instruction decoding phase and lastly,

the fourth instruction is in the fetching stage. The overlapping procedure among

these individual modules is depicted in fig. (1.4).

Buffering is essential to ensure a continuous smooth flow of data through the

pipeline segments in the cases where variable speed occurs and is virtually a

process of storing the results of a segment temporarily before sending them to the

next segment. To remedy this problem, a sufficient storage space or buffer is

included between this segment and its processor, the latter can continue its

operation on other results and transfer them to the provided buffer until it is full.

In addition to the architectural features of the pipelined processor, the busmg

structure is important in decidmg the efficiency of an algorithm to be executed on

such a system. Pipelining in essence, refers to the concurrent processing of

independent instructions though they may be in different stages of execution due to

overlapping.

Another damaging factor to the pipeline, even more than the mstruction

dependency is branching. The encounter of a conditional branch not only delays

further executions but affects the performance of the entire pipe since the exact

sequence of instructions to be followed is hard to foretell until the deciding results

becomes available at the output. To alleviate the effects of branchmg, several

techniques have been employed to provide mechanisms through which processing

can resume safely even if an mcorrect branch occurs which may create a

discontinuous supply of instructions

9

Stages

EX

OF

ID

1F FO

Stages

EX

OF

ID

1F FO

ID ~ OF - E ~

Figure (1.2) The modules of a pipelined processor

EO El E2 --

00 01 02 --

DO Dl D2 --

Fl F2 -Tune

Figure (1.3) Space-Time diagram (No Pipelining)

EO El E2 E3 E4 -

00 01 02 03 04 -

DO Dl D2 D3 D4 --

Fl F2 F3 F4 -- Tlnte

Figure (1.4) Space-Time diagram (Pipelining)

10

A similar degrading effect to the conditional branching is caused by interrupts

which disrupt the continuity of the instruction stream through the pipeline.

Interrupts must be serviced before any action can be applied to the next

instruction. In the case that the cost of a recovery mechanism for processing to

proceed after an unpredictable interrupt occurs (while instruction i is the next one to

enter the pipe ), is not exceedingly substantial, sufficient mformation must be saved

for the eventual recovery. Otherwise these two instructions, the interrupt

instruction and instruction i, have to be executed sequentially which in fact, is not

allowed for in the pipelining principle.

Finally, one of the most beneficial applications of overlapped processing in

order to increase the total throughput has been the execution of arithmetic functions.

Specially, the advantages of pipelining are greatly enhanced when floating point

operations in a vector are being considered since they represent quite a lengthy

process. Again, until all modules in the pipe are excessively used, full speed is not

made up of eight modules.

As an example of an arithmetic pipeline, we take the problem of adding two

floating-point vectors x, and y, ( i=l,2, ..... n ) to obtain the sum vector z.=x;+y,.

The operation of adding any pair of the above elements( x=e*2r and y=f*2• ) may

be divided into four suboperations. These are : (1) compare exponents, i.e form (r

s); (2) sh1ft x with respect to y, (r-s) places in order to line up the binary points ;

(3) add the mantissa of x to the mantissa of y; and (4) normalise by shifting the

result z to the left until the leading non-zero digit is next to the binary point In the

sequential computer the four suboperations must be completed on the first pair

element x1, y1 to produce the first result z1, before the second element pair enters

the arithmetic unit. In the arithmetic pipeline, the four suboperations are executed

11

on the four-module pipelmed system as defined before. The overlapping procedure

among the individual modules is shown in fig. (1.5).

1 clock/result

compare shift add

normalise

~ Serial

Z3

4 clock/result

Figure (1.5) Comparison of serial and pipelined computers.

12

L------------------------------------------------------------------------- ---

1.4 DATA-FLOW COMPUTERS

A common feature of all the high-speed parallel computer archltectures IS

that, due to the basic linearity of the program, the use of implicit sequencing of the

rnstrucuons is possible. This is a von-Neumann characteristic which means that the

order of execution of the instructions is determined by the order in which they are

stored in the memory with branches used to break this implicit sequencing at

selective points. An alternative form of instruction controlling is the explicit

sequencing which is basically the principal concept exploited by the data-flow (DF)

machines to provide the maximum possibilities for concurrency and speed-up.

However, this concept has a significant impact on the architecture of such

machines, the program representation, and the synchronisataion overheads.

In a Data-flow (DF) computer, the course of computation is controlled by

the flow of data in the program. That is, an operation is performed as and when its

operands are available. The sequence of operation in the DF computer obey the

precedence constraints imposed by the algorithm used rather than by the location of

the instructions in the memory. In a DF machine it is possible to carry out m

parallel as many instructions as the given computer can execute simultaneously.

After executing the instructions, the result is distributed to all subsequent

instructions which make use of this partial result as an operand. In this way, the

DF model of computation explmts in a simple manner the natural parallelism of

algorithms.

As an illustration of DF computation, the computation of the roots of a

quadratic equation is shown in fig. (1.6). Assuming that the a,b and c values are

available, (-b), (b2), (ac) and (2a) can be computed Immediately, followed by the

computation of (4ac), (b2-4ac) and ...J(b2-4ac) in that order, after this, ( -b +

...J (b2-4ac) ) and, ( -b - ...J (b2-4ac) ) can be simultaneously computed followed

13

b

1 2

2

2a

Root 1 Root2

Figure (1.6) A data-flow graph for the computation of the roots of a

quadratic equation.

14

by the simultaneous computation of the two roots. The only requirement is that the

operands be available before an operation can be invoked.

The DF concept encounters some problems when the algorithm contains

loops or subroutine calls, in which case the same instruction is executed several

times. Basically, the implementation of the data-flow computers can be grouped

into two main classes, the static and dynamic structures, depending on how this

problem is tackled.

In the first class, the static one, the loops and subroutine calls are unfolded

at compile time so that each instruction is executed only once. Consequently, the

implementation of the sequencing control is made simple since it directly follows

that of the graph. On the other hand, in the dynamic case, the operands are labelled

so that a single copy of the same instruction can be used several times for different

instances of the loop (or subroutine). For this type of architecture, it is necessary to

match all the operands with the same label before issuing the single copy of the

mstruction, the implementation of the control is significantly more complex in

comparison with that of the previous class. However, the dynamic approach which

allows a compact representation of large programs, can effectively exploit the

concurrency that appears during execution (for example, recursive calls or data

dependent loops).

An example of the static approach IS the MIT Data-Flow machine (fig. 1.7)

which consists of the following main components: a store that contains the

instruction cells or packets havmg space for the operation, operands and for

pointers to the successors, w1th a set of operating units to perform the operations.

These two components are connected by the two interconnection networks, one to

send ready-to-execute instruction packets to the operating units and anther to send

results back from the operating units to the instructions that use them as operands.

15

The system has to be carefully designed so as to prevent any bottle-neck from

occumng and to provide means for the full exploitation of all the concurrency.

In such a system, the maximum throughput is determined by the speed and

number of the operating units, the memory bandwidth and by the interconnection

system. As in the other organisations, several degradation factors reduce the

effecnve throughput. The most significant are the degree of concurrency available

in the program, the memory access, the interconnection network conflicts and the

broadcastmg of result, all of which except the last one are similar to the other

systems. Sometimes an instruction has several successors, so that result has to be

sent, or broadcast, to all of them and this introduces significant overheads in the

case when the number of destination pointers present in an instruction cell is

limited.

Examples of the dynamic approach include the U-Interpreter machine

[Arvind 1982] and the Manchester Data-flow Machine [Gurd 1985]. The main

components of the latter (see fig. 1.8) are the token queue that stores computed

results, the token matching unit that combines the corresponding token into

instruction arguments, the mstruction store that holds the ready-to-execute

instructions, the operating units, and the 1/0 switch for comrnumcanon with the

host. Due to the above mentioned degradation factors, data flow machines are only

attractive for cases in which the concurrency exhibited is of several hundred

instructions or more.

Another problem in the use of the data-flow approach is the lack of any data

structure definition, m fact only scalar operations were first utilised in the attempt to

maximise the amount of concurrency and this had significant limitations in terms of

the modularity of the programs. The inclusion of data structures in the graph

representation requires that the data-flow concept be extended and operanons on

16

OPERATING ~

' UNITS

~ I INSTCELL I ~ ~ ~ §~ I

~~ ~~ I f---

"'~ I ~~ i:i

~ I INSTCELL I ~

Figure (1.7) The static data-flow machine

V

TOKEN T oHost QUEUE

t t I/0 MATCHING

~

OVERfLOW SWITCH UNif UNif

i t INSTRUCTION

mHos STORE Fro

~ PROCESSING UNifS

~ Figure (1.8) The dynamic data-flow machine

•

17

them to be defined. From the operational point of view, the most straight forward

solution is to treat the data structure as an atomic operand, requmng the structure to

be sent as a whole to the operating units even though only few elements are

operated on. This can be performed by sending to the operating unit a pointer to the

data structure instead of its value.

One of the most significant advantages of the DF machines, as claimed by

its proponents, is the explmtauon of the concurrency at a low level of the execution

hierarchy since it allows the maximum utilisation of all the available concurrency.

However, some researchers argued that the overheads occurred with this

unstructured low-level concurrency is too high and have proposed the use of a

hierarchical approach in which different types of concurrency can be exploited at

various levels.

1.5 ARRAY PROCESSORS

The early interest in the parallel processor area initially appeared in the

investigation of machines that were arrays of processors connected in a four

nearest-neighbour manner "N,E,S,W" such as the von Neumann's Cellular

Automate and the Holland machme. Eventually, as a result of grouping interest m

this form of a computer, parallel processors with a central control mechanism that

controlled the entire array began to emerge.

Array processors can be defined as an array of interconnected identical

processing elements (PE 's). The PE's are controlled by a single control unit. Each

PE consists of an arithmetic and logical umt (ALU) and a local memory. Two

essential reasons for building array processors are firstly, economic, for it is

18

cheaper to build N processors with only a single control unit rather than N similar

computers. The second reason concerns mterprocessor commumcation, the

communication bandwidth can be more fully utilised.

The PE's are synchronized to perform the same function at the same time.

The control unit decodes the instruction and broadcasts the instruction via control

lines to all PE's simultaneously. The control unit can access information in both

control or local memory. Each PE has access to Its local memory only. Thus, a

common instruction is executed by all PE's simultaneously using data from its local

memory.

Different interconnection patterns between processors in array processors

are used to permit data transfers between processors. In order to maximize the

parallehsm m an array processor, we must utthse as much of the available memory

and processor as possible. The array processor is eminently suitable for

computations involving Linear Algebra operations. For example, if an array

processor contains N(N=2n) processor elements, the array NxN is stored by

colunms in such a way that each element of the matrix colunm is stored in the

memory of the corresponding PE and one memory fetch transfers one column of

the matnx mto the vector of arithmetic units (PE). An example of an array

processor is the ICL DAP computer. The general structure of the ICL DAP is

shown in fig. (1.9).

The operational speed of an array processor is supposed to increase linearly

as the number of processor elements (PE) is increased. However, this is not true

due to interprocessor communication and data access overheads. The array

processors can only be effective (i.e. maximum parallelism) if the array is

completely filled with operands.

19

Column !CL DAP 2980 1/0

M.C.U. N ~ Regtsters

Row

w n*n -PE's E

Master ~ Control ~

Instrucuon

Control Umt Buffer

op oode

s

Figure (1.9) A structure ofiCL DPA

An associative store is used to overcome the bottle-neck in enhancing the

speed of conventional computers. An array processor using an associative type

store as its memory is called an associative array processor.

20

1.6 DESIGN CLASSIFICATIONS

Parallel processing is a very general term; it can represent many different

strategies and its implementation.

As a result of the introduction of various forms of parallelism which has

proved to be an effective approach for increasing computational speed, several

competitive computer architectures were constructed but there was little evidence as

to which design was superior, nor was there sufficient knowledge on which to

make a careful evaluation. Researchers too helped the study of high-speed parallel

computer by attempting to classify all the proposed computer architectures, or at

least those which had been already well established.

Indeed, a number of classification approaches have been proposed in the

past, given by different researchers, especially by the two pioneers, Flynn [Flynn

1966] and Shore [Shore 1973]. However Flynn's classification scheme is too

broad: smce it combines groups of all parallel computers except the multiprocessor

mto the SIMD class and draws no distinction between the pipelined computer and

the processor array which have entirely different computer architectures. These

classification have been widely referenced and their corresponding terminology has

greatly contributed to the formation of the computer science vocabulary.

1.6.1 Flynn's Classification

Flynn's high speed parallel computer classification based on the dependent

relation between instructions that are propagated by the computer and the data being

processed, Flynn explored theoretically some of the organisanonal possibilities for

large scientific computing machinery before attempting to classify them into four

broad classes.

21

For convenience, he defined the instruction stream as a sequence of

instructions to be processed by the computer and the data stream as a set of

operands, including input and partial or temporary results. Also two additional

useful concepts were adopted, bandwidth and latency. By bandwidth he expressed

the time-rate of occurrences, and latency is used to express the total time between

execution of response of a computing process on a particular data unit.

Particularly, for the former notion, computational or execution bandwidth is the

number of instructions processed per second and storage bandwidth is the retrieval

rate of the data and instruction from the store (i.e. memory words per second).

By using the two definitions, Flynn categorised the almost theoretically

defined computer organisations depending on the multiplicity of the hardware

provided to service the instruction and data streams. The word "multiplicity",

which was intentionally used to avoid the ubiquitous and ambiguous term

"parallelism", refers to the maxJ.mum number of copies of simultaneous instructions

or data in the same phase of execution at the most constrained component of the

organisation.

The four basic types of systems recognised by Flynn's classification are :

a- Single Instruction Single Data (SISD) stream system (fig. 1.10). This is

the basic single-processor, or uniprocessor system. It may represent a classical von

Neumann architecture computer with practically no parallelism (IBM 701).

However, it may also represent some more sophisticated systems, where cenain

methods of parallelism have been implemented, such as multiple functional units

(IDM 360191, CDC 6600, CYBER 205) or pipelining (V AX 8600), or both.

b- Single Instruction Multiple Data (SIMD) stream system (fig. 1.11). A

number of processors simultaneously execute the same instruction, transmitted by

the control unit CU in the instruction stream. Each instruction is executed on a

22

Instrucbon Stream

Control Urut I .,.I Processor 1--Data Stream

Fig(l.lO) S.I.S.D. Computers

Processor2 Data

.-------,rnstrucbOn Stream 2 Control Unit !ream

Data Processorp

Streamp

Fig (1.11) S.I.M.D. Computers

Memory

Memory2

Memoryp

I Control Unit 21 Instrucbon .,. Processor 2 '-· -----'·Stream 2 Data

Memory Stream'---------'

I Control Umt I Instructton.,. Processor p '-· ___ _.....P Stream p

Fig. (1.12) M.I.S.D. Computers

Control Umt I Instruction '------' Stream 1

._Proce __ s_sor_1--'i-oollr"~""'-1--! Memory 1

I C I U I Instrucbon .,.I 2 1.,. Data I ontro rut p Stream 2 Processor Stream 2

Memory 2

Instructton Control Umt p Stream p Processorp Data

Streamp

F1g (1.13) M.I.M.D. Computers

23

Memoryp

different set of data, transmitted to each processor Pr1 m the data stream D1 from a

local memory LM1, i=l,2 ... ,n. The results are stored temporarily in the LM. There

exists a bidirectional bus interconnection between the main memory MM and

transmitted to the CU. A system of this type IS also called an Array Processor,

because of the array of processors formed by the Pr1, i=l,2 ... ,n. Examples of

SIMD systems are ILLIAC IV, BSP and MPP.

c- Multiple instruction Single Data (MISD) stream system (fig. 1.12). In

this system a sequence of data is transmitted to a sequence of processors, each of

which is controlled by a separate CU and executes a different instruction sequence.

The MISD structure has never been implemented, so no examples of any

well established organisation have yet been proposed. It resembles very much a

pipeline structllre. However, the main difference is that a ptpelme structllre belongs

to the same CPU and is controlled by a single CU .

d- Multiple Instruction Multiple Data (MlMD) stream system (fig. 1.13). In

the MIMD system a set of n processors simultaneously executing is usually called a

multiprocessor. The Balance 8000 parallel computer system which is in the Parallel

Algorithms Research Centre (PARC) at Loughborough University of Technology is

an example of this class, this machine is described in chapter 2. Other examples

are the IBM 3090, the Cray 2, the Alliant FX/8 and the NCUBE.

The MIMD system (the multiprocessors) constitute the more general type of

parallel processors. Any number of processors in the same computing system

execute concurrently different programs using different sets of data. In general, no

particular operation selection constraints are attached Each processor can work on a

different program at the same time. a program can be subdivided into subprograms

(such as processes, tasks, etc) that can be run concurrently on a number of

processors.

24

1.6.2 Shore's Classification

Classification of parallel computer systems based on their constituent

hardware components was observed by Shore [Shore 1973]. Accordingly, all

current existing computer architectures were categorised into six different classes .

The first machine (I), [e.g. CDC 7600 a pipelined scalar computer,

CRAYl, a pipelined vector computer] which is the conventional serial von

N eumann type organisation, consists of an Instruction Memory (IM), a single

Control Unit (CU), a Processing Unit (PU) and a Data Memory (DM). The main

source of power increase comes from the processing umt which may consist of

several functional units, pipelined or not and all binary digit bits of a single word

are read in order to be processed simultaneously (Horizontal PU).

A second alternative machine (II) is obtained from the ftrst by simply

changing the way the data is read from the data memory. Instead of reading all bits

of a single word as (I) does, machine (II) reads a bit from every word in the

memory, i.e. bit serially, but word processmg is parallel. In other words, if the

memory area is considered as a two dimensional array of bits, with each word

occupying an individual row, then machine (I) reads horizontal slices whereas

machine (II) reads vertical shces.

A combination of the two above machine yields machine (III). This means

that machine (Ill) has two processing units, a horizontal and a vertical one and is

capable of processing data in either of the two directions. The ICL DAP could have

been a favourable candidate for this class if only it had separate processing units to

offer this capability. An example of this organisation is the Sanders Associates

OMEN 60 Series of computer [Higeiel972].

Machine (IV) consists of a single control unit and many independent

processing elements, each of which has a processing unit and a data memory.

25

Communication between these components is restricted to take place only through

the control unit A good example of this machine IS the PEPE system.

If however, additional limited communication is allowed to take place

among the processor elements in a nearest-neighbour fashion, then machine (V) is

conceived. Thus communication paths between the linearly connected processor

offer for any processor in the array the possibility to access data from its immediate

neighbour's memories, as well as its own. An example of this machine type is the

ILLIAC N, which provides a short cut communication to every eight surrounding

processing elements.

The Logic-In-Memory-Array (LIMA) IS Shore's last class of computer

organisation. The main difference in machine (VI) and the previous one is that the

processing unit and the data memory are no longer two individual hardware

components, but instead they are constructed on the same IC board. Examples

range from simple associative memories to complex associative processors.

It IS observed that, generally speaking, Shore's classificatiOn, compared

with Flynn's, does not offer anything new, but only a subcategorisation of the

obscure SIMD class given by Flynn, except for machine (I) which is an SISD-type

computer. Again, as with Flynn's categorisation, pipelined computers do not

belong to a well specified class, that represents their hardware characteristics, but

on the contrary they are mixed up with unpipelmed scalar computer.

1.6.3 Other Classification Approaches

This paragraph gives a brief note on some other classification approaches of

less significant importance compared to the former two and which are based mainly

on the concept of parallelism.

26

One of the taxonomies, based on the amount of parallelism involved in the

control unit, data streams and instruction units was suggested by Hobbs et a!

[Hobbs 1970] in 1970. They distinguished parallel computers into

multiprocessors, associative processors, array processors and functional

processors.

Another classification, due to Murtha and Beadles [Murtha 1964] was based

upon the parallelism properties. An attempt to underline the main significant

differences between the multiprocessors and highly parallel organisations was

appreciated. Three main classes for parallel processor systems were identified and

they are general-purpose network computers, special-purpose network computers

characterised by global parallelism and finally non-global, semi-independent

network computers with local parallelism. Furthermore, all the classes, but the last

one, were further subcategrised into two subclasses each, whereas, the first class,

the general-purpose one, was subdivided into a general-purpose network computers

subclass with centralised common control and the general-purpose network

computers subclass, with many identical processors, each being capable of one

independent from the others, executing instructions from its own local storage. The

second class identified the pattern processors and associative processors

subclasses.

Hockney and Jesshope [Hackney 1988] formulated a taxonomy scheme for

both serial and parallel computers. The main subdivisions are shown in fig. (1.14).

Their taxonomy was more detruled than that of Flynn or Shore and took implicit

account of p1pelined structures. Therefore, the Multiple Instruction class was not

considered for further categorisation as with the pipelined and array processor

computers. Nevertheless, this scheme if coupled with that of Flynn could well be

suited for a general classification of parallel computers.

27

SINGlE INSJ'RUCTION UNIT

SINGlE UNPIPEUNED EXECUTION UNITS

SERIAL UNICOMPUI'ERS

COMPUTERS

Pll'ELINEDOR MULTIPlE EXECUTION UNITS

PARAllEL UNICOMPUI'ERS

MULTIPlE INSTRUCTION UNIT

MULTIPlE COMPUTERS (MULTIPROCESSORS)

BAlANCE 8000

Figure (1.14) Structural classification of computer

A multiprocessor, conforming to Enslow's defmition, is sometimes denoted

as a Tightly Coupled Multiprocessor [Hwang 1984] or a Loosely Coupled

Multiprocessor. In the case of tightly coupled processors, as shown in fig. (1.15)

(i.e a large number of processors shanng a common parallel memory via a high

speed multiplexed bus), the processors operate under the strict control of the bus

assignment scheme which is Implemented in hardware at the bus/processor

interface. On the other hand, in a system with loosely-coupled processors the

communication and mteraction takes phase on the basis of informanon exchange.

fig. (1.16) shows a general architecture of a loosely coupled system where each

processor has its own local memory. Comparing the above two classes of

28

multiprocessor sysrems, the main difference lies in the organisation of the memory

and the bandwidth of the inrerconnection network.

PI P2 ••• Pn

~ ~ ~ I Communication Network

~ ~ ~ ~ ~ M! M2 ••• Mm 101 102 •••

Figure (1.15) Tightly coupled multiprocessor

LMI 101 LM2 102 LMn IOn

~ ' ' ' ' ~ ~~ ~~ ' '~ ' ~

LOCAL BUS LOCAL BUS LOCAL BUS

t ~ ~ PI P2 ••• Pn

1 t t Communicauon Network

Figure (1.16) Loosely coupled multiprocessor

29

I

1. 7 MULTIPROCESSOR STRUCTURE: PROCESSING

AND COMMUNICATION.

A multiprocessor is composed of many processors, memory modules, I/0

interface units, and a communication network interconnecting all of them together.

The communication network is of the utmost importance in a multiprocessor

design. The overall performance of the multiprocessing system depends not only

on the individual speed and throughput of its processors, it strongly depends on the

quality of its communication network [Stone 1987].

From the stand-point of the type of communication network, there are three

main categories of multiprocessor structures; Bus-oriented systems, Hypercube

systems and Switch network systems.

A bus-oriented system contains one or more system buses (including data,

address and control lines) to which a11 of the system components are

mterconnected. A single-bus system is the simplest and least expensive to

implement. It is used in a number of multiprocessors, such as Sequent, Encore and

ELXSI. A single-bus system offers greater configuration flexibility to both the user

and the designer (i.e. fig. 1.17). Unfortunately, this type of system suffers from

some serious drawbacks. The most serious problem is the bus bottle-neck. Only

two devices at a time can establish communications through the system bus. No

matter how fast a bus IS implemented, this bottle-neck tends to slow down

considerably the overall throughput of a multiprocessor. However, a bus failure is

catastrophic failure m a single-bus multiprocessor.

This can be alleviated if multiple buses are implemented (see fig. 1.18).

Certainly, a failure of a single bus does not have to be catastrophic for the whole

system. If we have q buses, then up to q simultaneous interconnections can be

achieved, alleviating the bottle-neck problem. Of course, other problems arise.

30

Processor Processor Processor

Cache Cache Cache

DATA BUS f ' ' Memory Memory Memory Block Block Block

Ftgure (1.17) Single-bus system

A multiporting requires complicated and expensive extra logic on each device.

Therefore, the number q can not be too high, a dual-bus system (q=2) can be

achieved at a reasonable cost. For instance, the Alliant has a dual system bus

between its cache and main memory, and also a concurrency bus, connecting the

processors only.

PI • • Pn M! • • Mn

' )

~ ) . ~ ~-~. ; ~ . • • • • •

'~ 'r

' '~ If

• •

'~ ' • • • • ,,

IOI • • lOp

Figure (1.18) Multiple-bus system

31

------- --------------------------------------------------

The hypercube multiprocessor structure is characterised by the presence of

N=2n processors, interconnected as ann-dimensional binary cube [Seitz 1985].

Each processor forms a node, or vertex, of the cube. Each node has direct and

separate communication paths to n other nodes (its neighbours), these paths

correspond to the edges (channels) of the cube (fig. 1.19).

• • • D i

Figure (1.19) The hypercube topology.

The uniprocessor (SISD), represented by a single node, can be regarded as

a zero-cube. Two nodes, interconnected by a single path, form one-cube. Four

nodes, interconnected as a square, form a two-cube, and so on. The hypercube

configuration is implemented commercially by NCUBE, lntel and Floating Point

Systems (FPS). In all commercial hypercube systems the memory is distributed

among the nodes, there is only one local memory on each node board. Since each

processor has direct connections to n other processors and each processor has its

own local memory, the bottle-neck problem seems to be less serious. However,

the interprocessor direct communication in the commercial hypercube system, is

serial, therefore limiting the overall bandwidth. The particular interconnection

structure of a hypercube makes it suited for some classes of problems, but may

prove to be very inefficient for others. The hypercube structure is certainly not

considered universal by ideal.

32

I/0 Processor

Cross-bar_ Swuch

Figure (1.20) Crossbar switch multiprocessor.

Much more general is the switch network system structure and offers

numerous possibilities. One of the most well-known switch structures is the

crossbar switch (fig. 1.20). The crossbar switch permits establishing a concurrent

communication link between all processors (P) and memory modules (M), provided

the number of memory modules is sufficient (n :;;; m) and that each processor

attempts to access a different memory module. The same goes for 1/0 processors or

1/0 interface units. The information is routed through Crosspoint Switches (CS)

which, contains multiplexing and arbitration logic networks. The main advantage of

a crossbar network is its potential of high throughput by multiple, concurrent

communication paths. Its main disadvantage is an exceedingly high cost and

complex logic. The crossbar network was implemented in the Carnegie Mellon

University C-mmp multiprocessor system. The experimental C-mmp system was a

16x16 (n=16) network and each processor node was a DEC PDP-11 orLSill.

33

L-----------------------------------------------------------------------

The multistage network or the generalised cube network is a more general

representation for multiprocessor switching network [Siegel 1985]. Its basic

component IS the two-input, two-output Interchange Box. The two inputs and

output are labelled 0 and 1. There are two control signals, associated with the

Interchange Box, CO and Cl, which establishes the interconnection between the

input and the output terminals. A general multistage network has N inputs and N

outputs. In a generalized cube networks N=2m, where (m) is the number of stages,

each stage has N/2 interchange boxes. An example of such a network for N=8,

m=3, is shown in fig. (1.21).

In a multiprocessor system the inputs and outputs of a multistage network

can be connected to all the processors (one input and one output to each processor),

permitting direct intercommunication between them. Alternately, the inputs of the

multistage network can be connected to the processors, and the output to the the

memory modules. Considering the fact that in\acrual system, the communication

through the interchange boxes is serial, we can see that the multistage system IS a

realistic and economic alternative to the crossbar switch. Of course, serial

communication decreases the overall speed.

The importance of the communication network in a multiprocessor system

can not be over-emphasized. When a program is distnbuted among a number of

processors for execution, there is always a need for transmission of intermediate

results between the tasks which are scheduled to run on different processors. In

fact, some processes will not be able to proceed without receiving the results from

other processes. This calls for synchronisataion between different processes

belonging to the same program. Even if all processors run different programs, the

operating system has to schedule and distnbute the programs among the

processors, maintaining a vigilant so that none of the processors remain idle. The

34

----- -----------------------------------------------------------------

communication network plays a crucial part in all of the above events since it has to

efficiently transmit all of the data, Intermediate results, scheduling, and

synchronisatamn control signals.

STAGEi: 2 1 0

Figure (1.21) Three-stage generalized cube network.

35

~ 5 0

CHAPTER 2

THE VLSI TECHNOLOGY AND SYSTOLIC PARADIGM

2.1 INTRODUCTION

As a result of improvements in fabrication technology, Large Scale Integrated

electronic circuitry has become so dense that a single silicon LSI chip may contain

tens of thousands of transistors. Many LSI chips, such as microprocessors, now

consist of multiple complex subsystems, and thus are really integrated systems

rather than integrated circuits.

Achievable circuit density now increases greatly with each passing year or

two. Physical principles indicate that transistors can be scaled down to less than

1/1 OOOth of their present area and still function as switching elements in which we

can build digital systems. Following the rapid advances in LSI technology, Very

Large Scale Integration (VLSI) circuits have been developed with which

enormously complex digital electronic systems can be on a single chip of silicone.

In fact, it can be foreseen that the number of components that a VLSI chip could

accommodate would be increased by a multiplier factor of ten to one hundred in the

next decades [Mead 1980]. Devices which once required many complex

components can be built with just a few VLSI chips, reducing the problems in

reliability, performance and heat dissipation that arise from standard Small Scale

Integration (SSD and Medium Scale Integration (MSD components [Kung 1979].

VLSI electronics present a challenge, not only to those involved in the

development of fabrication technology, but also to computer scientists and

computer architects on how best to take advantage of the new technology. The

ways in which digital systems are structured, the procedures used to design them,

the trade-off between hardware and software, and the design of computational

algorithms will all be greatly affected by the coming changes in integrated

electronics.

37

The separation between the processor from its memory and the limited

opportunities for concurrent processing are the mam difficulties in the conventional

(von Neumann) computers. VLSI offers more flexibility over the conventional

(von Neumann) computers to overcome these difficulties because memory and

processing architectures can be implemented with the same technology and close

proximity to each other. The potential power of VLSI has to come from the large

amounts of concurrency that it may support. The degree of concurrency in a VLSI

computing structure is largely determined by the redesign of the underlying

algorithms. Enormous but unfortunately limited parallelism can be obtained by

introducing a high degree of pipe lining and multiprocessing while redesigning the

algorithm. The requirements of parallel architectures for VLSI have been discussed

by many authors (among those are Kung 1982 and Seitz 1985). The design should

contain many modules which are replicated many times (i.e. in a simple and

regular) and using both pipelining and multiprocessing principles. Finally, a

successful parallel algorithm for VLSI design Will be one where the communication

is only between neighbouring processors.

The development of new manufacturing techniques for the fabrication of

small, dense and inexpensive semi-conductor chips has created a unique

opportunity in the computer mdustry. With the use of VLSI in circuits, the size and

processmg elements and memory are considerably reduced and it becomes feasible

to combine the principles of Automation Theory with the pipeline concepts. This

combination is especially attractive since device manufacture costs have remained

constant relative to circuit complexity, with more time and money bemg invested in

the design and testing of the new chips.

In relation to what has been said above, approaches to device designs have

progressed significantly to the point where hardware design now relies on software

38

~----------------------------------------------- ---- --

techniques, i.e. special rules for circmt layout and high level design languages (e.g.

geometry languages, hardware languages (HDL), stick languages, register

languages, etc.) [Mead 1980]. In fact, some of these languages offer powerful chip

fabrication capabilities directly from the design they express.

lllustrative of this trend is the term 'silicon compiler' which is utilised by the

hardware designers to refer to computer-aided systems currently under

development. Analogous to a conventional software compiler, the silicon compiler

will conven linguistic representations of hardware components into machine code,

which can be stored and subsequently utilised in computer-assisted fabrication.

However, VLSI presents some problems, as the size of wires and transistors

as they approach the limits of photolithographic resolution for it becomes literally

impossible to achieve further mmiatunsation and actual circuit area (or chip area)

becomes a key issue. In addition, chip area is also limited in order to maintain high

chip yield the number of pins (through which the chip communicates with the

outside world) is limited by the finite size of the chip perimeters. These restrictions

form the basis of the VLSI paradigm

For a newly developed technology or product to survive in a highly

competitive industry there must be sufficient demand for it. The emergence and

subsegment success of VLSI oriented computing systems is due not only to

H.T.Kung's foresight but also to the timeliness of the state of industry. At the

same time Kung revealed the systolic concept, as a means of introducmg parallelism

m to VLSI circuits. Further, the Idea of using VLSI for signal processing became

the major focus of attention in governmental, industrial and university research

establishments.

39

2.2 VLSI-ORIENTED ARCHITECTURES

For large applications it may not be feasible to design a single chip

implementation of an array, especially when balance between flexibility, efficiency,

performance and implementation cost is essential. An alternative approach is to

implement basic cells at the board level using a set of 'off-the-shelf components

which are widely available as chip packages or sets from various manufacturers.

The continuously widening applicability of the systolic approach, as well as

the diversification of problems to be solved, gave birth to a large number of systolic

algorithms. Except for a limited number of cases, where performance is very

critical, it has been accepted that, in general, mapping a systolic computation

directly onto silicon is less attractive than programming a special-purpose, or even

general-purpose, VLSI processor array. In this section, we shall briefly rev1ew the

main contenders of VLSI-onented computing systems which have received

attention to date.

2.2.1 The WARP Architecture

The Warp Architecture developed at Cameg1e Mellon University (CMU) by

H.T.Kung and his associates is the most advanced VLSI-Oriented system for

purely systolic algorithms. Its main areas of apphcation are low-level sig11al and

image processing tasks (with special emphasis on computer vision}, as well as

matrix computations and other compute-intensive numerical algorithms (Kung

1984). It is a linearly interconnected (1-D) array of processors with data and

control flowing in one-direction with input at one end of the array and output at the

other. From the preceding discussions we observe that the design allows easy

implementation allowing synchronization by a simple global clock mechanism,

minimum input/output requirements and the use of efficient fault tolerance

40

techniques for faults. The basic Warp cell is constructed from a collection of chips

as is illustrated in fig. (2.1), its main characteristics being the pipelining of data and

control. Weitck 32-bit floating point multiplier (MPY) and ALU perform

operations and can be used m pipeline mode to improve throughput by two-level

pipehning. The MPY and ALU register use Weitck register file chips and can

compute approximate functions hke inverse square root using look-up facilities.

The cell has a significant amount of local memory (RAM), so that it is possible to

reduce the I/0 requirements during the computation, and to simulate systolic

algorithms that have been designed for (2-D) systems. Each cell is programmable,

controlled by a microcode sequencer and with rmcrocode storage. Finally, there are

input queues and multiplexers to implement programmable delays in the dataflow

and to relax the strictly pipelined dataflow.

As is shown in fig. (2.1), the x, y and addr-files are also register files but this

time they are used to implement delays for synchronising data paths. The crossbar

and input multiplexers (muxes) provide communication between the individual

elements and can be reconfigured by control signals. The muxes (multiplexers)

permit two-directional data flow and ring set-ups (using wrap around). A 10-cell

prototype has been built at CUM and tested on a number of example arrays

discussed in Kung [Kung 1984].

An Algol-like language, called W2, is used for the high-level programming of

Warp; W2 is translated by a compiler to a lower-level language, Wl, which is the

assembly-type language of the system. The Warp project had shown the

significance of software support for the development of a systohc computer;

especially the design of a compiler, which provides a feedback for the architectl!re

designer since it requires a thorough study of the functionality of the architectl!re.

41

y ~ M cod " I·

, YL 3: 1

:;; mux ~Y-File ~ ,

,

4 2: 1 ~ x-File ]~ XL XI-1, mux

1 ' Addr-FII<j- Cross

OOdn-1 ,

Bar MMPYReg~ MPY -

Uj File

, ....

Data

~ ~ ALURe~~ ALU

l memory File .,

Figure (2.1) Data paths for the WARP cell

2.2.2 The Wavefront Array Processor (WAP)

One problem with systolic arrays IS that cell synchronization in very large

arrays requires long delays between clock signals due to the clock skew problem,

which increases with the s1ze of the array. Also, the synchronization of data transfer

among large numbers of processors leads to large current surges as the cells are

simultaneously energized or change state.

A solution to the above mentioned problems, as suggested by S.Y.Kung

[Kung 1985], is to take advantage of the data and control flow locality, inherently

possessed by most algorithms. This permits a data-driven, self-timed approach to

array processing. Such an approach conceptually substitutes the requirement of

42

correct 'timing' by correct 'sequencing'. This concept is used extensively in data

flow computers and wavefront arrays.

Basically the derivation of a wavefront process consists of the three following

steps:

a. the algorithms are expressed in terms of a sequence of recursions;

b. each of the above recursions is mapped to a corresponding computation

wavefront; and,

c. the wavefronts are successively pipelined through the processor array.

Based on this approach, S.Y.Kung introduced the Wavefront Array

Processor (W AP) which consists of an NxN processing element with a regular

connection structure, a program store and memory buffering modules as illustrated

in Figure 2.2.

/ / ,

Front6 Front7

Figure (2.2) The Wavefront array processor.

43

-- ------------------------------------------------------------------------

Probably the main feature of W AP is its asynchronous communication; i.e

each PE commumcates with its neighbours using a handshak:ing protocol, and

performs its computations as soon as all the operands and control required are

available. Thus, there is no need for a global clock mechanism for synchromzation;

each cell is self-timed and whole array operation is data-driven, according to the

concept of dataflow computing. The processor grid acts as a wave propagating

medium, and an algorithm is executed by a series of wave fronts medium, and by a

series of wavefronts moving across the grid. Processors are assumed to support

pipelining of wave and the spacing of waves determined by the availability of data

and the execution of the basic operation. The speed of the wavefront is equivalent

to the data transfer time.

Summarising, the wavefront approach combines the advantages of data flow

machines with both the localJties of data flow and control flow inherent in a class of

algorithms. Since the burden of synchronising the entire array is avoided, a

wavefront array is architecturally 'scalable'.

2.2.3 The CHIP Architecture

In order to derive a more fleXIble VLSI-oriented computing system than the

special-purpose, where the same hardware would be used to solve several different

problems. Snyder suggested the design of the configurable, highly parallel

architecture 'CHIP' [Snyder 1982] based on the configurability. Conceptually, the

chip represents a family of syste!lE,each bmlt out of three components: a set of

processing elements (PE's), a switch lattice and a controller. The lattice, the most

Important component of a chip, is a 2-D structure of programmable switches

connected by data paths. The PE's are placed at regular intervals.

44

------------------------ -----

The processing elements are microprocessors each coupled with several kilo

bytes of RAM used as local storage. Data can be read or written through any of the

eight data paths or parts connected to the PE. Generally, the data transfer unit is a

word, through the physical data path may be narrower. The PE's operate

synchronously and systolically.

Each programmable switch contains a small amount (around 16 words) of

local RAM which is used to store instructions (one instruction per word) called

configuration settings. Each configuration setting specifies pairs of data paths to be

connected. When executed, each pair which works also as a cross-over level,

establishes a direct, static connection across the switch that is independent of the

others. The data paths are bidirectional and fully duplex, i.e. data movements can

take place m either direction simultaneously. Now executing a program causes the

specified connections to be established and to persist over time, e.g. over the

execution of an entire algorithm

The processing elements can be connected to form a particular structure by

directly configunng the lattice. That is, the programmer sets each switch in such

a way that collecuvely they implement the desired processor interconnection graph.

In addition to the lattice, a controller is also provided, and is responsible for loading

programs and configuration settings into the PE and switch memories respectively.

This task is performed through an additional data path network, called 'skeleton'.

From the funcuonal point of view, CHIP processing starts with the controller

broadcasung a command to all switches to invoke a particular configuration setting;

for example to implement a mesh pattern. The established configuration remains

during the execution of a particular phase of an algorithm. When a new phase of

processing, requiring different configuration settings is to begin, the controller

broadcasts a command to all switches so that they invoke the configuration setting;

45

for example, a structure implementing a tree. With the lattice thus restructured, the

PE's resume processing having taken only a single logical step in reconfiguring the

structure.

In conclusion, the ch1p computer which is a highly parallel computing system,

providing a programmable interconnection structure integrated with the processor

elements, is suited for VLSI implementation. Its main objective is to provide the

flexibility needed m order to solve general problems while retaining the benefits of

regularity and locality.

2.3 INMOS TRANSPUTERS AND OCCAM

Until the advent of the transputer, MIMD machines were limited to a relatively

small number of processors due to the difficulties in programming and the

synchroruzation mechanisms required to control the processors. The combination

of the transputer and OCCAM, which explicitly controls concurrency, was

designed to overcome these limitations [Harp 1989].

The Inmos transputer family is a range of system components each of which

combines processing, memory and mterconnect in a single VLSI chip. A

concurrent system can be constructed from a collection of transputers which operate

concurrently and communicate tlrrough serial communication links. Such systems

can be designed and programmed in Occam, a language based on communicating

processes (CSP) (fig. 2.3). Transputers have been successfully used in application

areas ranging from embedded systems to supercomputers.

The power of transputer is that it creates a new level of abstraction; m the

same way as the use of logic gates and Boolean algebra provide the design

methodology for present electronic systems. The term 'transputer' reflects this new

46

device's ability to be used as system building block. The word is derived from

'transistor' and 'computer', since the transputer is both a computer on a chip and a

silicon component like a transistor. The architecture has been optimised to obtain

the maximum of functionality for the minimum of silicon [Inmos 1987].

The first member of the INMOS transputer family, the lMS T414 32-bit

transputer, which was introduced in 1985, has enabled concurrency to be applied m

a wide variety of applications such as simulation, robot control, image synthesis,

and digital signal processing. Many computationally intensive applications can

exploit large arrays of transputers, the system performance depending on the

number of transputers, the speed of inter-transputer communication, and the

performance of transputer processor.

Many important applications of transputers involve floating point arithmetic.

Another member of the Inmos transputer family, the IMS T800, can increase the

performance of such a system by offering greatly improved floating-point and

communications [May 1989].

The latest addition to the transputer family is the T9000 which provides a

balance between the computation and communication facilities of the transputer. It

provides high performance computation as well as high throughput communication.

2.3.1 Transputer Architectures

One important property of VLSI technology is that communication between

the devices is very much slower than communication within a device. In a

computer, almost every operation that the processor performs involves the use of

memory. For this reason a transputer includes both processor and memory in the

same integrated circnit device.

The speed of communication between electronic devices is optimized by the

47

-

" 32 ) I< 32 ;;; 32 bJt System Services Processor

< 32 ::1 Lmk Interface I ,/

32 ) ' ~ I 4K bytes of < 32 Lmk Interface On-ch1pRAM

~I Lmk Interface I < 32 ~ Lmk Interface I

K 32 _)

I I Evants

External Memory Interface

.._

Figure (2.3) Transputer Architecture.

use of one-directional Signal wires, each connecting two devices. To provide

maximum speed with minimal wiring, the transputer uses point-to-point serial

communication links for direct connection to other transputers. Alternatively, if

many devices are connected by a shared bus, electrical problems of driving the bus

require that the speed is reduced. Also, additional control logic and wiring are

required to control the sharing of the bus.

48

The transputer is designed so that its external behaviour corresponds to the

formal model of process. As a consequence, it is possible to program systems

containing multtple interconnected transputers in which each transputer implements

a set of processes [Inmos 1987]. The transputer has a conventional microcoded

processor and there is a small core of about 32 instructions which is used to

implement simple sequential programs.

Internally, IMS T414, IMS T400 and IMS T425 consists of a memory,

processor and communications system connected via a 32-bit bus. The bus is also

connected to the external memory interface, enabling additional local memory to be

used. The processor, memory and communications system each occupy about 25%

of the total silicon area, the remainder being for power distribution, clock

generators and external connections.

The IMS T800 and IMS T805 with its 64-bit on-chip floating point unit, is

only 20% larger in area than the IMS T414. The small size and high performance

comes from a design which takes careful note of silicon economics. This contrasts

starkly with conventional eo-processors, where the floating point unit typically

occupies more area than a complete microprocessor, and requrres a second

supporttng chip. The way in which the major blocks of the IMS T800 and IMS

T414 are interconnected are indicated in fig. (2.4).

The Central Processing Unit (CPU) of the transputers contains three registers

A,B and C are for integer and address arithmetic, and form a hardware stack.

Loading a value into the stack pushes B into C, and A into B, before loading A.

Storing a value from A pops B into A and C into B.

The floating point umt (FPU) operates concurrently with and under the

control of the CPU. It also contains a three-register floattng-point evaluation stack.

49

FPU

• ~ CPU ~ CPU ,.

j j J RAM ""

J RAM

' '

- LINKS ' - LINKS

'

" ' :,

MEMORY INTERFACE MEMORY IN1ERFACE

IMST800 IMST414

Figure (2.4)Transputer internal architecture.

All data communication between memory and the floating point unit is done under

the control of the CPU. It was a design decision that the transputer should be

programmed in a high-level language. The instruction set has, therefore, been

designed for simple and efficient compilation. It contains a relatively small number

of instructions, all with the same format, chosen to give a compact representation of

the operations most frequently occurring in programmes. The instruction set is

independent of the processor wordlengths, allowing the same rnicrocode to be used

for transputers with different wordlengths. The instruction format gives a more

compact representation of high-level language programs a more conventional

instruction sets. Since a program requires less store to represent it, less memory

bandwidth is taken up with fetching instructions.

The processor provides efficient suppon for the Occam model of concurrency

and communication. It has a microcoded scheduler which enables any number of

concurrent processes to be executed together, sharing the processor time. This

50

--------------------------------------------------------------------

removes the need for a software kernel. The processor does not need to support the

dynamic allocation of storage as the Occam compiler is able to perform the

allocation of space to concurrent processes.

At any time, a concurrent process may be active (e.g. being executed or on a

list waiting to be executed), or inactive (e.g. reading to input, ready to output). The

scheduler operates in such a way that inactive processes do not consume any

processor time. The active processes waiting to be executed are held on a hst. This

is a linked list of process workspaces, implemented by two registers, one of which

points to the first process on the list, the other to the last. Thus in fig. (2.5), S is

executing, and P,Q and R are active, awaiting execution. A process is executed

until it is unable to proceed because it is waiting to input or output, or waiting for

the timer. Whenever a process is unable to proceed, its instruction pointer is saved

in its workspace and the next process is taken from the list

Communication between processes is achieved by means of channels. Occam

communication is point-to-pomt, synchronized and unbuffered. As a result, a

channel needs no process queue, no message queue and no message buffer. A

channel between two processes executing on the same transputer is implemented by

a single word in memory; a channel between processes executing on different

transputers is implemented by pomt-to-point links.

As m the Occam model, communicauon takes place when both the inputting

and outputting processes are ready. Consequently, the process which first becomes

ready must wait until the second one IS also ready.

At any time, an internal channel (a single word in memory) either holds the

identity of a process, or holds the special value 'empty'. The channel is initialized to

empty before It is used. When a message is passed using channel, the identity of the

first process to become ready is stored in the channel, and the processor starts to

51

R ., eg1s er Local p rogram

Front ... ~ --p t--Back -

Q ~

A ...

R B -14-c s

Workspace -Next lnst

Operand

Figure (2.5) Linked process list

execute the next process from the scheduling list. When the second process to use

the channel becomes ready, the message is copied, the Wlllting process is added to

the scheduling list, and the channel is reset to Its initial state. It does not matter

whether the inputting or the outputting process becomes ready first.

When a message is passed via an external channel the processor delegates to

an autonomous link interface the job of transferring the message and deschedules

the process. When the message has been transferred, the hnk interface causes the

processor to reschedule the waiting process. This allows the processor to continue

the execution of other processes wlnlst the external message transfer is taking place.

A link between two transputers is implemented by connecting a link interface

on one transputer to a link interface on the other transputer by two one-directional

signal wires, along which data is transmitted serially. The two wires provide two

Occam channels, one in each direction. This requires a simple protocol to multiplex

data and control information. Messages are transmitted as a sequence of bytes, each

of which must be acknowledged before the next is transmitted.

52

The fast block move of the IMS T414 makes it suitable for use in graphics

application using byte-per-pixel colour displays. The block move in the IMS T414

is designed to saturate the memory bandwidth, moving any number of bytes from

any bytes boundary to any other byte boundary using the smallest possible number

of word read and write operations. The IMS T805 extends this capability by

thcincorporation of a two-dimensional version of the block move (Move 2d) which can

move windows around a screen at full memory bandwidth, and conditional version

of the same block move which can be used to place templates and text into

windows. One of these operations (draw 2d) copies bytes from source to

destination, writing only non-zero bytes to the desunation. A new object of any

shape can therefore be drawn on top of the current image.

2.3.2 OCCAM

Occam is a programming language which from the outset was designed to

support concurrent applications. Occam was designed specifically for the

transputer, so the programming model for transputers is defined by Occam.

Transputers can be programmed in Occam or other languages. Occam is based on

the notion of communicatmg sequential processes, and provides concurrency and

communication as fundamental fearures of the language.

Where it is required to explOit concurrency, but still to use standard sequential

languages such as C or FORTRAN, Occam can be used as a harness to link

modules written in the languages [Inmos 1988].

In Occam processes are connected to form concurrent systems. Each process

can be regarded as a block box with internal state, which can communicate with

other processes using point to point communication channel. Processes can be used

53

to represent the behaviour of many things, for example, a logic gate, a

microprocessor, etc.

The design of Occam was heavily influenced by the work of Hoare on his

theoretical model of Communicating Sequential Processes (CSP) which grew out of

a study of process synchronisatruon problems [Galletly 1990].

Every transputer implements the Occam concepts of concurrency and

communication. As a result, Occam can be used to program an individual transputer

or to program a network of transputers. When Occam is used to program an

individual transputer, the transputer shares its time between the concurrent

processes and channel communicauon IS implemented by moving data wtthin the

memory. When Occam is used to program a network of transputers, each transputer

executes the process allocated to it (fig. 2.6). Communication between Occam

processes on different transputers is implemented directly by transputer links. Thus

the same Occam program can be implemented on a variety of transputer

configurations, with one configuration optimized for cost, another for performance,

or another for an appropriate balance of cost and performance.

P1

Three processes on one transputer The same processes distributed

over three transputers

Ftgure (2.6) Mapping processes onto one or several transputers.

54

All transputers include special instructions and hardware to provide maxunum

performance and optimal implementation of the Occam model of concurrency and

communication. Together with the transputers, it provides a modular hardware and

software component of the type which is essential in the construction of highly

parallel computer systems.

However, its lack of a powerful data structure and its closeness to the

hardware, means that Occam is likely to be the low-level language of fifth

generation systems with applications possibly wntten in a more abstract language,

i.e ADA.

2.3.3 Transputer Development System

The transputer development system (IDS) is an integrated development

system which can be used to develop Occam programs for a transputer network.

IDS provides a complete programming environment for the generation of reliable,

well structured and efficient programs. It consists of a plug m board for an mM

PC, such as an IMS T414 transputer with 2 Mbytes of RAM and all the appropnate

development software (fig.2.7).

Using the IDS, a programmer can edit, compile and run Occam programs

entirely within the development system. Occam programs can be developed on the

IDS and configured to run on a network of transputers, with the code being loaded

onto the network from the IDS. Alternatively, an operating system file can be

created which will boot a single transputer or network of transputers [lnmos 1988].

The IDS comes with all the necessary software tools and utilities to support

this kind of development. There are a variety of software routines to support

mathematical functions and input-output operations, for example.

55

!MS B004

1§!11 I 04~ [ I

\.... --! =

I I I ~I

IDS ll • ffiMXT/AT Echter Compiler

~ I IBM bus I I

Figure (2.7) Transpuier Development Sysiem

The benefits of the 1DS combine to provide design productivity, and increase

confidence in the timely and accurate implementation of highly concurrent and real

time systems.

In the development of programs for transputer networks, as with other

microprocessor development systems, a distinction may be made between the 'host'

and 'target' environments. The program development tools are run on a host

computer, which includes a terminal and a filing system. The host computer may

include a transputer within the computer, on which the development tools are run,

with the host computer providing the development tools with access to its terminal

and filing system; in this case the transputer IS known as the host transputer

[Wayman 1989].

Before the program under development is run on a transputer network, it may

be on a single host transputer connected to the host computer, and given to the

terminal and filing system of the host computer. Much of the program testing,

debugging, and iterative development can be done in this environment. Then it may

be loaded into a network of transputers from the host, such a network is known as

a 'target' network.

56

The transputer network must be connected to the host, via a link. The

network must also be connected together by transputer links. The topology of the

network must match the configuration description, otherwise the loading will fail.

As well as the link connected, Inmos boards also provide system control functions

to momtor and control the state of the transputer network. The system control

connection on boards are chained together to allow the whole of the network to be

controlled from the host.

As a more substantial example of configuration, the four transputer network

is an IMS B003 transputer evaluation board, and it is loaded from the host

computer. Every transputer on the IMS B003 has two links available on the edge

connector (links 0 and 1) while the other two are preconnected in a square array

(links 2 and 3). The example includes two different processes, control and work.

nng[3]

a- Allocation of processes on transputers

link3

b· Program runmng on an !MS B003

Figure (2.8) The logical structure of program.

57

hnk2

Fig. (2.8a) shows the logical structure of the program, as it is mapped onto

the four transputers. The procedure control has two channels, connecting it to and

from the host computer. In addition, there is one channel to the pipeline of work

processes, and one channel from the last process in the pipeline.

Once the process to run on each of the transputers has been specified, a

configuration description can be written, mapping processes to processors and links

to channel. The configuration for the IMS B003 must map control onto the root

transputer and work onto all three remaining transputers (fig. 2.8b). There are four

channels to connect the four transputers on the B003 that must be declared and

declared as an array of channels.

2.3.4 Performance Measurements of a Transputer Network

The two measures of a parallel system's performance are the speed-up (S) and

effictency (E). The speed-up of a transputer system has been defined as follows:

T(l) S = T(N) ' (2.1)

where T(N) is the total runtime consumed by an application program running on N

transputerrand T(l) is the time on one transputer. The efficiency of a transputer

system has been defined as follows:

E = _T;:_,(..:..l ),___ N x T(N)

(2.2)

Transputer systems have their performance inevitably degenerated by inter-

processor communications. There is a processor nme overhead for each input and

output statement of about lJ.ls. This overhead is not significantly dependent on the

size of the message communicated therefore the use of fewer and longer message is

an advantage.

58

Although transputer links are synchronized, data can not always be available

when needed owing to the low speed of the link. If the process runtime is smaller

than the time the input message takes to communicate through the link, the

processor will be idle for a period of time when all communications are done as

h1gh priority processes and data is ready on the sender. This helps to increase the

overhead time and consequently decrease the system performance. In such cases,

long messages may not be the best solution and a compromise has to be reached

between the number and size of the messages.

2.3.5 The Transputer Network Used for this Research

The PARC (Parallel Algorithms Research Centre) transputer hardware system

consists of a Sun SPARC workstation, a Volvox-liS interface board, a Tandon

Plus PC, an Inmos transputer Evaluation Model (ITEM 400). A full description of

the item box can be found in [Inmos 86]. The configuration consists of 2 Inmos

evaluation cards (IMS B003-2) and 1 (IMS B012) eurocard TRAM motherboard.

The software systems in connection with the hardware system include the operating

system SNOUS, the transputer development system (1DS3) and the motherboard

module software (MMS2).

The Sun workstation acts as the host computer system. By running a program

called a 'server' on the Sun workstation, the Sun system provides the transputer

development system with the filling facilities. This allows users to input data to

transputers from the keyboard or files and output data from transputers to the screen

or files.

The Volvox interface board, which is plugged in the Sun system, provides the

communication between the host system (SUN) and the transputer (mother

transputer) in the Volvox board. The communications between the mother

59

transputer and the transputer network are realised by links provided by the

transputers.

The other alternative, is a Tandon Plus PC acts as the host computer

prov!(hng terminal and file storage facilities. An mM PC/ AT version of the IDS

runs on the host computer. The IDS comprises an IMS T414 transputer on a B004

board and 2M bytes of DRAM.

Each of the IMS B003-2 contains 4 IMS T414B-G20s transputer with 256

Kbytes DRAM each and capable of 10 MIP's performance. The transputers on each

board are connected as a ring (fig. 2.9), leaving 2 uncommitted links per transputer.

The 8 uncommitted links are available on the edge connector allowing a wide range

of system configuration to be achieved using links cables to connect between

boards.

1 0

0 2 3

3 2

2 3

1 3 2 0

0 1

Figure (2.9) IMS B003 Configuration.

The IMS B012 holds 16 IMS B411 TRAMS in its slots. Each TRAM

incorporates an IMS T800 transputer and 1. Mbyte of dynamic RAM. Link 1 and 2

of each of the transputers are used to connect them as a 16 stage pipeline. The

60

B012 Sun

B003 B003

Volvex

ffiMXT/AT

B004

Figure (2.10) Transputer system.

pipeline can however be broken using a jumper block suppling to allow other

combinations.

The IMS T800 mcorporates a floating point umt capable of sustaining over

1.5 millions of floating point operations per seconds. Full details of the IMS T800

can be obtained in [Inmos 86].

Fig. (2.1 0) is an illustration of the transputer system. The 3 boards can be

connected together using hnk cables. Alternatively, networks comprising only

T414's or T800's can configured. A link on the B004 transputer is used to connect

the transputer network with the TDS which in turn connects with the host

transputer.

2.4 THE SEQUENT BALANCE 8000 SYSTEM

An example of the tightly-coupled MIMD architecture system or more

precisely the bus architecture, which we will discuss in this section, is the Sequent

Computer architecture.

The Sequent Computer Systems Inc., have developed two families of

parallel computers; the Balance Series and the Symmetry Series. These two series

are very similar m their structure, configuration, operating system, and user

software. However, the primary difference between them is the type of

microprocessor used to build the CPU's in each of them which has led to a

substantial difference between the two series at the machine language level. There

are, of course, other differences such as the speed, performance and memory size.

In this discussion we shall concentrate on the Balance senes and in

particular, on the Balance 8000 model of the series because most of the early part of

this research had been carried out using simulators running on this type of

machine.

62

The Balance 8000 model can have up to twelve 32-bit processors in a

tightly-coupled manner (Figure 2.16).

The machine at Loughborough University's Parallel Algorithm Research

Centre (PARC) has 12 processors. These processors are connected via a high

speed bus to all peripherals and shared memory and concurrently execute a shared

copy of a Unix-based operanng system. Any processor can execute any program to

achieve dynamic load balancing and multiple processors can work in parallel on a

single application and to minimize accesses to the system bus, each processor has

its own cache memory. Procssors' Boards

Memory

Data Bus

Penpheral Interface

Figure(2.16 ) Sequent Balance 8000 architecture.

Disks

Tenninals

All of the CPU units have a Floating Point Unit (FPU) and a Memory

Management Unit (MMU) and a System Link and Interrupt Controller (SLIC)

whose task is to manage the control of multiple processors.

The DYNIX (Dynamic Unix) operating system is an enhanced version of

Berkely Unix 4.2bsd which can emulate Unix System V at the system-call and

command levels. To support the Balance multiprocessing architecture, the Dynix

operating system kernel has been made completely shareable so that multtple CPUs

can execute identical system calls and other kernel codes simultaneously.

63

2.5 SYSTOLIC SYSTEM FOR VLSI COMPUTING

STRUCTURES

High-performance special-purpose VLSI oriented computer systems are

typically used to meet specific applications, or to off-load intensive computations

that are especially taxing to general-purpose computers. However, since most of

these systems are built on an ad hoc basis for specific tasks, then methodological

work in this area is rare. In an attempt to assist in improving this ad hoc approach,

some general design concepts will be discussed, while in the following paragraph

the particular concept of systohc array arch1tectures are introduced and, a general

methodology for the mapping of high-level computation problems into hardware

cellular structures, will be introduced

The systolic approach in parallel processing evolved from the possible

applications; and the appropriate technology and the background knowledge for Its

realisation. The applications arose from the ever increasing tendency for faster and

more reliable computations, especially in areas hke real-time signal and large scale

scientific computation. The appropriate technology was provided by the remarkable

advances in VLSI and automated design tools.

In areas such as real-time signal processing and large scale scientific

computation, the trade-off balance between generality and performance comes

down on the side of special-purpose devices, because of the stringent time

requirements. Thus, a systolic engine can funcuon as a peripheral deviGe attached to

a host system.

The host system need not be a computer; in the case of real-time signal

processing systolic systems are suitable for sensor devices, accepting a sampled

signal and then passing it on, after some processing, to other systems for further

processing [Dew 1986]. In the case of large-scale scientific computation, systolic

64

systems can be used as a 'hardware library', for cenain numerical algorithms.

Alternatively, they can be utilized to 'matricialize' the internal arithmetic units of

more general-purpose supercomputers.

Some of the VLSI limitations are alleviated when systolic algorithms are

implemented on processor arrays. For example, the actual chip design is not an

issue any more, since it is a programmable processor. Funher, the interconnections

need not be strictly planar. However, in both cases, simplicity and regularity remain

factors of the utmost imponance for an efficient systolic design; in the first case

because they ensure the design of cost-effective, special-purpose VLSI chips. and

in the second case because of the promising proposal to harness the programming

complexity of parallel computers with a large number of cooperating processors.

2.5.1 An Environment For The Development Of The Systolic

Approach.

The concept of systolic architectures, pioneered by H. T.Kung [Mead 1980] ,

which has been successfully shown to be suitable for VLSI implementation, is

basically a general methodology of directly mapping algorithms onto an array of

processor elements. It is panicularly amenable to a special class of algorithms,

taking advantage of their regular, localised data flow.

The word 'systole' was borrowed from physiologists who used it to descnbe

the rhythmically recurrent contraction of the hean and aneries which pulse blood

through the human body. By analogy, the function of a cell in asystolic computing

system is to ensure that data and the control are pumped in and out at a regular

pulse, while perforn1ing some shon computation [Kung 1978 and Dew 1986].

Systohc systems combine pipelining, array-processing and multiprocessing to

produce a high-performance parallel computer system. This combination is

65

-------- ---

exemphfied in fig.(2.12), which is a typical arrangement of a systolic system. A

linear array (pipeline) of n processors (cells, in the systolic terminology) is

connected with the host system, via the boundary cells. The number of cells in the

array is determined by the maximum attainable I/0 bandwidth of the host.

Operations are pumped through the array at a regular pulse. Everything is planned

in advance so that all inputs to a cell arrive at just the right time before they are

consumed. Intermediate results are passed on immediately to become the inputs for

further cells. A steady stream flows in at one end of the array which is said to

consume data and produce results 'on the fly'. The single operation common to all

algorithms considered in this secnon 1s the so-called inner product step 'IPS' , C =

C + A * B, which leads to a fundamental network capable of performing

computation-intensive algorithms, such as digital filtering, matrix multiplication,

and other related problems (see Table (2.1) for a more comprehensive list of

potential systolic applications).

Memory

~ PE PE PE PE PE PE PE PE PE

Figure (2.12) ASystolic processor array.

The systolic array systems feature the important properties of modularity,

local interconnection, a high degree of pipelining and highly synchronised

66

multiprocessing. These features are particularly more mteresting in the

implementanon of compute-bound algorithms, rather than input/output -'I/0' -

'SYSTOLIC' PROCESSOR ARRAY STRUCTURES

1- ID linear arrays

Problem cases: FIR-filter,convolution,'Discrete Fourier Transform" DFT,

matrix -vector multiplication, recurrence evaluation, solution of niangular

linear systems.

2- 2D square arrays

Problem cases: Dynamic programming for optimal pernthesization,

image processing, pattern matching, numerical relaxation.

3- Dhexagonal arrays

Problem case: Matrix problems (matrix multiplication), LU decomposition by

Gaussian elimination without pivoting, QR-factarization.

4- Trees

Problem cases: Searching algorithms, recurrence evaluation.

5- Triangular arrays

Problem case: Inversion of niangular matrix

Table (2.1) The potential utilization of'systolic' array configurations

67

bound computanons. In a compute-bound algorithm, the number of computing

operations is larger than the total number of I/0 elements, otherwise the problem is

termed I/0- bound. Illustrative of these concepts are the following matrix-matrix

multiplication and addition examples. An ordinary algorithm, the former, represents

a compute-bound task, since every entry in the matrix is multiplied by all the entries

in the rows or columns of the other matrix, i.e. O(n3) multiply- add steps, but only

O(n2) I/0 elements are required. However, the addition of two matrics, is I/0

bound. Since the total number of adds is not larger than the total number of I/0

operanons, i.e. O(n2) add steps and O(n2) I/0 elements.

The speeding up of the I/0-bound computations requires an increase in

memory bandwidth. Memory bandwidiths can be increased by the utilisation of

either fast components, which may be quite expensive, or interleaved memories,

whtch may create complex memory management problems. Speeding up a

compute-bound computation, however, may often be accomplished by using

systolic arrays.

The fundamental principle of a systolic architecrure, a particular systolic array,

is illustrated in fig. (2.12). By replacing a single processing element with an array

of PE's, a higher computational throughput can be achieved without increasing

bandwidth. This is apparent if we assume that the clock period of each PE is 1 OOns;

then the conventional memory-processor organisation has at most 5 MOPS

performance, while with the same clock rate, the systolic array will result in a

possible 35 MOPS performance.

Furthermore the systolic systems are algorithmically-specialised, and

therefore can achieve a better balance between computation and communicanon,

since the communication geometry and the computation performed by each

68

processor are unique for the specific problem to be solved. Thus, a systolic

algorithm must explicitly define not only the computation being performed by each

of the processors in the system, but also the communication between these

processors. That is, a systolic algorithm must specify the processor interconnection

pattern and the flow of data and control throughout the system.

2.5.2 Systolic Algorithms, Constraints and Classification.

An algorithm that is designed with the systolic concepts in rmnd, in particular

the use of simple and regular data and control flow, extensive use of pipelining and

high level of multiprocessing, is termed a systohc algonthm. Technologically

speaking, the design of systolic algorithms is in its early days, and as such, is

applicable to only a small subset of applications. However, it is forecasted that

further developments in the near future could alleviate some (if not all) of the

restrictive constraints of the VLSI design.

Recent developments in programming languages along with the chip

technology has made it possible to classify systolic algorithms mto brand classes

dependent on their specific properties. For example, a systolic algonthm can be

considered to depend upon many factors, i.e ease of manufacture, its ability to be

represented as a planar graph, or amount of area required on silicon to implement It.

Two main classes of systolic algorithms were identified [Bekakos 1986]; Soft

systolic algorithms and Hard-systolic algorithms.

The soft-systolic paradigm is described as a framework for realising an

algonthm design and programming methodology for general-purpose, stand-alone

(not attached to a host), high-level language parallel computers (more specifically,

the Fifth Generation Project computers) [Uchida 1983].

69

The soft-systolic algonthms, were defJ ne::l as a result of innovations in

concurrent programming languages, such as Occam and concurrent Prolog. In such

a class, planasity, broadcasting and area are no longer a major concern. Although,

the soft-systolic algonthms may intuitively not be suitable for direct mapping onto a

chip , they however can still be performed on some suitable parallel computers,

such as transputers. Therefore, these algorithms must be implemented in a

appropriate languages. Recent developments in the transputer device, the inclusion

of a stored Occam compiler as a chip, have made the transputer chip a favourable

candidate system to run algorithms of this class.

The second class of algorithms, the hard-systolic algonthms, which represent

the traditional algorithms are designed with the physical chip implemention

restrictions m mind so that they are easily manufactured as chip systems. Examples

include banded matrix-vector and matrix-matrix multiplication chips [Mead 1980].

Perhaps one of the most significant constraints imposed on VLSI systems is

that it is a 2D technology (planarity constraint) since chips are usually (or more

precisely wafered, if fabncation jargon is used) on a board. This physical constraint

is reflected on the hard-systohc design by considering only those graph model

representations which feature the planarity characteristic. However near planar

representations are also allowed since the 2D constraint is violated by permitting

two boards to be connected at the same places.

In addition, broadcasting has been avmded in such algorithms since each cell

has to be connected to the broadcast charuiel, increasing the power requirement of

the system as a whole or decreasing its speed. In a 'purely' hard-systolic algorithm,

broadcasting to cells is totally avoided. However, if only a limited amount is

allowed, the algorithm is termed 'sett11' hard-systolic algorithm.

70

Soft-systolic algorithms observe the main principles of systolic algorithms.

However, they do not have to obey the restrictions that refer to the VLSI

implementation of systolic algorithms, thus they differ from the hard-systolic

algorithms in the followmg ways:

* The network of processes need not be planar and static: non-planar

networks with multiple and complex interconnections, or even multidimensional

and/ or time varying systems may be possible.

* Area IS not a major consideration for optimization, however, it should be

noted that it represents processor, and thus processor and memory resources.

* They do not have to be fabricable, but they must be programmable in some

appropriate parallel processing language (e.g. Occam).

* Broadcasting, fan-in and small irregularities are not avoided; but there must

be a majority of pipelined structures.

It is clear that the set of hard-systolic algorithms form a sub-set of the soft

systolic class and as such they can also be implemented with the same concurrent

programming languages, although this is not necessary. Furthermore, it is also

evident that some of the soft-systolic algorithms will be very close to the hard

systolic ones but, under the strict definitions of hard-systolic, would not be closed

as such. Consequently, a third class, hybrid-systolic algorithms, was defined to

represent this state of transition from the soft class to the hard one. Only

technological improvements which are likely to take place in the near future will

achieve this hybrid-hard migration. Current research indicates that algorithms which

allow local broadcasting (not necessarily between nearest-neighbour cells), limited

nonplanarity or large amounts of nonplanarity (but in a controlled manner) could be

considered as contenders for this class of algonthm.

71

Hence, for hybrid-systolic algorithms, area is not a major consideration, in

terms of optimizating the area of the functional units, or of the array as a whole.

However, the restrictions of the machine must be taken into account, in terms of

processors and memory available. They do not have to be fabricable, but

programmable in some special-purpose systolic programming language, targeting a

special-purpose machine; usually they require significant amounts of memory and

control.

2.5.3 Systolic Array Simulation.

The term 'Systohc Array SimulatiOn' indicates the combination of several

approaches, I.e.: initially, the simulation of hard-systolic algorithms on

conventional computers, using some suitable language; further, the development of

hybrid-systolic algonthms and special-purpose systolic programming; finally, the

design and development methodology of soft-systolic algorithms that have as target

machines, some general-purpose parallel processing computer.

Occam programs can be divorced from transputer configurations by usmg the

language as a stmulatton tool throughout the development of our simulation system

in this research. A summary of the Occam language has been given in a previous

section i.e. (2.3.2). The general structure of Occam programs which represent the

simulation of systolic arrays is shown in fig(2.13) where branching indicates

parallel execution. The constnction of programs follows ideas developed by

G.M.Megson [Megson 1987]. Consequently Occam programs simulate the formal

proofs by replacing I/0 descriptions by actual results. Although the simulation does

not guarantee correctuess, it IS nevertheless a less ttme consuming approach which

does not result m unsolvable equations.

72

The Getdata and Putdata sections of fig. (2.13) are responsible for receiving

and sending data and other information from and to the host. Each routine contains

enough memory to store the initial array input data and the final output data

correspondmg to the global mput and output sequences of the model. However, in

cases of algorithms where the computation time is data-dependent, the Putdata

routine can run in parallel with the systolic system and immediately produce the

output data. Similar arrangements can be made for the Getdata routine. Notice that,

GEIDATA

~ SETUP

' ' ALLOCATOR

, t -~ ~ t 'f

SOURCES ., ... CELLS ' SINKS ~~ DEBUG ~ ,

..( ..( ..( -}

,if

DE-ALLOCATOR

~ PUIDATA

Figure (2.13) Structure of Occam program for simulatmg systolic array.

73

given the fact that Occam has no standard I/0 routines, it is possible to define a

library of primitive I/0 routines that are especially suitable for reading and writing

data and control streams, as required in systolic computation.

The Setup section computes system-dependent quantities. More specially, it

performs many necessary calculations whose values are useful in defining the

structure of the array. These structural values are more imponant as the array

becomes more complex.

A system is eventually decomposed in Sources-Cells-Sinks. A Source is

loaded initially with vectors from Getdata, representing input streams, together with

possible delays, and other control information created in the Setup secnon. Sinks

are analogous to sources, except that they work in inverse by placing real values

into data vectors which are then passed to Putdata. Sources and Sinks of

subsystems are usually connected to the Sources and Sinks of the main system.

The cell procedures implement the computations performed by the processing

elements (cells) of the given systolic architecture. Generally, there is one procedure

for each type of cell, and the programming task is simplified for homogeneous

networks. The I/0 sequences are represented by Occam channels appearing as

actual parameters in the procedure heading. Where cell definitions are only

margmally different, extra switches and flags can be added to a procedure heading

so it can set up the correct cell type. A cell definition is divided into three sections,

initialization, communication and computation. Initialization is performed only once

and allows cells to be cleared before use or predetermined values to be set up. In

particular, initialization defmes neutral element quantities which can be used in

communication before real data reaches the cell and is essential to mamtain dataflow

in Occam programs.

74

The other two sections of the cell, communication and computation, are

performed many times, and are executed sequentially one after the other enclosed in

a loop for iteration. All communication is performed in parallel and computation is

mainly sequential. The allocator routine is called after setup and is supplied with

parameters about the array dimensiOns, synchronisataion details of the total number

of cycles in the algorithm, 1f a loop scheme is used, and data sequence sizes. The

allocator is simply a set of parallel loops which specify and start-up the

computational graph by connecting correspondmg procedures using Occam

channels as arcs and allocatmg channels accordingly. The simpler the array the

easier are the mapping functions, and the result 1s an allocation similar to the VLSI

grid model. Once starting the sources and sinks control the computation, and the

allocator only terminates when all the graph cell procedures have terminated.

Termination of procedures is assumed to be globally synchronised if a for-loop is

used in cells and asynchronous if while-loops are incorporated. As Occam 1s an

asynchronous commumcation language, for-loops tend to be messy requiring some

additional computation after the loop to clear all the channels-hence avoidmg

deadlock. While-loops are better suited to the model of concurrency and when

augmented w1th systolic control sequences can be used selectively to close down

cells, input and output channels. Consequently, array cells can be switched off or

de-allocated by a wavefront progression or pipelined approach from sources to

sinks. An addinonal procedure for debugging purposes can be added, which runs

in parallel with graph networks, and is mainly a screen/file routine.

Brent [Brent 1983] used an extended version of Pascal. ADA also seems a

likely candidate as the ADA rendezvous is very sliD!lar to channel communication.

The adoption of Occam offers more direct hardware support for special purpose

designs as well as common architectures.

75

CHAPTER 3

FUNDAMENTALS OF DIGITAL IMAGE PROCESSING

3.1 INTRODUCTION

Image processing and image understanding have been fast-growing research

fields in Information Technology for the last thirty years. Influence for its growth

and advancement has arisen from studies in artificial intelligence, psychophysics,

computer architecture and computer graphics. Application areas for image

processing includes document processing, medicine and physiology, remote

sensing, mdustrial automation and surveillance amongst many others.

Image processing involves the various operauons which can be carried out on

the image data. These operations include preprocessing, spatial filtering, image

enhancement, feature detection, image compression and image restoration.

However, this is not exclusive. Image compression [Gonzalez 1992] is mainly used

for image transmission and storage. Image restoration involves smoothing

processes which restore a degraded image smoothing close to 'ideal'.

Image processing operations are generally categorised into three levels, low

level (image enhancement), medium-level (feature extraction), and high-level (scene

interpretation). As shown in fig. (3.1) the complexities of the operations, and the

volume of data required at each level varies greatly. Image enhancement

incorporates some operations on pixel values, repeated over the entire image.

Feature extraction involves identifying important features within the image, and

extracting useful information about them. Scene interpretation is when an intel!J.gent

decision process is made, regarding the contents of the image, usmg the extracted

features. Computer vision involves techmques from image processing, pattern

recognition and artificial intelligence. The process attempts to recognise and locate

objects in the scene.

77

Two Dunens10nal Image Data

)- f-4. Feature ~

Image Low level ,. " Capture Operauon Extraci!On

~ 11'"1 B 0

' Feature Database

Figure (3.1) Image processing operanons.



image and signal processing). Such systems must handle large quantities of data

(typical images have 512 * 512 pixels) at a high throughput. In section (3.3) ,we

explore different types of algorithms required for low-level image processing.

Many low-level vision algorithms are highly regular, data independent and operate

on a spatially local data set. These characteristics indicate that the algorithms have

an mherent parallelism which can be exploited by mapping them onto arrays of

processing elements operanng in parallel.

Many algorithms have been devised for machine vision tasks but they have

been heavily influenced by the von Neumann architecrure on which they have been

developed. Some influences are [Hussain 1991] as follows:

1) The way the problem and data can be represented;

2) The description of the problem to be solved;

78

3) The way the problem can be mapped to the architecture;

4) The data-structures which can be used;

5) The data and control flow allowed by the system;

6) That the algorithm is optimised to run on the available arclutecture.

People involved in machine vision have been aware of the parallelism

involved in the task and have been designing hardware for some time. Early (or

low-level) vision involves such tasks as filtering, segmentation, feature detection,

and optic flow, all involve local computation on the image array and these

operations are highly parallel.

There are different architectures for carrying out these varied operations Many

have developed systolic arrays which gives a good rate of data throughput; many

have augmented mesh architectures by building pyramid architectures. Others

have developed more general parallel machines based on shared memory, etc.

Section (3.4) will explore the various methods for implementing low-level

algorithms on various types of parallel architectures.

There are different ways in which both image and data can be represented.

Images are typically represented usmg a 20 array of picture elements (pixels) which

represent the intensity values (grey-levels). The image can itself be of various

types, such as, it may be the result of filtering, or it could be the result of

transformations (such as FFT). To extract information from these images different

types of operation are required, these could involve local operations such as

convolution or morphological operations.

This chapter concentrates on the parallel computer low-level image

processing algorithms, parallel hardware, the various methods for implementing

these algorithms on various types of parallel architecture and the image processing

techniques required for image filtering.

79

3.2 LOW-LEVEL IMAGE PROCESSING ALGORITHMS

The ann of computer vision systems is to analyse a scene in order to gain

some understanding of its contents. An image is an I x I array of pixels (picture

elements) each representing the light intensity at that point in the image. Images are

enhanced using low-level algorithms which perform image to image

transformations. These algorithms are used to eliminate noise, improve contrast and

detect certain low-level features such as edges.

The majority of image to Image transformations have spatially localised

inputs, each pixel in the output image is some function of a small window of

neighbouring pixels from the input image. Such algonthms are referred to as local

windowing operations. The algorithms are regular and local and for each window

in the image the computation is proportional to the size of the window. As the

window is moved across the image the windowing operation is performed in each

pixel position. Each application of the windowing operation produces a pixel for

output image.

Many low-level image processing operations, termed local operators, require

access to the four or eight neighbouring intensity values of a pixel, when computing

the new value for the pixel. Each member element in the image is replaced by some

function of itself and the neighbouring elements within a window centred on that

element. Common sizes of the neighbouring window are 3 x 3 and 5 x 5.

A powerful method of image computation is to apply a system process P to an

input function F(x.y) and generate a transformed output function H(x,y) [U nd rill

1992].

(3.1)

80

There are (k2-l) additions and k2 multiplications per window (k is the size of

the window) and the wmdow operation is applied (1-k+ 1)2 times.

Local neighbourhood operations are also useful in smoothmg the image, to

reduce the effects of noise, clean up, or blur the image. This category includes the

most common image processing function, that of applying both linear and non

linear filters. Linear filters, while they remove noise, also blur the edge. For this

reason, non-linear filters such as rank filters are used. In the next section we g~ve

an introduction to the mathematical approach and algorithms for various filters .

The measures such as point-wise standard deviation (SD) and signal-to-noise

ratio (SNR) give a quantitative result of noise in the imaging system. A

measurement of the width of edges in an image gives a quantative value of the

separation distance of two identical objects before they can be identified with any

certainty.

There are various measures of nmse, which may be classified as being global

or local. Global measures involve the entire image, for instance, the root-mean

square deviation between the image and the actual object or between the image and

Its ensemble average. Local measures are, for example, local signal-to-noise ratio

(SNR).

Local windowmg operations tend to exhibit a low degree of data dependency

which means that all parts of the image are treated uniformly. Data dependent image

to image transformations occur when values are adapted 'on-the-fly' for different

regions of the image, for example, adaptive filtering and adaptive thresholding.

81

- ------------------------------------------------------------------------

3.2.1 Parallel Paradigm

Factors which affect the performance of an algorithm on a particular

archltecture are dependent on the degree of parallelism and the over-heads incurred

in scheduling and synchronising the tasks.

The use of dedicated image processing hardware imposes two constrrunts on

the algorithms employed for image analysis [Giloi 1991]:

- The algorithms must be regular m order to be performed by specific high

speed processors.

- The algorithms must be sufficiently simple to ensure the cost-effective

realization of the dedicated processors.

An algorithm is regular if it is performed on the entire pixel matnx or a

wmdowed region thereof in a data stream mode. The choice of an algorithm to

solve a particular problem is strongly influenced by the hardware architecture and

software support tools available.

Just as there is difficulty and confusion over architecture taxonomy, so

similar difficulties exist in classifying algorithms. Often, algorithms have been

developed without reference to an architecture and use more than one model of

parallelism and are therefore difficult to implement m practice [Hussain 1991].

The two appropriate types of multiprocessor systems, which are used for

image processmg applications, are those which provide SIMD processmg and those

which provide MIMD processing [Flynn 1966] (see Chapter2 for further details).

There are two types of parallelism for image processing algorithms: i) data or

image parallelism, and ii) task parallelism. Morrow and Perrott [Morrow 1987]

have discussed and compared these two kinds of image processing parallelism. In

the following sections a brief overview of these two types of parallelism is given.

82

3.2.1.1 Image Parallelism

In the image or data -parallel paradigm of computation the image is initially

divided and distributed over the available processors. The algorithm to be applied to

the image then executes on each processor performing its operanons on the image

segment local to that processor.

The image-parallel paradigm of computation is explicitly synchronized; it

maps to the SIMD model of programming where each of the processors executes

the same code on its local data simultaneously. The advantages of image-parallel

programming is that there is a very simple control flow, the data array changes from

one state to another.

If there are fewer processors than the number of pixels in the image, the

picture will need to be partitioned amongst the available processors. While in an

image of 512 x 512 p1Xels may be 'folded' so that there are 16 (128 x 128) square

arrays; in processor farms or linear arrays, it is more likely that the image will be

partitioned verncally or horizontally into m-slices for the m available processors.

This partitioning is much simpler for the host computer in both transmitting and

collecting back the images and data. For neighbourhood operations involving k x k

kernels, each slice will have to have k rows or columns added to the border of the

segment. The segments are overlapped [Manning 1988].

Many of image-parallelism algorithms can be implemented on transputer

networks. It is more hkely that the image wtll be partitioned over the available

transputers. The algorithm to be applied to the image then executes on each

transputer performing its operations on the image segment local to that transputer.

One of the transputers in the network is designated the 'master' and communicates

with the host transputer. An image is obtained from the host via the master and

distributed to the remaining transputers in the network; thus providing an image

83

segment for each processor. All results are sent back to the master which, in turn,

passes them to the host.

3.2.1.2 Task Parallelism

In contrast to image parallelism, where the configuration of the processing

elements is the same for each algorithm, the processors are configured differently

for each task parallelism algorithm.

With task parallelism the algorithm under consideration is sub-divided into

relatively distinct sub-tasks. A task parallelism in a multiprocessor computer

requires the following steps:

- the algorithm needs to be partitioned into sub-tasks;

- the sub-tasks and data need to be distributed amongst the processors; and

- the system must be set up to allow interprocessor communication and

synchronisataion.

These three steps must be carried out in the specified order because the

requirement for communication and synchronisataion cannot be determined in

advance of the dtstribunon of the algorithm amongst the processors. For new

algorithms and architectures, the above processes have to be assessed a new each

time.

Many low-level image processing algorithms can be implemented on systolic

arrays, some of these implementations will be explored in section (3.4) .

With the MIMD type of computers, tasks may need to be partitioned (and

grouped) so that communication is minimised. There may be an overhead involved

in initiating communication between processors.

84

3.3 IMAGE FILTERING

In this section we consider techniques for filtering digital images. This

includes both low pass (smoothing) and high pass (edge enhancement) filters. The

principle objective of the enhancement techniques is to process a given image so

that the result is more suitable than the original image for a specific apphcation

[Gonzalez 1992]. Image enhancement involves noise removal to deblur the edges of

objects and to highlight specified features. The enhancement filter attempts to

improve the quality of an image for human or machine interpretability, where

quality is measured subjectively [Niblack 1986].

Image filtering techniques can be subdivided mto two main categories: ones

which act on the whole or large-sections of the image and the others involving small

neighbourhood windows.

Global techniques include least-squares filtering [Rosenfield 82] and Kalman

filters. Least-squares filter techniques require a statistical model of the signal and

the noise. The filtered images produced by these techniques have blurred edges.

Local methods are generally computationally more efficient. However, their

greater advantage comes from their ability to process several windows in parallel.

If the image is corrupted by random impulse noise, then linear or non-hnear

local window operations may be employed. The simplest of the linear operations is

equal-weighted averaging [Rosenfeld 1982]. While this method is efficient in

removmg noise it will blur edges and blurring is more severe for larger windows.

The blurring effect can be reduced slightly by using a weighted averaging

technique. The foundation of this technique can be from the convolution theorem

(section 4.2) . Let g(x,y) be an image formed by the convolution of an image

f(x,y) , that is

85

m n g(x,y) = . L L w (ij) f(x-i,y-j))

1~m J=-n (3.2)

where w (i,j) are normalised weights and are often binomial coeffiCients. In physical

systems, the kernel w must always be non-negative, which results m some blurring

or averaging of the image. In fact, by extending the basic idea of convolution, the

weights of w may be varied over the image, and the size and the shape of the

window varied. With this flexibility, a wide range of linear, non-linear, and

adaptive fllters may be implemented such as for edge enhancement or selective

smoothing.

Before proceedmg to smoothing filters, we describe the gradient and

Laplacian operators. These are two filters, in fact, two classes of filters, that are

often applied to digital images as convolutions.

3.3.1 Digital approximations to the Gradient and Laplacian

Operators

The gradient and Laplacian operators are related to the vector gradient and

scalar Laplacian of calculus. These are defined for a continuous function f(x,y) of

two variables as:

Gradient: M= (of/Ox)T + (of/Oy)j (3.3)

(3.4)

where T and J are unit vectors in the x and y directions.

The most common and historical earliest edge operator is the gradient operator

[Ballard 1982]. The gradient operator applied to a continuous function produces a

vector at each point whose direction gives the direction of maximum change of the

86

function at that point, and whose magnitude gives the magnitude of this maximum

change [Niblack 1986].

A digital gradient window gives the x component gx of the gradient, and the

other gives they component gy:

gx(lj) = maskx * n(i,j)

gy(i,j) = masky * n(i,j)

where n(ij) is some neighbourhood of (ij) and *represents the sum of products of

the corresponding terms.

For a digital image, analogously, we could use flrst differences, giving

gx(ij) = r(i,j+I)- r(ij)

gy(i,j) = r(i+ I ,j) - r(ij)

Note that these are digital convolution operators which convolve gx and gy

with the simplest set of masks:

maskx =-11 -1

masky = 1

The maskx generates output values centred on the point (ij+ 1/2) and masky

generates output values centred on (i+ l/2,j). To obtain values centred on (i,j),

symmetric masks about (ij) are most often used. We get

gx(ij) = r(ij+ 1) - r(ij-1)

gy(ij) = r(i+ I,j) - r(i-1,j)

(3.5)

(3.6)

These operators measure the horizontal and vertical changes in gx and gy.

Note that the set of masks is:

maskx =-I 0 1

87

-1

masky = 0

I

Another set of masks, called Roben's operators, are not oriented along the x

and y- directions, but are nevertheless similar. They are defined on a 2 x 2 window

as:

maska =[ 5 -~ J maskb =[ -? 5 J Whatever masks are used, the gradient operator produces a two element

vector at each pixel, and this is usually stored as two new images, one for each

component.

Sometimes the gradient is wanted as a magnitude gv and a direction gd. These

can be computed from gx and gy as

gv(i,j) =~

or

and

As we saw in equation (3.4), the Laplacian operator is one dimension reduces

to the second derivative, is also computed by convolvmg a mask with the Image.

One of the masks that is used may be derived by comparing the continuous and

digital cases as follows [N1black 1986]:

f(x) ~ r(1)

f'(x) ~ r'(i) = r(i)- r(i-1)

f"(x) ~ r"(i) = r'(i)- r'(i-1)

r"(1) = [r(i) -r(i-1)]- [r(i-1) - r(i-2)]

r"(i) = r(i-2) - 2r(i-1) +r(i)

r"(i) = (1 -2 1) ( r (i-2) r(i-1) r(i))

88

-------------

giving the convolution mask (1 -2 1). In this form, the Laplacian at i is computed

from values centred about i-1. To keep the Laplacian symmetric, it is normally

shifted and g:tven at i as:

(1 -2 1) (r(i-1) r(i) r(i+1))

Also, the sign is typically changed to give:

(-1 2 -1) (r(i-1) r(i) r(i+1))

and this is a common form of the one dimensional digital Laplacian although

mathematically it is the negative of the Laplacian. Different choices are available

when extending this mask to two dimensions. A plus shape standard mask is:

-1

-1 4 -1

-1

which is the negative of the mathematical Laplacian operator.

The digital Laplacian, responds to the 'shoulders' at the top and bottom of a

ramp, where there is a change in the rate of change of gray level [Rosenfeld 1982].

The window masks given here for the gradient and Laplacian operators are

farrly standard, but many other operators have been defined, in many cases using

larger windows, say 5 x 5. Also notice that the gradient gives both magnitude and

direction information about the change in pixe1 values at a point, whereas the

Laplacian is a scalar giving only magnitude. The digital gradient responds to edges

as strongly as 1t does to nmse. Thus the gradient operator would ordinarily be a

better edge detector than the Laplacian operator.

3.3.2 Low Pass and High Pass Filters

The low pass filters are smoothing filters designed to reduce the noise, detail,

or 'busy-ness' in an image. If multiple copies of the image are available or can be

89

obtained, they can be average pixel by pixel to improve the signal to noise ratio.

However, in most cases, only a single image is available. For this case, typical

smoothing filters perform some form of movmg wmdow operation that may be a

convolution or other local computation in the window.[Niblack 1986, Ekstrom

1984, Gonzalez 1992]

It is easy to smooth out an image, but the basic problem of smoothing filters

is how to do this without blurring out the interesting features. For this reason,

much emphasis in smoothing is on 'edge-preserving smoothing'. Salt-and-pepper

noise in images which are created by bit-error can be removed by use of low pass

filters such as median filters [Rosenfeld 1982, Hussain 1991]. In a small window,

the pixels are nearly homogeneous; only a small portion of these pixels are noise

pixels.

Edge enhancement (or image sharpening) techniques are useful primanly as

enhancement tools for highlighting edges in an image. These filters are the opposite

of smoothing filters, Whereas smoothmg ftlters are low pass filters, edge

enhancement filters are high pass filters [Gonzalez 1992, Hussain 1991]. The term

'edge detector' is also used. This may mean a simple high pass filter, but

sometimes may be more general, mcluding a thresholding of the points into edge

and non-edge categories, and even the linking up of edge pixels into connected

boundaries in the image.

Below, a brief review of several image smoothing and sharpening filters:

1- Median filtering: Median filtering is a nonlinear process useful in reducing

Impulsive or salt- and-pepper noise. It also useful in preservmg edges in an image

while reducing random noise. A pixel value is replaced by the median of its

neighbours. The median of a set of numbers is the value such that 50% are above

90

and 50% are below. For example, when the p1xel values within a window are

5,6,35,10, and 15, and the pixel being processed has a value of 35, its value is

changed to 10, which is the median of the five values.

Conceptually simple, the median filter is somewhat awkward to implement

because of the pixel value sorting required. However, it is one of the better edge

preserving smoothing filters [Danielsson 1981, Ekstrom 1984].

2- Mean: If the noise in an image appears as random, uncorrelated, then the

affected pixe1s can be replaced by a local average or the mean to reduce the gray

level variations. For an N x N window with pixel gray levels I(i,j) where i,j =

1,2, .... ,N, the average is

The size and shape of the window over which the mean IS computed can be

selected. Figure (3.2) shows one approach for extracting neighbourhoods from an

image array. The neighbourhood of a point is defined in this case by the set of

points inside, or on the boundary of, a circle centred about this point in question.

for example, the mean filter can be a square window shape or a plus shaped

window [Chm 1983, Gonzalez 1992].

3- Weighted mean: A weighted mean is often used in which the weight for a

pixel is related to its distance from the centre point. The s1ze and shape of the

window can be selected. The approach for extracting neighbourhood from an image

array shown in fig. (3.2) is also applied for the weighted mean filter [Niblack

1986]. For 3 x 3 windows, the weights may be:

91

1/16 1/8 1/16 1/6

1/8 1/4 1/8 1/6 1/3 1/6

l/16 1/8 1/16 1/6

square window plus shape window

The square weighted mean window is separable. Let W be the 3 x 3 square

window kernel above and let wv = Wh = (1/4 1/2 1/4) T then W is given by:

[

1/4] W=wv WhT = 1/2 (1/4 112 1/4)

1/4

r .... 1-~ ~ 11 + J

-;._ ... f.-~

9 points neighborhood 5 points neighborhood

Figure (3.2) Pixels neighbourhood.

4- k nearest neighbour averaging: This method is based on the fact that the

gray levels of pixels belonging to the same population within an N x N window are

highly correlated. The centre point I of an N x N neighbourhood is replaced by the

average gray level of the neighbours of I whose gray levels are closest to that of I.

A typical value of k is 6 when N=3 square window centred on I. This is another

filter used in edge preserving smoothing [ Chin 1983].

92

5- Inverse Gradient filter: This smoothing scheme is based on the observation

that the variation of gray level inside a region are smaller than those between

regions. In other words, that the absolute value of the gradient at the edge is higher

than within regions. The weighting coefficients are the normalized gradient in verses

between the centre point and its neighbours.

Let a pixel I(ij) in an m x m image, where i,j = 1,2, .... ,m, the inverse of the

absolute gradient at I(i,j) is defined as

1 r (i+k,j+l) = I 1(i+k,j+Ir I(i,j)l

where k,l = -1 ,0, 1, but k and I are not equal to zero at the same time. In other

words, the r(i+kj+I)'s are calculated for eight immediate neighbours of I(i,j)· The

smoothed pixel O(i,j) is computed as:

1 0(· ')- -I(·') IJ - 2 IJ

1 1 1 + 2 ( ~

1 ~~t(i+k,j+l) I(i+k,j+l) ) (3.7)

If I is in the immediate vicinity of an edge, those piXels outside the region will

be weighted very lightly; thus details would not be significantly blurred [Wang

1981].

6- Sigma filter: Set I(ij) equal to the average of all pixels in its neighbourhood

whose value is within t counts of the value of I(i,j)· Here t is an adjustable

parameter which is called the Sigma filter because the parameter t may be derived

from the sigma, or standard deviation, of the pixel value distribunon. Let,

r(i+k,j+l) = 1

r(i+k,j+l) = 0

if ll(i+k,j+!f l(i,j)l < S

otherwise

93

where k,l = -1 ,0, 1 for the window size 3 x 3 . Then the smoothed pixel

O(i,j) 1s computed as: 1 1 1 1 1

O(ij) = l l r(· k . l) ( l l w (i+k,j+l) ) (3.8) k--11=-1 I+ ,J+ k=-1 1~1

where

w(i+kj+l) = I(i+k.j+l)

The s1gma range is generally large enough to include most of the pixels from

the same distribution in the window, yet in most cases it is small enough to exclude

pixels representing high-contrast edges [Lee 1983].

7- Closest of minimum and maximum: A filter defined by computing the

minimum and maximum in n(ij) pixels and setting O(i,j) to the one that IS closest

to the value I(ij) often produces good results by sharpening the boundaries

between classes. The filter is typically iterated. It leaves isolated spikes which may

need to be removed by another filter, say a median filter, mixed into iterations.

8- Gradient operators: A Simple way of using them is to keep only the

magnitude (as explained in the previous section). Other methods keep both the

magnitude and the direction. The gradient in a given direction may also be

computed. If the gradient at pixel I is considered as a vector (gx , gy), then the

gradient in the direction of vector d=( dx, dy) is ~.

Moving across an edge , the gradient will start at zero, increase to maximum,

and then decrease back to zero, this produces a broad edge [Niblack 1986].

9- Laplacian operators: As described m the previous secuon.

94

10- Enhancement in drrecuon of the gradient: Initially compute the gradient at

pixel I(ij)' and then apply another filter (such as the one dimensional Laplacian

operator) in the direction of the gradient.

3.4 VLSI IMPLEMENTATION FOR LOW LEVEL IMAGE

PROCESSING.

The purpose of this section is to idenufy different computational models for

implementing low level image processing algorithms on a programmable VLSI

processor array, constructed from asystolic array, pyramid architecture, and the

Inmos transputer network. We shall consider the class of image processing

algorithms that use local windowing operations (as shown in the previous section).

Before proceeding to these computational models , we describe briefly the

CLIP systems. CLIP4 (Cellular Logic Image Processing) was the first large array

assembled using custom-designed integrated circuits of two boolean processors.

Each of the processors can be loaded with the same or different data, but the same

function is performed by all the processors at the same time. The CLIP system is

(SIMD) computer. The array size limitation of CLIP4 gleaned from several

successful image processing apphcations, led to the development of CLIP4S. But

there still remained some severe limitations with memory and processing of gray

scale images. The CLIP4S successor CLIP7 A, which incorporates various levels of

local control, and mcreased amount of memory per processor [Hussain 1991].

3.4.1 Systolic Array Implementation

Many of the early systolic algorithms for lD convolution and matrix vector

multiplicauon used bidirectional dataflow, with the data and results flowing in

opposite direction (Kung 1979, and 1982). A systolic design is shown in fig.

95

(3.3). The array is composed of k (k is the size of the convolution window)

identical cells and operates in a totally synchronous fashion. Each cell is capable of

performing multiplication followed by an addition. That is, each cell performs one

step in the calculanon of a scalar product [Quinton 1991].

There are two data streams in the array, the Xi circulate from left to right,

entering a new cell on each clock tick, and are not modified as they pass through the

cells. The y1 circulate in the opposite direction, at the same speed, and are computed

by successive accumulation. Initially, their values is 0, and when they exit the

array, they have been correctly computed according to equation (3.2). The array

outputs a new Yi on every two clock ticks. The primary weakness of the array is

that its cells are not active on all clock ticks.

The solution of efficiency for the above array, requires that the two data

streams Xi and Yi flow in the same direction. Such an array is shown in fig. (3.4).

y,

+ ~ ~ ~ al a2 a3 ' a4 ~ ~ - --,

2

~. XI· 2

XI·l

Figure (3.3) Bidirectional systolic array

The array is composed of k cells, and the internal registers are preloaded with

the window weights. The y1 flow at a speed double that of the xi; this is achieved

by inserting additional delays into the path of Xi· The array generates a new result

after every clock cycle [Kung 1984].

%

-------------------------------------------------------------------------

Kung modified the above systolic array in different designs for 2D

convolution [Kung 1980, 1982, 1984].

A number of systolic designs for 2D convolution are implemented, Robert

and Tchuente [Robert 1986] have proposed a divide-and-conquer systolic design

for 2D convolution. Megson introduced several designs for the same purpose

[Megson 1992].

X4 X2

Figure (3.4) undrrecnonal systolic array

3.4.2 Pyramid Architecture

A pyramids approach to image processing is proposed by Hubel and Wiesel

[Hubel 1962]. The pyramid structure is effectively an inverted order four tree, with

additional communications links between the processors at each level, as shown in

fig. (3.5). Each horizontal level is a square array. Operations involving nearest

neighbour algorithms are presented. The pyramid structure is proposed as an

efficient configuration for image processing due to the inherent hierarchy which is

analogous to the different levels of complexity of operations used in an application.

A different type of processor pyramid is considered. With this SIMD

configuration, an N by N array of simple processor devices can be reconfigured

into any one layer of the overall pyramid. While the number of processor nodes is

97

Figure (3.5) Three level pyramid configurauon.

decreased on each layer towards the top, the number of processors in each layer is

the same. This means that for all layers excluding the bottom one, there will be

successively more processors at each node. Clearly this arrangement does mean that

only the layer that the processor array has been configured for can execute at any

one time, although if more than one array were used, several layers could execute

concurrently.

Another pyramid architecture currently being constructed is the Warwick

Pyramid Machine [Nudd 88]. The WPM consists of three layers, the base layer is a

256 x 256 SIMD array construction. Above these are 256 processors which act as

controller for a 16 x 16 array of processors. The base-layer acts on iconic data and

the layer above converts these to a symbolic representation.

Transputers have been considered for use in a pyramid architecture for

knowledge-based sonar image Interpretation. The pyramid configuration proposed

will use 21 T800 transputers in three hierarchical layers. The base layer has 16

processors arranged as a 4 x 4 array, each connecting to one of the four processors

(m a 2 x 2 array) in the middle layer. The top layer is a single transputer,

98

--------------------1

connecting to each of the four devices in the middle layer. It would appear that this

structure is best smted for performing image analysis on repetitively captured

images, so that one layer performs its level of image processing, another layer

could be performing its function on the previous image [Manning1988].

99

CHAPTER 4

SYSTOLIC DESIGNS FOR DIGITAL CONVOLUTION

4.1 INTRODUCTION



vision and s1gnal processing). In addition multidimensional convolutions

constitutes some of the most compute-mtensive tasks in signal and image

processmg. Since each input data 1tem 1s used many times, and many input values

are needed for the computation of a single output . For example, a two dimensional

(2D) convolution using a general 3 x 3 kernel requires 9 multiplications and 8

additions to generate each pixel in the output image. If the dimensionality is higher

(typical images have 256 x 256 or 512 x 512 pixels) or the kernel is larger, many

more arithmetic operauons would be required.

Some previously-proposed systems were shown in chapter 3, however, these

systems suffer from two drawbacks. Firstly, they do not take advantage of the

possibility that arithmeuc units could themselves be pipelined. Secondly, they

cannot be used to perform convolutions of arbitrary dimensional; they can for

example perform only 1D or 2D convolutions but not both.

The use of pipelined components for implementing cells of systolic arrays is

especially attractive for applications reqmring floating point operations.

Commercially available floating point multipliers and adder chips can deliver higher

throughput, an example of such a chip is the Inmos T800 transputer. The processor

shares its Ume between any number of current processes [Inmos 1987]. These

components, when used to implement systolic cells, form a second level of

pipelining, the first level is the global pipelining between array cells, while this

additional level pipelining of computations within a cell, and this second level of

pipeluung can increase the system throughpuL

101

This chapter describes a two level pipelined systolic array design for 10

convoluuon which uses pipelined arithmetic units. The systolic array was first

proposed by H.T. Kung [Kung and Lam 1984]. It is shown that the systolic array

can be extended to handle a 20 convolution. This system can also be extended to

handle convolution of any dimensionality.

The systolic array that was proposed here is a linear array for m-0

convolution. It consists of two major building blocks: Multiply-add processor and

delays. The number of hardware components used is proved to be minimal when

compared with equivalent array of the same strucurre. Some modifications to the

design of the systolic array are also analysed

The next section presents a definition and system for 10 convoluuon. A

description of the systolic array for 20 convolution is also presented in section

(4.3). Section (4.4) presents a systolic design for multidimensional convolution.

Some improvements to these designs are shown in section ( 4.5). A description of

the transputer network for one dimensional and two dimensional convolution are

presented in sections (4.6) and (4.8). The results and efficiency obtained for each

transputer network are also presented in sections (4.7) and (4.9), reflecting the

performance of these designs on this level of vision. The proof that a design is

optimal in terms of the amount of time and memory used, is analysed in the last

section, where the systolic array is compared with another known design in the

literature.

102

4.2 ONE DIMENSIONAL CONVOLUTION DESIGN

4.2.1 Problem Definition

Given a vector of signals x= {xi}, i=l,2, .... n, and a kernel of weights w=

{ Wi}, i=l,2, .... k, with k<<n, then convolving the signal x with the kernel w is to

compute the quantity k

Yr = ~ wi xi+r-1 I= I

for r=l,2, .... ,n-k+l

(4.1)

Assullling the vector indices increase from left to right then the flrst result, Yi•

is obtained by aligning the leftmost element of w with the leftmost element of x,

then computing the inner product of w and the overlapped section of the signal x

with w.

The kernel slides one position rightward and the inner product calculation is

again performed on the overlap to produce the second result and so on for all the

other results. The last result, Y<n-k+l)• is obtained when the rightmost element of

w is aligned with the rightmost element of x.

From equation (4.1) we conclude that there are (k-1) additions and k

multiplications per window, and the window operation is applied (n-k+ 1) times.

This is the reason why the convolution is considered to be a compute intensive

operation.

Let's look at the input data as an endless sequence of number sliding along the

sequence of weights:

xw xg x8 x7 x6 x5 x4 x3 x2 x1

wa w2 ~

For each relative position of these two sequences the sum of products of

overlapping weights and inputs give us one of the convolution value. In the above

example we have:

103

In the next step the relanve position 1s

x10 x9 xg x7 x6 x5 x4 x3 x2 x1

w3 w2 w1 and we obtain:

Ys = xgw3+ X7'"2,+ x6w1

A design of convolution array described in the next secnon was developed

on the basis of this imply Qbservation.

4.2.2 Systolic Design

As described before the kernel should slide over the signal, but beside that

one can consider the kernel as being stationary in space and the signal as sliding

over the kernel. This view suggests a linearly connected array for performing

convolution so that each kernel element can be held in a single cell throughout the

computation and the number of cells is equal to the number of convolution weights

where the signals passes through the array from left to right

Host -

L.._. Cell! Cell2 Cell3 1---------- Cellk

Figure (4.1) 1D convolution systolic array.

Fig. (4.1) shows an array of cells connected to each other with each cell

holdmg a single kernel element, as described above. The array consists of k cells

104

-----------------------------------------------------------------------

(the length of the weight vector). The signal data is pumped into the array by the

host or interface umt in regular clock beats. Data and results are pumped through

the array and the final results and the original data are returned to the host (or

another host). Only cells on the array boundaries are permitted to communicate with

the host and each of the cells communicates with the left and right neighbour cells

only. It is assumed that there is a global clock synchronizing the computation of all

components in the system, having a time cycle (step, unit) long enough to

accommodate the most complex function performed by a cell, plus the data transfer.

In each step, all cells simultaneously perform their 1/0 and execute their operations.

In each cell (bifunctional cell) a sigual element is held by a register and the

cell contains two subtasks, multiplication of input x by a weight w and addition of

the result to y where the muluplier may be pipelined to an arbitrary degree. Each cell

produces its partial result one cycle earlier than the cell to its right . The skew can be

accomplished by replacrng in each cell the register, which transforms the signal

stream, wtth a mulustage shift register. For the general case the adder and the

muluplier in each cell are multistage pipeline unit as shown in fig. ( 4.2).

The number of stages of the shift register should be one greater than that of

the addr so if A is the number of stages of the shift register and B is the number of

pipeline stages of the adder unit then,

A=B+l

The Occam code running on this cell taken the folloWing form;

... PROC delay

... PROC pass.data

.. PROC multiply

... PROCadd

PROC cell (CHAN ..... )

... declarallon of local channels

PAR

105

delay

delay

pass

multiply

add

delay

It is poSSlble to modify the cell design by reducing the number of pipeline

stages for the multiplier unit, by adding another channel to the last stage of the shift

register as shown in fig. (4.3a). This channel is connected to the multiplier unit to

broadcast the signal from the shift register to the multiplier umt. The new cell

design is shown in fig. (4.3b).

Ym Yout

~ f.---RI R2 RB

Adder

w

~ RI R f.--- RC ~

Mult!pher

X m - RI R f.--- RA .. Xout • 7 -~

Sh1ft reg1ster

Fignre (4.2) Bifunctional cell design for lD convolution.

We can see from the cell design that both data streams, i.e., the input which

is through the shift register and the output which is through the adder unit, move in

the same direction. At the beginmng of a cycle one input data 'x' enters the leftmost

cell, and one of the output stream 'y' enters the same cell at the same time. The

computation of resultmg coefficients is achieved by means of the cells pipelined

accumulation of the partial product.

106

--- - -------------

For example, if k=4 (vector of wetghts), A=3 and B=2, then we need a

four-cell systolic array. Ftg. (4.4) shows snapshots of the x1 and y1 enter the array

at time 0, and Pi stands for a partial result of Yi· The computation of a resulting

coefficients is achieved by means of pipeline accumulation of the partial products.

For example, y 4 is computed in four steps: first w 4x4 is calculated in the first cell

(processor) at time=6; then later this passed to the right-hand side cell, where w3x3

Ym

X m

X out

X m R XQ!I

(a)

... '(.out

~ 1---RI R2 RB ,.

Adh"

w

~ Mulllphe

t X out ... , RI R2 --- RA

Shift register

(b)

Figure (4.3) (a) Register with two output channels. (b) Modified cell desigu

for ID convolution.

107

Time= 12

p]l+---l-----'1~ '---..-'--'

Time= 15

Time= 16

Figure (4.4) Snapshots of an execution of asystolic ID convolution

array with k=4.

108

IS calculated at time= a, and added tow 4x4 . The same processor will carry on the

next two cells, where w2x2 is calculated at the third cell at time=!(), w1x 1 is

calculated in the last cell at time=l2. The partial products are again summed to

produce Y4 as output at time=l5. With reference to fig. (4.5a), a configuration for

generating y 4 , is as follows:

Y4 = w4x4 +w3x3 + w2x2 + w1x1

The next output y 5 is shown in fig. ( 4.5b ), where the kernel moves one

position to the right to give

(4.4).

y5 = w4x5 +w3x4 + w2x3 + w1x2

The snapshots of systolic system for these output values are shown in fig.

w1w2 w3 w4

x1 x2 x3 x4 x5 x6 .......... xl7x18

(a)

w1w2 w3 w4

xl x2 x3 x4 x5 x6 .......... xl7x18

(b)

Figure (4.5) A window of length 4 for convolving the output

value of y 4 and y5

109

j

Only the first cell (leftmost cell) is permitted to communicate with the first

host, whtle only the last cell (rightmost cell) is permitted to communicate with the

second host.

The parallel algorithms computed at each result for the four cells is as

follows:

Wl..l read y 1 from the host I

read x1 from the host I

calculate a parual result of y 1

send y 1 to cell 2

send Xi to cell 2

cell j+l J =1 to 2

read y1 from cellJ

read x1 from cell j

calculate a parual result of y1

send y1 to cell J+2

send xi to cell j+2

~

read y1 from cell 3

read x1 from cell 3

calculate part of a result of y1

send y1 to the host 2

send Xj to the host 2

The mean procedure for the systolic array for lD convolution is

•.• PROC host!

... PROC cell

... PROC host2

PROC ID system (CHAN .... )

V AR float x[Ume],p[time],a[k]:

SEQ

110

PAR

hostl

PAR 1=[0 FORk]

cell

host2:

--(k IS the s1ze of the kernel)

Although, in general case, there are some invalid results generated like y 1•

Y2 and Y3• the fraction of the total results which are invalid is very small since

n>>k.

4.3 TWO DIMENSIONAL CONVOLUTION

4.3.1 Problem Definition

As shown in chapter 3, smce the input data in image processing is two

dimensional (with two space indices), therefore the convolution operations will also

be 2D. The main difference between the ID and 2D operations is that the number of

indices of the formal is doubled, as illustrated by the defmition below:

Given a 2D image x= ( XiJ )

where !=l,2, ... ,n1 and j=l,2, ... ,n2

Also we are given a 2D kernel W=wij

where i=1,2, ... ,kJ and j=1,2, ... ,k2

with kt <<n 1 and k2<<n2

The 2D formula for convolving x with w is

k1 k2

Yr1r2 = L L wi,j xi+r1-1, !+r2-1 1=1 j=1

For q=l,2, ... ,nJ-kt+l.

111

(4.2)

and r2=l,2, ... ,n2-k2+ 1.

The first results, Yll> is computed by placing the kernel over the image, so

that w11 covers x11> w12 covers x12 .. etc, and multiplying the corresponding

elements of w and x, followed by the summation of these products, let us suppose

the indices increase rightwards and downwards, then the kernel slides one position

to the right for the computation of the second result Y12·

The kernel moves one pos1tion downward and back to the left-edge of the

image, after the fust output row is computed. Similar steps will be repeated for all

the rows, the last result is computed when Wklk2 covers Xnln2.

There are (k2-1) additions and k2 multiplication per window, and the

window opera non is applied (n-k+ 1)2 times. As it indicates, 2D convolution is

even more compute-intensive than ID convolution and, because of that, 1t requires

very careful design of both the algorithm and the hardware. In this chapter, we will

improve the two-level pipelined systolic array design to fit with the defmition (4.2).

4.3.2 Computation of 2D Convolution as ID Convolution

From a computational point of view, an efficient way of computing 2D

convolution is first to conven it into that of computing a lD convolution, and

therefore the 2D convolution can be performed on a systolic array for ID

convolutions.

The 2D convolution defined in equation (4.2) can be viewed as lD

convolution.

and

Each row of the image input is represented as Xi!

where

Xi!= Xj1, Xj2 , ............ , Xm2

112

i=l ,2, .... ,n 1

From that the image input is defined as

X= X}!, x2! , ............ , Xn}! (4.3)

The total length of the 2D image is n 1 n2

Fig. ( 4.6) illustrates a 2D convolution, and its conversion to ID convolution

as explamed above.

where

The kernel can be computed in a smnlar manner where represented as Wi!

Wi != Wii, Wi2 , ............ , Wik2

xn XI2

X21 X22

XJ! XJ2

X51 X 52

(i=1,2, ...... ,kl)

X13

wn

X2J

w21

XJJ

W3I

X43

X 53

(a)

XI4

WI2

X24

W22

XJ4

WJ2

X44

X 54

XI5

W13

X25

W23

XJ5

W3J

X45

X 55

X= XII·X 12·XIJ·X14·X15·X2I·X22·X23·X24·X25·X31 , ........ ,X55

W= WII ,W12,WIJ•O,O,w2I•W22•W2J•O,O,w31 ,WJ2,W33

(b)

Figure ( 4.6) illustrates example for converting a 2D convolution and kernel

to ID convolution and kernel. (a) 2D image input x, and 2D kernel w. (b)

ID rrnage input x, and ID kernel w.

113

Thus the total kernel will be defmed as:

W=wt! ,(nz-kz)! 0, wz!,(nz-k2)! O, ..... ,Wkl! (4.4)

Equation (4.4) shows concatenation of the rows of the 2D kernel, with a

vector of (n2-kz} zero elements inserted between each consecutive pair of rows.

The total length of the 2D kernel is therefore n2(kr-l)+k2. Fig. (4.6)

illustrates an example of converting a 2D kernel to a ID kernel.

At this point, we can relax the constramt stipulating that the input sequence

(pixels) is formed by rows of the input array entered one after another without any

delays between consecutive rows.

4.3.3 Systolic Array Design for 2D Convolution

From the illustration shown in the previous section, we know that a 2D

convolution operation can be computed as a sum of ID convolution. A 2D

convolution array consists, therefore, of lD convolution arrays. We use k 1 linear

lD arrays, as descnbed in the previous sectiOn as a design for 2D convolution.

As shown in fig. (4.7), we connect the lD segments to produce a 2D array

for a 3 x 3 kernel, and input data row of 5 elements. Each of these arrays (or

segments) has k2 bifunctional cells, where k2 is the number of elements in each

kernel row. Also, there are (n2-k2) cells in each segment, except the last segment,

with zero kernel element, where n2 is the number of pixels in each image row, so

the total number of cells in each segment is n2 cells, and the total number of cells in

the system is therefore n2(k1-l)+k2, as shown in the previous section. In fig. (4.7)

x stands for the input sequence.

It can be seen that a large number of cells would be needed in the array

when n 1 and n2 are large, and the cells with zero weights would perform no useful

work, (N.P., n>>k). We show here that a kernel containing a large number of

114

~ - r---- - -hostl Wll W12 W13 W:{) W:{)

~ - r---- - - -X

~

1-- I-- 1-- ~ ~ W:{) W:{) W23 W22 W21 ~ 1-- I-- 1-- ~

L...\ .... ~ -- W31 W32 W33 host2

' ~

Figure (4.7) Systolic array for 2D convolution where ki> k2=3 and n1>n2=S.

stages of the shift register in each individual cell is adjustable.

Let us consider the cell shown in fig. ( 4.3b) with a kernel element equal to

zero, then the only effect of that cell is to delay the y stream B cycles and x stream

by A cycles, where A and B are as defined in section ( 4.2.2). It should therefore be

apparent that if this cell is replaced by a cell having zero cycles delay for the output

stream and a single cycle delay for the x stream, the same output stream would be

generated. This degenerated cell may be absorbed into a cell to the left by increasing

the number of the shift register stages of that cell by one.

Since k2 is the number of nonzero element, then these elements are then

loaded into consecutive cells of the systolic array. Thus, no more than k2 cells are

needed in the array.

Now let D be the number of zero kernel elements in each row , then,

(4.5)

115

------------------------------------------------------------------

hostl

Ym out

.:X:.:.t:::n-1--~ RI X out - ~~------------------~~

Shtft regiSter

Figure ( 4.8) Cell design with add shift register (Ak).

lv. w33 ' w32 ' w31 x .... ' I--' '

cell A cell A cellB

+-'k-- w21 "" w22 "" w23

:--.. ~ .... .; <--.... cell B cell A cell A

- w13 w12 ... W11 ::: ~ host2

cell A cell A cell A

Fignre (4.9) Systolic array for 2D convolution.

Let Ak be the number of stages of the shift register in the cell B, which is

the last cell with a non-zero element in the first row.

Then, the shift registers will be,

Ak=A+D (4.6)

where A is the original number of sh1ft register stages.

If we take the example shown in fig. ( 4. 7), then only three cells are needed

in each segment, the shift register in the first two cells is three and, from equation

116

( 4.6 ) Ak=5 at the third cell. Fig. ( 4.8) shows a cell design with Ak shift registers,

and the new systolic array design for that example is shown in fig. ( 4.8).

We conclude from the systolic array design described above that we have

kt(kz-1) +1 cells with A shift registers and (kl-1) cells each with a shift register of

Ak stages.

In general case the array consists of kt segments each having kz cells, it

adds up to kt kz bifunctional cells.

Equation ( 4.6) can be generalised to that of any size of kernel, and with any

zero kernel element, so if R is the number of stages of the shift register, in general,

then

R=Ak

and

R=A+nz-kz

Let L be the number of zero kernel element between any two non-zero

kernel element, then

R=A+nz-(kz+L)

and

R=A+(nzx h)-(kz+L) (4.6a)

where h is the number of rows between consecutive pair of kernel elements

Fig. (4.10) illustrates a series of snapshots of the example shown in fig.

( 4. 7). Both x and y are initialized to zero, and they enter the array at the top

segment and moves through the middle to the bottom segments. In our example,

two more delays were needed for each of the Ak shift registers (in the last cell of

each row except the last). As a result, the x sequence is slowed down and enters

each segment simultaneously with the appropriate elements of the y sequence.

There are 4 snapshots of the 2D convolution computation shown here, the indices

117

(q) Time =12

- ' ' I '"I

~ ~ ~ I I I I I ' I I I

I Y2

1.. tf3

$ IW22l ~ I ~ I~

r- ~ I I I ; I I ~

I I I [""' .....

- ' Y11,

~ ~ e I I I I -1 I I ~

(b'Time=21

Figure ( 4.10) Snapshots of the execution of a systolic 2D convolution

array with kr, k2=J, n1, n2 =5.

118

-

...._I ' ' ' I

~ ~ ~ I I I I u I I I

i 1, ...

~ ~ r-E1 - H I I I I I ::: I I I li

I I'

' .... Y3~

.... ~ - - - ..

~ ~ ' I I I I ~: I I --(c)T1me =31

.......! ~I ~

'"I .. ~ ~ ~

I I I I I I I I

"' ~ :... :- I ...

~ ~ ~ r--t- jL-J I

- H I I I I I I I I I I

Y33~ ~ ~ .... ~34 - ,

e e ~ I I I .I I I I I I I

~

(d)Time=32

119

of x and y sequence appears in each stage of the cell. We assume that the

computation starts at time zero.

At time three, x(l,l) meets y(l,l) at the first cell of the top segment, where

the kernel is w(3,3). Twelve clock cycles later x(l,l) moves to the first cell of the

middle segment to meet y(2,1), where the kernel is w(2,3). At the same time x(3,3)

meets y(3,3) at the first cell of the top segment, where this cell produces a partial

result ofy(3,3), as illustrated in fig. (4.10a).

The second snapshots illustrates the state of all cells at time=21, where

x(l,l) meets y(2,3) at the end of the middle segment, where the kernel is w(2,1).

At the same time x(2,3) will be at the first cell of the middle segment to produce

another partial result ofy(3,3). Also, at this time the first output y(l,l) is produced

which is an invalid result, the other cells produce partial result of other output

values. At time 31, x(l,l) w1ll be at the last cell of the array, to produce the last

partial resultofy(3,3) as shown in fig. (4.10c).

The last snapshot in this figure illustrate the array at time=32 where the first

valid result y(3,3) is produced. Also at this time, the last partial result of the second

valid result y(3,4), IS also produced.

Let us apply equation ( 4.2) to our example, the first valid result will be

y(3,3) = w(l,l) x(l,l)+ w(l,2) x(l,2)+ w(l,3) x(l,3)+ w(2,1) x(2,1)+

w(2,2) x(2,2)+ w(2,3) x(2,3)+ w(3,1) x(3,1)+ w(3,2) x(3,2)+

w(3,3) x(3,3)

But the partial results of this output y(3,3) are produced by different timings

m different cells where:

at cell (1,1), w(l,l) x(l,l) is produced at time=31

at cell (1,2), w(l,2) x(l,2) is produced at nme=29

at cell (1,3), w(l,3) x(l,3) is produced at time=27

120

at cell (2,1), w(2,1) x(2,1) is produced at time=25



at cell (3,1), w(3,1) x(3,1) is produced at time=l9



The same procedure is applied to all other outputs at every clock cycle, until

the final result y(5,5) is generated.

The main Occam procedure for the systolic array for 2D convolunon is

shown below, which is for any image size (no.c * no.r), and for any kernel size

(no.kc * no.kr).

---PROC hostl

---PROC host2

---PROC cell A

--·PROC cell B

PROC 2D.system (CHAN ···-)

---Input Image

---Input kernel

do:= no.kc * (no.kr -I)

PAR

hostl

PARJ = [0 FOR (no.kr- I)]

PAR 1 = [0 FOR (no.kc -I)]

cellA

cell B

PAR 1 = [do FOR no.kc]

call A

host2:

121

4.4 MULTIDIMENSIONAL CONVOLUTION

The systolic array for 2D convolution described in the previous section, can

be generalised to that of an rnD convolution.

To illustrate the idea, let us consider a 3D image. The systolic array for 3D

convolution is built up of k arrays for 2D convolutions. In this respect, the design

IS analogous to the 2D case, where we used k arrays for lD convolution.

Consequently. To illustrate the design of a 3D convolution array, let us assume

k1 =3, k2=3 and k3=3 for the 3D kernel, and n 1 =5, n2=S and n3=S for 3D image.

In this way, we obtain the 3D equivalent of the 2D array example from fig. (4.9).

The 3D array consists of three 2D subarray. Each subarray is, in turn, made up of

three lD segments, as shown in fig. ( 4.11). As in the 2D case, in order to complete

the design, we have to increase the number of the shut register stages of the last cell

in each subarray (except the last one). This new cell "cell C", which is similar to

"cell B" (fig. (4.8)) with an increase in the number of the shift register stages by

(n2x n 1)-(n2x kl)+(n2-kl)

Also, the method of converting a 2D problem into a 1D problem, which is

shown in section (4.3.2), can be generalised to that of converting an rnD problem

into a 1D problem. Let us consider the above example. The 3D image is formed into

1D signals as follows:

X= Xll!> X112• X113• X114• X115• X121> X122• X123• X124• X125• X131• X132•

X133• X134, X135• .... , X155• X21b X212• X213• ••••• X255• X31b X312• ••••••

X355• X411, X412• .... , X555•

And the lD kernel is formed as follows:

W = Wllb Wl12• Wl13• 0, 0, W121• W122• W123• 0,0, ..... W133• 0, 0, 0, 0,

0, 0, 0,0,0,0,0,0, W21b ..... , W233• 0, 0, .. ,0,W31b .... , W333·

The 1D signal is formed by concatenattng the rows of the first plain of the

122

---------------------- -------

,-------------------------1 I I I I I I I I I I I I

y, ... 1 w113 I

x" ... I wlll 1 ~ I w112 ' cell A cell A cellB

r--H. w123 1 ., .,

w121 ~f-. :;; I w122 I ~

cell B cell A cell A

--l w131 1 <;;. I w132 1 <: 1 w133 ~ ...

cell A cell A cell C "" -~-- -=-------------------

::.tl w211 : 5 w212 " w2131 I ' I

cell A cell A cellB

d w223 1 ~ 1 w2221 '5 I w221 ~ I'-cell B cell A cell A

L......j w231 I ~ " w233 ~ ' w232 I

cell A cell A cell C .; _:--_ --=-------------------~

'---. ~ w311 1 ....

I w312 1 <: I w313 1 ,

cell A cell A cellB

~W w323 1 '5 1 w321 ~I'-~ I w322 1 ~

cell B cell A cell A

L......J w331 1 ~ ...

w332 I w333 ,

cell A cell A cell A I ---------------------------

Figure ( 4.11) Systolic array for 3D convolution with three different

cells, i.e. cell A, cell B, cell C.

123

3D signal, followed by the rows of the second plain etc. The ID kernel is formed

by concatenating the rows of the first plain of the 3D kernel, with a zero vectors in

between, followed by a vector of zeros equal to n3-k3, then followed by the rows

of the second plain, etc.

4.5 CONSTANT TIME OPERATION

One of the main objectives of our design was to minimize cells delay. The cell

delay is defmed as the nrne delay between the input data to the shift register and the

output from the shift register, in our example it is shown in fig.(4.8) where the

number of stages of the shift register of cell B is:

R=Ak=n2=5

It means that the number of delays (shift register stages), is directly

determined by the input data size (number of columns). For example, to perform

2D convolution on a 256 image would require 256 shift register stages in each cell

B. The number of delays in the system is determined by both the kernel size and the

input data size.

If the size of the kernel is 5 x 5 then the total number of delays is:

Rb for cell B = Ak =A + D = 3 + (256 -5) = 254

Ra for cell A= 3

The total number of delays in cells B is:

4x256=1024

The total number of delays in cells A is:

21 x3 =63

Thus the total number of delays in all the cells is

R=Ra+Rb

R = 63 + 1024 = 1087

124

The percentage of the delay in cells B to the total delay is 94%, which is very

high and will increase the execution time of the system.

For llllproved system performance, the number of delays should be decreased

and/or the execution time of the shift register should be reduced.

The basic design of the systolic array for 2D and m-D convolution, presented

in sections (4.3.3) and (4.4), can be modJ.fied to 1mprove the performance of the

system. This can be achieved by implementing the shift register as a constant time

operation.

The shift register is a "FIFO (First-In First-Out)"list, so for a shift register of

size n we need to move n data items at each time a new data item enters the register

and outputs the first data in the register.

Q.rear Q.front

queue

Figure ( 4.12) A circular implementation of constant time operation.

To realise a constant time implementation of the delay the shift register 1s

viewed differently. It is regarded as a circular array, where the first position

followed the last as shown in fig. ( 4.12). The queue is formed around the circle in

consecutive positions, with the rear of the queue clockwise from the front.

125

To enqueue an element, we move the Q.rear pointer one position clockwise

and write the element in that position. To dequeue, we simply move Q.front one

position clockwise. Thus the queue migrates in a clockwise direction as we enqueue

and dequeue.

It can thus be seen with this implementation, only two data 1tems are moved,

the incoming and the output data.

The number of delays of the systolic system is therefore a constant time

operation, i.e., it is independent of the kernel size and input data size. This

decreases the execution time of the whole system, and improves the performance

(see section 4.10 of this chapter).

All the cells B shown in fig (4.9) are replaced with cells CT (cell A with

constant time process).

The Occam code for constant time process is as follows:

PROC constant tune

x:=O; b[k].=O; J:= 1

SEQ I= [0 FOR tune] --unage Size

SEQ

xin? x

bout I b[J]

b(J] ·=X

IF

J >(no eo -1)

j := 1

TRUE

j:= J +1:

126

4.6 TRANSPUTER NETWORK FOR ONE DIMENSIONAL

CONVOLUTION

The systolic design described in section (4.2.2) was chosen for the

implementation of the lD convolution on the transputer network. This is because

the problem can easily be arranged such that each transputer is responsible for

representing one of the cells or more (dependmg on the size of the kernel), with

communication occurring between them, so that each kernel element (or more than

one element) can be held in a single transputer throughout the computation, and the

maximum number of transputers in the design is equal to the number of convolution

kernel elements. The same cell design was adapted on the transputer, so we need

same code to run on all transputers.

The flrst transputer is connected to the host, so it receives input data from the

host, while the last transputer is also connected to the host, so it passes onto the

host all the resultlfrom all the transputers on the network before it shuts down.

Host

x,y 1----~' TO Tn

x,y 1-'1"'<------------------1

Figure (4.13) Transputer network configuration for ID convolution.

When calculating new values at each transputer communication has to occur to

obtain values from the neighbouring transputers. Each transputer collects input data

and results on the transputer boundaries which is then sent down the Inmos lmks to

127

the next right transputer. It also receives the values of the results and the input data

from the left transputer through the lnmos link.

Parallel code must be executed in each transputer whilst all transputers execute

in parallel.

The transputer network configuration for the design is shown in fig. (4.13).

The algorithms described in section (4.2.2) required modification in order for

them to be executed on a network of transputers and minimise the number of

channels used in each transputer.

The algorithm at each input data for all transputers are as follows:

I.ll receive mput value x1 from the host

calculate a partial result of y 1

send y1 to T1

sendx1 toT1

J =1 ton ---(n IS the number of transputers)

receive mput value x1 from TJ-1

receive a partial result y1 from TJ-1

calculate another partial result of Yi

send the added y 1 to TJ+ 1

send x1 to Tj+ 1

In receive mput value x1 from Tn-1

receive a partial result y1 from Tn-1

calculate the fmal partial result of y 1

send y 1 to the host

send x1 to the host

128

4.7 PERFORMANCE OF THE ONE DIMENSIONAL

CONVOLUTION SYSTOLIC DESIGN ON THE

TRANSPUTER NETWORK

Several experiments were performed on the model problem in order to

measure the performance of the algorithm on a variety of network configuration.

The number of transputers of network is not dependent on the stze of the image. a

The mam requirement is for such'transputer to have an equal number of cells as

close as possible. This is necessary for load balancing between transputers. The

network is comprised of only T800 transputers on the B012 board.

A timer on the host transputer is used to measure the time of processing of the

ID convolution algorithm. This time is measured from the instant of starting to send

the first pixel to the instant of receiving the last pixel. The time lapses includes the

time spent transmitting the final result to the host processor from the network.

Several image sizes with two different window sizes have been tested. Tables ( 4.1

and 4.2) show the time of processing with respect to image size for both window

sizes for various network configurations. The tables also includes the relative

speed-ups and effictencies of the algorithm. The speed-up and efficiency are

calculated using equations (2.1 and 2.2 ) respectively.

An analysis of the results shows that the system's performance is improved

by increasing the number of transputers. Two sets of performance graphs are

shown in each of figures ( 4.14 and 4.15) in which two points can be noted. The

first is the effect of mcreasing the number of transputers on the performance. It can

be seen that the speed-up and the efficiency is increasing with increasing number of

transputers. The second point is the effect of increasing the image size on the

129

___j

Image Size Network Time Lapse Relative Efficiency

Size (seconds) Speed-up %

1 0.037 1.00 100.00

16 X 16 2 0.025 1.48 74.09

3 0.013 2.87 95.81

1 0.137 1.00 100.00

32x 32 2 0.092 1.49 74.38

3 0.047 2.90 96.75

1 0.535 1.00 100.00

64x64 2 0.360 1.49 74.42

3 0.184 2.92 97.20

1 2.120 1.00 100.00 128 X 128 2 1.424 1.49 74.46

3 0.727 2.92 97.20

1 8.439 1.00 100.00

256 X 256 2 5.670 1.49 74.43

3 2.893 2.92 97.24

Table (4.1) Tuning results for the 1D convolution algorithm (kernel size 3).

speed-up and the efficiency. It can be seen that both increase with the increase of

the image size.

Fig. (4.15b) shows the efficiencies of over 77% for a three transputer

network, over 85% for a four transputers network and over 95% for seven

transputers network. The main reason for the drop in the efficiency for each

transputer network is the number of cells in each transputer. When the number of

transputers decreases then the number of cells in each transputer increases. As

executing parallel Occam code on each transputer implies the transputer spends its

ume between different processes giving the illness of concurrency, the best

efficiency we can get is when each transputer contains one cell only (7 transputers

for 7 cells). The same implies for fig. (4.14b) where the efficiency is 74% for two

130

transputer network and over 95% for three transputer network (3 transputers for

three cells).

In general, the ID convolution algorithm will give a nearly linear speed-up

when the number of transputers on the network is increased.

Image Size Network Time Lapse Relative Efficiency Size (seconds) Speed-up %

1 0.0856 1.00 100.00 16 X 16 3 0037 2.33 77.57

4 0.025 3.44 85.99

7 0.013 6.69 95.57

1 0.321 1.00 100.00 32x 32 3 0.137 2.34 77.87

4 0.092 3.48 86.87

7 0.047 6.77 96.74

1 1.254 1.00 100.00 64x64 3 0.536 2.34 77.94

4 0.360 3.48 87.06

7 0.185 6.79 96.98

1 4.966 1.00 100.00 128 X 128 3 2.123 2.34 77.98

4 1.425 3.48 87.11

7 0.731 6.79 97.03

1 19.78 1.00 100.00 256 X 256 3 8.449 2.34 78.03

4 5.675 3.49 87.14

7 2.911 6.80 97.06

Table ( 4.2) Tuning results for the ID convolution algorithm (kernel size 7).

131

... = "' .. .. ... "-'

... " " ..

·;:; ;.:: ... r.l

3~---------------------------,

2

2 3

No. or Transputer

16 *16

- ·- -··· 32*32

- ----· 64. 64

4

128. 128 256.256

Figure (4.14a) Speedup graph for 1 D convolution (k=3)

10~----------------------------,

09

16 *16

--- --- 32 *32

--- --· 64*64 08 128. 128 256.256

07 2

No. of Transputer 4

Figure (4.14b) Efficiency graph for 1 D convolution (k=3)

132

L------------------------------------------------------

;., ... " "' :E ... ... r.:l

7

6

... 5 16. 16

32* 32

64*64 128. 128 256.256

= "CC

"' ~4 "'

3

2 2 3 4 5 6 7 g

No. of Transputer

Figure (4.15a) Speedup graph for 1 D convolution (k=7)

12

10

os-

~.;::. --=-==-=-:::-:.::----/' __ _

06-

16 *16

--- --- 32.32 • •••••• 64*64

128. 128 --1-- 256*256

044-~--r-,-r-,,~,--~,-r--,r-~-~.~~ 2 3 4 5 6 7 g

No. of Transputer

Figure (4.15b) Efficiency graph for 1 D convolution (k=7)

133

4.8 TRANSPUTER NETWORK FOR TWO DIMENSIONAL

CONVOLUTION

The systolic system described in section (4.3.3) required further

modification in order for the array to be implemented on a network of transputers.

The transputer network was connected to one host only. The first and the

final transputers are connected to the host, the first transputer receives all the input

data from the host and pump it to the transputer network, the final transputer

receives the results from the neighbouring transputer and sends it down to the host.

For simplicity each transputer is responsible for a cell of the systolic design or more

than one cell. Basically inside each transputer equation ( 4.2) is applied to update the

partial results, each transputer collects values of both the input data and the partial

result through one Inmos link and sends them down the Inmos link, as shown in

fig. (4.16).

X

fllmg xm ~ TO ~ T1 system Host ,. ,. -

Transputer yin y

; xout

~ Tn --~- T2 yout

Figure (4.16) Transputer network configuration for 20 convolution.

We need Occam codes to run on different transputers to represent cell A

(shown in section (4.3) and cell CT (shown in section 4.5). The maximum number

of transputers we need for this design is equal to the number of cells on the systolic

design, or the number of kernel elements. The transputer network configuration for

134

such a design is shown in fig. (4.16). All the processes run in parallel inside the

transputers.

The Occam program for a host and network of transputers is given in

Appendix A and B respectively.

4.9 PERFORMANCE

CONVOLUTION

OF THE TWO DIMENSIONAL

SYSTOLIC DESIGN ON THE

TRANSPUTER NETWORK

Experiments similar to those performed for the one dimensional convolution

were carried out to measure the algorithm's performance on a network of T800

transputer configured as an array. A summary of the timing results for the two

dimensional convolution algorithm is presented in table (4.3), with respect to the

image size for a kernel size (3 x 3) and the various network configurations

implemented. The table also includes relative speed-ups and efficiencies of the

algorithm.

The overall results for this algorithm are very impressive. Table (4.3) indicate

'super' speed-up for all sizes of images when the network size is 3 transputers.

Their efficiencies are also extremely high (over I 00% falling gradually to above

91%).

This extraordinary behaviour is explained by the fact that each of the T800

transputer on the network is connected to an external memory which is much

slower than the on-chip RAM. When the program is executed on a single transputer

therefore some of the data stored in the cells shift register is stored in the external

memory so that extra time is required in accessing the slow external memory. When

the same program is then run on a network of 3 transputers, the amount of storage

needed per transputer is one third and thus all the data can be stored on the fast on-

135



1 0.093 1.00 100.00

16 X 16 3 0.029 3.2 106.67

5 0.020 4.59 91.86

9 0.011 8.23 91.40

1 0.348 1.00 100.00

32x 32 3 0.108 3.23 107.79

5 0.075 4.65 93.07

9 0.0416 8.38 93.15

1 1.361 1.00 100.00 64x64 3 0.420 3.24 108.19

5 0.291 4.67 93.35

9 0.162 8.43 93.80

1 5.389 1.00 100.00

128 X 128 3 1.660 3.25 108.12

5 1.154 4.67 93.42

9 0.638 8.43 93.68

1 21.477 1.00 100.00

256x 256 3 6.649 3.23 107.66

5 4.597 4.67 93.44

9 2.543 8.45 93.84

Table ( 4.3) Timing results for the 2D convolution algorithm.

chip RAM. The gain m speed from the on-chip RAM offsets the new constraint

introduced by communication.

Fig. ( 4.17) shows that a near linear speed up is obtained for various image

size. For small size images ( 16 x 16; 32 x 32), there are decreasing gains as the

network increases, while this gain increases as the size of the image increases.

This result is quit useful smce the need for a parallel system is more vital for large

images, when the time ofprocessmg is relatively high. Graphs of fig. (4.18) show

the effect of mcreasing the size of the transputer network on the performance of the

136

system. It can be seen that the efficiency increased with increasing the size of the

network beyond 5 transputers.

a. :I ., .. .. a.

1/)

... u c .. u

= -w

9

8

7

6

5

4

3 2 4 6

No. of transputers 8

16. 16

-· ---·- 32 • 32

- ------ 64. 64

-· ·-·-·-· 128. 128

- --- 256.256

1 0

Figure ( 4.17) Speedup graphs for 2D convolution.

1.2....---------------.

1 0 '------08

16. 16

• ·-·---·-· 32 • 32

------- 64. 64

·-·-·-·-· 128. 128 256.256

o6+--~~~~--r-~-r-~-~ 2 4 6 8 1 0

No. of transputers

Figure ( 4.18) Efficiency graphs for 2D convolution.

137

4.10 ANALYSIS AND COMPARISON OF THE TWO

DIMENSIONAL SYSTOLIC ARRAY

We know from chapter 2 that systolic arrays are special purpose computing

deviGes. The research completed in the area of two-dimensional and

multidimensional systolic array is quite extensive. Several different designs have

been suggested in the literature. We will compare one of them to the design

presented in this section.

The size of the convolution kernel and input data size are important

parameters in the design of any convolution array. For the design presented in this

chapter, the array is mdirectly related to the size of the input image, where the

number of delays in the shift register is equal to 3 in most of the array cells, and

some cells it needs a fixed or constant time process (CI) as shown in section ( 4.5).

However our array requites buffering equal to the dimens10nal of the image (1.e.

number of columns) at cell cr. The number of processors in the array is equal to

the number of convolution kernel elements.

We will now compare our array to the two-level pipeline proposed by Kung

[Kung 1983].

The number of delays of the array proposed by Kung are very high, where

three kinds of delays are needed, the flrst at the mulnplier unit, secondly at the

adder unit, and finally at the shift register. The total number of delays at some cells

of the array are more or less twice the dimension of the image and this will increase

the execution time of the system.

The main reason for using a parallel processing system to implement image

processing algorithms is to reduce theit execution times. Therefore it is important

that each processor is active as much as possible. There are two reasons why this

may not happen in the array proposed by Kung:

138

1- The processors may not be allocated equal workloads, because some of

the cells have much more delay than others and so those allocated cells which fmish

first must wait until the slower ones terminate; and

2- The communication of information will invariably be required between

cells. Since communication is tightly synchronised, it means that 1f one processor is

trying to transmit information to another it must wait until the second processor is

ready to receive information.

The processing time of our system running on the transputer network is

taken from table (4.3) and given here again in table (4.4). The table also include the

time of processmg on the system proposed by Kung running on the same transputer

network. The speed-ups for both systems are also included in the table.

A comparison between the times for the systems shows that the times for

our system are superior for all sizes of images and for all sizes of transputer

network as shown in fig. (4.19 and 4.20). The time increas~rapidly with the

increase of image size. This is because for large 1mage size the number of delays is

very high, and that will effect the performance of the system.

Figures (4.21 and 4.22) show a dramatic decrease in speedup and efficiency

of the Kung system, especially for larger sizes of image. The reason for this is as

before, smce the number of delays is very high.

Again for small image size there is less gain as the network size increases.

The reason for the efficiency difference in the case of smaller images is the size of

the load on each transputer, where the number of cells in each transputer at the

smaller sizes of transputer network is higher than that of the larger size of transputer

network. The increase of the number of cells per transputer, affects the efficiency

gain. This gain is decreased when the size of image is increased but the percentage

drop of effic1ency gain is very small.

139

c onstant tune system 21 1 · r evet p1pe me system

Image Network Time Relative Time Relative

Size Size Lapse Speed up Lapse Speed-up

(seconds) (seconds)

1 0.093 1.00 0.184 1.00 16 X 16 3 0.029 3.2 0.053 3.49

5 0.020 4.59 0.040 4.56

9 0.011 8.23 0.028 6.51

1 0.348 1.00 0.935 1.00 32x 32 3 0.107 3.23 0.320 2.92

5 0.074 4.47 0.177 5.28

9 0.0416 8.38 0.177 5.28

1 1.361 1.00 5.516 1.00 64x64 3 0.420 3.24 2.213 2.49

5 0.291 4.66 1.63 3.39

9 0.162 8.43 1.681 3.28

1 5.389 1.00 36.487 1.00 128 X 128 3 1.660 3.25 18.651 1.96

5 1.154 4.67 13.546 2.69

9 0.638 8.43 13.980 2.61

Table ( 4.4) Tmring results for the 2D convolution algorithm for the 2level

pipeline systolic system and constant time systolic system.

140

~

ci ., -" E j::

~

ci !!.

" E j::

02~---------------------------------,

0 1

0

~ · . • • • ·. ·. • \ • • .. •

ll .......... cons! t system 21evel pipe

\

2

• ·. ......... ····---............. .

4 6 No. of transputera

.............

8 1 0

Figure ( 4.19) Time lapse graphs for 2D convolution for two

systems (Image size 16 x 16).

40

30

20

10

0 0

.. ~-......

·· ....... ........

.........

2

........

4

.......... con tsystem 21evel pipe

·--.,. ...................... ..

6 8 No. of transputera

10

Figure ( 4.20) Time lapse graphs for 2D convolution for two


141

... :I

"tJ ., ., ... U)

... ... 1: ., ... --w

10

8

6

4

2

0 0

----c---

-· -· -·

const I system 21evel pipe

,c , -· -· -· -· ,.

,.a

..m -· -· -· -· -· -· -·

2 4 6 8 10 No. of transputers

Figure ( 4.21) Speed up graphs for 2D convolution for two


1.2....------------------,

1.0

08

06

04

0

... c .. ............. ... ........ ... 1J----------------1C

----c---

2

const t system 21evel pipe

4 6 No. of transputers

8 10

Figure ( 4.22) Efficiency graphs for 2D convolution for two


142

CHAPTER 5

PARALLEL IMPLEMENTATION OF THE LAPLACIAN AND GRADIENT OPERATORS IN COMPUTER VISION

5.1 INTRODUCTION

The purpose of this chapter is to identify a set of designs suitable for

implementing low level image processing algorithms on VLSI processing arrays.

We consider techniques for filtering digital images, as described in chapter 3. This

includes both low pass (smoothing) and high pass (edge enhancement) fllters.

Most of the image processing algorithms need massive amounts of band

matnx operations. However, these algorithms contain explicit parallelism which can

be efficiently exploited by processor arrays. All sections of the image have to be

processed in exactly the same way, regardless of the position of the image secnon

within the image, or the value of the pixel data.

Low level functions involve matrix vector operation, which are repeated at

very high speed Typically, images must be processed in real time, at 25 images per

second. Therefore with image sizes of 128 x 128, 256 x 256, and 512 x 512 or

greater, there is a large amount of data to be processed in a highly repetitive

process.

A different class of algorithms were implemented on a transputer network

using systolic array design, described in chapter 4. The reasons for choosing these

algorithms are that they commonly use operations in filtenng digital image

processing systems, and a varying degree of commumcation between processors is

required for each of the algorithms.

Various modifications of the systolic design are analysed in this chapter, the

design modifications are to handle each of the filter algorithms. The number of cells

or processors in each filter design is dependent on the size of the kernel and the

nature of the algonthms (e.g. number of sub tasks). However, some applications

may dictate that only a small number of processors can be used. For example, in

missile systems, where the amount of space avrulable may be small.

144

Brief defmition of each algorithm is given ahead of each section in which they

are discussed together with a full description of the systolic array design. The

results and efficiency obtained for each design on a transputer network is also

given, to reflect the performance of this design on this level of vision.

5.2 SYSTOLIC DESIGNS FOR DIGITAL IMAGE

FILTERS.

Smoothing filters are designed to reduce the noise, detail, or 'busyness' in an

image. Several types of low pass filter designs are considered. Typical filters are:

Mean, Weighted mean, Inverse gradient, and Sigma.

Edge enhancement filters are intended to enhance or boost the image edges.

Several types of enhancement filter designs are implemented. Typical filters are:

Gradient operations, Laplacian operators, the Laplacian added back, Sobel and

Prewltt.

For this purpose, we have developed several models of systohc arrays.

V anous modifications and customization of the destgn were presented in the

previous chapter for specific smoothing and edge enhancement algorithms.

Different kinds of cells are used, some of which are implemented in thts chapter

wluch has improved the hardware utilization and throughput The actual design reler

strongly on the property of Occam, which allows one to model in software several

concurrent processes on a single processing element or a network of processing

elements.

In this chapter we describe the designs for the Laplacian and Gradient

operators. These are two classes of filters , that are frequently applied to the digital

image.

145


LAPLACIAN OPERATOR.

5.3.1 Laplacian Operator Algorithms

The Laplacian operator is computed by convolving a mask with the image.

The Laplacian low pass filter is a typical smoothing filter, a Laplacian mask is

passed over the entire image, and the convolution operation is performed on each

pixel. Each pixel is replaced by the sum of the products of the mask weighting and

the appropriate neighbouring pixel values. Different choices are available when

using this mask in two dimensions.

The Laplacian filter using 3 x 3 convolution mask may be used. This utilises a

mask or weighting matrix defined on two standard masks as shown in fig. (5.1).

0 -1 0

-1 4 -1

0 -1 0

(A) Plus shaped mask

-1 -1 -1

-1 8 -1

-1 -1 -1

(B) Square mask

Figure (5.1) Two Laplacian masks.

The value in the weighting matrix allowed a simpler and faster version of the

algorithm than was obtained usmg the general convolution case. In the first mask

(fig. 5.1 A) as only values of -1 and 4 are used, it is possible to either subtract the

value of the pixel, or add 4 times its value, respectively, also there are four zeros

values, which reduce the number of the processors and the number of multiplication

and additions. While in the second mask (fig. 5.1 B), where only values of -1 and

8 are used, it is also possible to either subtract the value of the pixel or add 8 times

its value, respectively.

146

The algorithm given here 'smooths' a grey level input image and is based on

one descnbed in chapter 3. The algorithm has I as an input image and 0 as an

output image; both I and 0 contain M by M pixels, with P = M2. Each point of I is

a value representing one of possible grey levels. Each point in the smoothed output

image, O(ij), is to multiply the pixel I(ij) and its eight nearest neighbouring pixel

values, I(i-l,j-1), I(i-1,j), I(i-1,j+1), I(i,j+1), I(i ,j-1), I(i+1,j-1), I(i+1j), and

I(i+ 1j+ 1), by the weighting Laplacian mask values, where I~ i, j < M-1. Pixels on

the edges of I are not smoothed with this filter, they are simply copied to 0.

5.3.2 Systolic Array for the Laplacian Operator.

In chapter 4 we introduced two different designs of multidimensional

convolution. The second design includes two different cell designs , i.e., cell A and

cell CT (section 4.2.2 and 4.5). Cell A is original cell, holding a kernel element

value ..0 as shown in fig. (4.3 B), where cell CT with (Ak) shift register represents

some kernel element equal to zero as shown in fig. (4.8 ).

This design can be improved and implemented for the Laplacian filter. When

we consider how this could be implemented on this design with N number of cells,

where each processor or cell stores an element of the Laplacian filter, (where N to

be either 5 or 9 processors).

5.3.2.1 Plus shaped Laplacian Operator Design.

For the implementation of the Laplacian filter shown in fig. (5.1 A), we

should modify cell CT . This cell contains two kernel elements, the second and the

third element. The first kernel element of the mask is the zero element, and should

be ignored (as shown in fig. 5.2). The third kernel element is zero element, and

147

may be absorbed into the cell by increasing the number of shift register stages of

that cell by one.

The last cell in the second segment also has the same number of stages of the

shift register.

Let (AL) be the number of stages of the slnft register in the new cell (cell Cf).

Then, from equation (4.6), the shift registers will be

AL= Ak+ 1

=A+(D+1) (5.1)

The Occam code running on thiS cell (cell en takes the form of fig. (5.3).

The Laplacian systolic array desig11, as shown in fig. (5.4) will consist of five cells

only, one cell is cell er in the first segment. Three cells are needed in the second

0

xu

-1

X21

X 51

-1

Xl2

4

X22

-1

X32

X 52

-1

X23

0

X33

X 53

0

X14

0

X24

X34

X 54

0

X15

0

X25

X35

X 55

Fig11re (5.2) lllustrates Laplacian filter convolving an image n x n

where n=5. (arrow denoting zero elements in mask)

148

segment, the shift register in the first two cells is 3 (cell A as in fig 4.3), The third

cell (cell CD. One cell is needed in the third segment which is (cell A).

As shown in fig. (5.4), the image data is pumped into the array by the host.

The data and results are pumped through the array and the final results and the

original data are collected by the second host

For the parallel approach, parallel code must be executed in each cell, whilst

all the cells and the two hosts execute in parallel.

--- PROCClP

-- PROC delay

--- PROC pass.data

--- PROC multiply

--PROCadd

PROC cell.CT (CHAN ....•..... )

--- declaration of local channels

PAR

delay

pass.data

ClP

multiply

delay

add

Fig. (5.3) Procedure to run the cell Cf.

149

- -- ----------

y .... ws .... host!

, , X (-1) ,

cellCT

w2 "" w3 "" w4 ~ ,____ ,--(-1) (4)

., (-1) ~ ~ lr- '

cellCT cell A cellA

w1 .. (-1) ~ host2

cellA

Ftgure (5.4) Systohc array for the plus shape Laplacian operator_

Only the first cell (leftrnost cell) is permitted to communicate with the first

host, also only the last cell (rightrnost cell) is permitted to communicate with the

second host. Each of the other cells communicates with its left and right neighbour

cells only, each cell communicates input data and results to the neighbouring cell to

its right and obtains data and results from the cell to its left. The parallel algorithms

computed at each result for the five cells is shown in fig_ (5_5)_

~

read y 1 from the host I

read x1 from the host I

calculate a parual result of y 1

send y1 to cell 2

send Xj to cell 2

cell J+l j =I to 3

read Yi from cell J

read x1 from cell j

calculate a partial result of y 1

150

send y1 to cell J+2

send x1 to cell J+2

~

read y 1 from cell 4

read Xj from cell 4

calculate partial of a result of Yi

send y 1 to the host 2

send Xj to the host 2

Figure (5.5) Parallel Laplacian ftlter algorithm.

At the time the input data x1 and the output stream Yi enter the first cell in the

array, the computation of the resulting coefficients is achieved, W5Xi+3 (if the

image is 3 x 3), then this is passed to the right-hand side cell. The same process

will carry on for the next four cells with the partial products being summed in each

cell to produce y1j- With reference to fig. (5.2) and fig. (5.4) a configuration for

generating y, is as follows (for an image n x n where n =5).

Yij = w1 Xi-l,j + W2 Xij-1 + WJ Xij + W4 Xi,j+1 + w5 Xi+1,j (5.2a)

for i,j = 1,2, ..... ,n

By applying the weighting mask shown in ftg. (5.1A) to this equation, then,

Yij = (-1) Xi-1j + (-1) Xij-1 + 4Xij + (-1) Xi,j+1 + (-1) Xi+1j

(5.2b)

The algorithm is repeated for every input data for each cell concurrently. The main

Occam code to run the array 1s as follows

--- PROC hostl

--- PROC cell CT

--- PROC cellA

--- PROC host2

PROC mam.system (CHAN .......... )

151

--- declarauon of local channels

SEQ

lffiage = number columns * number .rows

SEQ 1 = [0 FOR image]

mput data

PAR

host!

cell CT

PAR 1 = [0 FOR 2]

cell A

cell CT

cell A

host2:

5.3.2.2 Systolic Array Design for Square Laplacian Operator.

The approach for the design of the Laplacian filter algorithm for the mask

shown in fig. (5.1B), is more or less similar to the one described for 2D

convolution in section 4.3.3.

The size of the mask shown in fig. (5.1B) is the 3 x 3 kernel. The 9 values

of the kernel are non zero values. The array consists of 3 segments each having 3

cells, adding up to 9 cells. The ftrst two cells of each of the ftrst and second

segments are cell A and the third cell is cell er (both cells design are described m

chapter 4). The third segment consists of three cells (cell A). Each cell stores an

element of the square Laplacian filter, as shown in fig. (5.6).

The parallel algorithms computed at each result for the nine cells is as shown

in ftg. (5.5). As the array size is greater than 5, all central cells would execute the

cell j+ 1 code.

The partial products are summed in each cell to produce Yi· With reference to

ftg. (5.6) a configuration for YIJ• is as follows:

152

hostl Y, w9 .. wa .... W7 -x, (-1) (-1) .. (-1)

cellA cellA celiCT

r- w4 ., w5

., w6 +.!~ -(-1) (8) "" (-1) ~

1'-v- ~

cell CT cellA cellA

1

'--- w3 .. w2 ' w1

(-1) .. (-1) .. (-1) ~ host2

cellA cell A cellA

Figure (5.6) Systolic array for the square Laplacian operator.

Yij = WJ Xi-1,j-1 + W2 Xi-1,j + W3 XJ-1j+1 + W4 Xi,j-1 + WS Xi,j

W6 Xij+1 + W7 Xi+1j-1 + wg Xi+1,j + W9 Xi+1,j+1

for i,j= l.. ... n .

(5.3a)

by applying the values of the mask shown m fig. (5.1B) to this equation, then,

Y1j = 8xi,J- [ XJ-Jj-1 + Xi-1j + Xi-1,j+1 + Xi,j-1 + Xi,j+1

+ Xi+1j-1+ Xi+1,j + Xi+1j+1]

The main Occam code to run the array is as follows:

--- PROC hostl

--- PROC cell CT

--- PROC cellA

--- PROC host2

PROC mam system (CHAN ....... )


SEQ

unage := number.columns • number .rows

SEQ 1 = [0 FOR Image]

input data

153

(5.3b)

PAR

hostl

PARJ = [OFOR 2)

PARi= [OFOR 2)

cellA

cell er PARJ = [0 FOR 3)

cell A

host2:

5.3.3 Designing Transputer Networks for the Laplacian Operator.

The systolic design descnbed in section (5.3.2.1) was chosen for the

implementation of the plus shaped Laplacian filter on the transputer network. This

is because the problem can easily be arranged such that each transputer is

responsible for representing one or more of the cells shown in fig. (5.4), with

communication occurring between them. Each kernel element can be held in a single

transputer throughout the computation, and the maximum number of transputers is

equal to the number of Laplacian filter weights. The same cell design was adapted

on the transputer, so we need two codes to run on different transputers to represent

cell A and cell CT as shown in section (5.3.2.1). The maximum number of

transputers we need for this design is five.

When calculating new values at each transputer, communication has to occur

to obtain values from the neighbouring transputers. Input data and results are

collected on the transputer boundary and then sent down the Inmos links to the next

transputer, receiving the values of the results from the Inmos lmk. Parallel code

must be executed in each transputer, whilst all transputers execute in parallel.

The first and the last transputers are connected to the host, so the first

transputer receives input data from the host, whilst the last transputer passes onto

154

------------------------------------------------

the host all the results from all the transputers on the network before it shuts down.

The transputer network configuration for such a destgn is shown in fig. (5.7). The

Occam code can be run on transputers TO and T3 to representing cell er design,

while different Occam code will be run on Tl, TI, and T4 transputers.

The transputer network design for square Laplacian filter is similar to that

shown in fig. (5.7). As the network size ts greater than 5, the central transputers

will be extended to 9 transputers. Two cell designs (cell A and cell Cf) are needed

X ... filing

, xm ~ TO Tl ~

system Host , , f--Transputer ym

y ... ,

"' , xout

~ Tn __ .....__

12 yout

Figure (5.7) Transputer network configuration for the Laplacian operator.

to be implemented on different transputers. An Occam code runs on transputers

TO, Tl, T4, T3, T7, T8, and Ts to implement cell A design, while transputers T2,

T5 execute another Occam code which is to be implemented on cell cr. All the

processes run in parallel inside the transputer, whilst all transputers execute in

parallel.

The algorithms descnbed in sections (5.3.2.1 and 5.3.2.2) require

modification in order for them to be executed on a network of transputers.

The algorithms at each input data for the 5 and 9 transputer networks are as

follows:

155

receive mput value x1 from the host

calculate a partial result of y1

sendy1 toTl

sendx1 toTl

Ii J=1 ton-1 ---(n IS the number of transputers)

receive input value x1 from TJ-1

receive a partial result y1 from Tj-1

calculate another partial result of y1

send the added y1 to Tj+ 1

send x1 to TJ+ 1

I!!:1. receive mput value x1 from Tn-2

receive a parual result y1 from Tn-2

calculate the fmal partial result ofy1

send y 1 to the host

send Xi to the host

- - -------

5.3.4 Performance of the Laplacian Operator Systolic Design on

Transputer Network.

The designs were applied and executed on transputer networks of varying

sizes. The network compnsed of 16 T800 transputers configured as an array on the

B012 board. The timing results for the 5 point (or plus) shaped Laplacian operators

are presented in table (5.1). The table also includes relative speed-ups and

efficiencies of the algorithm. The associated speed-up and efficiency graphs are

given in figures (5.8) and (5.9) respectively.

156

The speed-up graphs for this design are nearly linear, with speed-up

increasing as the size of the transputer network increases. Also, the speed-up

increases as the size of the image increases, as shown in fig.(5.8). The graphs in

fig. (5.9) show a drop in the efficiencies when the network size is 3 transputers.

This is because the load balance is poor compared with other networks load

balances. The maximum efficiency is obtained when the load on each transputer is

about the same. As the total number of cells in the systolic system is 5, then the best



1 0.048 1.00 100.00 16 X 16 2 0.029 1.66 83.07

3 0.020 2.38 79.47

5 0.011 4.32 86.45

1 0.180 1.00 100.00

32 X 32 2 0.108 1.67 83.45

3 0.075 2.41 80.30

5 0.042 4.33 86.56

1 0.716 1.00 100.00 64x64 2 0.420 1.70 85.23

3 0.291 2.46 82.09

5 0.162 4.43 88.64

1 2.989 1.00 100.00 128 X 128 2 1.663 1.797 89.85

3 1.151 260 86.58

5 0.639 4.68 93.54

1 12.012 1.00 100.00 256 X 256 2 6.859 1.75 87.56

3 4.601 2.61 87.03

5 2.550 4.71 94.22

Table (5.1) Timing results for the plus shape Laplacian operator.

157

5

4

... 16. 16 = "' 3 32• 32 .. .. ... - ----- 64*64 .. 128. 128

2 256.256

I 2 3 4 5 6

No. of transputers

Figure (5.8) Speedup graphs for plus shape Laplacian operator.

12~----------------------------~

10

08

I 2 3 4

No. of transputers

5

16. 16 • •••••• 32*32

- ----· 64. 64

128. 128 256.256

6

Figure (5.9) Efficiency graphs for plus shape Laplacian operator.

158

efficiency we can get is when the load IS one cell per transputer. In other words,

the maximum efficiency is obtained when there are 5 transputers in the network.

The efficiency graph for the larger image sizes shows generally a higher efficiency

than the smaller size of image, especially for the 5 transputers network.

Table (5.2) presents a summary of the timing results for the 9 point ( or

square) Laplacian operator. The relative speed-up and efficiencies of the algorithm

are included in the table. Figures (5.10) and (5.11) shows the relative speedup and

efficiency respectively. The 'super' speedup figures and the high efficiencies

shown in the graph for the 3 transputer network is due to the reason explained

earlier in section ( 4.9).

It can be seen from the graphs that the speedup and the efficiency increase

with increasing the size of the network (for more than 5 transputers), and the

increase of the size of the image. There are less efficiency gain when the size of the

network is 9. However, the reason for the efficiency gain difference in the case of

the higher network is due to the proportion of the time spent on communication

overheads, especially when the size of network is relatively high. The range of the

efficiency is over 85% for 4 transputers and over 91% for the 5 and 9 transputer

networks.

159



1 0.093 1.00 100.00

16 X 16 3 0.029 3.2 106.67

5 0.020 4.59 91.86

9 0011 8.23 91.40

1 0.348 1.00 100.00

32x 32 3 0.107 3.23 107.79

5 0.074 4.47 93.07

9 0.0416 8.38 93.15

1 1.361 1.00 100.00 64x 64 3 0.420 3.24 108.11

5 0.291 4.66 93.35

9 0.162 8.43 93.68

1 5.389 1.00 100.00

128 X 128 3 1.660 3.25 108.12

5 1.154 4.67 93.42

9 0.638 8.43 93.68

1 21.477 1.00 100.00

256x 256 3 6.649 3.23 107.66

5 4.597 4.67 93.44

9 2.543 8.45 93.84

Table (5.2) Timing results for the square Laplacian operator.

160

9

8

7 16*16 ... :I --- 32*32 ., " 6 - ------ 64 *64 .. ... Ul -· ·-·-·-· 128*128

5 - --- 256.256

4

3 2 4 6 8 10

No. of transputers

Figure (5.10) Speedup graphs for square Laplacian operator.

>-... c .. ... :;: -w

12,-------------------------------,

1 0 '------0 8.

16. 16

. ··----·-· 32 • 32 ------- 64. 64

·····-··· 128. 128 256.256

064---~--~.--~--~.--~---,-.--~~

2 4 6 8 10

No. of transputers

Figure (5.11) Efficiency graphs for square Laplacian operator.

161


GRADIENT OPERATOR

5.4.1 Gradient Operator Algorithms.

The gradient operator applied to a continuous function produces a vector at

each point whose direction is aligned with the direction of maximal grey-level

change at that point, and whose magnitude descnbes the seventy of tlus change.

A digital gradient may be computed by convolving two windows with an

image, one window giving the x component gx of the gradient operator, and the

other giving the y component gy .

Then, the magnitude of the gradient operator at a point is defined, as shown in

chapter 3, by

where the direction of this point can be computed from

e = tan·l (gy /gx)

(5.4)

(5.5)

The standard masks for the gradient operator are shown in fig. (5.12). The

mask x and y generated output are centred on (ij).

maskx = -1 0 1

(A )maskx

-1

masky = 0

1

(B) masky

Figure (5.12) Gradient operator masks

Typical gradient masks for a larger window, such as Sobel operators and

Prewitt operators are explained in a later section.

162

As seen in fig. (5.12), only values of -1 and 1 are used. It is possible to

subtract the pixels or add their value, respectively. Also there are zero values in

each mask, which reduce the number of processors and number of operations.

The gradient operator produces a two element vector at each pixel, and this is

usually stored as two new images, one for x component and the other for y

component.

The algorithm described in chapter 3, has I as input image and Ox and Oy as

output image; I as well as Ox and Oy contain M x M pixels, with p = M2. In each

pomt in the output image, gx(i,j), we multiply the pixel a(ij) and both Its left and

nght neighbouring pixel values, a(I,j-1) and a(i,j+1), by the weighting mask x

values. While at each point in the output image, gy(ij), we multiply the pixel a(i,j)

and both its upper and lower neighbouring p1xel values, a(i-1j) and a(i+ 1j) by the

weighting mask y values.

From equation (3.5), a configuration for generating the x output component

gx, is as follows:

gx(i,j)= W1 ai,j-1 + W2 aij + W3 aij+ 1 (5.6a)

By applying the weighting mask shown in fig. (5.12 A) to this equation, then

we have

(5.6b)

From equation (3.6), a configuration for generating the y output component

gy is as follows:

gy(i,j)= w1 ai-I,j +w2 aij + w3 ai+l,j (5.7a)

By applying the weighting mask shown in fig. (5.12 B) to this equation, then

we have,

(5.7b)

163

- -------------------

Substitution of equations (5.6b) and (5.7b) in equations (5.4) and (5.5)

respectively yields the gradient values;

(5.8)

e 1 ((ai+l,j - ai-l,j ))

=tan-(ai,j+ 1 - ai,j-1 )

(5.9)

for i,j = l.. ... n-1.

The algorithms are repeated for every pixel in the image.

5.4.2 Systolic Array Design for the Gradient Operator

The design consists of a double pipeline systolic array, pipe one

acco=odates the x component gx of the gradient operator, and pipe two

acco=odates the y component gy of the gradient operator.

This design can be implemented m a straightforward manner as shown in

fig.(5.13). The first cell in the array, delays, then makes a duplicate of the input

data, and pumps them both into the array through both pipes. The other function of

this cell is to delay the input strerun for x component pipe, with a multistage shift

!). " "

celll cell2 d., ~ ~ ~ g' ,

cellS 'y.., host! delay

root ~ host2

cell r'> ~ ~ ... r: _ ....

cell3 cell4 ,

" " ~

Figure (5.13) Double pipeline systolic array for the gradient operator.

164

register. The number of stages of the shift register is equal to n, where n is the

number of image columns. Each of the two pipes consists of k bifunctional cells

(where k is the length of the weight vector). Each of the two pipes consists of 2

bifunctional cells and each cell contains a kernel element value. The cell design is


From equation (4.5) and (4.6), the number of stages of the shift register in

each cell is as follows:

cell! (Ak +n) stages (It holds the first element of the y component, which

is the last element in the fust kernel row, the

additional n stages because the second kernel row

contams only zero values)

cell 2 A stages

cell3 (A+ 1) stages

cell 4 A stages

(It holds the final element of the y component)

(It holds the first element of the x component)

(It holds the final element of the x component)

The third part of the array consists of one multifunctional cell, to compute

equations (5.8) and (5.9). There is no requirement for a shift register in this cell, as

shown in flg. (5.14).

The input data are pumped mto the two pipes by the host, through the delay

cell, in regnlar clock beats, through different channels. The data and the results are

pumped through both pipes, the final results for each component of the gradient

operator and the original data are collected from both pipes by the last cell in the

array (cell 5), then it sends the final results to the second host. The algorithm is

repeated for every input data for each cell concurrently. The final outputs and the

input data are collected by the second host

165

'

i d I d-. I

t gy , I ~ I 1¥,

I I

gx !k~ ,J ~ I,. t r

~ I G G.. I

Figure (5.14) Cell root (multifunctional cell) layout.

The basic design of a double pipeline systolic array for the gradient operator,

presented previously, can be mo<hfied m many ways. The two main reasons for

adjusnng the design are:

1- To reduce the total number of stages of shift register in the cells.

2- To reduce the number of the input data streams to one stream, (instead of

two in the previous design).

We start from redrawing the old design with the shift registers of the cells,

replaced by a shift registers bus. The bus-oriented systolic array system contains

one system bus, to which all the systems cells are interconnected. The redrawn

system design is illustrated in fig. (5.15).

The number of stages of the shift register in the bus is equal to the maximum

number of stages between any two cells in the old design. Of course, each cell can

receive the input data from the bus, where each cell is connected to the bus at a

certam stage number, where the total number of shift register stages between the

two cells of each pipe of the old design is the same.

166

Now, if we compare the bus-oriented systolic array with the previous design,

we notice a change in the total number of stages of the shift register in the whole

system.

g,. ~ celll ' cell2 ...

I d ... ' 'I' g,.,

V system bus cellS hostl ~ ~ ...

host2 root

G,

J ,

!k ... I' cell3 cell4

Figure (5.15) Bus-onented systolic array for the gradient operator_

The total number of the shift register stages of the previous design for all cells

(SR1) are:

SR1 = (A+1) +A+ (Ak+n) +A

SR1 = 3A + Ak + n+ 1 ; (5.10)

While the total number of the shift register stages of the bus (SR2) are equal to the

shift register stages for both cells in the y component pipeline:

SR2 = (Ak+ n) +A

SR2=A +Ak+n

Combinations of this equation with equation ( 4.6) yield,

SR2 = 2A + 2n - k

167

(5.11a)

i.e. there are (2A+ 1) less stages in the new design than the old design. For the

kernel shown in fig. (5.12), where k=l and A=3, the number of stages of the shift

register of the bus is:

SR2 = 2( n+l) +A (5.11b)

Also we can see that the new design has one input data stream only. And there

is no need for the delay cell.

As shown in fig. (5.15), the image data is pumped into the system bus by the

first host, then it distributes the values to their corresponding cells, while the partial

results are pumped through both pipes exactly the same as in the previous design.

The main Occam code to run the array is as follows:

---PROC hostl

--·PROC celll (blfuncuonal cell)

-··PROC cell2 (bifuncuonal cell)

---PROC cell3 (bifuncuonal cell)

---PROC cell4 {b1funcuonal cell)

---PROC cellS (multfuncuonal cell)

---PROC host2

PROC grachentmam (CHAN-----·-)

SEQ

SEQ 1 = [ 0 FOR Image]

input data

PAR

host!

celll

cell3

cell4

cell2

cellS

host2

168

5.4.3 Transputer Network for Gradient Operator

The bus-oriented systolic system described in the previous section requires

further modification in order for the array to be implemented on a network of

transputers.

The transputer network is connected to one host only. The first transputer is

connected to the host so it receives all the input data and pumps it to the transputer

network through the network bus. Each transputer accommodates part of the shift

register bus, so all the parts of the system bus is inside the transputers itself, instead

of outside the cell, as shown previously in fig. (5.15).

!!v .... gyf fllmg Host !!x TO Tl ~ system ~ Transputer .:' V .-l

...... gy ,V

1 V ~ V

' I T3

.,.~

J gx T4

J I T2 I<-gy

F1gure (5.16) Transputer network configuration for the gradient

operator.

The final transputer on the network collects the x-component and the y

component values of the gradient operator, through both pipes, as shown in

fig.(5.16), it calculates the final output which it sends down the Inmos link to the

host.

169

Transputers TO and T3 are the y-component pipeline, while Tl and T2 are the

x-component pipeline. The following is the parallel algorithm computed for each

m put data in all the transputer:

I.ll read input value a1 from the host

read gxi, gyi from the host

calculate parual result of gyi

send a1 to T1 and T3

send gxi to Tl

send gy1 to T3

I.1 read a1 , gx1 from TO

calculate parual result of gx1

send a1 , gx1 to T2

I.l read a1 , gx1 from T1

calculate parual result of gx1

send a1 gx1 to T4 '

I..3. read a1 from T1

read gy1 from the T1

calculate parual result of gy1

send gy1 to T4

T4

read gyi from the T 3

read gXI from the T2

calculate the final results

send a, di, gxi and gy1 to the host

The Occam program for a network of transputers is given in Appendix C.

170

5.4.4 Prewitt and Sobel Operator Algorithms.

The Prewltt and Sobel operators are modified gradient operators. They are

3 x 3 gradient operators as shown in fig ( 5.17).

Equations (5.6b) and (5.7b) represent only the first order difference operator.

To accommodate the Prewltt operator, equations (5.6b) and (5.7b) can be wntten

as;

and,

Px(i,j)= [ ai-1,j+1 + ai,j+1 + ai+1,j+11-

[ al-1,j-1 + aij-1 + ai+1,j-1]

Py(i,j)= [ ai+1,j-1 + ai+1j + ai+1,j+11-

[ ai-1,j-1 + ai-1j + ai-1j+1]

(5.12)

(5.13)

From equation (5.4). The magnitude of the Prewitt operator at a point is

defined by

p~ X y (5.14)

-1 0 1 -1 -1 -1

-1 0 1 0 0 0

-1 0 1 1 1 1

(A) maskx (B) mask y

Prewitt operators -1 0 1 -1 -2 -1

-2 0 2 0 0 0

-1 0 1 1 2 1

(C) maskx (D)masky

Sobel operators

Figure (5.17) Prewitt and Sobe1 operator masks.

171

as;

and,

For the direction of this point, equation (5.5) can then be written as;

Bp= tan· I (Py /Px) (5.15)

1n the case of the So bel operator, equations (5.6b) and (5.7b) can be written

Sx(l,J)= [ ai-l,j+l +2ai,j+l + at+l,j+l1-

[ ai-lj-1 + 2aij-l + at+1,j-l]

Py(i,j)= [ ai+1j-1 +2ai+1j + ai+l,j+l1-

[ ai-1,j-1 + 2ai-lj + ai-1j+1]

(5.16)

(5.17)

Again From equation (5.4). The magnitude of the Sobel operator at a point is

defmed by

(5.18)

For the direction of this point, equation (5.5) can then be written as;

Bs =tan-I (Sy /Sx) (5.19)

5.4.5 Systolic Array Design for Prewitt and Sobel Operators.

The bus-oriented systolic system presented in a previous section can be

modified to accommodate the Prewitt or Sobel operators. The design can be

extended for any size of mask.

The design for the Prewm operator still consists of a double pipeline systolic

array. Pipe one is for the x component Px and pipe two is for they component Py.

Each of the two pipes consists of k bifunctional cells (k IS the length of the mask),

for Prewltt operator, k=6. Both pipes are interconnected to a shift register bus, with

each cell connected to the bus at a certain level to retam a synchronisataion for the

accumulated output as shown in fig. (5.18).

172

~ I~ celll ~ cell2 ~ cell3 --~ cell6 ~

~t <h..

I ' Jl' J ··g, cell13 host2 hostl ~r sys em bus I root ll ....

,I, '" ,~,. a::

I> cell7 ~ cellS ~ cell9 -~ cell12

~

Figure (5.18) Bus-oriented systolic array for the Prewitt operator.

The total number of the shift register (SR) of the bus for any stze of mask are,

SR = The total number of stages of the shift register of all cells in one of the

ptpelines.

Then , the SR for the y-component pipeline is,

SR = h 1 (A Ch2 - 1) +Ak) + 3A + n , (5.20a)

h1 is the number of mask segment (non zero mask value) and h2 is the number of

mask in each segment

By substituting equation (4.6) and (4.6a) into the SR equation, we obtain the

following·

In the case of the Prewitt operator, where h 1 =2 and h2 =3 yields;

SR = 2(3A + n - 3)+ 3A + n

SR = 9A +3n -3

(5.20b)

The SR for the x-component ptpelme has the same value, as they-component

pipeline.

The total number of cells in the system is 2k+ 1 (k is the number of kernel

elements), where in the case of the Prewitt operator, the total number of cells is 13.

173

The input data and the partial results are pumped through the system in the

same way as m the bus-oriented systolic system for the gradient operator presented

in the previous section.

The bus-oriented systolic system for the Sobel operator masks shown in

fig.(2.17 c and d), operates in a similar way as the operation of the bus-oriented

systolic system for the Prewitt operator. The design itself is general and can be used

for any size of mask and image.

5.4.6 Performance of the Gradient Operator Systolic Design on the

Transputer Network.

Experiments were performed on the systohc design (fig 5.18) to find the

effect on speedup and efficiency by mcreasing the network size for various Image

size. The timing results from the experiments are presented in table (5.3) and their

graphical interpretations are shown in figures ( 5.19) and (5.20).

5,----------------------------,

4 a. 16. 16 :I '0 .. .. ........ _, ·- 32* 32 .. a. ------ - 64.64 Ul 3

-·-·-·-· . 128*128

256.256

2 2 3 4 5 6

No. of transputers

Figure (5.19) Speedup graphs for gradient operator.

174

It can be seen from fig(5.19) that the relationship between the speedup and the

Image Size is nearly fixed for each transputer network. This means that for larger

sizes, the speedup is likely to be similar to that for an image of smaller size. But this

increases sharply when the network size is the maximum, i.e., 5 transputers. It is

clear from the graphs in fig. (5.20) that the efficiency decreases when the size of the

network is 4 transputers. This is due to the load imbalance. The maximum

efficiency is obtained when the load on each transputer is nearly the same. The

maximum efficiency is obtained for the 5 transputers network (the total number of

cells are 5). It can be concluded that better performance results can be achieved if

the load is balanced.

.... u

1 1

c: 09 CD u --w

07

2 3 4 5

No. of transputers

16. 16

32.32

------- 64. 64

-·-·-·-·- 128.128

----- 256.256

6

Figure (5.20) Efficiency graphs for gradient operator.

175



1 0.078 1.00 100.00

16 X 16 3 0.031 2.52 83.92

4 0.025 3.13 78.23

5 0.016 4.79 95.74

1 0.291 1.00 100.00

32x 32 3 0.115 2.54 84.64

4 0.092 3.15 78.74

5 0.060 4.84 96.83

1 1.136 1.00 100.00

64x64 3 0.447 2.54 84.73

4 0.361 3.15 78.78

5 0.234 4.85 97.06

1 4.502 1.00 100.00

128 X 128 3 1.770 2.54 84.79

4 1.428 3.15 78.83

5 0.927 4.86 97.17

1 17.957 1.00 100.00

256 X 256 3 7.330 2.45 81.66

4 5.690 3.16 78.89

5 3.692 4.86 97.27

Table (5.3) Timing results for the gradient operator.

176

CHAPTER 6

LOW-LEVEL IMAGE PROCESSING AND FILTER SOFTWARE LIBRARY DEVELOPMENT

6.1 INTRODUCTION

In the following sections we consider techniques and designs for filtering

digital images. This mcludes both smoothmg and edge enhancement filters. Typical

filters perform some form of moving window operations that may be a convolution

or other local computation m the wrndow.

Various modifications of the systolic design are analysed in this chapter, the

design modifications are to handle each of the filter algorithms. The number of cells

or processors in each filter design is dependent on the size of the kernel and the

nature ofthe algorithms .

The implementation of a variety of digital image filter algorithms within the

Sequent Balance and the transputer network was achieved. The aim of this is to

design and build a programming workbench for developing image processing

operations for low-level vision. The motivation for the work is to develop a

methodology for the implementation of an image processing library on the Sequent

Balance. The key to the workbench is to hold a library of precoded software

components in a generalised configuration-independent style. This digital image

processing filters library is discussed in this chapter.

Brief defrnition of each algorithm is given allead of each section in which they

are discussed together with a full description of the systolic array design. The

results and efficiency obtained for each design on a transputer network is also

given, to reflect the performance of this design on this level of vision.

178

-------------------- -------- ----

6.2 PARALLEL IMPLEMENTATION OF THE SIGMA

FILTER.

6.2.1 Sigma Filter Algorithm

As explained in chapter 3, this filter smooths the image noise by averaging

only those neighbouring pixels which have the intensities within a fixed sigma (S)

range of the centre pixel.

For a 3 x 3 window each point in the input image x(i,j), as shown in fig.

(6.1) is equal to the average of all eight pixels in its neighbourhood whose value is

within S counts of the value of x(i,j)· S is an adjustable parameter and may be

derived from sigma, or the standard deviation of the pixel value distribution, or a

specified nonnegative threshold.

r- -e- .. I

' Xp

+ .__ --- ...

Ftgure (6.1) Sigma filter with 8 neighbourhood pixels in a 3 x 3

Window.

From equation (3.8), we form the output y (i,j) according to the following

criterion: 1 1 1 1 1

Y(· ·) = ~ ~ ( ~ ~ w(· k · !) ) (6.1) I,J k~1 1~1 r(i+kj+l) k~1 1=-1 t+ ,J+

179

where k,l = -1 ,0,1 for window s1ze 3 x 3 . And,

r(i+k,j+l) = 1 if lx(i+k,j+lr x(i,j)l < s

r (i+kj+l) = 0 otherwise (6.2)

X(i+k,j+l)

Then,

Y(i,W ( r(i-1,j-l) w(l-1,j-l) l+ ( r(i-1j) w(i+1j) l+ ( r(i-1,j+l) w(i-lj+l)l+

( r(i,j-1) w(i,j-1) l+ ( r(i,j+l) w(i,j+l) l+ ( r(i+1j-l) w(t+1,j-l) }+

( r(i+1,j) w(i+1,j) l+ ( r(i+1j+l) w(i+1,j+l)} (6.3)

The output should be divided by m, where m is the accumulated value of r.

As the window is passed over the entire image, and the sigma filter operation

IS performed on each pixel, the 8 r elements m equation (6.3) are updated in each

pixel x(i,j) from the 8 nearest neighbour groups. The updated r is given from

equation (6.1), where r is either 1 or 0.

6.2.2 Systolic Array Design for the Sigma Filter

The bifunctional array design discussed in the previous chapter should be

modified to accommodate the sigma filter, as shown in fig. (6.2). The array

consists of the following cells:

a- A duplicate cell; the main function of this cell is to make a duplicate of the

input data, and pump them both into the array through different channels. The other

function of the cell is to delay one of the input streams with a multistage shift

register. The number of stages of the shift register is equal to (n+ 14), where n is the

number of image columns.

180

t host! ~ duphcat cellA cellA cellB

~ cell -X

'~ ~~r-cellB ' cellL """'\..£. <--

'

- 1'-:t cell ~ cellA cellA cellA host2 " dlVlsuon~ ,

Figure (6.2) Systohc array for the Sigma filter.

b- Multifuncuonal cell; the array consists of k2 multifunctional cells (the size

of the window), each cell produces its partial result. The main function of these

cells is to compute equations ( 6.2) and ( 6.3). There are three processors in each

cell. A process must be designed to collect two values each time, one from the delay

cell to the left, the other one from the shift register, then compute equation (6.1) and

calculate the value of r.

Another process is requrred to calculate a partial product of y (i,j)• as shown in

equation (6.3). Once this has been done, it communicates the results to another

process to accumulate the partial products of the output y (i,j)· The cell design is

shown in fig.(6.3).

Each cell contains a multistage shift register with three different numbers of

shift register in three different multifuncuonal cells. The number of stages in each

cell is calculated using equations (4.5), (4.6) and (5.1) respectively.

181

y y ,._I I d " +

XI] -""

XI]

~~ Sr I X I "

"'d ,._I R 1 ................ IRA I

X Id ....

Figure (6.3) Multifunctional cell design.

The Occam code running on this cell takes the form of fig. (6.3A). All

processes inside the cell run in parallel for input data, so that communication

between processes is overlapped with computation.

--- PROC delay

--- PROC pass data

--- PROC multiply

--- PROC add

PROC cell multifunctional (CHAN .......... )

--- declaration of local channel

PAR

PARJ = [0 FOR n]

delay

PROC compare

SEQ

SEQ 1 = [0 FOR time]

IF

muluply

add

I x(i+kJ+tr x(I,J)I < S

S=l

1RUE

S=O:

Figure (6.3A) Procedure to run multifuncnonal cell for Sigma filter

182

I

!

c- A division cell: this cell is an unifunctional cell. Its main task is to calculate

the fmal value of the output y (i,j) dividing the results by m.

As shown in fig. (6.2), the image data is pumped into the array by the first

host. As the delay cell collects the input data, it will duplicate this data, delay one

set, as discussed before, and then sent them down to the mulnfunctional cells. Each

multifunctional cell requires 3 input and 3 output channels for communications with

the left and right neighbour cells. The data and results are pumped through the array

and the accumulated partial products are collected by the final cell (division cell),

then 11 sends the final results to the second host. The main Occam code to run the

array is as follows:

---PROC hostl

--PROC delay

-··PROC cell A

-·-PROC cell B

---PROC cell L

---PROC cell.d1v

---PROC hosl2

PROC mam.system (CHAN---------)

SEQ

SEQ 1 = [ 0 FOR image]

mput data

PAR

hostl

delay

PAR 1 = [1 FOR 2]

cell A

cell B

cell L

cell B

PARi= [I FOR 3]

cell A

183

cell.dtv

host2:

6.2.3 Transputer Network for Sigma Filter

A parallel code is designed to run on each transputer while all transputers in

the network execute in parallel. For simplicity each transputer is responsible for a

cell of the systolic design shown in fig. (6. 2). Basically inside each transputer

(except the first and the last transputers), equation (6.2) is applied to update the

partial result The first transputer receives the input data, duplicates and sends them

to the neighbouring transputer. The mam task of the final transputer in the network

is to calculate the final value of the output.

In order to run on a network of transputers the number of channels between

neighbouring transputers have been reduced to 2 channels (input channel and output

channel). Fig. (6.4) shows the parallel algorithms computed at each input data for

all transputers.

I..ll receive xi from the host

receive y I from the host

make a copy of Xj

delay Xj copy

send Yi toTl

send xi to Tl

send xi copy to Tl

IJ J =1 to n-2

receive yi from TJ· 1

receive Xj and xi copy from TJ· I

calculate partial result of y 1

send the yi to TJ+ 1

send xi and xi copy to TJ+ 1

184

---(n is the number of transputers)

Tn-1

rece1ve mput value x1 and x1 copy from Tn-2

rece1ve a partial result y1 from Tn-2

calculate the final result of y 1 send y 1 to the host

send x1 to the host

Figure (6.4) A parallel algonthms computed at each input data for

all the transputers.

6.2.4 Performance of the Sigma Filter Systolic Array on the

Transputer Network

Experiments were performed on the design shown in fig.(6.3) to find the

effect on speed and efficiency of increasing the network s1ze for various image

sizes. The timing results for the experiments are presented in table (6.1) and their

graphical interpretation are shown in fig. ( 6.5) and ( 6.6).

A considerable amount of time IS saved by solving the problem on a network

of transputers, even for a small size of image. The speedup increases as the size of

the transputer network increases; also the speedup increases as the size of the image

increases, as shown in fig.(6.5). The overall results for this algorithm are very

impressive. Table ( 6.1) indicates 'super' speedup for all sizes of images when the

network size is 2 transputers. Their efficiencies are also extremely high (over 100%

falling gradually to above 89% ).

This extraordinary behaviour is explained by the fact that each of the T800

transputer on the network is connected to an external memory, which is much

slower than the on-chip RAM. When the program is executed on a single

185



1 0.224 1.00 100.00

16 X 16 2 0.102 2.20 110.21

6 0.042 5.35 89.21

10 0025 9.08 90.84

1 0.842 1.00 100.00

32x 32 2 0.379 2.22 111.01

6 0.156 5.41 90.20

10 0.091 9.21 92.10

1 3.291 1.00 100.00 64x64 2 1.479 2.23 111.28

6 0.606 5.43 90.48

10 0.355 9.25 92.53

1 13.035 1.00 100.00 128 X 128 2 5.854 2.23 111.33

6 2.405 5.42 90.34

10 1.407 9.27 92.65

Table (6.1) Timing results for the Sigma filter.

transputer, therefore some of the data stored in the cells shift register is stored in the

external memory so that extra time is required in accessing the slow external

memory. When the same program is then run on a network of 2 transputers, the

amount of storage needed per transputer is one half and thus all the data can be

stored on the fast on-chip RAM. The gain in speed-up from the on-chip RAM

offsets the new constraint introduced by communication.

Graphs of fig.(6.6) show a drop in the efficiencies when the network size ts

6. This is due to the fact that the load balancing of the system is poor at tbts kind of

network configuration. For all sizes of image the efficiency graph increases as the

186

network increases from 6 to 10 transputers. The max1mum efficiency is obtained

when there are the maximum number of transputers (10 transputers) in the network.

The efficiency is over 90%.

10

8

... " ... 6 " " ...

(/)

4

2

1 3

>- 1.1 u

" " u 09 --w 07

05

0 2 4 No. of

6 8 transputers

1 0 12

16. 16 32* 32 64.64 128. 128

Figure (6.5) Speedup graphs for Sigma filter.

-----16. 16

---- 32*32 ------- 64. 64

128*128

0 2 4 6 8 1 0 1 2 No. of transputers

Figure (6.6) Efficiency graphs for Sigma filter.

187

6.3 PARALLEL IMPLEMENTATION OF THE INVERSE

GRADIENT FILTER

6.3.1 Inverse Gradient Algorithm

This smoothing scheme is based on the observation that the variations of grey

levels inside a region are smaller than those between regions.

For a 3 x 3 window, each point in the mput image x(i,j)• as shown in fig.

(6.1}, is set equal to the inverse gradient of all etght pixels x(k,l), for k,l =1,2,3,

in its neighbourhood.

From equation (3. 7), we form the output y (ij) as:

1 Y(i,j) = 2 (x(i,j) + k t~(k,l) ) (6.4)

where r(k,l) is the mverse of the absolute gradient, multiplied by the relevant

neighbour pixels and is gtven by:

X(i+k,j+l) r (k,l) = I x(i,j)- x(I+k,j+l)l (6.5)

From these two equations (6.4) and (6.5), the configtrration for generaung the

output y (ij)• is as follows:

+ X(i-1,j) I x(i,j) - x(i-1,j)l

X(i-l,j+l) + .X(l,j-1)

I X(i,j) - X(i,j-1)1 + X(i,j+ 1)

I x(i,j) - x(i,j+ 1 )I lx(i,j)- x(i-l,j+l)l

X(i+ l,j-1) + X( I+ l,j ) + X(i+ l,j+ 1) ) (6.6) I x(i,j) - x(i+ l,j-1 )I I x(i,j) - x(i+ 1,J)I I x(I,j) - X(i+ 1,j+ 1)1

As the window is passed over the entire image, the inverse gradient filter

operation is performed on each pixel.

188

6.3.2 Systolic Array implementation for the Inverse Gradient Filter

The systolic array design is more or less similar to the sigma filter systolic

array discussed in section (6.2.2), the array consists of the following cells:

a- A duplicate cell; the main function of this cell is to make a duplicate of the

input data.

b- Multifunctional cell; the array consists of 8 multifunctional cells, each cell

produces Its partial results. The main function of these cells is to compute a partial

product ofy(i,j), as in equation (6.5). The number of stages of the shift register of

the multifunctional cell is shown in section (6.2.2). The Occam code running on

this cell takes the following form:

--- PROC delay

--- PROC pass data

---PROC add

PROC cell mulufunctiOnal (CHAN .......... )

--- declarauon of local channel

PAR

pass

PARJ = [0 FOR n)

delay

PROC compute

SEQ

SEQ 1 = [0 FOR Ume)

r (k,1)= I x(i,j) - x(k,l)l

add:

c- An addition cell; this cell is a umfunctional cell. Its main task is to calculate

the final value of the output Y(i,j) by applying equation (6.4).

The system behaves in a similar way as the systolic array for the sigma filter

described in secnon (6.2.2).

189

Each multifunctional cell requires 3 input and 3 output channels for

communication with the left and right neighbouring cells.

6.3.3 Transputer Network for the Inverse Gradient Filter

This is similar to the transputer network for the sigma filter, described in

secuon (6.2.3). Basically, inside each transputer (except the first and the last

transputers), equation (6 6) is applied to update the partial results. The first

transputer receives input data, duplicates it and sends it to the next neighbour

transputer, while the main task for the last transputer in the network is to calculate

the final values of the output by applying equation (6.4) and send it down the

Inmos link to the host. The total number of transputers in the network is 10.

The parallel algorithms computed at each input data for all the transputers, is


6.3.4 Performance of the Inverse Gradient Filter Systolic Array on

the Transputers Network

The inverse gradient systolic design was applied to the transputer network for

various network sizes and image sizes. The network consisted of T800 transputers.

The number of system cells in each transputer should be as equal as possible, to

ensure load balancing.

Table (6.2) shows the timing results of the algorithm. An analysis of the

results shows that the system performance is improved by increasing the number of

computing transputers. Also the performance is improved by increasing the image

size. The super speedup and the very high efficiency shown in the table is due to

the reason explained in section (6.2.4). F1g. (6.7) shows an increasing value of the

speedup as the network size increases. The graph for a 256 x 256 image size shows

190



1 0.183 1.00 100.00

16 X 16 2 0.085 2.15 107.58

6 0.036 5.08 84.63

10 0.021 8.68 86.79

1 0.687 1.00 100.00 32x 32 2 0.317 2.17 108.48

6 0.134 5.13 85.50

10 0.078 8.81 88.08

1 2.683 1.00 IOO.OO 64x64 2 1.234 2.17 108.74

6 0.523 5.14 85.68

10 0.303 8.84 88.42

I 10.628 1.00 IOO.OO 128 X 128 2 4.884 2.18 108.81

6 2.066 5.14 85.72

10 1.200 8.85 88.54

I 42.324 1.00 IOO.OO 256 X 256 2 I9.461 2.I7 108.74

6 8.280 5.11 85.20

10 4.782 8.85 88.51

Table (6.2) Timing results for the inverse gradient filter.

a decrease in the speedup when the network size is 6 and then a rapid increase when

the network is increased further, also the graph in fig.(6.8) shows a decrease in

efficiency when the network size is 6 for all sizes of images. The reason for this is

that the load balance of the system is poor, when the number of cells in the system

is 10, and the distribution of cells on the transputer network is unbalanced, as

explained in section (6.2.4). For all sizes of image the efficiency graph increases

from 7 to 10 transputers. The max1mum efficiency is obtained when there are 10

191

transputers in the network, which is the maximum number of transputers in the

network we can obtain for this algorithm.

10

8

D. ::J , 6

" " D. Ill

4

2 0 2 4

No.

6 8

of transputers

10 12

16. 16

32*32

64.64

128*128

256.256

Figure ( 6. 7) Speedup graphs for inverse gradient filter.

1 2

1 0 ,.. u 0:::

" u --w

08

06

04 0 2

------

4 6 8

No. of transputers

16. 16

32* 32 ••••••••. 64*64

10 12

128. 128

256.256

Figure (6.8) Efficiency graphs for inverse gradient filter.

192

L-------------------------------------------------- -- --

6.4 PARALLEL IMPLEMENTATION OF THE MEAN

AND WEIGHTED MEAN FILTERS

6.4.1 Mean Filter Algorithms

Mean filter is a straight foreword spatial-domain technique for image

smoothing. Given an M x M image I, the procedure is to generate a smoothed

image, whose grey level point (ij) is obtained by averaging the grey-level values of

the pixels of I contained in a predefined neighbourhood of (i,j), as shown in chapter

3.

The size and shape of the window over which the mean is computed can be

selected. For a 3 x 3 window, the filter weights are shown in fig. (6.9).

1/9 1/9 1/9

1/9 1/9 1/9

1/9 1/9 1/9

(A) square window

1/5

1/5 1/5 1/5

1/5

(B) plus shape window

Figure (6.9) Two mean filter masks.

If a 3 x 3 main filter is used, by employing the plus shaped window shown in

fig. (6.9B) to equation (5.2a) we note that the former equation is a special case of

the latter with wi =1/5. From equation (5.2a), the output pixel is obtained by the

relation,

Also if we employ the square window shown in fig. (6.9A) to equation

(5.3b), the output pixel is obtained by using the relation,

1 Y(ij)= 9 [ X(i-1,j-1) + X(i-1j) + X(i-1j+1) + X(ij-1) + X(i,j)

+ X(i,j+1) + X(i+1j-1) + X(i+1j) + X(i+1,j+1) j (6.8)

193

6.4.2 Weighted Mean Filter Algorithm

This algorithm approach IS similar to the mean filter algorithm descnbed in the

previous section (6.4.1). The difference in this case, a weighted mean filter is often

used in which the weight for a pixel is related to its distance from the centre point.

For a 3 x 3 window, the filter weights are shown in fig. (6.10). The neighbours

that lie on the same side of the point (ij) are weighted more heavily than the others.

1/16 1/8 1/16

1/8 1/4 1/8

1/16 1/8 1/16

(A) square window

1/6

1/6 1/3 1/6

1/6

(B) plus shape window

Figure (6.10) Two weighted mean filter masks.

By employing the plus shaped window shown in fig. (6.10B) to equation

(5.2a), then the output pixel is g~ven by the following equation:

Similarly, by employmg the square window shown in fig. (6.10A) to

equation (5.3b), the output pixel is given by the following equation:

1 + 16 [X(i·1,j·1) + X(i-1j+1) + X(i+1,j-1) + X(i+1,j+1) j

194

(6.9)

(6.10)

6.4.3 Systolic Design for Mean Filter

The systolic design for a mean filter is similar to that of the Laplacian operator

designs shown in fig. (5.4) and (5.6). The difference in this case is the value of the

kernel weights.

For the plus shaped mean filter, we note by comparing equation (5.2) and

(6. 7) that the values of the weights of the latter IS fixed with wi = 1/5. If we replace

the value of wi in each of fig (5.4) with the new values of wi =1/5, then we can

have a systolic design for the plus shaped mean filter.

In a similar way a systolic design for square mean fllter can be implemented.

We note by comparing equation (5.3) and (6.8) that the value of the weights of the

later is also fixed With wi =1/9. By replacing the value of wi in each cell of Fig.

(5.6) by the new values of wi = 1/9, we obtain the new design.

A systolic array design for weighted mean filters can be implemented in a

similar way to that in which the mean filters were implemented, as shown above.

The value of wi in each cell of fig. (5.4) and (5.6), is replaced by the new values

shown in equanon (6.9) and (6.10) respectively.

An alternative design for the mean filter is implemented as shown in fig.

(6.11). The new design, the unifunctional array, is similar to the previous ones, the

number of square cells is the same as before, depending on the size of the mask.

We modify the cells by reducing the number of operations inside each cell, so that it

performs addition operations only, instead of a multiplication followed by an

addition, with the design of the cell shown in fig. (6.12a). We need an additional

round cell (at the extreme right end of the array). It performs a multiplication only to

compute the final values of the components of the fllter. The design is outlined in

195

-,. 1---~

Hostl cell! cell n Host2

-,. + "---~ + X/---~-~ ,

Fignre (6.11) Systolic array for the mean filter.

fig. (6.12b). The image data is pumped into the array by the first host, and

accumulated as partial products Pij by the square cells. The output at the last square

ceii for square window, is as follows:

P(i,j)= [ X(i-lj-1) + X(i-lj) + X(i-lj+l) + X(ij-1) + X(i,j)

+ X(i,j+l) + X(i+lj-1) + X(i+lj) + X(i+lj+l)]

For the plus shaped window, Pij at the last square cell, is as foiiows:

P(i,j)= [ X(i-lj)+ X(i,j-1) + X(ij) + X(i,j+l) + X(i+l,j)]

In the round cell, the fmal value y (ij) is computed as

Y (i,j) = w P(ij)

where w is the filter weight. The final results are calculated by the second host. The

algorithm is repeated on every input data by all the ceiis working concurrently.

We bear in mind that the number of stages of the shift register in the square

cells is the same as that of the cell of the previous array.

When comparing the two des1gns, the old design and the umfunctional array

design, each of them has advantages as follows:

1-The number of cells in the unifunctional array is one more than tlle number

of cells of the previous design.

196

2-The time cycle for the unifunctional array is larger by one time unit than that

for the old array, so that the output w1ll be delayed by one time cycle. Th1s is

caused by the extra cell.

3-The computation time for each processor of the unifunctional array is less

than that for the processor of the previous array. Th1s is because in the former each

processor carries out accumulation only and for the last cell only multiplication;

where as in the latter each processor carries out both additions and multiplications.

V + I r 1

Y, , J

s~ d l X

X

RI ......... -rR.I ... 000

(a) (b)

Figure (6.12) Cells design for the array shown in fig. (6.11).

6.4.4 Transputer Network for the Mean and Weighted Mean Filter

These are silTillar to the transputer network for both plus shaped and square

Laplacian operator, described in section (5.3.2). The difference in this case is the

values of the kernel weights.

The systolic design for the plus shaped mean and weighted mean filters is

implemented by replacing the value of wi in each transputer of fig. (5. 7) with the

new values. In a similar way a transputer networks for square mean and weighted

mean filters can be implemented , by replacing the value of wi in each transputer of

fig. (5.7) by the new values shown m figures (6.9A) and (6.10A).

197

-------------------- - --

6.4.5 Performance of the Mean and Weighted Mean Filter Systolic

Designs on the Transputers networks

Experiments similar to those performed for the Laplacian operator were

carried out to measure the algorithms performance on a network of T800

transputers. Table (6.3) shows the timing results and associated speedup and

efficiency for the plus shaped main and plus shaped weighted mean filter systolic

design. The method was executed on a transputer network of varying sizes.

A good speedup was obtained for the 256 x 256 image size on a 5 transputers

networks. This result is quite useful, since the need for a parallel system is more

vital for larger images, where processing time is relatively high. Even when smaller

sizes of images are solved on a 5 transputer network, a speedup as higher as ( 4.32)

is obtained as shown m fig. (6.13). The graph for a 256 x 256Image size shows a

decrease in the speedup when the network size is 2. This situation is due to using

an external memory which is slower than on-chip RAM. When the design is

implemented on a network of size 2, the total size of the shift register buffer is

higher than on-chip RAM. The communication overheads of solving the problem

on a network are offset by the overheads of accessing the secondary memory.

Fig. (6.14) shows efficiencies of over 83% for a 2 transputer network, over

79% for a 3 transputer and over 86% for a 5 transputer network. The major reason

for the drop in the efficiencies for a 3 transputer network is the imbalance of the

load, where some of the transputers contain two cells and one transputer has one

cell only. This unbalanced load increases the elapsed time.

198

-I

5

4

"' " "' 3 .. .. "' "'

2

I I 2 3 4

No. or transputers

5 6

16 *16 32*32 64*64 128. 128 256.256

Figure ( 6.13) Speedup graphs for plus shape mean and weighted mean filters.

... .... = . ~ .... E r.l

12~----------------------------,

10-

08-

.. ... ...... .......... ......... . --- ::, __ .,..,.. .... ;,:.... ~---- ............................ ..

16 *16 32*32 64*64

128. 128

---- 256*256

06~--~-,,r-~---,r-~--~,r-~---,r-~---i

I 2 3 4 5 6

No. or transputers

Figure (6.14) Efficiency graphs for plus shape mean and weighted mean

filters.

199



1 0.048 1.00 100.00

16x 16 2 0.029 1.66 83.07

3 0.020 2.38 79.47

5 0.011 4.32 86.45

1 0 180 1.00 100.00

32x 32 2 0.108 1.67 83.45

3 0075 2.41 80.30

5 0.042 4.33 86.56

1 0.716 1.00 10000 64x64 2 0.420 1.70 85.23

3 0.291 2.46 82.09

5 0.162 4.43 88.64

1 2.989 1.00 100.00

128 X 128 2 1.663 1.797 89.85

3 1.151 2.60 86.58

5 0.639 4.68 93.54

1 12.012 1.00 100.00

256 X 256 2 6.859 1.75 87.56

3 4.601 2.61 87.03

5 2.550 4.71 94.22

Table (6.3) Timing results for the plus shape mean and weight mean filters.

The overall results for the square mean and square weighted mean algorithms

are very impressive. Table (6.4) shows that the speedups and related efficiencies

are h1gh for all sizes of images when the size of the network is 3 transputers. This

is due to the external memory of the T800 transputer as explamed in section

(6.2.4). The efficiency and speedup graphs are presented in figures (6.15) and

(6.16) respectively.

200

The graphs show good speed ups and efficiencies for the various size of image

and various sizes of transputer networks.

For any size of image, there is less gain when the network is 9 transputers.

The gain increases as the network size decreases. The reason for the loss in gam IS

due to the proportion of time spent on overhead communication, especially when

the size of network is relatively high. The communication overhead in this system is

increased as the size of the image increases.



1 0.093 1.00 100.00 16x 16 3 0.029 3.2 106.67

5 0.020 4.59 91.86

9 0.011 8.23 91.40

1 0.348 1.00 100.00 32x 32 3 0.107 3.23 107.79

5 0.074 4.47 93.07

9 0.0416 8.38 93.15

1 1.361 1.00 100.00 64x64 3 0.420 3.24 108.11

5 0.291 4.66 93.35

9 0.162 8.43 93.68

1 5.389 1.00 100.00 128 X 128 3 1.660 3.25 108.12

5 1.154 4.67 93.42

9 0.638 8.43 93.68

1 21.477 1.00 100.00 256x 256 3 6.649 3.23 107.66

5 4.597 4.67 93.44

9 2.543 8.45 93.84

Table (6.4) Timing results for the square mean and weighted mean filters.

201

9

8 ~' 7 ~~ 16.16

"'-~~ " --- 32.32 ...

" 6 - ------ 64.64 " "'-U) -· ·-·- -· 128. 128

5 -f---- 256.256

4

3 2 4 6 8 10

No. of transputers

Figure (6.15) Speedup graphs for square mean and weighted mean filters.

,.. .. c: .. .. :;: -w

12~--------------------------~

1 0 '-------08

16. 16 0

··-····--· 32 • 32 ------- 64. 64

·-·-·-·-· 128. 128 256.256

064---T-~~~---r--~~~~--~ 2 4 6 8 1 0

No. of transputers

Figure (6.16) Efficiency graphs for square mean and weighted mean filters.

202

6.5 AN ENVIRONMENT FOR DEVELOPING LOW

LEVEL IMAGE PROCESSING ON PARALLEL

COMPUTERS

6.5.1 Introduction

Developing image processing software systems can be a very time-consuming

process, since it involves a significant amount of experimentation with various

algorithms. Typically there w11l be several different algorithms available for the

same operation, and the programmer must choose the one which performs best in

that parucular environment. For instance, if the required operation is to extract the

edge of an image, then the best way of doing this will depend on how clear the

edges are, on lighting conditions, and on other factors. Thus the programmer needs

to experiment interacuvely with different algorithms. Even once an algorithm has

been selected, there is a lot of scope for setting parameter values expenmentally: for

instance, when thresholding an image, the best threshold value can often only be

chosen by trial and error. In providing a programming workbench for image

processing m which this experimentation can take place conveniently, we can

identify three main requirements:

1- To develop a library of image processing operations, coded to be

parameterised and scaleable.To provide as stmple a programming model as possible

for users to add new software components to the library.

2- Many image processing algonthms are computationally intensive and

therefore we require significant processing power allows the experimentation

process to be performed quickly. The faster the actual image processing can be

made, the better. One obvious solution to this is to use a parallel processing

machine.

203

The main aim of the research in this section was to design and build a

programming workbench for developing a low-level image processing software

library on a parallel computer. This has been designed to mee\~bove requirements.

The PARC-IPL (Parallel Algonthms Research Centre- Image Processing Library)

workbench runs on a workstation which front-ends the Sequent Balance 8000

system. The workbench was simulated on the Balance using the Occam high level

language.

The user can control the execution of the program from the workbench. At the

heart of PARC-IPL is a library of image processing routines and algorithms which

are coded in a generalized format which is not specific to one particular

configuration, either in size or topology.

Our implementation strategy has been to develop the implementation in two

stages. As a first step, a single processor implementation has been produced. This

is now operational, and users can execute programs written in Occam, inputting

images from file. The second implementation stage is to invoke the existing hbrary

on a multiprocessor such as a transputer network, using the same Occam codes.

Unfortunately, the second stage has not been finished due to special circumstances.

The workbench provides many facilities, however, which are equally useful to the

developer of image processing software on parallel computers.

In the following section we present some previous research which is related to

our current work. This is followed by an introduction to the workbench. An outline

of the implementation strategy used is also given. This is followed by an outline of

the contents of the library. Fmally some possible modifications to the workbench

are presented.

204

6.5.2 Background

There is at present much interest in implementing image processing

applications on parallel computers, because of the high processing speeds which

can be achieved by parallel processors such as transputer networks.

At Queen's University, Belfast, the computer department have designed and

built a programming workbench for developing image processing software systems

on transputer networks [Crookes, et al1990, and 1991]. A high-level programming

language for image processing, caiied IAL (Image Algebra Language) has also been

developed for this purpose.

The main aim of Queen's University workbench implementation is to partition

the image and distribute each section of the image per transputer. Each transputer

then processes its own section of it independently and in parallel, i.e. "image

parallelism". However, since many image processing operations operate on

neighbourhoods, problems occur at section boundaries, where a neighbourhood is

physically split across two transputers [Crookes 1990]. They overcome this

problem by adding a special program to the library.

The main aim of implemenung our algorithms in the (PARC-IPL) system is

"task parallelism" using systolic designs as shown in chapters 4 and 5 and the

previous sections. In these systems there ts no need to partition the image.

6.5.3 The Workbench

In this section we wtii present details of the design and construction of the

programming workbench, and describe each of the main components of the

workbench.

205

The overall workbench consists of three main parts. At the back end of the

workbench is a library of low-level Image processing operations, running on a

Sequent Balance 8000 system.

Front-ended workstations are connected to the Balance. The workstations

send the programs to the Balance as Pseudo-code instructions. Between the user

workstations and the library components is a server which receives the p-code

instructions, and makes (remote) calls upon the library. This is illustrated in fig.

(6.17) .

User Interface

.,. Contrdler > Commands

Workstallon Balance 8000

Figure (6.17) Workbench Overview.

The workbench can be used at the following two levels:

Image Processing Ltbrary

1- Building a low-level image processing system from previously defined

algorithms, held in the library.

2- Implementing new designs for other algorithms, and adding them to the

library, also enablmg the user interface to incorporate these new components.

206

6.5.3.1 Software Structure

The library routines which are called by the controller are executed by the

Balance. To make coding of the library components independent of the underlying

hardware, three layered software modelswere adapted as shown in fig. (6.18).

These three layers are:

1- The actual library routines themselves (coded as Occam procedures as

shown in the previous sections).

2- As some commands from the controller to the systolic design have to be

broadcast across all cells in the array, there is a need for a command distributor

layer. If the user wishes to apply a one-dimensional or two-dimensional

convolution, then this layer controls the number of cells needed for these routines

(i.e. dependent on the size of the kernel).

3- To convert the commands from the controller into actual calls to the library

routines, including the passing of parameters and other housekeeping.

- Interpreter -

- r--' Distnbuter

' Library ...:: ~

Figure (6.18) Layout software model.

207

L----------------------------- I

6.5.3.2 The User Interface

The workbench supports programs which can be represented as a set of

menus.

Programs are butlt by selecting algorithms from a menu of currently available

hbrary operations. Fig. (6.19) shows the first page of the screen as soon as the user

login to the program.

The workbench provides a menu editor for selecting the routines, and an

execution environment which provides algorithm selection and parameter settings.

It is easy and quick to flick between the editing and execution modes.

This is an OCCAM Program Library Low-Level Image Processing

Filter library

N.B. To exit from the system enter 99

Have your input data at file name

[ image.input]

Figure (6.19) First screen of the menu program.

To use the library, the user first selects appropriate operations from the

hbrary, by selecting the appropriate number from the menu bar, each number

representing a particular algorithm. Once a filter is chosen, the editor shows a few

boxes asking for the parameters of the image such as the number of column or the

208

size of the kernel (if the user chooses a convolution operation). If the editing is

successful then the execunon mode of the workbench is entered. An example of a

typical program is shown in fig. ( 6.20), which shows a simple program for the

plus shaped Laplacian operator.

Type the filter's number = 2

Plus Shaped Laplacian operater No:2

Give the total number of columns (min. 4)

= 256 Give the total number of rows

(min. 3)

= 256

Figure (6.20) Typical screen presentation for the environment.

6.5.3.3 Execution Mode

Once the editor program has completed, the programmer can then enter the

execution mode, from which the program can be run. It is the execution mode that

in which the algorithm selection and parameter setnng can take place.

Once running the program is completed, the name of the output file appears

on the screen. It may become apparent that the setting needs to be changed; or the

user may wish to try the same setting with a different algorithm. These alterations

can be performed simply by answering a question, entering the new setting, and

running the same algorithm again (or a different algorithm). The final two pages of

the editor screen are shown in fig. (6.21).

209

The program is running

Wait for the output

Your output data filename [ image2.out]

Do you w: 'ant to choose another filter

If yes type 1

If no typeO

Ftgure (6.21) Final screen presentation of the program.

6.5.4 Implementation

When the user's command is to execute the program, a set of p-code

instructions is produced. These instructions are passed to the controller m the

Balance where the interpretation stage starts. The controller repeatedly fetches and

executes individual instructions. In most cases, when the instruction is a call to a

library operation, execution nearly involves passing the instruction and its

parameters as a command to the systolic array. The cells in the array act in parallel

as shown in the previous chapters and sections, using the parameters which come

from the controller via the interpreter and the distributor, as shown in fig. (6.22).

Alltmages are held on the systolic array host which pumps them through the array.

The Occam comptler automatically distributes the algorithms and the parameters

over array cells. This means that the programmer does not need to be concerned

with the underlying parallelism of the design but can concentrate on the image

210

processing aspects. The underlying parallelism is effectively hidden from the user

while at the same time being efficiently explOited.

Workstallon

1---Interpreter

Algonthn ~

and f---

D1stnbuter f---

' j Contrdler ~

1-,::r-~

Systohc system

Figure (6.22) System operation.

6.5.5 Workbench Facilities

The workbench provides a good mechanism for developing low-level image

processing systems on parallel computers without the need for much actual

programming on the user part It has been designed with the following factors:

I- Ease of use; it is essential that the workbench is easy to use. The

workbench allows the user to select interactively which algorithm is to be used and

reselect a different algorithm for the same image or for a different image.

2- Flexibility; the library components are also set up so that certain of their

parameters can be 'tailored' or modified by the user. For instance, selecting a

threshold value in sigma filter is precoded in a way which enables the user to

supply a new threshold value interactively.

3- Hidden parallelism; to shield the user from complexities of explicitly

controlling parallelism. It was desirable to hide as much of the underlying

parallelism and communicatton as possible.

211

4- Extendibility; for the workbench to be of significant use, mechanisms must

be provided to allow the library to be expanded by adding new algorithms.

6.5.6 Image Input, Output and Data Types

The workbench provides facilities for input and output of images. A basic

operation for image input and output is provided.

* Image input and output

A set of standard routines is provided, giving the user the ability to:

1- read an image from a particular file;

[image.input]

2- read an imagJrDil particular area.

3- write an image to a particular file;

[write 3.out ]

3- write an image to a particular area.

*Data types

In the workbench there are three classes of data types images, kernel and

scalars. Each of these is now outlined briefly.

Images: An image is a one-dimens10nal or two-dimensional structure. The

images size can be declared by declaring the size of the columns and rows of the

image. If the number of columns is 256 and the number of rows is 256 then this

declares two images of size 256 x 256.

Kernel: Similarly, kernels are also ID or 2D structures. For the invariant

kernel, the size of a kernel can be declared by declaring the size of the columns and

rows of the kernel.

Thus, a user specifies the weights associated with a kernel simply by

enumeraong their values in a one dimensional series.

212

Scalars: The declaration of scalar vanables takes a similar form to that found

inOccarn.

6.5. 7 Types of Kernel

A kernel consists of two parts; a configuration, which defines the

neighbourhood over which it is defined and weights associated with each element

of the configuration neighbourhood. In addition, there are two types of kernel:

1- mvariant kernel: the weights are invariant with the location in the image and

these kernel correspond to the usual concept of masks, and,

2- variant kernel; the weights can vary as a function of image position.

Both types of kernek are included in the workbench. For most of the library

filters, the kernel are already given inside the algorithms. Except for the convolution

operation, a kernel is defined by giving its values in an initial value definition

section. For instance, the square Laplac1an kernel is defined in the square Laplacian

algorithm, and we can choose the algorithm directly from the menu. The alternative

way to choose a 2D convolution operation and square Laplacian kernel would be

defmed in the workbench as: - 1 -1 -1

-1 8 -1

-1 -1 -1

The user can choose any size of kernel for the ID and 2D convolution

operations.

6.5.8 Content of Library

The following is the range of operations provided in the library as it currently

stands:

1- Neighbourhood filters;

213

Various low pass, high pass, Laplacian filters and Sigma.

2- Edge extraction

Sobel, Prewitt, Gradient operator

3- Image algebra

One-dimenswnal convolution,

Two-dimensional convolution,

The user can define the window configuration and weights.

6.5.9 Extending the Environment

The workbench has been constructed in such a way as to allow it to be

extended by a user with relative ease. If a new algorithm is to be added to the

environment then obviously a modificauon must be made to the workbench by

adding the new algorithm to the library.

If a new algorithm is to be added to the hbrary, then the programmer must

adhere to the same conventions which have been used in the existing library

components, including the followmg:

1- Procedure heading must mclude the same set of system parameters and the

same structure of command packet as do existing operations;

2- Algorithm code must be written m Occam:

3- The user interface must be modified to allow the new algorithm to be

selected when using the editor.

In addiuon to adding the new algorithm to the hbrary, a menu description flle

must also be created for the algorithm to be controlled from the interface. This

includes acqmring a unique numerical identifier for the routine, and stating the same

parameter.

The main Occam code for the library is given m Appendix D.

214

CHAPTER 7

SYSTOLIC ALGORITHM FOR THE SOLUTION OF TOEPLITZ MATRICES

L-------------------- ---- --

7.1 INTRODUCTION

Toeplitz matrices have become increasingly important with the rapid growth

of signal and image processing [ Sloboda 1989].

It is very important that operations on Toeplitz matrices concerning the

application mentioned above, are performed in a cost effective, reliable and above

all rapid manner. Of course, the way to achieve these objectives is parallel

processmg, involving the concurrent use of many simple and reliable processing

elements in the form of VLSI designs, i.e., systohc arrays.

In this chapter a systolic design is suggested for solving Toeplitz matrices.

A special type of matrix, i.e., Toeplitz has the form, ao al a2 an·l

(7.1)

For such matnces the sum of elements in a row remains constant. Another

type of Toephtz matrix is the cyclic banded matnx, which has the form:

al a2···ar ~·· a2

a2 al ... llf-1 ~

~

0 ar ar

(7.2) ar

ar 0

a2 .. ar ar al

216

where the sum of the elements in a row also remains constant

The inverse of a banded Toephtz matrix, if it is invertible, can be performed

utilising methods given by [ Evans 1972] and [ Gohberg and Semencul 1972] in

O(n) number of operations. Toeplitz systems of linear equations arise in many

scientific and engineering applications and real time operation is often requested.

The well known algorithms of Trench [Trench 1964], Bareiss [ Bareiss 1969], all

require O(n)2 operations. The inverse of circulant Toeplitz matnx IS required in the

restoration of images [ Gonzales 1992]. An efficient algorithm for the factorisation

of a symmetric circulant banded Toeplitz matrix has been suggested by [ Evans

1981].

The solution of Toeplitz systems, such as circulant and skew symmetric has

received much attention from a systolic view point. Kung and Hu [Kung and Hu

1983] and Brent andluk [ Brent andluk 1984] have suggested parallel algorithms

for Toeplitz linear systems. Further parallel algorithms for solving Toeplitz linear

systems have been suggested by Evans [ Evans 1986 and 1989] and Megson [

Megsonl985]. Each of these algorithms are implemented as systolic designs.

The theory of orthogonal polynomials has made it possible to develop

efficient algorithms for smoothing a function of one variable on the equidistant set

of points which results in Toeplitz systems to be solved. Let us consider N+l

function values F(i) defined on the equidistant set of points 0,1, ..... N. Let a

function f be approximated on each subset consisting of n+ 1 pomts by a polynomial

Pm of order m, where N>>n. Let m be odd and n be even.

Let us denote the running subset n+1 points by i-n/2, .... , i-1, i, i+1, .... , i+

n/2. Then the smoothed value off in the mid point i is defined by the value of S(i) =

Pm (i) as shown in Sloboda [ Sloboda 1989].

Let m=1,

217

Then, at n+ 1=3, we have,

S(i) = t (f(i-1) + f(i) + f(i+l)) (7.3)

With n+1=5, we have,

S(i) = ~ (f(i-2)+ f(i-1)+ f(i) + f(i+l)+ f(i+2)) (7.4)

With n+1=7, we have,

S(i) = t (f(i-3)+ f(i-2)+f(i-1)+ f(i) + f(i+l)+ f(i+2)+ f(i+3)) (7.5)

7.1.1 Digital Contour Smoothing

As in image smoothing the ultimate goal of digital contour smoothing is to

improve a given image in some sense. Digital contour smoothing belong to the most

Important procedures in image processing. This procedure allows us to smooth

digital contour and to improve the stability of local and global invariants, such as

curvature which is impossible to calculate without smoothing.

A digital image I is a finite rectangular array whose elements are piXels or

image elements. Each pixel p of I is defiend by a pair of Cartesian coordinates (x,y),

which we may take to be integer values. An element or a pixel p(x,y) in a digital

picture I has two types of neighbours, i.e.,

1- Its four horizontal and vertical neighbours (h,k) such that

IX - hi + IY - kl = 1

2- Its diagonal neighbours (h,k) such that

IX - hi = IY - kl = 1

A simple closed digital curve in image I is a path, r = PO· Pl , .... , Pn

[Rosenfeld 1979] such that

Pi= Pj iffi=j

and

PI is a neighbour of Pj iff i=j+ 1 (mod n+ 1)

218

Let x=x(t) , y= y(t) (7.6)

be a simple closed curve in the 2-dimensional Euclidean space V. Let this curve be

approximated by a set of N elements Pl =( XJ, Yl), P2=( x2, Y2), ... , Pn=( Xn, Yn),

which are elements of a fmite rectangular array I, and let these elements represent a

simple closed 4-connected digital curve for which,

IPi- Pi-ll= lxi- Xi-ll = IYi- Yi-ll =1 (7.7)

The discretized parametric equation of this digital closed curve has the form

r Xl Yl

l X2 Y2

RJ I (7.8)

I Xn Yn J

The lest squares smoothing of a simple closed digital curve, is then defmed by

the linear operator (1/a) A wluch is apphed on R, so, 1

<a)Ax

where A is anN x N circulant Toeplitz matrix of the form (7.2) and a is the sum of

all elements in a row. For different values of m and n+ 1 we obtain, for example,

the following operator which corresponds to equation (7.3):

r 1 1 1

1 1 1 0

1 1 1 1 1 (-)A2=- (7.9) a a 0 1 1 1

1 1 1 J 1 1 1

219

A subset of linear operators defined by anN x N circulant Toeplitz matrix A

which smooth digital closed contours in the least squares sense is suitable for digital

contour approximation and these operators will be called feasible.

In this chapter a factorisation of certain symmetric circulant banded linear

system proposed by Evans [ Evans 1981] is described. It can be shown that a

banded Toeplitz matrix can be factorised into the product of easily inverted

matrices. By using this factorisation, a soluuon can be derived for such matrices.

A full description of the systolic array design for the algorithm shown above,

is given in a later section. The performance of the design is discussed in the section

(7 .4).

7.2 SOLUTION OF CERTAIN TOEPLITZ SYSTEMS

In the method descnbed by D.J.Evans [ Evans 1981] it is shown that the

special banded Toeplitz matrices Ar of semi-bandwidth r can be factorised into the

product of easily inverted matrices, the components of which are a cyclic matrix and

its transpose and a similar circulant banded matrix Ar-1 of order 1 less. By using

this factorisation, efficient algorithmic solution methods can be derived for the

related linear systems [ Evans 1980].

We consider the soluuon of

(7.10)

Where Ar is a cyclic banded Toeplitz matrix of the form shown in (7 .2),

x = ( xr. X2•····· xn) is the unknown vector and d is the known right hand side

vector.

220

In the following we present the cases r=2 (tridiagonal), r=3 (quindiagonal),

and for r ~ 4. Also an iterative procedure is suggested and set up to produce a

reverse recursive strategy for the solution in the general case (t.e. Ar, Ar-l>···,A3,

7 .2.1 Tridiagonal Case

We consider the solution of the system,

A2. x=d

where A2 is a circulant banded Toeplitz matrix of the following form,

r ar a2

a2 ar a2

A2= I I 0

l a2

Now A2 can be factorised into T

Az= Bz AI Bz

a2

0

a2

a2 a1

where BI is the transpose matrix of B2, with

0

and A 1 is the following dlagonal matrix:

221

l

(7.11)

(7.12)

(7.13)

(7.14)

---------------------------------- ------

I Cl

Ct

Cl At= I (7.15)

l J Then by carrying out the required multiplications of matrices (7 .14 and 7 .15)

and equating this to equation (7 .13), we get,

r <er+erb~ (erb2) (erb2)

(erb2) <er+erb~ <er b2) 0 I

A2 (7.16) 0

(erb2)

(erb2) (erb2) <er+erb~

Equating the terms of these matrices, we obtain the following relationships

between the elements a1, a2 of A2 and the elements c1, b2 of A1 and B2

respectively,i.e.

ar =er (l+b~

a2=b2cr

Solving these two equations for b2, gives the quadratic equation 2 ar

b2 - ( a2

) b2 + 1 = 0

which yields

222

(7.17)

(7.18)

-----------------

where h = al a2.

The elements of CJ are determined from equation (7 .18), where a2

CJ =b2

equauon (7.13) can be solved by the following solution process T

Az x= Bz A I B2 x = d

(7.19)

(7.20)

which by using the intermediate values v,z can be obtained in the 3 computational

steps, i.e.

B2 v=d

A 1 z= v T B2 x=z

(7.21)

(7.22)

(7.23)

As we can see from equations (7.21 to 7.23) above, we have to solve three

different forms of linear systems. The method of solution of these systems will be

presented in the following paragraphs [ Evans 1980].

From equation (7.21) we can obtain,

1

0

bz

bz

1 bz

1 bz

0

V1

=

dJ

d2

b12 J l ~n J l ~n J

from which we can obtain an equation relating v1 and v0 with do

bz Vt+ Vn = dn

then

223

(7.24)

(7.25)

Also from (7.24) we get a recursive sequence,

(i = n-1, ...... ,1) (7.26)

v0 _1 can then be determined from equation (7.26) and using equation (7.25), we

(7.27)

A similar process applied to equation (7 .26) in the same manner commencmg

on each equation, yields a sequence of expressions. The value of v1 is determined

from the final expressions in the form.

V1 = d1- b2 d2 + b~ d3- b~ d4 ...... (- b2)n-1 d4 + b~ V1

Then

v1 =( d1- b2 d2 + b~ d3- b~ d4 ...... (- b2)0-1 d4) I (1- b~) (7.28)

A backward substitution process of equation (7 .26) yields the components of

The solution of equation (7 .22) is strrught forward since A 1 is a diagonal

matrix.

Finally, the solution of equation (7 .23) is similar to the solution of

equation(7.21). From the linear system (7.23), we have the following expressions

for x 1o x2 , ..... , x0 m the form,

Then

(7.29)

And

( i= 2, ..... ,n) (7.30)

Then x1_1 ( i= n,n-1, ..... ,2) (7.30a)

224

From equations (7.29 and 7.30) we can obtain an equation relating x2 and xn

Wlth the Zj. 2

X2 = Z2- b2 Z1 + b2 Xn (7.31)

A similar process applied to equation (7 .30) in the same manner commencing

on the , third, founh, .... , and nth equations, creates a recursive sequence of

equations. Substitution of x1, x2, ..... , Xn-2 and Xn-1 in terms of Xn and z1 into the

final equation of (7 .30), yields the fmal expression of Xn as

Xn = Zn- b2 Zn-1 + b~ Zn-2 ...... (- b2)n-1 z1 + b~ Xn

Then

Xn = Zn- b2 Zn-1 + b~ Zn-2 ...... (- b2)n-1 z1 I (1- b~) (7.32)

Finally the vector x is determined from the forward recursive sequence of

equations (7.30), or backward from equation (7.30a).

7.2.2 Quindiagonal Case

We consider a similar solution of the system,

A3. x=d

where 31 a2 33 a3 32

32 31 a2 33 0 33

A3= (7.33) 33

a3 0 32

32 33 a2 31

Similarly the factorisation of the above can be obtained in the form, T

A3= B3 A2 B3 (7.34)

225

where Bj is the transpose matrix of B3, and

1 b3

1 b3 0 1 b3

B3= 0

(7.35a)

b3

b3 1

and,

r CJ cz cz

cz Cj cz 0

Az= I (7.35b)

I 0

l cz cz cz CJ

Then by carrying out the required multiplications of matrices (7 .35a and

7 .35b ), elements of the matrix A 3 can be shown by equating terms of the system

(7.34) to yields the relationships:

a 1 = (CJ + b3c2) + (cz+ b3q) b3

az = cz + b3 (CJ + b3c2)

a3 =b3c2

(7.36)

(7.37)

(7.38)

By using equation (7.38) and ellminating cz from equation (7.37), we obtain, a3 (a1_2a3)

az = b + b3 ( 2 + a3) 3 (l+b3)

Then azb3 (1 +bp = a3 (1 +b~)2 + b~ (a 1- 2a3)

and

226

(7.39)

which 1s a quartic equation for the derivative of b3.

Once b3 is obtained then the values of cz and q can be easily obtained from

equations (7.36, 7.37 and 7.38) as follows:

Multiplying equation (7.36) by (1 +b~ and equation (7.37) by (-Zb3), we get

a1 (l+b~ = q(1+b~)2 + 2b3c2 (l+b~)

2 2 -2b3a2 = -2b3cz(1 +b3) - 2q b3

Adding the above two equations yields, 2 4

q = (a 1 (1 +b3) - 2b3a2) I (1 +b3 ) (7.40a)

or substituting equation (7.38) m equation (7.36) yields the following express1on

for q, i.e. 2 q = (a1- 2a3) 1 (1+b3 )

Then we can get cz from equation (7 .37)

cz = (az- b3q) I (1+b~)

(7.40b)

(7.40c)

As b3, c2 and c1 are obtained, the linear system (7 .34) can now be solved by

substituting equation (7.13) in equation (7.34), we obtain T T

A3x= B3B2 A1 Bi B3 x=d

This equation can be solved by the following process

intermediate solutions vectors:

B3u=d

B2v=u

A,y= V

T B2 z=y

T B3 x=z

227

and by useing

(7.41a)

(7.41b)

(7.41c)

(7.41d)

(7.41e)

We determine u1 and v1 from the linear systems (7.4la and 7.4lb) in a

surular way to obtaining equation (7 .28), then,

ur =( dr- b3 d2 + bi d3 ...... (- b3)n-l dn) I (I- b~) (7.42a)

(7.42b)

Then we determine the values of Un and v n from

The values ofui and v1 can be obtained from a backward substitution process

of the following equauons

ul =d~- b3 ui+1 (i = n-1, ...... ,1)

(i = n-1, ...... ,1)

(7.43a)

(7.43b)

The solution of equation (7.41c) is straight forward since A1 is a diagonal

matrix.

Also we can determine Zn and Xn from the linear systems (7.4ld and 7.41e),

in similar way to obtaining equation (7.32), then

Zn = Yn- b2 Yn-1 + b~ Yn-2 ...... (- b2)n-l Y1 I (1- b~) (7.44a)

(7.44b)

Finally the values of z1 and x1 can be obtained from a forward substitution

process of the following equations

zl = Y1- b2 zl-1

XI=~- b3 Xj-1

Then ~- 1

( i= 2, ..... ,n)

( i= 2, ..... ,n)

( i= n,n-1, ..... ,2)

(I= n,n-1, ..... ,2)

228

(7.45a)

(7.45b)

(7.45c)

(7.45d)

------------------ - '

7.2.3 The General Case

So far we have solved the trldiagonal and quindiagonal cases for a banded

Toephtz matrices. For any semi-bandwidth r the layout of the banded Toeplitz

matrices remain the same.

Ar. x = d

a1 az ... a, a, .. az

az a 1 ..• llr--1 ar

a, 0

Ar= ar ar

(7.46)

ar

ar 0

az .. ar ar a1

The factorisation of the above form is T

Ar= Br Ar-1 Br (7.47)

where B'[ is the transpose matrix of Br, and

r 1 he

1 br 0

I 1 he Br= I 0 I

(7.48a)

l br

J br 1

229

------------ --- ---

and Ar-1 is a matrix of the following form:

c1 c2 ... Cr-1 Cr-1 .• c2

c2 c1 ... Cr-2 Cr-1

Cr-1

0 Ar-1 Cr-1 Cr-1

(7.48b)

Cr-1

Cr-1 0

c2 .. Cr-1 Cr-1 C1

By equating terms in the matrix multiplication, we can evaluate the unknown

terms in an algorithmic procedure with a pattern that can be determined from the

previous cases. Hence, we obtain the following relauonships.

a 1 = (Cj + brC2) + br (c2+ brei)

ar-3 = <Cr-3+ brCr-4)+ br (cr-2+ brCr-3)

ar-2 = <Cr-2+ brCr-3)+ br <Cr-I+ brer-2) 2

ar-1 =(Cr-I+ brer-2)+ b, Cr-1

ar = brCr-1

230

(7.49a)

(7.49b)

(7.49c)

(7.49d)

(7.49e)

(7.49f)

As an example of the general case for the Toeplitz matrix, we chose r=4.

Then, equations (7 .49a to 7 .49f) can be written as,

a1 = (q + b4c2) + b4 (cz+ b4q)

az = (cz+ b4q)+ b4 (CJ+ b4c2)

a3 = (c3+ b4cz)+ b~ c3

~=b4CJ

(7.S0a)

(7.50b)

(7.S0c)

(7.50d)

The values of c3, cz and c 1 can be obtained from the above equation (7 .SOa -

7 .SOd) as follows.

Subtracting equation (7 .SOd) from equation (7 .SOb), yields,

(az- a4) = cz(l +b~) + b4c 1 (7 .SOe)

Now multiplying equation (7.S0a) by (1+b~ and equation (7.S0e) by (-Zb4),

and add them together, gives,

q = {a1 (1+b~)- 2b4 (az- ~)}I (1+b!) (7.S1a)

Then we multiply equation (7.S0e) by (1+b~) and equation (7.S0a) by (-Zb4),

and add them together, to get

cz = {(az- ~) (1+b~- b4a1)l I (1+b!)

From equation (7 .SOc ), we can obtain CJ,

c3 = {a3- b4c2l I (1+b:)

(7.S1b)

(7.S1c)

Now these relationships appear too complicated to seek an exact solution so

an alternative is obtained as follows:

Frrst we guess a b4 value, for stability we chose b4 < Ill, then we determine

the values of CJ, cz andc3 from equations (7.S1a, 7.S1b and 7.S1c) respectively, a

new value of b4 IS obtained from equation (7 .SOa). Then we check for convergence

on the value of b4 , if the difference in b4 is greater than a specific level of accuracy

(i.e. >0.000001), then we repeat the previous operation, using the new value ofb4.

For any r the layout of the algorithm suggested remains the same, it is only the

231

number of steps that changes. In general, for a circular banded matnx Ar, we need

to obtain r-1 values of c from r number of equations.

According to the factorisation method, the linear system (7 .47) can now be

(7.52)

(7.53)

As shown in the previous sections, equation (7.53) can be solved by the

following process and by using a series of mtermediate vectors, i.e.,

Br-tV=U

T Br-lp=q

T Br x=p

(7.54a)

(7.54b)

(7.54c)

Determine the values of u1, v1 ..... s1 and from systems (7.54a) and

determine the values of zn, ..... p0 , x0 from the linear system (7 .54c) in a similar

way to obtaining equation (7.28).

Then we determine the value of u0 , v0 •••• s0 from

232

Sn = hn- b2 SI

The values of Uj, vi···si can be obtained from a backward substitution of the

following equations:

(i =n-1, ...... ,1)

(i = n-1, ...... ,1)

(1 = n-1, ...... ,1)

(7.55)

(7.55)

(7.55)

The values of z"" ... p1 and x1 can be obtained from a forward substitution

process of the following equations

zl = Y1- b2 z1-l ( i= 2, ..... ,n)

P1 = Ch- br-1 P1-l ( i= 2, ..... ,n)

Then 21-I

Pi-1

( i= 2, ..... ,n)

(q,- p,) hr-1

(p, - Xj)

br

( i= n,n-1, ..... ,2)

( i= n,n-1 , ..... ,2)

( i= n,n-1, ..... ,2)

233

(7.56a)

(7.56b)

7.3 SYSTOLIC ARRAY IMPLEMENTATION FOR

TOEPLITZ MATRICES

There are several systolic arrays proposed for solving Toeplitz matrices

known m the literature. Kung and Hu [Kung and Hu 1983] and Brent and Luk

[Brent and Luk 1983] have suggested systolic designs for solving Toephtz

systems. Further parallel algorithms for solving Toeplitz linear systems have been

suggested by Evans [ Evans 1986 and 1989] and Megson [Megson 1985]. Each of

these algorithms are implemented as systolic designs using Occam.

Tins section introduces a new systolic array designs for the implementation of

the Toeplitz matrix algorithms descnbed in the previous section. First, a systolic

array design is implemented for the tridiagonal case, which is then extended to

solve the quindiagonal case. Fmally the design is further extended to handle the

general case of banded Toeplitz matrices. One of the main objectives of the design

is the minimisation of the number of hardware components. The original design and

the extensions are fully descnbed in this section.

7.3.1 Systolic Array Design for Tridiagonal Case

The algorithm for the trid1agonal case of banded Toeplitz matrices described

in section (7.2.1), consists of a three-stage process, as shown in fig. (7.1). These

stages are as follows:

1- To solve the Toeplitz system B2 v = d m equation (7.21), which consists of

two substages:

a- To detennine the value of v1 from equation (7.28).

b- The computations of vi from equation (7 .26).

ii- The second stage is to solve the system (7.22) to get the values of z..

234

r Input data l ~

Solvmg matnx Toephtz system (7 21)

I Determwv 1 I

I Determirxvector v1 I

I.

Solve &agonal matnx A J(7 22)

" Solve the transposedToephtz matnx system (7.23)

I Detertnl""'n I

I Determine vector x 1 I

'" I Output l

Figure (7.1)Toeplitz rrtatrix algorithm-structure of tridiagonal case

235

------------

iii- The final stage, is the solution of the transposed Toeplitz system BI x = z

(7 .23). In this stage we can determine the value of x1 in the following two

sub stages.

a- Obtain the value of Xn from equation (7 .32).

b- Detemrine the values of vector x from equauon (7 .30).

The successive blocks are themselves systolic arrays and data is piped

between them. In order to illustrate the property of the systolic destgn, the design is

discussed in more detail in the following way:

The systolic array for the general tridiagonal form given in fig.(7.2) is a

special connected structure taking advantage of the system (7.13) to reduce array

inputs.

The array consists of five cells, each cell representing one of the stages or

substages of the diagram shown in fig. (7.1). The first two cells computes the

solution to the linear system B2 v = d (7.21), the last two cells computes the

solution to the linear system BI x = z (7.23), while the middle cell is used to solve

system A 1 z= v (7 .22).

The value of b2 is iniually stored in each of the cells, except the middle cell

where Ct is stored. Then, values of d,, i=n, n-1, .... ,1, are input from the host to

the first cell (in a backward sequence). The data and results are pumped through the

array and final results are sent back to the host. Only cells on the array boundaries

are permitted to communicate with the host, and each of the cells communicates

with its left and right neighbours cells only.

Starting with the first processing element, the value of vi is computed by

applying equation (7.28), the input data di are pumped into a shift register, the

236

number of stages of the shift register should be n (where n is the size of the input

vector d). The main purpose of the shift register is to delay the mput data uno! the

Host

d i 1--~1 cell I cell2

x,~------------------~

Figure (7.2) Tndiagonal Toeplitz matrix systolic array design (double-sided).

value of v 1 is computed. Once v1 is ready, both v1 and dn are pumped to the next

cell at the same time, followed by sending the input vector d. The structure of celll

is shown m fig. (7 .3a), and the Occam code running in this cell takes the following

form:

--- PROC delay

PROC cellA {CHAN .......... )

--- declaration of local channels

SEQ

image := number.columns • number.rows

SEQ 1 = [0 FOR n]

SEQ

dm! d

dout? d1

v:=v+(a*d)

a:= a/b

v.= v/{1-(c*b)):

PROCmam

237

PAR

cell A

PAR 1 = [0 FOR 2]

delay:

Cell2 computes backward recursive sequence of equation (7.26) to produce

v1, which is piped into the next cell at each cycle. Cell 3 also generates a sequence

of output data by applying equation (7 .22), which is piped into cell4.

Cells 2, 3 and 4 are overlapped in the computation process. Cell 4 continually

calculates the partial products of Xn as given in equation (7 .32). Although cell 4

works all the time its results are only valid after n-cycles, thus cell 5 ignores the

results until they become valid. Then Xn and ~ are pumped to the final cell on the

right side where Xn.J is calculated according to equation (7 .30a). This backward

substitution is repeated n-1 times to produce the x1 values, i= n-1, ..... ,2,1, and

thus the tridiagonal Toeplitz system is solved (7.11).

V

a- cell A b- cell B

Ftgure (7.3) Cells structures for the systolic design shown in fig.(7.2)

a- cell A represent cells 1 and 4 b- cell B represent cells 2 and 5.

The mam structllre of cell 4 is similar to the structure of cell 1, whilst the main

structure of cells 2, 3, and 5 are the same as shown in fig. (7 .3b ). The Occam code

running on these (i.e. cells 2 and 5) cells takes the following form:

238

PROC cellA (CHAN ......... )


SEQ

vm? VI

SEQ i = [0 FOR n]

SEQ

dm? d1

calculate Vj or Zj or x1

vout ! v1:

Fig. (7.4) illustrates a series of snapshots of an example where n =6 (size of

vector). There are four snapshots of the tndiagonal system computation shown

here, the indices of d, v, and x sequences appear in each relevant stage of the cell.

We assume that the computation starts at time zero.

At time zero, dn enters the array and v 1 is calculated in celll. Then, at time 5

the last part of vi is computed at cell! and pumped together with dn to the next cell

as shown in fig. (7 .4a). The second snapshot illustrates the state of all the cells at

time= 8, where d4 meets vi to produce v4 at cell2. At the same time, vs will be at

the middle cell to produce z5. At this time Z() is piped from cell 3, to produce

another partial result of xn. At time = 11 dl will be at cell2, v2 at cell 3 to produce

22 and z3 at cell 4 to produce another partial result of Xn, as shown in fig. (7 .4c). At

time = 13 the fmal partial result of Xn is calculated at cell 4. The last snapshot in this

figure illustrates the array at time = 15, where the first value of x (i.e. x6) is

produced. Also at this time, the next value of x (i.e. x5) is also produced at cellS.

239

celll cell2 cell4 cell5

0 0

a- TIIDe = 6

celll

® 5 5

cell4 cell5

0

b-Ttme=8

celll

0 ® 2 2

cell5

0

c- TIIDe = 11

celll cell2 C) cell4 cell5

X5 Xn

0 0 0 z5

d-Time= 15

Ftgure (7.4) Snapshots of execunon of tridiagonal Toeplitz matrix

240

- -----------------------------------------------------------------

Let us apply equation (7o32) to our example, then the first valid result will be: 2 3 4 5 6 x0 = {ze;- b2 zs + b2 z4 - b2 z3 + b2 z2 - b2 z1 ) I (1- b2 )

while the other values of xis calculated from equation (7o30a) by the following:

xs = (z6- x6)

h2

x4 = (zs - xs )

b2

x3 = (z4- x4)

b2

X2 = (z3 - x3)

b2

(z2 - x2) xi= b2

The fmal value is computed at time= 200

The main Occam code to run the tridiagonal matrix systolic system is as

follows:

--- PROC host

--- PROC cell A

--- PROC cell B

--- PROC cell C

PROC mam system (CHAN oooooooooo)

--- declarauon of local chaonels

PAR

host

cell A 00 cell 1

cell B -- cell 2

cell C -- cell 3

cell A -- cell 4

241

cell B -- cell 5·

The double-sided systolic array descnbed previOusly can be Improved and

implemented as a single-sided design as shown in fig. (7 .5). We modify the

design, by reducing the number of cells to three cells only. The operations of cells 2

and 3 are similar as in the previous design, with the third cell sending its output

back to cell1 then cells 1 and 2 repeat these operation again to solve system (7.23).

The fmal value of xi is calculated at cell 2 and pumped to the host The system runs

in !wo cycles, the first cycle is to solve systems (7 .21 and 7.22),and the second

cycle is to solve system (7 .23).

Figure (7 .5) Smgle-sided systolic array design for trid.Iagonal Toeplitz

matrix.

In the second cycle, a further problem arises, for in order to solve system

(7.23), a d.Ifferent set of equanons to that used previously in solving system (7.23)

must be solved due to the folding strategy applied. To overcome this problem we

suggest the following !wo solution strategies. We can,

1- Either make the cells generic in the sense that B2 and B~ can be

computed by the same cells. Thus, each cell will have two sets of equations,

associated with a tag bit s such that ,

242

0 when the system runs the first cycle

tags=

1 when the system runs the second cycle

Now cells 1 and 2 checks the tag bit, if it is 0, then it solves system (7.2 1).

Otherwise if the tag is 1, then the cell performs the solution of system (7.2 3). Cell

2 also needs a switch to know when to pump the output to cell 3 (i.e. in the first

cycle), or to pump the output back to the host in the second cycle. We can do that

by using the same tag s

0 pumps the output to cell 3

tags=

1 pumps the output to host

2- Or we can store the output data of cell3 in an array inside cell 3 and then

pump them to cell 1 in reverse order, by using a LIFO (Last In First Out)

procedure. Then we can use the same equations in each cell for both cycles.

By choosing the second solution we increase the computation time by N

steps, while the first solution gives increases in cell area.

The chmce of one of the two soluttons above depends on whether we are

interested m computation speed inside or outside the cells.

The main Occam code to run the smgle-stded systolic system is as follows:

--- PROC host

--- PROC cell A

--- PROC cell B

--- PROC cell C

PROC mam.system (CHAN ........ )


PAR

host

set tag

PAR t=[O FOR 2]

243

cell A -- cell 1

cell B -- cell 2

cell C -- cell 3:

7.3.2 Systolic Array Design for the Quindiagonal Case

From the algorithm described in section (7 .2.2), we know that the solution

for the quindiagonal case can be computed as an extension of the tridiagonal case.

The algorithm for the qumdiagonal case of banded Toeplitz matrices consists,

therefore, of the five-stage process, as shown in fig. (7.6). These stages are as

follows:

i- The solution of Toephtz system B3 u =din equation (7.41), consists of

two substages:

a- Determining the value of u 1 from equation (7 .42a).

b- Followed by determining the components of u1 from equation (7 .43a).

ii- The second stage is to solve the system Bz v = u (7.41b), as explained

above to get the value of v1•

ii1- Solving system (7.41c), for y1.

iv- In this stage, we solve the transposed Toeplitz system Bi z = y (7.4ld).

The value of z. can be determined in the following two sub stages.

a- Obtam the value of Zn from equation (7.44a).

b- Determine the reminder of the values of vector z from equation (7 .45c).

v- In similar way to stage (iv), we can determine the values of x1 of system

Bj x = z (7.41e), by applying equations (7.44b and 7.45d).

The approach for the systolic array for the quindiagonal system to

accommodate the stages shown in fig. (7.6), is more or less similar to the double

sided systolic array described in the previous section. The only difference in the

quindiagonal design bemg the of flow of data.

244

I Input data I ~

Solve the Toephtz matnx system (7.4la)

I Detenmn"\ I Detennill(vector u 1

Solve the Toephtz matnx system (7.4lb)

I Detennmrv1 I Detenniruvector vi

Solve dtagonal matnx AI (7 4lc)

Solvmg transposed matnx system (7.4ld)

Detenn~

I Detennm<vector z i I

Solvmg transposed matnx system (7.4le)

Detennuvx n

Detennm.-vector xi

,J-

r Output l Figure (7 .6)Toeplitz matrix algonthm-structure of quindiagonal case

245

We note from section (7.2.2) that the first and second stage of fig. (7.6) have

similar algorithms, fourth and fifth stages also have sirmlar algorithms.

As shown in fig. (7.7), the systohc system operate in two cycles. In the first

cycle, the host pump the input data d1 to the first cell, then the value of u 1 is

computed, u 1 and cl,. are pumped to cell 2 as explained in the previous section. Cell

2 computes the backward recursive sequence to produce u1 since these values are

piped back to cell I, to operate the second cycle for solving system (7 .41 b) in the

same manner as in the first cycle. As soon as cell 2 generates its output in the

second cycle, the output vector values are pumped to the middle cell, at this cell

system (7.4lc) is solved, this cell operates-only once. Then the results are pumped

through cell4 and cell-S for solving system (7.4ld). The final cell sends its results

back to cell4. At the second cycle, the data are pumped through cell 4 and cell 5,

where Xn and x1 are calculated according to system (7.4le).The final results are

collected by the host from the fmal cell.

Values of b2 and b3 are stored in all the cells except the middle one. In the

first cycle, b3 is used in cell I and cell2 while b2 is used in cell 4 and cell 5. At the

second cycle, b2 m the opposite sense, is used m cell! and cell2, while b3 is used

in cell 3 and cell4. c1 is stored at cell 3.

Host

di 1-------:.l~ 0 cell!

xi~----------------------------------------J

Figure (7.7) Double-sided systolic array design forquindiagonal system.

246

The single-sided systolic design shown in fig. (7 .5) can also be modified in

order for the array to be implemented for quindiagonal systems This design can then

be implemented in a similar way to that in which the tridiagonal system was

implemented.

We modify the design by increasing the number of cycles of operations.

Here, the system runs in four cycles, the first cycle is to solve system (7 .41 a), the

second cycle is to solve systems (7.41b and 7.41c), the third and the fourth cycles

are to solve systems (7.41d and 7.41e) respectively.

In order to make the cell generic in the sense that BI, Bi, B2 and B3 matnces

can be computed by the same cells, we choose one of the two solutions suggested

in the previous section.

The tag bit solution is operated in the following manner

0 when the system runs the first and the second cycle

tags=

1 when the system runs the third and the fourth cycle

7.3.3 Systolic Array Design for General Banded Toeplitz Matrices

So far we have described the proposed systolic array architecture for the

tndlagonal and quindiagonal Toeplitz matrices. Now for semi-bandwidth r the

layout of the systohc scheme suggested earlier remains the same, it is only the

number of stages of the algorithm that is changed. In general, for a circular banded

Toeplitz matnx Ar of sem1-bandwidth r, the number of main-stages required to give

the solution of the system are 2r-l as shown in fig. (7 .8), these are as follows:

i- Stages (1 to r-1) to solve the Toephtz systems (7.54a). Each main stage

consists of two substages as shown in section (7 .3.2).

247

1 Input data 1

t

Solve the Toephtz matnx system (7.54a)

I I I I ~

Solve dJagonal matrix A J(7.54b)

I

I I I t

Solvmg transposed matnx system (7 .54c)

I Output I

Figure (7 .8) Toeplitz matrix algorithm-structure of general case

248

ii- Stager is for solving the diagonal system (7.54b).

iii- Stages (r+l to 2r-l), are for solving the transposed Toeplitz system

(7.54c). In this stage we can determine the value of xi as outlined in section

(7.3.2).

The double-sided systolic array system shown in fig. (7. 7), and the single

sided systolic array system shown in fig (7.5), are implemented to accommodate

the stages shown in fig. (7.8) for general-case of Toeplitz systems. These designs

are implemented in a similar way to that in which the quindiagonal system was

implemented previously.

The double-sided systolic array for the general case operate for r-1 cycles.

The first two cells operate for r-1 cycles to solve systems (7 .54a). At the r-1 cycle,

cell 2 pumps the result to the middle cell, at this cell which system (7.54b) is

solved, this cell operates only once. The output is pumped to solve the next two

cells, which also operate in r-1 cycles. At cycle r-1, the final cell sends the results to

the host as shown in fig. (7. 7).

The single-sided systolic array for the general case operates for 2r-2 cycles.

The system runs for r-1 cycles to solve systems (7 .54a and 7 .54b) then the system

runs for another r-1 cycles to solve system (7 .54c ).

The Occam program for the double-sided systolic array, described in this

section, is given in Appendix E.

7.3.4 Numerical Test Example

The systohc designs shown in the previous sections were tested on a variety

of Toephtz matnces. The following banded Toeplitz system was used as an

example to show the validity of the method used in this chapter.

The case n=6, r=3

249

-- -- ---------- -

From system (7.13), we have

r I 5781 08437 0.1250 0.0 0.125 08437 r ::1 0 8437 1.5781 08437 0.1250 0.0 0.125

0.1250 0.8437 1.5781 0.8437 0.1250 00 I X3 • =

l 0.0 0.125 0 8437 15781 0 8437 0.1250 X4

01250 00 0.125 0.8437 1.5781 08437 X5

08437 0.1250 0.0 0.125 08437 I 5781 X6

9.3281

7.7812

10.5468

l ""'" J 16.8281

15.2812

From the algorithm shown in secuon (7 .2.2), we determine the values of b2,

b3 and c 1 . The above system can be solved by factorising the matrix A3, as

follows:

I 5 I .25

.5 0 .25 0

I .5 I .25

I • 0 .5 0

• I .25

l I .5

.5 I

I .25 J I .25

250

1

1 0

1

0 1

1

1

.5 1 0

.5 1

0 .5

.5 1

.5

*

1

1

.25 1 0

.25 1

0 .25 1

X2

.25

25

1 9.3281l

I 7.7812

10.5468 I 14 0625J

16 8281

15.2812

.25

*

1

The above system were then input to the Occam program for banded

quindiagonal Toeplitz matrix double-sided systolic array system. The program

result are as follows:

X1 = 1.0 xz = 2.0 X3 = 3.0 X4 = 4.0 X5 = 5.0 X6 = 6.0

7.4 PERFORMANCE OF THE SYSTOLIC ARRAY

7.4.1 Timing of the Systolic Array

The time required for the solution of the n x n banded Toeplitz systems of the

form Ar x= d with semi-bandwidth r, computed on a double-stded systolic array

shown in section (7.3.3), is T = t1 + t2 + t3 + t4 + t5, where,

1- t1 = n, time cycles to compute the flrst value (i.e. u1) for the flrst Toeplitz

system (7 .54a)at cell 1, the rematning values of the systems are overlapped with the

next cell, then t1 =1.

251

2- t2 = n, time cycles to compute the values of each of the Toeplitz system

(7 .54a) at cell 2.

3- cell 3 requires n time cycles to solve system (7 .54b) but is overlaped with

the computation of the preVIous cell, then t3 = 1.

4- cell4 requires n time cycles to solve the final value (i.e. Yn) for each of the

transposed Toeplitz system (7.54c). But this also overlapped with the computation

of the previous cell, then t4 = 1.

5- Finally t5 = n, time cycles to compute the vector values of each of the

transposed Toeplitz system (7.54c) at cellS.

then

Total time of the entire p1pelme is

T = [ n+(l x (r-2)] + [n x (r-1)] + [1] + [1 x (r-1)] + [n x (r-1)] for r > 2

T = [ n] + [n x (r-1)] + [1] + [1 x (r-1)] + [n x (r-1)] forr = 2

T = [(2n + 1) (r-1)] + [n + (r-1)]

T = [(2n +2) (r-1)] + n

forr~2

From the above equation we can determine the value of the time cycles

required to solve the tnillagonal Toeplitz system i.e.

T= [(2n+1) (2-1)] + [n+(2-1)]

= 3n +2 time cycles.

While the time cycles for solving the quindiagonal Toeplitz system is,

T = [(2n +1) (3-1)] + [n + (3-1)]

= Sn +4 time cycles.

A similar numbers of time cycles are required to solve the Toeplitz system

mentioned above, computed on a single-sided systolic array. By choosing a LlFO

procedure at cell 3, the computation time increases by n time cycles.

252

7 .4.2 Area of the Systolic Array

The double-sided systolic array consists of the following cells:

1- Two cells with addition and multiplication operations.

2- Three cells With addition and multiplication or division operations.

The total length of the pipeline is five cells and 2n shift registers.

The number of cells in the systolic array is constant and therefore is not

related to the semi-bandwidth r or to the size of the problem n, whilst in the other

schemes proposed by Kung and Hu [ Kung and Hu 1981] and Brent and Luk

[Brent and Luk 1983] the number of cells is related ton. In Megson's schemes

[Megson 1985], the number of cells is related tor the semi-bandwidth.

The single-sided systolic array consists of only three cells and 2n shift

reg1sters, thus reducing the design area to a minimum compared to the other

schemes. The only drawback being that some of the cells contains n shift registers.

253

CHAPTER 8

CONCLUSION AND DISCUSSION

The main study of this thesis is the design of parallel algorithms for digital


are implemented on both a Sequent Balance (MIMD) via an Occam simulator and a

transputer network running the Transputer Development System (1DS). The Occam



The algorithms considered in this thesis are drawn from the class oflow-level

vision algorithms. In particular we consider the low-pass and high-pass filters. The

approach taken is to develop systolic array designs for these algorithms. Comments

and conclusions related to the implementation of the systolic array on transputer

networks are provided in the performance secoons of the relevant chapters.

A general introduction to parallel processing is presented in chapter I. This

chapter covers a wide selecnon of the principles of the significant parallel computer

architectures, and various classifications of parallel architectures are presented.

The systolic approach in parallel processmg evolved from the appropriate

technology and the background knowledge for its realisation together with possible

applications. The applications arises from the ever mcreasing tendency for faster

and more reliable computations, especially in areas like real-time signal and large

scale scientific computation. The appropriate technology was provided by the

remarkable advances in VLSI and automated design tools. The systolic array

systems feature the importance of modularity ,local interconnection, high degree of

pipelining and highly synchronised multiprocessing. The systolic systems and the

hardware design of the transputer are discussed in detail in chapter 2, as well as the

associated parallel language Occam and the development system for running Occam

programs (TDS).

255

Transputers have a number of attractive features, which are important for

buildmg parallel systems. Probably the most important feature is the presence of

four high speed serial links through which the transputer can be connected to other

transputers. One of the limiting factors in the use of transputers for image

processmg applications is the size of the on-chip memory.

Chapter 3 considers the fundamentals of low-level image processing, which

includes the parallel computer low-level image processing algorithms, parallel

hardware, the various methods for implementing these algonthms on various types

of parallel architecture and the image processing techniques required for image

filtering.

A systolic array design for one-dimensional convolution is described in

chapter 4. It is shown that this systolic array can be extended to handle a two

dimensional convoluUon algonthm. The system performance was improved by

implementing the shift register as a constant time operation. The number of delays

of the systohc system IS therefore a constant time operation, i.e., it is independent

of the kernel size and input data size. This decreases the execution time of the

whole system. This system can be extended to handle convolution of any

dunens10nali ty.

The implementation of the systolic array for ID and 2D convolution

algorithms on transputer networks are also presented in chapter 4 and their timing

results analysed. For the 1 D convolution algorithm, the speed-up and the efficiency

increases with the increasing number of transputers and the increasing image size.

The best efficiency we can get is when each transputer contains one cell only of the

systolic design. The overall results for the 2D convolution algorithm are very

impressive, the timing indicates a very high speed-up for all sizes of images when

the network size increases to more than one transputer. Also the efficiency increases

256

with increasing the size of the transputer network. The systolic array designs for

2D convolution are shown to be superior to those known in literature.

Various modifications of the systolic design presented in chapter 4, to handle

a set of digital image filters, are analysed in chapters 5 and 6. This includes both

low-pass and high-pass filters. These algorithms were imptemented on a transputer

network, using a systolic array designed for each of them.

In chapter 5, systolic array designs for the Laplacian and gradient operators

are presented. The plus shaped Laplacian algonthm gave near linear speed-up for all

sizes of image on a transputer network. The maximum e~ficiency is obtained when

the load on each transputer is about the same whilst the square Laplacian algorithm

gives a high speed-up and efficiencies as the network size increases to more than

two transputers.

A new systolic system for the gradient operator is developed and shown in

chapter 5. The transputer implementation of the design, is also discussed. The

relationship between the image size and the speed-up is nearly fixed for each

transputer network, but increases sharply when the network size at its maximum. It

can be concluded that better performance results can be achieved if the load is

balanced. The systolic array design for the Prewitt and Sobel operators are also

introduced.

Another set of systolic systems for digital image filters are discussed in

chapter 6. The chapter starts with an overview of the systolic array of the sigma

filter, followed by another systolic design for the inverse gradient filter. Another

section presents a systolic array design for mean and weighted mean filters. The

implementation of each of these designs on transputer networks is also discussed.

The overall results for implementing sigma and inverse gradient filters on transputer

257

networks gives an efficiency which increases as the network increases from 7 to 10

transputers, also the speed-up mcreases as the size of the image increases.

The mean and weighted mean filters gave a good speed-up for a large size of

image. This result is quite useful, smce the need for a parallel system is more vital

for large images, where processing time is relatively high. The graphs show good

speed-up and efficiency for the various size of images and transputer networks.

However, there is a maximum number of transputers which can be used efficiently

m the systolic systems.

In chapter 6 the implementauon of a variety of digital image filter .algorithms

on the Sequent Balance and the transputer network was achieved. One of the aims

of this was to design and build a programming workbench for developing image

processmg operations for low-level vision. The motivation for the work is to

develop a methodology for the 1mplementation of an image processing library on

the Sequent Balance i.e. PARC-IPL. The key to the workbench is to hold a library

of precoded software components in a generalized configuration-independent style.

The workbench provides a good mechanism for developing low-level image

processing systems on parallel computers without the need for much actual

programming on the user pan. The user can control the execution of the program

from the workbench. At the heart of PARC-IPL is a library of image processing

routines and algorithms, which are coded in a generalized format which is not

specific to one particular configuration, either by its size or topology. Further work

is still required to invoke the existing library on a multiprocessor such as a

transputer network, using the same Occam codes.

Further in the filter field of design there occurs frequently the problem of

solving banded Toeplitz systems.Toeplitz matrices have become increasingly

important with the rapid growth of signal and image processing. The factorisation of

258

certain banded Toeplitz matrices proposed by D.J. Evans, is described in chapter 7.

By using this factorisation a solution can be derived for such matrices. Descriptions

of two systolic array designs for the tridiagonal, quindiagonal and general Toeplitz

matrices are presented. One of the main objectives of the designs is to minimize the

number of hardware components, the maximum number of cells in these designs

being 5. The performance of the designs is also given.

Further work is required to implement these systolic designs on a transputer

network, m order to measure the performance of the algorithm on a variety of

network configurations.

Fmally, the use of task parallelism should be considered for parallel

implementations of image processing applications, especially those belonging to the

low-level image class. This is because of the large amount of data mvolved m the

highly homogeneous processing required. The systolic array designs described m

this thesis can be extended to cover image parallelism as well as task parallelism. In

other words, the image is partitioned over a number of similar systolic arrays, with

each systolic array executing the whole algorithm. Such designs will achieve the

advantages of both image parallelism and task parallelism.

259

REFERENCES

Amin 88

Arvind 82

Ballard 82

Bareiss 69

Bekakos 86

Brent 83

Brent 84

Clnn 83

Crookes 90 [ 1]

Amin, S.A., Systohc Design for Lowpass Digital Image

Filtering on a Transputer Network Usmg TDS, The

SERC/DTI Initiative in the Engineering Application of

transputers, U.K, 1988.

Arvind and Gostelow, K.P., The U-Intepreter, IEEE

Computers, pp.42-49, Feb. 1982.

Ballard, D.H. and Brown,C.M, Computer Vision, Prentice

Hall, Inc., 1982.

Bareiss, E.H., Numerical Solution of Lmear Equations with

Toeplitz Matrices, Numerical Mathematics, 13, pp. 404-424,

1969.

Bekakos, M.P., A Study of Algorithms for Parallel

Computers and VLSI Systohc Processor Arrays, Ph.D.

Thesis, Dept. of Computer Studies, LUT, 1986.

Brent, R.P., Kung, H.T. and Luk, F., Some Linear Time

Algorithms for Systolic Arrays, CUM-ROL-83, 1983.

Brent, R.P.and Luk, F. T.,Systolic Arrays for the Linear Time

Solution of Toephtz Systems of Equations, J. of VLSI and

Computer Systems Vol. 1, No. 1, 1984.

Chin , R.T. and Yeh, C.L., Quantitative Evaluation of some

Edge-preserving Noise-Smoothing Techniques, Computer

Vision: Graphics and Image Processing 23, pp. 67-91, 1983.

Crookes, D., Morrow, P.J. and Philip, G.,The Development

of a Transputer-Based Image Database, Proc. 2nd

International Conference on Applications of Transputers,

Southampton, pp. 189-195, July 1990.

261

Crookes90

Crookes 91 [1]

Crookes 91

Danielsson 81

Dew86

Doshi 87

Duff83

Ekston 84

Crookes, D., Morrow, P.J. and McParland, P.J., IAL: a

Parallel Image Processing Programming Language, lEE

proceeding, Vol. 137, No. 3, pp. 176-182 June 1990.

Crookes, D., Morrow, P.J., McClatchey, I. and Rafferty,T.,

A Software Development Environment for Parallel Image

Processmg: lmplementanon Techniques and Issue, in "Occam

and the Transputer Current Developments" , Edwards, J.

(ed.), lOS Press, Netherlands, 1991.

Crookes, D., Morrow, P.J. and McParland, P.J., Occam

Implementation of an Algebra-Based Language for Low-level

Image Processing, Computer Systems and Engineering, Vol.

6, No 1, pp. 30-36, Jan. 1991.

Danielsson, P.E., Note on Getting the Median Faster,

Computer Vision: Graphics and Image Processing 17.

Dew, P.M., Manning, L.J. and Mcevoy, K., A Tutorial on

Systolic Array Architectures for High Performance

Processors, 2nd Int. Electronic Image Week, Nice, 1986.

Doshi, K. and Varman, P., A Modular Systolic Architecture

for Image Processing, Computer Architecture Conf., 14th Int.

Symp., USA, pp.56-63, 1987.

Duff, M.J.B., Computing Structures for Image Processing,

Academic Press, 1983.

Ekston, M.P., D1gttal Image Processing Techniques,

Academic Press, 1984.

262

Evans 72

Evans 80

Evans 81

Evans 86

Evans 89

Evans 91

Flynn 66

Galledy90

Giloi 91

----------

Evans, D.J., An Algorithm for the Solution of Certam

Tridiagonal Systems, The Computer Journal, 15, pp 356-

359, 1972.

Evans, DJ., On the Solution of Certain Toeplitz Tridiagonal

Linear System, SIAM J. Numerical Anal., Vol. 17, No. 5,

1980.

Evans, D.J., On the Factorisation of Certain Symmetric

Ctrculant Banded Linear System, in "Parallel Processing

Techmques, App!J.ed Information Technology edition", pp.

79- 84, 1981.

Evans, D.J. and Megson, G.M., A Highly Pipelined Systolic

Array For Solving Toeplitz Systems, Int. Rep., Computer

Studies, No. 332, LUT, 1986.

Evans, D.J. and Megson, G.M., Fast Tnangulization of a

Symmetric Tridiagonal Matnx, J. of Parallel and Distnbuted

computing, 6, pp. 663-678, 1989.

Evans, D.J. and Gusev, M., New Linear Systolic Arrays For

Digital Ftlters and Convoluuon, Int. Rep., Computer Studies,

No. 660, LUT, 1991.

Flynn, M.J., Very Htgh-Speed Compuung Systems, Proc. of

the lEE, Vol. 54, No.l2, pp.1901-1909, Dec. 1966.

Galletly, J., Occam 2, Pinnan Publishing, GB, 1990.

Gilol, W.K., Whither Image Analysis System Architecture?,

in "From Ptxels to Features 11", Burkhardt, H., Neuvo,Y.,

Simon, J.C. (eds), Elsevier Science Publishers, 1991.

263

Gohberg72

Gonzalez92

Graham90

Gurd 85

Handler82

Haralick 80

Harp 89

Hays 88

Higbie 72

Hobbs 70

Gohberg, I., Semencul, A., On the Inversion of finite Toeplitz

Matrices and therr Continuous Analogs, Mat. Issled 2, pp.

201-233, 1972.

Gonzalez, R.C. and Woods, R.E., Digital Image Processing,

Addison-Wesley Publishing Company, 1992.

Graham, I. and King, T., The Transputer Handbook, Prentice

Hall, 1990.

Gurd, J.R., Kirkham, C.C. and Watson,I., The Manchester

Protype Data-Flow Computer, Comm. ACM, No.1, PP.

34-52, Jan. 1985.

Handler,W., Innovation Computer Architectures -How to

Increase Parallelism But Not Complexity, in "Parallel

Processing Systems", Evans, D.J. (Ed.), Cambridge

University Press, GB, 1982.

Haralick, M.H. and Simon, J.C., Issues in Digital Image

Processing, Sijthoff and Noordhoff, Netherlands, 1980.

Harp, G., Transputer Applications, Pltman Publishing,

London, 1989.

Hays,J.P., Computer Architecture and Organization, McGraw

Hill, 1988.

Higbie, L.C., The Omen Computers: Associative Array

Processors, IEEE Comp. Conf., Digest, pp. 287-290, 1972.

Hobbs, L.C. and Thesis, D.J., Survey of Parallel Processor

Approaches and Techniques, in "Parallel Systems: Technology

and Apphcauons, Hobbs et al (eds.), Spartan Books, New

Yorks, pp. 3-20, 1970.

264

Hockney 88

Hubel62

Hussain 91

Hwang84

Inmos 86

Inmos 87

Inmos 88 [I]

Inmos 88 [2]

Inmos90

Inmos92

Kung 78

Kung79

Hockney, R.W. and Jesshope, C.R., Parallel Computer 2:

Architecture, Programming and Algorithms, Adam Hilger

Ltd., Bristol, England, 1988.

Hubel, D.H. and W1esel, T.N., Receptive Fields, Binocular

Interactive and Functional Architecture in the Cat's VIsual

Cortex, J. Physiol, pp. 106-154, 1962.

Hussain, Z., Digital Image Processing Practical Applications

of Parallel Processing Techniques, Ellis Horwood, 1991.

Hwang,K. and Briggs, F.A., Computer Architecture and

Parallel Processing, McGraw-Hill, N.Y, 1984.

Inmos Ltd.,Product Information ITEM 400 Inmos Transputer

Evaluation Module, 1986.

Inmos Ltd., The Transputer Family, Inmos Ltd., UK, 1987.

Inmos Ltd., IMS B012 User Guide and Reference Manual,

1988.

Inmos Ltd., Occam 2 reference Manual, Prentice Hall,

UK, 1988.

Inmos Ltd., Transputer Development System, Prenuce Hall,

Uk, 1990.

Inmos Ltd., The Transputer Databook, Inmos Ltd, Uk, 1992.

Kung, H.T. and Leiserson, C.E., Systolic Arrays (for

VLSI), in " Proc. Sparse Matrix Symp.(SIAM), pp. 256-282,

1978.

Kung, H.T., Let's Design Algorithms For VLSI System,

Proc. Conf. Very Large Scale Integration: Architecture,

265

Kung 80

Kung 82

Kung 83

Kung 84 [1]

Kung 84 [2]

Kung 84

Kung 83

Kung 85

Design Fabrication, California Institute of Technology, pp.

65-90, Jan. 1979.

Kung, H.T., Special-Purpose Devices for Signal and Image

Processing: an Opportunity in VLSI, Real-Time Signal

Processing Ill, pp 76-84, 1980.

Kung, H.T., Song, S.W., A Systolic 2D Convolution Chip,

in "Multiprocessors and Image Processing Algorithms and

Programs", Academic Press, 1982.

Kung, H.T., Ruane, L.M. and Yen, D.W., Two-Level

Pipelined Systolic Array for Multidimensional Convolution,

Image and VIsion Computing, Vol. 1, No.l, 1983.

Kung, H.T., Systolic Algonthms for the CMU Warp

Processors, Dept of Computer Science, CMU, 1984.

Kung, H.T., Systolic Algorithms for the CMM Warp

Processors, CMUC-CSA-84-158 (7th Int. Conf.), 1984.

Kung, H.T. and Lam, M.S., Wafer-Scale Integration and

Two-level Pipelined Implementations of Systolic Arrays, J. of

Parallel and Distributed Computing, Vol. 1, pp. 32-63, 1984.

Kung,S.Y. and Hu,Y.H. A Highly Concurrent Algorithm and

Pipelined Architecture for Solving Toeplitz Systems, IEEE

Transaction on Acoustics, Speech and Signal Processing, Vol.

1, No.ASSP-31, 1983.

Kung,S.Y., VLSI Array Processors, lEE ASSP Magazine,

pp.5-22, July 1985.

266

Kwan88

Lee83

Manning SS

Manning 88[1]

May89

Mead80

Megson85

Megson 86

Megson 87

Megson 92

Kwan, H. K. and Okullo-Oballa, T.S., Two-Dimensional

Systolic Arrays for Two-Dimensional Convolution, Proc. of

the SPIE, Vol. 1001, pp. 724-731, 1988.

Lee, J.S., Note on Digital Image Smoothing and the Sigma

Filter, Computer Vision: Graphics and Image Processing, Vol.

24, pp. 255-269, 1983.

Manning, L.J., Design and Analysis of Computational Models

for Programmable VLSI Processors Array , Ph.D. Thesis,

University of Leeds, 1988.

Manning, L.J., Dew, P.M. and Wang, H., Design and

Analysis of Image Processing Algorithms for Programmable

VLSI Array Processors, in Parallel Architecture and Computer

Vision, Page, I. (ed.), Oxford Umversity Press, 1988.

May, D., The Transputer, in "Transputer Applications",

Harp, G. (Ed.), Pitman Publishing, London, 1989.

Mead, C.A. and Conway, L.A., Introducnon to VLSI

System, Addison- Wesley, Reading, Mass, 1980.

Megson, G.M. and Evans, D.J., Banded and Toeplitz

Systems, Int. Rep, Computer Studies, No. 243, LUT, 1985.

Megson, G.M. and Evans, D.J., Soft-Systolic Pipelined

Matnx Algonthms, in "Parallel Computing 85" , Feumeier, M.

et al. (eds.), Elsevier Science Publishers, 1986.

Megson, G.M., Novel Algorithms for the Soft-Systolic

Paradigm, Ph.D Thesis, LUT, 1987.

Megson, G.M., An Introduction to Systolic Algorithm

Design, Colarendon Press, Oxford, UK, 1992.

267

Moore 87

Morrow87

Morrow91

Morrow92

Murtha64

Nagao79

Niblack86

Nudd 88

Moore, W., McCabe, A. and Urquhart, R., Systolic Arrays,

Adam Hilger, 1987.

Morrow,P.J. and Perrot, R.H., The Design of Low-Level

Image Processing Algorithm on a Transputer Network, in

Parallel Architectures and Computer Vision, pp. 243-260,

1987.

Morrow, P.J. and Crookes, D., Using a High Level Language

Issues in Image Processing on Transputers, in "From Pixels

to Features 11", Burkhardt,Y., et al (eds.), Elsevier Science

Publishers, Netherlands, pp. 313-326, 1991.

Morrow, P.J. and Crooks, D., Parallel Language for

Transputers-Based Image Processing, in "Image Processing

and Transputers", Webber, H.C. (ed.), lOS Press,

Netherlands, pp. 27-46, 1992.

Murtha, J. and Beadles, R., Survey of the Highly Parallel

Information Processing Systems, Prepared by the

Westinghouse Electronic Corp., Aerospace Division, ONR,

Report No. 4755, Nov. 1964.

Nagao, M. and Matsuyama,T., Note on Edge Preserving

Smoothing, Computer Vision: Graphics and Image

Processing, Vol. 9, pp. 394-407, 1979.

Niblack, W, An Introduction to digital image processing,

Prentice Hall, 1986.

Nudd, G.R. and Francis, N.D., Architectures for Image

Analysis, In Third lnt. Conf. on Image Processing and its

Applications, pp. 445-451, lEE, England, 1988.

268

--------1

--------------------------- -- ------- ---,

Offen 85

Peli 82

Quinton 91

Offen, R.J., VLSI Image Processing, Collins, London, 1985.

Peli, T. and Malah, D., Study of Edge Detection Algorithms,

Computer VIsion: Graphics and Image Processing, Vol. 20,

pp.1-21, 1982.

Quinton, P.and Robert,Y., Systolic Algorithms And

Architectures, Prentice Hall, UK, 1991.

Ramamoorthy 77 Ramamoorthy , C. V. and Li,H.F., Pipeline Architecture,

Computer Survey, Vol. 9, No. 1, pp.61-102, March, 1977

Robert 86 Robert,Y. and Tchuente, M., Efficient systolic Array for 1D

Convolution Problem, J. of VLSI and Computer Systems,

Vol.l, pp. 398-407, 1986.

Rosenfeld 79

Rosenfeld 82

Seitz 85

Shore 73

Siegel85

Sloboda 89

Snyder 82

Rosenfeld, A., Picture Languages, New York: AP, 1979.

Rosenfeld, A. and Kak, A. C., Digital Picture Processing, Vol

2, Academic Press, 1982.

Seitz, C.L., The Cosmic Cube, Comm. ACM,Vol. 28,

No1,pp.22-23, Jan. 1985.

Shore, J.E., Second Thoughts on Parallel Processing,

Comput. Elec. Eng., pp. 95-109, 1973.

Siegel, H.J., Interconnection Networks for Large-Scale

Parallel Processing, Lexington Book, D.C. Heath and

Co., Lexington, MA, 1985.

Sloboda, F., Toeplitz Matrices Homothety and Least Squares

Approximation, in Parallel Computing: Methods, Algorithms

and Applications, lOP Publishing Ltd, pp.237-248, 1989.

Snyder, L., Introduction to the Configurable Highly Parallel

Computer, IEEE, Computer J., pp. 47-56, 1982.

269

Stone 87

Tabak 89

Trench64

Undrill92

Wang 81

Wayman89

Stone, H.S., High-performance Computer Architecture,

Addison-Wesley, Reading, MA, 1987.

Tabak,D., Multiprocesser, Prentice-Hall International, 1989.

Trench, W., An Algorithm for the inversion of Ftmte Toeplitz

Matrices, J. Soc. Ind. Appl. Math., Vol. 12, pp. 515-522,

1964.

Undrill, P.E., Digital Images: Processing and Application of

the Transputer, in "Image Processing and Transputer,

Webber, H. C. (ed.), lOS Press, 1992.

Wang, D.C.C. and Vagnucci, A.H., Gradient Inverse

Wetghted Smoothing Scheme and the Evaluation of its

Performance, Computer Vision: Graphics and Image

Processing, Vol. 15, pp. 167-181, 1981.

Wayman, R., Transputer Development Systems, Pitman

Publishing, London, 1989.

270

APPENDICES

APPENDIX A

Host Transputer Occam Program for ID and 2D Convolution

-- Host programm for ID and 2D convolution.

-- Using host and n transputers.

-- One input channel and one output channel only for each transputer.

-- Image size 16* 16.

#USE interf

#USE uservals

#USE streamio

#USE snglrnath

#USE strings

#USE ssinterf

#USEuserio

#USE linkaddr

-- time,m,n,rnn should modify.

PROTOCOL pair IS REAL32 ; REAL32:

CHAN OF pair xyin,xyout :

VAL time IS 286: --size of the image+ No. of columns+ 14

VALmiS 16: --No.ofrows

VALniS 17: -- No. of columns+ 1

V AL rnn IS 272: --(no.co+ 1) * no.rw

[5] REAL32 a:

[time] REAL32 x,y,xx,yy:

INT start,end,interval:

REAL32 mtervall:

TIMER clock:

PLACE xyin AT link2.out:

PLACE xyout AT link3.in:

PROC output.result(CHAN OF ANY screen)

SEQ

SEQ i=O FOR time

SEQ

272

--------------------------------------------------------------------------------

ss.write.int(screen, i,6)

-- pnnt results

ss. write.real32(screen,xx[i], 8,1 0)

newline(screen)

ss.write.real32(screen,yy[i], 8,10)

new line( screen)

new line( screen)

-- print time lapse

write.full.string(screen," Timer in units = ")

ss. write.int( screen,interval,l 0)

new line( screen)

write.full.string(screen," Timer in secands = ")

ss. write.real32(screen,intervall,8,1 0)

PROC write.to.file()

INTerror:

SEQ

-- This procedure uses screen output With the option to file a copy

-- To file output, run it on an empty fold

INTkchar:

SEQ

ss. write.string(screen, "Do you want to file the output? ")

ks.read.echo.char (keyboard, screen, kchar)

ss.write.nl(screen)

V AL bchar IS BYTE (kchar 1\ #5F): -- mask off alphabetic case

IF

bchar= 'Y'

CHAN OF ANY fromprog, tofile:

INT foldnum:

PAR

SEQ

output.result(fromprog)

ss. write.endstream (fromprog)

273

SEQ

ss.scrstream.fan.out (fromprog, tome, screen)

ss.write.endstream (tofile)

SEQ

ss.scrstream.to.fold (tofile, from.user.filer[O],

to.user.filer[O], "output.dat", foldnum, error)

IF

error=O

SKIP

TRUE

STOP

TRUE

output.result(screen)

ss.write.string(screen, "File output OK*c*n")

-- The main host to send an element from buffers xO and yO via xyout channel,

-- and it coleact an element from buffers xx[] and yy[] via xyin channel.

PROC host(CHAN OF pair xyout,xyin)

PROC hostl (CHAN OF pair xyout)

SEQ

clock? start

SEQ 1= 0 FOR time

xyout ! x[i] ; y[i]

PROC host2(CHAN OF pair xyin)

SEQ

SEQ i= 0 FOR time

xyin ? xx[i] ; yy[i]

clock? end

SEQ

274

PAR

hostl (xyout)

host2(xyin)

interval:= end MINUS start

interval!:= (REAL32 ROUND(interval))/15625.0 (REAL32)

PROC initialization()

SEQ

SEQj=OFORm

SEQi=OFORn

SEQ

x[(n*j)+i] := REAL32 ROUND i

y[(n*j)+i] := 0.0 (REAL32)

SEQk=mnFOR 14

SEQ

x[k] := 0.0 (REAL32)

y[k] := 0.0 (REAL32)

SEQ

initialization()

host(xyin,xyout)

-- print output.

write.full.string(screen," Running ")

write.full.string(screen," Output ")

output.result( screen)

write.to.file()

INTany:

keyboard ? any

275

APPENDIX B

Occam Program for 2D Convolution

#USE linkaddr

PROTOCOL pair IS REAL32 ; REAL32:

CHAN OF pair xyin, xyout :

[ 12]CHAN OF pair xyc :

V AL time IS 286: --size of image +No. of columns + 14

-- Kernel elements values.

V AL REAL32 aO IS -l.O(REAL32):

V AL REAL32 a! IS -l.O(REAL32):

V AL REAL32 a2 IS -l.O(REAL32):


V AL REAL32 a4 IS 8.0(REAL32):





PROTOCOL pair IS REAL32; REAL32:

PROC systeml(CHAN OF prur xyin,xyout, --cell A

V AL INT time, V AL REAL32 aO)

[3]CHAN OF pair xyc :

PROC celll(CHAN OF pair xyin,xyout,

V AL REAL32 a)

[4]CHAN OF REAL32 xd:

[2]CHAN OF REAL32 yd:

CHAN OF REAL32 pd:

PROC malt ( CHAN OF REAL32 xin,yout)

#USEuserio

REAL32x,y:

SEQ

X :=0.0(REAL32)

y :=O.O(REAL32)

SEQ i= 0 FOR ume

SEQ

276

xin? X

yout! y

y:= x*a

PROC add ( CHAN OF pair xyout,

CHAN OF REAL32 pin,yin,xin)

[2]REAL32 p:

REAL32y:

REAL32x:

SEQ

p[O] :=O.O(REAL32)

p[l] :=O.O(REAL32)

y :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

pin? p[O]

yin?y

xin?x

xyout! x; p[l]

p[l]:= p[O] +y

PROC pass ( CHAN OF REAL32 zin,zout,xout)

[2]REAL32 z:

REAL32x:

SEQ

z[O] :=O.O(REAL32)

z[l] :=O.O(REAL32)

X :=O.O(REAL32)

SEQ i= 0 FOR tune

SEQ

zin? z[O]

zout! z[l]

xout! x

SEQ

z[l]:= z[O]

277

-- ----------

x := z[O]

PROC pass1 (CHAN OF pair xyin,

CHAN OF REAL32 yout,zout)

[2]REAL32 z:

REAL32y:

SEQ

z[O] :=O.O(REAL32)

z[1] :=O.O(REAL32)

y :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

xym? z[O]; y

zout ! z[1]

yout! y

z[1]:= z[O]

PROC delay ( CHAN OF REAL32 xm,xout)

[2]REAL32 x:

SEQ

x[O] :=O.O(REAL32)

x[1] :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

SEQ

xin? x[O]

xout! x[1]

x[1]:= x[O]

PAR

pass 1 (xyin,xd[O] ,xd[1])

pass(xd[ 1 ],xd[2],xd[3])

malt(xd[2],yd[0])

delay(yd[O],yd[l])

add(xyout,xd[O],yd[1],xd[3])

278

SEQ

celll (xyin,xyout,aO)

PROTOCOL pair IS REAL32; REAL32:

PROC system2(CHAN OF pair xyin,xyout, -- cell B

V AL INT time,

V AL REAL32 aO)

V AL no.co IS 16:

[3]CHAN OF pair xyc :

PROC cell2(CHAN OF pair xyin,xyout,

V AL REAL32 a)

[no.co+2]CHAN OF REAL32 xd:


CHAN OF REAL32 pd:


-- Similar to Proc malt in cell 1



-- Similar to Proc add in cell 1

PROC pass ( CHAN OF REAL32 zm,zout,xout)

-- Similar to Proc pass in cell 1

PROC pass! (CHAN OF parr xyin,


-- Sumlar to Proc pass! in cell 1


-- Sirmlar to Proc delay in cell 1

PROC delayo ( CHAN OF REAL32 xin,bout)

279

-- Constant Time Proc.

REAL32x:

[no.co] REAL32 b:

INTj:

SEQ

X :=0.0(REAL32)

SEQ k= 0 FOR no.co

b[k] :=O.O(REAL32)

j:=1

SEQ i= 0 FOR time

SEQ

SEQ

xin?x

bout! b[J]

b[j]:= X

IF

G >(no.co -2))

j:=1

TRUE

j:=j+1

PAR pass 1 (xyin,xd[O] ,xd[1])

pass(xd[ 1] ,xd[2] ,xd[3])

delayo(xd[3],xd[ 4])

malt(xd[2],yd[0])

delay(yd[O],yd[1])

add(xyout,xd[O],yd[1],xd[4])

SEQ

cell2 (xyin,xyout,aO)

-- Network cnfigurauon (9 transputers).

PLACED PAR

280

PROCESSOR 0 T8

PLACE xyin AT linkl.in:

PLACE xyc[O] AT link3.out:

systeml (xyin,xyc[O],time,aO)

PLACED PAR

PROCESSOR 1 T8

PLACE xyc[O] AT linkO.in:

PLACE xyc[l] AT link3.out:

systeml (xyc[O],xyc[l],time,al)

PLACED PAR

PROCESSOR 2 T8

PLACE xyc[l] AT linkO.in:

PLACE xyc[2] AT link3.out:

system2 (xyc[l],xyc[2],time,a2)

PLACED PAR

PROCESSOR 3 T8

PLACE xyc[2] AT linkO.in:

PLACE xyc[3] AT liuk2.out:

systeml (xyc[2],xyc[3],time,a3)

PLACED PAR

PROCESSOR 4 T8

PLACE xyc[3] AT linkl.m:

PLACE xyc[4] AT linkO.out:

systeml (xyc[3],xyc[ 4],time,a4)

PLACED PAR

PROCESSOR 5 T8

PLACE xyc[4] AT link3.in:

PLACE xyc[S] AT !J.nkO.out:

system2 (xyc[4],xyc[5],ume,a5)

PLACED PAR

PROCESSOR 6 T8

PLACE xyc[S] AT link3.in:

PLACE xyc[6] AT linkO.out:

systeml (xyc[5],xyc[6],time,a6)

PLACED PAR

281

PROCESSOR 7 T8

PLACE xyc[6] AT link3.in:


system! (xyc[6],xyc[7],time,a7)

PLACED PAR

PROCESSOR 8 T8

PLACE xyc[7] AT hnkl.in:

PLACE xyout AT linkO.out:

system! (xyc[7],xyout,time,a8)

282

---------

APPENDIX C

Occam Program for Gradient Operator

#USE linkaddr

PROTOCOL three IS REAL32 ; REAL32 ; REAL32 ; REAL32 :

CHAN OF three xyin, xyout :

[12]CHAN OF three xyc :

V AL time IS 286: --size of image +no eo+ 14

V AL REAL32 aO IS -l.O(REAL32):


V AL REAL32 a2 IS l.O(REAL32):

V AL REAL32 a3 IS l.O(REAL32):

PROTOCOL three IS REAL32; REAL32 ;REAL32 :

PROC system1(CHAN OF three xyin,xyout,

V AL INT time,

V AL REAL32 aO)

V AL no.co IS 16:

PROC celll(CHAN OF three xyin,xyout,

V AL REAL32 a)

[no.co +2]CHAN OF REAL32 xd:


CHAN OF REAL32 rd:


#USEuserio

REAL32x,y:

SEQ

X ::{).O(REAL32)

y ::{).0(REAL32)

SEQ 1= 0 FOR time

SEQ

xin?x

yout! y

y:=x*a

283

PROC add ( CHAN OF three xyout,

CHAN OF REAL32 pin,yin,xm,rin)

[2]REAL32 p:

REAL32y:

REAL32x,r:

SEQ

p[O] :=0 O(REAL32)

p[l] :=O.O(REAL32)

X :=O.O(REAL32)

y :=O.O(REAL32)

r :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

PAR

pin? p[O]

yin?y

xin?x

rin ?r

xyout ! x ; p[l] ; r

p[l]:= p[O] +y


[2]REAL32 z:

REAL32x:

SEQ

z[O] :=O.O(REAL32)

z[l] :=O.O(REAL32)

X :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

PAR

zin? z[O]

zout! z[l]

xout! x

SEQ

284

z[l]:= z[O]

x := z[O]

PROC pass! (CHAN OF three xyin,

CHAN OF REAL32 yout,xout,rout)

[2]REAL32 x:

REAL32y,r:

SEQ

x[O] :=O.O(REAL32)

x[l] :=O.O(REAL32)

y :=O.O(REAL32)

r :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

PAR

xyin ? x[O] ; y ; r

xout! x[l]

yout! y

rout! r

x[l]:= x[O]


[2]REAL32 x:

SEQ

x[O] :=O.O(REAL32)

x[l] :=O.O(REAL32)

SEQ I= 0 FOR time

SEQ

xin? x[O]

xout! x[l]

x[l]:= x[O]


-- Constant Time Proc.

REAL32x:

285

[no.co+8] REAL32 b:

INTj:

SEQ

X :=0.0(REAL32)

SEQ k= 0 FOR no.co

b[k] :=O.O(REAL32)

j:=1

SEQ i= 0 FOR time

SEQ

SEQ

xin?x

bout! b[j]

b[j]:= X

IF

G >(no.co -3))

j:=1

TRUE

j:=j+1

PAR

pass 1 (xyin,xd[O] ,xd[ 1] ,rd)

pass(xd[1],xd[2],xd[3])

delayo(xd[3],xd[4])

malt(xd[2],yd[O])

delay(yd[O] ,yd[ 1])

add(xyout,xd[O],yd[1],xd[ 4],rd)

SEQ

celil (xyin,xyout,aO)



V AL INT time,

286

V AL REAL32 aO)

PROC cell2(CHAN OF three xyin,xyout,

V AL REAL32 a)

[5]CHAN OF REAL32 xd:


CHAN OF REAL32 rd:


-- Similar to Proc malt in cell1



-- Srrnilar to Proc add in cell1

PROC pass ( CHAN OF REAL32 zin,wut,xout)

-- Similar to Proc pass in cel11

PROC pass1 (CHAN OF pair xyin,


-- Similar to Proc pass 1 in cel11

PROC delay ( CHAN OF REAL32 xin,xout)

-- Similar to Proc delay in cell 1

SEQ

PAR

pass 1 (xyin,xd[O],xd[l],rd)

pass(xd[ l],xd[2],xd[3])

delay(xd[3],xd[ 4])

malt(xd[2],yd[O])

delay(yd[O],yd[ 1])

add(xyout,xd[O],yd[l],xd[ 4],rd)

SEQ

287

cell2 (xym,xyout,aO)



V AL INT time,

V AL REAL32 aO)

VALno.coiS 16:

PROC cell3(CHAN OF three xyin,xyout,

V AL REAL32 a)

[no.co +2]CHAN OF REAL32 xd:


CHAN OF REAL32 rd:




CHAN OF REAL32 pin,yin,xm)

-- Similar to Proc add in cell 1


-- Similar to Proc pass in cell 1

PROC pass! (CHAN OF pair xyin,


-- Sirmlar to Proc pass! in cell!


-- Similar to Proc delay in cell 1


-- Constant Time Proc

REAL32x:

[no.co+8] REAL32 b:

288

---------

.-----------------------------------------------------------------,

INTj:

SEQ

X :=O.O(REAL32)

SEQ k= 0 FOR no.co

b[k] :=O.O(REAL32)

j:=1

SEQ i= 0 FOR time

SEQ

SEQ

xin?x

bout! b[j]

b[j]:= X

1F

G >(no.co -1))

j:=1

TRUE

j:=j+1

PAR

pass 1 (xyin,xd[O] ,xd[1],rd)

pass(xd[ 1],xd[2],xd[3])

delayo(xd[3] ,xd[ 4])

malt(xd[2],yd[O])

delay(yd[O],yd[ 1])

add(xyout,xd[O],yd[1],xd[4],rd)

SEQ

cell3 (xyin,xyout,aO)



V AL INT time,

V AL REAL32 aO)

289

[2]CHAN OF three xyc :

PROC cell4(CHAN OF three xym,xyout,

V AL REAL32 a)

[ 4]CHAN OF REAL32 xd:


CHAN OF REAL32 rd:





-- Similar to Proc add m cell 1

PROC pass ( CHAN OF REAL32 zin,wut,xout)

-- Similar to Proc pass in ce111


-- Similar to Proc delay in cell!

PROC pass1 (CHAN OF three xyin,

CHAN OF REAL32 yout,xout,rout)

--z[O] is x1 input

--z[1] is x1 output

--z[2] is x1 output to malt proc.

--r is x2 output to malt proc.

--y is main output to add proc.

[2]REAL32 x:

REAL32 y,r:

SEQ

x[O] :=O.O(REAL32)

x[1] :=O.O(REAL32)

y :=O.O(REAL32)

r :=O.O(REAL32)

SEQ i= 0 FOR tune

290

.--------------------------------------------------

SEQ

PAR

SEQ

xyin ? x[O] ; y ; r

xout! x[1]

yout! y

rout! r

x[1]:= x[O]

PAR

pass 1 (xym,xd[O],xd[1],rd)

pass(xd[1],xd[2],xd[3])

malt(xd[2],yd[O])

delay(yd[O] ,yd[ 1])

add(xyout,xd[O],yd[1],xd[3],rd)

PROC final(CHAN OF three zyin,zyout) --cell 5

[4]CHAN OF REAL32 rd:

CHAN OF REAL32 yd,xd:


-- Similar to Proc delay in cell1

PROC pass1 (CHAN OF three xyin,

CHAN OF REAL32 yout,zout,rout

REAL32z:

REAL32 y,r,v:

SEQ

z :=0.0(REAL32)

y :=O.O(REAL32)

r :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

PAR

xyin? z; y; r

291

zout! z

yout! y

rout! r

PROC sqrt ( CHAN OF three xyout,

CHAN OF REAL32 pin,xin,rin)

[2]REAL32 z:

REAL32r,p:

REAL32x:

SEQ

z[O] :=O.O(REAL32)

z[1] :=O.O(REAL32)

r :=O.O(REAL32)

X :=O.O(REAL32)

p :=O.O(REAL32)

SEQ i= 0 FOR time

SEQ

SEQ

PAR

pin? p

xin?x

rin? r

xyout! p; r; z[1]

z[O] :=(p*p )+(r*r)

z[1]:=SQRT(z[O])

PAR

pass 1 (zyin,yd,xd,rd[O])

delay(rd[O] ,rd[ 1])

sqrt( zyout,yd,xd,rd[ 1])

SEQ

PAR

292

cell4 (xyin,xyc[O],aO)

final (xyc[O],xyout)

--Network cnfiguration (4 transputers).

PLACED PAR

PROCESSOR 0 T8

PLACE xym AT linkl.in:

PLACE xyc[O] AT link3.out:

system! (xyin,xyc[O],time,aO)

PLACED PAR

PROCESSOR I T8

PLACE xyc[O] AT linkO.in:

PLACE xyc[l] AT link3.out:

system2 (xyc[O],xyc[l],time,al)

PLACED PAR

PROCESSOR 2 T8

PLACE xyc[l] AT linkO.in:


system3 (xyc[l],xyc[2],time,a2)

PLACED PAR

PROCESSOR 3 T8

PLACE xyc[2] AT linkO.in:

PLACE xyout AT link3.out:

system4 (xyc[2],xyout,time,a3)

293

APPENDIX D

Occam Program for the Filter Library (PARC-IPL)

EXTERNAL proc abort.program:

EXTERNAL proc open.file(value path.nameO, accessO, chan io.chan)

EXTERNAL proc close.f!le( chan io.chan) :

EXTERNAL proc str.to.chan( chan c,value s[])

EXTERNAL proc fp.num.to.chan( chan c,value float f) :

EXTERNAL proc fp.num.from.chan( chan c,var float f)

EXTERNAL proc num.to.chan( chan c,value n) :

EXTERNAL proc num.from.chan( chan c,var n) :

EXTERNAL proc str.to.screen( value sO) :

EXTERNAL proc fp.num.to.screen( value float f) :

EXTERNAL proc num.to.screen( value n) :

EXTERNAL proc fp.num.from.keyboard( var float f)

EXTERNAL proc num.from.keyboard( var n)

EXTERNAL proc sll :

EXTERNAL proc s14 :

EXTERNAL proc s15 :

EXTERNAL proc sl6 :

EXTERNAL proc s17 :

EXTERNAL proc s18 :

PROC system = VAR xo.f,no.f,run,vv,co,go,tr:

SEQ str.to.screen, _____________________ ")

str.to.screen("*n I I")

str.to.screen("*n I THIS IS AN OCCAM PROGRAM LffiRARY I")



str.to.screen("*n I IMAGE PROCESSING I")

str.to.screen("*n I FILTER LIDRARY I")


294

- ~

'

str.to.screen("*n 1 N.B. To exit from the system enter 99 1") str.to.screen("*n 1 ____________________ 1,")

str.to.screen("*n ")


str.to.screen("*n I Have your input data at file name I")

str.to.screen("*n 1 [ image.in ] I")

str. to.screen("*n I I")




str.to.screen("*n If you want to use this hbrary enter 1 = ")

295

str.to.screen("*n I !")


str.to.screen("*n I N.B. To exit from the system enter 99 !") str.to.screen("*n ! __________________ .!")



str.to.screen("*n Type filter number = ")

num.from.keyboard(no.f)

num.to.screen(no.f)

if

no.f=l

seq str.to.screen("*n _________________ ")

str.to.screen("*n I Laplacian filter No : 1 !")


sll vv:=O str.to.screen("*n _________________ ")


str.to.screen("*n I Your output data in file !")

str.to.screen("*n I [tmagel.out] !")



no.f=2


str.to.screen("*n I Gradient filter No : 2 !")


s14

vv:=O str.to.screen("*n _________________ ")



str.to.screen("*n I [image2.out] !")


296

str.to.screen("*n 1 1")

no.f=3

seq str.to.screen("*n ")

str.to.screen("*n 1 Mean filter No:3 1")


s15

vv:=O



str.to.screen("*n 1 Your output data in file !")

str.to.screen("*n 1 [image3.out] 1")

str.to.screen("*n 1 !")


no.f=4

seq


str.to.screen("*n 1 Weighted mean filter No:4 1")


s16

vv:=O str.to.screen("*n ")

str.to.screen("*n 1 1'')

str.to.screen("*n 1 Your output data in file 1")

str.to.screen("*n 1 [lmage4.out] 1")



no.f=5

seq


str.to.screen("*n 1 Inverse Gradient filter No : 5 1")


s17

vv:=O str.to.screen("*n ")

297



str.to.screen("*n I [image5.out] !")

str.to.screen("*n 1'----------------~1") no.f=6


str.to.screen("*n I Sigma filter No: 6 !")


s18

vv:=O str.to.screen("*n _________________ ")


str.to.screen("*n I Your output data in file 1 ")

str.to.screen("*n I [image6 out] !")



no.f=99

vv:=2

true

seq

str. to.screen("*n ")


str.to.screen("*n Sorry no such filter ")


str.to.screen("*n If you want to try again enter I")


str.to.screen("*n else enter 99")


num.from.keyboard(vv)


--vv:=1

if

vv=O

seq

298

--run:= FALSE



str.to.screen("*n Do you want to choose another filter")


str.to.screen("*n if yes type


str.to.screen("*n 1fno type

str.to screen("*n ")

num.from.keyboard(co)

if

co=O

seq

1 ")

0")

str.to.screen("*n _______________ ")

str.to.screen("*n 1 I")

str.to.screen("*n 1 Filter library ex1ts 1")

str.to.screen("*n 1 ______________ 1")

run:=FALSE

TRUE

run:= TRUE

vv=1

run:= TRUE

TRUE

seq str.to.screen("*n ______________ ")

str.to.screen("*n 1 Filter library exits I")


run:= FALSE

TRUE

seq

str.to.screen("*n ______________ ")

str.to.screen("*n 1 F1lter hbrary exits 1")

str.to.screen("*n I 1"):

SEQ

system

299

APPENDIX E

Occam Program for Toeplitz System

(Double-Sided Systolic Array)

EX1ERNAL proc abort. program:

EXTERNAL proc open.file(value path.nameO, accessO, chan io.chan)

EXTERNAL proc close.file( chan io.chan) :

EXTERNAL proc str.to.chan( chan c,value s[]) :

EXTERNAL proc fp.num.to.chan( chan c,value float f) :

EXTERNAL proc fp.num.from.chan( chan c,var float f)

EXTERNAL proc num.to.chan( chan c,value n) :

EXTERNAL proc num.from.chan( chan c,var n) :

EX1ERNAL proc str.to.screen( value s[]) :

EX1ERNAL proc fp.num.to.screen( value float f) :

EX1ERNAL proc num.to.screen( value n) :

EX1ERNAL proc fp.num.from.keyboard( var float f)

EX1ERNAL proc num.from.keyboard( var n) :

PROC celll ( CHAN gin, gout,zout,

VALUE m,n,value float b,c ) =

V AR float g,a,z :

seq

--imtialisation

z:=O.O

seqj=[O for m]

seq

zout! z

seq

a:=c

z:=O.O

seq i=[O for n]

seq

par

gin?g

par

300

gout! g

z := z+(a *g)

a:=a/b

z:= 2/(1.0-(c*b)):

PROC cell2 ( CHAN gin,pin, pout,

VALUE m,n,value float b ) =

V AR float g,p :

seq

seqj=[O for m]

seq

pin? p

seq i=[O for n]

seq

par

gin?g

p := g+(b * p)

par

pout! p:

PROC cellS ( CHAN gin, pout,

VALUE m,n,value float cc)=

V AR float g,p :

seq

seq j=[O for m]

seq i=[O for n]

seq

par

gin?g

p :=glee

par

pout! p:

proc delay (chan xin, xout,

value m,n )=

301

var float x[2] :

seq

par i=[O for 2]

X(!) := 0.0

seq i=[O for m]

seqj=[O for n]

seq

par

xin? x[O]

xout! x[l]

x[l] := x[O]:

PROC cellt (CHAN xin,xout,

VALUEm,n,

value float b,c )=

CHAN xd[n+ l],yd:

SEQ

par

celll (xin,xd[O], yd,m,n,b,c)

par i=[O for n]

delay ( xd[i],xd[i+l],m,n)

cell2 (xd[n],yd,xout,m,n,b):

PROC hostl ( CHAN gaout,

VALUEm,n,

V AR float g[],b[] ) =

SEQ

SEQ i=[O for m ]

SEQj=[O for n]

SEQ

gaout! g[(n*i)+(n-G+l))]:

PROC host2 ( CHAN yin,yout,

VALUEm,n,

VAR float y[],b[],cc) =

--------

302

SEQ

SEQ i=[O for m ]

SEQ

SEQj=[O for n]

SEQ

yin ? y[(n-G+ 1))]

SEQj=[O for n]

SEQ

y[j]:=y[j]/cc

yout! y[j]:

PROC host3 ( CHAN zin,

VALUE m,n,q,

V AR float z[] ) =

CHANio:

VAR float x[n+l],y[n+l],zz:

V AR no,nn ,ss:

SEQ

nn:=(q*n)\6

no:=((q*n)+(6-nn))/6

ss:=(q*n)

open. file(" ss l.out", "w" ,io)

str.to.chan(io,"*c *n Output")

str.to.chan(io," *c *n x(n) = ")

SEQ i=[O for m-q]

SEQj=[O for n]

SEQ

PAR

zin? z[j]

SEQ i=[O for q]

SEQ

str.to.chan(io,"*c *n ")

SEQj=[O for n]

SEQ

PAR

303

zin? z[n-G+ 1)]

SEQj=[O for n]

SEQ

fp.num.to.chan(io,z[j])

str.to.chan(io," ")

str.to.chan(io,"*c *n ")

close.flle(io):

PROC sssystem (CHAN gc[],zc[],

VALUE time,m,n,q)=

CHANzpc:

CHANio:

VAR FLOAT g[time],z[time],y[time],b[q],c[q],cc[ 4],al,l1,12,13:

SEQ

SEQ j=[O for q]

SEQ

str.to.screen("*n Give b[j] = ")

fp.num.from.keyboard(b[j])

fp.num.to.screen(b[j])

str.to.screen("*n cc = ")

fp.num.from.keyboard( cc[O])

fp.num.to.screen(cc[O])

str.to.screen("*n g[j] = ")

g[0]:=9.328125

g[1]:=7.78125

g[2]:=10.546875

g[3]:=14.0625

g[4]:=16.828125

g[5]:=15.28125

g[ 6] :=9.328125

g[7]:=7.78125

g[8]:=10.546875

g[9]:=14.0625

g[l0]:=16.828125

g[11]:=15.28125

304

g[12]:=9.328125

g[13]:=7.78125

g[l4]:=10.546875

g[15]:=14.0625

g[16]:=16.828125

g[17]:=15.28125

seq j=[O for n]

seq

str.to.screen(" ")

fp.num.to.screen(g[j])

seq j=[O for q]

seq

c[J]:=b[j]

seq i=[O for n-2]

c[j]:=c[j]*b[j]

PAR

host! (gc[O],m,n, g,b)

par i=[O for q]

cellt(gc[i],gc[I+ 1] ,m,n,b[I],c[I])

host2 (gc[q],gc[q+ l],m,n, y,b,cc[O])

par i=[ 1 for q]

cellt(gc[ q+i],gc[q+(i+ 1 )],m,n,b[(q)-i],c[ (q)-1])

host3 (gc[(2*q)+ 1] ,m,n,q,z):

PROC system (CHAN gc[],zc[])=

V AR time ,rimn,m,n,q,r:

SEQ

str.to.screen("*n Give the total number of r.s.c = ")

num.from.keyboard(r)

num.to.screen(r)

str.to.screen("*n Give the total number of q = ")

num.from.keyboard(m)

num to.screen(m)

str.to screen("*n Give the total number of rows = ")

num.from.keyboard(n)

305

num.to.screen(n)


rimn:=(n+ 2)\5

q:=m-1

r:=r-1

m:=((m-1 )*2)+(1 +r)

time:=m*n

sssystem (gc,zc,time,m,n,q):

--main

CHAN gc[19],zc[9] :

SEQ

system (gc,zc)

306

Download - The design of image processing algorithms on …THE DESIGN OF IMAGE PROCESSING ALGORITHMS ON PARALLEL COMPUTERS By Saad Ali Amin, MPhil., BSc. A Doctoral Thesis submitted in partial

Top Related