Loughborough UniversityInstitutional Repository
The design of imageprocessing algorithms on
parallel computers
This item was submitted to Loughborough University's Institutional Repositoryby the/an author.
Additional Information:
• A Doctoral Thesis. Submitted in partial fulfilment of the requirementsfor the award of Doctor of Philosophy at Loughborough University.
Metadata Record: https://dspace.lboro.ac.uk/2134/27573
Publisher: c© Saad Ali Amin
Rights: This work is made available according to the conditions of the CreativeCommons Attribution-NonCommercial-NoDerivatives 2.5 Generic (CC BY-NC-ND 2.5) licence. Full details of this licence are available at: http://creativecommons.org/licenses/by-nc-nd/2.5/
Please cite the published version.
This item was submitted to Loughborough University as a PhD thesis by the author and is made available in the Institutional Repository
(https://dspace.lboro.ac.uk/) under the following Creative Commons Licence conditions.
For the full text of this licence, please go to: http://creativecommons.org/licenses/by-nc-nd/2.5/
LOUGHBOROUGH UNIVERSITY OF TECHNOLOGY
LIBRARY
AUTHOR/FILING TITLE I
_______________ f)_tj_I_~-+--~.:A_._ ______________ .
ACCESSION/COPY NO
6'too.., 1. '1 ' ' ----------------- ---------------------------------VOL. NO. CLASS MARK
I 1111111111111111 11111
THE DESIGN OF IMAGE PROCESSING
ALGORITHMS
ON PARALLEL COMPUTERS
By
Saad Ali Amin, MPhil., BSc.
A Doctoral Thesis submitted in partial fulfilment
of the requirements for the
Award of the Degree of
Doctor of Plulosophy
of Loughborough University of Technology
Apn11993
©Saad All Amin,1993
----~--
Loughborough University of Techr::h9y Library
-~..- .. Date s¥v- '1'1
Class
Ace. <N: oo '7?, 'lt 1 No.
CERTIFICATE OF ORIGINALITY
This is to certify that I am responsible for the work submitted in this thesis,
that the original work is my own except as specified in acknowledgements
or in footnotes, and that neither the thesis nor the original work contamed
therein has been submitted to this or any other institution for a higher degree.
S.A.Amin
To the memory of my father Ah Amin.
To my mother, for her unfailing support.
To my wife, Maha, for her patience and love; the two things I needed most
in hard times.
To my daughter Shahad and my son Alt. And last, but by no means least,
to my brother and two sisters.
Acknowledgments
I wish to express my sincere thanks and gratitude to Professor D.J. Evans,
the director of the research, for his diligent guidance, useful suggestwns and
advice throughout the course of the research and preparation of this thesis
and also for his kindness and support which without it, this work would have
never been completed.
My thanks also go to my supervisor, Dr. A. Benson.
My gratitude are also extended to my friends Dr. M.A. Saeed and Dr.
W.S.Yosif for their helpful comments and suggestions during the preparation
of this thesis
My deep appreciation also go to my dear sister Sulaf and her husband
Fiasal for their endless support and love.
Finally, I wish to acknowledge the financial support offered by the Scien
tific Research Council in Iraq.
ABSTRACT
This thesis is a study of the design of parallel algorithms for low-level digital
image processing on Very Large Scale Integration (VLSI) processor arrays which
are implemented on both a Sequent Balance (MIMD) via an Occarn simulator and a
transputer network runnmg the Transputer Development System (1DS). The Occarn
programming language is used as a tool to simulate and map systolic arrays for the
image processing algorithms proposed.
In chapter I a general introduction to parallel processing is presented. This
chapter covers a wide selection of the principles of the significant parallel computer
arclntectures, and various classifications of parallel architectures are presented.
Chapter 2 starts with a brief review of the main contenders of VLSI-onented
computing systems. Then the transputer architecture and the associated language
Occarn are given, the transputer development system (TDS) is also descnbed,
followed by a detailed description of the hardware used for the research. The
chapter concludes with a description of the basic techniques for the design of
systolic array.
Chapter 3 presents the techniques for filtering digital images, for both low
pass and high-pass filtering and includes the basic mathemancal theory involved
followed by a discussion of the use of systolic array and the various methods for
implementing these algorithms on different types of parallel architectures.
To achieve greater efficiency in the processor arrays, systolic arrays for one
dimensional and two-dimensional digital convolution are introduced and discussed
in chapter 4. The implementation of these systolic arrays on transputer networks
are also presented and their performances on transputer networks analysed.
In chapter 5 systolic array designs for the Laplacian and gradient operators are
presented, and the transputer implementanon of these systolic arrays evaluated with
i
different s1zes of transputer networks used. The systolic array design for the
Prewitt and Sobel operators are also introduced.
The parallel implementation of the Low-level image processing filters is
covered in chapter 6.A set of systolic systems for digital image filters namely;
Sigma, inverse gradient, mean and weighted mean filters, are d1scussed. The
implementauon of each of these designs on transputer networks is discussed.
Chapter 6 also presents a low-level image processing system software library,
which includes both kinds of low-pass and high-pass filters systolic designs. The
motivation for the work is to develop a methodology for the implementation of an
rmage processing library on the Sequent Balance.
Further there occurs frequently in fllter design the problem of solving banded
Toeplitz systems. In chapter 7 a new operator method for their solution is
introduced and shown to be applicable. Some systolic array designs to solve this
problem are discussed.
Some conclusions and a discussion of further research topics in image
processing and of systolic array algorithm research is the subject of the last chapter.
The thesis concludes with a comprehensive reference list and Appendices,
together With a selection of systolic programs in Occam.
ii
CONTENTS
ABSTRACT
1 INTRODUCTION TO PARALLEL COMPUTER
ARCHITECTURES 1
1.11N1RODUCTION 2
1.2 MA1N MOTIV A TIONS 5
1.3 PIPELINED COMPUTERS 7
1.4DATA-FLOWCOMPUTERS 13
1.5 ARRAY PROCESSORS 18
1.6 DESIGN CLASSIFICATIONS 21
1.6.1 Flynn's Classification 21
1.6.2 Shore's Classification 25
1.6.3 Other Classification Approaches 26
1.7 MULTIPROCESSOR STRUCTURE: PROCESSING 30
AND COMMUNICATION
2 THE VLSI TECHNOLOGY AND SYSTOLIC PARADIGM 36
2.1 INTRODUCTION 37
2.2 VLSI-ORIENTED ARCHITECTURES 40
2.2.1 TheW ARP Architecture 40
2.2.2 The Wavefront Array Processor (W AP) 42
2.2.3 The CHIP Architecture 44
2.3 INMOS TRANSPUTERS AND OCCAM 46
2.3.1 Transputer Architectures 47
2.3.2 OCCAM 53
2.3.3 Transputer Development System 55
2.3.4 Performance Measurements of a Transputer Network 58
2.3.5 The Transputer Network Used for this Research 59
iii
2.4 TilE SEQUENT BALANCE 8000 SYSlEM 62
2.5 SYSTOLIC SYSlEM FOR VLSICOMPUTING STRUCfURES 64
2.5.1 An Environment For The Development Of 65
The Systolic Approach.
2.5.2 Systolic Algorithms, Constraints and Classification 69
2.5.3 Systolic Array Simulation 72
3 FUNDAMENTALS OF DIGITAL IMAGE PROCESSING 76
3.1INTRODUCTION 77
3.2 LOW-LEVEL IMAGE PROCESSING ALGORITHMS 80
3.2.1 Parallel Paradigm 82
3.2.1.1 Image Parallelism 83
3.2.1.2 Task Parallelism 84
3.3 IMAGE FIL lERING 85
3.3.1 Digital approximations to the Gradient and Laplacian Operators 86
3.3.2 Low Pass and High Pass Filters 89
3.4 VLSI IMPLEMENTATION FOR LOW 95
LEVEL IMAGE PROCESSING
3.4.1 Systolic Array Implementation 95
3.4.2 Pyramid Architecture 97
4 SYSTOLIC DESIGNS FOR DIGITAL CONVOLUTION 100
4.1 INTRODUCTION 101
4.2 ONE DIMENSIONAL CONVOLUTION DESIGN 103
4.2.1 Problem Definition 103
4.2.2 Systolic Design 104
4.3 TWO DIMENSIONAL CONVOLUTION 111
4.3.1 Problem Definition 111
4.3.2 Computation of 2D Convolution as ID Convolution 112
4.3.3 Systolic Array Design for 2D Convolution 114
4.4 MULTIDIMENSIONAL CONVOLUTION 122
4.5 CONSTANT TIME OPERATION 124
4.6 TRANSPUlER NETWORK FOR ONE 127
DIMENSIONAL CONVOLUTION
4.7 PERFORMANCE OF TilE ONE DIMENSIONAL CONVOLUTION129
iv
SYSTOLIC DESIGN ON 1HE TRANSPUTER NETWORK
4.8 TRANSPUTER NETWORK FOR TWO
DIMENSIONAL CONVOLUTION
134
4.9 PERFORMANCE OF 1HE TWO DIMENSIONAL CONVOLUTION 135
SYSTOLIC DESIGN ON 1HE TRANSPUTER NETWORK
4.10 ANALYSIS AND COMPARISON OF 1HE
TWO-DIMENSIONAL SYSTOLIC ARRAY
138
5 PARALLEL IMPLEMENTATION OF THE LAPLACIAN 143
AND GRADIENT OPERATORS IN COMPUTER VISION
5.1 INTRODUCTION 144
5.2 SYSTOLIC DESIGNS FOR DIGITAL IMAGE FILTERS 145
5.3 PARALLELIMPLEMENTATIONOFlHE 146
LAPLACIAN OPERA TOR.
5.3.1 Lap1acian Operator Algorithms 146
5.3.2 Systolic Array for the Lap1acian Operator 147
5.3.2.1 Plus shaped Laplacian Operator Design 147
5.3.2.2 Systolic Array Design for Square Laplacian Operator 152
5.3.3 Designing Transputer Networks for the Laplacian Operator 154
5.3.4 Performance of the Laplacian Operator 156
systolic design on Transputer Network.
5.4PARALLELIMPLEMENTATIONOF1HE 162
GRADIENT OPERA TOR .
5.4.1 Gradient Operator Algorithms 162
5.4.2 Systolic Array Design for the Gradient Operator 164
5.4.3 Transputer Network for Gradient Operator 169
5.4.4 Prewitt and Sobel Operator Algorithms 171
5.4.5 Systolic Array Design for Prewitt and Sobel Operators 172
5.4.6 Performance of the Gradient Operator Systolic 174
Design on the Transputer network
6 LOW-LEVEL IMAGE PROCESSING AND 177
FILTER SOFTWARE LIBRARY DEVELOPMENT
6.1 INTRODUCTION 178
6.2PARALLELIMPLEMENTATIONOF1HESIGMA FILTER 179
V
6.2.1 Sigma Filter Algorithm
6.2.2 Systolic Array Design for the Sigma Filter
6.2.3 Transputer network for Sigma Filter
6.2.4 Performance of the Sigma Filter Systolic Array
6.3 PARALLEL IMPLEMENTATION OF THE
INVERSE GRADIENT FILTER
179
180
184
185
188
6.3.1 Inverse Gradient algorithm Transputer Network 188
6.3.2 Systolic Array implementation for the Inverse Gradient Filter 189
6.3.3 Transputer Network for the Inverse Gradient Filter 190
6.3.4 Performance of the Inverse Gradient Filter 190
systolic Array on the Transputers network
6.4 PARALLEL IMPLEMENTATION OF THE MEAN 193
AND WEIGIITED MEAN FILTERS
6.4.1 Mean Filter Algorithms 193
6.4.2 Weighted Mean Filter algorithm 194
6.4.3 Systolic Design for Mean Filter 195
6.4.4 Transputer Network for the Mean and Weighted Mean Filter 198
6.4.5 Performance of the Mean and Weighted Mean Filter 198
Systolic Designs on the Transputers networks
6.5 AN ENVIRONMENT FOR DEVELOPING LOW LEVEL 204
IMAGE PROCESSING ON PARALLEL COMPUTERS
6.5.1 Introduction
6.5.2 Background
6.5.3 The Workbench
6.5.3.1 Software Structure
6.5.3.2 The User Interface
6.5.3.3 Execution Mode
6.5.4 Implementation
6.5.5 Workbench Facilities
6.5.6 Image Input, Output and Data Types
6.5. 7 Types of Kernel
6.5.8 Contents of Library
6.5.9 Extending the Environment
vi
204
206
206
207
208
209
210
211
212
213
213
214
7 SYSTOLIC ALGORITHM FOR THE SOLUTION 215
OF TOEPLITZ MATRICES
7.11NTRODUCTION 216
7.1.1 Digital Contour Smoothmg 218
7.2 SOLUTION OF CERTAIN TOEPLITZ SYSTEMS 220
7 .2.1 Tridiagonal Case 221
7.2.2 Quindiagonal Case 225
7.2.3 The General Case 229
7.3 SYSTOLIC ARRAY IMPLEMENTATION FOR 234
TOEPLITZMATRICES
7.3.1 Systolic Array Design for Tridlagonal Case 236
7.3.2 Systolic Array Design for the Quindiagonal Case 244
7.3.3 Systolic Array Design for General Banded Toeplitz Matrices 247
7.3.4 Numerical Test Example 249
7.4 PERFORMANCE OF THE SYSTOLIC ARRAY 251
7 .4.1 Timing of the Systolic Array 251
7 .4.2 Area of the Systolic Array 253
8 CONCLUSION AND DISCUSSION 254
REFERENCES 260
APPENDICES 271
APPENDIX A 272
APPENDIXB 276
APPENDIXC 283
APPENDIXD 294
APPENDIXE 300
vii
I
CHAPTER 1
INTRODUCTION TO PARALLEL COMPUTER ARCHITECTURES
1.1 INTRODUCTION
The greatest possible speed, throughput, performance, flexibility and a high
level of availability and reliability is the requirement of many scientific and
engineering applications, many of which need to be solved in real tinle, amongst the
huge numbers of computer applications which range from the simple personal
computer games to the weather forecasting calculation, image processing and
satellite transmission programmes, there are many that require the use of large
amounts of computational time. In an attempt to meet the challenging problem of
providing fast and economical computation, Large-Scale Parallel Computers were
developed. In fact until recently computational speed was derived only from the
development of faster electronic devices.
The current technology has gone a long way to increase the speed of operations
and the development continues. There is of course a natural limitation in
technology development; no signal can propagate faster than the speed of light
In the late 1960s, Integrated Circuits (ICS) were used in computer design and
were followed by Large Scale Integrated (LSI) techniques. The Very Large-Scale
Integrated Circuits (VLSI), developed a few years ago, are currently being used in
the design of very high speed special and general purpose computer systems.
Until eight years ago, the current state of electronic technology was such that
all factors affecting computational speed were almost minimised and any further
computational speed increased could only be achieved through both increased
switching speeds and increased circuit density. Hence, even if switching tinles are
almost instantaneous, distances between any two points in a circuit may not be
small enough to minimise the propagation delays and thus improve computational
speed. Therefore, the achievement of even faster computenis condJtional on the use
2
of new approaches that do not depend on any breakthrough in device technology
but rather on imaginative applications of the skills of the designers of computer
architecture.
As computeiSwere developed, more and more elementary operations were
performed concurrently, on a time-overlap basis. For instance, the fetch procedure
of a new instruction could be started before the previous one was completed. This
was called a prefetch operation. Obviously one approach to increasing speed is
through parallelism.
We can define the concept of parallel processing as a method of organization
of operations in a computing system where more than one operation 1s performed
concurrently or simultaneously [ Tabak 1989 ].
The parallel computer systems or multiprocessors as they are commonly
known, not only increase the potential processing speed, but they also increase the
overall throughput, flexibility, reliability and provide tolerance for processor
failures.
Hockney and Jesshope [Hackney 1988] summarised the principle ways of
introducing parallelism at the hardware level of the computer architectures as :
1- The application of pipelining-assembly line-techniques in order to
improve the performance of the anthmetic or control units. A processor is
decomposed into a certain number of elementary subprocesses each of
which IS capable of execution on dedicated autonomous units.
2- The provision of several independent units, operating in parallel, to
perform some basic fundamental functions such as logic, addition or
mulnplications.
3
3- The provision of an array of processing elements performing
simultaneously the same instruction on a set of different data where the data
is stored in the processing elements (PE) private memory.
4- The provision of several independent processors, working in a co
operative manner towards the solution of a single task by communicating
via a shared or common memory, each one of them being a complete
computer, obeying its own stored instructions.
One area which has received considerable attention in recent years, is the
design of real-time systems for the early processing of sensory data (i.e. low-level
image and signal processing). Most of the image processing algorithms need
massive amounts of band matrix operations. However these algorithms contain
explicit parallelism which can be efficiently exploit by processor arrays. All sections
of the image have to be processed in exactly the same way, regardless of the
position of the image section withrn the image, or the value of the pixel data.
Low level functions, involve matrix vector operation which are repeated at very
high speed. Such systems must handle large quantities of data (typical images have
512 x 512 pixe1s) at a high throughput.
The main study of this thesis is the design of parallel algorithms for digital
image processing on Very Large Scale Integration (VLSI) processor arrays which
are implemented on both a Sequent Balance (MIMD) via an Occam simulator and a
transputer network running the transputer Development System (IDS). The Occam
programming language is used as a tool to simulate and map systolic arrays for the
image processing algorithms proposed.
4
The following sections will cover a wide selection of the principal significant
parallel computer architectures, which differ sufficiently from each other, the
pipeline, SIMD, MIMD, data-flow, VLSI systems specially the Inmos Transputer
system and systolic arrays, to illustrate alternative hardware and software
approaches to parallellS!n.
1.2 MAIN MOTIV A TIONS
During the last decade the multiple processor approach has tailored a set of
long sought after motivating goals in order to satisfactorily meet many of the
challenging system design requirements. In reviewing some aspects of parallel
processing systems, one finds that while the hardware is improving at a fast rate,
the software tools to take advantage of the provided benefits are only slowly
emerging, a fact that affects the design motivations mentioned below.
Since the early developed multiple processing systems, the system
characteristics that have motivated the connnued development in this field have not
changed much. The most significant of these are increased throughput, improved
flexibility and reliability. Since none of these goals are numerically specified (i.e.
they are all qualitative goals), it is not surpnsing that the design of the future
"supercomputers" will also be motivated by the same objectives as today's parallel
computers. However, the improvements of some or all of these specificanons must
ulnmately result in an improved overall system performance, usually measured on
the basis of cost effectiveness.
The multiprocessing computing systems are composed of multiple processors,
interconnected to each other, and sharing the use of memory, input-output
peripherals and other resources. Each of the processors is capable of executing a
5
different part of the same program or a different program altogether. The multiple
processor approach is a cost-effective solution to the achievement of the system
throughput. The use of several cooperating processing units can considerably
increase this goal which could not even be matched by a uniprocessor system with
enhanced logic circuitry.
Literally, flexibility means the ease in changing the system configuration to suit
new conditions and the use of more than one processor has greatly increased the
system potential flexibility since it offers the ability to expand the memory space,
the number of processing units and even the software facilities in order to meet the
new demands. This flexibility may also be used to justify the increased reliability
of the system.
Broadly speaking, the reliability is related to different system aspects required
by different applications. The first one is the system availability which is defined
by the requirement that the system should remain available even in the case of a
malfunctiOning unit. An example of this is the computer controlled telephone
switching board. The second one is the system integrity and is defined by the
requirement that the information contained within should be "protected" against any
defection or corruption (e.g. in a secure banking system).
In conclusion, since all the system characteristics that have motivated the
development of the parallel processor computers are not descnbed quantitatively,
any new major system concept has been claimed by its proponents as the ultimate
solution to achieving these motivating goals. In fact, the same motives were behind
the follow- up to the parallel processing systems, the VLSI architectl!res.
6
1.3 PIPELINED COMPUTERS
The p1peline concept [Ha yes 1988] has been implemented in the practice in the
third computer generation ( on IBM 360/91 for instance ). The computer
pipeline is analogous to the assembly line processor in industrial manufacturing
processors. It is one form or technique of embedding parallelism or concurrency in
a computer system. Although, essentially sequential, this type of computer helps to
match the speeds of various subsystems without duplicating the cost of the entire
system involved. It also improves system availability and reliability by providing
several copies of dedicated subsystems.
The pipeline is particularly efficient for long sequences of operands or in other
words, for highly dimensional vector operands. For this reason, pipelined
processors are also sometimes called Vector Processors.
Pipelined computers achieve an increase in computational speed by
decomposing every process into several sub-processes which can be executed by
special autonomous and concurrently operating hardware unit. . Furthermore
pipelining can be introduced at more than one level in the design of computers.
Ramamorrthy [Ramamorrthy 1977] distinguished two pipeline levels, the system
level for the pipelining of the processing units and the subsystem level for the
arithmetic pipelining. Particularly Handler [Handler 1982] introduced a third level
and distinguished them under the names : macro-pipelining for the program level,
instruction pipelining for the instruction level and the arithmetic p1pelining for the
word level. Others designers have distinguished instruction pipelming, depending
on the control structure in the system, to either strict and relaxed pipelining. A pipe
can be further distinguished by its design configuration and control strategies into
two forms ; it can be either a static or dynamic pipe. Sometimes, a pipelined
structure 1s dedicated to a single function, e g. a p1pehned adder or multiplier. In
7
• Stage 1 ~
T Stage 2 _h_ Main Control
Memory --r wr=
Umt
Stage3
l I ~
StageN ~
Figure (1.1) A parallel processmg system
this case it is termed a unfunctional pipe with static configuration. On the other
hand, a pipelined module can serve several different functions. Such a pipe is
called a multifunctional pipe which can be static or dynamic depending on the
number of active configurations. If only one configuration is active at any one
Ume, then the pipe is said to be static. In a dynamic multifunctional pipe, more than
one configuration can be active at any one time, thus permitting a synchronous
overlappmg on different interconnections. The simplified model of a space general
pipelined computer is shown in Fig. (1.1) where the processor unit is segmented
into M modules, each of which performs its part of the processing and the result
appears at the end of the Nth stage.
The pipelined concurrency, a main characteristic of the simplest pipe lining, is
exemplified by the process of executing instructions. In fig. (1.2), we consider four
modules : Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF) and
Execution (E), obtained when segmenting the process of processing instructions.
8
Consequently, if the process is decomposed into four subprocesses and executed
on the four-module pipelined system as defined above, then four successive
instructions may execute in parallel and independently of each other but at different
execution stages : the first instruction is in the execution phase, the second one is in
the operand fetching stage, the third is in the instruction decoding phase and lastly,
the fourth instruction is in the fetching stage. The overlapping procedure among
these individual modules is depicted in fig. (1.4).
Buffering is essential to ensure a continuous smooth flow of data through the
pipeline segments in the cases where variable speed occurs and is virtually a
process of storing the results of a segment temporarily before sending them to the
next segment. To remedy this problem, a sufficient storage space or buffer is
included between this segment and its processor, the latter can continue its
operation on other results and transfer them to the provided buffer until it is full.
In addition to the architectural features of the pipelined processor, the busmg
structure is important in decidmg the efficiency of an algorithm to be executed on
such a system. Pipelining in essence, refers to the concurrent processing of
independent instructions though they may be in different stages of execution due to
overlapping.
Another damaging factor to the pipeline, even more than the mstruction
dependency is branching. The encounter of a conditional branch not only delays
further executions but affects the performance of the entire pipe since the exact
sequence of instructions to be followed is hard to foretell until the deciding results
becomes available at the output. To alleviate the effects of branchmg, several
techniques have been employed to provide mechanisms through which processing
can resume safely even if an mcorrect branch occurs which may create a
discontinuous supply of instructions
9
Stages
EX
OF
ID
1F FO
Stages
EX
OF
ID
1F FO
ID ~ OF - E ~
Figure (1.2) The modules of a pipelined processor
EO El E2 --
00 01 02 --
DO Dl D2 --
Fl F2 -Tune
Figure (1.3) Space-Time diagram (No Pipelining)
EO El E2 E3 E4 -
00 01 02 03 04 -
DO Dl D2 D3 D4 --
Fl F2 F3 F4 -- Tlnte
Figure (1.4) Space-Time diagram (Pipelining)
10
A similar degrading effect to the conditional branching is caused by interrupts
which disrupt the continuity of the instruction stream through the pipeline.
Interrupts must be serviced before any action can be applied to the next
instruction. In the case that the cost of a recovery mechanism for processing to
proceed after an unpredictable interrupt occurs (while instruction i is the next one to
enter the pipe ), is not exceedingly substantial, sufficient mformation must be saved
for the eventual recovery. Otherwise these two instructions, the interrupt
instruction and instruction i, have to be executed sequentially which in fact, is not
allowed for in the pipelining principle.
Finally, one of the most beneficial applications of overlapped processing in
order to increase the total throughput has been the execution of arithmetic functions.
Specially, the advantages of pipelining are greatly enhanced when floating point
operations in a vector are being considered since they represent quite a lengthy
process. Again, until all modules in the pipe are excessively used, full speed is not
made up of eight modules.
As an example of an arithmetic pipeline, we take the problem of adding two
floating-point vectors x, and y, ( i=l,2, ..... n ) to obtain the sum vector z.=x;+y,.
The operation of adding any pair of the above elements( x=e*2r and y=f*2• ) may
be divided into four suboperations. These are : (1) compare exponents, i.e form (r
s); (2) sh1ft x with respect to y, (r-s) places in order to line up the binary points ;
(3) add the mantissa of x to the mantissa of y; and (4) normalise by shifting the
result z to the left until the leading non-zero digit is next to the binary point In the
sequential computer the four suboperations must be completed on the first pair
element x1, y1 to produce the first result z1, before the second element pair enters
the arithmetic unit. In the arithmetic pipeline, the four suboperations are executed
11
on the four-module pipelmed system as defined before. The overlapping procedure
among the individual modules is shown in fig. (1.5).
1 clock/result
compare shift add
normalise
~ Serial
Z3
4 clock/result
Figure (1.5) Comparison of serial and pipelined computers.
12
L------------------------------------------------------------------------- ---
1.4 DATA-FLOW COMPUTERS
A common feature of all the high-speed parallel computer archltectures IS
that, due to the basic linearity of the program, the use of implicit sequencing of the
rnstrucuons is possible. This is a von-Neumann characteristic which means that the
order of execution of the instructions is determined by the order in which they are
stored in the memory with branches used to break this implicit sequencing at
selective points. An alternative form of instruction controlling is the explicit
sequencing which is basically the principal concept exploited by the data-flow (DF)
machines to provide the maximum possibilities for concurrency and speed-up.
However, this concept has a significant impact on the architecture of such
machines, the program representation, and the synchronisataion overheads.
In a Data-flow (DF) computer, the course of computation is controlled by
the flow of data in the program. That is, an operation is performed as and when its
operands are available. The sequence of operation in the DF computer obey the
precedence constraints imposed by the algorithm used rather than by the location of
the instructions in the memory. In a DF machine it is possible to carry out m
parallel as many instructions as the given computer can execute simultaneously.
After executing the instructions, the result is distributed to all subsequent
instructions which make use of this partial result as an operand. In this way, the
DF model of computation explmts in a simple manner the natural parallelism of
algorithms.
As an illustration of DF computation, the computation of the roots of a
quadratic equation is shown in fig. (1.6). Assuming that the a,b and c values are
available, (-b), (b2), (ac) and (2a) can be computed Immediately, followed by the
computation of (4ac), (b2-4ac) and ...J(b2-4ac) in that order, after this, ( -b +
...J (b2-4ac) ) and, ( -b - ...J (b2-4ac) ) can be simultaneously computed followed
13
b
1 2
2
2a
Root 1 Root2
Figure (1.6) A data-flow graph for the computation of the roots of a
quadratic equation.
14
by the simultaneous computation of the two roots. The only requirement is that the
operands be available before an operation can be invoked.
The DF concept encounters some problems when the algorithm contains
loops or subroutine calls, in which case the same instruction is executed several
times. Basically, the implementation of the data-flow computers can be grouped
into two main classes, the static and dynamic structures, depending on how this
problem is tackled.
In the first class, the static one, the loops and subroutine calls are unfolded
at compile time so that each instruction is executed only once. Consequently, the
implementation of the sequencing control is made simple since it directly follows
that of the graph. On the other hand, in the dynamic case, the operands are labelled
so that a single copy of the same instruction can be used several times for different
instances of the loop (or subroutine). For this type of architecture, it is necessary to
match all the operands with the same label before issuing the single copy of the
mstruction, the implementation of the control is significantly more complex in
comparison with that of the previous class. However, the dynamic approach which
allows a compact representation of large programs, can effectively exploit the
concurrency that appears during execution (for example, recursive calls or data
dependent loops).
An example of the static approach IS the MIT Data-Flow machine (fig. 1.7)
which consists of the following main components: a store that contains the
instruction cells or packets havmg space for the operation, operands and for
pointers to the successors, w1th a set of operating units to perform the operations.
These two components are connected by the two interconnection networks, one to
send ready-to-execute instruction packets to the operating units and anther to send
results back from the operating units to the instructions that use them as operands.
15
The system has to be carefully designed so as to prevent any bottle-neck from
occumng and to provide means for the full exploitation of all the concurrency.
In such a system, the maximum throughput is determined by the speed and
number of the operating units, the memory bandwidth and by the interconnection
system. As in the other organisations, several degradation factors reduce the
effecnve throughput. The most significant are the degree of concurrency available
in the program, the memory access, the interconnection network conflicts and the
broadcastmg of result, all of which except the last one are similar to the other
systems. Sometimes an instruction has several successors, so that result has to be
sent, or broadcast, to all of them and this introduces significant overheads in the
case when the number of destination pointers present in an instruction cell is
limited.
Examples of the dynamic approach include the U-Interpreter machine
[Arvind 1982] and the Manchester Data-flow Machine [Gurd 1985]. The main
components of the latter (see fig. 1.8) are the token queue that stores computed
results, the token matching unit that combines the corresponding token into
instruction arguments, the mstruction store that holds the ready-to-execute
instructions, the operating units, and the 1/0 switch for comrnumcanon with the
host. Due to the above mentioned degradation factors, data flow machines are only
attractive for cases in which the concurrency exhibited is of several hundred
instructions or more.
Another problem in the use of the data-flow approach is the lack of any data
structure definition, m fact only scalar operations were first utilised in the attempt to
maximise the amount of concurrency and this had significant limitations in terms of
the modularity of the programs. The inclusion of data structures in the graph
representation requires that the data-flow concept be extended and operanons on
16
OPERATING ~
' UNITS
~ I INSTCELL I ~ ~ ~ §~ I
~~ ~~ I f---
"'~ I ~~ i:i
~ I INSTCELL I ~
Figure (1.7) The static data-flow machine
V
TOKEN T oHost QUEUE
t t I/0 MATCHING
~
OVERfLOW SWITCH UNif UNif
i t INSTRUCTION
mHos STORE Fro
~ PROCESSING UNifS
~ Figure (1.8) The dynamic data-flow machine
•
17
them to be defined. From the operational point of view, the most straight forward
solution is to treat the data structure as an atomic operand, requmng the structure to
be sent as a whole to the operating units even though only few elements are
operated on. This can be performed by sending to the operating unit a pointer to the
data structure instead of its value.
One of the most significant advantages of the DF machines, as claimed by
its proponents, is the explmtauon of the concurrency at a low level of the execution
hierarchy since it allows the maximum utilisation of all the available concurrency.
However, some researchers argued that the overheads occurred with this
unstructured low-level concurrency is too high and have proposed the use of a
hierarchical approach in which different types of concurrency can be exploited at
various levels.
1.5 ARRAY PROCESSORS
The early interest in the parallel processor area initially appeared in the
investigation of machines that were arrays of processors connected in a four
nearest-neighbour manner "N,E,S,W" such as the von Neumann's Cellular
Automate and the Holland machme. Eventually, as a result of grouping interest m
this form of a computer, parallel processors with a central control mechanism that
controlled the entire array began to emerge.
Array processors can be defined as an array of interconnected identical
processing elements (PE 's). The PE's are controlled by a single control unit. Each
PE consists of an arithmetic and logical umt (ALU) and a local memory. Two
essential reasons for building array processors are firstly, economic, for it is
18
cheaper to build N processors with only a single control unit rather than N similar
computers. The second reason concerns mterprocessor commumcation, the
communication bandwidth can be more fully utilised.
The PE's are synchronized to perform the same function at the same time.
The control unit decodes the instruction and broadcasts the instruction via control
lines to all PE's simultaneously. The control unit can access information in both
control or local memory. Each PE has access to Its local memory only. Thus, a
common instruction is executed by all PE's simultaneously using data from its local
memory.
Different interconnection patterns between processors in array processors
are used to permit data transfers between processors. In order to maximize the
parallehsm m an array processor, we must utthse as much of the available memory
and processor as possible. The array processor is eminently suitable for
computations involving Linear Algebra operations. For example, if an array
processor contains N(N=2n) processor elements, the array NxN is stored by
colunms in such a way that each element of the matrix colunm is stored in the
memory of the corresponding PE and one memory fetch transfers one column of
the matnx mto the vector of arithmetic units (PE). An example of an array
processor is the ICL DAP computer. The general structure of the ICL DAP is
shown in fig. (1.9).
The operational speed of an array processor is supposed to increase linearly
as the number of processor elements (PE) is increased. However, this is not true
due to interprocessor communication and data access overheads. The array
processors can only be effective (i.e. maximum parallelism) if the array is
completely filled with operands.
19
Column !CL DAP 2980 1/0
M.C.U. N ~ Regtsters
Row
w n*n -PE's E
Master ~ Control ~
Instrucuon
Control Umt Buffer
op oode
s
Figure (1.9) A structure ofiCL DPA
An associative store is used to overcome the bottle-neck in enhancing the
speed of conventional computers. An array processor using an associative type
store as its memory is called an associative array processor.
20
1.6 DESIGN CLASSIFICATIONS
Parallel processing is a very general term; it can represent many different
strategies and its implementation.
As a result of the introduction of various forms of parallelism which has
proved to be an effective approach for increasing computational speed, several
competitive computer architectures were constructed but there was little evidence as
to which design was superior, nor was there sufficient knowledge on which to
make a careful evaluation. Researchers too helped the study of high-speed parallel
computer by attempting to classify all the proposed computer architectures, or at
least those which had been already well established.
Indeed, a number of classification approaches have been proposed in the
past, given by different researchers, especially by the two pioneers, Flynn [Flynn
1966] and Shore [Shore 1973]. However Flynn's classification scheme is too
broad: smce it combines groups of all parallel computers except the multiprocessor
mto the SIMD class and draws no distinction between the pipelined computer and
the processor array which have entirely different computer architectures. These
classification have been widely referenced and their corresponding terminology has
greatly contributed to the formation of the computer science vocabulary.
1.6.1 Flynn's Classification
Flynn's high speed parallel computer classification based on the dependent
relation between instructions that are propagated by the computer and the data being
processed, Flynn explored theoretically some of the organisanonal possibilities for
large scientific computing machinery before attempting to classify them into four
broad classes.
21
For convenience, he defined the instruction stream as a sequence of
instructions to be processed by the computer and the data stream as a set of
operands, including input and partial or temporary results. Also two additional
useful concepts were adopted, bandwidth and latency. By bandwidth he expressed
the time-rate of occurrences, and latency is used to express the total time between
execution of response of a computing process on a particular data unit.
Particularly, for the former notion, computational or execution bandwidth is the
number of instructions processed per second and storage bandwidth is the retrieval
rate of the data and instruction from the store (i.e. memory words per second).
By using the two definitions, Flynn categorised the almost theoretically
defined computer organisations depending on the multiplicity of the hardware
provided to service the instruction and data streams. The word "multiplicity",
which was intentionally used to avoid the ubiquitous and ambiguous term
"parallelism", refers to the maxJ.mum number of copies of simultaneous instructions
or data in the same phase of execution at the most constrained component of the
organisation.
The four basic types of systems recognised by Flynn's classification are :
a- Single Instruction Single Data (SISD) stream system (fig. 1.10). This is
the basic single-processor, or uniprocessor system. It may represent a classical von
Neumann architecture computer with practically no parallelism (IBM 701).
However, it may also represent some more sophisticated systems, where cenain
methods of parallelism have been implemented, such as multiple functional units
(IDM 360191, CDC 6600, CYBER 205) or pipelining (V AX 8600), or both.
b- Single Instruction Multiple Data (SIMD) stream system (fig. 1.11). A
number of processors simultaneously execute the same instruction, transmitted by
the control unit CU in the instruction stream. Each instruction is executed on a
22
Instrucbon Stream
Control Urut I .,.I Processor 1--Data Stream
Fig(l.lO) S.I.S.D. Computers
Processor2 Data
.-------,rnstrucbOn Stream 2 Control Unit !ream
Data Processorp
Streamp
Fig (1.11) S.I.M.D. Computers
Memory
Memory2
Memoryp
I Control Unit 21 Instrucbon .,. Processor 2 '-· -----'·Stream 2 Data
Memory Stream'---------'
I Control Umt I Instructton.,. Processor p '-· ___ _.....P Stream p
Fig. (1.12) M.I.S.D. Computers
Control Umt I Instruction '------' Stream 1
._Proce __ s_sor_1--'i-oollr"~""'-1--! Memory 1
I C I U I Instrucbon .,.I 2 1.,. Data I ontro rut p Stream 2 Processor Stream 2
Memory 2
Instructton Control Umt p Stream p Processorp Data
Streamp
F1g (1.13) M.I.M.D. Computers
23
Memoryp
different set of data, transmitted to each processor Pr1 m the data stream D1 from a
local memory LM1, i=l,2 ... ,n. The results are stored temporarily in the LM. There
exists a bidirectional bus interconnection between the main memory MM and
transmitted to the CU. A system of this type IS also called an Array Processor,
because of the array of processors formed by the Pr1, i=l,2 ... ,n. Examples of
SIMD systems are ILLIAC IV, BSP and MPP.
c- Multiple instruction Single Data (MISD) stream system (fig. 1.12). In
this system a sequence of data is transmitted to a sequence of processors, each of
which is controlled by a separate CU and executes a different instruction sequence.
The MISD structure has never been implemented, so no examples of any
well established organisation have yet been proposed. It resembles very much a
pipeline structllre. However, the main difference is that a ptpelme structllre belongs
to the same CPU and is controlled by a single CU .
d- Multiple Instruction Multiple Data (MlMD) stream system (fig. 1.13). In
the MIMD system a set of n processors simultaneously executing is usually called a
multiprocessor. The Balance 8000 parallel computer system which is in the Parallel
Algorithms Research Centre (PARC) at Loughborough University of Technology is
an example of this class, this machine is described in chapter 2. Other examples
are the IBM 3090, the Cray 2, the Alliant FX/8 and the NCUBE.
The MIMD system (the multiprocessors) constitute the more general type of
parallel processors. Any number of processors in the same computing system
execute concurrently different programs using different sets of data. In general, no
particular operation selection constraints are attached Each processor can work on a
different program at the same time. a program can be subdivided into subprograms
(such as processes, tasks, etc) that can be run concurrently on a number of
processors.
24
1.6.2 Shore's Classification
Classification of parallel computer systems based on their constituent
hardware components was observed by Shore [Shore 1973]. Accordingly, all
current existing computer architectures were categorised into six different classes .
The first machine (I), [e.g. CDC 7600 a pipelined scalar computer,
CRAYl, a pipelined vector computer] which is the conventional serial von
N eumann type organisation, consists of an Instruction Memory (IM), a single
Control Unit (CU), a Processing Unit (PU) and a Data Memory (DM). The main
source of power increase comes from the processing umt which may consist of
several functional units, pipelined or not and all binary digit bits of a single word
are read in order to be processed simultaneously (Horizontal PU).
A second alternative machine (II) is obtained from the ftrst by simply
changing the way the data is read from the data memory. Instead of reading all bits
of a single word as (I) does, machine (II) reads a bit from every word in the
memory, i.e. bit serially, but word processmg is parallel. In other words, if the
memory area is considered as a two dimensional array of bits, with each word
occupying an individual row, then machine (I) reads horizontal slices whereas
machine (II) reads vertical shces.
A combination of the two above machine yields machine (III). This means
that machine (Ill) has two processing units, a horizontal and a vertical one and is
capable of processing data in either of the two directions. The ICL DAP could have
been a favourable candidate for this class if only it had separate processing units to
offer this capability. An example of this organisation is the Sanders Associates
OMEN 60 Series of computer [Higeiel972].
Machine (IV) consists of a single control unit and many independent
processing elements, each of which has a processing unit and a data memory.
25
Communication between these components is restricted to take place only through
the control unit A good example of this machine IS the PEPE system.
If however, additional limited communication is allowed to take place
among the processor elements in a nearest-neighbour fashion, then machine (V) is
conceived. Thus communication paths between the linearly connected processor
offer for any processor in the array the possibility to access data from its immediate
neighbour's memories, as well as its own. An example of this machine type is the
ILLIAC N, which provides a short cut communication to every eight surrounding
processing elements.
The Logic-In-Memory-Array (LIMA) IS Shore's last class of computer
organisation. The main difference in machine (VI) and the previous one is that the
processing unit and the data memory are no longer two individual hardware
components, but instead they are constructed on the same IC board. Examples
range from simple associative memories to complex associative processors.
It IS observed that, generally speaking, Shore's classificatiOn, compared
with Flynn's, does not offer anything new, but only a subcategorisation of the
obscure SIMD class given by Flynn, except for machine (I) which is an SISD-type
computer. Again, as with Flynn's categorisation, pipelined computers do not
belong to a well specified class, that represents their hardware characteristics, but
on the contrary they are mixed up with unpipelmed scalar computer.
1.6.3 Other Classification Approaches
This paragraph gives a brief note on some other classification approaches of
less significant importance compared to the former two and which are based mainly
on the concept of parallelism.
26
One of the taxonomies, based on the amount of parallelism involved in the
control unit, data streams and instruction units was suggested by Hobbs et a!
[Hobbs 1970] in 1970. They distinguished parallel computers into
multiprocessors, associative processors, array processors and functional
processors.
Another classification, due to Murtha and Beadles [Murtha 1964] was based
upon the parallelism properties. An attempt to underline the main significant
differences between the multiprocessors and highly parallel organisations was
appreciated. Three main classes for parallel processor systems were identified and
they are general-purpose network computers, special-purpose network computers
characterised by global parallelism and finally non-global, semi-independent
network computers with local parallelism. Furthermore, all the classes, but the last
one, were further subcategrised into two subclasses each, whereas, the first class,
the general-purpose one, was subdivided into a general-purpose network computers
subclass with centralised common control and the general-purpose network
computers subclass, with many identical processors, each being capable of one
independent from the others, executing instructions from its own local storage. The
second class identified the pattern processors and associative processors
subclasses.
Hockney and Jesshope [Hackney 1988] formulated a taxonomy scheme for
both serial and parallel computers. The main subdivisions are shown in fig. (1.14).
Their taxonomy was more detruled than that of Flynn or Shore and took implicit
account of p1pelined structures. Therefore, the Multiple Instruction class was not
considered for further categorisation as with the pipelined and array processor
computers. Nevertheless, this scheme if coupled with that of Flynn could well be
suited for a general classification of parallel computers.
27
SINGlE INSJ'RUCTION UNIT
SINGlE UNPIPEUNED EXECUTION UNITS
SERIAL UNICOMPUI'ERS
COMPUTERS
Pll'ELINEDOR MULTIPlE EXECUTION UNITS
PARAllEL UNICOMPUI'ERS
MULTIPlE INSTRUCTION UNIT
MULTIPlE COMPUTERS (MULTIPROCESSORS)
BAlANCE 8000
Figure (1.14) Structural classification of computer
A multiprocessor, conforming to Enslow's defmition, is sometimes denoted
as a Tightly Coupled Multiprocessor [Hwang 1984] or a Loosely Coupled
Multiprocessor. In the case of tightly coupled processors, as shown in fig. (1.15)
(i.e a large number of processors shanng a common parallel memory via a high
speed multiplexed bus), the processors operate under the strict control of the bus
assignment scheme which is Implemented in hardware at the bus/processor
interface. On the other hand, in a system with loosely-coupled processors the
communication and mteraction takes phase on the basis of informanon exchange.
fig. (1.16) shows a general architecture of a loosely coupled system where each
processor has its own local memory. Comparing the above two classes of
28
multiprocessor sysrems, the main difference lies in the organisation of the memory
and the bandwidth of the inrerconnection network.
PI P2 ••• Pn
~ ~ ~ I Communication Network
~ ~ ~ ~ ~ M! M2 ••• Mm 101 102 •••
Figure (1.15) Tightly coupled multiprocessor
LMI 101 LM2 102 LMn IOn
~ ' ' ' ' ~ ~~ ~~ ' '~ ' ~
LOCAL BUS LOCAL BUS LOCAL BUS
t ~ ~ PI P2 ••• Pn
1 t t Communicauon Network
Figure (1.16) Loosely coupled multiprocessor
29
I
1. 7 MULTIPROCESSOR STRUCTURE: PROCESSING
AND COMMUNICATION.
A multiprocessor is composed of many processors, memory modules, I/0
interface units, and a communication network interconnecting all of them together.
The communication network is of the utmost importance in a multiprocessor
design. The overall performance of the multiprocessing system depends not only
on the individual speed and throughput of its processors, it strongly depends on the
quality of its communication network [Stone 1987].
From the stand-point of the type of communication network, there are three
main categories of multiprocessor structures; Bus-oriented systems, Hypercube
systems and Switch network systems.
A bus-oriented system contains one or more system buses (including data,
address and control lines) to which a11 of the system components are
mterconnected. A single-bus system is the simplest and least expensive to
implement. It is used in a number of multiprocessors, such as Sequent, Encore and
ELXSI. A single-bus system offers greater configuration flexibility to both the user
and the designer (i.e. fig. 1.17). Unfortunately, this type of system suffers from
some serious drawbacks. The most serious problem is the bus bottle-neck. Only
two devices at a time can establish communications through the system bus. No
matter how fast a bus IS implemented, this bottle-neck tends to slow down
considerably the overall throughput of a multiprocessor. However, a bus failure is
catastrophic failure m a single-bus multiprocessor.
This can be alleviated if multiple buses are implemented (see fig. 1.18).
Certainly, a failure of a single bus does not have to be catastrophic for the whole
system. If we have q buses, then up to q simultaneous interconnections can be
achieved, alleviating the bottle-neck problem. Of course, other problems arise.
30
Processor Processor Processor
Cache Cache Cache
DATA BUS f ' ' Memory Memory Memory Block Block Block
Ftgure (1.17) Single-bus system
A multiporting requires complicated and expensive extra logic on each device.
Therefore, the number q can not be too high, a dual-bus system (q=2) can be
achieved at a reasonable cost. For instance, the Alliant has a dual system bus
between its cache and main memory, and also a concurrency bus, connecting the
processors only.
PI • • Pn M! • • Mn
' )
~ ) . ~ ~-~. ; ~ . • • • • •
'~ 'r
' '~ If
• •
'~ ' • • • • ,,
IOI • • lOp
Figure (1.18) Multiple-bus system
31
------- --------------------------------------------------
The hypercube multiprocessor structure is characterised by the presence of
N=2n processors, interconnected as ann-dimensional binary cube [Seitz 1985].
Each processor forms a node, or vertex, of the cube. Each node has direct and
separate communication paths to n other nodes (its neighbours), these paths
correspond to the edges (channels) of the cube (fig. 1.19).
• • • D i
Figure (1.19) The hypercube topology.
The uniprocessor (SISD), represented by a single node, can be regarded as
a zero-cube. Two nodes, interconnected by a single path, form one-cube. Four
nodes, interconnected as a square, form a two-cube, and so on. The hypercube
configuration is implemented commercially by NCUBE, lntel and Floating Point
Systems (FPS). In all commercial hypercube systems the memory is distributed
among the nodes, there is only one local memory on each node board. Since each
processor has direct connections to n other processors and each processor has its
own local memory, the bottle-neck problem seems to be less serious. However,
the interprocessor direct communication in the commercial hypercube system, is
serial, therefore limiting the overall bandwidth. The particular interconnection
structure of a hypercube makes it suited for some classes of problems, but may
prove to be very inefficient for others. The hypercube structure is certainly not
considered universal by ideal.
32
I/0 Processor
Cross-bar_ Swuch
Figure (1.20) Crossbar switch multiprocessor.
Much more general is the switch network system structure and offers
numerous possibilities. One of the most well-known switch structures is the
crossbar switch (fig. 1.20). The crossbar switch permits establishing a concurrent
communication link between all processors (P) and memory modules (M), provided
the number of memory modules is sufficient (n :;;; m) and that each processor
attempts to access a different memory module. The same goes for 1/0 processors or
1/0 interface units. The information is routed through Crosspoint Switches (CS)
which, contains multiplexing and arbitration logic networks. The main advantage of
a crossbar network is its potential of high throughput by multiple, concurrent
communication paths. Its main disadvantage is an exceedingly high cost and
complex logic. The crossbar network was implemented in the Carnegie Mellon
University C-mmp multiprocessor system. The experimental C-mmp system was a
16x16 (n=16) network and each processor node was a DEC PDP-11 orLSill.
33
L-----------------------------------------------------------------------
The multistage network or the generalised cube network is a more general
representation for multiprocessor switching network [Siegel 1985]. Its basic
component IS the two-input, two-output Interchange Box. The two inputs and
output are labelled 0 and 1. There are two control signals, associated with the
Interchange Box, CO and Cl, which establishes the interconnection between the
input and the output terminals. A general multistage network has N inputs and N
outputs. In a generalized cube networks N=2m, where (m) is the number of stages,
each stage has N/2 interchange boxes. An example of such a network for N=8,
m=3, is shown in fig. (1.21).
In a multiprocessor system the inputs and outputs of a multistage network
can be connected to all the processors (one input and one output to each processor),
permitting direct intercommunication between them. Alternately, the inputs of the
multistage network can be connected to the processors, and the output to the the
memory modules. Considering the fact that in\acrual system, the communication
through the interchange boxes is serial, we can see that the multistage system IS a
realistic and economic alternative to the crossbar switch. Of course, serial
communication decreases the overall speed.
The importance of the communication network in a multiprocessor system
can not be over-emphasized. When a program is distnbuted among a number of
processors for execution, there is always a need for transmission of intermediate
results between the tasks which are scheduled to run on different processors. In
fact, some processes will not be able to proceed without receiving the results from
other processes. This calls for synchronisataion between different processes
belonging to the same program. Even if all processors run different programs, the
operating system has to schedule and distnbute the programs among the
processors, maintaining a vigilant so that none of the processors remain idle. The
34
----- -----------------------------------------------------------------
communication network plays a crucial part in all of the above events since it has to
efficiently transmit all of the data, Intermediate results, scheduling, and
synchronisatamn control signals.
STAGEi: 2 1 0
Figure (1.21) Three-stage generalized cube network.
35
~ 5 0
CHAPTER 2
THE VLSI TECHNOLOGY AND SYSTOLIC PARADIGM
2.1 INTRODUCTION
As a result of improvements in fabrication technology, Large Scale Integrated
electronic circuitry has become so dense that a single silicon LSI chip may contain
tens of thousands of transistors. Many LSI chips, such as microprocessors, now
consist of multiple complex subsystems, and thus are really integrated systems
rather than integrated circuits.
Achievable circuit density now increases greatly with each passing year or
two. Physical principles indicate that transistors can be scaled down to less than
1/1 OOOth of their present area and still function as switching elements in which we
can build digital systems. Following the rapid advances in LSI technology, Very
Large Scale Integration (VLSI) circuits have been developed with which
enormously complex digital electronic systems can be on a single chip of silicone.
In fact, it can be foreseen that the number of components that a VLSI chip could
accommodate would be increased by a multiplier factor of ten to one hundred in the
next decades [Mead 1980]. Devices which once required many complex
components can be built with just a few VLSI chips, reducing the problems in
reliability, performance and heat dissipation that arise from standard Small Scale
Integration (SSD and Medium Scale Integration (MSD components [Kung 1979].
VLSI electronics present a challenge, not only to those involved in the
development of fabrication technology, but also to computer scientists and
computer architects on how best to take advantage of the new technology. The
ways in which digital systems are structured, the procedures used to design them,
the trade-off between hardware and software, and the design of computational
algorithms will all be greatly affected by the coming changes in integrated
electronics.
37
The separation between the processor from its memory and the limited
opportunities for concurrent processing are the mam difficulties in the conventional
(von Neumann) computers. VLSI offers more flexibility over the conventional
(von Neumann) computers to overcome these difficulties because memory and
processing architectures can be implemented with the same technology and close
proximity to each other. The potential power of VLSI has to come from the large
amounts of concurrency that it may support. The degree of concurrency in a VLSI
computing structure is largely determined by the redesign of the underlying
algorithms. Enormous but unfortunately limited parallelism can be obtained by
introducing a high degree of pipe lining and multiprocessing while redesigning the
algorithm. The requirements of parallel architectures for VLSI have been discussed
by many authors (among those are Kung 1982 and Seitz 1985). The design should
contain many modules which are replicated many times (i.e. in a simple and
regular) and using both pipelining and multiprocessing principles. Finally, a
successful parallel algorithm for VLSI design Will be one where the communication
is only between neighbouring processors.
The development of new manufacturing techniques for the fabrication of
small, dense and inexpensive semi-conductor chips has created a unique
opportunity in the computer mdustry. With the use of VLSI in circuits, the size and
processmg elements and memory are considerably reduced and it becomes feasible
to combine the principles of Automation Theory with the pipeline concepts. This
combination is especially attractive since device manufacture costs have remained
constant relative to circuit complexity, with more time and money bemg invested in
the design and testing of the new chips.
In relation to what has been said above, approaches to device designs have
progressed significantly to the point where hardware design now relies on software
38
~----------------------------------------------- ---- --
techniques, i.e. special rules for circmt layout and high level design languages (e.g.
geometry languages, hardware languages (HDL), stick languages, register
languages, etc.) [Mead 1980]. In fact, some of these languages offer powerful chip
fabrication capabilities directly from the design they express.
lllustrative of this trend is the term 'silicon compiler' which is utilised by the
hardware designers to refer to computer-aided systems currently under
development. Analogous to a conventional software compiler, the silicon compiler
will conven linguistic representations of hardware components into machine code,
which can be stored and subsequently utilised in computer-assisted fabrication.
However, VLSI presents some problems, as the size of wires and transistors
as they approach the limits of photolithographic resolution for it becomes literally
impossible to achieve further mmiatunsation and actual circuit area (or chip area)
becomes a key issue. In addition, chip area is also limited in order to maintain high
chip yield the number of pins (through which the chip communicates with the
outside world) is limited by the finite size of the chip perimeters. These restrictions
form the basis of the VLSI paradigm
For a newly developed technology or product to survive in a highly
competitive industry there must be sufficient demand for it. The emergence and
subsegment success of VLSI oriented computing systems is due not only to
H.T.Kung's foresight but also to the timeliness of the state of industry. At the
same time Kung revealed the systolic concept, as a means of introducmg parallelism
m to VLSI circuits. Further, the Idea of using VLSI for signal processing became
the major focus of attention in governmental, industrial and university research
establishments.
39
2.2 VLSI-ORIENTED ARCHITECTURES
For large applications it may not be feasible to design a single chip
implementation of an array, especially when balance between flexibility, efficiency,
performance and implementation cost is essential. An alternative approach is to
implement basic cells at the board level using a set of 'off-the-shelf components
which are widely available as chip packages or sets from various manufacturers.
The continuously widening applicability of the systolic approach, as well as
the diversification of problems to be solved, gave birth to a large number of systolic
algorithms. Except for a limited number of cases, where performance is very
critical, it has been accepted that, in general, mapping a systolic computation
directly onto silicon is less attractive than programming a special-purpose, or even
general-purpose, VLSI processor array. In this section, we shall briefly rev1ew the
main contenders of VLSI-onented computing systems which have received
attention to date.
2.2.1 The WARP Architecture
The Warp Architecture developed at Cameg1e Mellon University (CMU) by
H.T.Kung and his associates is the most advanced VLSI-Oriented system for
purely systolic algorithms. Its main areas of apphcation are low-level sig11al and
image processing tasks (with special emphasis on computer vision}, as well as
matrix computations and other compute-intensive numerical algorithms (Kung
1984). It is a linearly interconnected (1-D) array of processors with data and
control flowing in one-direction with input at one end of the array and output at the
other. From the preceding discussions we observe that the design allows easy
implementation allowing synchronization by a simple global clock mechanism,
minimum input/output requirements and the use of efficient fault tolerance
40
techniques for faults. The basic Warp cell is constructed from a collection of chips
as is illustrated in fig. (2.1), its main characteristics being the pipelining of data and
control. Weitck 32-bit floating point multiplier (MPY) and ALU perform
operations and can be used m pipeline mode to improve throughput by two-level
pipehning. The MPY and ALU register use Weitck register file chips and can
compute approximate functions hke inverse square root using look-up facilities.
The cell has a significant amount of local memory (RAM), so that it is possible to
reduce the I/0 requirements during the computation, and to simulate systolic
algorithms that have been designed for (2-D) systems. Each cell is programmable,
controlled by a microcode sequencer and with rmcrocode storage. Finally, there are
input queues and multiplexers to implement programmable delays in the dataflow
and to relax the strictly pipelined dataflow.
As is shown in fig. (2.1), the x, y and addr-files are also register files but this
time they are used to implement delays for synchronising data paths. The crossbar
and input multiplexers (muxes) provide communication between the individual
elements and can be reconfigured by control signals. The muxes (multiplexers)
permit two-directional data flow and ring set-ups (using wrap around). A 10-cell
prototype has been built at CUM and tested on a number of example arrays
discussed in Kung [Kung 1984].
An Algol-like language, called W2, is used for the high-level programming of
Warp; W2 is translated by a compiler to a lower-level language, Wl, which is the
assembly-type language of the system. The Warp project had shown the
significance of software support for the development of a systohc computer;
especially the design of a compiler, which provides a feedback for the architectl!re
designer since it requires a thorough study of the functionality of the architectl!re.
41
y ~ M cod " I·
, YL 3: 1
:;; mux ~Y-File ~ ,
,
4 2: 1 ~ x-File ]~ XL XI-1, mux
1 ' Addr-FII<j- Cross
OOdn-1 ,
Bar MMPYReg~ MPY -
Uj File
, ....
Data
~ ~ ALURe~~ ALU
l memory File .,
Figure (2.1) Data paths for the WARP cell
2.2.2 The Wavefront Array Processor (WAP)
One problem with systolic arrays IS that cell synchronization in very large
arrays requires long delays between clock signals due to the clock skew problem,
which increases with the s1ze of the array. Also, the synchronization of data transfer
among large numbers of processors leads to large current surges as the cells are
simultaneously energized or change state.
A solution to the above mentioned problems, as suggested by S.Y.Kung
[Kung 1985], is to take advantage of the data and control flow locality, inherently
possessed by most algorithms. This permits a data-driven, self-timed approach to
array processing. Such an approach conceptually substitutes the requirement of
42
correct 'timing' by correct 'sequencing'. This concept is used extensively in data
flow computers and wavefront arrays.
Basically the derivation of a wavefront process consists of the three following
steps:
a. the algorithms are expressed in terms of a sequence of recursions;
b. each of the above recursions is mapped to a corresponding computation
wavefront; and,
c. the wavefronts are successively pipelined through the processor array.
Based on this approach, S.Y.Kung introduced the Wavefront Array
Processor (W AP) which consists of an NxN processing element with a regular
connection structure, a program store and memory buffering modules as illustrated
in Figure 2.2.
/ / ,
Front6 Front7
Figure (2.2) The Wavefront array processor.
43
-- ------------------------------------------------------------------------
Probably the main feature of W AP is its asynchronous communication; i.e
each PE commumcates with its neighbours using a handshak:ing protocol, and
performs its computations as soon as all the operands and control required are
available. Thus, there is no need for a global clock mechanism for synchromzation;
each cell is self-timed and whole array operation is data-driven, according to the
concept of dataflow computing. The processor grid acts as a wave propagating
medium, and an algorithm is executed by a series of wave fronts medium, and by a
series of wavefronts moving across the grid. Processors are assumed to support
pipelining of wave and the spacing of waves determined by the availability of data
and the execution of the basic operation. The speed of the wavefront is equivalent
to the data transfer time.
Summarising, the wavefront approach combines the advantages of data flow
machines with both the localJties of data flow and control flow inherent in a class of
algorithms. Since the burden of synchronising the entire array is avoided, a
wavefront array is architecturally 'scalable'.
2.2.3 The CHIP Architecture
In order to derive a more fleXIble VLSI-oriented computing system than the
special-purpose, where the same hardware would be used to solve several different
problems. Snyder suggested the design of the configurable, highly parallel
architecture 'CHIP' [Snyder 1982] based on the configurability. Conceptually, the
chip represents a family of syste!lE,each bmlt out of three components: a set of
processing elements (PE's), a switch lattice and a controller. The lattice, the most
Important component of a chip, is a 2-D structure of programmable switches
connected by data paths. The PE's are placed at regular intervals.
44
------------------------ -----
The processing elements are microprocessors each coupled with several kilo
bytes of RAM used as local storage. Data can be read or written through any of the
eight data paths or parts connected to the PE. Generally, the data transfer unit is a
word, through the physical data path may be narrower. The PE's operate
synchronously and systolically.
Each programmable switch contains a small amount (around 16 words) of
local RAM which is used to store instructions (one instruction per word) called
configuration settings. Each configuration setting specifies pairs of data paths to be
connected. When executed, each pair which works also as a cross-over level,
establishes a direct, static connection across the switch that is independent of the
others. The data paths are bidirectional and fully duplex, i.e. data movements can
take place m either direction simultaneously. Now executing a program causes the
specified connections to be established and to persist over time, e.g. over the
execution of an entire algorithm
The processing elements can be connected to form a particular structure by
directly configunng the lattice. That is, the programmer sets each switch in such
a way that collecuvely they implement the desired processor interconnection graph.
In addition to the lattice, a controller is also provided, and is responsible for loading
programs and configuration settings into the PE and switch memories respectively.
This task is performed through an additional data path network, called 'skeleton'.
From the funcuonal point of view, CHIP processing starts with the controller
broadcasung a command to all switches to invoke a particular configuration setting;
for example to implement a mesh pattern. The established configuration remains
during the execution of a particular phase of an algorithm. When a new phase of
processing, requiring different configuration settings is to begin, the controller
broadcasts a command to all switches so that they invoke the configuration setting;
45
for example, a structure implementing a tree. With the lattice thus restructured, the
PE's resume processing having taken only a single logical step in reconfiguring the
structure.
In conclusion, the ch1p computer which is a highly parallel computing system,
providing a programmable interconnection structure integrated with the processor
elements, is suited for VLSI implementation. Its main objective is to provide the
flexibility needed m order to solve general problems while retaining the benefits of
regularity and locality.
2.3 INMOS TRANSPUTERS AND OCCAM
Until the advent of the transputer, MIMD machines were limited to a relatively
small number of processors due to the difficulties in programming and the
synchroruzation mechanisms required to control the processors. The combination
of the transputer and OCCAM, which explicitly controls concurrency, was
designed to overcome these limitations [Harp 1989].
The Inmos transputer family is a range of system components each of which
combines processing, memory and mterconnect in a single VLSI chip. A
concurrent system can be constructed from a collection of transputers which operate
concurrently and communicate tlrrough serial communication links. Such systems
can be designed and programmed in Occam, a language based on communicating
processes (CSP) (fig. 2.3). Transputers have been successfully used in application
areas ranging from embedded systems to supercomputers.
The power of transputer is that it creates a new level of abstraction; m the
same way as the use of logic gates and Boolean algebra provide the design
methodology for present electronic systems. The term 'transputer' reflects this new
46
device's ability to be used as system building block. The word is derived from
'transistor' and 'computer', since the transputer is both a computer on a chip and a
silicon component like a transistor. The architecture has been optimised to obtain
the maximum of functionality for the minimum of silicon [Inmos 1987].
The first member of the INMOS transputer family, the lMS T414 32-bit
transputer, which was introduced in 1985, has enabled concurrency to be applied m
a wide variety of applications such as simulation, robot control, image synthesis,
and digital signal processing. Many computationally intensive applications can
exploit large arrays of transputers, the system performance depending on the
number of transputers, the speed of inter-transputer communication, and the
performance of transputer processor.
Many important applications of transputers involve floating point arithmetic.
Another member of the Inmos transputer family, the IMS T800, can increase the
performance of such a system by offering greatly improved floating-point and
communications [May 1989].
The latest addition to the transputer family is the T9000 which provides a
balance between the computation and communication facilities of the transputer. It
provides high performance computation as well as high throughput communication.
2.3.1 Transputer Architectures
One important property of VLSI technology is that communication between
the devices is very much slower than communication within a device. In a
computer, almost every operation that the processor performs involves the use of
memory. For this reason a transputer includes both processor and memory in the
same integrated circnit device.
The speed of communication between electronic devices is optimized by the
47
-
" 32 ) I< 32 ;;; 32 bJt System Services Processor
< 32 ::1 Lmk Interface I ,/
32 ) ' ~ I 4K bytes of < 32 Lmk Interface On-ch1pRAM
~I Lmk Interface I < 32 ~ Lmk Interface I
K 32 _)
I I Evants
External Memory Interface
.._
Figure (2.3) Transputer Architecture.
use of one-directional Signal wires, each connecting two devices. To provide
maximum speed with minimal wiring, the transputer uses point-to-point serial
communication links for direct connection to other transputers. Alternatively, if
many devices are connected by a shared bus, electrical problems of driving the bus
require that the speed is reduced. Also, additional control logic and wiring are
required to control the sharing of the bus.
48
The transputer is designed so that its external behaviour corresponds to the
formal model of process. As a consequence, it is possible to program systems
containing multtple interconnected transputers in which each transputer implements
a set of processes [Inmos 1987]. The transputer has a conventional microcoded
processor and there is a small core of about 32 instructions which is used to
implement simple sequential programs.
Internally, IMS T414, IMS T400 and IMS T425 consists of a memory,
processor and communications system connected via a 32-bit bus. The bus is also
connected to the external memory interface, enabling additional local memory to be
used. The processor, memory and communications system each occupy about 25%
of the total silicon area, the remainder being for power distribution, clock
generators and external connections.
The IMS T800 and IMS T805 with its 64-bit on-chip floating point unit, is
only 20% larger in area than the IMS T414. The small size and high performance
comes from a design which takes careful note of silicon economics. This contrasts
starkly with conventional eo-processors, where the floating point unit typically
occupies more area than a complete microprocessor, and requrres a second
supporttng chip. The way in which the major blocks of the IMS T800 and IMS
T414 are interconnected are indicated in fig. (2.4).
The Central Processing Unit (CPU) of the transputers contains three registers
A,B and C are for integer and address arithmetic, and form a hardware stack.
Loading a value into the stack pushes B into C, and A into B, before loading A.
Storing a value from A pops B into A and C into B.
The floating point umt (FPU) operates concurrently with and under the
control of the CPU. It also contains a three-register floattng-point evaluation stack.
49
FPU
• ~ CPU ~ CPU ,.
j j J RAM ""
J RAM
' '
- LINKS ' - LINKS
'
" ' :,
MEMORY INTERFACE MEMORY IN1ERFACE
IMST800 IMST414
Figure (2.4)Transputer internal architecture.
All data communication between memory and the floating point unit is done under
the control of the CPU. It was a design decision that the transputer should be
programmed in a high-level language. The instruction set has, therefore, been
designed for simple and efficient compilation. It contains a relatively small number
of instructions, all with the same format, chosen to give a compact representation of
the operations most frequently occurring in programmes. The instruction set is
independent of the processor wordlengths, allowing the same rnicrocode to be used
for transputers with different wordlengths. The instruction format gives a more
compact representation of high-level language programs a more conventional
instruction sets. Since a program requires less store to represent it, less memory
bandwidth is taken up with fetching instructions.
The processor provides efficient suppon for the Occam model of concurrency
and communication. It has a microcoded scheduler which enables any number of
concurrent processes to be executed together, sharing the processor time. This
50
--------------------------------------------------------------------
removes the need for a software kernel. The processor does not need to support the
dynamic allocation of storage as the Occam compiler is able to perform the
allocation of space to concurrent processes.
At any time, a concurrent process may be active (e.g. being executed or on a
list waiting to be executed), or inactive (e.g. reading to input, ready to output). The
scheduler operates in such a way that inactive processes do not consume any
processor time. The active processes waiting to be executed are held on a hst. This
is a linked list of process workspaces, implemented by two registers, one of which
points to the first process on the list, the other to the last. Thus in fig. (2.5), S is
executing, and P,Q and R are active, awaiting execution. A process is executed
until it is unable to proceed because it is waiting to input or output, or waiting for
the timer. Whenever a process is unable to proceed, its instruction pointer is saved
in its workspace and the next process is taken from the list
Communication between processes is achieved by means of channels. Occam
communication is point-to-pomt, synchronized and unbuffered. As a result, a
channel needs no process queue, no message queue and no message buffer. A
channel between two processes executing on the same transputer is implemented by
a single word in memory; a channel between processes executing on different
transputers is implemented by pomt-to-point links.
As m the Occam model, communicauon takes place when both the inputting
and outputting processes are ready. Consequently, the process which first becomes
ready must wait until the second one IS also ready.
At any time, an internal channel (a single word in memory) either holds the
identity of a process, or holds the special value 'empty'. The channel is initialized to
empty before It is used. When a message is passed using channel, the identity of the
first process to become ready is stored in the channel, and the processor starts to
51
R ., eg1s er Local p rogram
Front ... ~ --p t--Back -
Q ~
A ...
R B -14-c s
Workspace -Next lnst
Operand
Figure (2.5) Linked process list
execute the next process from the scheduling list. When the second process to use
the channel becomes ready, the message is copied, the Wlllting process is added to
the scheduling list, and the channel is reset to Its initial state. It does not matter
whether the inputting or the outputting process becomes ready first.
When a message is passed via an external channel the processor delegates to
an autonomous link interface the job of transferring the message and deschedules
the process. When the message has been transferred, the hnk interface causes the
processor to reschedule the waiting process. This allows the processor to continue
the execution of other processes wlnlst the external message transfer is taking place.
A link between two transputers is implemented by connecting a link interface
on one transputer to a link interface on the other transputer by two one-directional
signal wires, along which data is transmitted serially. The two wires provide two
Occam channels, one in each direction. This requires a simple protocol to multiplex
data and control information. Messages are transmitted as a sequence of bytes, each
of which must be acknowledged before the next is transmitted.
52
The fast block move of the IMS T414 makes it suitable for use in graphics
application using byte-per-pixel colour displays. The block move in the IMS T414
is designed to saturate the memory bandwidth, moving any number of bytes from
any bytes boundary to any other byte boundary using the smallest possible number
of word read and write operations. The IMS T805 extends this capability by
thcincorporation of a two-dimensional version of the block move (Move 2d) which can
move windows around a screen at full memory bandwidth, and conditional version
of the same block move which can be used to place templates and text into
windows. One of these operations (draw 2d) copies bytes from source to
destination, writing only non-zero bytes to the desunation. A new object of any
shape can therefore be drawn on top of the current image.
2.3.2 OCCAM
Occam is a programming language which from the outset was designed to
support concurrent applications. Occam was designed specifically for the
transputer, so the programming model for transputers is defined by Occam.
Transputers can be programmed in Occam or other languages. Occam is based on
the notion of communicatmg sequential processes, and provides concurrency and
communication as fundamental fearures of the language.
Where it is required to explOit concurrency, but still to use standard sequential
languages such as C or FORTRAN, Occam can be used as a harness to link
modules written in the languages [Inmos 1988].
In Occam processes are connected to form concurrent systems. Each process
can be regarded as a block box with internal state, which can communicate with
other processes using point to point communication channel. Processes can be used
53
to represent the behaviour of many things, for example, a logic gate, a
microprocessor, etc.
The design of Occam was heavily influenced by the work of Hoare on his
theoretical model of Communicating Sequential Processes (CSP) which grew out of
a study of process synchronisatruon problems [Galletly 1990].
Every transputer implements the Occam concepts of concurrency and
communication. As a result, Occam can be used to program an individual transputer
or to program a network of transputers. When Occam is used to program an
individual transputer, the transputer shares its time between the concurrent
processes and channel communicauon IS implemented by moving data wtthin the
memory. When Occam is used to program a network of transputers, each transputer
executes the process allocated to it (fig. 2.6). Communication between Occam
processes on different transputers is implemented directly by transputer links. Thus
the same Occam program can be implemented on a variety of transputer
configurations, with one configuration optimized for cost, another for performance,
or another for an appropriate balance of cost and performance.
P1
Three processes on one transputer The same processes distributed
over three transputers
Ftgure (2.6) Mapping processes onto one or several transputers.
54
All transputers include special instructions and hardware to provide maxunum
performance and optimal implementation of the Occam model of concurrency and
communication. Together with the transputers, it provides a modular hardware and
software component of the type which is essential in the construction of highly
parallel computer systems.
However, its lack of a powerful data structure and its closeness to the
hardware, means that Occam is likely to be the low-level language of fifth
generation systems with applications possibly wntten in a more abstract language,
i.e ADA.
2.3.3 Transputer Development System
The transputer development system (IDS) is an integrated development
system which can be used to develop Occam programs for a transputer network.
IDS provides a complete programming environment for the generation of reliable,
well structured and efficient programs. It consists of a plug m board for an mM
PC, such as an IMS T414 transputer with 2 Mbytes of RAM and all the appropnate
development software (fig.2.7).
Using the IDS, a programmer can edit, compile and run Occam programs
entirely within the development system. Occam programs can be developed on the
IDS and configured to run on a network of transputers, with the code being loaded
onto the network from the IDS. Alternatively, an operating system file can be
created which will boot a single transputer or network of transputers [lnmos 1988].
The IDS comes with all the necessary software tools and utilities to support
this kind of development. There are a variety of software routines to support
mathematical functions and input-output operations, for example.
55
!MS B004
1§!11 I 04~ [ I
\.... --! =
I I I ~I
IDS ll • ffiMXT/AT Echter Compiler
~ I IBM bus I I
Figure (2.7) Transpuier Development Sysiem
The benefits of the 1DS combine to provide design productivity, and increase
confidence in the timely and accurate implementation of highly concurrent and real
time systems.
In the development of programs for transputer networks, as with other
microprocessor development systems, a distinction may be made between the 'host'
and 'target' environments. The program development tools are run on a host
computer, which includes a terminal and a filing system. The host computer may
include a transputer within the computer, on which the development tools are run,
with the host computer providing the development tools with access to its terminal
and filing system; in this case the transputer IS known as the host transputer
[Wayman 1989].
Before the program under development is run on a transputer network, it may
be on a single host transputer connected to the host computer, and given to the
terminal and filing system of the host computer. Much of the program testing,
debugging, and iterative development can be done in this environment. Then it may
be loaded into a network of transputers from the host, such a network is known as
a 'target' network.
56
The transputer network must be connected to the host, via a link. The
network must also be connected together by transputer links. The topology of the
network must match the configuration description, otherwise the loading will fail.
As well as the link connected, Inmos boards also provide system control functions
to momtor and control the state of the transputer network. The system control
connection on boards are chained together to allow the whole of the network to be
controlled from the host.
As a more substantial example of configuration, the four transputer network
is an IMS B003 transputer evaluation board, and it is loaded from the host
computer. Every transputer on the IMS B003 has two links available on the edge
connector (links 0 and 1) while the other two are preconnected in a square array
(links 2 and 3). The example includes two different processes, control and work.
nng[3]
a- Allocation of processes on transputers
link3
b· Program runmng on an !MS B003
Figure (2.8) The logical structure of program.
57
hnk2
Fig. (2.8a) shows the logical structure of the program, as it is mapped onto
the four transputers. The procedure control has two channels, connecting it to and
from the host computer. In addition, there is one channel to the pipeline of work
processes, and one channel from the last process in the pipeline.
Once the process to run on each of the transputers has been specified, a
configuration description can be written, mapping processes to processors and links
to channel. The configuration for the IMS B003 must map control onto the root
transputer and work onto all three remaining transputers (fig. 2.8b). There are four
channels to connect the four transputers on the B003 that must be declared and
declared as an array of channels.
2.3.4 Performance Measurements of a Transputer Network
The two measures of a parallel system's performance are the speed-up (S) and
effictency (E). The speed-up of a transputer system has been defined as follows:
T(l) S = T(N) ' (2.1)
where T(N) is the total runtime consumed by an application program running on N
transputerrand T(l) is the time on one transputer. The efficiency of a transputer
system has been defined as follows:
E = _T;:_,(..:..l ),___ N x T(N)
(2.2)
Transputer systems have their performance inevitably degenerated by inter-
processor communications. There is a processor nme overhead for each input and
output statement of about lJ.ls. This overhead is not significantly dependent on the
size of the message communicated therefore the use of fewer and longer message is
an advantage.
58
Although transputer links are synchronized, data can not always be available
when needed owing to the low speed of the link. If the process runtime is smaller
than the time the input message takes to communicate through the link, the
processor will be idle for a period of time when all communications are done as
h1gh priority processes and data is ready on the sender. This helps to increase the
overhead time and consequently decrease the system performance. In such cases,
long messages may not be the best solution and a compromise has to be reached
between the number and size of the messages.
2.3.5 The Transputer Network Used for this Research
The PARC (Parallel Algorithms Research Centre) transputer hardware system
consists of a Sun SPARC workstation, a Volvox-liS interface board, a Tandon
Plus PC, an Inmos transputer Evaluation Model (ITEM 400). A full description of
the item box can be found in [Inmos 86]. The configuration consists of 2 Inmos
evaluation cards (IMS B003-2) and 1 (IMS B012) eurocard TRAM motherboard.
The software systems in connection with the hardware system include the operating
system SNOUS, the transputer development system (1DS3) and the motherboard
module software (MMS2).
The Sun workstation acts as the host computer system. By running a program
called a 'server' on the Sun workstation, the Sun system provides the transputer
development system with the filling facilities. This allows users to input data to
transputers from the keyboard or files and output data from transputers to the screen
or files.
The Volvox interface board, which is plugged in the Sun system, provides the
communication between the host system (SUN) and the transputer (mother
transputer) in the Volvox board. The communications between the mother
59
transputer and the transputer network are realised by links provided by the
transputers.
The other alternative, is a Tandon Plus PC acts as the host computer
prov!(hng terminal and file storage facilities. An mM PC/ AT version of the IDS
runs on the host computer. The IDS comprises an IMS T414 transputer on a B004
board and 2M bytes of DRAM.
Each of the IMS B003-2 contains 4 IMS T414B-G20s transputer with 256
Kbytes DRAM each and capable of 10 MIP's performance. The transputers on each
board are connected as a ring (fig. 2.9), leaving 2 uncommitted links per transputer.
The 8 uncommitted links are available on the edge connector allowing a wide range
of system configuration to be achieved using links cables to connect between
boards.
1 0
0 2 3
3 2
2 3
1 3 2 0
0 1
Figure (2.9) IMS B003 Configuration.
The IMS B012 holds 16 IMS B411 TRAMS in its slots. Each TRAM
incorporates an IMS T800 transputer and 1. Mbyte of dynamic RAM. Link 1 and 2
of each of the transputers are used to connect them as a 16 stage pipeline. The
60
B012 Sun
B003 B003
Volvex
ffiMXT/AT
B004
Figure (2.10) Transputer system.
pipeline can however be broken using a jumper block suppling to allow other
combinations.
The IMS T800 mcorporates a floating point umt capable of sustaining over
1.5 millions of floating point operations per seconds. Full details of the IMS T800
can be obtained in [Inmos 86].
Fig. (2.1 0) is an illustration of the transputer system. The 3 boards can be
connected together using hnk cables. Alternatively, networks comprising only
T414's or T800's can configured. A link on the B004 transputer is used to connect
the transputer network with the TDS which in turn connects with the host
transputer.
2.4 THE SEQUENT BALANCE 8000 SYSTEM
An example of the tightly-coupled MIMD architecture system or more
precisely the bus architecture, which we will discuss in this section, is the Sequent
Computer architecture.
The Sequent Computer Systems Inc., have developed two families of
parallel computers; the Balance Series and the Symmetry Series. These two series
are very similar m their structure, configuration, operating system, and user
software. However, the primary difference between them is the type of
microprocessor used to build the CPU's in each of them which has led to a
substantial difference between the two series at the machine language level. There
are, of course, other differences such as the speed, performance and memory size.
In this discussion we shall concentrate on the Balance senes and in
particular, on the Balance 8000 model of the series because most of the early part of
this research had been carried out using simulators running on this type of
machine.
62
The Balance 8000 model can have up to twelve 32-bit processors in a
tightly-coupled manner (Figure 2.16).
The machine at Loughborough University's Parallel Algorithm Research
Centre (PARC) has 12 processors. These processors are connected via a high
speed bus to all peripherals and shared memory and concurrently execute a shared
copy of a Unix-based operanng system. Any processor can execute any program to
achieve dynamic load balancing and multiple processors can work in parallel on a
single application and to minimize accesses to the system bus, each processor has
its own cache memory. Procssors' Boards
Memory
Data Bus
Penpheral Interface
Figure(2.16 ) Sequent Balance 8000 architecture.
Disks
Tenninals
All of the CPU units have a Floating Point Unit (FPU) and a Memory
Management Unit (MMU) and a System Link and Interrupt Controller (SLIC)
whose task is to manage the control of multiple processors.
The DYNIX (Dynamic Unix) operating system is an enhanced version of
Berkely Unix 4.2bsd which can emulate Unix System V at the system-call and
command levels. To support the Balance multiprocessing architecture, the Dynix
operating system kernel has been made completely shareable so that multtple CPUs
can execute identical system calls and other kernel codes simultaneously.
63
2.5 SYSTOLIC SYSTEM FOR VLSI COMPUTING
STRUCTURES
High-performance special-purpose VLSI oriented computer systems are
typically used to meet specific applications, or to off-load intensive computations
that are especially taxing to general-purpose computers. However, since most of
these systems are built on an ad hoc basis for specific tasks, then methodological
work in this area is rare. In an attempt to assist in improving this ad hoc approach,
some general design concepts will be discussed, while in the following paragraph
the particular concept of systohc array arch1tectures are introduced and, a general
methodology for the mapping of high-level computation problems into hardware
cellular structures, will be introduced
The systolic approach in parallel processing evolved from the possible
applications; and the appropriate technology and the background knowledge for Its
realisation. The applications arose from the ever increasing tendency for faster and
more reliable computations, especially in areas hke real-time signal and large scale
scientific computation. The appropriate technology was provided by the remarkable
advances in VLSI and automated design tools.
In areas such as real-time signal processing and large scale scientific
computation, the trade-off balance between generality and performance comes
down on the side of special-purpose devices, because of the stringent time
requirements. Thus, a systolic engine can funcuon as a peripheral deviGe attached to
a host system.
The host system need not be a computer; in the case of real-time signal
processing systolic systems are suitable for sensor devices, accepting a sampled
signal and then passing it on, after some processing, to other systems for further
processing [Dew 1986]. In the case of large-scale scientific computation, systolic
64
systems can be used as a 'hardware library', for cenain numerical algorithms.
Alternatively, they can be utilized to 'matricialize' the internal arithmetic units of
more general-purpose supercomputers.
Some of the VLSI limitations are alleviated when systolic algorithms are
implemented on processor arrays. For example, the actual chip design is not an
issue any more, since it is a programmable processor. Funher, the interconnections
need not be strictly planar. However, in both cases, simplicity and regularity remain
factors of the utmost imponance for an efficient systolic design; in the first case
because they ensure the design of cost-effective, special-purpose VLSI chips. and
in the second case because of the promising proposal to harness the programming
complexity of parallel computers with a large number of cooperating processors.
2.5.1 An Environment For The Development Of The Systolic
Approach.
The concept of systolic architectures, pioneered by H. T.Kung [Mead 1980] ,
which has been successfully shown to be suitable for VLSI implementation, is
basically a general methodology of directly mapping algorithms onto an array of
processor elements. It is panicularly amenable to a special class of algorithms,
taking advantage of their regular, localised data flow.
The word 'systole' was borrowed from physiologists who used it to descnbe
the rhythmically recurrent contraction of the hean and aneries which pulse blood
through the human body. By analogy, the function of a cell in asystolic computing
system is to ensure that data and the control are pumped in and out at a regular
pulse, while perforn1ing some shon computation [Kung 1978 and Dew 1986].
Systohc systems combine pipelining, array-processing and multiprocessing to
produce a high-performance parallel computer system. This combination is
65
-------- ---
exemphfied in fig.(2.12), which is a typical arrangement of a systolic system. A
linear array (pipeline) of n processors (cells, in the systolic terminology) is
connected with the host system, via the boundary cells. The number of cells in the
array is determined by the maximum attainable I/0 bandwidth of the host.
Operations are pumped through the array at a regular pulse. Everything is planned
in advance so that all inputs to a cell arrive at just the right time before they are
consumed. Intermediate results are passed on immediately to become the inputs for
further cells. A steady stream flows in at one end of the array which is said to
consume data and produce results 'on the fly'. The single operation common to all
algorithms considered in this secnon 1s the so-called inner product step 'IPS' , C =
C + A * B, which leads to a fundamental network capable of performing
computation-intensive algorithms, such as digital filtering, matrix multiplication,
and other related problems (see Table (2.1) for a more comprehensive list of
potential systolic applications).
Memory
~ PE PE PE PE PE PE PE PE PE
Figure (2.12) ASystolic processor array.
The systolic array systems feature the important properties of modularity,
local interconnection, a high degree of pipelining and highly synchronised
66
multiprocessing. These features are particularly more mteresting in the
implementanon of compute-bound algorithms, rather than input/output -'I/0' -
'SYSTOLIC' PROCESSOR ARRAY STRUCTURES
1- ID linear arrays
Problem cases: FIR-filter,convolution,'Discrete Fourier Transform" DFT,
matrix -vector multiplication, recurrence evaluation, solution of niangular
linear systems.
2- 2D square arrays
Problem cases: Dynamic programming for optimal pernthesization,
image processing, pattern matching, numerical relaxation.
3- Dhexagonal arrays
Problem case: Matrix problems (matrix multiplication), LU decomposition by
Gaussian elimination without pivoting, QR-factarization.
4- Trees
Problem cases: Searching algorithms, recurrence evaluation.
5- Triangular arrays
Problem case: Inversion of niangular matrix
Table (2.1) The potential utilization of'systolic' array configurations
67
bound computanons. In a compute-bound algorithm, the number of computing
operations is larger than the total number of I/0 elements, otherwise the problem is
termed I/0- bound. Illustrative of these concepts are the following matrix-matrix
multiplication and addition examples. An ordinary algorithm, the former, represents
a compute-bound task, since every entry in the matrix is multiplied by all the entries
in the rows or columns of the other matrix, i.e. O(n3) multiply- add steps, but only
O(n2) I/0 elements are required. However, the addition of two matrics, is I/0
bound. Since the total number of adds is not larger than the total number of I/0
operanons, i.e. O(n2) add steps and O(n2) I/0 elements.
The speeding up of the I/0-bound computations requires an increase in
memory bandwidth. Memory bandwidiths can be increased by the utilisation of
either fast components, which may be quite expensive, or interleaved memories,
whtch may create complex memory management problems. Speeding up a
compute-bound computation, however, may often be accomplished by using
systolic arrays.
The fundamental principle of a systolic architecrure, a particular systolic array,
is illustrated in fig. (2.12). By replacing a single processing element with an array
of PE's, a higher computational throughput can be achieved without increasing
bandwidth. This is apparent if we assume that the clock period of each PE is 1 OOns;
then the conventional memory-processor organisation has at most 5 MOPS
performance, while with the same clock rate, the systolic array will result in a
possible 35 MOPS performance.
Furthermore the systolic systems are algorithmically-specialised, and
therefore can achieve a better balance between computation and communicanon,
since the communication geometry and the computation performed by each
68
processor are unique for the specific problem to be solved. Thus, a systolic
algorithm must explicitly define not only the computation being performed by each
of the processors in the system, but also the communication between these
processors. That is, a systolic algorithm must specify the processor interconnection
pattern and the flow of data and control throughout the system.
2.5.2 Systolic Algorithms, Constraints and Classification.
An algorithm that is designed with the systolic concepts in rmnd, in particular
the use of simple and regular data and control flow, extensive use of pipelining and
high level of multiprocessing, is termed a systohc algonthm. Technologically
speaking, the design of systolic algorithms is in its early days, and as such, is
applicable to only a small subset of applications. However, it is forecasted that
further developments in the near future could alleviate some (if not all) of the
restrictive constraints of the VLSI design.
Recent developments in programming languages along with the chip
technology has made it possible to classify systolic algorithms mto brand classes
dependent on their specific properties. For example, a systolic algonthm can be
considered to depend upon many factors, i.e ease of manufacture, its ability to be
represented as a planar graph, or amount of area required on silicon to implement It.
Two main classes of systolic algorithms were identified [Bekakos 1986]; Soft
systolic algorithms and Hard-systolic algorithms.
The soft-systolic paradigm is described as a framework for realising an
algonthm design and programming methodology for general-purpose, stand-alone
(not attached to a host), high-level language parallel computers (more specifically,
the Fifth Generation Project computers) [Uchida 1983].
69
The soft-systolic algonthms, were defJ ne::l as a result of innovations in
concurrent programming languages, such as Occam and concurrent Prolog. In such
a class, planasity, broadcasting and area are no longer a major concern. Although,
the soft-systolic algonthms may intuitively not be suitable for direct mapping onto a
chip , they however can still be performed on some suitable parallel computers,
such as transputers. Therefore, these algorithms must be implemented in a
appropriate languages. Recent developments in the transputer device, the inclusion
of a stored Occam compiler as a chip, have made the transputer chip a favourable
candidate system to run algorithms of this class.
The second class of algorithms, the hard-systolic algonthms, which represent
the traditional algorithms are designed with the physical chip implemention
restrictions m mind so that they are easily manufactured as chip systems. Examples
include banded matrix-vector and matrix-matrix multiplication chips [Mead 1980].
Perhaps one of the most significant constraints imposed on VLSI systems is
that it is a 2D technology (planarity constraint) since chips are usually (or more
precisely wafered, if fabncation jargon is used) on a board. This physical constraint
is reflected on the hard-systohc design by considering only those graph model
representations which feature the planarity characteristic. However near planar
representations are also allowed since the 2D constraint is violated by permitting
two boards to be connected at the same places.
In addition, broadcasting has been avmded in such algorithms since each cell
has to be connected to the broadcast charuiel, increasing the power requirement of
the system as a whole or decreasing its speed. In a 'purely' hard-systolic algorithm,
broadcasting to cells is totally avoided. However, if only a limited amount is
allowed, the algorithm is termed 'sett11' hard-systolic algorithm.
70
Soft-systolic algorithms observe the main principles of systolic algorithms.
However, they do not have to obey the restrictions that refer to the VLSI
implementation of systolic algorithms, thus they differ from the hard-systolic
algorithms in the followmg ways:
* The network of processes need not be planar and static: non-planar
networks with multiple and complex interconnections, or even multidimensional
and/ or time varying systems may be possible.
* Area IS not a major consideration for optimization, however, it should be
noted that it represents processor, and thus processor and memory resources.
* They do not have to be fabricable, but they must be programmable in some
appropriate parallel processing language (e.g. Occam).
* Broadcasting, fan-in and small irregularities are not avoided; but there must
be a majority of pipelined structures.
It is clear that the set of hard-systolic algorithms form a sub-set of the soft
systolic class and as such they can also be implemented with the same concurrent
programming languages, although this is not necessary. Furthermore, it is also
evident that some of the soft-systolic algorithms will be very close to the hard
systolic ones but, under the strict definitions of hard-systolic, would not be closed
as such. Consequently, a third class, hybrid-systolic algorithms, was defined to
represent this state of transition from the soft class to the hard one. Only
technological improvements which are likely to take place in the near future will
achieve this hybrid-hard migration. Current research indicates that algorithms which
allow local broadcasting (not necessarily between nearest-neighbour cells), limited
nonplanarity or large amounts of nonplanarity (but in a controlled manner) could be
considered as contenders for this class of algonthm.
71
Hence, for hybrid-systolic algorithms, area is not a major consideration, in
terms of optimizating the area of the functional units, or of the array as a whole.
However, the restrictions of the machine must be taken into account, in terms of
processors and memory available. They do not have to be fabricable, but
programmable in some special-purpose systolic programming language, targeting a
special-purpose machine; usually they require significant amounts of memory and
control.
2.5.3 Systolic Array Simulation.
The term 'Systohc Array SimulatiOn' indicates the combination of several
approaches, I.e.: initially, the simulation of hard-systolic algorithms on
conventional computers, using some suitable language; further, the development of
hybrid-systolic algonthms and special-purpose systolic programming; finally, the
design and development methodology of soft-systolic algorithms that have as target
machines, some general-purpose parallel processing computer.
Occam programs can be divorced from transputer configurations by usmg the
language as a stmulatton tool throughout the development of our simulation system
in this research. A summary of the Occam language has been given in a previous
section i.e. (2.3.2). The general structure of Occam programs which represent the
simulation of systolic arrays is shown in fig(2.13) where branching indicates
parallel execution. The constnction of programs follows ideas developed by
G.M.Megson [Megson 1987]. Consequently Occam programs simulate the formal
proofs by replacing I/0 descriptions by actual results. Although the simulation does
not guarantee correctuess, it IS nevertheless a less ttme consuming approach which
does not result m unsolvable equations.
72
The Getdata and Putdata sections of fig. (2.13) are responsible for receiving
and sending data and other information from and to the host. Each routine contains
enough memory to store the initial array input data and the final output data
correspondmg to the global mput and output sequences of the model. However, in
cases of algorithms where the computation time is data-dependent, the Putdata
routine can run in parallel with the systolic system and immediately produce the
output data. Similar arrangements can be made for the Getdata routine. Notice that,
GEIDATA
~ SETUP
' ' ALLOCATOR
, t -~ ~ t 'f
SOURCES ., ... CELLS ' SINKS ~~ DEBUG ~ ,
..( ..( ..( -}
,if
DE-ALLOCATOR
~ PUIDATA
Figure (2.13) Structure of Occam program for simulatmg systolic array.
73
given the fact that Occam has no standard I/0 routines, it is possible to define a
library of primitive I/0 routines that are especially suitable for reading and writing
data and control streams, as required in systolic computation.
The Setup section computes system-dependent quantities. More specially, it
performs many necessary calculations whose values are useful in defining the
structure of the array. These structural values are more imponant as the array
becomes more complex.
A system is eventually decomposed in Sources-Cells-Sinks. A Source is
loaded initially with vectors from Getdata, representing input streams, together with
possible delays, and other control information created in the Setup secnon. Sinks
are analogous to sources, except that they work in inverse by placing real values
into data vectors which are then passed to Putdata. Sources and Sinks of
subsystems are usually connected to the Sources and Sinks of the main system.
The cell procedures implement the computations performed by the processing
elements (cells) of the given systolic architecture. Generally, there is one procedure
for each type of cell, and the programming task is simplified for homogeneous
networks. The I/0 sequences are represented by Occam channels appearing as
actual parameters in the procedure heading. Where cell definitions are only
margmally different, extra switches and flags can be added to a procedure heading
so it can set up the correct cell type. A cell definition is divided into three sections,
initialization, communication and computation. Initialization is performed only once
and allows cells to be cleared before use or predetermined values to be set up. In
particular, initialization defmes neutral element quantities which can be used in
communication before real data reaches the cell and is essential to mamtain dataflow
in Occam programs.
74
The other two sections of the cell, communication and computation, are
performed many times, and are executed sequentially one after the other enclosed in
a loop for iteration. All communication is performed in parallel and computation is
mainly sequential. The allocator routine is called after setup and is supplied with
parameters about the array dimensiOns, synchronisataion details of the total number
of cycles in the algorithm, 1f a loop scheme is used, and data sequence sizes. The
allocator is simply a set of parallel loops which specify and start-up the
computational graph by connecting correspondmg procedures using Occam
channels as arcs and allocatmg channels accordingly. The simpler the array the
easier are the mapping functions, and the result 1s an allocation similar to the VLSI
grid model. Once starting the sources and sinks control the computation, and the
allocator only terminates when all the graph cell procedures have terminated.
Termination of procedures is assumed to be globally synchronised if a for-loop is
used in cells and asynchronous if while-loops are incorporated. As Occam 1s an
asynchronous commumcation language, for-loops tend to be messy requiring some
additional computation after the loop to clear all the channels-hence avoidmg
deadlock. While-loops are better suited to the model of concurrency and when
augmented w1th systolic control sequences can be used selectively to close down
cells, input and output channels. Consequently, array cells can be switched off or
de-allocated by a wavefront progression or pipelined approach from sources to
sinks. An addinonal procedure for debugging purposes can be added, which runs
in parallel with graph networks, and is mainly a screen/file routine.
Brent [Brent 1983] used an extended version of Pascal. ADA also seems a
likely candidate as the ADA rendezvous is very sliD!lar to channel communication.
The adoption of Occam offers more direct hardware support for special purpose
designs as well as common architectures.
75
CHAPTER 3
FUNDAMENTALS OF DIGITAL IMAGE PROCESSING
3.1 INTRODUCTION
Image processing and image understanding have been fast-growing research
fields in Information Technology for the last thirty years. Influence for its growth
and advancement has arisen from studies in artificial intelligence, psychophysics,
computer architecture and computer graphics. Application areas for image
processing includes document processing, medicine and physiology, remote
sensing, mdustrial automation and surveillance amongst many others.
Image processing involves the various operauons which can be carried out on
the image data. These operations include preprocessing, spatial filtering, image
enhancement, feature detection, image compression and image restoration.
However, this is not exclusive. Image compression [Gonzalez 1992] is mainly used
for image transmission and storage. Image restoration involves smoothing
processes which restore a degraded image smoothing close to 'ideal'.
Image processing operations are generally categorised into three levels, low
level (image enhancement), medium-level (feature extraction), and high-level (scene
interpretation). As shown in fig. (3.1) the complexities of the operations, and the
volume of data required at each level varies greatly. Image enhancement
incorporates some operations on pixel values, repeated over the entire image.
Feature extraction involves identifying important features within the image, and
extracting useful information about them. Scene interpretation is when an intel!J.gent
decision process is made, regarding the contents of the image, usmg the extracted
features. Computer vision involves techmques from image processing, pattern
recognition and artificial intelligence. The process attempts to recognise and locate
objects in the scene.
77
Two Dunens10nal Image Data
)- f-4. Feature ~
Image Low level ,. " Capture Operauon Extraci!On
~ 11'"1 B 0
' Feature Database
Figure (3.1) Image processing operanons.
One area which has received considerable attention in recent years, is the
design of real-time systems for the early processing of sensory data (i.e. low-level
image and signal processing). Such systems must handle large quantities of data
(typical images have 512 * 512 pixels) at a high throughput. In section (3.3) ,we
explore different types of algorithms required for low-level image processing.
Many low-level vision algorithms are highly regular, data independent and operate
on a spatially local data set. These characteristics indicate that the algorithms have
an mherent parallelism which can be exploited by mapping them onto arrays of
processing elements operanng in parallel.
Many algorithms have been devised for machine vision tasks but they have
been heavily influenced by the von Neumann architecrure on which they have been
developed. Some influences are [Hussain 1991] as follows:
1) The way the problem and data can be represented;
2) The description of the problem to be solved;
78
3) The way the problem can be mapped to the architecture;
4) The data-structures which can be used;
5) The data and control flow allowed by the system;
6) That the algorithm is optimised to run on the available arclutecture.
People involved in machine vision have been aware of the parallelism
involved in the task and have been designing hardware for some time. Early (or
low-level) vision involves such tasks as filtering, segmentation, feature detection,
and optic flow, all involve local computation on the image array and these
operations are highly parallel.
There are different architectures for carrying out these varied operations Many
have developed systolic arrays which gives a good rate of data throughput; many
have augmented mesh architectures by building pyramid architectures. Others
have developed more general parallel machines based on shared memory, etc.
Section (3.4) will explore the various methods for implementing low-level
algorithms on various types of parallel architectures.
There are different ways in which both image and data can be represented.
Images are typically represented usmg a 20 array of picture elements (pixels) which
represent the intensity values (grey-levels). The image can itself be of various
types, such as, it may be the result of filtering, or it could be the result of
transformations (such as FFT). To extract information from these images different
types of operation are required, these could involve local operations such as
convolution or morphological operations.
This chapter concentrates on the parallel computer low-level image
processing algorithms, parallel hardware, the various methods for implementing
these algorithms on various types of parallel architecture and the image processing
techniques required for image filtering.
79
3.2 LOW-LEVEL IMAGE PROCESSING ALGORITHMS
The ann of computer vision systems is to analyse a scene in order to gain
some understanding of its contents. An image is an I x I array of pixels (picture
elements) each representing the light intensity at that point in the image. Images are
enhanced using low-level algorithms which perform image to image
transformations. These algorithms are used to eliminate noise, improve contrast and
detect certain low-level features such as edges.
The majority of image to Image transformations have spatially localised
inputs, each pixel in the output image is some function of a small window of
neighbouring pixels from the input image. Such algonthms are referred to as local
windowing operations. The algorithms are regular and local and for each window
in the image the computation is proportional to the size of the window. As the
window is moved across the image the windowing operation is performed in each
pixel position. Each application of the windowing operation produces a pixel for
output image.
Many low-level image processing operations, termed local operators, require
access to the four or eight neighbouring intensity values of a pixel, when computing
the new value for the pixel. Each member element in the image is replaced by some
function of itself and the neighbouring elements within a window centred on that
element. Common sizes of the neighbouring window are 3 x 3 and 5 x 5.
A powerful method of image computation is to apply a system process P to an
input function F(x.y) and generate a transformed output function H(x,y) [U nd rill
1992].
(3.1)
80
There are (k2-l) additions and k2 multiplications per window (k is the size of
the window) and the wmdow operation is applied (1-k+ 1)2 times.
Local neighbourhood operations are also useful in smoothmg the image, to
reduce the effects of noise, clean up, or blur the image. This category includes the
most common image processing function, that of applying both linear and non
linear filters. Linear filters, while they remove noise, also blur the edge. For this
reason, non-linear filters such as rank filters are used. In the next section we g~ve
an introduction to the mathematical approach and algorithms for various filters .
The measures such as point-wise standard deviation (SD) and signal-to-noise
ratio (SNR) give a quantitative result of noise in the imaging system. A
measurement of the width of edges in an image gives a quantative value of the
separation distance of two identical objects before they can be identified with any
certainty.
There are various measures of nmse, which may be classified as being global
or local. Global measures involve the entire image, for instance, the root-mean
square deviation between the image and the actual object or between the image and
Its ensemble average. Local measures are, for example, local signal-to-noise ratio
(SNR).
Local windowmg operations tend to exhibit a low degree of data dependency
which means that all parts of the image are treated uniformly. Data dependent image
to image transformations occur when values are adapted 'on-the-fly' for different
regions of the image, for example, adaptive filtering and adaptive thresholding.
81
- ------------------------------------------------------------------------
3.2.1 Parallel Paradigm
Factors which affect the performance of an algorithm on a particular
archltecture are dependent on the degree of parallelism and the over-heads incurred
in scheduling and synchronising the tasks.
The use of dedicated image processing hardware imposes two constrrunts on
the algorithms employed for image analysis [Giloi 1991]:
- The algorithms must be regular m order to be performed by specific high
speed processors.
- The algorithms must be sufficiently simple to ensure the cost-effective
realization of the dedicated processors.
An algorithm is regular if it is performed on the entire pixel matnx or a
wmdowed region thereof in a data stream mode. The choice of an algorithm to
solve a particular problem is strongly influenced by the hardware architecture and
software support tools available.
Just as there is difficulty and confusion over architecture taxonomy, so
similar difficulties exist in classifying algorithms. Often, algorithms have been
developed without reference to an architecture and use more than one model of
parallelism and are therefore difficult to implement m practice [Hussain 1991].
The two appropriate types of multiprocessor systems, which are used for
image processmg applications, are those which provide SIMD processmg and those
which provide MIMD processing [Flynn 1966] (see Chapter2 for further details).
There are two types of parallelism for image processing algorithms: i) data or
image parallelism, and ii) task parallelism. Morrow and Perrott [Morrow 1987]
have discussed and compared these two kinds of image processing parallelism. In
the following sections a brief overview of these two types of parallelism is given.
82
3.2.1.1 Image Parallelism
In the image or data -parallel paradigm of computation the image is initially
divided and distributed over the available processors. The algorithm to be applied to
the image then executes on each processor performing its operanons on the image
segment local to that processor.
The image-parallel paradigm of computation is explicitly synchronized; it
maps to the SIMD model of programming where each of the processors executes
the same code on its local data simultaneously. The advantages of image-parallel
programming is that there is a very simple control flow, the data array changes from
one state to another.
If there are fewer processors than the number of pixels in the image, the
picture will need to be partitioned amongst the available processors. While in an
image of 512 x 512 p1Xels may be 'folded' so that there are 16 (128 x 128) square
arrays; in processor farms or linear arrays, it is more likely that the image will be
partitioned verncally or horizontally into m-slices for the m available processors.
This partitioning is much simpler for the host computer in both transmitting and
collecting back the images and data. For neighbourhood operations involving k x k
kernels, each slice will have to have k rows or columns added to the border of the
segment. The segments are overlapped [Manning 1988].
Many of image-parallelism algorithms can be implemented on transputer
networks. It is more hkely that the image wtll be partitioned over the available
transputers. The algorithm to be applied to the image then executes on each
transputer performing its operations on the image segment local to that transputer.
One of the transputers in the network is designated the 'master' and communicates
with the host transputer. An image is obtained from the host via the master and
distributed to the remaining transputers in the network; thus providing an image
83
segment for each processor. All results are sent back to the master which, in turn,
passes them to the host.
3.2.1.2 Task Parallelism
In contrast to image parallelism, where the configuration of the processing
elements is the same for each algorithm, the processors are configured differently
for each task parallelism algorithm.
With task parallelism the algorithm under consideration is sub-divided into
relatively distinct sub-tasks. A task parallelism in a multiprocessor computer
requires the following steps:
- the algorithm needs to be partitioned into sub-tasks;
- the sub-tasks and data need to be distributed amongst the processors; and
- the system must be set up to allow interprocessor communication and
synchronisataion.
These three steps must be carried out in the specified order because the
requirement for communication and synchronisataion cannot be determined in
advance of the dtstribunon of the algorithm amongst the processors. For new
algorithms and architectures, the above processes have to be assessed a new each
time.
Many low-level image processing algorithms can be implemented on systolic
arrays, some of these implementations will be explored in section (3.4) .
With the MIMD type of computers, tasks may need to be partitioned (and
grouped) so that communication is minimised. There may be an overhead involved
in initiating communication between processors.
84
3.3 IMAGE FILTERING
In this section we consider techniques for filtering digital images. This
includes both low pass (smoothing) and high pass (edge enhancement) filters. The
principle objective of the enhancement techniques is to process a given image so
that the result is more suitable than the original image for a specific apphcation
[Gonzalez 1992]. Image enhancement involves noise removal to deblur the edges of
objects and to highlight specified features. The enhancement filter attempts to
improve the quality of an image for human or machine interpretability, where
quality is measured subjectively [Niblack 1986].
Image filtering techniques can be subdivided mto two main categories: ones
which act on the whole or large-sections of the image and the others involving small
neighbourhood windows.
Global techniques include least-squares filtering [Rosenfield 82] and Kalman
filters. Least-squares filter techniques require a statistical model of the signal and
the noise. The filtered images produced by these techniques have blurred edges.
Local methods are generally computationally more efficient. However, their
greater advantage comes from their ability to process several windows in parallel.
If the image is corrupted by random impulse noise, then linear or non-hnear
local window operations may be employed. The simplest of the linear operations is
equal-weighted averaging [Rosenfeld 1982]. While this method is efficient in
removmg noise it will blur edges and blurring is more severe for larger windows.
The blurring effect can be reduced slightly by using a weighted averaging
technique. The foundation of this technique can be from the convolution theorem
(section 4.2) . Let g(x,y) be an image formed by the convolution of an image
f(x,y) , that is
85
m n g(x,y) = . L L w (ij) f(x-i,y-j))
1~m J=-n (3.2)
where w (i,j) are normalised weights and are often binomial coeffiCients. In physical
systems, the kernel w must always be non-negative, which results m some blurring
or averaging of the image. In fact, by extending the basic idea of convolution, the
weights of w may be varied over the image, and the size and the shape of the
window varied. With this flexibility, a wide range of linear, non-linear, and
adaptive fllters may be implemented such as for edge enhancement or selective
smoothing.
Before proceedmg to smoothing filters, we describe the gradient and
Laplacian operators. These are two filters, in fact, two classes of filters, that are
often applied to digital images as convolutions.
3.3.1 Digital approximations to the Gradient and Laplacian
Operators
The gradient and Laplacian operators are related to the vector gradient and
scalar Laplacian of calculus. These are defined for a continuous function f(x,y) of
two variables as:
Gradient: M= (of/Ox)T + (of/Oy)j (3.3)
(3.4)
where T and J are unit vectors in the x and y directions.
The most common and historical earliest edge operator is the gradient operator
[Ballard 1982]. The gradient operator applied to a continuous function produces a
vector at each point whose direction gives the direction of maximum change of the
86
function at that point, and whose magnitude gives the magnitude of this maximum
change [Niblack 1986].
A digital gradient window gives the x component gx of the gradient, and the
other gives they component gy:
gx(lj) = maskx * n(i,j)
gy(i,j) = masky * n(i,j)
where n(ij) is some neighbourhood of (ij) and *represents the sum of products of
the corresponding terms.
For a digital image, analogously, we could use flrst differences, giving
gx(ij) = r(i,j+I)- r(ij)
gy(i,j) = r(i+ I ,j) - r(ij)
Note that these are digital convolution operators which convolve gx and gy
with the simplest set of masks:
maskx =-11 -1
masky = 1
The maskx generates output values centred on the point (ij+ 1/2) and masky
generates output values centred on (i+ l/2,j). To obtain values centred on (i,j),
symmetric masks about (ij) are most often used. We get
gx(ij) = r(ij+ 1) - r(ij-1)
gy(ij) = r(i+ I,j) - r(i-1,j)
(3.5)
(3.6)
These operators measure the horizontal and vertical changes in gx and gy.
Note that the set of masks is:
maskx =-I 0 1
87
-1
masky = 0
I
Another set of masks, called Roben's operators, are not oriented along the x
and y- directions, but are nevertheless similar. They are defined on a 2 x 2 window
as:
maska =[ 5 -~ J maskb =[ -? 5 J Whatever masks are used, the gradient operator produces a two element
vector at each pixel, and this is usually stored as two new images, one for each
component.
Sometimes the gradient is wanted as a magnitude gv and a direction gd. These
can be computed from gx and gy as
gv(i,j) =~
or
and
As we saw in equation (3.4), the Laplacian operator is one dimension reduces
to the second derivative, is also computed by convolvmg a mask with the Image.
One of the masks that is used may be derived by comparing the continuous and
digital cases as follows [N1black 1986]:
f(x) ~ r(1)
f'(x) ~ r'(i) = r(i)- r(i-1)
f"(x) ~ r"(i) = r'(i)- r'(i-1)
r"(1) = [r(i) -r(i-1)]- [r(i-1) - r(i-2)]
r"(i) = r(i-2) - 2r(i-1) +r(i)
r"(i) = (1 -2 1) ( r (i-2) r(i-1) r(i))
88
-------------
giving the convolution mask (1 -2 1). In this form, the Laplacian at i is computed
from values centred about i-1. To keep the Laplacian symmetric, it is normally
shifted and g:tven at i as:
(1 -2 1) (r(i-1) r(i) r(i+1))
Also, the sign is typically changed to give:
(-1 2 -1) (r(i-1) r(i) r(i+1))
and this is a common form of the one dimensional digital Laplacian although
mathematically it is the negative of the Laplacian. Different choices are available
when extending this mask to two dimensions. A plus shape standard mask is:
-1
-1 4 -1
-1
which is the negative of the mathematical Laplacian operator.
The digital Laplacian, responds to the 'shoulders' at the top and bottom of a
ramp, where there is a change in the rate of change of gray level [Rosenfeld 1982].
The window masks given here for the gradient and Laplacian operators are
farrly standard, but many other operators have been defined, in many cases using
larger windows, say 5 x 5. Also notice that the gradient gives both magnitude and
direction information about the change in pixe1 values at a point, whereas the
Laplacian is a scalar giving only magnitude. The digital gradient responds to edges
as strongly as 1t does to nmse. Thus the gradient operator would ordinarily be a
better edge detector than the Laplacian operator.
3.3.2 Low Pass and High Pass Filters
The low pass filters are smoothing filters designed to reduce the noise, detail,
or 'busy-ness' in an image. If multiple copies of the image are available or can be
89
obtained, they can be average pixel by pixel to improve the signal to noise ratio.
However, in most cases, only a single image is available. For this case, typical
smoothing filters perform some form of movmg wmdow operation that may be a
convolution or other local computation in the window.[Niblack 1986, Ekstrom
1984, Gonzalez 1992]
It is easy to smooth out an image, but the basic problem of smoothing filters
is how to do this without blurring out the interesting features. For this reason,
much emphasis in smoothing is on 'edge-preserving smoothing'. Salt-and-pepper
noise in images which are created by bit-error can be removed by use of low pass
filters such as median filters [Rosenfeld 1982, Hussain 1991]. In a small window,
the pixels are nearly homogeneous; only a small portion of these pixels are noise
pixels.
Edge enhancement (or image sharpening) techniques are useful primanly as
enhancement tools for highlighting edges in an image. These filters are the opposite
of smoothing filters, Whereas smoothmg ftlters are low pass filters, edge
enhancement filters are high pass filters [Gonzalez 1992, Hussain 1991]. The term
'edge detector' is also used. This may mean a simple high pass filter, but
sometimes may be more general, mcluding a thresholding of the points into edge
and non-edge categories, and even the linking up of edge pixels into connected
boundaries in the image.
Below, a brief review of several image smoothing and sharpening filters:
1- Median filtering: Median filtering is a nonlinear process useful in reducing
Impulsive or salt- and-pepper noise. It also useful in preservmg edges in an image
while reducing random noise. A pixel value is replaced by the median of its
neighbours. The median of a set of numbers is the value such that 50% are above
90
and 50% are below. For example, when the p1xel values within a window are
5,6,35,10, and 15, and the pixel being processed has a value of 35, its value is
changed to 10, which is the median of the five values.
Conceptually simple, the median filter is somewhat awkward to implement
because of the pixel value sorting required. However, it is one of the better edge
preserving smoothing filters [Danielsson 1981, Ekstrom 1984].
2- Mean: If the noise in an image appears as random, uncorrelated, then the
affected pixe1s can be replaced by a local average or the mean to reduce the gray
level variations. For an N x N window with pixel gray levels I(i,j) where i,j =
1,2, .... ,N, the average is
The size and shape of the window over which the mean IS computed can be
selected. Figure (3.2) shows one approach for extracting neighbourhoods from an
image array. The neighbourhood of a point is defined in this case by the set of
points inside, or on the boundary of, a circle centred about this point in question.
for example, the mean filter can be a square window shape or a plus shaped
window [Chm 1983, Gonzalez 1992].
3- Weighted mean: A weighted mean is often used in which the weight for a
pixel is related to its distance from the centre point. The s1ze and shape of the
window can be selected. The approach for extracting neighbourhood from an image
array shown in fig. (3.2) is also applied for the weighted mean filter [Niblack
1986]. For 3 x 3 windows, the weights may be:
91
1/16 1/8 1/16 1/6
1/8 1/4 1/8 1/6 1/3 1/6
l/16 1/8 1/16 1/6
square window plus shape window
The square weighted mean window is separable. Let W be the 3 x 3 square
window kernel above and let wv = Wh = (1/4 1/2 1/4) T then W is given by:
[
1/4] W=wv WhT = 1/2 (1/4 112 1/4)
1/4
r .... 1-~ ~ 11 + J
-;._ ... f.-~
9 points neighborhood 5 points neighborhood
Figure (3.2) Pixels neighbourhood.
4- k nearest neighbour averaging: This method is based on the fact that the
gray levels of pixels belonging to the same population within an N x N window are
highly correlated. The centre point I of an N x N neighbourhood is replaced by the
average gray level of the neighbours of I whose gray levels are closest to that of I.
A typical value of k is 6 when N=3 square window centred on I. This is another
filter used in edge preserving smoothing [ Chin 1983].
92
5- Inverse Gradient filter: This smoothing scheme is based on the observation
that the variation of gray level inside a region are smaller than those between
regions. In other words, that the absolute value of the gradient at the edge is higher
than within regions. The weighting coefficients are the normalized gradient in verses
between the centre point and its neighbours.
Let a pixel I(ij) in an m x m image, where i,j = 1,2, .... ,m, the inverse of the
absolute gradient at I(i,j) is defined as
1 r (i+k,j+l) = I 1(i+k,j+Ir I(i,j)l
where k,l = -1 ,0, 1, but k and I are not equal to zero at the same time. In other
words, the r(i+kj+I)'s are calculated for eight immediate neighbours of I(i,j)· The
smoothed pixel O(i,j) is computed as:
1 0(· ')- -I(·') IJ - 2 IJ
1 1 1 + 2 ( ~
1 ~~t(i+k,j+l) I(i+k,j+l) ) (3.7)
If I is in the immediate vicinity of an edge, those piXels outside the region will
be weighted very lightly; thus details would not be significantly blurred [Wang
1981].
6- Sigma filter: Set I(ij) equal to the average of all pixels in its neighbourhood
whose value is within t counts of the value of I(i,j)· Here t is an adjustable
parameter which is called the Sigma filter because the parameter t may be derived
from the sigma, or standard deviation, of the pixel value distribunon. Let,
r(i+k,j+l) = 1
r(i+k,j+l) = 0
if ll(i+k,j+!f l(i,j)l < S
otherwise
93
where k,l = -1 ,0, 1 for the window size 3 x 3 . Then the smoothed pixel
O(i,j) 1s computed as: 1 1 1 1 1
O(ij) = l l r(· k . l) ( l l w (i+k,j+l) ) (3.8) k--11=-1 I+ ,J+ k=-1 1~1
where
w(i+kj+l) = I(i+k.j+l)
The s1gma range is generally large enough to include most of the pixels from
the same distribution in the window, yet in most cases it is small enough to exclude
pixels representing high-contrast edges [Lee 1983].
7- Closest of minimum and maximum: A filter defined by computing the
minimum and maximum in n(ij) pixels and setting O(i,j) to the one that IS closest
to the value I(ij) often produces good results by sharpening the boundaries
between classes. The filter is typically iterated. It leaves isolated spikes which may
need to be removed by another filter, say a median filter, mixed into iterations.
8- Gradient operators: A Simple way of using them is to keep only the
magnitude (as explained in the previous section). Other methods keep both the
magnitude and the direction. The gradient in a given direction may also be
computed. If the gradient at pixel I is considered as a vector (gx , gy), then the
gradient in the direction of vector d=( dx, dy) is ~.
Moving across an edge , the gradient will start at zero, increase to maximum,
and then decrease back to zero, this produces a broad edge [Niblack 1986].
9- Laplacian operators: As described m the previous secuon.
94
10- Enhancement in drrecuon of the gradient: Initially compute the gradient at
pixel I(ij)' and then apply another filter (such as the one dimensional Laplacian
operator) in the direction of the gradient.
3.4 VLSI IMPLEMENTATION FOR LOW LEVEL IMAGE
PROCESSING.
The purpose of this section is to idenufy different computational models for
implementing low level image processing algorithms on a programmable VLSI
processor array, constructed from asystolic array, pyramid architecture, and the
Inmos transputer network. We shall consider the class of image processing
algorithms that use local windowing operations (as shown in the previous section).
Before proceeding to these computational models , we describe briefly the
CLIP systems. CLIP4 (Cellular Logic Image Processing) was the first large array
assembled using custom-designed integrated circuits of two boolean processors.
Each of the processors can be loaded with the same or different data, but the same
function is performed by all the processors at the same time. The CLIP system is
(SIMD) computer. The array size limitation of CLIP4 gleaned from several
successful image processing apphcations, led to the development of CLIP4S. But
there still remained some severe limitations with memory and processing of gray
scale images. The CLIP4S successor CLIP7 A, which incorporates various levels of
local control, and mcreased amount of memory per processor [Hussain 1991].
3.4.1 Systolic Array Implementation
Many of the early systolic algorithms for lD convolution and matrix vector
multiplicauon used bidirectional dataflow, with the data and results flowing in
opposite direction (Kung 1979, and 1982). A systolic design is shown in fig.
95
(3.3). The array is composed of k (k is the size of the convolution window)
identical cells and operates in a totally synchronous fashion. Each cell is capable of
performing multiplication followed by an addition. That is, each cell performs one
step in the calculanon of a scalar product [Quinton 1991].
There are two data streams in the array, the Xi circulate from left to right,
entering a new cell on each clock tick, and are not modified as they pass through the
cells. The y1 circulate in the opposite direction, at the same speed, and are computed
by successive accumulation. Initially, their values is 0, and when they exit the
array, they have been correctly computed according to equation (3.2). The array
outputs a new Yi on every two clock ticks. The primary weakness of the array is
that its cells are not active on all clock ticks.
The solution of efficiency for the above array, requires that the two data
streams Xi and Yi flow in the same direction. Such an array is shown in fig. (3.4).
y,
+ ~ ~ ~ al a2 a3 ' a4 ~ ~ - --,
2
~. XI· 2
XI·l
Figure (3.3) Bidirectional systolic array
The array is composed of k cells, and the internal registers are preloaded with
the window weights. The y1 flow at a speed double that of the xi; this is achieved
by inserting additional delays into the path of Xi· The array generates a new result
after every clock cycle [Kung 1984].
%
-------------------------------------------------------------------------
Kung modified the above systolic array in different designs for 2D
convolution [Kung 1980, 1982, 1984].
A number of systolic designs for 2D convolution are implemented, Robert
and Tchuente [Robert 1986] have proposed a divide-and-conquer systolic design
for 2D convolution. Megson introduced several designs for the same purpose
[Megson 1992].
X4 X2
Figure (3.4) undrrecnonal systolic array
3.4.2 Pyramid Architecture
A pyramids approach to image processing is proposed by Hubel and Wiesel
[Hubel 1962]. The pyramid structure is effectively an inverted order four tree, with
additional communications links between the processors at each level, as shown in
fig. (3.5). Each horizontal level is a square array. Operations involving nearest
neighbour algorithms are presented. The pyramid structure is proposed as an
efficient configuration for image processing due to the inherent hierarchy which is
analogous to the different levels of complexity of operations used in an application.
A different type of processor pyramid is considered. With this SIMD
configuration, an N by N array of simple processor devices can be reconfigured
into any one layer of the overall pyramid. While the number of processor nodes is
97
Figure (3.5) Three level pyramid configurauon.
decreased on each layer towards the top, the number of processors in each layer is
the same. This means that for all layers excluding the bottom one, there will be
successively more processors at each node. Clearly this arrangement does mean that
only the layer that the processor array has been configured for can execute at any
one time, although if more than one array were used, several layers could execute
concurrently.
Another pyramid architecture currently being constructed is the Warwick
Pyramid Machine [Nudd 88]. The WPM consists of three layers, the base layer is a
256 x 256 SIMD array construction. Above these are 256 processors which act as
controller for a 16 x 16 array of processors. The base-layer acts on iconic data and
the layer above converts these to a symbolic representation.
Transputers have been considered for use in a pyramid architecture for
knowledge-based sonar image Interpretation. The pyramid configuration proposed
will use 21 T800 transputers in three hierarchical layers. The base layer has 16
processors arranged as a 4 x 4 array, each connecting to one of the four processors
(m a 2 x 2 array) in the middle layer. The top layer is a single transputer,
98
--------------------1
connecting to each of the four devices in the middle layer. It would appear that this
structure is best smted for performing image analysis on repetitively captured
images, so that one layer performs its level of image processing, another layer
could be performing its function on the previous image [Manning1988].
99
CHAPTER 4
SYSTOLIC DESIGNS FOR DIGITAL CONVOLUTION
4.1 INTRODUCTION
One area which has received considerable attention in recent years, is the
design of real-time systems for the early processing of sensory data (i.e. low-level
vision and s1gnal processing). In addition multidimensional convolutions
constitutes some of the most compute-mtensive tasks in signal and image
processmg. Since each input data 1tem 1s used many times, and many input values
are needed for the computation of a single output . For example, a two dimensional
(2D) convolution using a general 3 x 3 kernel requires 9 multiplications and 8
additions to generate each pixel in the output image. If the dimensionality is higher
(typical images have 256 x 256 or 512 x 512 pixels) or the kernel is larger, many
more arithmetic operauons would be required.
Some previously-proposed systems were shown in chapter 3, however, these
systems suffer from two drawbacks. Firstly, they do not take advantage of the
possibility that arithmeuc units could themselves be pipelined. Secondly, they
cannot be used to perform convolutions of arbitrary dimensional; they can for
example perform only 1D or 2D convolutions but not both.
The use of pipelined components for implementing cells of systolic arrays is
especially attractive for applications reqmring floating point operations.
Commercially available floating point multipliers and adder chips can deliver higher
throughput, an example of such a chip is the Inmos T800 transputer. The processor
shares its Ume between any number of current processes [Inmos 1987]. These
components, when used to implement systolic cells, form a second level of
pipelining, the first level is the global pipelining between array cells, while this
additional level pipelining of computations within a cell, and this second level of
pipeluung can increase the system throughpuL
101
This chapter describes a two level pipelined systolic array design for 10
convoluuon which uses pipelined arithmetic units. The systolic array was first
proposed by H.T. Kung [Kung and Lam 1984]. It is shown that the systolic array
can be extended to handle a 20 convolution. This system can also be extended to
handle convolution of any dimensionality.
The systolic array that was proposed here is a linear array for m-0
convolution. It consists of two major building blocks: Multiply-add processor and
delays. The number of hardware components used is proved to be minimal when
compared with equivalent array of the same strucurre. Some modifications to the
design of the systolic array are also analysed
The next section presents a definition and system for 10 convoluuon. A
description of the systolic array for 20 convolution is also presented in section
(4.3). Section (4.4) presents a systolic design for multidimensional convolution.
Some improvements to these designs are shown in section ( 4.5). A description of
the transputer network for one dimensional and two dimensional convolution are
presented in sections (4.6) and (4.8). The results and efficiency obtained for each
transputer network are also presented in sections (4.7) and (4.9), reflecting the
performance of these designs on this level of vision. The proof that a design is
optimal in terms of the amount of time and memory used, is analysed in the last
section, where the systolic array is compared with another known design in the
literature.
102
4.2 ONE DIMENSIONAL CONVOLUTION DESIGN
4.2.1 Problem Definition
Given a vector of signals x= {xi}, i=l,2, .... n, and a kernel of weights w=
{ Wi}, i=l,2, .... k, with k<<n, then convolving the signal x with the kernel w is to
compute the quantity k
Yr = ~ wi xi+r-1 I= I
for r=l,2, .... ,n-k+l
(4.1)
Assullling the vector indices increase from left to right then the flrst result, Yi•
is obtained by aligning the leftmost element of w with the leftmost element of x,
then computing the inner product of w and the overlapped section of the signal x
with w.
The kernel slides one position rightward and the inner product calculation is
again performed on the overlap to produce the second result and so on for all the
other results. The last result, Y<n-k+l)• is obtained when the rightmost element of
w is aligned with the rightmost element of x.
From equation (4.1) we conclude that there are (k-1) additions and k
multiplications per window, and the window operation is applied (n-k+ 1) times.
This is the reason why the convolution is considered to be a compute intensive
operation.
Let's look at the input data as an endless sequence of number sliding along the
sequence of weights:
xw xg x8 x7 x6 x5 x4 x3 x2 x1
wa w2 ~
For each relative position of these two sequences the sum of products of
overlapping weights and inputs give us one of the convolution value. In the above
example we have:
103
In the next step the relanve position 1s
x10 x9 xg x7 x6 x5 x4 x3 x2 x1
w3 w2 w1 and we obtain:
Ys = xgw3+ X7'"2,+ x6w1
A design of convolution array described in the next secnon was developed
on the basis of this imply Qbservation.
4.2.2 Systolic Design
As described before the kernel should slide over the signal, but beside that
one can consider the kernel as being stationary in space and the signal as sliding
over the kernel. This view suggests a linearly connected array for performing
convolution so that each kernel element can be held in a single cell throughout the
computation and the number of cells is equal to the number of convolution weights
where the signals passes through the array from left to right
Host -
L.._. Cell! Cell2 Cell3 1---------- Cellk
Figure (4.1) 1D convolution systolic array.
Fig. (4.1) shows an array of cells connected to each other with each cell
holdmg a single kernel element, as described above. The array consists of k cells
104
-----------------------------------------------------------------------
(the length of the weight vector). The signal data is pumped into the array by the
host or interface umt in regular clock beats. Data and results are pumped through
the array and the final results and the original data are returned to the host (or
another host). Only cells on the array boundaries are permitted to communicate with
the host and each of the cells communicates with the left and right neighbour cells
only. It is assumed that there is a global clock synchronizing the computation of all
components in the system, having a time cycle (step, unit) long enough to
accommodate the most complex function performed by a cell, plus the data transfer.
In each step, all cells simultaneously perform their 1/0 and execute their operations.
In each cell (bifunctional cell) a sigual element is held by a register and the
cell contains two subtasks, multiplication of input x by a weight w and addition of
the result to y where the muluplier may be pipelined to an arbitrary degree. Each cell
produces its partial result one cycle earlier than the cell to its right . The skew can be
accomplished by replacrng in each cell the register, which transforms the signal
stream, wtth a mulustage shift register. For the general case the adder and the
muluplier in each cell are multistage pipeline unit as shown in fig. ( 4.2).
The number of stages of the shift register should be one greater than that of
the addr so if A is the number of stages of the shift register and B is the number of
pipeline stages of the adder unit then,
A=B+l
The Occam code running on this cell taken the folloWing form;
... PROC delay
... PROC pass.data
.. PROC multiply
... PROCadd
PROC cell (CHAN ..... )
... declarallon of local channels
PAR
105
delay
delay
pass
multiply
add
delay
It is poSSlble to modify the cell design by reducing the number of pipeline
stages for the multiplier unit, by adding another channel to the last stage of the shift
register as shown in fig. (4.3a). This channel is connected to the multiplier unit to
broadcast the signal from the shift register to the multiplier umt. The new cell
design is shown in fig. (4.3b).
Ym Yout
~ f.---RI R2 RB
Adder
w
~ RI R f.--- RC ~
Mult!pher
X m - RI R f.--- RA .. Xout • 7 -~
Sh1ft reg1ster
Fignre (4.2) Bifunctional cell design for lD convolution.
We can see from the cell design that both data streams, i.e., the input which
is through the shift register and the output which is through the adder unit, move in
the same direction. At the beginmng of a cycle one input data 'x' enters the leftmost
cell, and one of the output stream 'y' enters the same cell at the same time. The
computation of resultmg coefficients is achieved by means of the cells pipelined
accumulation of the partial product.
106
--- - -------------
For example, if k=4 (vector of wetghts), A=3 and B=2, then we need a
four-cell systolic array. Ftg. (4.4) shows snapshots of the x1 and y1 enter the array
at time 0, and Pi stands for a partial result of Yi· The computation of a resulting
coefficients is achieved by means of pipeline accumulation of the partial products.
For example, y 4 is computed in four steps: first w 4x4 is calculated in the first cell
(processor) at time=6; then later this passed to the right-hand side cell, where w3x3
Ym
X m
X out
X m R XQ!I
(a)
... '(.out
~ 1---RI R2 RB ,.
Adh"
w
~ Mulllphe
t X out ... , RI R2 --- RA
Shift register
(b)
Figure (4.3) (a) Register with two output channels. (b) Modified cell desigu
for ID convolution.
107
Time= 12
p]l+---l-----'1~ '---..-'--'
Time= 15
Time= 16
Figure (4.4) Snapshots of an execution of asystolic ID convolution
array with k=4.
108
IS calculated at time= a, and added tow 4x4 . The same processor will carry on the
next two cells, where w2x2 is calculated at the third cell at time=!(), w1x 1 is
calculated in the last cell at time=l2. The partial products are again summed to
produce Y4 as output at time=l5. With reference to fig. (4.5a), a configuration for
generating y 4 , is as follows:
Y4 = w4x4 +w3x3 + w2x2 + w1x1
The next output y 5 is shown in fig. ( 4.5b ), where the kernel moves one
position to the right to give
(4.4).
y5 = w4x5 +w3x4 + w2x3 + w1x2
The snapshots of systolic system for these output values are shown in fig.
w1w2 w3 w4
x1 x2 x3 x4 x5 x6 .......... xl7x18
(a)
w1w2 w3 w4
xl x2 x3 x4 x5 x6 .......... xl7x18
(b)
Figure (4.5) A window of length 4 for convolving the output
value of y 4 and y5
109
j
Only the first cell (leftmost cell) is permitted to communicate with the first
host, whtle only the last cell (rightmost cell) is permitted to communicate with the
second host.
The parallel algorithms computed at each result for the four cells is as
follows:
Wl..l read y 1 from the host I
read x1 from the host I
calculate a parual result of y 1
send y 1 to cell 2
send Xi to cell 2
cell j+l J =1 to 2
read y1 from cellJ
read x1 from cell j
calculate a parual result of y1
send y1 to cell J+2
send xi to cell j+2
~
read y1 from cell 3
read x1 from cell 3
calculate part of a result of y1
send y1 to the host 2
send Xj to the host 2
The mean procedure for the systolic array for lD convolution is
•.• PROC host!
... PROC cell
... PROC host2
PROC ID system (CHAN .... )
V AR float x[Ume],p[time],a[k]:
SEQ
110
PAR
hostl
PAR 1=[0 FORk]
cell
host2:
--(k IS the s1ze of the kernel)
Although, in general case, there are some invalid results generated like y 1•
Y2 and Y3• the fraction of the total results which are invalid is very small since
n>>k.
4.3 TWO DIMENSIONAL CONVOLUTION
4.3.1 Problem Definition
As shown in chapter 3, smce the input data in image processing is two
dimensional (with two space indices), therefore the convolution operations will also
be 2D. The main difference between the ID and 2D operations is that the number of
indices of the formal is doubled, as illustrated by the defmition below:
Given a 2D image x= ( XiJ )
where !=l,2, ... ,n1 and j=l,2, ... ,n2
Also we are given a 2D kernel W=wij
where i=1,2, ... ,kJ and j=1,2, ... ,k2
with kt <<n 1 and k2<<n2
The 2D formula for convolving x with w is
k1 k2
Yr1r2 = L L wi,j xi+r1-1, !+r2-1 1=1 j=1
For q=l,2, ... ,nJ-kt+l.
111
(4.2)
and r2=l,2, ... ,n2-k2+ 1.
The first results, Yll> is computed by placing the kernel over the image, so
that w11 covers x11> w12 covers x12 .. etc, and multiplying the corresponding
elements of w and x, followed by the summation of these products, let us suppose
the indices increase rightwards and downwards, then the kernel slides one position
to the right for the computation of the second result Y12·
The kernel moves one pos1tion downward and back to the left-edge of the
image, after the fust output row is computed. Similar steps will be repeated for all
the rows, the last result is computed when Wklk2 covers Xnln2.
There are (k2-1) additions and k2 multiplication per window, and the
window opera non is applied (n-k+ 1)2 times. As it indicates, 2D convolution is
even more compute-intensive than ID convolution and, because of that, 1t requires
very careful design of both the algorithm and the hardware. In this chapter, we will
improve the two-level pipelined systolic array design to fit with the defmition (4.2).
4.3.2 Computation of 2D Convolution as ID Convolution
From a computational point of view, an efficient way of computing 2D
convolution is first to conven it into that of computing a lD convolution, and
therefore the 2D convolution can be performed on a systolic array for ID
convolutions.
The 2D convolution defined in equation (4.2) can be viewed as lD
convolution.
and
Each row of the image input is represented as Xi!
where
Xi!= Xj1, Xj2 , ............ , Xm2
112
i=l ,2, .... ,n 1
From that the image input is defined as
X= X}!, x2! , ............ , Xn}! (4.3)
The total length of the 2D image is n 1 n2
Fig. ( 4.6) illustrates a 2D convolution, and its conversion to ID convolution
as explamed above.
where
The kernel can be computed in a smnlar manner where represented as Wi!
Wi != Wii, Wi2 , ............ , Wik2
xn XI2
X21 X22
XJ! XJ2
X51 X 52
(i=1,2, ...... ,kl)
X13
wn
X2J
w21
XJJ
W3I
X43
X 53
(a)
XI4
WI2
X24
W22
XJ4
WJ2
X44
X 54
XI5
W13
X25
W23
XJ5
W3J
X45
X 55
X= XII·X 12·XIJ·X14·X15·X2I·X22·X23·X24·X25·X31 , ........ ,X55
W= WII ,W12,WIJ•O,O,w2I•W22•W2J•O,O,w31 ,WJ2,W33
(b)
Figure ( 4.6) illustrates example for converting a 2D convolution and kernel
to ID convolution and kernel. (a) 2D image input x, and 2D kernel w. (b)
ID rrnage input x, and ID kernel w.
113
Thus the total kernel will be defmed as:
W=wt! ,(nz-kz)! 0, wz!,(nz-k2)! O, ..... ,Wkl! (4.4)
Equation (4.4) shows concatenation of the rows of the 2D kernel, with a
vector of (n2-kz} zero elements inserted between each consecutive pair of rows.
The total length of the 2D kernel is therefore n2(kr-l)+k2. Fig. (4.6)
illustrates an example of converting a 2D kernel to a ID kernel.
At this point, we can relax the constramt stipulating that the input sequence
(pixels) is formed by rows of the input array entered one after another without any
delays between consecutive rows.
4.3.3 Systolic Array Design for 2D Convolution
From the illustration shown in the previous section, we know that a 2D
convolution operation can be computed as a sum of ID convolution. A 2D
convolution array consists, therefore, of lD convolution arrays. We use k 1 linear
lD arrays, as descnbed in the previous sectiOn as a design for 2D convolution.
As shown in fig. (4.7), we connect the lD segments to produce a 2D array
for a 3 x 3 kernel, and input data row of 5 elements. Each of these arrays (or
segments) has k2 bifunctional cells, where k2 is the number of elements in each
kernel row. Also, there are (n2-k2) cells in each segment, except the last segment,
with zero kernel element, where n2 is the number of pixels in each image row, so
the total number of cells in each segment is n2 cells, and the total number of cells in
the system is therefore n2(k1-l)+k2, as shown in the previous section. In fig. (4.7)
x stands for the input sequence.
It can be seen that a large number of cells would be needed in the array
when n 1 and n2 are large, and the cells with zero weights would perform no useful
work, (N.P., n>>k). We show here that a kernel containing a large number of
114
~ - r---- - -hostl Wll W12 W13 W:{) W:{)
~ - r---- - - -X
~
1-- I-- 1-- ~ ~ W:{) W:{) W23 W22 W21 ~ 1-- I-- 1-- ~
L...\ .... ~ -- W31 W32 W33 host2
' ~
Figure (4.7) Systolic array for 2D convolution where ki> k2=3 and n1>n2=S.
stages of the shift register in each individual cell is adjustable.
Let us consider the cell shown in fig. ( 4.3b) with a kernel element equal to
zero, then the only effect of that cell is to delay the y stream B cycles and x stream
by A cycles, where A and B are as defined in section ( 4.2.2). It should therefore be
apparent that if this cell is replaced by a cell having zero cycles delay for the output
stream and a single cycle delay for the x stream, the same output stream would be
generated. This degenerated cell may be absorbed into a cell to the left by increasing
the number of the shift register stages of that cell by one.
Since k2 is the number of nonzero element, then these elements are then
loaded into consecutive cells of the systolic array. Thus, no more than k2 cells are
needed in the array.
Now let D be the number of zero kernel elements in each row , then,
(4.5)
115
------------------------------------------------------------------
hostl
Ym out
.:X:.:.t:::n-1--~ RI X out - ~~------------------~~
Shtft regiSter
Figure ( 4.8) Cell design with add shift register (Ak).
lv. w33 ' w32 ' w31 x .... ' I--' '
cell A cell A cellB
+-'k-- w21 "" w22 "" w23
:--.. ~ .... .; <--.... cell B cell A cell A
- w13 w12 ... W11 ::: ~ host2
cell A cell A cell A
Fignre (4.9) Systolic array for 2D convolution.
Let Ak be the number of stages of the shift register in the cell B, which is
the last cell with a non-zero element in the first row.
Then, the shift registers will be,
Ak=A+D (4.6)
where A is the original number of sh1ft register stages.
If we take the example shown in fig. ( 4. 7), then only three cells are needed
in each segment, the shift register in the first two cells is three and, from equation
116
( 4.6 ) Ak=5 at the third cell. Fig. ( 4.8) shows a cell design with Ak shift registers,
and the new systolic array design for that example is shown in fig. ( 4.8).
We conclude from the systolic array design described above that we have
kt(kz-1) +1 cells with A shift registers and (kl-1) cells each with a shift register of
Ak stages.
In general case the array consists of kt segments each having kz cells, it
adds up to kt kz bifunctional cells.
Equation ( 4.6) can be generalised to that of any size of kernel, and with any
zero kernel element, so if R is the number of stages of the shift register, in general,
then
R=Ak
and
R=A+nz-kz
Let L be the number of zero kernel element between any two non-zero
kernel element, then
R=A+nz-(kz+L)
and
R=A+(nzx h)-(kz+L) (4.6a)
where h is the number of rows between consecutive pair of kernel elements
Fig. (4.10) illustrates a series of snapshots of the example shown in fig.
( 4. 7). Both x and y are initialized to zero, and they enter the array at the top
segment and moves through the middle to the bottom segments. In our example,
two more delays were needed for each of the Ak shift registers (in the last cell of
each row except the last). As a result, the x sequence is slowed down and enters
each segment simultaneously with the appropriate elements of the y sequence.
There are 4 snapshots of the 2D convolution computation shown here, the indices
117
(q) Time =12
- ' ' I '"I
~ ~ ~ I I I I I ' I I I
I Y2
1.. tf3
$ IW22l ~ I ~ I~
r- ~ I I I ; I I ~
I I I [""' .....
- ' Y11,
~ ~ e I I I I -1 I I ~
(b'Time=21
Figure ( 4.10) Snapshots of the execution of a systolic 2D convolution
array with kr, k2=J, n1, n2 =5.
118
-
...._I ' ' ' I
~ ~ ~ I I I I u I I I
i 1, ...
~ ~ r-E1 - H I I I I I ::: I I I li
I I'
' .... Y3~
.... ~ - - - ..
~ ~ ' I I I I ~: I I --(c)T1me =31
.......! ~I ~
'"I .. ~ ~ ~
I I I I I I I I
"' ~ :... :- I ...
~ ~ ~ r--t- jL-J I
- H I I I I I I I I I I
Y33~ ~ ~ .... ~34 - ,
e e ~ I I I .I I I I I I I
~
(d)Time=32
119
of x and y sequence appears in each stage of the cell. We assume that the
computation starts at time zero.
At time three, x(l,l) meets y(l,l) at the first cell of the top segment, where
the kernel is w(3,3). Twelve clock cycles later x(l,l) moves to the first cell of the
middle segment to meet y(2,1), where the kernel is w(2,3). At the same time x(3,3)
meets y(3,3) at the first cell of the top segment, where this cell produces a partial
result ofy(3,3), as illustrated in fig. (4.10a).
The second snapshots illustrates the state of all cells at time=21, where
x(l,l) meets y(2,3) at the end of the middle segment, where the kernel is w(2,1).
At the same time x(2,3) will be at the first cell of the middle segment to produce
another partial result ofy(3,3). Also, at this time the first output y(l,l) is produced
which is an invalid result, the other cells produce partial result of other output
values. At time 31, x(l,l) w1ll be at the last cell of the array, to produce the last
partial resultofy(3,3) as shown in fig. (4.10c).
The last snapshot in this figure illustrate the array at time=32 where the first
valid result y(3,3) is produced. Also at this time, the last partial result of the second
valid result y(3,4), IS also produced.
Let us apply equation ( 4.2) to our example, the first valid result will be
y(3,3) = w(l,l) x(l,l)+ w(l,2) x(l,2)+ w(l,3) x(l,3)+ w(2,1) x(2,1)+
w(2,2) x(2,2)+ w(2,3) x(2,3)+ w(3,1) x(3,1)+ w(3,2) x(3,2)+
w(3,3) x(3,3)
But the partial results of this output y(3,3) are produced by different timings
m different cells where:
at cell (1,1), w(l,l) x(l,l) is produced at time=31
at cell (1,2), w(l,2) x(l,2) is produced at nme=29
at cell (1,3), w(l,3) x(l,3) is produced at time=27
120
at cell (2,1), w(2,1) x(2,1) is produced at time=25
at cell (2,2), w(2,2) x(2,2) is produced at time=23
at cell (2,3), w(2,3) x(2,3) is produced at time=21
at cell (3,1), w(3,1) x(3,1) is produced at time=l9
at cell (3,2), w(3,2) x(3,2) is produced at time=l7
at cell (3,3), w(3,3) x(3,3) is produced at time=l5
The same procedure is applied to all other outputs at every clock cycle, until
the final result y(5,5) is generated.
The main Occam procedure for the systolic array for 2D convolunon is
shown below, which is for any image size (no.c * no.r), and for any kernel size
(no.kc * no.kr).
---PROC hostl
---PROC host2
---PROC cell A
--·PROC cell B
PROC 2D.system (CHAN ···-)
---Input Image
---Input kernel
do:= no.kc * (no.kr -I)
PAR
hostl
PARJ = [0 FOR (no.kr- I)]
PAR 1 = [0 FOR (no.kc -I)]
cellA
cell B
PAR 1 = [do FOR no.kc]
call A
host2:
121
4.4 MULTIDIMENSIONAL CONVOLUTION
The systolic array for 2D convolution described in the previous section, can
be generalised to that of an rnD convolution.
To illustrate the idea, let us consider a 3D image. The systolic array for 3D
convolution is built up of k arrays for 2D convolutions. In this respect, the design
IS analogous to the 2D case, where we used k arrays for lD convolution.
Consequently. To illustrate the design of a 3D convolution array, let us assume
k1 =3, k2=3 and k3=3 for the 3D kernel, and n 1 =5, n2=S and n3=S for 3D image.
In this way, we obtain the 3D equivalent of the 2D array example from fig. (4.9).
The 3D array consists of three 2D subarray. Each subarray is, in turn, made up of
three lD segments, as shown in fig. ( 4.11). As in the 2D case, in order to complete
the design, we have to increase the number of the shut register stages of the last cell
in each subarray (except the last one). This new cell "cell C", which is similar to
"cell B" (fig. (4.8)) with an increase in the number of the shift register stages by
(n2x n 1)-(n2x kl)+(n2-kl)
Also, the method of converting a 2D problem into a 1D problem, which is
shown in section (4.3.2), can be generalised to that of converting an rnD problem
into a 1D problem. Let us consider the above example. The 3D image is formed into
1D signals as follows:
X= Xll!> X112• X113• X114• X115• X121> X122• X123• X124• X125• X131• X132•
X133• X134, X135• .... , X155• X21b X212• X213• ••••• X255• X31b X312• ••••••
X355• X411, X412• .... , X555•
And the lD kernel is formed as follows:
W = Wllb Wl12• Wl13• 0, 0, W121• W122• W123• 0,0, ..... W133• 0, 0, 0, 0,
0, 0, 0,0,0,0,0,0, W21b ..... , W233• 0, 0, .. ,0,W31b .... , W333·
The 1D signal is formed by concatenattng the rows of the first plain of the
122
---------------------- -------
,-------------------------1 I I I I I I I I I I I I
y, ... 1 w113 I
x" ... I wlll 1 ~ I w112 ' cell A cell A cellB
r--H. w123 1 ., .,
w121 ~f-. :;; I w122 I ~
cell B cell A cell A
--l w131 1 <;;. I w132 1 <: 1 w133 ~ ...
cell A cell A cell C "" -~-- -=-------------------
::.tl w211 : 5 w212 " w2131 I ' I
cell A cell A cellB
d w223 1 ~ 1 w2221 '5 I w221 ~ I'-cell B cell A cell A
L......j w231 I ~ " w233 ~ ' w232 I
cell A cell A cell C .; _:--_ --=-------------------~
'---. ~ w311 1 ....
I w312 1 <: I w313 1 ,
cell A cell A cellB
~W w323 1 '5 1 w321 ~I'-~ I w322 1 ~
cell B cell A cell A
L......J w331 1 ~ ...
w332 I w333 ,
cell A cell A cell A I ---------------------------
Figure ( 4.11) Systolic array for 3D convolution with three different
cells, i.e. cell A, cell B, cell C.
123
3D signal, followed by the rows of the second plain etc. The ID kernel is formed
by concatenating the rows of the first plain of the 3D kernel, with a zero vectors in
between, followed by a vector of zeros equal to n3-k3, then followed by the rows
of the second plain, etc.
4.5 CONSTANT TIME OPERATION
One of the main objectives of our design was to minimize cells delay. The cell
delay is defmed as the nrne delay between the input data to the shift register and the
output from the shift register, in our example it is shown in fig.(4.8) where the
number of stages of the shift register of cell B is:
R=Ak=n2=5
It means that the number of delays (shift register stages), is directly
determined by the input data size (number of columns). For example, to perform
2D convolution on a 256 image would require 256 shift register stages in each cell
B. The number of delays in the system is determined by both the kernel size and the
input data size.
If the size of the kernel is 5 x 5 then the total number of delays is:
Rb for cell B = Ak =A + D = 3 + (256 -5) = 254
Ra for cell A= 3
The total number of delays in cells B is:
4x256=1024
The total number of delays in cells A is:
21 x3 =63
Thus the total number of delays in all the cells is
R=Ra+Rb
R = 63 + 1024 = 1087
124
The percentage of the delay in cells B to the total delay is 94%, which is very
high and will increase the execution time of the system.
For llllproved system performance, the number of delays should be decreased
and/or the execution time of the shift register should be reduced.
The basic design of the systolic array for 2D and m-D convolution, presented
in sections (4.3.3) and (4.4), can be modJ.fied to 1mprove the performance of the
system. This can be achieved by implementing the shift register as a constant time
operation.
The shift register is a "FIFO (First-In First-Out)"list, so for a shift register of
size n we need to move n data items at each time a new data item enters the register
and outputs the first data in the register.
Q.rear Q.front
queue
Figure ( 4.12) A circular implementation of constant time operation.
To realise a constant time implementation of the delay the shift register 1s
viewed differently. It is regarded as a circular array, where the first position
followed the last as shown in fig. ( 4.12). The queue is formed around the circle in
consecutive positions, with the rear of the queue clockwise from the front.
125
To enqueue an element, we move the Q.rear pointer one position clockwise
and write the element in that position. To dequeue, we simply move Q.front one
position clockwise. Thus the queue migrates in a clockwise direction as we enqueue
and dequeue.
It can thus be seen with this implementation, only two data 1tems are moved,
the incoming and the output data.
The number of delays of the systolic system is therefore a constant time
operation, i.e., it is independent of the kernel size and input data size. This
decreases the execution time of the whole system, and improves the performance
(see section 4.10 of this chapter).
All the cells B shown in fig (4.9) are replaced with cells CT (cell A with
constant time process).
The Occam code for constant time process is as follows:
PROC constant tune
x:=O; b[k].=O; J:= 1
SEQ I= [0 FOR tune] --unage Size
SEQ
xin? x
bout I b[J]
b(J] ·=X
IF
J >(no eo -1)
j := 1
TRUE
j:= J +1:
126
4.6 TRANSPUTER NETWORK FOR ONE DIMENSIONAL
CONVOLUTION
The systolic design described in section (4.2.2) was chosen for the
implementation of the lD convolution on the transputer network. This is because
the problem can easily be arranged such that each transputer is responsible for
representing one of the cells or more (dependmg on the size of the kernel), with
communication occurring between them, so that each kernel element (or more than
one element) can be held in a single transputer throughout the computation, and the
maximum number of transputers in the design is equal to the number of convolution
kernel elements. The same cell design was adapted on the transputer, so we need
same code to run on all transputers.
The flrst transputer is connected to the host, so it receives input data from the
host, while the last transputer is also connected to the host, so it passes onto the
host all the resultlfrom all the transputers on the network before it shuts down.
Host
x,y 1----~' TO Tn
x,y 1-'1"'<------------------1
Figure (4.13) Transputer network configuration for ID convolution.
When calculating new values at each transputer communication has to occur to
obtain values from the neighbouring transputers. Each transputer collects input data
and results on the transputer boundaries which is then sent down the Inmos lmks to
127
the next right transputer. It also receives the values of the results and the input data
from the left transputer through the lnmos link.
Parallel code must be executed in each transputer whilst all transputers execute
in parallel.
The transputer network configuration for the design is shown in fig. (4.13).
The algorithms described in section (4.2.2) required modification in order for
them to be executed on a network of transputers and minimise the number of
channels used in each transputer.
The algorithm at each input data for all transputers are as follows:
I.ll receive mput value x1 from the host
calculate a partial result of y 1
send y1 to T1
sendx1 toT1
J =1 ton ---(n IS the number of transputers)
receive mput value x1 from TJ-1
receive a partial result y1 from TJ-1
calculate another partial result of Yi
send the added y 1 to TJ+ 1
send x1 to Tj+ 1
In receive mput value x1 from Tn-1
receive a partial result y1 from Tn-1
calculate the fmal partial result of y 1
send y 1 to the host
send x1 to the host
128
4.7 PERFORMANCE OF THE ONE DIMENSIONAL
CONVOLUTION SYSTOLIC DESIGN ON THE
TRANSPUTER NETWORK
Several experiments were performed on the model problem in order to
measure the performance of the algorithm on a variety of network configuration.
The number of transputers of network is not dependent on the stze of the image. a
The mam requirement is for such'transputer to have an equal number of cells as
close as possible. This is necessary for load balancing between transputers. The
network is comprised of only T800 transputers on the B012 board.
A timer on the host transputer is used to measure the time of processing of the
ID convolution algorithm. This time is measured from the instant of starting to send
the first pixel to the instant of receiving the last pixel. The time lapses includes the
time spent transmitting the final result to the host processor from the network.
Several image sizes with two different window sizes have been tested. Tables ( 4.1
and 4.2) show the time of processing with respect to image size for both window
sizes for various network configurations. The tables also includes the relative
speed-ups and effictencies of the algorithm. The speed-up and efficiency are
calculated using equations (2.1 and 2.2 ) respectively.
An analysis of the results shows that the system's performance is improved
by increasing the number of transputers. Two sets of performance graphs are
shown in each of figures ( 4.14 and 4.15) in which two points can be noted. The
first is the effect of mcreasing the number of transputers on the performance. It can
be seen that the speed-up and the efficiency is increasing with increasing number of
transputers. The second point is the effect of increasing the image size on the
129
___j
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.037 1.00 100.00
16 X 16 2 0.025 1.48 74.09
3 0.013 2.87 95.81
1 0.137 1.00 100.00
32x 32 2 0.092 1.49 74.38
3 0.047 2.90 96.75
1 0.535 1.00 100.00
64x64 2 0.360 1.49 74.42
3 0.184 2.92 97.20
1 2.120 1.00 100.00 128 X 128 2 1.424 1.49 74.46
3 0.727 2.92 97.20
1 8.439 1.00 100.00
256 X 256 2 5.670 1.49 74.43
3 2.893 2.92 97.24
Table (4.1) Tuning results for the 1D convolution algorithm (kernel size 3).
speed-up and the efficiency. It can be seen that both increase with the increase of
the image size.
Fig. (4.15b) shows the efficiencies of over 77% for a three transputer
network, over 85% for a four transputers network and over 95% for seven
transputers network. The main reason for the drop in the efficiency for each
transputer network is the number of cells in each transputer. When the number of
transputers decreases then the number of cells in each transputer increases. As
executing parallel Occam code on each transputer implies the transputer spends its
ume between different processes giving the illness of concurrency, the best
efficiency we can get is when each transputer contains one cell only (7 transputers
for 7 cells). The same implies for fig. (4.14b) where the efficiency is 74% for two
130
transputer network and over 95% for three transputer network (3 transputers for
three cells).
In general, the ID convolution algorithm will give a nearly linear speed-up
when the number of transputers on the network is increased.
Image Size Network Time Lapse Relative Efficiency Size (seconds) Speed-up %
1 0.0856 1.00 100.00 16 X 16 3 0037 2.33 77.57
4 0.025 3.44 85.99
7 0.013 6.69 95.57
1 0.321 1.00 100.00 32x 32 3 0.137 2.34 77.87
4 0.092 3.48 86.87
7 0.047 6.77 96.74
1 1.254 1.00 100.00 64x64 3 0.536 2.34 77.94
4 0.360 3.48 87.06
7 0.185 6.79 96.98
1 4.966 1.00 100.00 128 X 128 3 2.123 2.34 77.98
4 1.425 3.48 87.11
7 0.731 6.79 97.03
1 19.78 1.00 100.00 256 X 256 3 8.449 2.34 78.03
4 5.675 3.49 87.14
7 2.911 6.80 97.06
Table ( 4.2) Tuning results for the ID convolution algorithm (kernel size 7).
131
... = "' .. .. ... "-'
... " " ..
·;:; ;.:: ... r.l
3~---------------------------,
2
2 3
No. or Transputer
16 *16
- ·- -··· 32*32
- ----· 64. 64
4
128. 128 256.256
Figure (4.14a) Speedup graph for 1 D convolution (k=3)
10~----------------------------,
09
16 *16
--- --- 32 *32
--- --· 64*64 08 128. 128 256.256
07 2
No. of Transputer 4
Figure (4.14b) Efficiency graph for 1 D convolution (k=3)
132
L------------------------------------------------------
;., ... " "' :E ... ... r.:l
7
6
... 5 16. 16
32* 32
64*64 128. 128 256.256
= "CC
"' ~4 "'
3
2 2 3 4 5 6 7 g
No. of Transputer
Figure (4.15a) Speedup graph for 1 D convolution (k=7)
12
10
os-
~.;::. --=-==-=-:::-:.::----/' __ _
06-
16 *16
--- --- 32.32 • •••••• 64*64
128. 128 --1-- 256*256
044-~--r-,-r-,,~,--~,-r--,r-~-~.~~ 2 3 4 5 6 7 g
No. of Transputer
Figure (4.15b) Efficiency graph for 1 D convolution (k=7)
133
4.8 TRANSPUTER NETWORK FOR TWO DIMENSIONAL
CONVOLUTION
The systolic system described in section (4.3.3) required further
modification in order for the array to be implemented on a network of transputers.
The transputer network was connected to one host only. The first and the
final transputers are connected to the host, the first transputer receives all the input
data from the host and pump it to the transputer network, the final transputer
receives the results from the neighbouring transputer and sends it down to the host.
For simplicity each transputer is responsible for a cell of the systolic design or more
than one cell. Basically inside each transputer equation ( 4.2) is applied to update the
partial results, each transputer collects values of both the input data and the partial
result through one Inmos link and sends them down the Inmos link, as shown in
fig. (4.16).
X
fllmg xm ~ TO ~ T1 system Host ,. ,. -
Transputer yin y
; xout
~ Tn --~- T2 yout
Figure (4.16) Transputer network configuration for 20 convolution.
We need Occam codes to run on different transputers to represent cell A
(shown in section (4.3) and cell CT (shown in section 4.5). The maximum number
of transputers we need for this design is equal to the number of cells on the systolic
design, or the number of kernel elements. The transputer network configuration for
134
such a design is shown in fig. (4.16). All the processes run in parallel inside the
transputers.
The Occam program for a host and network of transputers is given in
Appendix A and B respectively.
4.9 PERFORMANCE
CONVOLUTION
OF THE TWO DIMENSIONAL
SYSTOLIC DESIGN ON THE
TRANSPUTER NETWORK
Experiments similar to those performed for the one dimensional convolution
were carried out to measure the algorithm's performance on a network of T800
transputer configured as an array. A summary of the timing results for the two
dimensional convolution algorithm is presented in table (4.3), with respect to the
image size for a kernel size (3 x 3) and the various network configurations
implemented. The table also includes relative speed-ups and efficiencies of the
algorithm.
The overall results for this algorithm are very impressive. Table (4.3) indicate
'super' speed-up for all sizes of images when the network size is 3 transputers.
Their efficiencies are also extremely high (over I 00% falling gradually to above
91%).
This extraordinary behaviour is explained by the fact that each of the T800
transputer on the network is connected to an external memory which is much
slower than the on-chip RAM. When the program is executed on a single transputer
therefore some of the data stored in the cells shift register is stored in the external
memory so that extra time is required in accessing the slow external memory. When
the same program is then run on a network of 3 transputers, the amount of storage
needed per transputer is one third and thus all the data can be stored on the fast on-
135
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.093 1.00 100.00
16 X 16 3 0.029 3.2 106.67
5 0.020 4.59 91.86
9 0.011 8.23 91.40
1 0.348 1.00 100.00
32x 32 3 0.108 3.23 107.79
5 0.075 4.65 93.07
9 0.0416 8.38 93.15
1 1.361 1.00 100.00 64x64 3 0.420 3.24 108.19
5 0.291 4.67 93.35
9 0.162 8.43 93.80
1 5.389 1.00 100.00
128 X 128 3 1.660 3.25 108.12
5 1.154 4.67 93.42
9 0.638 8.43 93.68
1 21.477 1.00 100.00
256x 256 3 6.649 3.23 107.66
5 4.597 4.67 93.44
9 2.543 8.45 93.84
Table ( 4.3) Timing results for the 2D convolution algorithm.
chip RAM. The gain m speed from the on-chip RAM offsets the new constraint
introduced by communication.
Fig. ( 4.17) shows that a near linear speed up is obtained for various image
size. For small size images ( 16 x 16; 32 x 32), there are decreasing gains as the
network increases, while this gain increases as the size of the image increases.
This result is quit useful smce the need for a parallel system is more vital for large
images, when the time ofprocessmg is relatively high. Graphs of fig. (4.18) show
the effect of mcreasing the size of the transputer network on the performance of the
136
system. It can be seen that the efficiency increased with increasing the size of the
network beyond 5 transputers.
a. :I ., .. .. a.
1/)
... u c .. u
= -w
9
8
7
6
5
4
3 2 4 6
No. of transputers 8
16. 16
-· ---·- 32 • 32
- ------ 64. 64
-· ·-·-·-· 128. 128
- --- 256.256
1 0
Figure ( 4.17) Speedup graphs for 2D convolution.
1.2....---------------.
1 0 '------08
16. 16
• ·-·---·-· 32 • 32
------- 64. 64
·-·-·-·-· 128. 128 256.256
o6+--~~~~--r-~-r-~-~ 2 4 6 8 1 0
No. of transputers
Figure ( 4.18) Efficiency graphs for 2D convolution.
137
4.10 ANALYSIS AND COMPARISON OF THE TWO
DIMENSIONAL SYSTOLIC ARRAY
We know from chapter 2 that systolic arrays are special purpose computing
deviGes. The research completed in the area of two-dimensional and
multidimensional systolic array is quite extensive. Several different designs have
been suggested in the literature. We will compare one of them to the design
presented in this section.
The size of the convolution kernel and input data size are important
parameters in the design of any convolution array. For the design presented in this
chapter, the array is mdirectly related to the size of the input image, where the
number of delays in the shift register is equal to 3 in most of the array cells, and
some cells it needs a fixed or constant time process (CI) as shown in section ( 4.5).
However our array requites buffering equal to the dimens10nal of the image (1.e.
number of columns) at cell cr. The number of processors in the array is equal to
the number of convolution kernel elements.
We will now compare our array to the two-level pipeline proposed by Kung
[Kung 1983].
The number of delays of the array proposed by Kung are very high, where
three kinds of delays are needed, the flrst at the mulnplier unit, secondly at the
adder unit, and finally at the shift register. The total number of delays at some cells
of the array are more or less twice the dimension of the image and this will increase
the execution time of the system.
The main reason for using a parallel processing system to implement image
processing algorithms is to reduce theit execution times. Therefore it is important
that each processor is active as much as possible. There are two reasons why this
may not happen in the array proposed by Kung:
138
1- The processors may not be allocated equal workloads, because some of
the cells have much more delay than others and so those allocated cells which fmish
first must wait until the slower ones terminate; and
2- The communication of information will invariably be required between
cells. Since communication is tightly synchronised, it means that 1f one processor is
trying to transmit information to another it must wait until the second processor is
ready to receive information.
The processing time of our system running on the transputer network is
taken from table (4.3) and given here again in table (4.4). The table also include the
time of processmg on the system proposed by Kung running on the same transputer
network. The speed-ups for both systems are also included in the table.
A comparison between the times for the systems shows that the times for
our system are superior for all sizes of images and for all sizes of transputer
network as shown in fig. (4.19 and 4.20). The time increas~rapidly with the
increase of image size. This is because for large 1mage size the number of delays is
very high, and that will effect the performance of the system.
Figures (4.21 and 4.22) show a dramatic decrease in speedup and efficiency
of the Kung system, especially for larger sizes of image. The reason for this is as
before, smce the number of delays is very high.
Again for small image size there is less gain as the network size increases.
The reason for the efficiency difference in the case of smaller images is the size of
the load on each transputer, where the number of cells in each transputer at the
smaller sizes of transputer network is higher than that of the larger size of transputer
network. The increase of the number of cells per transputer, affects the efficiency
gain. This gain is decreased when the size of image is increased but the percentage
drop of effic1ency gain is very small.
139
c onstant tune system 21 1 · r evet p1pe me system
Image Network Time Relative Time Relative
Size Size Lapse Speed up Lapse Speed-up
(seconds) (seconds)
1 0.093 1.00 0.184 1.00 16 X 16 3 0.029 3.2 0.053 3.49
5 0.020 4.59 0.040 4.56
9 0.011 8.23 0.028 6.51
1 0.348 1.00 0.935 1.00 32x 32 3 0.107 3.23 0.320 2.92
5 0.074 4.47 0.177 5.28
9 0.0416 8.38 0.177 5.28
1 1.361 1.00 5.516 1.00 64x64 3 0.420 3.24 2.213 2.49
5 0.291 4.66 1.63 3.39
9 0.162 8.43 1.681 3.28
1 5.389 1.00 36.487 1.00 128 X 128 3 1.660 3.25 18.651 1.96
5 1.154 4.67 13.546 2.69
9 0.638 8.43 13.980 2.61
Table ( 4.4) Tmring results for the 2D convolution algorithm for the 2level
pipeline systolic system and constant time systolic system.
140
~
ci ., -" E j::
~
ci !!.
" E j::
02~---------------------------------,
0 1
0
~ · . • • • ·. ·. • \ • • .. •
ll .......... cons! t system 21evel pipe
\
2
• ·. ......... ····---............. .
4 6 No. of transputera
.............
8 1 0
Figure ( 4.19) Time lapse graphs for 2D convolution for two
systems (Image size 16 x 16).
40
30
20
10
0 0
.. ~-......
·· ....... ........
.........
2
........
4
.......... con tsystem 21evel pipe
·--.,. ...................... ..
6 8 No. of transputera
10
Figure ( 4.20) Time lapse graphs for 2D convolution for two
systems (Image size 128 x 128).
141
... :I
"tJ ., ., ... U)
... ... 1: ., ... --w
10
8
6
4
2
0 0
----c---
-· -· -·
const I system 21evel pipe
,c , -· -· -· -· ,.
,.a
..m -· -· -· -· -· -· -·
2 4 6 8 10 No. of transputers
Figure ( 4.21) Speed up graphs for 2D convolution for two
systems (Image size 128 x 128).
1.2....------------------,
1.0
08
06
04
0
... c .. ............. ... ........ ... 1J----------------1C
----c---
2
const t system 21evel pipe
4 6 No. of transputers
8 10
Figure ( 4.22) Efficiency graphs for 2D convolution for two
systems (Image size 128 x 128).
142
CHAPTER 5
PARALLEL IMPLEMENTATION OF THE LAPLACIAN AND GRADIENT OPERATORS IN COMPUTER VISION
5.1 INTRODUCTION
The purpose of this chapter is to identify a set of designs suitable for
implementing low level image processing algorithms on VLSI processing arrays.
We consider techniques for filtering digital images, as described in chapter 3. This
includes both low pass (smoothing) and high pass (edge enhancement) fllters.
Most of the image processing algorithms need massive amounts of band
matnx operations. However, these algorithms contain explicit parallelism which can
be efficiently exploited by processor arrays. All sections of the image have to be
processed in exactly the same way, regardless of the position of the image secnon
within the image, or the value of the pixel data.
Low level functions involve matrix vector operation, which are repeated at
very high speed Typically, images must be processed in real time, at 25 images per
second. Therefore with image sizes of 128 x 128, 256 x 256, and 512 x 512 or
greater, there is a large amount of data to be processed in a highly repetitive
process.
A different class of algorithms were implemented on a transputer network
using systolic array design, described in chapter 4. The reasons for choosing these
algorithms are that they commonly use operations in filtenng digital image
processing systems, and a varying degree of commumcation between processors is
required for each of the algorithms.
Various modifications of the systolic design are analysed in this chapter, the
design modifications are to handle each of the filter algorithms. The number of cells
or processors in each filter design is dependent on the size of the kernel and the
nature of the algonthms (e.g. number of sub tasks). However, some applications
may dictate that only a small number of processors can be used. For example, in
missile systems, where the amount of space avrulable may be small.
144
Brief defmition of each algorithm is given ahead of each section in which they
are discussed together with a full description of the systolic array design. The
results and efficiency obtained for each design on a transputer network is also
given, to reflect the performance of this design on this level of vision.
5.2 SYSTOLIC DESIGNS FOR DIGITAL IMAGE
FILTERS.
Smoothing filters are designed to reduce the noise, detail, or 'busyness' in an
image. Several types of low pass filter designs are considered. Typical filters are:
Mean, Weighted mean, Inverse gradient, and Sigma.
Edge enhancement filters are intended to enhance or boost the image edges.
Several types of enhancement filter designs are implemented. Typical filters are:
Gradient operations, Laplacian operators, the Laplacian added back, Sobel and
Prewltt.
For this purpose, we have developed several models of systohc arrays.
V anous modifications and customization of the destgn were presented in the
previous chapter for specific smoothing and edge enhancement algorithms.
Different kinds of cells are used, some of which are implemented in thts chapter
wluch has improved the hardware utilization and throughput The actual design reler
strongly on the property of Occam, which allows one to model in software several
concurrent processes on a single processing element or a network of processing
elements.
In this chapter we describe the designs for the Laplacian and Gradient
operators. These are two classes of filters , that are frequently applied to the digital
image.
145
5.3 PARALLEL IMPLEMENTATION OF THE
LAPLACIAN OPERATOR.
5.3.1 Laplacian Operator Algorithms
The Laplacian operator is computed by convolving a mask with the image.
The Laplacian low pass filter is a typical smoothing filter, a Laplacian mask is
passed over the entire image, and the convolution operation is performed on each
pixel. Each pixel is replaced by the sum of the products of the mask weighting and
the appropriate neighbouring pixel values. Different choices are available when
using this mask in two dimensions.
The Laplacian filter using 3 x 3 convolution mask may be used. This utilises a
mask or weighting matrix defined on two standard masks as shown in fig. (5.1).
0 -1 0
-1 4 -1
0 -1 0
(A) Plus shaped mask
-1 -1 -1
-1 8 -1
-1 -1 -1
(B) Square mask
Figure (5.1) Two Laplacian masks.
The value in the weighting matrix allowed a simpler and faster version of the
algorithm than was obtained usmg the general convolution case. In the first mask
(fig. 5.1 A) as only values of -1 and 4 are used, it is possible to either subtract the
value of the pixel, or add 4 times its value, respectively, also there are four zeros
values, which reduce the number of the processors and the number of multiplication
and additions. While in the second mask (fig. 5.1 B), where only values of -1 and
8 are used, it is also possible to either subtract the value of the pixel or add 8 times
its value, respectively.
146
The algorithm given here 'smooths' a grey level input image and is based on
one descnbed in chapter 3. The algorithm has I as an input image and 0 as an
output image; both I and 0 contain M by M pixels, with P = M2. Each point of I is
a value representing one of possible grey levels. Each point in the smoothed output
image, O(ij), is to multiply the pixel I(ij) and its eight nearest neighbouring pixel
values, I(i-l,j-1), I(i-1,j), I(i-1,j+1), I(i,j+1), I(i ,j-1), I(i+1,j-1), I(i+1j), and
I(i+ 1j+ 1), by the weighting Laplacian mask values, where I~ i, j < M-1. Pixels on
the edges of I are not smoothed with this filter, they are simply copied to 0.
5.3.2 Systolic Array for the Laplacian Operator.
In chapter 4 we introduced two different designs of multidimensional
convolution. The second design includes two different cell designs , i.e., cell A and
cell CT (section 4.2.2 and 4.5). Cell A is original cell, holding a kernel element
value ..0 as shown in fig. (4.3 B), where cell CT with (Ak) shift register represents
some kernel element equal to zero as shown in fig. (4.8 ).
This design can be improved and implemented for the Laplacian filter. When
we consider how this could be implemented on this design with N number of cells,
where each processor or cell stores an element of the Laplacian filter, (where N to
be either 5 or 9 processors).
5.3.2.1 Plus shaped Laplacian Operator Design.
For the implementation of the Laplacian filter shown in fig. (5.1 A), we
should modify cell CT . This cell contains two kernel elements, the second and the
third element. The first kernel element of the mask is the zero element, and should
be ignored (as shown in fig. 5.2). The third kernel element is zero element, and
147
may be absorbed into the cell by increasing the number of shift register stages of
that cell by one.
The last cell in the second segment also has the same number of stages of the
shift register.
Let (AL) be the number of stages of the slnft register in the new cell (cell Cf).
Then, from equation (4.6), the shift registers will be
AL= Ak+ 1
=A+(D+1) (5.1)
The Occam code running on thiS cell (cell en takes the form of fig. (5.3).
The Laplacian systolic array desig11, as shown in fig. (5.4) will consist of five cells
only, one cell is cell er in the first segment. Three cells are needed in the second
0
xu
-1
X21
X 51
-1
Xl2
4
X22
-1
X32
X 52
-1
X23
0
X33
X 53
0
X14
0
X24
X34
X 54
0
X15
0
X25
X35
X 55
Fig11re (5.2) lllustrates Laplacian filter convolving an image n x n
where n=5. (arrow denoting zero elements in mask)
148
segment, the shift register in the first two cells is 3 (cell A as in fig 4.3), The third
cell (cell CD. One cell is needed in the third segment which is (cell A).
As shown in fig. (5.4), the image data is pumped into the array by the host.
The data and results are pumped through the array and the final results and the
original data are collected by the second host
For the parallel approach, parallel code must be executed in each cell, whilst
all the cells and the two hosts execute in parallel.
--- PROCClP
-- PROC delay
--- PROC pass.data
--- PROC multiply
--PROCadd
PROC cell.CT (CHAN ....•..... )
--- declaration of local channels
PAR
delay
pass.data
ClP
multiply
delay
add
Fig. (5.3) Procedure to run the cell Cf.
149
- -- ----------
y .... ws .... host!
, , X (-1) ,
cellCT
w2 "" w3 "" w4 ~ ,____ ,--(-1) (4)
., (-1) ~ ~ lr- '
cellCT cell A cellA
w1 .. (-1) ~ host2
cellA
Ftgure (5.4) Systohc array for the plus shape Laplacian operator_
Only the first cell (leftrnost cell) is permitted to communicate with the first
host, also only the last cell (rightrnost cell) is permitted to communicate with the
second host. Each of the other cells communicates with its left and right neighbour
cells only, each cell communicates input data and results to the neighbouring cell to
its right and obtains data and results from the cell to its left. The parallel algorithms
computed at each result for the five cells is shown in fig_ (5_5)_
~
read y 1 from the host I
read x1 from the host I
calculate a parual result of y 1
send y1 to cell 2
send Xj to cell 2
cell J+l j =I to 3
read Yi from cell J
read x1 from cell j
calculate a partial result of y 1
150
send y1 to cell J+2
send x1 to cell J+2
~
read y 1 from cell 4
read Xj from cell 4
calculate partial of a result of Yi
send y 1 to the host 2
send Xj to the host 2
Figure (5.5) Parallel Laplacian ftlter algorithm.
At the time the input data x1 and the output stream Yi enter the first cell in the
array, the computation of the resulting coefficients is achieved, W5Xi+3 (if the
image is 3 x 3), then this is passed to the right-hand side cell. The same process
will carry on for the next four cells with the partial products being summed in each
cell to produce y1j- With reference to fig. (5.2) and fig. (5.4) a configuration for
generating y, is as follows (for an image n x n where n =5).
Yij = w1 Xi-l,j + W2 Xij-1 + WJ Xij + W4 Xi,j+1 + w5 Xi+1,j (5.2a)
for i,j = 1,2, ..... ,n
By applying the weighting mask shown in ftg. (5.1A) to this equation, then,
Yij = (-1) Xi-1j + (-1) Xij-1 + 4Xij + (-1) Xi,j+1 + (-1) Xi+1j
(5.2b)
The algorithm is repeated for every input data for each cell concurrently. The main
Occam code to run the array 1s as follows
--- PROC hostl
--- PROC cell CT
--- PROC cellA
--- PROC host2
PROC mam.system (CHAN .......... )
151
--- declarauon of local channels
SEQ
lffiage = number columns * number .rows
SEQ 1 = [0 FOR image]
mput data
PAR
host!
cell CT
PAR 1 = [0 FOR 2]
cell A
cell CT
cell A
host2:
5.3.2.2 Systolic Array Design for Square Laplacian Operator.
The approach for the design of the Laplacian filter algorithm for the mask
shown in fig. (5.1B), is more or less similar to the one described for 2D
convolution in section 4.3.3.
The size of the mask shown in fig. (5.1B) is the 3 x 3 kernel. The 9 values
of the kernel are non zero values. The array consists of 3 segments each having 3
cells, adding up to 9 cells. The ftrst two cells of each of the ftrst and second
segments are cell A and the third cell is cell er (both cells design are described m
chapter 4). The third segment consists of three cells (cell A). Each cell stores an
element of the square Laplacian filter, as shown in fig. (5.6).
The parallel algorithms computed at each result for the nine cells is as shown
in ftg. (5.5). As the array size is greater than 5, all central cells would execute the
cell j+ 1 code.
The partial products are summed in each cell to produce Yi· With reference to
ftg. (5.6) a configuration for YIJ• is as follows:
152
hostl Y, w9 .. wa .... W7 -x, (-1) (-1) .. (-1)
cellA cellA celiCT
r- w4 ., w5
., w6 +.!~ -(-1) (8) "" (-1) ~
1'-v- ~
cell CT cellA cellA
1
'--- w3 .. w2 ' w1
(-1) .. (-1) .. (-1) ~ host2
cellA cell A cellA
Figure (5.6) Systolic array for the square Laplacian operator.
Yij = WJ Xi-1,j-1 + W2 Xi-1,j + W3 XJ-1j+1 + W4 Xi,j-1 + WS Xi,j
W6 Xij+1 + W7 Xi+1j-1 + wg Xi+1,j + W9 Xi+1,j+1
for i,j= l.. ... n .
(5.3a)
by applying the values of the mask shown m fig. (5.1B) to this equation, then,
Y1j = 8xi,J- [ XJ-Jj-1 + Xi-1j + Xi-1,j+1 + Xi,j-1 + Xi,j+1
+ Xi+1j-1+ Xi+1,j + Xi+1j+1]
The main Occam code to run the array is as follows:
--- PROC hostl
--- PROC cell CT
--- PROC cellA
--- PROC host2
PROC mam system (CHAN ....... )
--- declarauon of local channels
SEQ
unage := number.columns • number .rows
SEQ 1 = [0 FOR Image]
input data
153
(5.3b)
PAR
hostl
PARJ = [OFOR 2)
PARi= [OFOR 2)
cellA
cell er PARJ = [0 FOR 3)
cell A
host2:
5.3.3 Designing Transputer Networks for the Laplacian Operator.
The systolic design descnbed in section (5.3.2.1) was chosen for the
implementation of the plus shaped Laplacian filter on the transputer network. This
is because the problem can easily be arranged such that each transputer is
responsible for representing one or more of the cells shown in fig. (5.4), with
communication occurring between them. Each kernel element can be held in a single
transputer throughout the computation, and the maximum number of transputers is
equal to the number of Laplacian filter weights. The same cell design was adapted
on the transputer, so we need two codes to run on different transputers to represent
cell A and cell CT as shown in section (5.3.2.1). The maximum number of
transputers we need for this design is five.
When calculating new values at each transputer, communication has to occur
to obtain values from the neighbouring transputers. Input data and results are
collected on the transputer boundary and then sent down the Inmos links to the next
transputer, receiving the values of the results from the Inmos lmk. Parallel code
must be executed in each transputer, whilst all transputers execute in parallel.
The first and the last transputers are connected to the host, so the first
transputer receives input data from the host, whilst the last transputer passes onto
154
------------------------------------------------
the host all the results from all the transputers on the network before it shuts down.
The transputer network configuration for such a destgn is shown in fig. (5.7). The
Occam code can be run on transputers TO and T3 to representing cell er design,
while different Occam code will be run on Tl, TI, and T4 transputers.
The transputer network design for square Laplacian filter is similar to that
shown in fig. (5.7). As the network size ts greater than 5, the central transputers
will be extended to 9 transputers. Two cell designs (cell A and cell Cf) are needed
X ... filing
, xm ~ TO Tl ~
system Host , , f--Transputer ym
y ... ,
"' , xout
~ Tn __ .....__
12 yout
Figure (5.7) Transputer network configuration for the Laplacian operator.
to be implemented on different transputers. An Occam code runs on transputers
TO, Tl, T4, T3, T7, T8, and Ts to implement cell A design, while transputers T2,
T5 execute another Occam code which is to be implemented on cell cr. All the
processes run in parallel inside the transputer, whilst all transputers execute in
parallel.
The algorithms descnbed in sections (5.3.2.1 and 5.3.2.2) require
modification in order for them to be executed on a network of transputers.
The algorithms at each input data for the 5 and 9 transputer networks are as
follows:
155
receive mput value x1 from the host
calculate a partial result of y1
sendy1 toTl
sendx1 toTl
Ii J=1 ton-1 ---(n IS the number of transputers)
receive input value x1 from TJ-1
receive a partial result y1 from Tj-1
calculate another partial result of y1
send the added y1 to Tj+ 1
send x1 to TJ+ 1
I!!:1. receive mput value x1 from Tn-2
receive a parual result y1 from Tn-2
calculate the fmal partial result ofy1
send y 1 to the host
send Xi to the host
- - -------
5.3.4 Performance of the Laplacian Operator Systolic Design on
Transputer Network.
The designs were applied and executed on transputer networks of varying
sizes. The network compnsed of 16 T800 transputers configured as an array on the
B012 board. The timing results for the 5 point (or plus) shaped Laplacian operators
are presented in table (5.1). The table also includes relative speed-ups and
efficiencies of the algorithm. The associated speed-up and efficiency graphs are
given in figures (5.8) and (5.9) respectively.
156
The speed-up graphs for this design are nearly linear, with speed-up
increasing as the size of the transputer network increases. Also, the speed-up
increases as the size of the image increases, as shown in fig.(5.8). The graphs in
fig. (5.9) show a drop in the efficiencies when the network size is 3 transputers.
This is because the load balance is poor compared with other networks load
balances. The maximum efficiency is obtained when the load on each transputer is
about the same. As the total number of cells in the systolic system is 5, then the best
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.048 1.00 100.00 16 X 16 2 0.029 1.66 83.07
3 0.020 2.38 79.47
5 0.011 4.32 86.45
1 0.180 1.00 100.00
32 X 32 2 0.108 1.67 83.45
3 0.075 2.41 80.30
5 0.042 4.33 86.56
1 0.716 1.00 100.00 64x64 2 0.420 1.70 85.23
3 0.291 2.46 82.09
5 0.162 4.43 88.64
1 2.989 1.00 100.00 128 X 128 2 1.663 1.797 89.85
3 1.151 260 86.58
5 0.639 4.68 93.54
1 12.012 1.00 100.00 256 X 256 2 6.859 1.75 87.56
3 4.601 2.61 87.03
5 2.550 4.71 94.22
Table (5.1) Timing results for the plus shape Laplacian operator.
157
5
4
... 16. 16 = "' 3 32• 32 .. .. ... - ----- 64*64 .. 128. 128
2 256.256
I 2 3 4 5 6
No. of transputers
Figure (5.8) Speedup graphs for plus shape Laplacian operator.
12~----------------------------~
10
08
I 2 3 4
No. of transputers
5
16. 16 • •••••• 32*32
- ----· 64. 64
128. 128 256.256
6
Figure (5.9) Efficiency graphs for plus shape Laplacian operator.
158
efficiency we can get is when the load IS one cell per transputer. In other words,
the maximum efficiency is obtained when there are 5 transputers in the network.
The efficiency graph for the larger image sizes shows generally a higher efficiency
than the smaller size of image, especially for the 5 transputers network.
Table (5.2) presents a summary of the timing results for the 9 point ( or
square) Laplacian operator. The relative speed-up and efficiencies of the algorithm
are included in the table. Figures (5.10) and (5.11) shows the relative speedup and
efficiency respectively. The 'super' speedup figures and the high efficiencies
shown in the graph for the 3 transputer network is due to the reason explained
earlier in section ( 4.9).
It can be seen from the graphs that the speedup and the efficiency increase
with increasing the size of the network (for more than 5 transputers), and the
increase of the size of the image. There are less efficiency gain when the size of the
network is 9. However, the reason for the efficiency gain difference in the case of
the higher network is due to the proportion of the time spent on communication
overheads, especially when the size of network is relatively high. The range of the
efficiency is over 85% for 4 transputers and over 91% for the 5 and 9 transputer
networks.
159
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.093 1.00 100.00
16 X 16 3 0.029 3.2 106.67
5 0.020 4.59 91.86
9 0011 8.23 91.40
1 0.348 1.00 100.00
32x 32 3 0.107 3.23 107.79
5 0.074 4.47 93.07
9 0.0416 8.38 93.15
1 1.361 1.00 100.00 64x 64 3 0.420 3.24 108.11
5 0.291 4.66 93.35
9 0.162 8.43 93.68
1 5.389 1.00 100.00
128 X 128 3 1.660 3.25 108.12
5 1.154 4.67 93.42
9 0.638 8.43 93.68
1 21.477 1.00 100.00
256x 256 3 6.649 3.23 107.66
5 4.597 4.67 93.44
9 2.543 8.45 93.84
Table (5.2) Timing results for the square Laplacian operator.
160
9
8
7 16*16 ... :I --- 32*32 ., " 6 - ------ 64 *64 .. ... Ul -· ·-·-·-· 128*128
5 - --- 256.256
4
3 2 4 6 8 10
No. of transputers
Figure (5.10) Speedup graphs for square Laplacian operator.
>-... c .. ... :;: -w
12,-------------------------------,
1 0 '------0 8.
16. 16
. ··----·-· 32 • 32 ------- 64. 64
·····-··· 128. 128 256.256
064---~--~.--~--~.--~---,-.--~~
2 4 6 8 10
No. of transputers
Figure (5.11) Efficiency graphs for square Laplacian operator.
161
5.4 PARALLEL IMPLEMENTATION OF THE
GRADIENT OPERATOR
5.4.1 Gradient Operator Algorithms.
The gradient operator applied to a continuous function produces a vector at
each point whose direction is aligned with the direction of maximal grey-level
change at that point, and whose magnitude descnbes the seventy of tlus change.
A digital gradient may be computed by convolving two windows with an
image, one window giving the x component gx of the gradient operator, and the
other giving the y component gy .
Then, the magnitude of the gradient operator at a point is defined, as shown in
chapter 3, by
where the direction of this point can be computed from
e = tan·l (gy /gx)
(5.4)
(5.5)
The standard masks for the gradient operator are shown in fig. (5.12). The
mask x and y generated output are centred on (ij).
maskx = -1 0 1
(A )maskx
-1
masky = 0
1
(B) masky
Figure (5.12) Gradient operator masks
Typical gradient masks for a larger window, such as Sobel operators and
Prewitt operators are explained in a later section.
162
As seen in fig. (5.12), only values of -1 and 1 are used. It is possible to
subtract the pixels or add their value, respectively. Also there are zero values in
each mask, which reduce the number of processors and number of operations.
The gradient operator produces a two element vector at each pixel, and this is
usually stored as two new images, one for x component and the other for y
component.
The algorithm described in chapter 3, has I as input image and Ox and Oy as
output image; I as well as Ox and Oy contain M x M pixels, with p = M2. In each
pomt in the output image, gx(i,j), we multiply the pixel a(ij) and both Its left and
nght neighbouring pixel values, a(I,j-1) and a(i,j+1), by the weighting mask x
values. While at each point in the output image, gy(ij), we multiply the pixel a(i,j)
and both its upper and lower neighbouring p1xel values, a(i-1j) and a(i+ 1j) by the
weighting mask y values.
From equation (3.5), a configuration for generating the x output component
gx, is as follows:
gx(i,j)= W1 ai,j-1 + W2 aij + W3 aij+ 1 (5.6a)
By applying the weighting mask shown in fig. (5.12 A) to this equation, then
we have
(5.6b)
From equation (3.6), a configuration for generating the y output component
gy is as follows:
gy(i,j)= w1 ai-I,j +w2 aij + w3 ai+l,j (5.7a)
By applying the weighting mask shown in fig. (5.12 B) to this equation, then
we have,
(5.7b)
163
- -------------------
Substitution of equations (5.6b) and (5.7b) in equations (5.4) and (5.5)
respectively yields the gradient values;
(5.8)
e 1 ((ai+l,j - ai-l,j ))
=tan-(ai,j+ 1 - ai,j-1 )
(5.9)
for i,j = l.. ... n-1.
The algorithms are repeated for every pixel in the image.
5.4.2 Systolic Array Design for the Gradient Operator
The design consists of a double pipeline systolic array, pipe one
acco=odates the x component gx of the gradient operator, and pipe two
acco=odates the y component gy of the gradient operator.
This design can be implemented m a straightforward manner as shown in
fig.(5.13). The first cell in the array, delays, then makes a duplicate of the input
data, and pumps them both into the array through both pipes. The other function of
this cell is to delay the input strerun for x component pipe, with a multistage shift
!). " "
celll cell2 d., ~ ~ ~ g' ,
cellS 'y.., host! delay
root ~ host2
cell r'> ~ ~ ... r: _ ....
cell3 cell4 ,
" " ~
Figure (5.13) Double pipeline systolic array for the gradient operator.
164
register. The number of stages of the shift register is equal to n, where n is the
number of image columns. Each of the two pipes consists of k bifunctional cells
(where k is the length of the weight vector). Each of the two pipes consists of 2
bifunctional cells and each cell contains a kernel element value. The cell design is
shown in fig. (4.8).
From equation (4.5) and (4.6), the number of stages of the shift register in
each cell is as follows:
cell! (Ak +n) stages (It holds the first element of the y component, which
is the last element in the fust kernel row, the
additional n stages because the second kernel row
contams only zero values)
cell 2 A stages
cell3 (A+ 1) stages
cell 4 A stages
(It holds the final element of the y component)
(It holds the first element of the x component)
(It holds the final element of the x component)
The third part of the array consists of one multifunctional cell, to compute
equations (5.8) and (5.9). There is no requirement for a shift register in this cell, as
shown in flg. (5.14).
The input data are pumped mto the two pipes by the host, through the delay
cell, in regnlar clock beats, through different channels. The data and the results are
pumped through both pipes, the final results for each component of the gradient
operator and the original data are collected from both pipes by the last cell in the
array (cell 5), then it sends the final results to the second host. The algorithm is
repeated for every input data for each cell concurrently. The final outputs and the
input data are collected by the second host
165
'
i d I d-. I
t gy , I ~ I 1¥,
I I
gx !k~ ,J ~ I,. t r
~ I G G.. I
Figure (5.14) Cell root (multifunctional cell) layout.
The basic design of a double pipeline systolic array for the gradient operator,
presented previously, can be mo<hfied m many ways. The two main reasons for
adjusnng the design are:
1- To reduce the total number of stages of shift register in the cells.
2- To reduce the number of the input data streams to one stream, (instead of
two in the previous design).
We start from redrawing the old design with the shift registers of the cells,
replaced by a shift registers bus. The bus-oriented systolic array system contains
one system bus, to which all the systems cells are interconnected. The redrawn
system design is illustrated in fig. (5.15).
The number of stages of the shift register in the bus is equal to the maximum
number of stages between any two cells in the old design. Of course, each cell can
receive the input data from the bus, where each cell is connected to the bus at a
certam stage number, where the total number of shift register stages between the
two cells of each pipe of the old design is the same.
166
Now, if we compare the bus-oriented systolic array with the previous design,
we notice a change in the total number of stages of the shift register in the whole
system.
g,. ~ celll ' cell2 ...
I d ... ' 'I' g,.,
V system bus cellS hostl ~ ~ ...
host2 root
G,
J ,
!k ... I' cell3 cell4
Figure (5.15) Bus-onented systolic array for the gradient operator_
The total number of the shift register stages of the previous design for all cells
(SR1) are:
SR1 = (A+1) +A+ (Ak+n) +A
SR1 = 3A + Ak + n+ 1 ; (5.10)
While the total number of the shift register stages of the bus (SR2) are equal to the
shift register stages for both cells in the y component pipeline:
SR2 = (Ak+ n) +A
SR2=A +Ak+n
Combinations of this equation with equation ( 4.6) yield,
SR2 = 2A + 2n - k
167
(5.11a)
i.e. there are (2A+ 1) less stages in the new design than the old design. For the
kernel shown in fig. (5.12), where k=l and A=3, the number of stages of the shift
register of the bus is:
SR2 = 2( n+l) +A (5.11b)
Also we can see that the new design has one input data stream only. And there
is no need for the delay cell.
As shown in fig. (5.15), the image data is pumped into the system bus by the
first host, then it distributes the values to their corresponding cells, while the partial
results are pumped through both pipes exactly the same as in the previous design.
The main Occam code to run the array is as follows:
---PROC hostl
--·PROC celll (blfuncuonal cell)
-··PROC cell2 (bifuncuonal cell)
---PROC cell3 (bifuncuonal cell)
---PROC cell4 {b1funcuonal cell)
---PROC cellS (multfuncuonal cell)
---PROC host2
PROC grachentmam (CHAN-----·-)
SEQ
SEQ 1 = [ 0 FOR Image]
input data
PAR
host!
celll
cell3
cell4
cell2
cellS
host2
168
5.4.3 Transputer Network for Gradient Operator
The bus-oriented systolic system described in the previous section requires
further modification in order for the array to be implemented on a network of
transputers.
The transputer network is connected to one host only. The first transputer is
connected to the host so it receives all the input data and pumps it to the transputer
network through the network bus. Each transputer accommodates part of the shift
register bus, so all the parts of the system bus is inside the transputers itself, instead
of outside the cell, as shown previously in fig. (5.15).
!!v .... gyf fllmg Host !!x TO Tl ~ system ~ Transputer .:' V .-l
...... gy ,V
1 V ~ V
' I T3
.,.~
J gx T4
J I T2 I<-gy
F1gure (5.16) Transputer network configuration for the gradient
operator.
The final transputer on the network collects the x-component and the y
component values of the gradient operator, through both pipes, as shown in
fig.(5.16), it calculates the final output which it sends down the Inmos link to the
host.
169
Transputers TO and T3 are the y-component pipeline, while Tl and T2 are the
x-component pipeline. The following is the parallel algorithm computed for each
m put data in all the transputer:
I.ll read input value a1 from the host
read gxi, gyi from the host
calculate parual result of gyi
send a1 to T1 and T3
send gxi to Tl
send gy1 to T3
I.1 read a1 , gx1 from TO
calculate parual result of gx1
send a1 , gx1 to T2
I.l read a1 , gx1 from T1
calculate parual result of gx1
send a1 gx1 to T4 '
I..3. read a1 from T1
read gy1 from the T1
calculate parual result of gy1
send gy1 to T4
T4
read gyi from the T 3
read gXI from the T2
calculate the final results
send a, di, gxi and gy1 to the host
The Occam program for a network of transputers is given in Appendix C.
170
5.4.4 Prewitt and Sobel Operator Algorithms.
The Prewltt and Sobel operators are modified gradient operators. They are
3 x 3 gradient operators as shown in fig ( 5.17).
Equations (5.6b) and (5.7b) represent only the first order difference operator.
To accommodate the Prewltt operator, equations (5.6b) and (5.7b) can be wntten
as;
and,
Px(i,j)= [ ai-1,j+1 + ai,j+1 + ai+1,j+11-
[ al-1,j-1 + aij-1 + ai+1,j-1]
Py(i,j)= [ ai+1,j-1 + ai+1j + ai+1,j+11-
[ ai-1,j-1 + ai-1j + ai-1j+1]
(5.12)
(5.13)
From equation (5.4). The magnitude of the Prewitt operator at a point is
defined by
p~ X y (5.14)
-1 0 1 -1 -1 -1
-1 0 1 0 0 0
-1 0 1 1 1 1
(A) maskx (B) mask y
Prewitt operators -1 0 1 -1 -2 -1
-2 0 2 0 0 0
-1 0 1 1 2 1
(C) maskx (D)masky
Sobel operators
Figure (5.17) Prewitt and Sobe1 operator masks.
171
as;
and,
For the direction of this point, equation (5.5) can then be written as;
Bp= tan· I (Py /Px) (5.15)
1n the case of the So bel operator, equations (5.6b) and (5.7b) can be written
Sx(l,J)= [ ai-l,j+l +2ai,j+l + at+l,j+l1-
[ ai-lj-1 + 2aij-l + at+1,j-l]
Py(i,j)= [ ai+1j-1 +2ai+1j + ai+l,j+l1-
[ ai-1,j-1 + 2ai-lj + ai-1j+1]
(5.16)
(5.17)
Again From equation (5.4). The magnitude of the Sobel operator at a point is
defmed by
(5.18)
For the direction of this point, equation (5.5) can then be written as;
Bs =tan-I (Sy /Sx) (5.19)
5.4.5 Systolic Array Design for Prewitt and Sobel Operators.
The bus-oriented systolic system presented in a previous section can be
modified to accommodate the Prewitt or Sobel operators. The design can be
extended for any size of mask.
The design for the Prewm operator still consists of a double pipeline systolic
array. Pipe one is for the x component Px and pipe two is for they component Py.
Each of the two pipes consists of k bifunctional cells (k IS the length of the mask),
for Prewltt operator, k=6. Both pipes are interconnected to a shift register bus, with
each cell connected to the bus at a certain level to retam a synchronisataion for the
accumulated output as shown in fig. (5.18).
172
~ I~ celll ~ cell2 ~ cell3 --~ cell6 ~
~t <h..
I ' Jl' J ··g, cell13 host2 hostl ~r sys em bus I root ll ....
,I, '" ,~,. a::
I> cell7 ~ cellS ~ cell9 -~ cell12
~
Figure (5.18) Bus-oriented systolic array for the Prewitt operator.
The total number of the shift register (SR) of the bus for any stze of mask are,
SR = The total number of stages of the shift register of all cells in one of the
ptpelines.
Then , the SR for the y-component pipeline is,
SR = h 1 (A Ch2 - 1) +Ak) + 3A + n , (5.20a)
h1 is the number of mask segment (non zero mask value) and h2 is the number of
mask in each segment
By substituting equation (4.6) and (4.6a) into the SR equation, we obtain the
following·
In the case of the Prewitt operator, where h 1 =2 and h2 =3 yields;
SR = 2(3A + n - 3)+ 3A + n
SR = 9A +3n -3
(5.20b)
The SR for the x-component ptpelme has the same value, as they-component
pipeline.
The total number of cells in the system is 2k+ 1 (k is the number of kernel
elements), where in the case of the Prewitt operator, the total number of cells is 13.
173
The input data and the partial results are pumped through the system in the
same way as m the bus-oriented systolic system for the gradient operator presented
in the previous section.
The bus-oriented systolic system for the Sobel operator masks shown in
fig.(2.17 c and d), operates in a similar way as the operation of the bus-oriented
systolic system for the Prewitt operator. The design itself is general and can be used
for any size of mask and image.
5.4.6 Performance of the Gradient Operator Systolic Design on the
Transputer Network.
Experiments were performed on the systohc design (fig 5.18) to find the
effect on speedup and efficiency by mcreasing the network size for various Image
size. The timing results from the experiments are presented in table (5.3) and their
graphical interpretations are shown in figures ( 5.19) and (5.20).
5,----------------------------,
4 a. 16. 16 :I '0 .. .. ........ _, ·- 32* 32 .. a. ------ - 64.64 Ul 3
-·-·-·-· . 128*128
256.256
2 2 3 4 5 6
No. of transputers
Figure (5.19) Speedup graphs for gradient operator.
174
It can be seen from fig(5.19) that the relationship between the speedup and the
Image Size is nearly fixed for each transputer network. This means that for larger
sizes, the speedup is likely to be similar to that for an image of smaller size. But this
increases sharply when the network size is the maximum, i.e., 5 transputers. It is
clear from the graphs in fig. (5.20) that the efficiency decreases when the size of the
network is 4 transputers. This is due to the load imbalance. The maximum
efficiency is obtained when the load on each transputer is nearly the same. The
maximum efficiency is obtained for the 5 transputers network (the total number of
cells are 5). It can be concluded that better performance results can be achieved if
the load is balanced.
.... u
1 1
c: 09 CD u --w
07
2 3 4 5
No. of transputers
16. 16
32.32
------- 64. 64
-·-·-·-·- 128.128
----- 256.256
6
Figure (5.20) Efficiency graphs for gradient operator.
175
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.078 1.00 100.00
16 X 16 3 0.031 2.52 83.92
4 0.025 3.13 78.23
5 0.016 4.79 95.74
1 0.291 1.00 100.00
32x 32 3 0.115 2.54 84.64
4 0.092 3.15 78.74
5 0.060 4.84 96.83
1 1.136 1.00 100.00
64x64 3 0.447 2.54 84.73
4 0.361 3.15 78.78
5 0.234 4.85 97.06
1 4.502 1.00 100.00
128 X 128 3 1.770 2.54 84.79
4 1.428 3.15 78.83
5 0.927 4.86 97.17
1 17.957 1.00 100.00
256 X 256 3 7.330 2.45 81.66
4 5.690 3.16 78.89
5 3.692 4.86 97.27
Table (5.3) Timing results for the gradient operator.
176
CHAPTER 6
LOW-LEVEL IMAGE PROCESSING AND FILTER SOFTWARE LIBRARY DEVELOPMENT
6.1 INTRODUCTION
In the following sections we consider techniques and designs for filtering
digital images. This mcludes both smoothmg and edge enhancement filters. Typical
filters perform some form of moving window operations that may be a convolution
or other local computation m the wrndow.
Various modifications of the systolic design are analysed in this chapter, the
design modifications are to handle each of the filter algorithms. The number of cells
or processors in each filter design is dependent on the size of the kernel and the
nature ofthe algorithms .
The implementation of a variety of digital image filter algorithms within the
Sequent Balance and the transputer network was achieved. The aim of this is to
design and build a programming workbench for developing image processing
operations for low-level vision. The motivation for the work is to develop a
methodology for the implementation of an image processing library on the Sequent
Balance. The key to the workbench is to hold a library of precoded software
components in a generalised configuration-independent style. This digital image
processing filters library is discussed in this chapter.
Brief defrnition of each algorithm is given allead of each section in which they
are discussed together with a full description of the systolic array design. The
results and efficiency obtained for each design on a transputer network is also
given, to reflect the performance of this design on this level of vision.
178
-------------------- -------- ----
6.2 PARALLEL IMPLEMENTATION OF THE SIGMA
FILTER.
6.2.1 Sigma Filter Algorithm
As explained in chapter 3, this filter smooths the image noise by averaging
only those neighbouring pixels which have the intensities within a fixed sigma (S)
range of the centre pixel.
For a 3 x 3 window each point in the input image x(i,j), as shown in fig.
(6.1) is equal to the average of all eight pixels in its neighbourhood whose value is
within S counts of the value of x(i,j)· S is an adjustable parameter and may be
derived from sigma, or the standard deviation of the pixel value distribution, or a
specified nonnegative threshold.
r- -e- .. I
' Xp
+ .__ --- ...
Ftgure (6.1) Sigma filter with 8 neighbourhood pixels in a 3 x 3
Window.
From equation (3.8), we form the output y (i,j) according to the following
criterion: 1 1 1 1 1
Y(· ·) = ~ ~ ( ~ ~ w(· k · !) ) (6.1) I,J k~1 1~1 r(i+kj+l) k~1 1=-1 t+ ,J+
179
where k,l = -1 ,0,1 for window s1ze 3 x 3 . And,
r(i+k,j+l) = 1 if lx(i+k,j+lr x(i,j)l < s
r (i+kj+l) = 0 otherwise (6.2)
X(i+k,j+l)
Then,
Y(i,W ( r(i-1,j-l) w(l-1,j-l) l+ ( r(i-1j) w(i+1j) l+ ( r(i-1,j+l) w(i-lj+l)l+
( r(i,j-1) w(i,j-1) l+ ( r(i,j+l) w(i,j+l) l+ ( r(i+1j-l) w(t+1,j-l) }+
( r(i+1,j) w(i+1,j) l+ ( r(i+1j+l) w(i+1,j+l)} (6.3)
The output should be divided by m, where m is the accumulated value of r.
As the window is passed over the entire image, and the sigma filter operation
IS performed on each pixel, the 8 r elements m equation (6.3) are updated in each
pixel x(i,j) from the 8 nearest neighbour groups. The updated r is given from
equation (6.1), where r is either 1 or 0.
6.2.2 Systolic Array Design for the Sigma Filter
The bifunctional array design discussed in the previous chapter should be
modified to accommodate the sigma filter, as shown in fig. (6.2). The array
consists of the following cells:
a- A duplicate cell; the main function of this cell is to make a duplicate of the
input data, and pump them both into the array through different channels. The other
function of the cell is to delay one of the input streams with a multistage shift
register. The number of stages of the shift register is equal to (n+ 14), where n is the
number of image columns.
180
t host! ~ duphcat cellA cellA cellB
~ cell -X
'~ ~~r-cellB ' cellL """'\..£. <--
'
- 1'-:t cell ~ cellA cellA cellA host2 " dlVlsuon~ ,
Figure (6.2) Systohc array for the Sigma filter.
b- Multifuncuonal cell; the array consists of k2 multifunctional cells (the size
of the window), each cell produces its partial result. The main function of these
cells is to compute equations ( 6.2) and ( 6.3). There are three processors in each
cell. A process must be designed to collect two values each time, one from the delay
cell to the left, the other one from the shift register, then compute equation (6.1) and
calculate the value of r.
Another process is requrred to calculate a partial product of y (i,j)• as shown in
equation (6.3). Once this has been done, it communicates the results to another
process to accumulate the partial products of the output y (i,j)· The cell design is
shown in fig.(6.3).
Each cell contains a multistage shift register with three different numbers of
shift register in three different multifuncuonal cells. The number of stages in each
cell is calculated using equations (4.5), (4.6) and (5.1) respectively.
181
y y ,._I I d " +
XI] -""
XI]
~~ Sr I X I "
"'d ,._I R 1 ................ IRA I
X Id ....
Figure (6.3) Multifunctional cell design.
The Occam code running on this cell takes the form of fig. (6.3A). All
processes inside the cell run in parallel for input data, so that communication
between processes is overlapped with computation.
--- PROC delay
--- PROC pass data
--- PROC multiply
--- PROC add
PROC cell multifunctional (CHAN .......... )
--- declaration of local channel
PAR
PARJ = [0 FOR n]
delay
PROC compare
SEQ
SEQ 1 = [0 FOR time]
IF
muluply
add
I x(i+kJ+tr x(I,J)I < S
S=l
1RUE
S=O:
Figure (6.3A) Procedure to run multifuncnonal cell for Sigma filter
182
I
!
c- A division cell: this cell is an unifunctional cell. Its main task is to calculate
the fmal value of the output y (i,j) dividing the results by m.
As shown in fig. (6.2), the image data is pumped into the array by the first
host. As the delay cell collects the input data, it will duplicate this data, delay one
set, as discussed before, and then sent them down to the mulnfunctional cells. Each
multifunctional cell requires 3 input and 3 output channels for communications with
the left and right neighbour cells. The data and results are pumped through the array
and the accumulated partial products are collected by the final cell (division cell),
then 11 sends the final results to the second host. The main Occam code to run the
array is as follows:
---PROC hostl
--PROC delay
-··PROC cell A
-·-PROC cell B
---PROC cell L
---PROC cell.d1v
---PROC hosl2
PROC mam.system (CHAN---------)
SEQ
SEQ 1 = [ 0 FOR image]
mput data
PAR
hostl
delay
PAR 1 = [1 FOR 2]
cell A
cell B
cell L
cell B
PARi= [I FOR 3]
cell A
183
cell.dtv
host2:
6.2.3 Transputer Network for Sigma Filter
A parallel code is designed to run on each transputer while all transputers in
the network execute in parallel. For simplicity each transputer is responsible for a
cell of the systolic design shown in fig. (6. 2). Basically inside each transputer
(except the first and the last transputers), equation (6.2) is applied to update the
partial result The first transputer receives the input data, duplicates and sends them
to the neighbouring transputer. The mam task of the final transputer in the network
is to calculate the final value of the output.
In order to run on a network of transputers the number of channels between
neighbouring transputers have been reduced to 2 channels (input channel and output
channel). Fig. (6.4) shows the parallel algorithms computed at each input data for
all transputers.
I..ll receive xi from the host
receive y I from the host
make a copy of Xj
delay Xj copy
send Yi toTl
send xi to Tl
send xi copy to Tl
IJ J =1 to n-2
receive yi from TJ· 1
receive Xj and xi copy from TJ· I
calculate partial result of y 1
send the yi to TJ+ 1
send xi and xi copy to TJ+ 1
184
---(n is the number of transputers)
Tn-1
rece1ve mput value x1 and x1 copy from Tn-2
rece1ve a partial result y1 from Tn-2
calculate the final result of y 1 send y 1 to the host
send x1 to the host
Figure (6.4) A parallel algonthms computed at each input data for
all the transputers.
6.2.4 Performance of the Sigma Filter Systolic Array on the
Transputer Network
Experiments were performed on the design shown in fig.(6.3) to find the
effect on speed and efficiency of increasing the network s1ze for various image
sizes. The timing results for the experiments are presented in table (6.1) and their
graphical interpretation are shown in fig. ( 6.5) and ( 6.6).
A considerable amount of time IS saved by solving the problem on a network
of transputers, even for a small size of image. The speedup increases as the size of
the transputer network increases; also the speedup increases as the size of the image
increases, as shown in fig.(6.5). The overall results for this algorithm are very
impressive. Table ( 6.1) indicates 'super' speedup for all sizes of images when the
network size is 2 transputers. Their efficiencies are also extremely high (over 100%
falling gradually to above 89% ).
This extraordinary behaviour is explained by the fact that each of the T800
transputer on the network is connected to an external memory, which is much
slower than the on-chip RAM. When the program is executed on a single
185
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.224 1.00 100.00
16 X 16 2 0.102 2.20 110.21
6 0.042 5.35 89.21
10 0025 9.08 90.84
1 0.842 1.00 100.00
32x 32 2 0.379 2.22 111.01
6 0.156 5.41 90.20
10 0.091 9.21 92.10
1 3.291 1.00 100.00 64x64 2 1.479 2.23 111.28
6 0.606 5.43 90.48
10 0.355 9.25 92.53
1 13.035 1.00 100.00 128 X 128 2 5.854 2.23 111.33
6 2.405 5.42 90.34
10 1.407 9.27 92.65
Table (6.1) Timing results for the Sigma filter.
transputer, therefore some of the data stored in the cells shift register is stored in the
external memory so that extra time is required in accessing the slow external
memory. When the same program is then run on a network of 2 transputers, the
amount of storage needed per transputer is one half and thus all the data can be
stored on the fast on-chip RAM. The gain in speed-up from the on-chip RAM
offsets the new constraint introduced by communication.
Graphs of fig.(6.6) show a drop in the efficiencies when the network size ts
6. This is due to the fact that the load balancing of the system is poor at tbts kind of
network configuration. For all sizes of image the efficiency graph increases as the
186
network increases from 6 to 10 transputers. The max1mum efficiency is obtained
when there are the maximum number of transputers (10 transputers) in the network.
The efficiency is over 90%.
10
8
... " ... 6 " " ...
(/)
4
2
1 3
>- 1.1 u
" " u 09 --w 07
05
0 2 4 No. of
6 8 transputers
1 0 12
16. 16 32* 32 64.64 128. 128
Figure (6.5) Speedup graphs for Sigma filter.
-----16. 16
---- 32*32 ------- 64. 64
128*128
0 2 4 6 8 1 0 1 2 No. of transputers
Figure (6.6) Efficiency graphs for Sigma filter.
187
6.3 PARALLEL IMPLEMENTATION OF THE INVERSE
GRADIENT FILTER
6.3.1 Inverse Gradient Algorithm
This smoothing scheme is based on the observation that the variations of grey
levels inside a region are smaller than those between regions.
For a 3 x 3 window, each point in the mput image x(i,j)• as shown in fig.
(6.1}, is set equal to the inverse gradient of all etght pixels x(k,l), for k,l =1,2,3,
in its neighbourhood.
From equation (3. 7), we form the output y (ij) as:
1 Y(i,j) = 2 (x(i,j) + k t~(k,l) ) (6.4)
where r(k,l) is the mverse of the absolute gradient, multiplied by the relevant
neighbour pixels and is gtven by:
X(i+k,j+l) r (k,l) = I x(i,j)- x(I+k,j+l)l (6.5)
From these two equations (6.4) and (6.5), the configtrration for generaung the
output y (ij)• is as follows:
+ X(i-1,j) I x(i,j) - x(i-1,j)l
X(i-l,j+l) + .X(l,j-1)
I X(i,j) - X(i,j-1)1 + X(i,j+ 1)
I x(i,j) - x(i,j+ 1 )I lx(i,j)- x(i-l,j+l)l
X(i+ l,j-1) + X( I+ l,j ) + X(i+ l,j+ 1) ) (6.6) I x(i,j) - x(i+ l,j-1 )I I x(i,j) - x(i+ 1,J)I I x(I,j) - X(i+ 1,j+ 1)1
As the window is passed over the entire image, the inverse gradient filter
operation is performed on each pixel.
188
6.3.2 Systolic Array implementation for the Inverse Gradient Filter
The systolic array design is more or less similar to the sigma filter systolic
array discussed in section (6.2.2), the array consists of the following cells:
a- A duplicate cell; the main function of this cell is to make a duplicate of the
input data.
b- Multifunctional cell; the array consists of 8 multifunctional cells, each cell
produces Its partial results. The main function of these cells is to compute a partial
product ofy(i,j), as in equation (6.5). The number of stages of the shift register of
the multifunctional cell is shown in section (6.2.2). The Occam code running on
this cell takes the following form:
--- PROC delay
--- PROC pass data
---PROC add
PROC cell mulufunctiOnal (CHAN .......... )
--- declarauon of local channel
PAR
pass
PARJ = [0 FOR n)
delay
PROC compute
SEQ
SEQ 1 = [0 FOR Ume)
r (k,1)= I x(i,j) - x(k,l)l
add:
c- An addition cell; this cell is a umfunctional cell. Its main task is to calculate
the final value of the output Y(i,j) by applying equation (6.4).
The system behaves in a similar way as the systolic array for the sigma filter
described in secnon (6.2.2).
189
Each multifunctional cell requires 3 input and 3 output channels for
communication with the left and right neighbouring cells.
6.3.3 Transputer Network for the Inverse Gradient Filter
This is similar to the transputer network for the sigma filter, described in
secuon (6.2.3). Basically, inside each transputer (except the first and the last
transputers), equation (6 6) is applied to update the partial results. The first
transputer receives input data, duplicates it and sends it to the next neighbour
transputer, while the main task for the last transputer in the network is to calculate
the final values of the output by applying equation (6.4) and send it down the
Inmos link to the host. The total number of transputers in the network is 10.
The parallel algorithms computed at each input data for all the transputers, is
shown in fig. (6.4).
6.3.4 Performance of the Inverse Gradient Filter Systolic Array on
the Transputers Network
The inverse gradient systolic design was applied to the transputer network for
various network sizes and image sizes. The network consisted of T800 transputers.
The number of system cells in each transputer should be as equal as possible, to
ensure load balancing.
Table (6.2) shows the timing results of the algorithm. An analysis of the
results shows that the system performance is improved by increasing the number of
computing transputers. Also the performance is improved by increasing the image
size. The super speedup and the very high efficiency shown in the table is due to
the reason explained in section (6.2.4). F1g. (6.7) shows an increasing value of the
speedup as the network size increases. The graph for a 256 x 256 image size shows
190
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.183 1.00 100.00
16 X 16 2 0.085 2.15 107.58
6 0.036 5.08 84.63
10 0.021 8.68 86.79
1 0.687 1.00 100.00 32x 32 2 0.317 2.17 108.48
6 0.134 5.13 85.50
10 0.078 8.81 88.08
1 2.683 1.00 IOO.OO 64x64 2 1.234 2.17 108.74
6 0.523 5.14 85.68
10 0.303 8.84 88.42
I 10.628 1.00 IOO.OO 128 X 128 2 4.884 2.18 108.81
6 2.066 5.14 85.72
10 1.200 8.85 88.54
I 42.324 1.00 IOO.OO 256 X 256 2 I9.461 2.I7 108.74
6 8.280 5.11 85.20
10 4.782 8.85 88.51
Table (6.2) Timing results for the inverse gradient filter.
a decrease in the speedup when the network size is 6 and then a rapid increase when
the network is increased further, also the graph in fig.(6.8) shows a decrease in
efficiency when the network size is 6 for all sizes of images. The reason for this is
that the load balance of the system is poor, when the number of cells in the system
is 10, and the distribution of cells on the transputer network is unbalanced, as
explained in section (6.2.4). For all sizes of image the efficiency graph increases
from 7 to 10 transputers. The max1mum efficiency is obtained when there are 10
191
transputers in the network, which is the maximum number of transputers in the
network we can obtain for this algorithm.
10
8
D. ::J , 6
" " D. Ill
4
2 0 2 4
No.
6 8
of transputers
10 12
16. 16
32*32
64.64
128*128
256.256
Figure ( 6. 7) Speedup graphs for inverse gradient filter.
1 2
1 0 ,.. u 0:::
" u --w
08
06
04 0 2
------
4 6 8
No. of transputers
16. 16
32* 32 ••••••••. 64*64
10 12
128. 128
256.256
Figure (6.8) Efficiency graphs for inverse gradient filter.
192
L-------------------------------------------------- -- --
6.4 PARALLEL IMPLEMENTATION OF THE MEAN
AND WEIGHTED MEAN FILTERS
6.4.1 Mean Filter Algorithms
Mean filter is a straight foreword spatial-domain technique for image
smoothing. Given an M x M image I, the procedure is to generate a smoothed
image, whose grey level point (ij) is obtained by averaging the grey-level values of
the pixels of I contained in a predefined neighbourhood of (i,j), as shown in chapter
3.
The size and shape of the window over which the mean is computed can be
selected. For a 3 x 3 window, the filter weights are shown in fig. (6.9).
1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9
(A) square window
1/5
1/5 1/5 1/5
1/5
(B) plus shape window
Figure (6.9) Two mean filter masks.
If a 3 x 3 main filter is used, by employing the plus shaped window shown in
fig. (6.9B) to equation (5.2a) we note that the former equation is a special case of
the latter with wi =1/5. From equation (5.2a), the output pixel is obtained by the
relation,
Also if we employ the square window shown in fig. (6.9A) to equation
(5.3b), the output pixel is obtained by using the relation,
1 Y(ij)= 9 [ X(i-1,j-1) + X(i-1j) + X(i-1j+1) + X(ij-1) + X(i,j)
+ X(i,j+1) + X(i+1j-1) + X(i+1j) + X(i+1,j+1) j (6.8)
193
6.4.2 Weighted Mean Filter Algorithm
This algorithm approach IS similar to the mean filter algorithm descnbed in the
previous section (6.4.1). The difference in this case, a weighted mean filter is often
used in which the weight for a pixel is related to its distance from the centre point.
For a 3 x 3 window, the filter weights are shown in fig. (6.10). The neighbours
that lie on the same side of the point (ij) are weighted more heavily than the others.
1/16 1/8 1/16
1/8 1/4 1/8
1/16 1/8 1/16
(A) square window
1/6
1/6 1/3 1/6
1/6
(B) plus shape window
Figure (6.10) Two weighted mean filter masks.
By employing the plus shaped window shown in fig. (6.10B) to equation
(5.2a), then the output pixel is g~ven by the following equation:
Similarly, by employmg the square window shown in fig. (6.10A) to
equation (5.3b), the output pixel is given by the following equation:
1 + 16 [X(i·1,j·1) + X(i-1j+1) + X(i+1,j-1) + X(i+1,j+1) j
194
(6.9)
(6.10)
6.4.3 Systolic Design for Mean Filter
The systolic design for a mean filter is similar to that of the Laplacian operator
designs shown in fig. (5.4) and (5.6). The difference in this case is the value of the
kernel weights.
For the plus shaped mean filter, we note by comparing equation (5.2) and
(6. 7) that the values of the weights of the latter IS fixed with wi = 1/5. If we replace
the value of wi in each of fig (5.4) with the new values of wi =1/5, then we can
have a systolic design for the plus shaped mean filter.
In a similar way a systolic design for square mean fllter can be implemented.
We note by comparing equation (5.3) and (6.8) that the value of the weights of the
later is also fixed With wi =1/9. By replacing the value of wi in each cell of Fig.
(5.6) by the new values of wi = 1/9, we obtain the new design.
A systolic array design for weighted mean filters can be implemented in a
similar way to that in which the mean filters were implemented, as shown above.
The value of wi in each cell of fig. (5.4) and (5.6), is replaced by the new values
shown in equanon (6.9) and (6.10) respectively.
An alternative design for the mean filter is implemented as shown in fig.
(6.11). The new design, the unifunctional array, is similar to the previous ones, the
number of square cells is the same as before, depending on the size of the mask.
We modify the cells by reducing the number of operations inside each cell, so that it
performs addition operations only, instead of a multiplication followed by an
addition, with the design of the cell shown in fig. (6.12a). We need an additional
round cell (at the extreme right end of the array). It performs a multiplication only to
compute the final values of the components of the fllter. The design is outlined in
195
-,. 1---~
Hostl cell! cell n Host2
-,. + "---~ + X/---~-~ ,
Fignre (6.11) Systolic array for the mean filter.
fig. (6.12b). The image data is pumped into the array by the first host, and
accumulated as partial products Pij by the square cells. The output at the last square
ceii for square window, is as follows:
P(i,j)= [ X(i-lj-1) + X(i-lj) + X(i-lj+l) + X(ij-1) + X(i,j)
+ X(i,j+l) + X(i+lj-1) + X(i+lj) + X(i+lj+l)]
For the plus shaped window, Pij at the last square cell, is as foiiows:
P(i,j)= [ X(i-lj)+ X(i,j-1) + X(ij) + X(i,j+l) + X(i+l,j)]
In the round cell, the fmal value y (ij) is computed as
Y (i,j) = w P(ij)
where w is the filter weight. The final results are calculated by the second host. The
algorithm is repeated on every input data by all the ceiis working concurrently.
We bear in mind that the number of stages of the shift register in the square
cells is the same as that of the cell of the previous array.
When comparing the two des1gns, the old design and the umfunctional array
design, each of them has advantages as follows:
1-The number of cells in the unifunctional array is one more than tlle number
of cells of the previous design.
196
2-The time cycle for the unifunctional array is larger by one time unit than that
for the old array, so that the output w1ll be delayed by one time cycle. Th1s is
caused by the extra cell.
3-The computation time for each processor of the unifunctional array is less
than that for the processor of the previous array. Th1s is because in the former each
processor carries out accumulation only and for the last cell only multiplication;
where as in the latter each processor carries out both additions and multiplications.
V + I r 1
Y, , J
s~ d l X
X
RI ......... -rR.I ... 000
(a) (b)
Figure (6.12) Cells design for the array shown in fig. (6.11).
6.4.4 Transputer Network for the Mean and Weighted Mean Filter
These are silTillar to the transputer network for both plus shaped and square
Laplacian operator, described in section (5.3.2). The difference in this case is the
values of the kernel weights.
The systolic design for the plus shaped mean and weighted mean filters is
implemented by replacing the value of wi in each transputer of fig. (5. 7) with the
new values. In a similar way a transputer networks for square mean and weighted
mean filters can be implemented , by replacing the value of wi in each transputer of
fig. (5.7) by the new values shown m figures (6.9A) and (6.10A).
197
-------------------- - --
6.4.5 Performance of the Mean and Weighted Mean Filter Systolic
Designs on the Transputers networks
Experiments similar to those performed for the Laplacian operator were
carried out to measure the algorithms performance on a network of T800
transputers. Table (6.3) shows the timing results and associated speedup and
efficiency for the plus shaped main and plus shaped weighted mean filter systolic
design. The method was executed on a transputer network of varying sizes.
A good speedup was obtained for the 256 x 256 image size on a 5 transputers
networks. This result is quite useful, since the need for a parallel system is more
vital for larger images, where processing time is relatively high. Even when smaller
sizes of images are solved on a 5 transputer network, a speedup as higher as ( 4.32)
is obtained as shown m fig. (6.13). The graph for a 256 x 256Image size shows a
decrease in the speedup when the network size is 2. This situation is due to using
an external memory which is slower than on-chip RAM. When the design is
implemented on a network of size 2, the total size of the shift register buffer is
higher than on-chip RAM. The communication overheads of solving the problem
on a network are offset by the overheads of accessing the secondary memory.
Fig. (6.14) shows efficiencies of over 83% for a 2 transputer network, over
79% for a 3 transputer and over 86% for a 5 transputer network. The major reason
for the drop in the efficiencies for a 3 transputer network is the imbalance of the
load, where some of the transputers contain two cells and one transputer has one
cell only. This unbalanced load increases the elapsed time.
198
-I
5
4
"' " "' 3 .. .. "' "'
2
I I 2 3 4
No. or transputers
5 6
16 *16 32*32 64*64 128. 128 256.256
Figure ( 6.13) Speedup graphs for plus shape mean and weighted mean filters.
... .... = . ~ .... E r.l
12~----------------------------,
10-
08-
.. ... ...... .......... ......... . --- ::, __ .,..,.. .... ;,:.... ~---- ............................ ..
16 *16 32*32 64*64
128. 128
---- 256*256
06~--~-,,r-~---,r-~--~,r-~---,r-~---i
I 2 3 4 5 6
No. or transputers
Figure (6.14) Efficiency graphs for plus shape mean and weighted mean
filters.
199
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.048 1.00 100.00
16x 16 2 0.029 1.66 83.07
3 0.020 2.38 79.47
5 0.011 4.32 86.45
1 0 180 1.00 100.00
32x 32 2 0.108 1.67 83.45
3 0075 2.41 80.30
5 0.042 4.33 86.56
1 0.716 1.00 10000 64x64 2 0.420 1.70 85.23
3 0.291 2.46 82.09
5 0.162 4.43 88.64
1 2.989 1.00 100.00
128 X 128 2 1.663 1.797 89.85
3 1.151 2.60 86.58
5 0.639 4.68 93.54
1 12.012 1.00 100.00
256 X 256 2 6.859 1.75 87.56
3 4.601 2.61 87.03
5 2.550 4.71 94.22
Table (6.3) Timing results for the plus shape mean and weight mean filters.
The overall results for the square mean and square weighted mean algorithms
are very impressive. Table (6.4) shows that the speedups and related efficiencies
are h1gh for all sizes of images when the size of the network is 3 transputers. This
is due to the external memory of the T800 transputer as explamed in section
(6.2.4). The efficiency and speedup graphs are presented in figures (6.15) and
(6.16) respectively.
200
The graphs show good speed ups and efficiencies for the various size of image
and various sizes of transputer networks.
For any size of image, there is less gain when the network is 9 transputers.
The gain increases as the network size decreases. The reason for the loss in gam IS
due to the proportion of time spent on overhead communication, especially when
the size of network is relatively high. The communication overhead in this system is
increased as the size of the image increases.
Image Size Network Time Lapse Relative Efficiency
Size (seconds) Speed-up %
1 0.093 1.00 100.00 16x 16 3 0.029 3.2 106.67
5 0.020 4.59 91.86
9 0.011 8.23 91.40
1 0.348 1.00 100.00 32x 32 3 0.107 3.23 107.79
5 0.074 4.47 93.07
9 0.0416 8.38 93.15
1 1.361 1.00 100.00 64x64 3 0.420 3.24 108.11
5 0.291 4.66 93.35
9 0.162 8.43 93.68
1 5.389 1.00 100.00 128 X 128 3 1.660 3.25 108.12
5 1.154 4.67 93.42
9 0.638 8.43 93.68
1 21.477 1.00 100.00 256x 256 3 6.649 3.23 107.66
5 4.597 4.67 93.44
9 2.543 8.45 93.84
Table (6.4) Timing results for the square mean and weighted mean filters.
201
9
8 ~' 7 ~~ 16.16
"'-~~ " --- 32.32 ...
" 6 - ------ 64.64 " "'-U) -· ·-·- -· 128. 128
5 -f---- 256.256
4
3 2 4 6 8 10
No. of transputers
Figure (6.15) Speedup graphs for square mean and weighted mean filters.
,.. .. c: .. .. :;: -w
12~--------------------------~
1 0 '-------08
16. 16 0
··-····--· 32 • 32 ------- 64. 64
·-·-·-·-· 128. 128 256.256
064---T-~~~---r--~~~~--~ 2 4 6 8 1 0
No. of transputers
Figure (6.16) Efficiency graphs for square mean and weighted mean filters.
202
6.5 AN ENVIRONMENT FOR DEVELOPING LOW
LEVEL IMAGE PROCESSING ON PARALLEL
COMPUTERS
6.5.1 Introduction
Developing image processing software systems can be a very time-consuming
process, since it involves a significant amount of experimentation with various
algorithms. Typically there w11l be several different algorithms available for the
same operation, and the programmer must choose the one which performs best in
that parucular environment. For instance, if the required operation is to extract the
edge of an image, then the best way of doing this will depend on how clear the
edges are, on lighting conditions, and on other factors. Thus the programmer needs
to experiment interacuvely with different algorithms. Even once an algorithm has
been selected, there is a lot of scope for setting parameter values expenmentally: for
instance, when thresholding an image, the best threshold value can often only be
chosen by trial and error. In providing a programming workbench for image
processing m which this experimentation can take place conveniently, we can
identify three main requirements:
1- To develop a library of image processing operations, coded to be
parameterised and scaleable.To provide as stmple a programming model as possible
for users to add new software components to the library.
2- Many image processing algonthms are computationally intensive and
therefore we require significant processing power allows the experimentation
process to be performed quickly. The faster the actual image processing can be
made, the better. One obvious solution to this is to use a parallel processing
machine.
203
The main aim of the research in this section was to design and build a
programming workbench for developing a low-level image processing software
library on a parallel computer. This has been designed to mee\~bove requirements.
The PARC-IPL (Parallel Algonthms Research Centre- Image Processing Library)
workbench runs on a workstation which front-ends the Sequent Balance 8000
system. The workbench was simulated on the Balance using the Occam high level
language.
The user can control the execution of the program from the workbench. At the
heart of PARC-IPL is a library of image processing routines and algorithms which
are coded in a generalized format which is not specific to one particular
configuration, either in size or topology.
Our implementation strategy has been to develop the implementation in two
stages. As a first step, a single processor implementation has been produced. This
is now operational, and users can execute programs written in Occam, inputting
images from file. The second implementation stage is to invoke the existing hbrary
on a multiprocessor such as a transputer network, using the same Occam codes.
Unfortunately, the second stage has not been finished due to special circumstances.
The workbench provides many facilities, however, which are equally useful to the
developer of image processing software on parallel computers.
In the following section we present some previous research which is related to
our current work. This is followed by an introduction to the workbench. An outline
of the implementation strategy used is also given. This is followed by an outline of
the contents of the library. Fmally some possible modifications to the workbench
are presented.
204
6.5.2 Background
There is at present much interest in implementing image processing
applications on parallel computers, because of the high processing speeds which
can be achieved by parallel processors such as transputer networks.
At Queen's University, Belfast, the computer department have designed and
built a programming workbench for developing image processing software systems
on transputer networks [Crookes, et al1990, and 1991]. A high-level programming
language for image processing, caiied IAL (Image Algebra Language) has also been
developed for this purpose.
The main aim of Queen's University workbench implementation is to partition
the image and distribute each section of the image per transputer. Each transputer
then processes its own section of it independently and in parallel, i.e. "image
parallelism". However, since many image processing operations operate on
neighbourhoods, problems occur at section boundaries, where a neighbourhood is
physically split across two transputers [Crookes 1990]. They overcome this
problem by adding a special program to the library.
The main aim of implemenung our algorithms in the (PARC-IPL) system is
"task parallelism" using systolic designs as shown in chapters 4 and 5 and the
previous sections. In these systems there ts no need to partition the image.
6.5.3 The Workbench
In this section we wtii present details of the design and construction of the
programming workbench, and describe each of the main components of the
workbench.
205
The overall workbench consists of three main parts. At the back end of the
workbench is a library of low-level Image processing operations, running on a
Sequent Balance 8000 system.
Front-ended workstations are connected to the Balance. The workstations
send the programs to the Balance as Pseudo-code instructions. Between the user
workstations and the library components is a server which receives the p-code
instructions, and makes (remote) calls upon the library. This is illustrated in fig.
(6.17) .
User Interface
.,. Contrdler > Commands
Workstallon Balance 8000
Figure (6.17) Workbench Overview.
The workbench can be used at the following two levels:
Image Processing Ltbrary
1- Building a low-level image processing system from previously defined
algorithms, held in the library.
2- Implementing new designs for other algorithms, and adding them to the
library, also enablmg the user interface to incorporate these new components.
206
6.5.3.1 Software Structure
The library routines which are called by the controller are executed by the
Balance. To make coding of the library components independent of the underlying
hardware, three layered software modelswere adapted as shown in fig. (6.18).
These three layers are:
1- The actual library routines themselves (coded as Occam procedures as
shown in the previous sections).
2- As some commands from the controller to the systolic design have to be
broadcast across all cells in the array, there is a need for a command distributor
layer. If the user wishes to apply a one-dimensional or two-dimensional
convolution, then this layer controls the number of cells needed for these routines
(i.e. dependent on the size of the kernel).
3- To convert the commands from the controller into actual calls to the library
routines, including the passing of parameters and other housekeeping.
- Interpreter -
- r--' Distnbuter
' Library ...:: ~
Figure (6.18) Layout software model.
207
L----------------------------- I
6.5.3.2 The User Interface
The workbench supports programs which can be represented as a set of
menus.
Programs are butlt by selecting algorithms from a menu of currently available
hbrary operations. Fig. (6.19) shows the first page of the screen as soon as the user
login to the program.
The workbench provides a menu editor for selecting the routines, and an
execution environment which provides algorithm selection and parameter settings.
It is easy and quick to flick between the editing and execution modes.
This is an OCCAM Program Library Low-Level Image Processing
Filter library
N.B. To exit from the system enter 99
Have your input data at file name
[ image.input]
Figure (6.19) First screen of the menu program.
To use the library, the user first selects appropriate operations from the
hbrary, by selecting the appropriate number from the menu bar, each number
representing a particular algorithm. Once a filter is chosen, the editor shows a few
boxes asking for the parameters of the image such as the number of column or the
208
size of the kernel (if the user chooses a convolution operation). If the editing is
successful then the execunon mode of the workbench is entered. An example of a
typical program is shown in fig. ( 6.20), which shows a simple program for the
plus shaped Laplacian operator.
Type the filter's number = 2
Plus Shaped Laplacian operater No:2
Give the total number of columns (min. 4)
= 256 Give the total number of rows
(min. 3)
= 256
Figure (6.20) Typical screen presentation for the environment.
6.5.3.3 Execution Mode
Once the editor program has completed, the programmer can then enter the
execution mode, from which the program can be run. It is the execution mode that
in which the algorithm selection and parameter setnng can take place.
Once running the program is completed, the name of the output file appears
on the screen. It may become apparent that the setting needs to be changed; or the
user may wish to try the same setting with a different algorithm. These alterations
can be performed simply by answering a question, entering the new setting, and
running the same algorithm again (or a different algorithm). The final two pages of
the editor screen are shown in fig. (6.21).
209
The program is running
Wait for the output
Your output data filename [ image2.out]
Do you w: 'ant to choose another filter
If yes type 1
If no typeO
Ftgure (6.21) Final screen presentation of the program.
6.5.4 Implementation
When the user's command is to execute the program, a set of p-code
instructions is produced. These instructions are passed to the controller m the
Balance where the interpretation stage starts. The controller repeatedly fetches and
executes individual instructions. In most cases, when the instruction is a call to a
library operation, execution nearly involves passing the instruction and its
parameters as a command to the systolic array. The cells in the array act in parallel
as shown in the previous chapters and sections, using the parameters which come
from the controller via the interpreter and the distributor, as shown in fig. (6.22).
Alltmages are held on the systolic array host which pumps them through the array.
The Occam comptler automatically distributes the algorithms and the parameters
over array cells. This means that the programmer does not need to be concerned
with the underlying parallelism of the design but can concentrate on the image
210
processing aspects. The underlying parallelism is effectively hidden from the user
while at the same time being efficiently explOited.
Workstallon
1---Interpreter
Algonthn ~
and f---
D1stnbuter f---
' j Contrdler ~
1-,::r-~
Systohc system
Figure (6.22) System operation.
6.5.5 Workbench Facilities
The workbench provides a good mechanism for developing low-level image
processing systems on parallel computers without the need for much actual
programming on the user part It has been designed with the following factors:
I- Ease of use; it is essential that the workbench is easy to use. The
workbench allows the user to select interactively which algorithm is to be used and
reselect a different algorithm for the same image or for a different image.
2- Flexibility; the library components are also set up so that certain of their
parameters can be 'tailored' or modified by the user. For instance, selecting a
threshold value in sigma filter is precoded in a way which enables the user to
supply a new threshold value interactively.
3- Hidden parallelism; to shield the user from complexities of explicitly
controlling parallelism. It was desirable to hide as much of the underlying
parallelism and communicatton as possible.
211
4- Extendibility; for the workbench to be of significant use, mechanisms must
be provided to allow the library to be expanded by adding new algorithms.
6.5.6 Image Input, Output and Data Types
The workbench provides facilities for input and output of images. A basic
operation for image input and output is provided.
* Image input and output
A set of standard routines is provided, giving the user the ability to:
1- read an image from a particular file;
[image.input]
2- read an imagJrDil particular area.
3- write an image to a particular file;
[write 3.out ]
3- write an image to a particular area.
*Data types
In the workbench there are three classes of data types images, kernel and
scalars. Each of these is now outlined briefly.
Images: An image is a one-dimens10nal or two-dimensional structure. The
images size can be declared by declaring the size of the columns and rows of the
image. If the number of columns is 256 and the number of rows is 256 then this
declares two images of size 256 x 256.
Kernel: Similarly, kernels are also ID or 2D structures. For the invariant
kernel, the size of a kernel can be declared by declaring the size of the columns and
rows of the kernel.
Thus, a user specifies the weights associated with a kernel simply by
enumeraong their values in a one dimensional series.
212
Scalars: The declaration of scalar vanables takes a similar form to that found
inOccarn.
6.5. 7 Types of Kernel
A kernel consists of two parts; a configuration, which defines the
neighbourhood over which it is defined and weights associated with each element
of the configuration neighbourhood. In addition, there are two types of kernel:
1- mvariant kernel: the weights are invariant with the location in the image and
these kernel correspond to the usual concept of masks, and,
2- variant kernel; the weights can vary as a function of image position.
Both types of kernek are included in the workbench. For most of the library
filters, the kernel are already given inside the algorithms. Except for the convolution
operation, a kernel is defined by giving its values in an initial value definition
section. For instance, the square Laplac1an kernel is defined in the square Laplacian
algorithm, and we can choose the algorithm directly from the menu. The alternative
way to choose a 2D convolution operation and square Laplacian kernel would be
defmed in the workbench as: - 1 -1 -1
-1 8 -1
-1 -1 -1
The user can choose any size of kernel for the ID and 2D convolution
operations.
6.5.8 Content of Library
The following is the range of operations provided in the library as it currently
stands:
1- Neighbourhood filters;
213
Various low pass, high pass, Laplacian filters and Sigma.
2- Edge extraction
Sobel, Prewitt, Gradient operator
3- Image algebra
One-dimenswnal convolution,
Two-dimensional convolution,
The user can define the window configuration and weights.
6.5.9 Extending the Environment
The workbench has been constructed in such a way as to allow it to be
extended by a user with relative ease. If a new algorithm is to be added to the
environment then obviously a modificauon must be made to the workbench by
adding the new algorithm to the library.
If a new algorithm is to be added to the hbrary, then the programmer must
adhere to the same conventions which have been used in the existing library
components, including the followmg:
1- Procedure heading must mclude the same set of system parameters and the
same structure of command packet as do existing operations;
2- Algorithm code must be written m Occam:
3- The user interface must be modified to allow the new algorithm to be
selected when using the editor.
In addiuon to adding the new algorithm to the hbrary, a menu description flle
must also be created for the algorithm to be controlled from the interface. This
includes acqmring a unique numerical identifier for the routine, and stating the same
parameter.
The main Occam code for the library is given m Appendix D.
214
CHAPTER 7
SYSTOLIC ALGORITHM FOR THE SOLUTION OF TOEPLITZ MATRICES
L-------------------- ---- --
7.1 INTRODUCTION
Toeplitz matrices have become increasingly important with the rapid growth
of signal and image processing [ Sloboda 1989].
It is very important that operations on Toeplitz matrices concerning the
application mentioned above, are performed in a cost effective, reliable and above
all rapid manner. Of course, the way to achieve these objectives is parallel
processmg, involving the concurrent use of many simple and reliable processing
elements in the form of VLSI designs, i.e., systohc arrays.
In this chapter a systolic design is suggested for solving Toeplitz matrices.
A special type of matrix, i.e., Toeplitz has the form, ao al a2 an·l
(7.1)
For such matnces the sum of elements in a row remains constant. Another
type of Toephtz matrix is the cyclic banded matnx, which has the form:
al a2···ar ~·· a2
a2 al ... llf-1 ~
~
0 ar ar
(7.2) ar
ar 0
a2 .. ar ar al
216
where the sum of the elements in a row also remains constant
The inverse of a banded Toephtz matrix, if it is invertible, can be performed
utilising methods given by [ Evans 1972] and [ Gohberg and Semencul 1972] in
O(n) number of operations. Toeplitz systems of linear equations arise in many
scientific and engineering applications and real time operation is often requested.
The well known algorithms of Trench [Trench 1964], Bareiss [ Bareiss 1969], all
require O(n)2 operations. The inverse of circulant Toeplitz matnx IS required in the
restoration of images [ Gonzales 1992]. An efficient algorithm for the factorisation
of a symmetric circulant banded Toeplitz matrix has been suggested by [ Evans
1981].
The solution of Toeplitz systems, such as circulant and skew symmetric has
received much attention from a systolic view point. Kung and Hu [Kung and Hu
1983] and Brent andluk [ Brent andluk 1984] have suggested parallel algorithms
for Toeplitz linear systems. Further parallel algorithms for solving Toeplitz linear
systems have been suggested by Evans [ Evans 1986 and 1989] and Megson [
Megsonl985]. Each of these algorithms are implemented as systolic designs.
The theory of orthogonal polynomials has made it possible to develop
efficient algorithms for smoothing a function of one variable on the equidistant set
of points which results in Toeplitz systems to be solved. Let us consider N+l
function values F(i) defined on the equidistant set of points 0,1, ..... N. Let a
function f be approximated on each subset consisting of n+ 1 pomts by a polynomial
Pm of order m, where N>>n. Let m be odd and n be even.
Let us denote the running subset n+1 points by i-n/2, .... , i-1, i, i+1, .... , i+
n/2. Then the smoothed value off in the mid point i is defined by the value of S(i) =
Pm (i) as shown in Sloboda [ Sloboda 1989].
Let m=1,
217
Then, at n+ 1=3, we have,
S(i) = t (f(i-1) + f(i) + f(i+l)) (7.3)
With n+1=5, we have,
S(i) = ~ (f(i-2)+ f(i-1)+ f(i) + f(i+l)+ f(i+2)) (7.4)
With n+1=7, we have,
S(i) = t (f(i-3)+ f(i-2)+f(i-1)+ f(i) + f(i+l)+ f(i+2)+ f(i+3)) (7.5)
7.1.1 Digital Contour Smoothing
As in image smoothing the ultimate goal of digital contour smoothing is to
improve a given image in some sense. Digital contour smoothing belong to the most
Important procedures in image processing. This procedure allows us to smooth
digital contour and to improve the stability of local and global invariants, such as
curvature which is impossible to calculate without smoothing.
A digital image I is a finite rectangular array whose elements are piXels or
image elements. Each pixel p of I is defiend by a pair of Cartesian coordinates (x,y),
which we may take to be integer values. An element or a pixel p(x,y) in a digital
picture I has two types of neighbours, i.e.,
1- Its four horizontal and vertical neighbours (h,k) such that
IX - hi + IY - kl = 1
2- Its diagonal neighbours (h,k) such that
IX - hi = IY - kl = 1
A simple closed digital curve in image I is a path, r = PO· Pl , .... , Pn
[Rosenfeld 1979] such that
Pi= Pj iffi=j
and
PI is a neighbour of Pj iff i=j+ 1 (mod n+ 1)
218
Let x=x(t) , y= y(t) (7.6)
be a simple closed curve in the 2-dimensional Euclidean space V. Let this curve be
approximated by a set of N elements Pl =( XJ, Yl), P2=( x2, Y2), ... , Pn=( Xn, Yn),
which are elements of a fmite rectangular array I, and let these elements represent a
simple closed 4-connected digital curve for which,
IPi- Pi-ll= lxi- Xi-ll = IYi- Yi-ll =1 (7.7)
The discretized parametric equation of this digital closed curve has the form
r Xl Yl
l X2 Y2
RJ I (7.8)
I Xn Yn J
The lest squares smoothing of a simple closed digital curve, is then defmed by
the linear operator (1/a) A wluch is apphed on R, so, 1
<a)Ax
where A is anN x N circulant Toeplitz matrix of the form (7.2) and a is the sum of
all elements in a row. For different values of m and n+ 1 we obtain, for example,
the following operator which corresponds to equation (7.3):
r 1 1 1
1 1 1 0
1 1 1 1 1 (-)A2=- (7.9) a a 0 1 1 1
1 1 1 J 1 1 1
219
A subset of linear operators defined by anN x N circulant Toeplitz matrix A
which smooth digital closed contours in the least squares sense is suitable for digital
contour approximation and these operators will be called feasible.
In this chapter a factorisation of certain symmetric circulant banded linear
system proposed by Evans [ Evans 1981] is described. It can be shown that a
banded Toeplitz matrix can be factorised into the product of easily inverted
matrices. By using this factorisation, a soluuon can be derived for such matrices.
A full description of the systolic array design for the algorithm shown above,
is given in a later section. The performance of the design is discussed in the section
(7 .4).
7.2 SOLUTION OF CERTAIN TOEPLITZ SYSTEMS
In the method descnbed by D.J.Evans [ Evans 1981] it is shown that the
special banded Toeplitz matrices Ar of semi-bandwidth r can be factorised into the
product of easily inverted matrices, the components of which are a cyclic matrix and
its transpose and a similar circulant banded matrix Ar-1 of order 1 less. By using
this factorisation, efficient algorithmic solution methods can be derived for the
related linear systems [ Evans 1980].
We consider the soluuon of
(7.10)
Where Ar is a cyclic banded Toeplitz matrix of the form shown in (7 .2),
x = ( xr. X2•····· xn) is the unknown vector and d is the known right hand side
vector.
220
In the following we present the cases r=2 (tridiagonal), r=3 (quindiagonal),
and for r ~ 4. Also an iterative procedure is suggested and set up to produce a
reverse recursive strategy for the solution in the general case (t.e. Ar, Ar-l>···,A3,
7 .2.1 Tridiagonal Case
We consider the solution of the system,
A2. x=d
where A2 is a circulant banded Toeplitz matrix of the following form,
r ar a2
a2 ar a2
A2= I I 0
l a2
Now A2 can be factorised into T
Az= Bz AI Bz
a2
0
a2
a2 a1
where BI is the transpose matrix of B2, with
0
and A 1 is the following dlagonal matrix:
221
l
(7.11)
(7.12)
(7.13)
(7.14)
---------------------------------- ------
I Cl
Ct
Cl At= I (7.15)
l J Then by carrying out the required multiplications of matrices (7 .14 and 7 .15)
and equating this to equation (7 .13), we get,
r <er+erb~ (erb2) (erb2)
(erb2) <er+erb~ <er b2) 0 I
A2 (7.16) 0
(erb2)
(erb2) (erb2) <er+erb~
Equating the terms of these matrices, we obtain the following relationships
between the elements a1, a2 of A2 and the elements c1, b2 of A1 and B2
respectively,i.e.
ar =er (l+b~
a2=b2cr
Solving these two equations for b2, gives the quadratic equation 2 ar
b2 - ( a2
) b2 + 1 = 0
which yields
222
(7.17)
(7.18)
-----------------
where h = al a2.
The elements of CJ are determined from equation (7 .18), where a2
CJ =b2
equauon (7.13) can be solved by the following solution process T
Az x= Bz A I B2 x = d
(7.19)
(7.20)
which by using the intermediate values v,z can be obtained in the 3 computational
steps, i.e.
B2 v=d
A 1 z= v T B2 x=z
(7.21)
(7.22)
(7.23)
As we can see from equations (7.21 to 7.23) above, we have to solve three
different forms of linear systems. The method of solution of these systems will be
presented in the following paragraphs [ Evans 1980].
From equation (7.21) we can obtain,
1
0
bz
bz
1 bz
1 bz
0
V1
=
dJ
d2
b12 J l ~n J l ~n J
from which we can obtain an equation relating v1 and v0 with do
bz Vt+ Vn = dn
then
223
(7.24)
(7.25)
Also from (7.24) we get a recursive sequence,
(i = n-1, ...... ,1) (7.26)
v0 _1 can then be determined from equation (7.26) and using equation (7.25), we
(7.27)
A similar process applied to equation (7 .26) in the same manner commencmg
on each equation, yields a sequence of expressions. The value of v1 is determined
from the final expressions in the form.
V1 = d1- b2 d2 + b~ d3- b~ d4 ...... (- b2)n-1 d4 + b~ V1
Then
v1 =( d1- b2 d2 + b~ d3- b~ d4 ...... (- b2)0-1 d4) I (1- b~) (7.28)
A backward substitution process of equation (7 .26) yields the components of
The solution of equation (7 .22) is strrught forward since A 1 is a diagonal
matrix.
Finally, the solution of equation (7 .23) is similar to the solution of
equation(7.21). From the linear system (7.23), we have the following expressions
for x 1o x2 , ..... , x0 m the form,
Then
(7.29)
And
( i= 2, ..... ,n) (7.30)
Then x1_1 ( i= n,n-1, ..... ,2) (7.30a)
224
From equations (7.29 and 7.30) we can obtain an equation relating x2 and xn
Wlth the Zj. 2
X2 = Z2- b2 Z1 + b2 Xn (7.31)
A similar process applied to equation (7 .30) in the same manner commencing
on the , third, founh, .... , and nth equations, creates a recursive sequence of
equations. Substitution of x1, x2, ..... , Xn-2 and Xn-1 in terms of Xn and z1 into the
final equation of (7 .30), yields the fmal expression of Xn as
Xn = Zn- b2 Zn-1 + b~ Zn-2 ...... (- b2)n-1 z1 + b~ Xn
Then
Xn = Zn- b2 Zn-1 + b~ Zn-2 ...... (- b2)n-1 z1 I (1- b~) (7.32)
Finally the vector x is determined from the forward recursive sequence of
equations (7.30), or backward from equation (7.30a).
7.2.2 Quindiagonal Case
We consider a similar solution of the system,
A3. x=d
where 31 a2 33 a3 32
32 31 a2 33 0 33
A3= (7.33) 33
a3 0 32
32 33 a2 31
Similarly the factorisation of the above can be obtained in the form, T
A3= B3 A2 B3 (7.34)
225
where Bj is the transpose matrix of B3, and
1 b3
1 b3 0 1 b3
B3= 0
(7.35a)
b3
b3 1
and,
r CJ cz cz
cz Cj cz 0
Az= I (7.35b)
I 0
l cz cz cz CJ
Then by carrying out the required multiplications of matrices (7 .35a and
7 .35b ), elements of the matrix A 3 can be shown by equating terms of the system
(7.34) to yields the relationships:
a 1 = (CJ + b3c2) + (cz+ b3q) b3
az = cz + b3 (CJ + b3c2)
a3 =b3c2
(7.36)
(7.37)
(7.38)
By using equation (7.38) and ellminating cz from equation (7.37), we obtain, a3 (a1_2a3)
az = b + b3 ( 2 + a3) 3 (l+b3)
Then azb3 (1 +bp = a3 (1 +b~)2 + b~ (a 1- 2a3)
and
226
(7.39)
which 1s a quartic equation for the derivative of b3.
Once b3 is obtained then the values of cz and q can be easily obtained from
equations (7.36, 7.37 and 7.38) as follows:
Multiplying equation (7.36) by (1 +b~ and equation (7.37) by (-Zb3), we get
a1 (l+b~ = q(1+b~)2 + 2b3c2 (l+b~)
2 2 -2b3a2 = -2b3cz(1 +b3) - 2q b3
Adding the above two equations yields, 2 4
q = (a 1 (1 +b3) - 2b3a2) I (1 +b3 ) (7.40a)
or substituting equation (7.38) m equation (7.36) yields the following express1on
for q, i.e. 2 q = (a1- 2a3) 1 (1+b3 )
Then we can get cz from equation (7 .37)
cz = (az- b3q) I (1+b~)
(7.40b)
(7.40c)
As b3, c2 and c1 are obtained, the linear system (7 .34) can now be solved by
substituting equation (7.13) in equation (7.34), we obtain T T
A3x= B3B2 A1 Bi B3 x=d
This equation can be solved by the following process
intermediate solutions vectors:
B3u=d
B2v=u
A,y= V
T B2 z=y
T B3 x=z
227
and by useing
(7.41a)
(7.41b)
(7.41c)
(7.41d)
(7.41e)
We determine u1 and v1 from the linear systems (7.4la and 7.4lb) in a
surular way to obtaining equation (7 .28), then,
ur =( dr- b3 d2 + bi d3 ...... (- b3)n-l dn) I (I- b~) (7.42a)
(7.42b)
Then we determine the values of Un and v n from
The values ofui and v1 can be obtained from a backward substitution process
of the following equauons
ul =d~- b3 ui+1 (i = n-1, ...... ,1)
(i = n-1, ...... ,1)
(7.43a)
(7.43b)
The solution of equation (7.41c) is straight forward since A1 is a diagonal
matrix.
Also we can determine Zn and Xn from the linear systems (7.4ld and 7.41e),
in similar way to obtaining equation (7.32), then
Zn = Yn- b2 Yn-1 + b~ Yn-2 ...... (- b2)n-l Y1 I (1- b~) (7.44a)
(7.44b)
Finally the values of z1 and x1 can be obtained from a forward substitution
process of the following equations
zl = Y1- b2 zl-1
XI=~- b3 Xj-1
Then ~- 1
( i= 2, ..... ,n)
( i= 2, ..... ,n)
( i= n,n-1, ..... ,2)
(I= n,n-1, ..... ,2)
228
(7.45a)
(7.45b)
(7.45c)
(7.45d)
------------------ - '
7.2.3 The General Case
So far we have solved the trldiagonal and quindiagonal cases for a banded
Toephtz matrices. For any semi-bandwidth r the layout of the banded Toeplitz
matrices remain the same.
Ar. x = d
a1 az ... a, a, .. az
az a 1 ..• llr--1 ar
a, 0
Ar= ar ar
(7.46)
ar
ar 0
az .. ar ar a1
The factorisation of the above form is T
Ar= Br Ar-1 Br (7.47)
where B'[ is the transpose matrix of Br, and
r 1 he
1 br 0
I 1 he Br= I 0 I
(7.48a)
l br
J br 1
229
------------ --- ---
and Ar-1 is a matrix of the following form:
c1 c2 ... Cr-1 Cr-1 .• c2
c2 c1 ... Cr-2 Cr-1
Cr-1
0 Ar-1 Cr-1 Cr-1
(7.48b)
Cr-1
Cr-1 0
c2 .. Cr-1 Cr-1 C1
By equating terms in the matrix multiplication, we can evaluate the unknown
terms in an algorithmic procedure with a pattern that can be determined from the
previous cases. Hence, we obtain the following relauonships.
a 1 = (Cj + brC2) + br (c2+ brei)
ar-3 = <Cr-3+ brCr-4)+ br (cr-2+ brCr-3)
ar-2 = <Cr-2+ brCr-3)+ br <Cr-I+ brer-2) 2
ar-1 =(Cr-I+ brer-2)+ b, Cr-1
ar = brCr-1
230
(7.49a)
(7.49b)
(7.49c)
(7.49d)
(7.49e)
(7.49f)
As an example of the general case for the Toeplitz matrix, we chose r=4.
Then, equations (7 .49a to 7 .49f) can be written as,
a1 = (q + b4c2) + b4 (cz+ b4q)
az = (cz+ b4q)+ b4 (CJ+ b4c2)
a3 = (c3+ b4cz)+ b~ c3
~=b4CJ
(7.S0a)
(7.50b)
(7.S0c)
(7.50d)
The values of c3, cz and c 1 can be obtained from the above equation (7 .SOa -
7 .SOd) as follows.
Subtracting equation (7 .SOd) from equation (7 .SOb), yields,
(az- a4) = cz(l +b~) + b4c 1 (7 .SOe)
Now multiplying equation (7.S0a) by (1+b~ and equation (7.S0e) by (-Zb4),
and add them together, gives,
q = {a1 (1+b~)- 2b4 (az- ~)}I (1+b!) (7.S1a)
Then we multiply equation (7.S0e) by (1+b~) and equation (7.S0a) by (-Zb4),
and add them together, to get
cz = {(az- ~) (1+b~- b4a1)l I (1+b!)
From equation (7 .SOc ), we can obtain CJ,
c3 = {a3- b4c2l I (1+b:)
(7.S1b)
(7.S1c)
Now these relationships appear too complicated to seek an exact solution so
an alternative is obtained as follows:
Frrst we guess a b4 value, for stability we chose b4 < Ill, then we determine
the values of CJ, cz andc3 from equations (7.S1a, 7.S1b and 7.S1c) respectively, a
new value of b4 IS obtained from equation (7 .SOa). Then we check for convergence
on the value of b4 , if the difference in b4 is greater than a specific level of accuracy
(i.e. >0.000001), then we repeat the previous operation, using the new value ofb4.
For any r the layout of the algorithm suggested remains the same, it is only the
231
number of steps that changes. In general, for a circular banded matnx Ar, we need
to obtain r-1 values of c from r number of equations.
According to the factorisation method, the linear system (7 .47) can now be
(7.52)
(7.53)
As shown in the previous sections, equation (7.53) can be solved by the
following process and by using a series of mtermediate vectors, i.e.,
Br-tV=U
T Br-lp=q
T Br x=p
(7.54a)
(7.54b)
(7.54c)
Determine the values of u1, v1 ..... s1 and from systems (7.54a) and
determine the values of zn, ..... p0 , x0 from the linear system (7 .54c) in a similar
way to obtaining equation (7.28).
Then we determine the value of u0 , v0 •••• s0 from
232
Sn = hn- b2 SI
The values of Uj, vi···si can be obtained from a backward substitution of the
following equations:
(i =n-1, ...... ,1)
(i = n-1, ...... ,1)
(1 = n-1, ...... ,1)
(7.55)
(7.55)
(7.55)
The values of z"" ... p1 and x1 can be obtained from a forward substitution
process of the following equations
zl = Y1- b2 z1-l ( i= 2, ..... ,n)
P1 = Ch- br-1 P1-l ( i= 2, ..... ,n)
Then 21-I
Pi-1
( i= 2, ..... ,n)
(q,- p,) hr-1
(p, - Xj)
br
( i= n,n-1, ..... ,2)
( i= n,n-1 , ..... ,2)
( i= n,n-1, ..... ,2)
233
(7.56a)
(7.56b)
7.3 SYSTOLIC ARRAY IMPLEMENTATION FOR
TOEPLITZ MATRICES
There are several systolic arrays proposed for solving Toeplitz matrices
known m the literature. Kung and Hu [Kung and Hu 1983] and Brent and Luk
[Brent and Luk 1983] have suggested systolic designs for solving Toephtz
systems. Further parallel algorithms for solving Toeplitz linear systems have been
suggested by Evans [ Evans 1986 and 1989] and Megson [Megson 1985]. Each of
these algorithms are implemented as systolic designs using Occam.
Tins section introduces a new systolic array designs for the implementation of
the Toeplitz matrix algorithms descnbed in the previous section. First, a systolic
array design is implemented for the tridiagonal case, which is then extended to
solve the quindiagonal case. Fmally the design is further extended to handle the
general case of banded Toeplitz matrices. One of the main objectives of the design
is the minimisation of the number of hardware components. The original design and
the extensions are fully descnbed in this section.
7.3.1 Systolic Array Design for Tridiagonal Case
The algorithm for the trid1agonal case of banded Toeplitz matrices described
in section (7.2.1), consists of a three-stage process, as shown in fig. (7.1). These
stages are as follows:
1- To solve the Toeplitz system B2 v = d m equation (7.21), which consists of
two substages:
a- To detennine the value of v1 from equation (7.28).
b- The computations of vi from equation (7 .26).
ii- The second stage is to solve the system (7.22) to get the values of z..
234
r Input data l ~
Solvmg matnx Toephtz system (7 21)
I Determwv 1 I
I Determirxvector v1 I
I.
Solve &agonal matnx A J(7 22)
" Solve the transposedToephtz matnx system (7.23)
I Detertnl""'n I
I Determine vector x 1 I
'" I Output l
Figure (7.1)Toeplitz rrtatrix algorithm-structure of tridiagonal case
235
------------
iii- The final stage, is the solution of the transposed Toeplitz system BI x = z
(7 .23). In this stage we can determine the value of x1 in the following two
sub stages.
a- Obtain the value of Xn from equation (7 .32).
b- Detemrine the values of vector x from equauon (7 .30).
The successive blocks are themselves systolic arrays and data is piped
between them. In order to illustrate the property of the systolic destgn, the design is
discussed in more detail in the following way:
The systolic array for the general tridiagonal form given in fig.(7.2) is a
special connected structure taking advantage of the system (7.13) to reduce array
inputs.
The array consists of five cells, each cell representing one of the stages or
substages of the diagram shown in fig. (7.1). The first two cells computes the
solution to the linear system B2 v = d (7.21), the last two cells computes the
solution to the linear system BI x = z (7.23), while the middle cell is used to solve
system A 1 z= v (7 .22).
The value of b2 is iniually stored in each of the cells, except the middle cell
where Ct is stored. Then, values of d,, i=n, n-1, .... ,1, are input from the host to
the first cell (in a backward sequence). The data and results are pumped through the
array and final results are sent back to the host. Only cells on the array boundaries
are permitted to communicate with the host, and each of the cells communicates
with its left and right neighbours cells only.
Starting with the first processing element, the value of vi is computed by
applying equation (7.28), the input data di are pumped into a shift register, the
236
number of stages of the shift register should be n (where n is the size of the input
vector d). The main purpose of the shift register is to delay the mput data uno! the
Host
d i 1--~1 cell I cell2
x,~------------------~
Figure (7.2) Tndiagonal Toeplitz matrix systolic array design (double-sided).
value of v 1 is computed. Once v1 is ready, both v1 and dn are pumped to the next
cell at the same time, followed by sending the input vector d. The structure of celll
is shown m fig. (7 .3a), and the Occam code running in this cell takes the following
form:
--- PROC delay
PROC cellA {CHAN .......... )
--- declaration of local channels
SEQ
image := number.columns • number.rows
SEQ 1 = [0 FOR n]
SEQ
dm! d
dout? d1
v:=v+(a*d)
a:= a/b
v.= v/{1-(c*b)):
PROCmam
237
PAR
cell A
PAR 1 = [0 FOR 2]
delay:
Cell2 computes backward recursive sequence of equation (7.26) to produce
v1, which is piped into the next cell at each cycle. Cell 3 also generates a sequence
of output data by applying equation (7 .22), which is piped into cell4.
Cells 2, 3 and 4 are overlapped in the computation process. Cell 4 continually
calculates the partial products of Xn as given in equation (7 .32). Although cell 4
works all the time its results are only valid after n-cycles, thus cell 5 ignores the
results until they become valid. Then Xn and ~ are pumped to the final cell on the
right side where Xn.J is calculated according to equation (7 .30a). This backward
substitution is repeated n-1 times to produce the x1 values, i= n-1, ..... ,2,1, and
thus the tridiagonal Toeplitz system is solved (7.11).
V
a- cell A b- cell B
Ftgure (7.3) Cells structures for the systolic design shown in fig.(7.2)
a- cell A represent cells 1 and 4 b- cell B represent cells 2 and 5.
The mam structllre of cell 4 is similar to the structure of cell 1, whilst the main
structure of cells 2, 3, and 5 are the same as shown in fig. (7 .3b ). The Occam code
running on these (i.e. cells 2 and 5) cells takes the following form:
238
PROC cellA (CHAN ......... )
--- declarauon of local channels
SEQ
vm? VI
SEQ i = [0 FOR n]
SEQ
dm? d1
calculate Vj or Zj or x1
vout ! v1:
Fig. (7.4) illustrates a series of snapshots of an example where n =6 (size of
vector). There are four snapshots of the tndiagonal system computation shown
here, the indices of d, v, and x sequences appear in each relevant stage of the cell.
We assume that the computation starts at time zero.
At time zero, dn enters the array and v 1 is calculated in celll. Then, at time 5
the last part of vi is computed at cell! and pumped together with dn to the next cell
as shown in fig. (7 .4a). The second snapshot illustrates the state of all the cells at
time= 8, where d4 meets vi to produce v4 at cell2. At the same time, vs will be at
the middle cell to produce z5. At this time Z() is piped from cell 3, to produce
another partial result of xn. At time = 11 dl will be at cell2, v2 at cell 3 to produce
22 and z3 at cell 4 to produce another partial result of Xn, as shown in fig. (7 .4c). At
time = 13 the fmal partial result of Xn is calculated at cell 4. The last snapshot in this
figure illustrates the array at time = 15, where the first value of x (i.e. x6) is
produced. Also at this time, the next value of x (i.e. x5) is also produced at cellS.
239
celll cell2 cell4 cell5
0 0
a- TIIDe = 6
celll
® 5 5
cell4 cell5
0
b-Ttme=8
celll
0 ® 2 2
cell5
0
c- TIIDe = 11
celll cell2 C) cell4 cell5
X5 Xn
0 0 0 z5
d-Time= 15
Ftgure (7.4) Snapshots of execunon of tridiagonal Toeplitz matrix
240
- -----------------------------------------------------------------
Let us apply equation (7o32) to our example, then the first valid result will be: 2 3 4 5 6 x0 = {ze;- b2 zs + b2 z4 - b2 z3 + b2 z2 - b2 z1 ) I (1- b2 )
while the other values of xis calculated from equation (7o30a) by the following:
xs = (z6- x6)
h2
x4 = (zs - xs )
b2
x3 = (z4- x4)
b2
X2 = (z3 - x3)
b2
(z2 - x2) xi= b2
The fmal value is computed at time= 200
The main Occam code to run the tridiagonal matrix systolic system is as
follows:
--- PROC host
--- PROC cell A
--- PROC cell B
--- PROC cell C
PROC mam system (CHAN oooooooooo)
--- declarauon of local chaonels
PAR
host
cell A 00 cell 1
cell B -- cell 2
cell C -- cell 3
cell A -- cell 4
241
cell B -- cell 5·
The double-sided systolic array descnbed previOusly can be Improved and
implemented as a single-sided design as shown in fig. (7 .5). We modify the
design, by reducing the number of cells to three cells only. The operations of cells 2
and 3 are similar as in the previous design, with the third cell sending its output
back to cell1 then cells 1 and 2 repeat these operation again to solve system (7.23).
The fmal value of xi is calculated at cell 2 and pumped to the host The system runs
in !wo cycles, the first cycle is to solve systems (7 .21 and 7.22),and the second
cycle is to solve system (7 .23).
Figure (7 .5) Smgle-sided systolic array design for trid.Iagonal Toeplitz
matrix.
In the second cycle, a further problem arises, for in order to solve system
(7.23), a d.Ifferent set of equanons to that used previously in solving system (7.23)
must be solved due to the folding strategy applied. To overcome this problem we
suggest the following !wo solution strategies. We can,
1- Either make the cells generic in the sense that B2 and B~ can be
computed by the same cells. Thus, each cell will have two sets of equations,
associated with a tag bit s such that ,
242
0 when the system runs the first cycle
tags=
1 when the system runs the second cycle
Now cells 1 and 2 checks the tag bit, if it is 0, then it solves system (7.2 1).
Otherwise if the tag is 1, then the cell performs the solution of system (7.2 3). Cell
2 also needs a switch to know when to pump the output to cell 3 (i.e. in the first
cycle), or to pump the output back to the host in the second cycle. We can do that
by using the same tag s
0 pumps the output to cell 3
tags=
1 pumps the output to host
2- Or we can store the output data of cell3 in an array inside cell 3 and then
pump them to cell 1 in reverse order, by using a LIFO (Last In First Out)
procedure. Then we can use the same equations in each cell for both cycles.
By choosing the second solution we increase the computation time by N
steps, while the first solution gives increases in cell area.
The chmce of one of the two soluttons above depends on whether we are
interested m computation speed inside or outside the cells.
The main Occam code to run the smgle-stded systolic system is as follows:
--- PROC host
--- PROC cell A
--- PROC cell B
--- PROC cell C
PROC mam.system (CHAN ........ )
--- declarauon of local channels
PAR
host
set tag
PAR t=[O FOR 2]
243
cell A -- cell 1
cell B -- cell 2
cell C -- cell 3:
7.3.2 Systolic Array Design for the Quindiagonal Case
From the algorithm described in section (7 .2.2), we know that the solution
for the quindiagonal case can be computed as an extension of the tridiagonal case.
The algorithm for the qumdiagonal case of banded Toeplitz matrices consists,
therefore, of the five-stage process, as shown in fig. (7.6). These stages are as
follows:
i- The solution of Toephtz system B3 u =din equation (7.41), consists of
two substages:
a- Determining the value of u 1 from equation (7 .42a).
b- Followed by determining the components of u1 from equation (7 .43a).
ii- The second stage is to solve the system Bz v = u (7.41b), as explained
above to get the value of v1•
ii1- Solving system (7.41c), for y1.
iv- In this stage, we solve the transposed Toeplitz system Bi z = y (7.4ld).
The value of z. can be determined in the following two sub stages.
a- Obtam the value of Zn from equation (7.44a).
b- Determine the reminder of the values of vector z from equation (7 .45c).
v- In similar way to stage (iv), we can determine the values of x1 of system
Bj x = z (7.41e), by applying equations (7.44b and 7.45d).
The approach for the systolic array for the quindiagonal system to
accommodate the stages shown in fig. (7.6), is more or less similar to the double
sided systolic array described in the previous section. The only difference in the
quindiagonal design bemg the of flow of data.
244
I Input data I ~
Solve the Toephtz matnx system (7.4la)
I Detenmn"\ I Detennill(vector u 1
Solve the Toephtz matnx system (7.4lb)
I Detennmrv1 I Detenniruvector vi
Solve dtagonal matnx AI (7 4lc)
Solvmg transposed matnx system (7.4ld)
Detenn~
I Detennm<vector z i I
Solvmg transposed matnx system (7.4le)
Detennuvx n
Detennm.-vector xi
,J-
r Output l Figure (7 .6)Toeplitz matrix algonthm-structure of quindiagonal case
245
We note from section (7.2.2) that the first and second stage of fig. (7.6) have
similar algorithms, fourth and fifth stages also have sirmlar algorithms.
As shown in fig. (7.7), the systohc system operate in two cycles. In the first
cycle, the host pump the input data d1 to the first cell, then the value of u 1 is
computed, u 1 and cl,. are pumped to cell 2 as explained in the previous section. Cell
2 computes the backward recursive sequence to produce u1 since these values are
piped back to cell I, to operate the second cycle for solving system (7 .41 b) in the
same manner as in the first cycle. As soon as cell 2 generates its output in the
second cycle, the output vector values are pumped to the middle cell, at this cell
system (7.4lc) is solved, this cell operates-only once. Then the results are pumped
through cell4 and cell-S for solving system (7.4ld). The final cell sends its results
back to cell4. At the second cycle, the data are pumped through cell 4 and cell 5,
where Xn and x1 are calculated according to system (7.4le).The final results are
collected by the host from the fmal cell.
Values of b2 and b3 are stored in all the cells except the middle one. In the
first cycle, b3 is used in cell I and cell2 while b2 is used in cell 4 and cell 5. At the
second cycle, b2 m the opposite sense, is used m cell! and cell2, while b3 is used
in cell 3 and cell4. c1 is stored at cell 3.
Host
di 1-------:.l~ 0 cell!
xi~----------------------------------------J
Figure (7.7) Double-sided systolic array design forquindiagonal system.
246
The single-sided systolic design shown in fig. (7 .5) can also be modified in
order for the array to be implemented for quindiagonal systems This design can then
be implemented in a similar way to that in which the tridiagonal system was
implemented.
We modify the design by increasing the number of cycles of operations.
Here, the system runs in four cycles, the first cycle is to solve system (7 .41 a), the
second cycle is to solve systems (7.41b and 7.41c), the third and the fourth cycles
are to solve systems (7.41d and 7.41e) respectively.
In order to make the cell generic in the sense that BI, Bi, B2 and B3 matnces
can be computed by the same cells, we choose one of the two solutions suggested
in the previous section.
The tag bit solution is operated in the following manner
0 when the system runs the first and the second cycle
tags=
1 when the system runs the third and the fourth cycle
7.3.3 Systolic Array Design for General Banded Toeplitz Matrices
So far we have described the proposed systolic array architecture for the
tndlagonal and quindiagonal Toeplitz matrices. Now for semi-bandwidth r the
layout of the systohc scheme suggested earlier remains the same, it is only the
number of stages of the algorithm that is changed. In general, for a circular banded
Toeplitz matnx Ar of sem1-bandwidth r, the number of main-stages required to give
the solution of the system are 2r-l as shown in fig. (7 .8), these are as follows:
i- Stages (1 to r-1) to solve the Toephtz systems (7.54a). Each main stage
consists of two substages as shown in section (7 .3.2).
247
1 Input data 1
t
Solve the Toephtz matnx system (7.54a)
I I I I ~
Solve dJagonal matrix A J(7.54b)
I
I I I t
Solvmg transposed matnx system (7 .54c)
I Output I
Figure (7 .8) Toeplitz matrix algorithm-structure of general case
248
ii- Stager is for solving the diagonal system (7.54b).
iii- Stages (r+l to 2r-l), are for solving the transposed Toeplitz system
(7.54c). In this stage we can determine the value of xi as outlined in section
(7.3.2).
The double-sided systolic array system shown in fig. (7. 7), and the single
sided systolic array system shown in fig (7.5), are implemented to accommodate
the stages shown in fig. (7.8) for general-case of Toeplitz systems. These designs
are implemented in a similar way to that in which the quindiagonal system was
implemented previously.
The double-sided systolic array for the general case operate for r-1 cycles.
The first two cells operate for r-1 cycles to solve systems (7 .54a). At the r-1 cycle,
cell 2 pumps the result to the middle cell, at this cell which system (7.54b) is
solved, this cell operates only once. The output is pumped to solve the next two
cells, which also operate in r-1 cycles. At cycle r-1, the final cell sends the results to
the host as shown in fig. (7. 7).
The single-sided systolic array for the general case operates for 2r-2 cycles.
The system runs for r-1 cycles to solve systems (7 .54a and 7 .54b) then the system
runs for another r-1 cycles to solve system (7 .54c ).
The Occam program for the double-sided systolic array, described in this
section, is given in Appendix E.
7.3.4 Numerical Test Example
The systohc designs shown in the previous sections were tested on a variety
of Toephtz matnces. The following banded Toeplitz system was used as an
example to show the validity of the method used in this chapter.
The case n=6, r=3
249
-- -- ---------- -
From system (7.13), we have
r I 5781 08437 0.1250 0.0 0.125 08437 r ::1 0 8437 1.5781 08437 0.1250 0.0 0.125
0.1250 0.8437 1.5781 0.8437 0.1250 00 I X3 • =
l 0.0 0.125 0 8437 15781 0 8437 0.1250 X4
01250 00 0.125 0.8437 1.5781 08437 X5
08437 0.1250 0.0 0.125 08437 I 5781 X6
9.3281
7.7812
10.5468
l ""'" J 16.8281
15.2812
From the algorithm shown in secuon (7 .2.2), we determine the values of b2,
b3 and c 1 . The above system can be solved by factorising the matrix A3, as
follows:
I 5 I .25
.5 0 .25 0
I .5 I .25
I • 0 .5 0
• I .25
l I .5
.5 I
I .25 J I .25
250
1
1 0
1
0 1
1
1
.5 1 0
.5 1
0 .5
.5 1
.5
*
1
1
.25 1 0
.25 1
0 .25 1
X2
.25
25
1 9.3281l
I 7.7812
10.5468 I 14 0625J
16 8281
15.2812
.25
*
1
The above system were then input to the Occam program for banded
quindiagonal Toeplitz matrix double-sided systolic array system. The program
result are as follows:
X1 = 1.0 xz = 2.0 X3 = 3.0 X4 = 4.0 X5 = 5.0 X6 = 6.0
7.4 PERFORMANCE OF THE SYSTOLIC ARRAY
7.4.1 Timing of the Systolic Array
The time required for the solution of the n x n banded Toeplitz systems of the
form Ar x= d with semi-bandwidth r, computed on a double-stded systolic array
shown in section (7.3.3), is T = t1 + t2 + t3 + t4 + t5, where,
1- t1 = n, time cycles to compute the flrst value (i.e. u1) for the flrst Toeplitz
system (7 .54a)at cell 1, the rematning values of the systems are overlapped with the
next cell, then t1 =1.
251
2- t2 = n, time cycles to compute the values of each of the Toeplitz system
(7 .54a) at cell 2.
3- cell 3 requires n time cycles to solve system (7 .54b) but is overlaped with
the computation of the preVIous cell, then t3 = 1.
4- cell4 requires n time cycles to solve the final value (i.e. Yn) for each of the
transposed Toeplitz system (7.54c). But this also overlapped with the computation
of the previous cell, then t4 = 1.
5- Finally t5 = n, time cycles to compute the vector values of each of the
transposed Toeplitz system (7.54c) at cellS.
then
Total time of the entire p1pelme is
T = [ n+(l x (r-2)] + [n x (r-1)] + [1] + [1 x (r-1)] + [n x (r-1)] for r > 2
T = [ n] + [n x (r-1)] + [1] + [1 x (r-1)] + [n x (r-1)] forr = 2
T = [(2n + 1) (r-1)] + [n + (r-1)]
T = [(2n +2) (r-1)] + n
forr~2
From the above equation we can determine the value of the time cycles
required to solve the tnillagonal Toeplitz system i.e.
T= [(2n+1) (2-1)] + [n+(2-1)]
= 3n +2 time cycles.
While the time cycles for solving the quindiagonal Toeplitz system is,
T = [(2n +1) (3-1)] + [n + (3-1)]
= Sn +4 time cycles.
A similar numbers of time cycles are required to solve the Toeplitz system
mentioned above, computed on a single-sided systolic array. By choosing a LlFO
procedure at cell 3, the computation time increases by n time cycles.
252
7 .4.2 Area of the Systolic Array
The double-sided systolic array consists of the following cells:
1- Two cells with addition and multiplication operations.
2- Three cells With addition and multiplication or division operations.
The total length of the pipeline is five cells and 2n shift registers.
The number of cells in the systolic array is constant and therefore is not
related to the semi-bandwidth r or to the size of the problem n, whilst in the other
schemes proposed by Kung and Hu [ Kung and Hu 1981] and Brent and Luk
[Brent and Luk 1983] the number of cells is related ton. In Megson's schemes
[Megson 1985], the number of cells is related tor the semi-bandwidth.
The single-sided systolic array consists of only three cells and 2n shift
reg1sters, thus reducing the design area to a minimum compared to the other
schemes. The only drawback being that some of the cells contains n shift registers.
253
CHAPTER 8
CONCLUSION AND DISCUSSION
The main study of this thesis is the design of parallel algorithms for digital
image processing on Very Large Scale Integration (VLSI) processor arrays which
are implemented on both a Sequent Balance (MIMD) via an Occam simulator and a
transputer network running the Transputer Development System (1DS). The Occam
programming language is used as a tool to simulate and map systolic arrays for the
image processing algorithms proposed.
The algorithms considered in this thesis are drawn from the class oflow-level
vision algorithms. In particular we consider the low-pass and high-pass filters. The
approach taken is to develop systolic array designs for these algorithms. Comments
and conclusions related to the implementation of the systolic array on transputer
networks are provided in the performance secoons of the relevant chapters.
A general introduction to parallel processing is presented in chapter I. This
chapter covers a wide selecnon of the principles of the significant parallel computer
architectures, and various classifications of parallel architectures are presented.
The systolic approach in parallel processmg evolved from the appropriate
technology and the background knowledge for its realisation together with possible
applications. The applications arises from the ever mcreasing tendency for faster
and more reliable computations, especially in areas like real-time signal and large
scale scientific computation. The appropriate technology was provided by the
remarkable advances in VLSI and automated design tools. The systolic array
systems feature the importance of modularity ,local interconnection, high degree of
pipelining and highly synchronised multiprocessing. The systolic systems and the
hardware design of the transputer are discussed in detail in chapter 2, as well as the
associated parallel language Occam and the development system for running Occam
programs (TDS).
255
Transputers have a number of attractive features, which are important for
buildmg parallel systems. Probably the most important feature is the presence of
four high speed serial links through which the transputer can be connected to other
transputers. One of the limiting factors in the use of transputers for image
processmg applications is the size of the on-chip memory.
Chapter 3 considers the fundamentals of low-level image processing, which
includes the parallel computer low-level image processing algorithms, parallel
hardware, the various methods for implementing these algonthms on various types
of parallel architecture and the image processing techniques required for image
filtering.
A systolic array design for one-dimensional convolution is described in
chapter 4. It is shown that this systolic array can be extended to handle a two
dimensional convoluUon algonthm. The system performance was improved by
implementing the shift register as a constant time operation. The number of delays
of the systohc system IS therefore a constant time operation, i.e., it is independent
of the kernel size and input data size. This decreases the execution time of the
whole system. This system can be extended to handle convolution of any
dunens10nali ty.
The implementation of the systolic array for ID and 2D convolution
algorithms on transputer networks are also presented in chapter 4 and their timing
results analysed. For the 1 D convolution algorithm, the speed-up and the efficiency
increases with the increasing number of transputers and the increasing image size.
The best efficiency we can get is when each transputer contains one cell only of the
systolic design. The overall results for the 2D convolution algorithm are very
impressive, the timing indicates a very high speed-up for all sizes of images when
the network size increases to more than one transputer. Also the efficiency increases
256
with increasing the size of the transputer network. The systolic array designs for
2D convolution are shown to be superior to those known in literature.
Various modifications of the systolic design presented in chapter 4, to handle
a set of digital image filters, are analysed in chapters 5 and 6. This includes both
low-pass and high-pass filters. These algorithms were imptemented on a transputer
network, using a systolic array designed for each of them.
In chapter 5, systolic array designs for the Laplacian and gradient operators
are presented. The plus shaped Laplacian algonthm gave near linear speed-up for all
sizes of image on a transputer network. The maximum e~ficiency is obtained when
the load on each transputer is about the same whilst the square Laplacian algorithm
gives a high speed-up and efficiencies as the network size increases to more than
two transputers.
A new systolic system for the gradient operator is developed and shown in
chapter 5. The transputer implementation of the design, is also discussed. The
relationship between the image size and the speed-up is nearly fixed for each
transputer network, but increases sharply when the network size at its maximum. It
can be concluded that better performance results can be achieved if the load is
balanced. The systolic array design for the Prewitt and Sobel operators are also
introduced.
Another set of systolic systems for digital image filters are discussed in
chapter 6. The chapter starts with an overview of the systolic array of the sigma
filter, followed by another systolic design for the inverse gradient filter. Another
section presents a systolic array design for mean and weighted mean filters. The
implementation of each of these designs on transputer networks is also discussed.
The overall results for implementing sigma and inverse gradient filters on transputer
257
networks gives an efficiency which increases as the network increases from 7 to 10
transputers, also the speed-up mcreases as the size of the image increases.
The mean and weighted mean filters gave a good speed-up for a large size of
image. This result is quite useful, smce the need for a parallel system is more vital
for large images, where processing time is relatively high. The graphs show good
speed-up and efficiency for the various size of images and transputer networks.
However, there is a maximum number of transputers which can be used efficiently
m the systolic systems.
In chapter 6 the implementauon of a variety of digital image filter .algorithms
on the Sequent Balance and the transputer network was achieved. One of the aims
of this was to design and build a programming workbench for developing image
processmg operations for low-level vision. The motivation for the work is to
develop a methodology for the 1mplementation of an image processing library on
the Sequent Balance i.e. PARC-IPL. The key to the workbench is to hold a library
of precoded software components in a generalized configuration-independent style.
The workbench provides a good mechanism for developing low-level image
processing systems on parallel computers without the need for much actual
programming on the user pan. The user can control the execution of the program
from the workbench. At the heart of PARC-IPL is a library of image processing
routines and algorithms, which are coded in a generalized format which is not
specific to one particular configuration, either by its size or topology. Further work
is still required to invoke the existing library on a multiprocessor such as a
transputer network, using the same Occam codes.
Further in the filter field of design there occurs frequently the problem of
solving banded Toeplitz systems.Toeplitz matrices have become increasingly
important with the rapid growth of signal and image processing. The factorisation of
258
certain banded Toeplitz matrices proposed by D.J. Evans, is described in chapter 7.
By using this factorisation a solution can be derived for such matrices. Descriptions
of two systolic array designs for the tridiagonal, quindiagonal and general Toeplitz
matrices are presented. One of the main objectives of the designs is to minimize the
number of hardware components, the maximum number of cells in these designs
being 5. The performance of the designs is also given.
Further work is required to implement these systolic designs on a transputer
network, m order to measure the performance of the algorithm on a variety of
network configurations.
Fmally, the use of task parallelism should be considered for parallel
implementations of image processing applications, especially those belonging to the
low-level image class. This is because of the large amount of data mvolved m the
highly homogeneous processing required. The systolic array designs described m
this thesis can be extended to cover image parallelism as well as task parallelism. In
other words, the image is partitioned over a number of similar systolic arrays, with
each systolic array executing the whole algorithm. Such designs will achieve the
advantages of both image parallelism and task parallelism.
259
REFERENCES
Amin 88
Arvind 82
Ballard 82
Bareiss 69
Bekakos 86
Brent 83
Brent 84
Clnn 83
Crookes 90 [ 1]
Amin, S.A., Systohc Design for Lowpass Digital Image
Filtering on a Transputer Network Usmg TDS, The
SERC/DTI Initiative in the Engineering Application of
transputers, U.K, 1988.
Arvind and Gostelow, K.P., The U-Intepreter, IEEE
Computers, pp.42-49, Feb. 1982.
Ballard, D.H. and Brown,C.M, Computer Vision, Prentice
Hall, Inc., 1982.
Bareiss, E.H., Numerical Solution of Lmear Equations with
Toeplitz Matrices, Numerical Mathematics, 13, pp. 404-424,
1969.
Bekakos, M.P., A Study of Algorithms for Parallel
Computers and VLSI Systohc Processor Arrays, Ph.D.
Thesis, Dept. of Computer Studies, LUT, 1986.
Brent, R.P., Kung, H.T. and Luk, F., Some Linear Time
Algorithms for Systolic Arrays, CUM-ROL-83, 1983.
Brent, R.P.and Luk, F. T.,Systolic Arrays for the Linear Time
Solution of Toephtz Systems of Equations, J. of VLSI and
Computer Systems Vol. 1, No. 1, 1984.
Chin , R.T. and Yeh, C.L., Quantitative Evaluation of some
Edge-preserving Noise-Smoothing Techniques, Computer
Vision: Graphics and Image Processing 23, pp. 67-91, 1983.
Crookes, D., Morrow, P.J. and Philip, G.,The Development
of a Transputer-Based Image Database, Proc. 2nd
International Conference on Applications of Transputers,
Southampton, pp. 189-195, July 1990.
261
Crookes90
Crookes 91 [1]
Crookes 91
Danielsson 81
Dew86
Doshi 87
Duff83
Ekston 84
Crookes, D., Morrow, P.J. and McParland, P.J., IAL: a
Parallel Image Processing Programming Language, lEE
proceeding, Vol. 137, No. 3, pp. 176-182 June 1990.
Crookes, D., Morrow, P.J., McClatchey, I. and Rafferty,T.,
A Software Development Environment for Parallel Image
Processmg: lmplementanon Techniques and Issue, in "Occam
and the Transputer Current Developments" , Edwards, J.
(ed.), lOS Press, Netherlands, 1991.
Crookes, D., Morrow, P.J. and McParland, P.J., Occam
Implementation of an Algebra-Based Language for Low-level
Image Processing, Computer Systems and Engineering, Vol.
6, No 1, pp. 30-36, Jan. 1991.
Danielsson, P.E., Note on Getting the Median Faster,
Computer Vision: Graphics and Image Processing 17.
Dew, P.M., Manning, L.J. and Mcevoy, K., A Tutorial on
Systolic Array Architectures for High Performance
Processors, 2nd Int. Electronic Image Week, Nice, 1986.
Doshi, K. and Varman, P., A Modular Systolic Architecture
for Image Processing, Computer Architecture Conf., 14th Int.
Symp., USA, pp.56-63, 1987.
Duff, M.J.B., Computing Structures for Image Processing,
Academic Press, 1983.
Ekston, M.P., D1gttal Image Processing Techniques,
Academic Press, 1984.
262
Evans 72
Evans 80
Evans 81
Evans 86
Evans 89
Evans 91
Flynn 66
Galledy90
Giloi 91
----------
Evans, D.J., An Algorithm for the Solution of Certam
Tridiagonal Systems, The Computer Journal, 15, pp 356-
359, 1972.
Evans, DJ., On the Solution of Certain Toeplitz Tridiagonal
Linear System, SIAM J. Numerical Anal., Vol. 17, No. 5,
1980.
Evans, D.J., On the Factorisation of Certain Symmetric
Ctrculant Banded Linear System, in "Parallel Processing
Techmques, App!J.ed Information Technology edition", pp.
79- 84, 1981.
Evans, D.J. and Megson, G.M., A Highly Pipelined Systolic
Array For Solving Toeplitz Systems, Int. Rep., Computer
Studies, No. 332, LUT, 1986.
Evans, D.J. and Megson, G.M., Fast Tnangulization of a
Symmetric Tridiagonal Matnx, J. of Parallel and Distnbuted
computing, 6, pp. 663-678, 1989.
Evans, D.J. and Gusev, M., New Linear Systolic Arrays For
Digital Ftlters and Convoluuon, Int. Rep., Computer Studies,
No. 660, LUT, 1991.
Flynn, M.J., Very Htgh-Speed Compuung Systems, Proc. of
the lEE, Vol. 54, No.l2, pp.1901-1909, Dec. 1966.
Galletly, J., Occam 2, Pinnan Publishing, GB, 1990.
Gilol, W.K., Whither Image Analysis System Architecture?,
in "From Ptxels to Features 11", Burkhardt, H., Neuvo,Y.,
Simon, J.C. (eds), Elsevier Science Publishers, 1991.
263
Gohberg72
Gonzalez92
Graham90
Gurd 85
Handler82
Haralick 80
Harp 89
Hays 88
Higbie 72
Hobbs 70
Gohberg, I., Semencul, A., On the Inversion of finite Toeplitz
Matrices and therr Continuous Analogs, Mat. Issled 2, pp.
201-233, 1972.
Gonzalez, R.C. and Woods, R.E., Digital Image Processing,
Addison-Wesley Publishing Company, 1992.
Graham, I. and King, T., The Transputer Handbook, Prentice
Hall, 1990.
Gurd, J.R., Kirkham, C.C. and Watson,I., The Manchester
Protype Data-Flow Computer, Comm. ACM, No.1, PP.
34-52, Jan. 1985.
Handler,W., Innovation Computer Architectures -How to
Increase Parallelism But Not Complexity, in "Parallel
Processing Systems", Evans, D.J. (Ed.), Cambridge
University Press, GB, 1982.
Haralick, M.H. and Simon, J.C., Issues in Digital Image
Processing, Sijthoff and Noordhoff, Netherlands, 1980.
Harp, G., Transputer Applications, Pltman Publishing,
London, 1989.
Hays,J.P., Computer Architecture and Organization, McGraw
Hill, 1988.
Higbie, L.C., The Omen Computers: Associative Array
Processors, IEEE Comp. Conf., Digest, pp. 287-290, 1972.
Hobbs, L.C. and Thesis, D.J., Survey of Parallel Processor
Approaches and Techniques, in "Parallel Systems: Technology
and Apphcauons, Hobbs et al (eds.), Spartan Books, New
Yorks, pp. 3-20, 1970.
264
Hockney 88
Hubel62
Hussain 91
Hwang84
Inmos 86
Inmos 87
Inmos 88 [I]
Inmos 88 [2]
Inmos90
Inmos92
Kung 78
Kung79
Hockney, R.W. and Jesshope, C.R., Parallel Computer 2:
Architecture, Programming and Algorithms, Adam Hilger
Ltd., Bristol, England, 1988.
Hubel, D.H. and W1esel, T.N., Receptive Fields, Binocular
Interactive and Functional Architecture in the Cat's VIsual
Cortex, J. Physiol, pp. 106-154, 1962.
Hussain, Z., Digital Image Processing Practical Applications
of Parallel Processing Techniques, Ellis Horwood, 1991.
Hwang,K. and Briggs, F.A., Computer Architecture and
Parallel Processing, McGraw-Hill, N.Y, 1984.
Inmos Ltd.,Product Information ITEM 400 Inmos Transputer
Evaluation Module, 1986.
Inmos Ltd., The Transputer Family, Inmos Ltd., UK, 1987.
Inmos Ltd., IMS B012 User Guide and Reference Manual,
1988.
Inmos Ltd., Occam 2 reference Manual, Prentice Hall,
UK, 1988.
Inmos Ltd., Transputer Development System, Prenuce Hall,
Uk, 1990.
Inmos Ltd., The Transputer Databook, Inmos Ltd, Uk, 1992.
Kung, H.T. and Leiserson, C.E., Systolic Arrays (for
VLSI), in " Proc. Sparse Matrix Symp.(SIAM), pp. 256-282,
1978.
Kung, H.T., Let's Design Algorithms For VLSI System,
Proc. Conf. Very Large Scale Integration: Architecture,
265
Kung 80
Kung 82
Kung 83
Kung 84 [1]
Kung 84 [2]
Kung 84
Kung 83
Kung 85
Design Fabrication, California Institute of Technology, pp.
65-90, Jan. 1979.
Kung, H.T., Special-Purpose Devices for Signal and Image
Processing: an Opportunity in VLSI, Real-Time Signal
Processing Ill, pp 76-84, 1980.
Kung, H.T., Song, S.W., A Systolic 2D Convolution Chip,
in "Multiprocessors and Image Processing Algorithms and
Programs", Academic Press, 1982.
Kung, H.T., Ruane, L.M. and Yen, D.W., Two-Level
Pipelined Systolic Array for Multidimensional Convolution,
Image and VIsion Computing, Vol. 1, No.l, 1983.
Kung, H.T., Systolic Algonthms for the CMU Warp
Processors, Dept of Computer Science, CMU, 1984.
Kung, H.T., Systolic Algorithms for the CMM Warp
Processors, CMUC-CSA-84-158 (7th Int. Conf.), 1984.
Kung, H.T. and Lam, M.S., Wafer-Scale Integration and
Two-level Pipelined Implementations of Systolic Arrays, J. of
Parallel and Distributed Computing, Vol. 1, pp. 32-63, 1984.
Kung,S.Y. and Hu,Y.H. A Highly Concurrent Algorithm and
Pipelined Architecture for Solving Toeplitz Systems, IEEE
Transaction on Acoustics, Speech and Signal Processing, Vol.
1, No.ASSP-31, 1983.
Kung,S.Y., VLSI Array Processors, lEE ASSP Magazine,
pp.5-22, July 1985.
266
Kwan88
Lee83
Manning SS
Manning 88[1]
May89
Mead80
Megson85
Megson 86
Megson 87
Megson 92
Kwan, H. K. and Okullo-Oballa, T.S., Two-Dimensional
Systolic Arrays for Two-Dimensional Convolution, Proc. of
the SPIE, Vol. 1001, pp. 724-731, 1988.
Lee, J.S., Note on Digital Image Smoothing and the Sigma
Filter, Computer Vision: Graphics and Image Processing, Vol.
24, pp. 255-269, 1983.
Manning, L.J., Design and Analysis of Computational Models
for Programmable VLSI Processors Array , Ph.D. Thesis,
University of Leeds, 1988.
Manning, L.J., Dew, P.M. and Wang, H., Design and
Analysis of Image Processing Algorithms for Programmable
VLSI Array Processors, in Parallel Architecture and Computer
Vision, Page, I. (ed.), Oxford Umversity Press, 1988.
May, D., The Transputer, in "Transputer Applications",
Harp, G. (Ed.), Pitman Publishing, London, 1989.
Mead, C.A. and Conway, L.A., Introducnon to VLSI
System, Addison- Wesley, Reading, Mass, 1980.
Megson, G.M. and Evans, D.J., Banded and Toeplitz
Systems, Int. Rep, Computer Studies, No. 243, LUT, 1985.
Megson, G.M. and Evans, D.J., Soft-Systolic Pipelined
Matnx Algonthms, in "Parallel Computing 85" , Feumeier, M.
et al. (eds.), Elsevier Science Publishers, 1986.
Megson, G.M., Novel Algorithms for the Soft-Systolic
Paradigm, Ph.D Thesis, LUT, 1987.
Megson, G.M., An Introduction to Systolic Algorithm
Design, Colarendon Press, Oxford, UK, 1992.
267
Moore 87
Morrow87
Morrow91
Morrow92
Murtha64
Nagao79
Niblack86
Nudd 88
Moore, W., McCabe, A. and Urquhart, R., Systolic Arrays,
Adam Hilger, 1987.
Morrow,P.J. and Perrot, R.H., The Design of Low-Level
Image Processing Algorithm on a Transputer Network, in
Parallel Architectures and Computer Vision, pp. 243-260,
1987.
Morrow, P.J. and Crookes, D., Using a High Level Language
Issues in Image Processing on Transputers, in "From Pixels
to Features 11", Burkhardt,Y., et al (eds.), Elsevier Science
Publishers, Netherlands, pp. 313-326, 1991.
Morrow, P.J. and Crooks, D., Parallel Language for
Transputers-Based Image Processing, in "Image Processing
and Transputers", Webber, H.C. (ed.), lOS Press,
Netherlands, pp. 27-46, 1992.
Murtha, J. and Beadles, R., Survey of the Highly Parallel
Information Processing Systems, Prepared by the
Westinghouse Electronic Corp., Aerospace Division, ONR,
Report No. 4755, Nov. 1964.
Nagao, M. and Matsuyama,T., Note on Edge Preserving
Smoothing, Computer Vision: Graphics and Image
Processing, Vol. 9, pp. 394-407, 1979.
Niblack, W, An Introduction to digital image processing,
Prentice Hall, 1986.
Nudd, G.R. and Francis, N.D., Architectures for Image
Analysis, In Third lnt. Conf. on Image Processing and its
Applications, pp. 445-451, lEE, England, 1988.
268
--------1
--------------------------- -- ------- ---,
Offen 85
Peli 82
Quinton 91
Offen, R.J., VLSI Image Processing, Collins, London, 1985.
Peli, T. and Malah, D., Study of Edge Detection Algorithms,
Computer VIsion: Graphics and Image Processing, Vol. 20,
pp.1-21, 1982.
Quinton, P.and Robert,Y., Systolic Algorithms And
Architectures, Prentice Hall, UK, 1991.
Ramamoorthy 77 Ramamoorthy , C. V. and Li,H.F., Pipeline Architecture,
Computer Survey, Vol. 9, No. 1, pp.61-102, March, 1977
Robert 86 Robert,Y. and Tchuente, M., Efficient systolic Array for 1D
Convolution Problem, J. of VLSI and Computer Systems,
Vol.l, pp. 398-407, 1986.
Rosenfeld 79
Rosenfeld 82
Seitz 85
Shore 73
Siegel85
Sloboda 89
Snyder 82
Rosenfeld, A., Picture Languages, New York: AP, 1979.
Rosenfeld, A. and Kak, A. C., Digital Picture Processing, Vol
2, Academic Press, 1982.
Seitz, C.L., The Cosmic Cube, Comm. ACM,Vol. 28,
No1,pp.22-23, Jan. 1985.
Shore, J.E., Second Thoughts on Parallel Processing,
Comput. Elec. Eng., pp. 95-109, 1973.
Siegel, H.J., Interconnection Networks for Large-Scale
Parallel Processing, Lexington Book, D.C. Heath and
Co., Lexington, MA, 1985.
Sloboda, F., Toeplitz Matrices Homothety and Least Squares
Approximation, in Parallel Computing: Methods, Algorithms
and Applications, lOP Publishing Ltd, pp.237-248, 1989.
Snyder, L., Introduction to the Configurable Highly Parallel
Computer, IEEE, Computer J., pp. 47-56, 1982.
269
Stone 87
Tabak 89
Trench64
Undrill92
Wang 81
Wayman89
Stone, H.S., High-performance Computer Architecture,
Addison-Wesley, Reading, MA, 1987.
Tabak,D., Multiprocesser, Prentice-Hall International, 1989.
Trench, W., An Algorithm for the inversion of Ftmte Toeplitz
Matrices, J. Soc. Ind. Appl. Math., Vol. 12, pp. 515-522,
1964.
Undrill, P.E., Digital Images: Processing and Application of
the Transputer, in "Image Processing and Transputer,
Webber, H. C. (ed.), lOS Press, 1992.
Wang, D.C.C. and Vagnucci, A.H., Gradient Inverse
Wetghted Smoothing Scheme and the Evaluation of its
Performance, Computer Vision: Graphics and Image
Processing, Vol. 15, pp. 167-181, 1981.
Wayman, R., Transputer Development Systems, Pitman
Publishing, London, 1989.
270
APPENDICES
APPENDIX A
Host Transputer Occam Program for ID and 2D Convolution
-- Host programm for ID and 2D convolution.
-- Using host and n transputers.
-- One input channel and one output channel only for each transputer.
-- Image size 16* 16.
#USE interf
#USE uservals
#USE streamio
#USE snglrnath
#USE strings
#USE ssinterf
#USEuserio
#USE linkaddr
-- time,m,n,rnn should modify.
PROTOCOL pair IS REAL32 ; REAL32:
CHAN OF pair xyin,xyout :
VAL time IS 286: --size of the image+ No. of columns+ 14
VALmiS 16: --No.ofrows
VALniS 17: -- No. of columns+ 1
V AL rnn IS 272: --(no.co+ 1) * no.rw
[5] REAL32 a:
[time] REAL32 x,y,xx,yy:
INT start,end,interval:
REAL32 mtervall:
TIMER clock:
PLACE xyin AT link2.out:
PLACE xyout AT link3.in:
PROC output.result(CHAN OF ANY screen)
SEQ
SEQ i=O FOR time
SEQ
272
--------------------------------------------------------------------------------
ss.write.int(screen, i,6)
-- pnnt results
ss. write.real32(screen,xx[i], 8,1 0)
newline(screen)
ss.write.real32(screen,yy[i], 8,10)
new line( screen)
new line( screen)
-- print time lapse
write.full.string(screen," Timer in units = ")
ss. write.int( screen,interval,l 0)
new line( screen)
write.full.string(screen," Timer in secands = ")
ss. write.real32(screen,intervall,8,1 0)
PROC write.to.file()
INTerror:
SEQ
-- This procedure uses screen output With the option to file a copy
-- To file output, run it on an empty fold
INTkchar:
SEQ
ss. write.string(screen, "Do you want to file the output? ")
ks.read.echo.char (keyboard, screen, kchar)
ss.write.nl(screen)
V AL bchar IS BYTE (kchar 1\ #5F): -- mask off alphabetic case
IF
bchar= 'Y'
CHAN OF ANY fromprog, tofile:
INT foldnum:
PAR
SEQ
output.result(fromprog)
ss. write.endstream (fromprog)
273
SEQ
ss.scrstream.fan.out (fromprog, tome, screen)
ss.write.endstream (tofile)
SEQ
ss.scrstream.to.fold (tofile, from.user.filer[O],
to.user.filer[O], "output.dat", foldnum, error)
IF
error=O
SKIP
TRUE
STOP
TRUE
output.result(screen)
ss.write.string(screen, "File output OK*c*n")
-- The main host to send an element from buffers xO and yO via xyout channel,
-- and it coleact an element from buffers xx[] and yy[] via xyin channel.
PROC host(CHAN OF pair xyout,xyin)
PROC hostl (CHAN OF pair xyout)
SEQ
clock? start
SEQ 1= 0 FOR time
xyout ! x[i] ; y[i]
PROC host2(CHAN OF pair xyin)
SEQ
SEQ i= 0 FOR time
xyin ? xx[i] ; yy[i]
clock? end
SEQ
274
PAR
hostl (xyout)
host2(xyin)
interval:= end MINUS start
interval!:= (REAL32 ROUND(interval))/15625.0 (REAL32)
PROC initialization()
SEQ
SEQj=OFORm
SEQi=OFORn
SEQ
x[(n*j)+i] := REAL32 ROUND i
y[(n*j)+i] := 0.0 (REAL32)
SEQk=mnFOR 14
SEQ
x[k] := 0.0 (REAL32)
y[k] := 0.0 (REAL32)
SEQ
initialization()
host(xyin,xyout)
-- print output.
write.full.string(screen," Running ")
write.full.string(screen," Output ")
output.result( screen)
write.to.file()
INTany:
keyboard ? any
275
APPENDIX B
Occam Program for 2D Convolution
#USE linkaddr
PROTOCOL pair IS REAL32 ; REAL32:
CHAN OF pair xyin, xyout :
[ 12]CHAN OF pair xyc :
V AL time IS 286: --size of image +No. of columns + 14
-- Kernel elements values.
V AL REAL32 aO IS -l.O(REAL32):
V AL REAL32 a! IS -l.O(REAL32):
V AL REAL32 a2 IS -l.O(REAL32):
V AL REAL32 a3 IS -l.O(REAL32):
V AL REAL32 a4 IS 8.0(REAL32):
V AL REAL32 a5 IS -l.O(REAL32):
V AL REAL32 a6 IS -l.O(REAL32):
V AL REAL32 a7 IS -l.O(REAL32):
V AL REAL32 a8 IS -l.O(REAL32):
PROTOCOL pair IS REAL32; REAL32:
PROC systeml(CHAN OF prur xyin,xyout, --cell A
V AL INT time, V AL REAL32 aO)
[3]CHAN OF pair xyc :
PROC celll(CHAN OF pair xyin,xyout,
V AL REAL32 a)
[4]CHAN OF REAL32 xd:
[2]CHAN OF REAL32 yd:
CHAN OF REAL32 pd:
PROC malt ( CHAN OF REAL32 xin,yout)
#USEuserio
REAL32x,y:
SEQ
X :=0.0(REAL32)
y :=O.O(REAL32)
SEQ i= 0 FOR ume
SEQ
276
xin? X
yout! y
y:= x*a
PROC add ( CHAN OF pair xyout,
CHAN OF REAL32 pin,yin,xin)
[2]REAL32 p:
REAL32y:
REAL32x:
SEQ
p[O] :=O.O(REAL32)
p[l] :=O.O(REAL32)
y :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
pin? p[O]
yin?y
xin?x
xyout! x; p[l]
p[l]:= p[O] +y
PROC pass ( CHAN OF REAL32 zin,zout,xout)
[2]REAL32 z:
REAL32x:
SEQ
z[O] :=O.O(REAL32)
z[l] :=O.O(REAL32)
X :=O.O(REAL32)
SEQ i= 0 FOR tune
SEQ
zin? z[O]
zout! z[l]
xout! x
SEQ
z[l]:= z[O]
277
-- ----------
x := z[O]
PROC pass1 (CHAN OF pair xyin,
CHAN OF REAL32 yout,zout)
[2]REAL32 z:
REAL32y:
SEQ
z[O] :=O.O(REAL32)
z[1] :=O.O(REAL32)
y :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
xym? z[O]; y
zout ! z[1]
yout! y
z[1]:= z[O]
PROC delay ( CHAN OF REAL32 xm,xout)
[2]REAL32 x:
SEQ
x[O] :=O.O(REAL32)
x[1] :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
SEQ
xin? x[O]
xout! x[1]
x[1]:= x[O]
PAR
pass 1 (xyin,xd[O] ,xd[1])
pass(xd[ 1 ],xd[2],xd[3])
malt(xd[2],yd[0])
delay(yd[O],yd[l])
add(xyout,xd[O],yd[1],xd[3])
278
SEQ
celll (xyin,xyout,aO)
PROTOCOL pair IS REAL32; REAL32:
PROC system2(CHAN OF pair xyin,xyout, -- cell B
V AL INT time,
V AL REAL32 aO)
V AL no.co IS 16:
[3]CHAN OF pair xyc :
PROC cell2(CHAN OF pair xyin,xyout,
V AL REAL32 a)
[no.co+2]CHAN OF REAL32 xd:
[2]CHAN OF REAL32 yd:
CHAN OF REAL32 pd:
PROC malt ( CHAN OF REAL32 xin,yout)
-- Similar to Proc malt in cell 1
PROC add ( CHAN OF pair xyout,
CHAN OF REAL32 pin,yin,xin)
-- Similar to Proc add in cell 1
PROC pass ( CHAN OF REAL32 zm,zout,xout)
-- Similar to Proc pass in cell 1
PROC pass! (CHAN OF parr xyin,
CHAN OF REAL32 yout,zout)
-- Sumlar to Proc pass! in cell 1
PROC delay ( CHAN OF REAL32 xm,xout)
-- Sirmlar to Proc delay in cell 1
PROC delayo ( CHAN OF REAL32 xin,bout)
279
-- Constant Time Proc.
REAL32x:
[no.co] REAL32 b:
INTj:
SEQ
X :=0.0(REAL32)
SEQ k= 0 FOR no.co
b[k] :=O.O(REAL32)
j:=1
SEQ i= 0 FOR time
SEQ
SEQ
xin?x
bout! b[J]
b[j]:= X
IF
G >(no.co -2))
j:=1
TRUE
j:=j+1
PAR pass 1 (xyin,xd[O] ,xd[1])
pass(xd[ 1] ,xd[2] ,xd[3])
delayo(xd[3],xd[ 4])
malt(xd[2],yd[0])
delay(yd[O],yd[1])
add(xyout,xd[O],yd[1],xd[4])
SEQ
cell2 (xyin,xyout,aO)
-- Network cnfigurauon (9 transputers).
PLACED PAR
280
PROCESSOR 0 T8
PLACE xyin AT linkl.in:
PLACE xyc[O] AT link3.out:
systeml (xyin,xyc[O],time,aO)
PLACED PAR
PROCESSOR 1 T8
PLACE xyc[O] AT linkO.in:
PLACE xyc[l] AT link3.out:
systeml (xyc[O],xyc[l],time,al)
PLACED PAR
PROCESSOR 2 T8
PLACE xyc[l] AT linkO.in:
PLACE xyc[2] AT link3.out:
system2 (xyc[l],xyc[2],time,a2)
PLACED PAR
PROCESSOR 3 T8
PLACE xyc[2] AT linkO.in:
PLACE xyc[3] AT liuk2.out:
systeml (xyc[2],xyc[3],time,a3)
PLACED PAR
PROCESSOR 4 T8
PLACE xyc[3] AT linkl.m:
PLACE xyc[4] AT linkO.out:
systeml (xyc[3],xyc[ 4],time,a4)
PLACED PAR
PROCESSOR 5 T8
PLACE xyc[4] AT link3.in:
PLACE xyc[S] AT !J.nkO.out:
system2 (xyc[4],xyc[5],ume,a5)
PLACED PAR
PROCESSOR 6 T8
PLACE xyc[S] AT link3.in:
PLACE xyc[6] AT linkO.out:
systeml (xyc[5],xyc[6],time,a6)
PLACED PAR
281
PROCESSOR 7 T8
PLACE xyc[6] AT link3.in:
PLACE xyc[7] AT link2.out:
system! (xyc[6],xyc[7],time,a7)
PLACED PAR
PROCESSOR 8 T8
PLACE xyc[7] AT hnkl.in:
PLACE xyout AT linkO.out:
system! (xyc[7],xyout,time,a8)
282
---------
APPENDIX C
Occam Program for Gradient Operator
#USE linkaddr
PROTOCOL three IS REAL32 ; REAL32 ; REAL32 ; REAL32 :
CHAN OF three xyin, xyout :
[12]CHAN OF three xyc :
V AL time IS 286: --size of image +no eo+ 14
V AL REAL32 aO IS -l.O(REAL32):
V AL REAL32 a1 IS -l.O(REAL32):
V AL REAL32 a2 IS l.O(REAL32):
V AL REAL32 a3 IS l.O(REAL32):
PROTOCOL three IS REAL32; REAL32 ;REAL32 :
PROC system1(CHAN OF three xyin,xyout,
V AL INT time,
V AL REAL32 aO)
V AL no.co IS 16:
PROC celll(CHAN OF three xyin,xyout,
V AL REAL32 a)
[no.co +2]CHAN OF REAL32 xd:
[2]CHAN OF REAL32 yd:
CHAN OF REAL32 rd:
PROC malt ( CHAN OF REAL32 xin,yout)
#USEuserio
REAL32x,y:
SEQ
X ::{).O(REAL32)
y ::{).0(REAL32)
SEQ 1= 0 FOR time
SEQ
xin?x
yout! y
y:=x*a
283
PROC add ( CHAN OF three xyout,
CHAN OF REAL32 pin,yin,xm,rin)
[2]REAL32 p:
REAL32y:
REAL32x,r:
SEQ
p[O] :=0 O(REAL32)
p[l] :=O.O(REAL32)
X :=O.O(REAL32)
y :=O.O(REAL32)
r :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
PAR
pin? p[O]
yin?y
xin?x
rin ?r
xyout ! x ; p[l] ; r
p[l]:= p[O] +y
PROC pass ( CHAN OF REAL32 zin,zout,xout)
[2]REAL32 z:
REAL32x:
SEQ
z[O] :=O.O(REAL32)
z[l] :=O.O(REAL32)
X :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
PAR
zin? z[O]
zout! z[l]
xout! x
SEQ
284
z[l]:= z[O]
x := z[O]
PROC pass! (CHAN OF three xyin,
CHAN OF REAL32 yout,xout,rout)
[2]REAL32 x:
REAL32y,r:
SEQ
x[O] :=O.O(REAL32)
x[l] :=O.O(REAL32)
y :=O.O(REAL32)
r :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
PAR
xyin ? x[O] ; y ; r
xout! x[l]
yout! y
rout! r
x[l]:= x[O]
PROC delay ( CHAN OF REAL32 xm,xout)
[2]REAL32 x:
SEQ
x[O] :=O.O(REAL32)
x[l] :=O.O(REAL32)
SEQ I= 0 FOR time
SEQ
xin? x[O]
xout! x[l]
x[l]:= x[O]
PROC delayo ( CHAN OF REAL32 xin,bout)
-- Constant Time Proc.
REAL32x:
285
[no.co+8] REAL32 b:
INTj:
SEQ
X :=0.0(REAL32)
SEQ k= 0 FOR no.co
b[k] :=O.O(REAL32)
j:=1
SEQ i= 0 FOR time
SEQ
SEQ
xin?x
bout! b[j]
b[j]:= X
IF
G >(no.co -3))
j:=1
TRUE
j:=j+1
PAR
pass 1 (xyin,xd[O] ,xd[ 1] ,rd)
pass(xd[1],xd[2],xd[3])
delayo(xd[3],xd[4])
malt(xd[2],yd[O])
delay(yd[O] ,yd[ 1])
add(xyout,xd[O],yd[1],xd[ 4],rd)
SEQ
celil (xyin,xyout,aO)
PROTOCOL three IS REAL32; REAL32 ;REAL32 :
PROC system2(CHAN OF three xyin,xyout,
V AL INT time,
286
V AL REAL32 aO)
PROC cell2(CHAN OF three xyin,xyout,
V AL REAL32 a)
[5]CHAN OF REAL32 xd:
[2]CHAN OF REAL32 yd:
CHAN OF REAL32 rd:
PROC malt ( CHAN OF REAL32 xin,yout)
-- Similar to Proc malt in cell1
PROC add ( CHAN OF pair xyout,
CHAN OF REAL32 pin,yin,xin)
-- Srrnilar to Proc add in cell1
PROC pass ( CHAN OF REAL32 zin,wut,xout)
-- Similar to Proc pass in cel11
PROC pass1 (CHAN OF pair xyin,
CHAN OF REAL32 yout,zout)
-- Similar to Proc pass 1 in cel11
PROC delay ( CHAN OF REAL32 xin,xout)
-- Similar to Proc delay in cell 1
SEQ
PAR
pass 1 (xyin,xd[O],xd[l],rd)
pass(xd[ l],xd[2],xd[3])
delay(xd[3],xd[ 4])
malt(xd[2],yd[O])
delay(yd[O],yd[ 1])
add(xyout,xd[O],yd[l],xd[ 4],rd)
SEQ
287
cell2 (xym,xyout,aO)
PROTOCOL three IS REAL32; REAL32 ;REAL32 :
PROC system3(CHAN OF three xyin,xyout,
V AL INT time,
V AL REAL32 aO)
VALno.coiS 16:
PROC cell3(CHAN OF three xyin,xyout,
V AL REAL32 a)
[no.co +2]CHAN OF REAL32 xd:
[2]CHAN OF REAL32 yd:
CHAN OF REAL32 rd:
PROC malt ( CHAN OF REAL32 xin,yout)
-- Similar to Proc malt in cell 1
PROC add ( CHAN OF pair xyout,
CHAN OF REAL32 pin,yin,xm)
-- Similar to Proc add in cell 1
PROC pass ( CHAN OF REAL32 zin,zout,xout)
-- Similar to Proc pass in cell 1
PROC pass! (CHAN OF pair xyin,
CHAN OF REAL32 yout,zout)
-- Sirmlar to Proc pass! in cell!
PROC delay ( CHAN OF REAL32 xin,xout)
-- Similar to Proc delay in cell 1
PROC delayo ( CHAN OF REAL32 xin,bout)
-- Constant Time Proc
REAL32x:
[no.co+8] REAL32 b:
288
---------
.-----------------------------------------------------------------,
INTj:
SEQ
X :=O.O(REAL32)
SEQ k= 0 FOR no.co
b[k] :=O.O(REAL32)
j:=1
SEQ i= 0 FOR time
SEQ
SEQ
xin?x
bout! b[j]
b[j]:= X
1F
G >(no.co -1))
j:=1
TRUE
j:=j+1
PAR
pass 1 (xyin,xd[O] ,xd[1],rd)
pass(xd[ 1],xd[2],xd[3])
delayo(xd[3] ,xd[ 4])
malt(xd[2],yd[O])
delay(yd[O],yd[ 1])
add(xyout,xd[O],yd[1],xd[4],rd)
SEQ
cell3 (xyin,xyout,aO)
PROTOCOL three IS REAL32; REAL32 ;REAL32 :
PROC system4(CHAN OF three xyin,xyout,
V AL INT time,
V AL REAL32 aO)
289
[2]CHAN OF three xyc :
PROC cell4(CHAN OF three xym,xyout,
V AL REAL32 a)
[ 4]CHAN OF REAL32 xd:
[2]CHAN OF REAL32 yd:
CHAN OF REAL32 rd:
PROC malt ( CHAN OF REAL32 xin,yout)
-- Similar to Proc malt in cell 1
PROC add ( CHAN OF pair xyout,
CHAN OF REAL32 pin,yin,xin)
-- Similar to Proc add m cell 1
PROC pass ( CHAN OF REAL32 zin,wut,xout)
-- Similar to Proc pass in ce111
PROC delay ( CHAN OF REAL32 xin,xout)
-- Similar to Proc delay in cell!
PROC pass1 (CHAN OF three xyin,
CHAN OF REAL32 yout,xout,rout)
--z[O] is x1 input
--z[1] is x1 output
--z[2] is x1 output to malt proc.
--r is x2 output to malt proc.
--y is main output to add proc.
[2]REAL32 x:
REAL32 y,r:
SEQ
x[O] :=O.O(REAL32)
x[1] :=O.O(REAL32)
y :=O.O(REAL32)
r :=O.O(REAL32)
SEQ i= 0 FOR tune
290
.--------------------------------------------------
SEQ
PAR
SEQ
xyin ? x[O] ; y ; r
xout! x[1]
yout! y
rout! r
x[1]:= x[O]
PAR
pass 1 (xym,xd[O],xd[1],rd)
pass(xd[1],xd[2],xd[3])
malt(xd[2],yd[O])
delay(yd[O] ,yd[ 1])
add(xyout,xd[O],yd[1],xd[3],rd)
PROC final(CHAN OF three zyin,zyout) --cell 5
[4]CHAN OF REAL32 rd:
CHAN OF REAL32 yd,xd:
PROC delay ( CHAN OF REAL32 xin,xout)
-- Similar to Proc delay in cell1
PROC pass1 (CHAN OF three xyin,
CHAN OF REAL32 yout,zout,rout
REAL32z:
REAL32 y,r,v:
SEQ
z :=0.0(REAL32)
y :=O.O(REAL32)
r :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
PAR
xyin? z; y; r
291
zout! z
yout! y
rout! r
PROC sqrt ( CHAN OF three xyout,
CHAN OF REAL32 pin,xin,rin)
[2]REAL32 z:
REAL32r,p:
REAL32x:
SEQ
z[O] :=O.O(REAL32)
z[1] :=O.O(REAL32)
r :=O.O(REAL32)
X :=O.O(REAL32)
p :=O.O(REAL32)
SEQ i= 0 FOR time
SEQ
SEQ
PAR
pin? p
xin?x
rin? r
xyout! p; r; z[1]
z[O] :=(p*p )+(r*r)
z[1]:=SQRT(z[O])
PAR
pass 1 (zyin,yd,xd,rd[O])
delay(rd[O] ,rd[ 1])
sqrt( zyout,yd,xd,rd[ 1])
SEQ
PAR
292
cell4 (xyin,xyc[O],aO)
final (xyc[O],xyout)
--Network cnfiguration (4 transputers).
PLACED PAR
PROCESSOR 0 T8
PLACE xym AT linkl.in:
PLACE xyc[O] AT link3.out:
system! (xyin,xyc[O],time,aO)
PLACED PAR
PROCESSOR I T8
PLACE xyc[O] AT linkO.in:
PLACE xyc[l] AT link3.out:
system2 (xyc[O],xyc[l],time,al)
PLACED PAR
PROCESSOR 2 T8
PLACE xyc[l] AT linkO.in:
PLACE xyc[2] AT link3.out:
system3 (xyc[l],xyc[2],time,a2)
PLACED PAR
PROCESSOR 3 T8
PLACE xyc[2] AT linkO.in:
PLACE xyout AT link3.out:
system4 (xyc[2],xyout,time,a3)
293
APPENDIX D
Occam Program for the Filter Library (PARC-IPL)
EXTERNAL proc abort.program:
EXTERNAL proc open.file(value path.nameO, accessO, chan io.chan)
EXTERNAL proc close.f!le( chan io.chan) :
EXTERNAL proc str.to.chan( chan c,value s[])
EXTERNAL proc fp.num.to.chan( chan c,value float f) :
EXTERNAL proc fp.num.from.chan( chan c,var float f)
EXTERNAL proc num.to.chan( chan c,value n) :
EXTERNAL proc num.from.chan( chan c,var n) :
EXTERNAL proc str.to.screen( value sO) :
EXTERNAL proc fp.num.to.screen( value float f) :
EXTERNAL proc num.to.screen( value n) :
EXTERNAL proc fp.num.from.keyboard( var float f)
EXTERNAL proc num.from.keyboard( var n)
EXTERNAL proc sll :
EXTERNAL proc s14 :
EXTERNAL proc s15 :
EXTERNAL proc sl6 :
EXTERNAL proc s17 :
EXTERNAL proc s18 :
PROC system = VAR xo.f,no.f,run,vv,co,go,tr:
SEQ str.to.screen, _____________________ ")
str.to.screen("*n I I")
str.to.screen("*n I THIS IS AN OCCAM PROGRAM LffiRARY I")
str.to.screen("*n I I")
str.to.screen("*n I I")
str.to.screen("*n I IMAGE PROCESSING I")
str.to.screen("*n I FILTER LIDRARY I")
str.to.screen("*n I I")
294
- ~
'
str.to.screen("*n 1 N.B. To exit from the system enter 99 1") str.to.screen("*n 1 ____________________ 1,")
str.to.screen("*n ")
str.to.screen("*n I I")
str.to.screen("*n I Have your input data at file name I")
str.to.screen("*n 1 [ image.in ] I")
str. to.screen("*n I I")
str.to.screen("*n I I")
str.to.screen("*n ")
str.to.screen("*n ")
str.to.screen("*n If you want to use this hbrary enter 1 = ")
295
str.to.screen("*n I !")
str.to.screen("*n I !")
str.to.screen("*n I N.B. To exit from the system enter 99 !") str.to.screen("*n ! __________________ .!")
str.to.screen("*n ")
str.to.screen("*n ")
str.to.screen("*n Type filter number = ")
num.from.keyboard(no.f)
num.to.screen(no.f)
if
no.f=l
seq str.to.screen("*n _________________ ")
str.to.screen("*n I Laplacian filter No : 1 !")
str.to.screen("*n I !")
sll vv:=O str.to.screen("*n _________________ ")
str.to.screen("*n I !")
str.to.screen("*n I Your output data in file !")
str.to.screen("*n I [tmagel.out] !")
str.to.screen("*n I !")
str.to.screen("*n I !")
no.f=2
seq str.to.screen("*n _________________ ")
str.to.screen("*n I Gradient filter No : 2 !")
str.to.screen("*n I !")
s14
vv:=O str.to.screen("*n _________________ ")
str.to.screen("*n I !")
str.to.screen("*n I Your output data in file !")
str.to.screen("*n I [image2.out] !")
str.to.screen("*n I I")
296
str.to.screen("*n 1 1")
no.f=3
seq str.to.screen("*n ")
str.to.screen("*n 1 Mean filter No:3 1")
str.to.screen("*n 1 1")
s15
vv:=O
str.to.screen("*n ")
str.to.screen("*n 1 1")
str.to.screen("*n 1 Your output data in file !")
str.to.screen("*n 1 [image3.out] 1")
str.to.screen("*n 1 !")
str.to.screen("*n 1 !")
no.f=4
seq
str.to.screen("*n ")
str.to.screen("*n 1 Weighted mean filter No:4 1")
str.to.screen("*n 1 1")
s16
vv:=O str.to.screen("*n ")
str.to.screen("*n 1 1'')
str.to.screen("*n 1 Your output data in file 1")
str.to.screen("*n 1 [lmage4.out] 1")
str.to.screen("*n 1 1")
str.to.screen("*n 1 1")
no.f=5
seq
str.to.screen("*n ")
str.to.screen("*n 1 Inverse Gradient filter No : 5 1")
str.to.screen("*n 1 1")
s17
vv:=O str.to.screen("*n ")
297
str.to.screen("*n I !")
str.to.screen("*n I Your output data in file !")
str.to.screen("*n I [image5.out] !")
str.to.screen("*n 1'----------------~1") no.f=6
seq str.to.screen("*n _________________ ")
str.to.screen("*n I Sigma filter No: 6 !")
str.to.screen("*n I !")
s18
vv:=O str.to.screen("*n _________________ ")
str.to.screen("*n 1 !")
str.to.screen("*n I Your output data in file 1 ")
str.to.screen("*n I [image6 out] !")
str.to.screen("*n I !")
str.to.screen("*n I !")
no.f=99
vv:=2
true
seq
str. to.screen("*n ")
str.to.screen("*n ")
str.to.screen("*n Sorry no such filter ")
str.to.screen("*n ")
str.to.screen("*n If you want to try again enter I")
str.to.screen("*n ")
str.to.screen("*n else enter 99")
str.to.screen("*n ")
num.from.keyboard(vv)
str. to.screen("*n ")
--vv:=1
if
vv=O
seq
298
--run:= FALSE
str.to.screen("*n ")
str.to.screen("*n ")
str.to.screen("*n Do you want to choose another filter")
str. to.screen("*n ")
str.to.screen("*n if yes type
str.to.screen("*n ")
str.to.screen("*n 1fno type
str.to screen("*n ")
num.from.keyboard(co)
if
co=O
seq
1 ")
0")
str.to.screen("*n _______________ ")
str.to.screen("*n 1 I")
str.to.screen("*n 1 Filter library ex1ts 1")
str.to.screen("*n 1 ______________ 1")
run:=FALSE
TRUE
run:= TRUE
vv=1
run:= TRUE
TRUE
seq str.to.screen("*n ______________ ")
str.to.screen("*n 1 Filter library exits I")
str.to.screen("*n 1 1")
run:= FALSE
TRUE
seq
str.to.screen("*n ______________ ")
str.to.screen("*n 1 F1lter hbrary exits 1")
str.to.screen("*n I 1"):
SEQ
system
299
APPENDIX E
Occam Program for Toeplitz System
(Double-Sided Systolic Array)
EX1ERNAL proc abort. program:
EXTERNAL proc open.file(value path.nameO, accessO, chan io.chan)
EXTERNAL proc close.file( chan io.chan) :
EXTERNAL proc str.to.chan( chan c,value s[]) :
EXTERNAL proc fp.num.to.chan( chan c,value float f) :
EXTERNAL proc fp.num.from.chan( chan c,var float f)
EXTERNAL proc num.to.chan( chan c,value n) :
EXTERNAL proc num.from.chan( chan c,var n) :
EX1ERNAL proc str.to.screen( value s[]) :
EX1ERNAL proc fp.num.to.screen( value float f) :
EX1ERNAL proc num.to.screen( value n) :
EX1ERNAL proc fp.num.from.keyboard( var float f)
EX1ERNAL proc num.from.keyboard( var n) :
PROC celll ( CHAN gin, gout,zout,
VALUE m,n,value float b,c ) =
V AR float g,a,z :
seq
--imtialisation
z:=O.O
seqj=[O for m]
seq
zout! z
seq
a:=c
z:=O.O
seq i=[O for n]
seq
par
gin?g
par
300
gout! g
z := z+(a *g)
a:=a/b
z:= 2/(1.0-(c*b)):
PROC cell2 ( CHAN gin,pin, pout,
VALUE m,n,value float b ) =
V AR float g,p :
seq
seqj=[O for m]
seq
pin? p
seq i=[O for n]
seq
par
gin?g
p := g+(b * p)
par
pout! p:
PROC cellS ( CHAN gin, pout,
VALUE m,n,value float cc)=
V AR float g,p :
seq
seq j=[O for m]
seq i=[O for n]
seq
par
gin?g
p :=glee
par
pout! p:
proc delay (chan xin, xout,
value m,n )=
301
var float x[2] :
seq
par i=[O for 2]
X(!) := 0.0
seq i=[O for m]
seqj=[O for n]
seq
par
xin? x[O]
xout! x[l]
x[l] := x[O]:
PROC cellt (CHAN xin,xout,
VALUEm,n,
value float b,c )=
CHAN xd[n+ l],yd:
SEQ
par
celll (xin,xd[O], yd,m,n,b,c)
par i=[O for n]
delay ( xd[i],xd[i+l],m,n)
cell2 (xd[n],yd,xout,m,n,b):
PROC hostl ( CHAN gaout,
VALUEm,n,
V AR float g[],b[] ) =
SEQ
SEQ i=[O for m ]
SEQj=[O for n]
SEQ
gaout! g[(n*i)+(n-G+l))]:
PROC host2 ( CHAN yin,yout,
VALUEm,n,
VAR float y[],b[],cc) =
--------
302
SEQ
SEQ i=[O for m ]
SEQ
SEQj=[O for n]
SEQ
yin ? y[(n-G+ 1))]
SEQj=[O for n]
SEQ
y[j]:=y[j]/cc
yout! y[j]:
PROC host3 ( CHAN zin,
VALUE m,n,q,
V AR float z[] ) =
CHANio:
VAR float x[n+l],y[n+l],zz:
V AR no,nn ,ss:
SEQ
nn:=(q*n)\6
no:=((q*n)+(6-nn))/6
ss:=(q*n)
open. file(" ss l.out", "w" ,io)
str.to.chan(io,"*c *n Output")
str.to.chan(io," *c *n x(n) = ")
SEQ i=[O for m-q]
SEQj=[O for n]
SEQ
PAR
zin? z[j]
SEQ i=[O for q]
SEQ
str.to.chan(io,"*c *n ")
SEQj=[O for n]
SEQ
PAR
303
zin? z[n-G+ 1)]
SEQj=[O for n]
SEQ
fp.num.to.chan(io,z[j])
str.to.chan(io," ")
str.to.chan(io,"*c *n ")
close.flle(io):
PROC sssystem (CHAN gc[],zc[],
VALUE time,m,n,q)=
CHANzpc:
CHANio:
VAR FLOAT g[time],z[time],y[time],b[q],c[q],cc[ 4],al,l1,12,13:
SEQ
SEQ j=[O for q]
SEQ
str.to.screen("*n Give b[j] = ")
fp.num.from.keyboard(b[j])
fp.num.to.screen(b[j])
str.to.screen("*n cc = ")
fp.num.from.keyboard( cc[O])
fp.num.to.screen(cc[O])
str.to.screen("*n g[j] = ")
g[0]:=9.328125
g[1]:=7.78125
g[2]:=10.546875
g[3]:=14.0625
g[4]:=16.828125
g[5]:=15.28125
g[ 6] :=9.328125
g[7]:=7.78125
g[8]:=10.546875
g[9]:=14.0625
g[l0]:=16.828125
g[11]:=15.28125
304
g[12]:=9.328125
g[13]:=7.78125
g[l4]:=10.546875
g[15]:=14.0625
g[16]:=16.828125
g[17]:=15.28125
seq j=[O for n]
seq
str.to.screen(" ")
fp.num.to.screen(g[j])
seq j=[O for q]
seq
c[J]:=b[j]
seq i=[O for n-2]
c[j]:=c[j]*b[j]
PAR
host! (gc[O],m,n, g,b)
par i=[O for q]
cellt(gc[i],gc[I+ 1] ,m,n,b[I],c[I])
host2 (gc[q],gc[q+ l],m,n, y,b,cc[O])
par i=[ 1 for q]
cellt(gc[ q+i],gc[q+(i+ 1 )],m,n,b[(q)-i],c[ (q)-1])
host3 (gc[(2*q)+ 1] ,m,n,q,z):
PROC system (CHAN gc[],zc[])=
V AR time ,rimn,m,n,q,r:
SEQ
str.to.screen("*n Give the total number of r.s.c = ")
num.from.keyboard(r)
num.to.screen(r)
str.to.screen("*n Give the total number of q = ")
num.from.keyboard(m)
num to.screen(m)
str.to screen("*n Give the total number of rows = ")
num.from.keyboard(n)
305
num.to.screen(n)
str.to.screen("*n ")
rimn:=(n+ 2)\5
q:=m-1
r:=r-1
m:=((m-1 )*2)+(1 +r)
time:=m*n
sssystem (gc,zc,time,m,n,q):
--main
CHAN gc[19],zc[9] :
SEQ
system (gc,zc)
306