towards a general framework for fpga based image processing using

1

Towards a General Framework for FPGA Based Image Processing

using Hardware Skeletons

K Benkrid, D Crookes and A Benkrid

School of Computer Science, The Queen’s University of Belfast, Belfast BT7 1NN, UK

(K.Benkrid, D.Crookes, A.Benkrid)@qub.ac.uk

Abstract

In this paper, we present our approach to developing a general framework for FPGA based

Image Processing. This framework is based on a library of Hardware Skeletons. A hardware

skeleton is a parameterised description of a task-specific architecture. A skeleton’s

implementation will apply optimisations specific to the target hardware. The library

normally contains a range of alternative skeletons for the same task, perhaps tailored for

different data representations. The library also contains high level skeletons for compound

operations, whose implementation can apply appropriate optimisations. Given a complete

algorithm description in terms of skeletons, an efficient hardware configuration is generated

automatically. We have developed a library of hardware skeletons for common image

processing tasks, with optimised implementations specifically for Xilinx XC4000 FPGAs.

This paper presents and illustrates our hardware skeleton approach in the context of some

common image processing tasks. It demonstrates our approach to the broader problem of

achieving optimised hardware configurations while retaining the convenience and rapid

development cycle of an application-oriented, high level programming model.

Keywords: FPGA, Coprocessor, Hardware Skeletons, Image Processing, High Level Programming.

1. Introduction

Many modern image processing applications (such as processing video and very large

images) are so computationally demanding that special purpose hardware solutions need to be

considered. Reconfigurable hardware in the form of FPGAs can offer the performance

advantages of a custom hardware solution, while their reprogrammability makes them multi-

2

purpose and reusable. However, a big disadvantage is the low level, hardware-oriented

programming model needed to get the most from the FPGA’s potential performance.

Despite the great amount of research done on FPGAs, many FPGA-based applications

have been algorithm specific [1][2][3]. An environment for developing applications needs

more than just a library of static FPGA configurations, perhaps parameterisable (e.g. in terms

of input data wordlength), since it should allow the user to experiment with alternative

algorithms and develop his/her own algorithms. There is a need for bridging the gap between

high level application-oriented software and low level FPGA hardware. Many behavioural

synthesis tools [4][5][6] have been developed to satisfy this requirement. These tools allow

the user to program FPGAs at a very high level (e.g. in a C-like syntax) without having to

deal with low level hardware details (e.g. scheduling, allocation, pipelining etc.). However,

although behavioural synthesis tools have developed enormously [7][8], structural design

techniques often still result in circuits that are substantially smaller and faster than those

developed using only behavioural synthesis tools [9][10].

The aim of this work is to provide a framework for developing efficient hardware

solutions specifically for image processing applications. This framework gives the benefits of

an application-oriented, high level programming model, but does not sacrifice significantly

the performance of the solution. Our approach to this is to use a concept which has proved

relatively successful in developing parallel software, namely skeletons [11][12][13].

Skeletons are reusable, parameterised fragments or frameworks to which the user can supply

components (e.g. functions). It is common for skeletons to include functions as parameters

which are applied by the skeleton to a data set. The implementation of a skeleton is normally

optimised for a specific target machine.

In this paper we introduce the concept of hardware skeletons. A hardware skeleton is

a parameterised description of a task-specific architecture, to which the user can supply

parameters such as values, functions (parameterised functional blocks) or even other

skeletons. In this sense, a skeleton is like a class, from which specific instances can be

created. Certain combinations of basic skeletons can form the basis of additional, higher level

skeletons. The concept grew up from our experience in Image Processing where we have

noticed that many IP operations can be assembled using common arrangements of basic

image operations on which known optimisations can be applied. Hardware skeletons are

conceptually similar to Cole et al’s [11][12][13] software skeletons, although the intricacies

of hardware implementation are inherently different from those in Software (e.g. buffer

sharing, synchronisation of operations with different word lengths etc.). Considerable work

3

has been done on problems associated with multiple FPGA implementations (e.g. the ArMen

project [14]). However, with current FPGA chips densities crossing the 10 million gates

barrier, it is increasingly possible to implement very sophisticated algorithms on one FPGA

chip. The work presented in this paper targets a single-chip FPGA machine. Other

researchers have addressed the issue of dynamic reconfiguration (e.g. ARDOISE project

[15]). Our current system targets Xilinx XC4000 FPGAs, which do not allow for dynamic

reconfiguration. From a compilation point of view, our approach is different from any other

work we are aware of. Indeed, the use of the notion of hardware skeletons, specific to the

application domain in hand, is novel to the hardware domain. Also, the use of a rule-based

language (Prolog) to apply task-specific optimisations as well as target-hardware-specific

optimisations is novel. To illustrate this, the paper first identifies a suitable application-

oriented model for describing image processing operations. The common basic tasks which

we identify will form the basis of a library of core skeletons. Next, we outline the strategy

which the system employs to generate efficient FPGA configurations from a given operation

description. The implementation of the hardware skeleton library will then be presented. A

practical example will then be given to demonstrate our approach.

2. An application oriented description model for IP operations

Many image processing operations can be described in terms of a Directed Acyclic Graph

(DAG), where vertices represent IP tasks, and the directed edges represent the data flow (see

Figure 1.).

Nodes are typically simple tasks such as adding two input images, or an image

convolution. Common IP tasks can be classified in terms of the locality of their data access

requirements into three categories:

• Point operations: The same operation is applied to each individual pixel of one or many

source images to produce a corresponding result pixel in the new image. These include:

relational operations (e.g. ‘≥’, ’ ≤’, ‘=’), arithmetic operations (e.g. ‘+’, ‘-‘, ’*’, ‘ ÷’ ),

logical operations (e.g. ‘AND’, ‘OR’ ) and Look-Up tables. The operation could either be

between two images or between an image and a scalar value.

• Neighbourhood operations: In neighbourhood operations, a new pixel value is

calculated using only the pixel values in the neighbourhood of the original pixel and the

weights in a window (e.g. convolution). This is done for all image pixels, and results in a

4

new image. A Neighbourhood operation is completely defined by a two-stage operation:

first the local operation between corresponding pixels and window values (e.g.

multiplication), then a global operation (e.g. accumulation) which reduces the window of

intermediate results to a single result pixel, and a window (with given shape and

coefficients) [16].

• Global operations: These operations operate globally on the whole image. We can

distinguish two types:

- Reduction to Scalar (RS): These operate on the whole image to produce a scalar as a

result. Examples include count, global maximum, global minimum and global

accumulation (Σ).

- Reduction to Vector (RV): This operation operates on the whole image to produce a

vector as a result. These include histogramming and cumulative histogramming.

The properties of an item of data (represented by an edge in the DAG) are of two kinds:

• Data type

This is defined by two properties:

- Structure: could be an image, a vector or a scalar.

- Pixel type: which, for the purpose of this work, could be either an integer or a boolean.

• Data representation

A particular data representation is defined by three properties:

- The data could be in bit serial, or in bit parallel with an associated word size or, in digit

serial representation, with a particular digit and word sizes.

- If data is in bit serial (or digit serial), it can then be processed either MSB (or MSD) First

or LSB (or LSD) First.

- Number System which, for the purpose of this work, could be one of unsigned integer, 2’s

complement, or Signed Digit (SD) number representation [17][18][19].

Note that Binary representation corresponds to bit parallel with a word size one (denoted as

parallel(1)). Online arithmetic is digit serial SD MSD first.

A node with a particular set of logical Inputs/Outputs could be implemented by a

range of different possible implementations as illustrated for the ‘Absolute value’ operation

in Figure 2. It is normal (but not compulsory) for the input and output representations to be

the same.

5

The Hardware Skeleton Library will contain parameterised descriptions of architectures not

only for the full range of basic operations (nodes), but possibly with different versions for

different data representation combinations.

3. Implementation strategy

The user’s first task will be to represent the algorithm in terms of a DAG, without initially

being concerned with data type or data representation considerations (see Figure 3.). Once

this is done, an analysis of the properties of the input and output data formats of the nodes

will identify a range of possible implementations of each node. For instance, the result of an

N-bit integer image comparison operation could be either an N-bit integer image or a (1-bit)

binary image. The choice will depend on subsequent processing of the result image, and on

what skeletons are available. As a first step, the set of all possible implementations should

first be considered by the user. The library of Hardware Skeletons (e.g. neighbourhood

operations, point operations, etc.), in which each component has a set of different

implementations (e.g. bit serial, bit parallel), is the basis of this phase. The implementations

of the library components are optimised for specific target architectures (e.g. bit parallel

adder units based on dedicated fast carry logic on Xilinx 4000). The range of possible

implementations generated for a particular IP algorithm depends on the extent of this library.

To select the optimum skeleton from the set of possible choices, the cost of each

choice of optional skeleton needs to be found. The system can estimate or calculate area costs

(in terms of CLBs) and latency costs (in cycles) for all operations. However, accurate speed

information can only come from the Xilinx tools after generating the FPGA configuration for

each option including the application of the optimisations associated with each skeleton. The

subsequent choice given these costs is accurately done manually. This cost based analysis

enables the user to settle on a final DAG with all attributes (data type and representation)

defined. The corresponding FPGA implementation is finally generated, in the form of EDIF

netlist [20], for the chosen solution. This is performed by a Prolog based Hardware

Description Environment, called HIDE4k, developed at Queen’s University [10][21][22]. The

latter enables highly scaleable and parameterised component descriptions to be written, and

generates pre-placed configurations in EDIF format for Xilinx XC4000 series [23]. The

resulting EDIF file is finally fed to Xilinx Placement And Routing (PAR) tools to generate

the FPGA configuration bitstream. The use of a rule-based generator (written in Prolog)

6

allows for the application of task-specific optimisations. It also allows for the application of

optimisations specific to the target hardware. Hence the dual requirement of high level

description and efficiency can be met.

Note that during the process of implementing a DAG, the following issues arise:

• Data representation conversion

Since many data representations might be used within the DAG, data representation

converters may be needed to convert between different representations (e.g. from bit serial to

bit parallel, or from Signed Digit to two’s complement etc.)

• Data synchronisation

When there are two or more inputs to a DAG node (vertex), any branch that arrives earlier

than the others should be forced to wait for the slowest branches by adding appropriate delays

to the fastest branches. This is performed automatically by our system so that the user does

not have to deal with low level data synchronisation issues.

As a result, the user’s programming model is merely the set of hardware skeletons

provided by the Hardware Skeleton Library. These skeletons can be accessed either textually

(header) or even more conveniently by interacting with a GUI.

4. Implementing the Hardware Skeleton Library

We implemented our Hardware Skeleton Library as a hierarchy of three levels of hardware

blocks. At the bottom level lies the arithmetic cores library (see Figure 4.). This provides

arithmetic units (e.g. adders, multipliers) parameterised for different number representations

(e.g. bit serial, bit parallel, 2’s complement, unsigned etc.). Immediately on the top of this

level, we find the basic image operations library. The latter provides implementations for the

basic image operations presented in section 2 above (e.g. basic neighbourhood operations).

Finally, the top level provides implementations for high level (compound) skeletons.

Users supply the desired parameters (e.g. arithmetic type, window coefficients, pixel

word length etc.) in a query, and the search of the library is performed by Prolog’s pattern

matching mechanism. The following will present each of these three levels in more details.

4.1 Arithmetic cores library

This library provides the basic building blocks required for image processing operations (and

signal processing in general). It includes adders, multipliers, dividers, shifts and delays. Note

7

that the basic functions required for nearly any signal processing operation include

addition/subtraction, shifts and delays. These blocks can then be used to construct the more

complicated structures such as multipliers, dividers and maximum/minimum selectors.

Versions of these cores are provided for different number representations. At the time of

writing, the following number representations are supported:

� Bit parallel (N bits), 2’s complement

� Bit serial, 2’s complement, Most Significant Bit (MSB) First

� Bit serial, 2’s complement, Least Significant Bit (LSB) First

� Bit serial, Signed Digit, MSB First

The implementation of these cores is optimised for a specific target architecture (XC4000

FPGAs for our particular case study). This should take advantage of the particular features of

the target architecture (e.g. 4 input LUTs, synchronous RAMs, dedicated fast carry logic for

XC4000). The core descriptions are held in HIDE4k with rules for core-specific

optimisations as part of the core. For instance, a constant coefficient multiplication will apply

CSD coding of the multiplier coefficient to reduce the consumed hardware [24][25]. Such

optimisations, often, are not performed by behavioural synthesis tools.

4.2 Basic image operations library

This library provides implementations of the basic image operations presented in section 2.

Consider the case of basic neighbourhood operations. As mentioned in section 2, a

neighbourhood operation is completely defined by a local and global operation. Local

operations include multiplication and addition. Global operations include accumulation,

maximum and minimum. These form the Image Algebra five basic neighbourhood operations

as shown in Table 1 [16].

Figure 5 gives the architecture of a generic PxQ neighbourhood operation with a local

operation L and a global one G. This architecture is parameterisable or scaleable in terms of

[26]:

- The window size (PxQ)

- The window coefficients

- The image size (line buffer size δLB)

- The pixel wordlength

- The local and global operations (L and G)

- The number representation (arithmetic type)

8

A generic description of a neighbourhood operation would then be given by:

neighbourhood_op(Arithmetic_type, Local_op, Global_op, Window, pixel_wordlength,

Image_Size)

Our HIDE4k system is capable of generating pre-placed FPGA architectures in EDIF format

from such generic description. A ~30K line EDIF description is generated in 1~2 sec. The

resulting architectures are tailored to the particular neighbourhood operation in hand. Their

performance (speed and area) rivals those obtained with a careful hand design [10].

4.3 High level (compound) skeletons library

This library contains efficient implementations of a set of compound skeletons. These

compound skeletons result from the process of identifying, by experience, common ways of

assembling primitive operations and providing optimised implementations of these. To

demonstrate this concept, we will present an example of such compound skeletons. More

examples are provided in detail in [10]. Also in this reference, the complete content of the

whole Hardware Skeletons Library can be found.

High level skeleton example: parallel neighbourhood operations

A number of common image processing algorithms comprise several concurrent

neighbourhood operations which share the same input image, and whose templates have the

same size and shape (see Figure 6.). Sobel, Prewitt, Roberts and Kirsch edge detectors [27],

are examples of such operations.

The result images are typically combined in some way (e.g. by adding, or finding the

maximum of corresponding result pixels). In this case, instead of allocating separate line

buffers for each neighbourhood operation to synchronise the supply of pixels for all

operations, only one set of line buffers is needed. This is because all neighbourhood

operations are applied to the same image. This reduces area, though potentially at the expense

of maximum speed. The parallel neighbourhood operations can then be replaced by one

compound neighbourhood operation as shown in Figure 7 for the case of two parallel

neighbourhood operations, where Li, Gi {i=1,2} are the local and global operations

respectively and Ai,j, and Bi,j are the window coefficients of the two operations respectively.

9

Note that an extra pipeline stage (δthru = 1) has been added to the second neighbourhood to

speed up the FPGA implementation. This skew will be compensated at a subsequent

operation, if necessary, as discussed under ‘data synchronisation’ in section 3.

This skeleton can be found in the Hardware Skeleton Library, where different

implementations are available in the form of bit serial two’s complement LSBF, online

arithmetic and bit parallel based implementations [10].

5. Implementation strategy illustration: Sobel edge detection

The Sobel edge detection algorithm is one of the most commonly used techniques for edge

detection [27]. It can be performed (approximately) by adding the absolute results of two

separate convolutions (for horizontal and vertical edge strengths) as shown in Figure 8.

In the following, we will present two possible FPGA implementations of the Sobel

operation, both based on bit serial arithmetic. The first possible way of implementation is

based on online arithmetic using Radix-2 Signed Digit number representation. This choice of

arithmetic is motivated by the fact that an Absolute operation is needed after a convolution

operation. This operation is naturally performed MSB first. Hence the choice of Most

Significant Bit First arithmetic. Further, since a convolution is also involved, we need to use

carry free arithmetic to be able to perform addition MSB first. Hence the choice of Signed

Digit representation. The second implementation is based on two’s complement arithmetic

LSBF for performing the two convolutions. Absolute operations will be performed in bit

parallel using dedicated carry logic. A serial to parallel converter is hence needed.

In both cases, the circuits are assembled by selecting the appropriate skeletons from

the library. The corresponding FPGA configurations, with placement information, are

generated automatically by the HIDE4k system.

• Online arithmetic based implementation

In this case, the Sobel circuit is assembled by selecting skeletons which use online arithmetic.

For 8-bit input pixels, the minimum processing word length is 13 bits (because of the latency

of the online adder). A floorplan of the resulting architecture for 256x256 image of 8-bits

pixels on XC4036EX-2 (36x36 CLBs) is presented in Figure 9. The circuit occupies 475

CLBs. Timing simulation shows that the circuit can run at a speed of 75MHz which leads to a

10

theoretical frame rate of 88 frames per second. From experience, we note that a speed of

75MHz on XC4036EX-2 for this sort of operations is a very good figure.

• Two's complement LSBF based implementation

In this case, the Sobel circuit is assembled using skeletons which are implemented using 2's

complement LSBF arithmetic. For 8-bit input pixels, the precision required in this case is

only 11 bits. As stated above, in order to perform the ‘absolute’ operation in bit parallel, the

two serial LSBF convolution outputs need first to be converted into bit parallel. The final

addition is also performed in bit parallel. The latter is based on dedicated fast carry logic. A

floorplan of the resulting architecture for 256x256 image of 8-bits pixels on XC4036EX-2 is

presented in Figure 10. The circuit occupies 369 CLBs. This is more than 100 CLBs less than

an online arithmetic based implementation. This is because of the reduced line buffer space

since the required precision is just 11 bits (instead of 13 bits for an online based

implementation). Timing simulation shows that the circuit can run at a speed of 75MHz

which leads to a theoretical frame rate of 104 frames per second. Again, we note that such a

speed is a very good figure on XC4036EX-2. Clearly, this solution is more efficient (in area

and speed) than an online arithmetic based solution.

6. Summary

In this paper, we have presented a framework for FPGA based Image Processing. Central to

this framework is the Hardware Skeleton Library which contains a set of high level

descriptions of task-specific architectures specifically optimised for Xilinx XC4000 FPGAs.

The library also contains high level skeletons for compound operations, whose

implementations include task-specific optimisations. Skeletons are parameterisable, and

different skeletons for the same operation can be provided, for instance for different

arithmetic representations. This gives the user a range of implementation choices. This in

turn supports experimentation with different implementations and choosing the most suitable

one for the particular constraints in hand (e.g. speed and area). We are investigating the

possibility of doing some of this experimentation automatically, but for now we do it

manually. Given a complete algorithm description in terms of skeletons, an efficient

hardware configuration is generated automatically by our system. The paper shows that

hardware skeletons are a promising approach to satisfy the dual requirement of achieving

11

very efficient hardware configurations while retaining the convenience and rapid

development cycle of an application-oriented, high level programming model.

Future directions include upgrading the system to handle other FPGA series (particularly

Xilinx Virtex chips). The extension of the hardware skeleton library, both in supporting more

arithmetic types and providing other skeletons for more sophisticated image processing

operations (wavelet transform in particular), is being investigated. The automation of the

process of selecting the appropriate implementation among different alternative solutions is

also the subject of future work.

12

IP tasks

Input2 (e.g. image)Input1 (e.g. image)

Output (e.g. image, histogram etc.)

e.g.ConvolutionConvolution

+

image

image

Figure 1. A hypothetical image processing algorithm modelled as a DAG graph

13

Absolute value

Int

Int

(a) (b)

Bit SerialSD, MSDF

Bit SerialSD, MSDF

Absolute value

(c)

Bit Serial 2’scomplement, MSBF

Absolute value

Bit Serial 2’scomplement, MSBF

(d)

Absolute value

Bit Parallel2’s complement

Bit Parallel2’s complement

Figure 2. A DAG node (a) with several possible implementations (b), (c) and (d)

14

DAG with logical data

types

Solution generation

A DAG set of available

implementations

Cost Based Analysis

DAG with specific data representation

choices

Hardware Skeleton Library

Optimisation

A DAG set of optimised

implementations

Code Generator EDIF

HIDE4k System

Xilinx PAR tools

Xilinx XC4000 FPGA

Bitstream

Figure 3. Overall view of our implementation strategy

15

Basic Image Operations Library(e.g. neighbourhood operations)

High Level (compound)Skeletons library

To Image Processing Application Developer

Arithmetic Cores Library

Figure 4. Hierarchical implementation of the Hardware Skeleton Library

16

Neighbourhood Operation Local Op. GlobalOp.

Convolution * Σ

Multiplicative maximum * Max

Multiplicative minimum * Min

Additive maximum + Max

Additive minimum + Min

Table 1. Image Algebra core operation set

17

Line Buffer1 Processing Elements (PE)

Pixel Delays

PE1 PEQ PE Q*(P-2)+1 PE Q*(P-1) PE Q*(P-1)+1 PE Q*P

G

δ

G

δ

G

L

δ

G

δ

G

δ

G

δ

Line BufferP-1δLB δLB

LL L L L

Figure 5. Architecture of a generic PxQ neighbourhood operation using P.Q Processing

Elements (PEs)

18

Neighbourhoodoperation Nop-N

Neighbourhoodoperation Nop-2

Neighbourhoodoperation Nop-1

Figure 6. Parallel neighbourhood operations sharing the same input image

19

Line Buffer1 Line BufferP-1

δCPE

δCPE + δW

L1

G1

A1,1

L2

G2

B1,1

δCPE

δCPE +δW

L1

G1

A1,Q

L2

G2

B1,Q

δthru δthru

δCPE

δCPE +δW

L1

G1

AP-1,1

L2

G2

BP-1,1

δCPE

δCPE +δW

L1

G1

AP-1,Q

L2

G2

BP-1,Q

δthru δthru

δCPE

δCPE + δW

L1

G1

AP,1

L2

G2

BP,1

δCPE

δCPE + δW

L1

G1

AP,Q

L2

G2

BP,Q

δthru δthru

CPE1 CPEQ CPEQ*(P-2)+1 CPEQ*(P-1) CPEQ*(P-1)+1 CPEQ*P

Compound Processing Element

(CPE) δW : Pixel delay

Figure 7. Architecture of a generic 2D, compound PxQ neighbourhood operation using P.Q

Compound Processing Elements (CPEs)

20

Absoluteoperation

Absoluteoperation

Image-Imageaddition

-1 -2 -1

~ ~ ~

1 2 1

-1 ~ 1

-2 ~ 2

-1 ~ 1

convolution convolution

Skeleton (see Figure 6)

Absoluteoperation

Absoluteoperation

Image-Imageaddition

Figure 8. Sobel edge detection algorithm

21

Line Buffers

9 CompoundProcessingElements

Extra delay for datasynchronisation

Absolute value unit

Absolute value unit

Adder unit

SDNR to binaryconverter

Output

Input

Figure 9. Physical configuration of ‘Sobel’ on XC4036EX-2, using online arithmetic

22

Parallel Absolute valueSerial to Parallel

converter

Parallel Absolute value

Parallel Adder

Serial to Parallelconverter

9 CompoundProcessing Elements

Line Buffers

Figure 10. Physical configuration for ‘Sobel’ on XC4036EX-2, using 2’s complement LSBF

23

7. References

[1] Kean T, New B and Slous B, ‘A 800 MPixel/sec Reconfigurable Image Correlator on

XC6216’, Proceedings of the 7th International Workshop on Field Programmable Logic

and its Applications, FPL’ 97, pp 382-391.

[2] Heron J, Trainor D, and Woods R, 'Implementation of the 2D DCT using a Xilinx

XC6264 FPGA', IEEE Proceedings on the Workshop on Signal Processing systems,

SiPS'97, IEEE Press, pp541-550.

[3] Reza Ali M, Turney Robert D, ‘FPGA implementation of 2D wavelet transform’,

Proceedings of the Asilomar Conference on Signals, Systems and Computers, Vol. 1,

pp 584-588, 1999.

[4] Synopsys Inc., ‘Behavioural Compiler’, Software documentation, 1998.

http://www.synopsys.com/products/beh_syn/

[5] C Level Design Inc, ‘C/C++ Synthesis System Compiler’, Product overview, 1998

http://www.cleveldesign.com/products/

[6] The Embedded Solutions Limited, ‘Handel C information sheets’, 1999

http://www.embeddedsol.com

[7] Scott Smith and David Black, ‘Pushing the Limits with Behavioral Compiler’,

Synopsys Inc, 1999.

http://www.synopsys.com/products/beh_syn/bc_compaq_wp.pdf

[8] Zahir Jaffer and Bryan Piotto, ‘Experiences Using Behavioral Synthesis on an ATM

Traffic & Queue Management ASIC’, Synopsys Inc, 1999

http://www.synopsys.com/products/beh_syn/bc_nortel_wp.pdf

[9] Hutchings B, Bellows P, Hawkins J, Hemmert S, Nelson B and Rytting M, ‘A CAD

suite for High-Performance FPGA design’, FCCM’99, Preliminary Proceedings.

[10] Benkrid K, ‘Design and Implementation of a High Level FPGA Based Coprocessor for

Image and Video Processing’, PhD Thesis, Department of Computer Science, The

Queen's University of Belfast, 2000.

http://www.cs.qub.ac.uk/~K.Benkrid/MyThesis.html

[11] Cole M, ‘Algorithmic Skeletons: structured management of parallel computation’, MIT

Press, 1989.

[12] Darlington J, Ghanem M, and To H W, 'Structured Parallel Programming', In

Programming Models for Massively Parallel Programming Computers, IEEE Computer

Society Press, pp. 160-169, Sept 1993.

24

[13] Michaelson G J, Scaife N R, and Wallace A M, 'Prototyping parallel algorithms in

Standard ML', Proceedings of British Vision Conference, Sep 1995.

ftp://ftp.cee.hw.ac.uk/pub/funcprog/msw.bmvc95.ps.Z

[14] P. Dhaussy, J.-M. Filloque, B. Pottier, and S. Rubini. ArMen: an FPGA-based parallel

architecture. In H.J. Siegel, editor, International Parallel Processing Symposium

(Parallel System Fair), Cancùn, Mexico, April 1994.

http://ubolib.univ-brest.fr/~armen/armen1-eng.html

[15] The ARDOISE project, “Architecture Reconfigurable Dynamiquement Orientée Image

et Signal Embarquable”.

http://www-etis.ensea.fr/Francais/AxesArchi/ai/dardoise/ardoise11.html

[16] Ritter G X, Wilson J N and Davidson J L, ‘Image Algebra: an overview’, Computer

Vision, Graphics and Image Processing, No 49, pp 297-331, 1990.

[17] Avizienis A, ‘Signed Digit Number Representation for Fast Parallel Arithmetic, IRE

Transactions on Electronic Computer, Vol. 10, pp 389-400, 1961.

[18] Moran J, Rios I and Meneses J, ‘Signed Digit Arithmetic on FPGAs’, More FPGAs, W

Moore and W Luk (editors), Abington, EE&CS Books, pp 250, 1994.

[19] Ercegovac M D, On-line arithmetic: an overview, SPIE Vol. 495, Real time signal

processing VII, pp 86-93, 1984.

[20] Crawford J D, ‘EDIF: A Mechanism for the Exchange of Design Information’, IEEE

Design and Test of Computers, Vol. 2, No. 1, pp 63-69, 1984.

[21] Crookes D, Alotaibi K, Bouridane A, Donachy P and Benkrid A, ‘An Environment for

Generating FPGA Architectures for Image Algebra-based Algorithms’, ICIP98, Vol.3,

pp. 990-994, 1998.

[22] Benkrid K, Crookes D, Bouridane A, Corr P and Alotaibi K, ‘A High Level Software

Environment for FPGA Based Image Processing’, Proc. IPA'99, IEE Seventh

International Conference on Image Processing and its Applications, Manchester, pp.

112-116, 1999.

[23] Xilinx Ltd, XC4000E and XC4000X Series Field Programmable Gate Arrays -Product

Specification, 1999.

http://www.xilinx.com/partinfo/4000.pdf

[24] Hwang K, ‘Computer Arithmetic Principles’, Architecture, and Design, Wiley, 1979.

[25] Koren I, ‘Computer arithmetic algorithms’, Prentice-Hall, Inc, pp. 99-126, 1993.

[26] Crookes D, Benkrid K, Bouridane A, Alotaibi K and Benkrid A, ‘Design and

Implementation of a High Level Programming Environment for FPGA Based Image

25

Processing’, IEE proceedings: Vision, Image and Signal Processing, Vol. 147, No. 7,

pp. 377-384.

[27] Castleman K R, ‘Digital Image processing’, Prentice Hall, 1995, ISBN: 0132114674.

towards a general framework for fpga based image processing using

Documents