the implementation of multidimensional discrete transforms for digital signal processing

University of WollongongResearch Online

University of Wollongong Thesis Collection University of Wollongong Thesis Collections

1990

The implementation of multidimensional discretetransforms for digital signal processingHong Ren WuUniversity of Wollongong

Research Online is the open access institutional repository for theUniversity of Wollongong. For further information contact the UOWLibrary: [email protected]

Recommended CitationWu, Hong Ren, The implementation of multidimensional discrete transforms for digital signal processing, Doctor of Philosophythesis, Department of Electrical and Computer Engineering, University of Wollongong, 1990. http://ro.uow.edu.au/theses/1353

http://ro.uow.edu.au/


http://ro.uow.edu.au

http://ro.uow.edu.au/theses

http://ro.uow.edu.au/thesesuow



THE IMPLEMENTATION OF MULTIDIMENSIONAL DISCRETE

TRANSFORMS

FOR

DIGITAL SIGNAL PROCESSING

A thesis submitted in fulfilment of the requirements for the award of the degree

DOCTOR OF PHILOSOPHY

from

THE UNIVERSITY OF WOLLONGONG

by

WU, HONG REN, B.E., M.E.

THE DEPARTMENT OF ELECTRICAL

AND COMPUTER ENGINEERING.

FEBRUARY 1990.

"Entertaining someone with fish, uou could only

serve him once,', but if you teach him the, art of fishing, it

vt>iCC serve fvim for a Cije time,."

—cAn ancient Chinese wise man and, philosopher.

1

CONTENTS

ACKNOWLEDGEMENTS vi

ABSTRACT viii

LIST OF ACRONYMS AND SYMBOLS x

CHAPTER ONE: INTRODUCTION 1

1 -1 Introduction to Multidimensional Digital Signal Processing 1

1-2 Applications 2

1-3 History and N e w Achievements in Fast Signal Processing

Algorithms 3

1-4 Objectives 6

1-5 Thesis Review and Contributions 8

1-6 Publications, Submitted Papers and Internal Technical

Reports 11

PART I.

MULTIDIMENSIONAL DISCRETE FOURIER TRANSFORMS 14

CHAPTER TWO: 1-D DISCRETE FOURIER

TRANSFORM AND FAST FOURIER

TRANSFORM ALGORITHMS 15

2-1 Definitions 15

2-2 Matrix Representations for 1-D Cooley-Tukey F F T

Algorithms 16

2-3 Computational Considerations 23

2-4 Summary 31

ii

CHAPTER THREE: 2-D DFT AND 2-D FFT

ALGORITHMS 32

3-1 Introduction to 2-D Discrete Fourier Transforms 32

3-2 Definitions 37

3-3 Row-Column FFT Algorithms 38

3-4 Vector Radix FFT Algorithms 39

3-5 Matrix Representations for 2-D Vector Radix FFT

Algorithms 43

3-6 Structure Theorems 46

3-7 Structural Approach via Logic Diagrams 55

3-8 2-D Vector Split-Radix FFT Algorithms 63

3-9 Comparisons of Various 2-D Vector Radix FFT

Algorithms 67

3-10 Vector Radix FFT Using F D P ™ A41102 69

3-11 Summary 72

CHAPTER FOUR: A PERSPECTIVE ON VECTOR RADLX

FFT ALGORITHMS OF HIGHER

DIMENSIONS 72

4-1 Definitions 74

4-2 Matrix Representations and Structure Theorems 75

4-3 Diagrammatical Presentations 78

4-4 Computing Power Limitations 84

Ill

PART II.

MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS 87

CHAPTER FIVE: INTRODUCTION TO MULTI

DIMENSIONAL DISCRETE COSINE

TRANSFORMS 88

5-1 Definitions of 1 -D DCT and Its Inverse DCT 91

5-2 Definitions of 2-D DCT and Its Inverse DCT 93

5-3 Applications of 2-D DCTs in Image Compression 95

5-4 2-D Indirect Fast DCT Algorithms 100

CHAPTER SIX: 2-D DIRECT FAST DCT

ALGORITHMS 103

6-1 2-D Direct Fast DCT Algorithm Based on Lee's Method 103

6-1-1 1-D Lee's algorithm in matrix form 103

6-1-2 Derivation of 2-D fast DCT algorithm from

Lee's algorithm 108

6-2 2-D Direct Fast DCT Algorithm Based on Hou's Method 114

6-2-1 1-D Hou's algorithm in matrix form 114

6-2-2 Derivation of 2-D fast DCT algorithm from

Hou's algorithm 118

6-3 Comparison of Arithmetic Complexity of Various DCT

Algorithms 124

6-4 Comparison of Computation Structures of 2-D Direct VR

DCTs and VR FFTs 125

6-5 Summary 126

IV

CHAPTER SEVEN: HARDWARE IMPLEMENTATION OF

2-D DCTS FOR REAL-TIME IMAGE

CODING SYSTEMS 128

7_ l Description of Hardware Implementation of Modified 2-D

Makhoul D C T Algorithm Using F D P ™ A41102 129

7-2 Discussion of 2-D D C T Image Coding Systems Using

VLSI Digital Signal Processors 132

CHAPTER EIGHT: THE EFFECTS OF FINITE-WORD-

LENGTH COMPUTATION FOR FAST

DCT ALGORITHMS 136

8-1 Introduction 136

8-2 Simulation Design 138

8-2-1 Structure of the simulation program 138

8-2-2 Error model for the basic computation structure 140

8-2-3 D C T in infinite-word-length 141

8-2-4 Data collection 141

8-3 Simulation Results 143

8-3-1 Floating-point computation of 1-D DCTs 143

8-3-2 Floating-point computation of 2-D DCTs 147

8-4 Summary 151

CHAPTER NINE: CONCLUSIONS 152

9-1 Conclusions 152

9-2 Suggestions for Future Research 154

BIBLIOGRAPHY 156

APPENDIX A:

APPENDIX B

APPENDIX C

APPENDIX D

APPENDIX E:

PRELIMINARY BACKGROUND ON

THE TENSOR (KRONECKER)

PRODUCT AND THE LOGIC DIAGRAM 172

PROOF OF STRUCTURE THEOREMS 175

THE COMBINED FACTOR METHOD 176

DERIVATION OF VECTOR RADIX 2-D

FAST DCT BASED ON LEE'S

ALGORITHM 181

ARITHMETIC COMPLEXITY OF THE

VECTOR SPLIT-RADIX DIF FFT

ALGORITHM 187

vi

ACKNOWLEDGEMENTS

The author wishes to express his deepest appreciation to his Supervisor, Dr. FJ.

Paoloni, Associate Professor of the Department of Electrical and Computer Engineering,

The University of Wollongong, for his guidance, support and encouragement and also for

his understanding and confidence in the author throughout this research. His professional

and optimistic attitude towards the research have made this research challenging,

interesting, productive and enjoyable.

The author wishes to thank Professor Huang, Ruji, of the Department of Industrial

Automation, University of Science and Technology, Beijing (formerly Beijing University

of Iron and Steel Technology), who, as his Masters' Supervisor, had a great influence on

shaping the author's research skills and abilities as an independent as well as cooperative

researcher.

Sincere thanks are also extended to Professor B.H. Smith who introduced the

author to this Institution and made this study possible in the first place.

The author wishes to thank the following people for their generous help, patience

and useful discussions at various stages of this program: Dr. G.W. Trott and Dr. T.S.

Ng, Department of Electrical and Computer Engineering; Mr. I.C. Piper, Computer

Services; Mr. J.K. Giblin, formerly with Computer Services and now with Network

Technical Services, B.H.P. Steel International Group; Mr. G. Andersson, Computer

Services; Dr. N. Smyth and Dr, K.G. Russell, Department of Mathematics, The

University of Wollongong; Professor J.H. McClellan, School of Electrical Engineering,

Georgia Institute of Technology, formerly with Schlumberger Well Services; Dr. J.D.

O'Sullivan, Dr. DJ. McLean, Dr. C.E. Jacka and Mr. K.T. Hwa, Division of Radio

Physics, CSIRO in Epping, N e w South Wales; Dr. M J . Biggar and Dr. W.B.S. Tan,

Telecom Research Laboratories (Australia); Professor K.R. Rao, Department of Electrical

Engineering, The University of Texas at Arlington; Professor M . Vetterli, Department of

Electrical Engineering, Columbia University; Mr. P. Single, Austek Microsystems

(Australia); Dr. M.A. Magdy, Mr. J.F. Chicharo and Mrs. C. Quinn, Department of

vii

Electrical and Computer Engineering, The University of Wollongong; Mr. P.J. Costigan

and all technical staff in the Department.

The author is deeply grateful to his friend and English teacher Mrs. B. S. Perry for

her generous help and professional assistance in the author's understanding of English

and Australian culture, and her and her husband's, Mr. E.J.W. Perry, understanding,

friendship and encouragement which made the author's stay in Wollongong worthwhile,

most pleasant and enjoyable.

The assistance from Miss. M.J. Fryer, of the Department of Electrical and

Computer Engineering, The University of Wollongong, throughout this research and

particularly in reading the final manuscript of this thesis is warmly appreciated.

Financial support received from the Department of Electrical and Computer

Engineering and the Committee of Post-Graduate Study, The University of Wollongong,

by means of the Departmental Teaching Fellowship and the Post-Graduate Research

Scholarship respectively, which made this research possible, is sincerely acknowledged.

Financial supports from Australian Telecommunication and Electronics Research Board

and from Telecom Research Laboratories (Australia) through R&D contract No.7066 are

also acknowledged.

Finally, the author wishes to express his deepest gratitude to Mei Mei, his best

friend, colleague and wife, without whose patience, understanding, appreciation and

continuous support, encouragement and inspiration, this work would not have been

accomplished. The continuous support and understanding from his parents, from whom

he has been separated for the cause is also greatly appreciated.

vm

ABSTRACT

A structural approach to the construction of multidimensional vector radix fast

Discrete Fourier Transforms (DFTs) and fast direct Discrete Cosine Transforms (DCTs)

is presented in this thesis. The approach features the use of matrix representation of one-

dimensional (1-D) and two-dimensional (2-D) FFT and fast DCT algorithms along with

the tensor product, and the use of logic diagram and rules for modifications.

In the first pan of the thesis, the structural approach is applied to construct 2-D

Decimation-In-Time (DIT), Decimation-In-Frequency (DIF) and mixed (DIT & DIF)

vector radix FFT algorithms from corresponding 1-D FFT algorithms by the Cooley-

Tukey method. The results are summarized in theorems as well as examples using logic

diagrams. It has been shown that the logic diagram (or signal flow graph) as well as

being a form of representation and interpretation of fast algorithm equations, is a stand

alone engineering tool for the construction of fast algorithms. The concept of "vector

signal processing" is adapted into the logic diagram representation which reveals the

structural features of multidimensional vector radix FFTs and explains the relationships

and differences between the row-column FFT, the vector radix FFT reported previously

and the approach presented in this thesis. The introduction of the structural approach

makes the formulation of a multidimensional vector radix FFT algorithm of high radix and

dimension easy to evaluate and implement by both software and hardware.

The hardware implementation of 2-D DFTs is discussed in the light of vector radix

FFTs using the Frequency Domain Processor (FDP™) A41102, which has shown

improvement in reducing the system complexity over the traditional row-column method.

With the help of the structural approach, the vector split-radix DIF FFT algorithm,

mixed (DIT & DIF) vector radix FFT and Combined Factor (CF) vector radix FFT

algorithms are presented whereby a comparison study is made in terms of arithmetic

complexity. The approach is then generalized to vector radix FFTs of higher dimensions.

Two vector radix DCT algorithms are presented in the second part of the thesis.

Although the one based on Lee's approach was reported by Haque using a direct matrix

IX

derivation method, it is derived independently by the author using the structural approach.

The other vector radix DCT algorithm is based on Hou's method. The arithmetic

complexities of these two algorithms are considered as well as various other known row-

column DCT algorithms. The computation structures of 2-D vector radix direct fast DCT

algorithms are discussed in comparison with those of 2-D vector radix FFT algorithms.

A correction to the system description of Hou's DIT fast DCT algorithm is presented as

an analysis result of the algorithm's computation structure.

The system design of the 2-D modified Makhoul algorithm using the FDP A41102

provides yet another solution to the real-time 2-D DCT image coding problem. The

effects of finite-word-length computation of DCT using various direct fast algorithms are

studied by computer simulation for the purpose of transform coding of images. The

results are also presented in the thesis.

X

LIST OF ACRONYMS AND SYMBOLS

ASSP: Acoustics, Speech, and Signal Processing

AUSTEK: Austek Microsystems Proprietary. Inc. and Austek Microsystems

Proprietary Ltd.

BF: ButterFly computational structure of fast transform algorithms

CCll'i: International Telegraph and Telephone Consultative Committee

CF: Combined Factor method

CSIRO: the Commonwealth Scientific and Industrial Research Organization

DCT: Discrete Cosine Transform

DCTd: the DCT output sequence with the double-precision (64-bit floating-point)

DCTf: the DCT output sequence with the finite-word-length, (32-bit floating

point or fixed-point)

DFT: Discrete Fourier Transform

DIF: Decimation-In-Frequency

DIT: Decimation-In-Time

DSP: Digital Signal Processor, or Digital Signal Processing

FDP: Frequency Domain Processor

FFT: Fast discrete Fourier Transform algorithm(s)

FIR: Finite-extent Impulse Response

HDTV: High Definition Television

HR: Infinite-extent Impulse Response

ISDN: Integrated Services Digital Networks

inmos: a part of SGS THOMSON Microelectronics Group

m-D: multi-Dimensional

M/A: Multiplier/Accumulator, or Multiply/Accumulate

MIT: Massachusetts Institute of Technology

xi

Ms/s:

NMR:

RMFFT:

SGS THOMSON:

SNR:

TM

TRW:

VLSI:

VR:

VSP:

VSR:

WFTA:

Zoran:

oc,...,?:

B:

B:

p(2n+l)k *~2N

C(ki,k2):

C:

Ci':

C:

Q:

en.eceo:

Ftt-

Milhon samples per second

Nuclear Magnetic Resonance

Reduced Multiplications Fast discrete Fourier Transform algorithm(s)

SGS T H O M S O N Microelectronics Group

Signal to Noise Ratio

Twiddling Multiplications of fast transform algorithms

T R W LSI Products Inc.

Very Large Scale Integrated circuits

Vector Radix

Vector Signal Processor

Vector Split-Radix

Winograd Fourier Transform Algorithm(s)

Zoran Corporation

small Greek letters are used for transform coefficients throughout the

thesis

matrix of butterfly computation structure outiining the Cooley-Tukey FFT

butterfly matrix of Lee's fast D C T algorithm

butterfly matrix of vector radix fast D C T algorithm based on Lee's

method

cos[Z*|p]

2-D DCT sequence in 2-D indirect fast DCT algorithm

1-D D C T matrix

inverse 1-D D C T matrix

denormalized 1-D D C T matrix

transpose of the denormalized 1-D D C T matrix

roundoff errors

matrix for 1-D radix-r twiddling multiplication of length N DIT FFT

xu

matrix for 2-D vector radix-ri*r2 twiddling multiplications structure of

length Ni*N2 V R DIT FFT

matrix for 1-D radix-r twiddling multiplication of length N DIF FFT

matrix for 2-D vector radix-ri*r2 twiddling multiplications structure of

length N i * N 2 V R DIF F F T

matrix for 1-D radix-r butterfly structure of length N DIT FFT

matrix for 2-D vector radix-ri*r2 butterfly structure of length Ni*NT2 VR

DIT F F T

matrix for 1-D radix-r butterfly structure of length N DIF FFT

matrix for 2-D vector radix-r\*i2 butterfly structure of length Ni*NT2 VR

DIF F F T

imaginary unity

length of the transform

diag.[-,-,...,-]

multiplication matrix of Lee's fast DCT algorithm

multiplication matrix of vector radix fast D C T algorithm based on Lee's

method

number of addition operations required by the transform •

number of multiplication operations required by the transform •

pre- or pro-calculation matrix of Lee's fast D C T algorithm

pre- or pro-calculation matrix of vector radix fast D C T algorithm based

on Lee's method

diag.[-^l,...,l]

sample variance

recursive denormalized DCT matrix used in Hou's fast DCT algorithm

diag.[|,|...|]

T: matrix of twiddle multiplications outlining the Cooley-Tukey FFT

X_ : vector of multi-dimensional transform sequence

x : vector of multi-dimensional data sequence

X(k): transfonn sequence

x(m): data sequence

X : vector of 1-D transform sequence, [X(0),X(1),...,X(N-1)]

x : vector of 1-D data sequence, [x(0),x(l),...,x(N-l)]

X(k) and X(k): denormalized 1-D DCT sequence

~ A X (k) and X (k): vector of the denormalized 1-D D C T sequence

X_ (k) and X (k): vector of the denormalized 2-D DCT sequence

X i: sample mean of a random variable

v(n i ,n2): rearranged 2-D sequence for indirect DCT

V(ki,k2): 2-D discrete Fourier transform of v(ni,n2)

W^m: exp(-27ijkm/N)

WN: 1 -D discrete Fourier transform matrix

Y/^ : 1-D inverse discrete Fourier transform matrix

®: tensor (or Kronecker) product

*: multiply

+: add

1

CHAPTER ONE: INTRODUCTION

1-1 Introduction to Multidimensional Digital Signal Processing

Only after the advent of the modem electronic computer has multidimensional (m-D)

signal processing become a reality. It has attracted more and more research interest as

integrated circuits have become faster, cheaper and more compact [1]. It covers a large

research area including image processing, computer-aided tomography, image

compression and image coding, multidimensional Finite-extent Impulse Response (FIR)

filtering, multidimensional Infinite-extent Impulse Response (HR) filtering, beamforming,

multidimensional spectrum analysis and estimation, radar detection, seismic signal

processing, biomedical signal processing, etc. While multidimensional signal processing,

as defined by its name, is processing on all signals where dimensionality is equal to or

greater than two; at present, two dimensional and three dimensional problems are of

practical concern [2].

Although multidimensional signal processing is an extension of one dimensional

signal processing, it does have different problems associated with the huge amount of

data involved, which makes implementation a difficult issue. More complicated

mathematics is required, which could be more arduous to comprehend. It also allows a

greater degree of freedom because it provides versatile solutions to a single problem.

These difficulties make multidimensional signal processing a very complicated task, and

also motivate research in mathematics, algorithms and implementation. Practical solutions

to these problems are based on the development of modern technology (particularly

computer technology) and raise the future requirements on the technology front.

On the whole, as in one dimensional signal processing, there are two basic

approaches to multidimensional signal processing problems. One is the spatial (or

original) domain approach, and the other is the frequency (or transform) domain approach

[1, 3-8]. They are two forms of mathematical representation within the natural world.

Although they are equally powerful, one can be more appealing than the other in certain

applications. This thesis will be focusing on the transform approach, the implementation

2

of multidimensional Discrete Fourier Transforms (DFTs) and Discrete Cosine Transforms

(DCTs) in particular.

1-2 Applications

The Fourier transform theory has played an important role in multidimensional

signal processing [1, 7, 8] and will continue to be a topic of interest in theoretical, as well

as applied, work in this field [9-12]. Mathematical fundamentals of multidimensional

Fourier transforms have been thoroughly examined, and many fast algorithms have been

proposed. The introduction of vector processors [61], VLSI vector signal processors

[13], VLSI F F T processors [14, 15, 142], systolic array processors [140, 141, 145,

147], Single Instruction Multiple Data (SIMD) [130] and Very Long Instruction Word

(VLIW) supercomputers [129], makes the implementation of multidimensional Fourier

transforms more real in practical situations than ever before.

The multidimensional Fourier transform finds its application in the 2-D context,

such as image enhancement (smoothing, edge detecting), image restoration, image

compression and encoding, image description [4, 8], radar detection [134,137], 2-D FIR

filter implementation and design [1] and invariant object recognition [10]. The 3-D

Fourier transform is required in nuclear magnetic resonance imaging algorithms [9], 3-D

tomo-synthesis [146] and in construction of 3-D microscopic-scale objects to remove out-

of-focus noise [16]. Multidimensional Fourier transforms used for simultaneous time

spatial or spatial frequency representation in computer vision and pattern analysis provide

better tools for pattern analysis and a better understanding of dynamic patterns in the

visual system [2, 10, 11, 12].

Almost a decade after the introduction of the Cooley-Tukey Fast Fourier Transform

algorithm (FFT), the Discrete Cosine Transform was first introduced into digital signal

processing for the purposes of pattern recognition and Wiener filtering in 1974 [17] .

But it soon led to a vast range of engineering applications. In the multidimensional

context, the two dimensional (2-D) D C T is used for image compression and transform

coding of images [3, 4] in telecommunications such as video-conferencing, video

3

telephony, video image compression for High Definition Television (HDTV), block

structure/distortion in image coding, activity classification in transform coding, surface

texture analysis, tomographic classification, photovideotex, pattern recognition,

progressive image transmission, printed image coding and applications in fast packet

switching networks [18, 19, 157]. The D C T s can be implemented by fast algorithms

with either software or hardware and render almost optimal performance that is virtually

indistinguishable from the Karhunen-Loeve Transform [3, 17], in terms of energy

packing ability and decorrelation efficiency. Various VLSI D C T processors have also

been reported and demonstrated recently, for video coding applications [74, 75, 87-89,

111, 143].

1-3 History and New Achievements of Fast Signal Processing

Algorithms

Reviewing the history of a research and study area provides a perspective which

generally benefits future research and study. A review of the study of FFT algorithms in

digital signal processing has particular significance.

Great engineering power has its deep roots in mathematics. Applications of

research achievements rely on the development of relevant technology. The development

of technology motivates further research in conveying more mathematical wonders into

application. The gap has to be bridged by a proper approach and a form of representation

which are attractive to the engineering society.

The history of FFT algorithms did not begin at Good, or Thomas, or Danielson, or

Lanczos, or even Runge [20]. It can be traced back to the great German mathematician

Carl Friedrich Gauss (1777-1855) [21]. But it has only become an important engineering

concern, since the advent of the modern electronic digital computer, through the

fundamental work laid by Cooley and Tukey [22] and those who have helped to give this

mathematical curiosity an engineering interpretation and eventually to convert it to an

engineering power [5, 23,24]. It has been said that the rediscovery of the F F T algorithm

was one of the saviours of the predecessor of the IEEE Acoustics, Speech, and Signal

4

Processing Society [23] and marked the beginning of modern digital signal processing [6,

31, 135]. The Cooley-Tukey FFT algorithm, in addition to being widely used because it

came first, owes much to its simple structure; a structure which is appealing to the

engineering society. The representation of F F T by the so-called butterfly signal flow

graph [24, 25] fits nicely into the newly released VLSI F F T processor—the Frequency

Domain Processor (FDP™) A41102 [14, 15, 26-29]. Some problems can be timeless

and solutions to them can be discovered, and rediscovered again and again.

Representation also is of vital importance for each step of the conversion from research

achievement to engineering application. One of the tasks required of scientific researchers

is to demystify and clarify, not mystify.

FFT algorithms also have their roots in Abelian Semi-Simple Algebras by which the

mathematical structure of the F F T is revealed. These Abelian Semi-Simple Algebras

provide explanation as to how various F F T algorithms are devised. Many attempts have

been made to convey this mathematical result to the engineering society [29-34]. When

the mathematical structure of a process is well understood, many fast algorithms for it can

be constructed systematically. Taking their 1-D counterparts respectively, the tensor (or

Kronecker) product has been used successfully to generate multidimensional Winograd

Fourier transform algorithms [35] and prime factor F F T algorithms [32].

In [31], a matrix form is introduced to represent the vector radix-2*2 and -4*4

Decimation-In-Time (DIT) F F T algorithms. However, the tensor product is used as a

form of representation for the 2-D VR-4*4 F F T algorithm rather than as a tool for the

construction of the V R F F T algorithm from its 1-D counterpart [31]. Many

multidimensional fast algorithms are constructed using a direct derivation method [36-38].

It is worth noting that in most of the published literature, multidimensional algorithms are

still described in a 1-D diagrammatical representation form by the traditional butterfly

signal flow graph which can be over-complicated in the multidimensional case. When

problems become extended, new representation forms have to be found to disclose the

myths behind mathematical structures, which are sometimes quite complicated (or

abstract).

5

In the history of fast signal processing algorithms, the basic issues, which are

associated with evaluating the effectiveness of an algorithm from the outset, have been:

(1) reduction of arithmetic complexity;

(2) reduction of round-off errors and errors due to the quantization of the

coefficients;

(3) in-place computation; and,

(4) possession of a regular computation structure.

Three of the above four points (point 2 excluded) are associated with the processing

speed which is a major engineering concern. An algorithm which does not possess in-

place computation or regular computational structure will require more bookkeeping and

indexing operations, and will affect the processing speed.

In the early years, multiplications were more time-consuming than additions and

other types of operations (data transfer, for instance) on general purpose computers.

Reducing the number of multiplications became the centre point of the evaluation of fast

algorithms. As a result, a group of FFT algorithms, called the reduced multiplications

FFT (RMFFT) algorithms were introduced [39] including the prime factor algorithm,

Winograd Fourier Transform Algorithm (WFTA) and polynomial transforms. Many of

these were obtained at the expense of more additions and loss of regular computation

structure. However, the introduction of Digital Signal Processors (DSPs) and the

development of VLSI technology, Application Specific Integrated Circuits (ASIC)

technology in particular, have changed this tradition dramatically, and now an addition (or

even loading of data) takes about the same time to complete as a multiplication on some of

processors [40]. The issue is not just reduction of the number of multiplications but the

total number of operations. Fast algorithms which do not posses in-place computation or

do not have a regular structure, will be in a disadvantageous position as they have to pay a

severe cost in loading, storing, copying data and other indexing tasks [39, 41]. In

systolic array implementations of DFT and FFTs, emphasis has been on modularity,

pipelining and parallelism, and simple, regular and local communication

6

structures [140, 141, 145, 147] (apart from area*time2 criteria commonly used for

VLSI designs).

A n algorithm is only fast when the hardware can take advantage of it [31]. A

theoretically fast algorithm may be even less effective than a "slow" algorithm on certain

processors. Some features of many fast algorithms, such as parallelism and pipelining

structure, still remain to be fully exploited [134, 137]. These algorithms will be many-

times faster than they are now, only when computer technology resolves the problems

which are associated with them. For example, VLSI implementation-of FFT algorithms

is not limited to radix-2 or radix-4 butterflies. Full length (up to 256 complex point)

C M O S and H M O S FFT processors (DFPs) have been reported, demonstrated [14, 15]

and now are commercially available, as mentioned previously. A n FFT processor that

computes a 4096-point complex D F T in 1 0 2 ^ with 22-bit floating-point arithmetic was

also reported in [142]. W h e n the computation structure of the Cooley-Tukey algorithm

described by butterflies is made into a VLSI pipelining architecture, a 256 complex point

D F T can be achieved in about 102.4^s (200^ for H M O S Chip) on FDPs. Another

feature of F D P A41102 is that 8*8- or 16*16-point 2-D D F T can be accomplished in one

pass although the row-column approach is used. This means a reduction in the time for

sweeping data. This places many digital signal processing applications using FFT into

the real-time or pseudo-real-time processing category.

1-4 Objectives

As explained in the previous section, because of the nature of multidimensional

Digital Signal Processing (DSP), there are various multidimensional D S P algorithms from

which to choose and the structures of these algorithms appear to be more complex than

those of their 1-D counterparts. Without an appropriate method, the construction, the

evaluation of the performance and the implementation of m - D algorithms would be a very

difficult task indeed. This thesis attempts to seek a structural approach to m-D fast DSP

algorithms to make the task simpler. Instead of deriving m - D fast algorithms from a

defined m - D D S P problem directly, evaluating and implementing algorithms according to

7

equations so derived, the approach suggests that the construction, evaluation and

implementation of m-D fast algorithms be based on our knowledge of, and experience

with, the corresponding 1-D algorithms, if possible. For example, 1-D FFT algorithms

based on the Cooley-Tukey method are extensively studied and well documented.

Computer programs of 1-D FFTs can be found in the published literature and in computer

software mathematics libraries. Many DSP manufacturers provide their version of FFT

programs. The VLSI integration of 1-D FFT algorithms has also broadened our

knowledge. All the above knowledge can be made useful for the development of m-D

FFT algorithms. The simplest case would be the row-column approach. The row-

column m-D FFT algorithm, for instance, is obtained by repeatedly using the 1-D FFT

algorithm on each dimension so that it can be said that all the knowledge and experience,

including the programs and hardwares, of 1-D FFTs, are directly made use of in this m-D

method and the method is constructed and built on 1-D FFT by deriving the relation

between m-D DFT and the DFT on each of its dimensions. In this case, it happens that it

is not necessary to worry about the structure of 1-D FFTs, nor how they are constructed.

It is simply to make use of what is available at the 1-D level. However, when the number

of dimensions of DFTs increases, the computational saving of m-D fast FFTs, of which

the vector radix FFT is one, over the row-column FFT will become substantial in terms of

the number of multiplications or the total number of numerical operations, as will be

shown in Chapter Four of this thesis [44,45]. When the m-D vector radix FFT is to be

constructed, the structure of 1-D FFT algorithms and the structural relationship between

the m-D FFT and 1-D FFTs have to be studied and understood in order to generate

systematically the required m-D FFT from the knowledge (algorithm, software and

hardware) possessed of corresponding 1-D FFTs. The above described approach is

hereby called a structural approach. This kind of approach will not only help the

construction, software and hardware implementation of m-D algorithms, but also assist

the study of VLSI integration of m-D algorithms, which possess a greater degree of

complexity than the 1-D case. Its function as a tool for software development of m-D

discrete transform algorithms has been successfully demonstrated during this research.

8

This thesis is mainly concerned with the implementation of the multidimensional

vector radix FFT algorithms [42-44], based on the Cooley-Tukey method [22], and that

of 2-D direct vector radix fast DCT algorithms. The mathematical structures of these

algorithms are to be examined and a graphical representation (logic diagram) is to be

introduced to accommodate the concept of vector signal processing in graphical form.

Various issues associated with the software and hardware implementations of m-D DFTs

and DCTs are also to be investigated using some state-of-the-art digital signal processors.

It will be shown that the algorithms under study are highly structured and have a

close link to their 1-D counterparts. They provide more efficient process in terms of the

computational complexity and will be fast if the parallel and pipeline structure can be fully

exploited.

1-5 Thesis Review and Contributions

A structural approach to the construction of multidimensional vector radix fast

Discrete Fourier Transforms (DFTs) and fast direct Discrete Cosine Transforms (DCTs)

is presented in this thesis. Rigorous mathematical derivation is presented by representing

the 1-D and 2-D FFT and fast DCT algorithms in matrix form together with tensor

product However, the same algorithm is also derived more simply by examination of the

structure of logic diagrams with given rules for modification. The structural approach is

applied to construct 2-D Decimation-In-Time (DIT), Decimation-In-Frequency (DIF) and

mixed (DIT & DIF) vector radix FFT algorithms from corresponding 1-D FFT algorithms

using the Cooley-Tukey approach. The whole procedure is summarized in theorems.

The results are then generalized to vector radix FFTs of higher dimensions and vector

radix DCT algorithms. It has been shown that the logic diagram (or signal flow graph) is,

in addition to being a form of representation and interpretation of fast algorithm equations,

a stand-alone engineering tool for the construction of fast algorithms. The concept of

"vector processing" is adapted into the logic diagram representation . This reveals the

structural features of multidimensional vector radix FFTs and explains the relationships

and differences between the row-column FFT, the vector radix FFT in [43, 44] and the

9

approach presented in this thesis. Introduction of the structural approach makes the

multidimensional vector radix F F T algorithms of high radix and high dimension easy to

evaluate and implement by both software and hardware.

The hardware implementation of 2-D D F T is discussed in the light of vector radix

FFTs using the Frequency Domain Processor ( F D P ™ ) A41102, which has shown

improvement in reducing the system complexity over the traditional row-column method.

With the help of the structural approach, the vector split-radix DIF FFT algorithm,

mixed (DIT & DIF) vector radix FFT and Combined Factor (CF) vector radix FFT

algorithms are presented whereby a comparison study is made in terms of arithmetic

complexity.

T w o vector radix D C T algorithms are presented in the second part of the thesis.

Although the one based on Lee's approach was reported by Haque using a direct matrix

derivation method, it is here derived independently, using the structural approach. The

other vector radix D C T algorithm is based on Hou's method.

The system design of the 2-D modified Makhoul algorithm using the F D P A41102

provides yet another solution to the real-time 2-D image coding problem. The effects of

finite-word-length computation of D C T using various direct fast algorithms are studied

by computer simulation and results are presented.

Chapter One presents an introduction to the multidimensional digital signal

processing, with emphasis given to the transform method, multidimensional discrete

Fourier transforms and discrete cosine transforms in particular. The development and

new achievements of fast digital signal processing algorithms are reviewed, producing

insight to the research area.

T w o basic representations, namely the matrix form and the logic diagram, for 1-D

D F T and F F T algorithms are presented in Chapter Two, which lays the foundation for the

presentation of the structural approach to the construction of multidimensional vector

radix F F T algorithms. It has also been shown that the logic diagram is a form of

representation for F F T algorithms and a tool to derive or construct F F T algorithms as

well.

10

Chapter Three forms one of the major chapters of the thesis. After the introduction

of general matrix representations for the first stage 2-D DIT, DIF and mixed decimation

vector radix algorithms, structure theorems are presented along with diagrammatical

representation, which bear the essential message for the structural approach towards the

construction of various vector radix FFT algorithms. The applications of theorems and

the logic diagram are demonstrated by various examples, including the 2-D vector split-

radix DIF FFT algorithm. As well, comparative studies of vector radix FFTs and

hardware implementation of vector radix FFTs using the FDP A41102 are presented in

this chapter.

The structural approach is extended to multidimensional vector radix FFT

algorithms of higher dimension in Chapter Four. A recursive symbol system, which

makes the derivation of multidimensional vector radix FFT from 1-D FFTs a systematic,

straight-forward and error free procedure, is presented for the logic diagram

representation of vector radix FFTs.

The second part of this thesis consists of study results on the fast computation of 2-

D discrete cosine transforms, its application to the transform coding of real-time images

and error analysis of various direct fast DCT algorithms for image coding purposes using

the floating-point computation.

A brief introduction to multidimensional DCTs is presented in Chapter Five. Two

vector radix direct fast DCT algorithms are constructed using the structural approach and

presented in Chapter Six. The arithmetic complexity of various direct fast DCT

algorithms is also discussed in this chapter. In Chapter Seven, hardware implementations

of 2-D DCTs for real-time image coding are discussed using dedicated VLSI DCT

processors, digital signal processors, fast multiplier/accumulators and the newly released

FDP A41102. The effects of finite-word-length computation for fast DCT algorithms are

studied using the floating-point arithmetic in comparison with the direct matrix

multiplication method and simulation results presented in Chapter Eight. In conclusion,

Chapter Nine summarizes the main approach taken, the contribution made by this thesis

and future aspects of research.

11

Preliminary material on the tensor (or Kronecker) product and logic diagrams are

presented in appendices. A short proof of the structure theorems, vector radix direct fast

D C T algorithm based on Lee's method and derivations of various combined vector radix

FFT algorithms also are presented in appendices.

1-6 Publications, Submitted Papers and Internal Technical

Reports

[1-6.1] H.R. Wu and FJ. Paoloni, "The Structure of Vector Radix Multidimensional

Fast Fourier Transforms", ISSPA 87. Signal Processing. Theories-

Implementations and Applications, pp.89-92, August 1987.

[1-6.2] H.R. Wu and F.J. Paoloni, "The Structure of Vector Radix Fast Fourier

Transforms", IEEE Transactions on Acoustics. Speech, and Signal

Processing. Vol.37, pp.1415-1424, September 1989.

[1-6.3] H.R. Wu and F.J. Paoloni, "On the Two Dimensional Vector Split-Radix

FFT Algorithm", IEEE Transactions on Acoustics. Speech, and Signal

Processing. Vol.37, pp.1302-1304, August 1989.

[1-6.4] H.R. Wu and F.J. Paoloni, "Structured Vector Radix FFT Algorithms and

Hardware Implementation", Journal of Electrical and Electro

-nics Engineering, Australia, September 1990.

[1-6.5] H.R. Wu and F.J. Paoloni, "A Two Dimensional Fast Cosine Transform

Algorithm—A Structural Approach", Proceedings of IEEE International

Conference on Image Processing, pp.50-54, Singapore, September 1989.

12

[1-6.6] H.R. W u and F.J. Paoloni, "A 2-D Fast Cosine Transform Algorithm Based

on Hou's Approach", IEEE Trans, on Acoust., Speech, Sign

al Processing, to appear in June 1991.

[1-6.7] H.R. Wu, F.J. Paoloni and W. Tan, "Implementation of 2-D DCT for Image

Coding Using F D P ™ A41102", Proceedings of the Conference on Image

Processing and the Impact of New Technologies, pp.35-38, Canberra,

December 1989.

[1-6.8] H.R. Wu and F.J. Paoloni, "A Structural Approach to Two Dimensional

Direct Fast Discrete Cosine Transform Algorithms", Proceedings of

International Symposium on Computer Architecture & Digital Signal

Processing, pp.358-362, Hong Kong, October 1989.

[1 -6.9] H.R. Wu and F.J. Paoloni, "The Impact of the VLSI Technology on the Fast

Computation of Discrete Cosine Transforms for Image Coding", to be

submitted.

[1-6.10] H.R. Wu and F.J. Paoloni, "A Perspective on Vector Radix FFT Algorithms

of Higher Dimensions", Proc. of the IASTED Int. Symp. on Sig

-nal Processing & Digital Filtering, June 1990.

[1-6.11] H.R. W u and F.J. Paoloni, "Implementation of 2-D Vector Radix FFT

Algorithms Using the Frequency Domain Processor A41102", Proc.

of the IASTED Int. Symp. on Signal Processing & Digit

al Filtering, June 1990.

13

(Internal Technical Report)

[1-6.12] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms",

Technical Report-1, the University of Wollongong-Telecom Research

Laboratories (Australia) R&D Contract for the Study of Fast Implementations

of Discrete Cosine Transform Coding Systems, under No.7066, June 1989.

[1-6.13] H.R. Wu and F.J. Paoloni, "Simulation Study on the Effects of Finite-Word-

Length Calculations for Fast DCT Algorithms", Technical Report-2, the

University of Wollongong-Telecom Research Laboratories (Australia) R&D

Contract for the Study of Fast Implementations of Discrete Cosine Transform

Coding Systems, under No.7066, October 1989.

[1-6.14] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms",

Addendum of Technical Report-1, the University of Wollongong-Telecom

Research Laboratories (Australia) R&D Contract for the Study of Fast

Implementations of Discrete Cosine Transform Coding Systems, under

No.7066, November 1989.

14

PART I.

MULTIDIMENSIONAL DISCRETE FOURIER TRANSFORMS

15

CHAPTER TWO: 1-D DISCRETE FOURIER TRANSFORM AND

FAST FOURIER TRANSFORM ALGORITHMS

In this thesis, multidimensional vector radix FFT algorithms [42-44] based on the

Cooley-Tukey method [22] are considered in detail. Although the 1-D Cooley-Tukey

FFT algorithm and many others have been well studied and understood, they have been

included here for the purpose of understanding the structure of m-D VR FFT, the

evolution of VR FFTs from 1-D FFT, and even the 1-D FFT itself. The more that is

understood about 1-D FFT, the more easily the knowledge of m-D VR FFT algorithms

can be expanded. The matrix and the logic diagram representation of the 1-D FFT

algorithm form a foundation which provides the basis for the m-D VR FFTs.

After defining the 1-D discrete Fourier transform, the matrix forms for the 1-D FFT

are introduced. An examination as to why FFTs can achieve better computational

efficiency in different ways is then presented.

2-1 Definitions

The Discrete Fourier Transform (DFT), X(k), of a vector x(m) of length N is

defined [22, 47] as follows:

X(k)= I x(m)W*m (2-1-1) m=0

where WN = exp(-j27t/N), j = V7! and k = 0,1,... ,N-1.

The inverse DFT (IDFT) is given by:

x(m)=lzX(k)WNkm (2-1-2)

1X1 k=0 1N

where m = 0,1,...,N-1.

The derivation or development of the DFT from its corresponding continuous

Fourier Transform (FT) can be found in [39,47].

In their matrix forms, the DFT and IDFT are defined by the following equations:

16

X = W * x (2-1-3)

and:

x = N WN! A' (2-1-4)

where X = [X(0), X(1),...,X(N-1)]T, x = [x(0), x(l),...,x(N-l)]T, W^ is an N*N

matrix and for 0 < k,m < N-l, W* (k,m) = W^m; N = diag.[~ ...,£]; for 0 < k,m <

N"l. Wj^ (m,k) = WjJm. All matrices are of size N*N. W^ is often called the 1-D

DFT matrix.

2-2 Matrix Representations for 1-D Cooley-Tukey FFT

Algorithms

If the direct matrix operation is used, the computation of a 1-D DFT as defined by

Equation (2-1-3) needs N2 complex multiplications and N2-N complex additions,

provided the input sequence is complex. Thus, fast algorithms need to be introduced to

reduce the arithmetic complexity and the time for the computation.

In this section, the traditional 1-D Cooley-Tukey FFT algorithm is presented in a

traditional manner, which is followed by traditional graphical and matricial representations

of the algorithm and its computation structure. The matrix and diagrammatical forms,

which are used throughout the thesis, for both DIT and DIF FFT algorithms are then

presented with differences and relationships explained.

Using the Cooley-Tukey method or the Decimation-In-Time FFT algorithm [22,24,

47]:

k = ki*N' + kn; m = mi*r + mo; (2-2-1)

is set; where N' = N/r, ki, mo = 0,l,...,r-l and ko, mi = 0,1,...,N'-1.

Then Equation (2-2-2) is derived from Equation (2-1-1).

X(k!,k0)= Z X^mi.moJW^'W^wf'"1" (2-2-2)

mo=0 mj=0

The four step algorithm to calculate Equation (2-2-2) are:

17

Step 1: The second Butterfly (BF2) or the shorter length DFTs:

Step 2:

N'-]

x'l(ko,mo)= X x(mi.mo) W | 2 m i

m]=0 N'

The Twiddling Multiplications (TM):

(2-2-3a)

Step 3:

x1(ko,mo)=x'](ko,mo)W^Tomo

N

The first Butterfly (BF1) of radix-r:

(2-2-3b)

Step 4:

r-l x2(ko,ki)= I xi(k 0 , m 0 ) W

k i m o

mo=0 r

The unscrambling:

(2-2-3c)

X(ki,ko) = X2(ko,ki) (2-2-3d)

The algorithm shows that the original 1-D D F T of length N can be calculated by r

length N' (shorter than length N) DFTs and computation saving thereafter can be

achieved.

There have been many attempts to interpret the FFT by finding both mathematical

and graphical expressions for the algorithm [24, 25, 31, 39, 47]. The Mason flow graph

[6] is one such attempt and its introduction has been of great help in interpreting and

presenting the FFT. Figure-1 shows a Mason signal flow graph representing 8-point DIT

Cooley-Tukey FFT algorithm, where N = 8, the input is in natural numerical order and

the output in bit-reversed order. By bit-reversed order, it is meant that if (n2ninn)b is the

binary representation of the natural decimal index (n)d, its decimal index (nr)d in bit-

reversed order will be (nonin2)b- For example, if (n)d = (4)d = (100)b, (nr)d = (001)b =

(l)d- Bit-reversed order of a sequence: ^ O N ^

1 2

{?) d

01 10

W is

10 01

2 1

18

c CTJ

1-1 0 4-1

r—

a CO s-i u s 0 tH (ii

rH C3 C to •H LO r*

0 en « 2 <

+J •H

0 M

<

H fc En >, 0) ^ O H 1 >~, cu rH o 0 u AJ

c •H o a oo

i — i

i

a s-i D tn

•t-i

t.

X X X X X X X X

19

It is assumed that the decimal number is used for indexing unless it is indicated

explicitly otherwise.

In Figure-1, the signal flow graph shows vividly the construction of the algorithm,

including the butterfly (BF), twiddles ( T M ) and the unscrambling, and the number of

operations saved noting that W ^ = 1 and W g = -j. The relationships between the long

length DFT and shorter length DFTs when the algorithm is applied can also be described

[25].

A n alternative representation uses matrices to give the mathematical interpretation

and explanation of the algorithm. The F F T algorithm can be constructed by matrix

decomposition (or factorization) [39, 47], and its roots lie in algebra [30, 33]. In [30],

the matrix decomposition which underlines the Cooley-Tukey algorithm [30] is shown to

be:

W * = 3 1 T 1 B 2 T 2 . . . B m . 1 T m . 1 5 m (2-2-4)

assuming the length of the D F T N = 2 m . 3[ (i=l,...,m) represents the butterfly stage

which evolves from 2-point DFTs [30]. It is shown that each 3j only needs N complex

additions to calculate. 7j (i=l,...,m-l) is a diagonal matrix (representing the twiddling

multiplications) with half of the elements being 1 (or trivial), i.e., only N/2 complex

multiplications needed to work out each of the Tj matrix. This matrix form describes the

Cooley-Tukey algorithm both precisely and concisely. As it has been shown, the

computational efficiency is also very easy to evaluate.

The matrix form adapted in this thesis is virtually the same as that by Blahut [31],

and its indexing scheme follows traditional Cooley-Tukey's presentation [22, 47]. The

matrix equations for the 1-D radix-2 DIT FFT algorithm on the 8-point 1-D D F T (i.e., r =

2, N = 8 and N' =4,) is presented as follows:

20

(BF2:)

ko -x'i(0,mo)

x'i(2,m0)

x'i(l,mo)

_x'i(3,mo).

(TM:)

mo

"xi(k0,0)- _

_xi(ko,i). "

(BF1:)

ki -x2(k0,0)- _

_X2(ko,l). "

(Unscrambling:)

•X(0,k0)-

X(l,k0)_ =

"x2(ko,0)-

_X2(ko,l)_

In Equation (2-2-5a), there are two (when mo = 0,1, respectively) length four DFTs

which can be further decimated using this recursive algorithm.

The algorithm is described by a logic diagram in Figure-2 which is similar to the

Mason signal flow graph in Figure-1. There is a correspondence between the matrix

form of the butterfly, twiddles and its signal flow graph. As shown in Figure-2, the logic

diagram consists of one stage of radix-4 DIT FFT Butterfly (BF), a stage of radix-2 BF

and a Twiddling Multiplication (TM) stage between the radix-4 and radix-2 BF stages

represented by DIT TM-4*2 where a = exp(-J7t/4). Other symbols are defined in

Appendix A. It can be seen that there are two radix-4 butterflies in the radix-4 BF stage

and each radix-4 butterfly can be implemented with no multiplication and only eight

additions. The difference between Equation (2-2-4) and Equation (2-2-5) is that the

former, which is a form that underlines the Cooley-Tukey algorithm [30], is a top-down

overall matrix decomposition method; and the latter is a representation of the algorithm

1 1 1 1 -

1 1 - 1 - 1

1 -1 -j j

1 -1 j -j -

mj x(0,mo)-

x(2,mo)

x(l,mo)

Lx(3,mo).

(2-2-5a)

1 0

0 W k0 N

mo

xi'(k0,0)-

_xi'(k0,l). (2-2-5b)

- 1 1 -_1 -1_

mo -xi(k0,0)-

_xi(ko,l)_ (2-2-5c)

21

c JS *-) «.—< s_i

o <

H p •— ?. Ol •— — U-tp t-

15 o ^ oc

fc« Q *_> c o &<

CM 1 CJ VJ

fcC •rH t-

CO

o X

^ X

<N

X

CO

X

"<3-

X

lO

X

VO

X r-X

22

itself. From the matrix form point of view, it is a bottom-up approach. Equation (2-2-4)

can be seen as a mathematical representation of the Cooley-Tukey algorithm. It explains

"what" and "why". But when it comes to "how", i.e., how to derive an algorithm, the

"decimation" method has been predominantly preferred [5, 6, 22, 24, 25, 36, 37, 43,

68]. Equation (2-2-5) is a concise representation of the Cooley-Tukey algorithm. Funher

decimation can proceed on Equation (2-2-5a) as m 0 takes each fixed value the equation

itself is a half-length DFT. It is easy to see that the matrix form used in this thesis is

similar to that used by Blahut but with reordered data sequence and different indexing

scheme.

Similarly, a matrix form of equations can be introduced for the Decimation-In-

Frequency (DIF) Cooley-Tukey FFT algorithm.

Assuming N = r*N', set

k = ki*r + ko; m = mi*N' + mo;

where kn.mi = 0,1 r-1 and ki,mn = 0,1,...,N'-1.

Given Ni = N 2 = N = 8, N = 2 * N', N' = 4, matrix equations for the 1-D radix-2

DIF FFT algorithm on the 8-point D F T are presented as follows:

(BF1:)

k0 xi(0,mo)l

Lxi(l,mo)J L 1 "3

1 1 mj

x(0,mo)

x(l,m0). (2-2-6a)

(TM:)

(BF2:)

k0

xi'(0,mo)'

xi'(l,mo).

1 0

OWJJ0

ko

xi(0,mo)

_xi(l,mo). (2-2-6b)

ki X2(k0,0)-

X2(ko,2) X2(kn,l)

X2(ko,3)-

- 1 1 1 l-i

1 1 - 1 - 1

1 -1 -j j

- 1 -1 j -j J

mo rxi,(k0,0) xi'(k0,2)

xiYko,!) Lxi'(ko,3)

(2-2-6c)

23

(Unscrambling:)

rX(0,k0)-X(2,k0)

X(l,k0)

Lx(3,k0)_

-x2(k0,0)-

X2(ko,2)

X2(ko,l) -x2(k0)3)_

A logic diagram used to perform this 8-point D F T is shown in Figure-3.

Because of the binary organization of digital computers, algorithms for N (not a

power of 2) have received less attention, although the Cooley-Tukey algorithm can be

applied to D F T of length N which can be any composite number [21, 24].

Thereafter, in this thesis, most of the discussion will be on DFTs with length being a

power of 2.

2-3 Computational Considerations

The computational consideration is a sophisticated problem. The majority of the

work undertaken in search of fast algorithms for DFTs has gone into reducing the

computational complexity [34, 39, 48, 49]. O n most general purpose computers,

multiplications are more expensive than additions. As a result, especially in the early

years of research on FFT algorithms, much work was carried out in reducing the number

of multiplications, some of which are accomplished at the expense of additions, in-place

computation, and most of all, regular computing structure. A second consideration was

reducing roundoff errors [25] and research in this direction has been very much alive and

up to date [50, 51]. In-Place computation has also drawn a lot of attention from

researchers [39, 52]. Regular structure is yet another factor which is important to both

software and hardware implementation of FFT algorithms [14, 34, 41, 53, 54]. A fast

algorithm may lose its initial momentum due to its lack of in-place computation or regular

computing structure which dramatically increases the bookkeeping task, and it may be

placed in a disadvantageous position after all [39, 41, 154]. All the above should be

considered and balanced to devise or choose an FFT algorithm for a specific application.

Another point to be considered is that the advantage of an improved algorithm can be

24

<N

X

+ + I

c <

u. .

fe

wB:

o I

oc

O X xf x X 1<

25

wasted if the computer hardware (or the digital signal processor) cannot take advantage of

it [31]. This aspect shall be discussed in the second part of the thesis where the hardware

implementation of 2-D D C T s is considered.

The computational complexity in terms of operations is usually evaluated in two

ways, i.e., by examining the mathematical equations or by looking at the logic diagrams

(the signal flow graphs). There are many interpretations as to why the FFT algorithm can

be fast. Mathematically speaking, as the 1-D D F T is a summation of weighted inputs

with the weight being a periodic function, it can be evaluated by a clever insertion of

parentheses to reduce the number of additions and multiplications, thus becoming an

algebraic exercise. The number of multiplications can be further reduced by locating the

trivial multiplications such as ±1 and ±j in regular positions [31, 45]. This can also be

explained using the logic diagram which gives an engineering interpretation. Examine a

4-point D F T , for example. The algorithm represented by Figure-4-(a) is equivalent to the

direct matrix operation. As a result, 12 additions and 16 multiplications are needed to

complete the transform. By close examination, it is found that use can be made of the

periodicity of the weighting function as shown in Figure-4-(b) to group those inputs with

the same weights to reduce the number of additions to 8 in Figure-4-(c). It can be noted

thatW^ = 1, which does not require multiplication, and W 4 can be moved to the right of

the summation symbol, which further reduces the number of multiplications by 1. Using

conventional presentation, the algorithm represented by the logic diagram is given by

Figure-4-(d). Similarly, w ] = -j, which can be calculated without multiplication. The

final fast algorithm for 4-point DFT needs only 8 additions as shown in Figure-4-(e).

Logic diagrams can also be used to explain why one algorithm can be faster than the

other, in terms of the number of multiplications, when different radices are used. Take a

16-point 1-D DFT, for example. In Figure-5, a mixed radix-2 and radix-8 FFT algorithm

is used so that 10 multiplications are required to complete the transform. If the twiddles

between the radix-2 and radix-8 butterflies are moved to the right and are combined with

the twiddles inside the radix-8 butterfly, a new group of twiddles is formed as shown in

26

ft 1)

\

^

\

)\

\

\)

\ )

)\

^

y\

\

w4u

w?

w?

w?

w4u

wj

w?

w?

W4U

w4z

w?

w*

w4u

wi

w?

w?

1 1 X(0)

+ X(l)

+ X(2)

+ • X(3)

x(0)

x(l)

x(2)

x(3) 1

x(0)

x(l)

x(2)

x(3) — ,

x(0)

x(l)

x(2)

x(3)

x(0) ,

x(l) .

x(2) ,

x(3)

1

1

T

1

1

w4[

-1

-wj

1

-,

1

-1

1

-wi

-1

wj

+

+

+

+

Figure-4-(a) Figure-4-(b)

27

x(0)

x(2)

x^O)

x(2)

x(l)

x(3)

x(l)

x(3)

wi

Wj

+

+

+

Figure-4-(c)

-i

-i

+

+

+

X(O)

X(l)

X(2^

X(3)

x(O)

x(2)

+ + X(O)

+ + X(l)

x(l)

x(3) + + X(2)

+

W

+ X(3)

Figure-4-(d)

28

x(0)

xuj

x(2)

x(3)

+

+ X(0)

-1 + X(2)

-1

+

+

-J

+ X(1)

+ X(3)

Figure-4-(e)

29

^ rr X

01

X

^ c* X

c

X X X

+J L±J L+J L±J L+J L+J + 1 _±J + + + + +ir+1+1f+

s S i S S S S S GuC+] r+i r+n r+i m r+n r+i rr fa. o H fa. fa.

s* p, K

^—s

•—• ><

o ts X

*~^ p-1 v ^ X

/- TJ-K

>n X

^^ o <i> X

t-~ X

^^ OC X

*-*. 0\ - X

to' f—t

X

^ • *

«- K

<N'

X x 5 X

30

X I

^ ^ W ,-v

=c ^ — Q, X X X X

1 1 + •

i

+ •

+ -

+] "

.

f + + + -

j -

,, 1

f + + + -r1! r1!

;••-- S

..... , , . ,.

f + + + -.J

- . -

h + + + H

: . • " ' " :

1

2 ^ ••» — " O —> X X X X 1 1 1 1

, - v ^ C**i

C "J". — X X X

r~.

X X

I I I I I f ]+ + + n ni

i l

h + + + i . J .

: -P..,«

H + + + -•h r1! r1!

•

- ; + + + -

• • * — «

• ' . • • • : - . . : • • i

+ 1 + 1 +

£ e

? i is Hi rh

i,-E "

f + !+'

(- + +

V V T

-

+ +

II" 1 !

+ [+

=i

+ + -JL _L

+ 1 + -

*T "i*

. ' • :

1 1 + E 1 10 I H—'

i g^j ^

f + J . . J ,

a a H

T> '7-" fa.

i i — - i T 1-1+ ttl

JU

K X CJ, o x x

* 6 g 6 S £ S § X X X X X X X

X

"•3-

X

fa. fa.

«

b.

M

O x

c X

o — II

fa. C

N/

c X o II

I

J-l

3 •H fa.

c a

31

Figure-6. As a result, only 8 multiplications are needed to perform the same 16-point

DFT. As a matter of fact, Figure-6 represents the radix-4 FFT algorithm. According to

Richard [55], the best choice of algorithm depends strongly on whether the execution

speed is dominated by: (1) multiply time; (2) equally by multiply and addition time; or (3)

butterfly time. From the logic diagram it can be seen that, when multiply time dominates,

to optimize FFT algorithms is to find the best locations for those twiddle factors so that

the number of multiplications is minimal.

2-4 Summary

In this chapter, both 1-D Decimation-In-Time and Decimation-In-Frequency FFT

algorithms have been examined using two forms, namely, the matrix representation and

the logic diagram. These two forms will be used throughout this thesis as a basis for the

derivation of multidimensional vector radix FFT algorithms. The logic diagram is

equivalent to the traditional signal flow graph, but it can be generalized into a

multidimensional form without any complication and misperception. It is also

demonstrated that alternative algorithms can be derived using logic diagrams alone. In

other words, the logic diagram is not only a form of representation of algorithms but can

also be used to derive new algorithms. It can be used as a stand-alone engineering tool.

32

CHAPTER THREE: 2-D DFT AND 2-D FFT ALGORITHMS

3-1 Introduction to 2-D Discrete Fourier Transforms

It is well known that when the sample length of the convolution for image filtering

or the correlation for template matching is long, the FFT approach is faster than the

original domain approach [8]. In most multidimensional applications, this is usually the

case. Although there are many approximation methods in the original domain which

perform multidimensional processing in real-time and have achieved reasonably good

results, it is believed that multidimensional FFT will still have a place in the field, in

theoretical analysis as well as in practical applications.

A 2-D DFT problem can arise from 2-D signal processing applications or can result

from mathematical manipulations [1, 57]. The study of fast algorithms for 2-D DFTs is

one of the practical concerns in multidimensional.(m-D) digital signal processing based on

contemporary technology—especially computer technology, and also the first step to

studying multidimensional fast Fourier transform algorithms. Many properties of and

applications for 2-D Fourier transforms can be extended directly to computing

multidimensional Fourier transforms [31],

At present, however, it must be admitted that the requirements of most real-time

multidimensional DFT applications cannot be satisfied using ordinary methods [1, 8, 31,

39, 47, 63] with even the most advanced digital signal processors [13, 26, 40] or

supercomputers [129, 130], due to the large amount of data needed to be processed. To

improve the situation, good multidimensional FFT algorithms and greater computing

power provided by the development of application specific VLSI technology in the

computer industry are needed. This has motivated research in both theory and

technology. On the theoretical front, apart from the traditional row-column approach,

there are reports of many multidimensional FFT methods: the vector radix FFT

algorithms [31, 42-45], multidimensional DFTs by polynomial transforms [58],

multidimensional prime factor FFT algorithms [32], multidimensional Winograd Fourier

transforms [35], multidimensional Number Theoretic Transform algorithm [59], vector

33

split-radix F F T algorithms [36, 37, 60] and, recently, vector FFT algorithms developed

for vector computers [61] and supercomputers [129, 130]. A good mathematical

explanation of different multidimensional F F T algorithm can be found in [1, 30, 31, 33].

O n the technology front, apart from different Digital Signal Processors (DSP), especially

the Zoran Vector Signal Processor (VSP) [13], which can perform FFT, the successful

fabrication of radix-2 and radix-4 butterflies has been reported [56] as well as full length

FFT processors [142]. And recently, a full length FFT processor which does up to 256-

point complex D F T in 102.4^s has been fabricated and demonstrated [14, 15], and now is

commercially available [26, 27]. Many proposals have been made to implement DFTs

and FFTs using systolic array processors [140, 141, 145, 147] and V L r W [129] or

S L M D [130] supercomputers. The neural net implementation of FFTs is still in its early

stage [138]. A VLSI architecture has been proposed and designed using G E 3-u.m

C M O S technology and the vector radix-2*2 F F T algorithm for rasterizing the 2-D D F T

with size N * N at video speed [134].

From the beginning, research on fast computation of DFTs has followed criteria to

evaluate the effectiveness of an FFT algorithm, i.e., an effective FFT algorithm should be

computationally efficient in terms of the number of operations, it should reduce roundoff

errors, it should possess an in-place computation, and a regular structure. Ignorance of

the last two points wiD cause an increase in bookkeeping burden and is responsible for the

disadvantages which the bookkeeping task may cause. According to the above criteria,

although algorithms based on the Cooley-Tukey method usually need more

multiplications than many reduced multiplications FFT algorithms [39,41, 58], they have

obvious advantages over the rest by the last three criteria. This is also true of the

multidimensional vector radix FFT algorithms [50, 51].

In this part of the thesis, 2-D Vector Radix (VR) F F T algorithms, which are

multidimensional extensions of the 1-D Cooley-Tukey algorithms, are considered. The

Cooley-Tukey F F T algorithm is of historical importance in modern digital signal

processing. It is still a most widely used algorithm, both in software and hardware,

including VLSI implementation of DFTs, because of its regular structure and many other

34

computational advantages. The fact that an algorithm has a regular structure is very

crucial in VLSI implementation. The C S I R O designed A U S T E K F D P ™ A41101 and

A41102 F F T processors exploit the good structure provided by the Cooley-Tukey

algorithm to form a pipeline architecture and to achieve one of the fastest FFT processing

speeds on record [14, 15, 142].

The vector radix F F T algorithm was first conceived by Rivard [42], further

developed by Harris, McClellan, Chan and Schuessler [43], Arambepola [44], and

unified by Mersereau and Speake [62]. The vector radix FFT algorithm is a straight

forward extension of the 1-D Cooley-Tukey algorithm, and it is more efficient in terms of

the number of multiplications than its row-column 1-D counterpart A VLSI architecture,

using the vector radix-2*2 FFT algorithm, has been proposed and patented [134, 137] to

show many of its advantages over the traditional row-column implementations.

Nevertheless, it is less well known to electrical and computer engineers, and more often

than not, misunderstood and treated as a mathematically complicated and involved

process. Many think that it is not worth the effort using V R F F T in real applications.

This is only natural when all the struggle and strife of two decades ago in understanding

and interpreting the Cooley-Tukey F F T algorithm [20, 23, 25, 47, 63, 64] is recalled.

On the other hand, there are new reports lately on the vector split-radix (VSR) FFT

algorithms by Pei and W u [36], and M o u and Duhamel [37]. The derivation of these

"strangely" split vector radix algorithms is rather complicated and final results are difficult

to appreciate due to the direct derivation approach used.

In order to extend different 1-D F F T algorithms to higher dimensions whilst

avoiding confusions and tedious derivation, the structural features of the

multidimensional FFT algorithms have to be examined. The structural features of the

multidimensional Winograd Fourier transform algorithms have been studied extensively

[30, 35], as well as those of the prime factor algorithm [32]. Efficient as they are, the

Winograd Fourier transform algorithm demands the length of the D F T to be a prime, and

the prime factor fast Fourier transform algorithm requires the length of the short DFTs to

be mutually prime [45, 154]. In essence, they tend to reduce the number of

35

multiplications at the expense of the number of additions, and especially the regular

computation structure. Furthermore, these structural features are described exclusively by

matrix decomposition (or factorization) which is somehow less attractive to the engineers

than the graphical representation such as signal flow graph—"butterflies". O n the other

hand, the structural features of the vector radix F F T algorithms have not been well

examined, understood, or exploited.

In the history of the F F T algorithms, matrix decomposition has served as a method

of interpretation of fast algorithms [47] and an alternative representation of the algorithms

[39]. Algorithms are usually derived by decimation of the indices of the transform

function and finally described by signal flow graphs on which computer programs or

VLSI architecture designs are based. Matrix representation was used as a tool for the

construction of some F F T algorithms. It has been found that the construction of fast

algorithms and algebra, of which matrix is one, are deeply related, although they are not

the same subject [30].

W h e n the dimensions increase, problems become more complex and are often

difficult to comprehend as explained in previous chapter. Solutions become more flexible

and there are many alternative approaches. W h e n a new algorithm is derived, it is not

always certain if the formulas are error free and correcting them is not easy. More often

than not, the effort spent in verifying a new algorithm is equal to the time taken to derive

another new algorithm. This is why the study of a systematic and structural approach to

the m - D F F T algorithms is justified.

Whenever a mathematical result is used for engineering applications, it results in a

new algorithm. It may be further converted into software or hardware implementations,

in which case the representation becomes important, so much so they make whatever they

represent either a technological and industrial wave or a sinking leaf in the sea of research

papers. There is no need to stress the significant role played by a group of researchers at

M I T in providing electrical engineers with an engineering interpretation of the Cooley-

Tukey F F T algorithm [23], as the so-called "butterfly" signal flow graph is a major

consideration in all of the publications on F F T algorithms which are based on the Cooley-

36

Tukey method. It has been used to represent many other fast transform algorithms as

well [39]. Unfortunately, in most publications on m-D FFT algorithms, the graphical

presentation is still in a 1-D form which can, more often than not, be over complicated.

Another thing which is missing in the 1-D graphical presentation of m-D algorithms is the

connection and relation between the m-D algorithm and its corresponding row-column 1-

D algorithm. The graphic presentation which is going to be introduced in this chapter will

adapt the vector signal process concept. It will become clear that this form is a better form

and it can be used to explain the relationship between different m-D algorithms as well.

It is the purpose of this part of the thesis to establish the general matrix

representation forms for vector radix FFT algorithms. These general forms can provide a

structural approach to the construction of various VR FFT algorithms whilst still

allowing appreciation for the simple and regular structure such as those possessed by their

1-D counterparts, in addition to the computational improvement. The form that has been

chosen is a combination of matrix representation and logic diagram which is a

multidimensional extension of the signal flow graph ("butterflies") for 1-D FFT

algorithms. Matrices are used as a concise representation of algorithms and any

particular structure of an algorithm as well as a tool to derive multidimensional VR FFT

from their 1-D corresponding algorithms. Logic diagrams independently provide another

approach towards the derivation of m-D VR FFT algorithms and are used for

computational considerations, software programming and hardware implementation. The

relationships between the row-column FFT and various VR FFT can also be vividly

interpreted or explained by these forms. The use of this structural approach makes the

derivation and implementation of various VR FFT algorithms a straight-forward and

structured one. Otherwise it could be a tedious, untidy and potentially erroneous

procedure. With a 1-D FFT algorithm, based on Cooley-Tukey's concept, it is possible

to systematically achieve its multidimensional VR FFT. The properties, such as in-place

calculation, symmetry, data order and structures like butterfly, twiddling multiplications

and unscrambling, are all preserved in a multidimensional context. This approach can be

37

also extended to some other fast transform algorithms and has been done so to the 2-D

fast cosine transform, as explored in the second part of this thesis.

Another feature of multidimensional V R FFT algorithms which is worth mentioning

is that V R FFTs have better fixed-point and floating-point error characteristics than both

the row-column FFTs and polynomial transform FFTs [50, 51]. Naturally, when the

number of operations is reduced, the number of error sources will also be reduced.

3-2 Definitions

The general Ni*N2-point 2-D D F T and its inverse are defined as [1]:

X(k,/ ) = NX Ix(m,n) W™* W^ (3-2-la) m=0 n=0 1 M

and,

x(m,n) - t Tx(k,/ ) W-»k WN"2' (3-2-1 b)

where k,m = 0,l,...,Ni-l; n, / = 0,1,...,N2-1. In the following discussion it is

assumed that Ni and N2 are power of 2 to simplify the presentation.

The matrix forms of the 2-D D F T and its inverse definitions are given as follows.

X = W 2 x (3-2-2a)

and,

x = ^ _ W " 2 X (3-2-2b) N1N2

where X is an N i * N 2 column vector formed by stacking transposed row vectors of the 2-

D output array, x is also an N i * N 2 column vector formed by stacking transposed row

vectors of the 2-D input array, W 2 = W ^ ® W ^ , W " 2 = W j ^ ® W j ^ , and <g>

stands for the tensor (or Kronecker) product. Another matrix form for the definition of

the 2-D D F T for general periodically sampled signals is given by Mersereau and Speake

[62].

X(k)= I x. (n) expt-jk^TtN-^n] k e J n (3-2-3a)

and,

38

£ (n) = I X (k) exp[jkT(27tN-l)n] ne I N (3-2-3b) keJN

where N is a periodicity matrix, IN and JN are regions in which x (n) and X (k) are

supported respectively [1, 62]. A special case of these two regions is the rectangular,

which is commonly used.

Yet another definition can be given to the 2-D DFT in a form of the matrix row and

column operations [3]. In this form the input and its DFT sequences are both in a 2-D

matrix form. Although the row and column matrix operations are most familiar, this form

seems difficult to extend to anything beyond the 2-D DFT. When it comes to the

derivation of 2-D FFT algorithms, such as vector radix FFT algorithms, this form is not

as convenient to use as others.

The definition given by Equation (3-2-3) is a very concise mathematical

representation of the 2-D (m-D) DFT. Based on this definition, FFT algorithms for

rectangularly or hexagonally sampled signals or signals which are sampled on an arbitrary

periodic grid in either the spatial or Fourier domain are devised. The relationships

between the existing m-D FFT algorithms based on the Cooley-Tukey scheme are also

well explained in this form. However, this form helps little to show how to derive vector

radix FFT algorithms given that the corresponding 1-D FFTs are known.

Definitions given by Equations (3-2-1) and Equations (3-2-2) are by far the most

commonly used presentations for m-D DFTs [2, 9, 30, 31, 35, 37, 65, 66]; Equation (3-

2-2) being a direct matrix representation of Equation (3-2-1).

3-3 Row-Column FFT Algorithms

The relationship between the two multidimensional DFT and the 1-D DFTs can be

expressed by the Kronecker product in a matrix form [30, 35], i.e., the multidimensional

DFT matrix W2 is presented by

W 2 =W^1 <8> W^2 (3-3-D

where wi, i=l,2, represents the Ni-point 1-D DFT matrix.

39

The first implication from Equation (3-3-1) in the 2-D D F T problem is the row-

column approach which is well known. If Ni = N 2 = 16, the row-column radix-4 FFT

can be used to calculate the 2-D D F T as is shown in Figure-7. In Figure-7, the vectors

xO to xl5 consist of the row elements of the 2-D input array and XO to X15 represent

rows of the output array with elements in bit-reversed order. Each heavy line represents

sixteen datum lines each of which carries an element from xi. The block inscribed by R-

16 FFT represents the 16-point D F T using the radix-4 FFT as given in Figure-6 and

performs the row FFT in the diagram. The addition block stands for vector addition and

operates on the elements from the same column of the two input vectors. Likewise, the -

1 block and the -j block performs corresponding operations on every element of the input

vector. The part of Figure-7 to the right of the R-16 FFT blocks forms a radix-4 FFT

structure. However this structure operates on columns of the input array only, i.e., it

performs the column FFT. Using the logic diagram shown in Figure-7, the computation

structure of the 2-D D F T is exceedingly clear.

3-4 Vector Radix FFT Algorithms

Instead of proceeding with decimation operations on each dimension separately (one

after another) as the row-column method does, the vector radix FFT algorithm suggests

that decimation be performed on all indices (or dimensions) simultaneously [42-44].

In the case where decimation-in-time is used on both indices of the 2-D DFT,

assuming that N\ = T \ * N\ and N 2 = r2 * N2', set:

k = k i * N f +kn; m = m i * n + m o ; / = / i * N 2 +/n; n = ni*r2+no;

where ki , m o = 0,l,...,ri-l; kn , mi = 0,l,...,Ni'-l and l\ , n0 = 0,l,...,r2-l; /0 , ni

= 0,1,...,N2'-1. From Equation (3-2-1), Equation (3-4-1) is derived:

40

o oe x x x x x x x x x

_ r- ^

x x x x X x

dbcfai^igL^

a

CO

c £

a u

•c ><

c

H b.

a o

c c

II C3

X

II

a

i

>-i

3 60 •H fi-

41

X(ki,ko;/i,/o)= S I' Y Ni"x(mi,mo;ni,no)W™1.kow"1!0

mo=0 no=0 mi=0 m=0 Nl N2

wmokowno/owm°klwno/l Nl N 2 rj r2

- V V Wm°klWno/l WmOkO\X7nO/0Nv1 Nv_1 / ^rmiko,,Jii/o

(3-4-1)

When decimation-in-frequency is applied along both indices, set:

k = k1*r1+k0; m = m1*N1'+mo; / = /1*r2 + /0; n = n1*N2' + n0.

where N1=r1*N1', N2=r2*N2\ kb m0 =0,1,..., N^-l; ICQ, mi =0,1,..., rrl, and lh n0

=0,1,..., N2'-l; /0, m =0,1,..., r2-l.

Then from Equation (3-2-1), Equation (3-4-2) is derived:

X(k1,k0;/1,/0)= £ i"1 £ l\(mllm0;n1,no)Wj

klW"°V m0=0n0=0 1^=0 n^O

! 2

wmokown°/owmikowni/o Nl N 2 ri r2

Ni'-l NV-1 , , , , n-1 r->-l - Y V Wm°klWn0/l Wm0k°Wn0/0 V V v/n, «, • r, r, UUmlkOwnl'0 ~ ^ n ^n Ni' W N V WNi W No ^ X x(mlsm0; nl5n0)W W m0=0n0=0

A>1 i>2 "1 i>(2 m 1 = o n i = 0 n r2

(3-4-2)

Since more than one dimension can be decimated, different decimation schemes can

be applied to different dimensions which leads to a mixed decimation vector radix FFT

(mixed VR FFT for short [44, 67]). For instance, the DIF is used on the row index and

the DIT on the column index by setting:

k = k1*r1+k0; m = m1*N1' + m0; / = h * N2' + /n; n = ni*r2+no;

where N1=r1*N1', N2=r2*N2', klf TDQ =0,1,..., Nj'-l; k0, mj =0,1,..., rrl, and /i , n0

= 0,l,...,r2-l; /0,m = 0,1,...,N2'-1.

42

X(k1,k0;/1,/0) = \* 'j X1 ^"xCmLino ; ni,n0) W ^ w "0 ' 1

m0=Ono=0 mi=0ni=0 ^1 r2

wm°k° wn°/owm ] k°wn 11° Nj N 2 ri N2'

N/-1 r2-l 1 , , , n-1 NV-1 - 2v 2. W N , W W N " W N £ E x(mi,mo; ni,n0) W

1 UWN^,U

m0=0n0=0 i>:l r2 1N1 1N2 m i = 0 n i=0

rl N2

(3-4-3)

From Equation (3-4-1), in each stage of the FFT operation, row twiddles both

inside Wr° 1 and outside wJJP0 the butterfly structure can be combined with

column twiddles (W^0 \ w£J° °, respectively). Intuitively, this explains why vector

radix FFT algorithms have fewer number of multiplications than their row-column

counterparts.

The point is that although this original VR FFT presentation is mathematically and

computationally simple and clear, it helps little in eliminating the complicated and tedious

procedure for the derivation of various VR FFT algorithms required by specific

applications. When the mixed radix FFT method [54] or the split-radix method [68] is

invoked for each dimension to obtain the VR FFT algorithms, further complications in the

derivation procedure would be expected. It has not been seen in any literature that there

are simple solutions to the problem. The computational complexity can be calculated on 2 3

the wrong basis as well. For instance, W N counts one complex multiplication just as W N

does. Another point that has to be made here is that the mixed vector radix FFT algorithm

has more variety than the 1-D mixed radix FFT algorithm [47, 54], which has not been

addressed properly in the published literature [1,31, 42-44, 62] if it was addressed at all.

This will be discussed further through examples.

43

3-5 Matrix Representations for 2-D Vector Radix F F T

Algorithms

In order to present the structural approach, a matrix form is introduced for 2-D V R

FFTs. Its indexing scheme follows the traditional Cooley-Tukey presentation, which has

been widely used in the literature and adopted in both software and hardware

implementations, otherwise it is a generalized form of that presented in [31].

A matrix form for DIT V R FFTs given by Equation (3-4-1) can be written as the

following three steps:

(BF:) [X(*i , k0 ; 2i , IQ)] = I1 [xi'(ko, m0 ; /o, n0)] (3-5-la)

(TM:) [xi'(ko, mo; IQ, no)] = EHxiCkn, mo; /n, no)] (3-5-lb)

(Remaining Short Length 2-D DFTs:)

Nj'-l N2'-l v ,

[x1(k0,m0;/o,«o)]= I I W™ 1r 0^,°[x(mi fm 0;ni,n 0)] mi=0 ni=0 1 z

(3-5-lc)

where [X(k] , ko ; £j , IQ)], [xi'(kn, mo ; IQ, no )], [xi(ko, mo ; /o, «o )],and [x(mi,mo ;

n\,no )] are rir2 column vectors, with kj ,£i and mo ,no varying in bit reversed order, El

is the twiddle factor matrix which is an nr2*rir2 diagonal matrix with the element value

Fl(i,i) (i = 1,2,.. .,rir2) equal to W ^o k ° W { ^ / o accordingly, and I1 is the matrix for the 2-D

vector radix-n*r2 BF structure which is also an nr2*rir2 matrix with the element value

Hid) (ij = l,2,...,rir2) equal to W ^ W ^ 7 correspondingly. Equation (3-5-lc)

contains n*r2 Ni'*N2'-point 2-D DFTs which can be further decimated.

Example-1: Given an Ni*N2-point 2-D D F T where Ni = N 2 = N = 16, the VR-4*4

DfT FFT algorithm in matrix form can be presented as follows:

(BF:)

xo-X2 XI X3_

=

rll 1 1 -. 11-1-1 1 -1 -j j

L i -i j -j J

®

r l l 1 1 "

11-1-1

1 -1 -j j

- 1 "1 j -j J

rxi'O xi'2

xi'l -xi'3

(3-5-2a)

44

(TM:)

-xi'O-

xi'2

xi'l

Lxi'3J

1 0 0

0 W 2k0 N

0

0 -!

0

rO 0 0 W*u 0

0 0 0 w 3k0 N .

®

1 0 0

0 w2/0

N 0

0

0

0 0 WJJ 0

0 0 0 w. 3/0 N -I

rxiO-xi2

xil Lxi3_

(3-5-2b)


Nj'-l N2'-l m v „ ,

x x w™;k° *$> mi=0n!=0 X>1 ^ 2

-xiO-xi2 xil

_xi3_

rxO-

x2

xl

x3.

where i = 0,1,...,3;

Xi = [X(i,k0 ; 0,/0),X(i,ko ; 2,/0),X(i,k0 ; l,/0),X(i,k0 ; 3,/0)]T;

xi'i = [xi'(k0,i; /o,0),xi'(ko,i; /0,2),xi'(k0,i; /0,l),xi'(k0,i; /o,3)]T;

xii = [xi(kn,i; /o,0),xi(ko,i; /o,2),xi(ko,i; /o,l),xi(ko,i; /o,3)]T;

xi = [x(mi,i; ni,0),x(mi,i; ni,2),x(mi,i; ni,l),x(mi,i; ni,3)]T;

(3-5-2c)

Il =

It4 Jt4 Jt4 Jt4

It4 Jt4 _Jt4 _Jt4

114 .J14 .jjt4 jjt4

_ J14 _Jt4 jjt4 _jjt4 _

El =

N u

0 W2koF

0 0

0 0

44 N

0 0

0 0

Wk°Ft4 0

0 w3k°Ft4

u ffN N

It4 =

1 1 1 1 - 1

1 1 - 1 - 1

1 -1 -j j

1 -1 j -j -

I7t4 _

r- 1 0 0

owj'0 0 >0

0

0

o o wjy o o o o w3/°

The tensor product in Equation (3-5-2) now is just used as a form of concise

presentation. However, it does indicate an important fact which will be discussed in the

next section.

45

A matrix form can be also written for the first stage of DIF V R FFTs presented by

Equation (3-4-2) as the following three steps:

(BF:) [xjCfy , mo; £0 . n

0)] = If [x(m7 , m^ *; , n0)] (3-5-3a)

(TM:) [x{(k0, mo; £0, n0)] = Ef Ui(k0, n^; £0.

n0)] (3-5-3b)


[XCk,,^ ;llt£0 )]= X X W ™ f ^ W x ^ , m0; £0 , n0)]- (3-5-3c) n\Q=0 nQ=0 1 ^

where [X(kltk0 ; /l f ^ 0 )], [ x ^ , TDQ; £0 , n0)], [xi'(k0 , VTIQ; £0 . n0)] and [x(m7, mo;

";> no)] are rlr2 column vectors with kn , -£# and mj , nj varying in bit reversed order,

E f is the twiddle factor matrix which is an r ^ T j ^ diagonal matrix with the element

value Ff(i,i) G=l,2,..., r^) equal to W J ^ W J ^ 0 correspondingly, and If is the matrix

for the 2-D vector radix-ri*r2 BF structure which is also an r\t2*x\?2 matrix with the

element value If(i,j) (i,j=l,2,..., rir2 ) equal to W ^ ' V 7 ^°. The product of Ef, If and rl T2

[x(m7, mo; nj, no)] is a column vector again so that further decimation can proceed on

Equation (3-5-3c).

The matrix form for the 2-D mixed vector radix FFT algorithm given by Equation

(3-4-3) is as follows:

(BF1:)

[x\(kn ,mo ; £o .no)] = IJBF W m ^ ^0 '- nl >n0)J (3-5-4a)

(TM:)

[xi'(^ ,mo ; IQ ,n0 )] = E ? T M [x\(ko ,mo ; Iq M )] (3"5-4b)

(BF2:)

[X(kj ,ko ; £j ,/0 )] = IJBF [xi'(ko /no ; h ,no )] 0-5-4c)

46

where [x(mj ,mn ; nj ,no)] and [x\(kn ,mn ; #o ,no)] in Equation (3-5-4a) are r}N2'

column vectors with kg , #o ani^ ml > "7 varying in bit reversed order; [XI(£Q ,mo ; IQ ,no

)] and [x\\ko ,mo ; /o >w0 )] m Equation (3-5-4b) are rjr2 column vectors with kg and I\Q

varying in bit reversed order; [xi'(ko /no ; /o ,«o )] and [X(£/ ,ko ; £] ,IQ )] in (3-5-4c)

are Ni'r2 column vectors with kj , £j and mo, no varying in bit reversed order. E ^ T M is

the twiddle factor matrix which is an rir2*rjr2 diagonal matrix with the element value

— I T M ^'^ (i=l»2,..., rjr2) equal to W N ° W N ° correspondingly, I™BF is the matrix for

the 2-D vector radix-ri*N2' BF structure which is an riN2'*riN2 matrix with the

element value lfBF(i,j) (i,j=l,2,..., r^r2 ) equal to W ™ ; °W";, °, and I ^ F is the matrix

for the 2-D vector radix-Nj'*r2 BF structure which is an Ni,r2*N1'r2 matrix with the

element value lJBF(i,j) (i,j=l,2,..., rxr2 ) equal to w J J £ *; W ^ 7 . The superscript m

stands for the mixed decimation vector radix FFT. Further decimation can proceed on

both Equation (3-5-4a) and Equation (3-5-4c).

3-6 Structure Theorems

By using the structural features of the multidimensional vector radix DIT FFT stated

in the following theorem, the straight-forward but often tedious derivation can be

bypassed.

[Structure Theorem 1:—Decimation-In-Time FFT]

If a 2-D D F T is defined by Equation (3-2-la), Ni = n * N\ and N 2 = r2 * N2', the

vector radix-ri*r2 decimation-in-time FFT is used, and the matrix representations of

corresponding 1-D FFT equations are given as follows:

[X(kj , ko)] = ijjj [xi'Cko, m0)] (3-6'la)

[xiXko, m0)] = F ^ [xi(ko, m0)] (3-6"lb)

[xi(k0, m0)] = Y W ™ 1 * 0 [x(mi,m0 )] C3"6'1^ ra!=0 " J

47

and,

[X(*i , /0)] = I*2 [XI'(/O, no)] (3.6.2a)

[xi'(/o. "0)] = F^ [Xl(/0, «o)] (3-6-2b)

N2'-l ,

[xi(/o,^)]= X W " 1 ' 0 ^ , ^ ) ] (3_6.2c) ni=0 z

where *7 , m0 = 0,l,...,n-l; ko , mi = 0,l,...,Ni'-l; £ltn0 = 0,l,...,r2-l; /0 , m =

0,1,... ,N2'-1; F N ^ (F N2 respectively) is the twiddle factor matrix of 1-D radix-ri (radix-

r2 respectively) DIT FFT and ij^ (I*2 respectively) is the BF structure matrix of 1-D

radix-ri (radix-r2 respectively) DIT FFT, then the matrix equation for the 2-D vector

radix-ri*r2 DIT FFT algorithm is presented by Equation (3-5-1), where El = F*1 <g> F*2,

and I1 = INi ® IN2 with symbol ® standing for the tensor (or the Kronecker) product

[30, 31, 69]. In other words, El can be obtained by replacing the element F*1 (i,i) of

matrix FjJJ with F^(i,i)*F^ and p by replacing I^(i,j) of ijj with ^ ( i j ) * ^ .

The structure theorem can be readily proved using matrix theory once all equations

have been expressed in the above matrix form (see Appendix B). It can be verified that

the result is correct by referring to Equation (3-4-1). The complete equations for a

specific DIT V R F F T can be obtained by applying the theorem on the remaining short

length 2-D D F T s repeatedly. The application of the structure theorem will be

demonstrated in the examples at the end of this subsection.

The relationship between 1-D radix-2^1 FFTs and corresponding vector radix FFTs

is clearly explained by the structure theorem and thus the derivation of the higher order

vector radix F F T algorithms becomes simpler. Since F F T based on [22] and [43] is the

issue, not surprisingly, the statements cover the processing stages of both B F and T M .

The unscrambling stage of a complete vector radix F F T equation is also governed by this

rule, i.e., once the unscrambling matrix for corresponding 1-D FFTs are known, that of

48

the 2-D vector radix FFT algorithm will be the result of the tensor product of the two

[153].

Similarly, the following theorems for the DIF V R FFTs and the mixed V R FFTs are

also true.

[Structure Theorem 2:—Decimation-In-Frequency FFT]

Suppose that the N j * N 2 2-D D F T is defined by Equation (3-2-la), where

Ni=ri*Ni', N2=r2*N2', decimation-in-frequency is used and the matrix representations

of the corresponding 1-D FFT equations are given as follows:

[ xT( k0 , mo)] = 1 ^ [ x( m j , mo)] (3-6-3a)

[ Xl'( k0 , mo)] = F** [ Xl( ko , mo )] (3-6-3b)

Nl'_1 mold

[ X( klf k0) ] = I W ^ f 1 [Xl'( ko, mo)] <3-6-3c) mo=0 N*

and,

[ Xl( £0 , n0)] = 1 ^ [ x( nj , n0)] (3-6-4a)

[ Xl'( £0 , n0 )] = F*2 [ Xl( £0 , n0 )] (3-6-4b)

N2'-l _ i

[ X( /,, £0) ] = I W ^ 7 [ Xl'( £0 , n0)] (3'6-4c)

where k0,m2 = 0,1,..., rrl; kj, m0 =0,1,..., Nj'-l; £o>ni = 0,1,..., r2-l; /j,

n 0 =0,1,..., N2'-l, F^1 (F^2 respectively) is the twiddle factor matrix of 1-D radix-r!

(radix-r2 respectively) decimation-in-frequency FFT and ijjj (Irrespectively) is the BF

structure matrix of 1-D radix-rx (radix-r2 respectively) decimation-in-frequency FFT. The

matrix equation for the 2-D vector radix-rj*r2 DIF FFT algorithm is given by Equation

(3-5-3), where Ef= F % ® ¥ % and 1 ^ 1 ^ ® ^ , w k h the Symbo1 ® Standing f ° r the

tensor (or the Kronecker) product [30, 31, 69].

49

[Structure Theorem 3:—Mixed V R FFT]

For a given Ni*N 2 2-D D F T as shown in Equation (3-2-la), if Ni = ri*Ni\ N 2 =

r2*N2', the matrix representation of 1-D DIF FFT and that of 1-D DrT FFT algorithm are

presented as follows:

[ x2( k0 , m0)] = 1 ^ [ x( w ; , mo)] (3-6-5a)

[ Xi'( ICQ , mo)] = F ^ [ Xi( k0 , HIQ )] (3-6-5b)

Nl'"1 mnki

[X(khk0)] = I W™° 1 [xi(k0,m0)] (3-6-5c)

mo=0 1

and,

[X(^7 , IQ)] = I*2 [xi'(/0, no )] (3"6-6a)

[xi'(/0, "0 )] = F^2 [xi(/0, n0 )] (3-6-6b)

[xi(/o, no)] = I W"^,0 [x(ni,;io )] (3-6-6c)

v/hzrt ko/nj = 0,1,..., rrl; kls m o =0,1,..., Nj'-l; £] , no = 0,l,...,r2-l; IQ, n\-

0,1,...,N2'-1, F^1 is the twiddle factor matrix of the 1-D radix-r j DIF FFT, ijj1 is the

BF structure matrix of the 1-D radix-r! DIF FFT, FJJ? is the twiddle factor matrix of the

1-D radix-r2 DIT FFT and I^2 is the B F structure matrix of the 1-D radix-r2 DrT FFT.

The matrix equation for the 2-D mixed vector radix-r}*r2 FFT algorithm is given by

Equation (3-5-4), where & „ = 1 ^ ® < / , E?TM = F f ^ ® F £ , and I? B F - I™'" ®

I^2, with the superscript m for the mixed vector radix FFT.

The application of the above theorems can be shown by the following examples.

ExampIe-2:

Deriving the 1-D radix-8 FFT algorithm used to be a significant task [64].

However, comparing it with generating the 2-D vector radix-8*8 FFT, it is relatively

simple. For many, writing up the corresponding 1-D algorithm (even deriving it from

scratch) or drawing its logic diagram is a good starting point for generating required 2-D

50

VR FFT and it is simple enough. By applying the structure theorem, the vector radix

FFT formula will then be achieved with little extra effort.

Consider a 2-D DFT defined as Equation (3-2-la) where N1=N2=N=8(A, \i is a

positive integer so that the VR-8*8 DIF FFT can be applied.

Begin by writing the butterfly structure and twiddling multiplications of the 1-D

radix-8 DIF FFT algorithm in matrix form presented by Equation (3-6-3) where:

r=8, N'=N/8, kj, mo =0,1,...,N'-1, and

[X(khk0)] = [X(k1,0),X(k1,4),X(k1,2),X(k1,6), .

X( kl5 1 ), X( kl5 5 ), X( kls 3 ), X( klf 7 )]T;

[ xj'C k0 , m0)] =[ x{( 0, mo ), x{( 4, TUQ ), x{( 2, IHQ ), x{( 6, niQ ),

X!'( 1, mo ), x{( 5, mo ), x{( 3, HIQ ), x{( 7, m 0 )] T;

[ xx( k0, mo)] = [ xj( 0, mo), xi( 4, mo), xj( 2, mQ), x;( 6, IXIQ ),

x2( 1, mo), X!( 5, mo), x2( 3, mo), x}( 7, mo)] T;

[ x( m; , mo)] = [ x( 0, m 0 ), x( 4, n^), x( 2, mo ), x( 6, iriQ ),

x( 1, m 0 ) , x( 5, mo ), x( 3, niQ ), x( 7, m 0 )] T;

1 1 1 1

1

1

1

1

1 1 1 1

-1

-1

-1

-1

1 1 -1 -1

-j

-j

j

j

1 1 -1 -1

j

j

-j

-j

1 -1

-j j

a

-a

-ja

ja

1 -1

-j j

-a

a

ja

-ja

1 -1

j -j

-ja

ja a

-a

1 -1

j -j

ja

-ja -a

a

(3-6-7)

a =W8=exp(-J7t/4);

Ff8=diaef 1 W4"10 W 2 m o W 6 ™ 0 W m ° W 5 m ° , W?Tm°, wj"°].

r N -Qiag.L i, w N , w N , \>N , w N , v»N , " N , N

51

The logic diagram shown in Figure-3 performs the R-8 DIF FFT BF where there

are only two complex multiplications (caused by a) and the T M stage, where there are

seven non-trivial complex multiplications because of the twiddles, can be added to the BF

[45].

According to Structure Theorem 2, the first stage of the 2-D VR-8*8 DIF FFT

matrix representation will be given by Equation (3-5-3), where:

r!=r2=8, k!,m0, /1? n0 =0,1,...,N'-1, and N=N/8.

[x(mj, niQ ; nj, T\Q )]=[X0, X4, x2, x6,xl, x5, x3,x7] T;

xi = [ x(i, mo;0, n0 ), x (i, mQ;4, n0 ), x (i, mQ;2, n0 ), x (i, mo;6, n0 ),

x(i, mo;l, n0 ), x (i, mo;5, n0 ), x (i, mo;3, n0 ), x (i, mQ;7, n0 ) ], and i

=0,1,...,7.

[x^kQ, m0 ; £0, n0 )]=[xx0, xx4, xx2, Xl6, xjl, xx5, xt3, xx7 ] T;

xji = [x!(i, mo;0, n 0 ) , xj(i, m0;4, n 0 ) , xî, mo;2, n 0 ) , xT(i, m0;6, n 0 ) ,

xj(i, m0;l, n0), xî, m0;5, n0), Xj(i, m0;3, n0), xî, m0;7, n0)], and i

=0,1,...,7.

[x{(k0, m0 ; £0, n0 )] = [x1,0, x{4, x{2, x{6, xx'l, Xl'5, Xj'3, xj'7 ] T;

xj'i = [ \{(i, mo;0, n0 ), x{(i, m0;4, n0 ), x{(i, m0;2, n0 ), xj'Ci, m0;6, n0 ),

xj'Ci, m0;l, n0), xi'(i, mQ:5, n0), Xj'(i, mo;3, n0), x{(i, m0;7, n0)],

and i =0,1,...,7.

[X (kl7k0 ; h,£0 )]=[ XO, X4, X2, X6, XI, X5, X3, X7 ]T;

Xi= [ X( kl5i; lh0), X ( kj,i; l\A), X ( kl5i; lh2 ), X ( kj,i; /i,6),

X ( klti; /j,l ), X ( ki,i; /j,5 ), X ( k2,i; /lf3 ), X ( k,,i; /j,7 ) ], and i

=0,1,...,7.

52

Ef= diag.[ F« , w£x» p« , wjfo F« _ w^o pf8

w7F«,W^F«,W^FffljW7moFf8 j

FN = diaS-[ 1, w£°, W^°, w£«, Wj* W*"*, w^c- w7n0]

From Equation (3-6-7) we have:

rf8 Tf8 Tf8

V =

N

f8 N

f8 N

{f& l N

f8 •N

f8 N

f8 N

f8 N

-I

-I

-I

-I

N

f8 N

f8 N

f8 N

f8 N

f8 N

f8 N

I

I

-I

-I

N

f8 N

f8 N

f8 N

I

I

-I

-I

f8 N

f8 N

f8 N

f8 N

-I

[f8

f8 N

f8 N

f8 N

I f8 N

[f8

I

-I

f8 N

f8 N

-ilf8

jTf8

.jf8 {Tf8

ilf8 -ilf8

ilf8

-ilf8

5Tf8 JTf8 ~Tf8

•JJN ih aIN ;Tf8 ;rf8 ~rf8

:Tf8 .Tf8 . Tf8

J J N - ^ N "jaIN

• Tf8 .Tf8 • Tf8

-aIfN8 -jal^8 jaIfN

8

aIfN8 jalf8 -jaIfN

8

jaIfN8 al{J -al^8

. Tf8 •jaIN

aljj al f8 N

- If8 (x) Tf8

- 1N 09 1N

(3-6-8)

The complete equations of the VR-8*8 FFT for a specific 2-D D F T application can

be obtained by applying the structure theorem recursively. Another point to be made is

that Equation (3-6-8) is the matrix presentation of VR-8*8 FFT butterfly structure which

is equivalent to an 8*8-point DFT and itself can also be calculated by further invoking the

vector radix approach. In mathematical terms, this implies further application of the

properties of the tensor product to Equation (3-6-8). Computing VR-8*8 FFT BF as it is,

would commonly mean invoking the row-column method. As a result, this VR-8*8 FFT

would be inferior to the VR-4*4 FFT in terms of the arithmetic complexity. However, if

the vector radix approach is used to perform this VR-8*8 BF by one of the following: the

method [43] indicated; the Combined Factor (CF) method in [45, Appendix C]; or the

53

mixed VR method, as will be shown by the following example, the performance of the

VR-8*8 FFT would be better than that of VR-4*4 FFTs [44, 45, 67].

ExampIe-3:

Given Ni = N2 = N = 8, N = 2 * N, N = 4, matrix equations for the 1-D radix-2

DIF FFT algorithm on the 8-point DFT are presented as follows:

(BF1:)

ko "xi(0,mo)" _xi(l,mo).

"i r l -i_

mi

"x(0,mo)~

_x(l5mo)_ (3-6-9a)

(TM:)

ko

xi'(0,mo)'

.xi'(l,mo).

1 0

ow£°

ko

xi(0,mo)'

xi(l,mo). (3-6-9b)

(BF2:)

kj rX(0,k0)-

X(2,k0)

X(l,k0)

LX(3,k0)J

- 1 1 1 1-j 1 1 - 1 - 1

1 -1 -j j

- 1 -1 j -j J

mo rxi'(k0,0)-i

xi'(k0,2)

xi'(k0,l) Lxi'(k0,3)J

(3-6-9c)

The matrix equations for the 1-D radix-2 DrT FFT algorithm on the 8-point D F T are

presented as follows:

(BF1:)

(TM:)

*l •x(0,/0r

X(1,/0)J

- 1 1 -

1 -1_

no •xi'(/0,0)-

.xi'(/0,D. (3-6-10a)

no

•xi'(/o,oy

xi'(/0.D.

1 o

o w to N

no

xi(/o,0)"

Lxi(/0,1). (3-6-10b)

54

(BF2:)

*0 -xi(0,no)~

xi(2,no)

xi(l,no)

_xi(3,n0)_

- 1 1 1 l-i

1 1 - 1 - 1

1 -1 -j j

L i -i j -j J

ni -x(0,no)

x(2,n0)

x(l,no)

-x(3,n0)

(3-6-10c)

Using Structure Theorem 3, from Equations (3-6-9) and (3-6-10) the matrix form

for the mixed DIF & DIT vector radix FFT algorithm is derived.

(BF1:)

ko *0 •xi(0,mo;0,/ xi(0,mo;2,/

xi(0,mo;l,/ xi(0,mo;3,/

xi(l,mo;0,/

xi(l,mo;2,/

xi(l,mo;l,/

.xi(l,mo;3,/

M

mj nj rx(0,mo;0,no)-

x(0,mo;2,no)

x(0,mo;l,no)

x(0,mo;3,no)

x(l,mo;0,no)

x(l,mo;2,no)

x(l,mo;l,no)

Lx(l,mo;3,no)-

(3-6-11 a)

(TM:)

ko no -xi'(0,mo;/o,0)-

xi'(0,mo;/o,l) xi'(l,mo;/o,0)

.xi'(l,mo;/o,l).

if 3FJ*

ko no r-xi(0,mo;/o,0)-

xi(0,mo;/o,l) xi(l,mo;/o,0)

_xi(l,mo;/o,l)-

(3-6-lib)

(BF2:)

ki £] •X(0,ko;0,/0)-i X(0,k0;l,/0) X(2,ko;0,/0) X(2,k0;l,/o) X(l,k0;0,/o) X(l,k0;l,/0) X(3,ko;0,/0) •X(3,k0;Uo)

M

mo no -xi'(k0,0;/0,0)-

xi'(ko,0;/0,l)

xi'(ko,2;/o,0)

xi'(ko,2;/o,l)

xi'(k0,l;/0,0)

xi'(ko,i;/o,l) xi'(k0,3;/o,0)

-xi'(ko3;/o»i)-

(3-6-1lc)

55

where: r 1 0 0 O - i

F^®Flg2 =

OW^J 0 0

0 0 WjJ° 0

0 0 0 W ™ 0 + ' ° _

The above theorems provide very simple construction tools for various vector radix

FFT algorithms. Having a knowledge of different 1-D FFT algorithms, 2-D V R FFT for

a required application can be readily achieved. Since the theorems state clearly what the

2-D B F or T M stage should look like, checking a new variation of V R FFTs becomes a

simple and straight-forward procedure. Once complete equations of 1-D FFT algorithms

are made available, it is a matter of inter-weaving corresponding B F and T M structures of

1-D algorithms to form 2-D (m-D) B F and T M structures. Although not discussed in the

theorems, the output sequences of 2-D V R FFTs also obey the properties of the tensor

product in respect of 1-D FFT output sequences [153].

3-7 Structural Approach via Logic Diagrams

The diagrammatical interpretation of structure theorems can be expressed both at

stage-by-stage level [45] and as a complete form for a specific application [67]. Obtaining

the logic diagram of a 2-D FFT from those of 1-D FFTs requires the following procedure:

drawing 1-D FFT logic diagram(s); generating the logic diagram using the row-column

FFT; and finally, modifying the logic diagram using the row-column FFT into various 2-

D vector radix FFTs. Modification of the logic diagram follows the simple rules as

shown by the following equations.

In Figure-8(a), AxO ± A x l = A(x0 ± xl).

In Figure-8(b), ocAx = A (ax).

where x, xO and xl are column vectors; xO and xl are of the same dimension; A is an

operator and a is a scalar.

For long length DFTs and using high radices [45, 70], the logic diagram of FFT at

the stage-by-stage level would be useful because final drawing would be difficult to

o X

m m CD

o

$

© X X

CD

m m © i-i

X X

57

accommodate on one sheet of paper, nor is it necessary, although it is not unachievable.

For small size DFTs, deriving a complete logic diagram is always preferable.

Example-4:

In this first example, the VR-4*4 F F T algorithm on a 16* 16-point D F T will be

derived using the logic diagram. As most 1-D F F T algorithms are well documented, it is

always simple to start by drawing a 1-D logic diagram. In this case, the logic diagram of

a 16-point D F T using radix-4 F F T algorithm is presented in Figure-6. Even if there were

no 1-D logic diagram available, drawing a 1-D diagram from equations is much simpler

than doing so from equations for a 2-D vector radix F F T algorithm. For this reason, it is

preferable that fast algorithms be presented in logic diagrams (or equally flow graphs)

whenever it is feasible. From personal experience, more often than not, one can judge if

it is worthwhile for a 1-D fast transform algorithm to be generalized to its

multidimensional counterpart and if the saving could be made in terms of computational

complexity just by looking at the structure of the logic diagram of the algorithm.

After the logic diagram is drawn for the 1-D radix-4 FFT as shown in Figure-6, the

figure is partitioned into three parts according to the stages of the FFT procedure as

included in Figure-6. The logic diagram of the 2-D 16* 16-point D F T then is presented

using the row-column radix-4 FFT as it is given in Figure-7. Replace all blocks inscribed

by R-16 F F T in Figure-7 by Figure-6 to obtain Figure-9. Then Figure-9 can be modified

to achieve Figure-10 which is the logic diagram for the vector radix-4*4 DIT FFT

algorithm on a 16* 16-point DFT. The twiddle factors of the row FFT are combined with

those of the column FFT to reduce the number of multiplications and this is the reason

why the vector radix approach is less expensive in terms of the computational operations

than the row-column approach. In Figure-10, the VR-4*4 F F T B F is actually

implemented by the row-column approach. Another alternative is to apply the VR-2*2

FFT to the VR-4*4 F F T B F but there is no further saving in the number of non-trivial

multiplications.

X X X X X X X x X x x X X x *

o x T IZ « — « „ ^ t/* — c>

x x ^ x x x x ^ x x ^ x x ^ x

B afttf

60

The matrix form for the vector radix DIT FFT algorithm and the structure theorem can be

readily extended to higher dimensions and so can the logic diagrams. Figure-10 can be

used as the VR-16*16 FFT B F in the vector radix-16* 16 FFT algorithm and so forth

[70].

This example not only shows the evolution of the VR-4*4 FFT algorithm from its

1-D counterpart, but also indicates that to perform 16* 16-point 2-D D F T in a pipelined

computation, only one complex multiplier is required for a VLSI design [14, 15, 134,

137].

As this technique does not impose any requirement on the radices nor any

knowledge of how the decimation (DIT or DIF) procedure is undertaken to obtain 1-D

FFTs, not surprisingly, the mixed radix FFT algorithms can be derived by this approach

as well.

Example-5:

In this example, the mixed DIF and DIT vector radix FFT algorithm is derived to

compute an 8*8-point D F T which is equivalent to that presented in Equation (3-6-11).

The 8*8-point 2-D D F T is first calculated using the row-column FFT algorithm and

different decimation techniques have been used, as shown in Figure-11 where 32

nontrivial multiplications are involved. In Figure-11, row transforms are performed

using 1-D DIT FFTs as shown in Figure-2, and column transforms are computed by DIF

FFTs as shown in Figure-3. When the mixed vector radix FFT algorithm is applied to the

same problem, the logic diagram of which is shown in Figure-12, the number of

nontrivial multiplications are reduced to 24 after combining the row and the column

twiddles. This example once again demonstrates that different multidimensional vector

radix FFT algorithms can be developed systematically by using the structure theorem.

If the complete Equation (3-6-11) for the mixed vector radix FFT still looks

somewhat complicated, its diagrammatical presentation is extremely clear and straight

forward.

61

CU-oc

S101S O O i

^cE

u 11 o —. M

0 COf-i

•«*

ft

£ © I

H ft

o

<y J=

«-)

oc c

• ••*

P .-- £

E * ft 5 ^

5

H

t-^

X

**! X

"1 X

X

? X

r?

X

X

©

X

3^F

S "o ft-I

00 * 00

x :-: u

62

o

X X X X X IT}

X m X X

+

+

a: 5

+

+

xa u

+ + +

+ +

*S c

+ +

+ +

+

u.

*

>

+ 4-

t * *

H T C 2

> s •D H

S

+

u.

EC

+

eiEfr T i - C

*Efr

+

«Eh

+

u

eiEfr

+

-Tf-'

;c£

+

teofr

+

01515 +

a o X

fH

X <s X

m X

T X

IT) X

vo X

t-X

63

Before considering the 2-D vector split-radix F F T algorithm and the comparative

study of various vector radix F F T algorithms, two points have to be made. One is that

this structural approach, both in its matrix form and diagrammatical form, can be

extended to multidimensional cases with little difficulty. The other is that the 2-D direct

vector radix D C T algorithms were devised by examination of logic diagrams of the

corresponding 1-D algorithms and were later verified by mathematical analysis.

The discussion on the combined factor vector radix-8*8 and vector radix-16* 16

FFT algorithms are included in Appendix C.

3-8 2-D Vector Split-Radix FFT Algorithms

Another successful application of the structural approach is to generate complete

equations for the DIF vector split-radix FFT.[60]. The idea behind the split-radix

approach is quite simple. In one dimensional Discrete Fourier Transform computation,

the 1-D split-radix approach [68] divides a length N D F T into two DFTs of length N/2

when a radix-2 FFT is applied at the first stage. One of the resulting N/2 DFTs, which

involves odd terms, is further decimated using radix-2. Thus the original D F T is

implemented by an N/2 D F T together with two N/4 DFTs and an algorithm can be

devised to reduce the number of operations required to complete the transform. The

trouble is that when this very approach is applied to 2-D DFTs using the traditional

mathematical representations [36, 37], the final equation for the algorithm contains so

many terms that without an understanding of its structural features, derivation and

verification of the algorithm and the implementation of the algorithm would be difficult

indeed [37, 71]. This is not to mention its generalization to even higher dimension

applications. Recently, the split-radix F F T algorithm has been extended to two

dimensions using Decimation-In-Frequency (DIF) [36] and Decimation-In-Time [37]. In

this section, the complete equations for the first stage of the vector split-radix DIF FFT

algorithm are derived using a structural approach [45, 60]. To derive the complete

equation for the vector split-radix DIF F F T equations, the structure theorem is used

64

initially to obtain VR-2*2 and VR-4*4 DIF F F T equations. The split-radix idea is then

applied to compute the outputs when both indices are even in a vector radix-2*2 step and

the rest in a vector radix-4*4 step. The algorithm is the two dimensional counterpart of

the 1-D split-radix DIF FFT algorithm [68], and differs from the split vector radix 2-D

FFT [36] in the way in which the vector radices are divided.

Using the structural approach [45] the vector radix-2*2 and the vector radix-4*4

DIF FFT algorithms can be derived easily from corresponding 1-D algorithms. The

matrix form of the 2-D vector radix-2*2 DIF FFT is given by the following equations,

assuming Ni = NT2 = N.

ko *0 -xi(0,mo;0,nn)-xi(0,mn;l,no) xi(l,mo;0,no)

_xi(l,mo;l,no)_

-xi'(0,mo;0,no)

xi'(0,mo;l,no) xi'(l,mo;0,no)

_xi'(l,mo;l,no)

- 1 1 1 l-i 1 - 1 1 - 1

1 1-1-1

-1-1-1 1 -

mj nj

-x(0,mo;0,no)-

x(0,mo;l,no) x(l,mo;0,no)

_x(l,mo;l,no)-.

p i 0 0 0 -j

ow!? o o N

0 0 WJJ° 0 _o o o w;j0 + n o_

-xi(0,mo;0,no)-

xi(0,mo*,l,no) xi(l,mo;0,no)

_xi(l,mo;l,no)_

ko £o ko #o rX(ki,0;/i,0)-| X(k1,0;/1,l) X(ki,l;/i,0)

LX(k1,l;/1,l)J mo=0 no=0 w "

-xi,(0,mo;0,no)~ xi'(0,mo;l,no) xi'(l,mo;0,no)

-xi \U no; 1,no) J

(3-8-la)

(3-8-lb)

(3-8-lc)

where ki, l\ =0,1,...,N'-1, and N' = N/2.

The vector radix-4*4 DIF FFT is described by the following equations:

[xi(*0,mo; £0V0)] = If [x(w;,mo;«7,no)] (3-8-2a)

[xi'(*0,mo; V»0)] =Ef[xi(^,m0;^,no)] (

3'8-2b)

[X(kuk0 ;hJo)] = T Y W^,?kl W^1 [xi'(*0,mo;^.no)] (3"8-2c)

mo=0 no=0 ™ iy

where ki, l\ =0,1,...,N"-1, and N " = N/4;

65

[X(ki,*0 ;lh£0)] = [X(k1,0;/1,0),X(k1,0;/1,2),X(k1,0;/1,l),X(k1,0;/1,3),

X(k1,2;/1,0),X(ki,2;/1,2),X(k1,2;/1,l),X(ki,2;/1,3),

X(ki,l;/1,0),X(k1,l;/1,2),X(k1,l;/1,l),X(k1,l;/1,3),

X(ki,3;/1,0),X(k1,3;/1,2),X(k1,3;/1,l),X(k1,3;/i,3)]T;

[x(m7,mo;«7,n0)] = [x(0,mo;0,no),x(0,mo;2,no),x(0,mo;l,n0),x(0,mo;3,no),

x(2,mo;0,no),x(2,m0;2,no),x(2,m0; 1 ,n0),x(2,mo;3,n0),

x(l,mo;0,no),x(l,mo;2,n0),x(l,mo;l,n0),x(l,mo;3,no),

x(3,mo;0,no),x(3,mo;2,n0),x(3,mo; 1 ,no),x(3,m0;3,n0)]T;

Ef=diag.[l,W^nO,wJO,W^nO,

w2m05w2m0+2n° w2m°+n° w2m°+3n°

wm°,wmo + 2 n o wm°+n° w m o + 3 n°

w3m0,w3m0+2n° w3 m o + n° W3mo+3n°l-

If=

Tf4 Jf4 jf4 jf4

If4 jf4 _jf4 _jf4

If4 _lf4 _jlf4 j!f4

Llf4.!f4 jjf4 _j!f4 J

I» =

1 1 1 1 - 1 1 1 - 1 - 1

1 -1 -j j

- 1 -1 j -j

The basic approach of the vector split-radix algorithm is to compute the outputs

when both indices are even in a vector radix-2*2 step and the rest in a vector radix-4*4

step. Both indices are even in the first line in each of Equations (3-8-la) to (3-8-lc).

Thus, in the 4*4 process, twelve equations are required out of the sixteen in Equation (3-

8-2) since X(ki,0;/i,0), X(ki,0;/i,2), X(ki,2;/i,0) and X(ki,2;/i,2) have already been

solved by the vector radix-2*2 step. This is the first stage of the vector split-radix DIF

FFT decomposition as is shown below:

X(ki,0;/,,0)= I 1 Z 1 W " ) , o k l W" 0 / l x1(0^0;0,no) mo=0 no=0 N ™

xi(0,mo;0,n0) = [ 1 1 1 1 ]

r-x(0,mo;0,no)"l x(0,mo;l,no) x(l,mo;0,no)

Lx(l,mo;l,no)J

(3-8-3a)

(3-8-3b)

where N* = N/2, ki,/i = 0,1 N*-l; and,

66

[xi(fy),mo;%no)]m = I m [x(m;,mo;n;,no)]

[xi'(ko,mo;%n0)]m = Fm [xi(^,mo;^no)]m

[X(ki

(3-8-3c)

(3-8-3d)

,k0 ;hJ0 )]m = I 1 2 W ™ ? k l W"0/1 [xi(k0,mQ;£0,nQ)]m (3-8-3e) mo=0 no=0 a> 1N

where ki, h =0,1,...,N"-1, and N" = N/4;

[X(ki,^;/l,^)]m=[X(ki,0;/i,l),X(ki,0;/i,3),

X(ki,2;/1,l),X(ki,2;/1,3),

X(ki,l;/i,0),X(ki,l;/i,2),X(ki,l;/i,l),X(ki,l;/i,3),

X(ki,3;/i,0),X(ki,3;/i,2),X(ki,3;/i,l),X(k1,3;/i,3)]T;

[x i (kn,mo; ^o,no)]m=[x l (O.mn; 1 ,no),xi (0,mo; 3,no),

xi(2,mo;l,no),xi(2,mo;3,no),

xi(l,mo;0,no),xi(l,mo;2,no),xi(l,mo;l,no),xi(l,mo;3,no),

xi(3,mo;0,no),xi(3,mo;2,no),xi(3,mo;l,no),xi(3,mo;3,no)]T;

E 4 = diag.[ W ^ W 3 ^ , W2jnO+nOtW2mo+3n0>

wmo wmo+2nn wmo+no wmo+3no W N ' N ' N ' N

w3mo w3mo+2no w3mo+no w3mo+3no-.. W N ' N ' N ' N J'

4 =

r f4 f4 Tf4 f4 H m m m m

jf4 jf4 jf4 ,f4

I» -If4 -jlf4 jlf4

L I» -I» jl» -jlf4j

If4 =

-11 1 Ii 11-1-1 1 -1 -j j

L i -i j -j -.

if4 = m

1 -1 -j j 1 1-1 j -jj

and [xQnj,mQ;ni,nQ)] is defined as in Equation (3-8-2). The first, second, fifth and sixth

rows of [X(khk0 ;liJ0)]. [xi(Jto,m0;4>,no)], Ef and If have been omitted to obtain

[X(khk0 ;h,£0 )]m , [xi(Ao.mo;^,no)]m. fm and Ifm. All indices are the same as those

in Equation (3-8-2), but a long and tedious direct derivation has been avoided. The logic

diagram of the vector split-radix DIF FFT can be achieved by modifying the

corresponding logic diagrams of VR-2*2 and VR-4*4 DIF FFT algorithms, which is a

67

simple procedure. Complete equations for the first stage 2-D vector split-radix DIT FFT

algorithm can also be constructed by this simple approach [37].

3-9 Comparisons of Various 2-D Vector Radix FFT Algorithms

The comparison of vector radix F F T algorithms in this section mainly follows the

traditional judgement, i.e., arithmetic complexity, error analysis, in-place computation

and regularity of the computation structure, as mentioned in the previous chapter. Since

the analysis of arithmetic complexity in the early work of vector radix FFTs [42, 43],

there have been many other reports on the issue associated with different vector radix

algorithms [1, 36, 37, 44, 45, 60, 62]. The arithmetic complexity, in terms of

multiplications, of various vector radix FFTs is listed in Table-1 in comparison with row-

column FFTs, assuming that inputs are complex. N = 4096 is chosen because all the

vector radix F F T algorithms considered can be applied. It is worth noting that although

the split radix method requires less multiplications than the rest of the Cooley-Tukey

based FFTs in 1-D D F T computations [68], its applications in the 2-D case [36, 37, 60]

are less effective than the Combined Factor (CF) VR-16*16 FFT in terms of multiply

operations [45, 70]. Besides, since vector radix FFTs preserve a regular computation

structure inherited from 1-D Cooley-Tukey algorithms, they are bound to have advantages

in the software and hardware D F T implementations [154]. They carry out an in-place

computation and their numerical features are also superior to the row-column method.

Vector radix F F T algorithms consist of VR-2*2 BFs, regular twiddling multiplication

stages and regular indexing formation. These features, along with the pipelined and

parallel structure inherited from their 1-D counterparts, would facilitate both software and

hardware implementation of fast computation of 2-D DFTs as well [134,137].

To give a brief idea of the reduction of complex multiplications, for a 4096*4096 2-

D D F T problem, the number of complex multiplications required by the vector split-radix

DIF F F T [Appendix E] is only about 3 7 % of that required by the radix-2 row-column

FFT algorithm [45]; about 6 5 % of that required by the 1-D split-radix FFT row-column

FFT [68]; 4 9 % of that needed by the vector radix-2*2 F F T [43]; 6 6 % of that required by

Table-1 Arithmetic complexity of FFT algorithms for 4096 * 4096 2-D DFTs

in terms of multiplications

2-D FFT Algorithms

RC R-2

RC R-4

RC R-8

RC R-16

RC SR FFT [68]

VR-2 * 2

VR-4 * 4

VR-8*8

VR-16 * 16

CF VR-8*8

CF VR-16 -16

VSR-1 [36]

VSR-2 [60]

Number of BF multiplications

0

0

2*16,777,216

2*25,165,824

N/A

0

0

33,554,432

50,331,648

25,165,824

33,030,144

N/A

N/A

Number of T M multiplications

2* 92,274,688

2* 62,914,560

2* 44,040,192

2*31,457.280

N/A

138,412,032

78,643,200

49,545,216

33,423,360

49,545,216

33,423,360

N/A

N/A

Total number of multiplications

184,549,376

125,829,120

121,634,816

113,246.208

104,398,848

138,412,032

78,643,200

83,099,648

83,755,008

74,711,040

60,453,504

102,676,560

67,746,504

Percentage (total multiplications.)

100.00%

68.18%

65.91%

61.36%

56.57%

75.00%

42.61%

45.03%

45.38%

40.48%

36.01%

55.64%

36.71%

NOTE:

RC R-i:

VR: CF:

N/A:

BF: TM:

SR:

VSR:

The row-column 1-D radix-i FFT algorithm;

The vector radix 2-D FFT algorithm;

Combined Factor method applied;

Not Applicable;

ButterFly computation structure;

Twiddling Multiplications;

Split-Radix FFT algorithm;

Vector Split-Radix 2-D FFT algorithm.

69

a different vector split-radix DIF F F T approach [36]; and it is slightly (2%) inferior to the

combined factor vector radix-16* 16 F F T algorithm [45, 70]. This algorithm needs

slightly more complex additions than the vector radix-2*2 FFT algorithm. Further

discussion on the issue can be found in [37].

3-10 Vector Radix FFT Using FDP™ A41102

The Australian CSIRO designed A U S T E K Frequency Domain Processor (FDP™)

A41102 is a high performance C M O S VLSI device providing a complete hardware

solution for implementing FFTs [14,15]. Its main features include performing up to 256

complex point DFTs within 102.4M-S and 2-D 8*8-point DFTs or 16* 16-point DFTs

using a single processor configuration with a throughput of 2.5 Ms/s [28]. In [28], 2-D

512*512-point and 1024*1024-point D F T s are implemented using FDPs by the row-

column method. Although there are many publications in which multidimensional vector

radix FFT algorithms are shown by simulation to have computational advantages over the

row-column method, there have been very few reports on hardware implementation [134,

137]. In this section, it shall be demonstrated that when the vector radix method is used,

fewer FDPs are required to obtain the same 2-D FFT processing throughput.

The vector radix-8*8 FFT algorithm can be used to calculate 512*512-point DFTs.

The complete operation is divided into three vector radix-8*8 Butterfly (BF) and two

Twiddling Multiplication (TM) stages [70]. Since the VR-8*8 butterfly computation

structure is a 2-D 8*8-point D F T on its own, it does not greatly matter if it is implemented

by the row-column or the vector radix approach so long as the most efficient computation

can be achieved. In fact, the 2-D 8*8-point D F T is calculated by the row-column FFT on

the F D P A41102. Using the VR-8*8 F F T algorithm to perform 512*512-point DFTs, a

multi-FDP system design is described in Figure-13 which consists of three FDPs with

auxiliary discrete circuits rendering a processing rate of 2.5 Ms/s compared with four

FDPs required by a row-column procedure [28]. In this configuration, the VR-8*8 BFs

are calculated by the 2-D 8*8-point FFT function provided on F D P A41102s and two T M

stages are performed using two available uncommitted complex multipliers. Using a

70

3

o Q

o

CM*-"1

Q i

R-.«

oo oo G

* B > s see

ra T3

4 k

A* ? i

2-D 8**

FFT

i i

1

ts

o

<

5 u.

1st VR

Twid

dle

ROM

W5

o D. i

*

C o

s c E

•-*

H oc

* oc •o a

v >

Q Es, ex c

I QJ

u 3 60 •H

T Q

71

3 C

O

c I

* ti.

4

* >% oo c * B > = ~ £ i- ""•

oc

9 t <N

o 4

c;

— -O

O uX i — 3

L

O — Q r-J

(S c:

<

5 u.

2 < Oi

CQ J

liftL

k

(/* <f V i

oc * >, at & > = "5 cc c

oc

s£ c-i

<N ^

« C5 w L.

& L. r»"\

i

*r-*£ < Oi

1

If I (5

i

ft< y4

< * i5 >

—• * \o >-. • i -

os S > 3

*" f -9 "L M * [ L

5 C i

C3

C3 •o

3

ca

i

s 2

r-i O

<

Q u.

>^S N h K

CJ o _ < c-Q U.

*=5 > 3s — Ho;

H fa

Q

o c I

© *

o

.2 5

c a

E Q.

E •-*

H fa fa

C3

u >

•a

u X

H

Q fa ex

I 0)

3 60 •H

Q

72

mixed VR-16* 16 and VR-8*8 F F T algorithm, 1024*1024-point DFTs can also be

implemented using three FDPs, rendering a throughput of 2.5 Ms/s with alternative

configurations as shown in Figure-14. In real-time image processing, a multi-processor

system has to be used and reduction in the number of processors will mean a decrease in

the system complexity. These are but a few examples to show what can be done using

the vector radix approach.

3-11 Summary

In this chapter, the structural approach to the construction of 2-D V R FFT

algorithms has been presented by both mathematical equations and diagrammatical forms.

The use of this method helps to understand structures of various V R FFTs, and, most

importantly, also eases the burden of the implementation task for electrical engineers.

Using the diagrammatical representation, the modification of a V R FFT algorithm to fit

special design requirements becomes a simple task. The comparison study on various V R

FFTs summarizes the arithmetic complexities of V R FFTs and also their merits in the

context of error analysis, in-place computation and regularity of the computation

structure.

The introduction of the F D P A41102 demonstrates a complete VLSI hardware

solution to the D F T computation. It has been shown that if the vector radix method were

used, the number of complex multipliers on the processor to perform either 2-D 8*8- or

16* 16-point D F T could be reduced to one. Even if die F D P is used in its current status,

incorporating the vector radix method in the application of 2-D DFTs for real-time image

processing will reduce the number of processors required to achieve the same

performance when the row-column FFT is used. This would mean a reduction in the

system complexity.

73

CHAPTER FOUR: A PERSPECTIVE ON VECTOR RADIX FFT

ALGORITHMS OF HIGHER DIMENSIONS

As discussed in the introduction, multidimensional (m-D) Discrete Fourier

Transforms (DFTs) with dimensions equal to or greater than three have been used in

construction of 3-D microscopic-scale objects to remove out-of-focus noise [16], Nuclear

Magnetic Resonance ( N M R ) imaging algorithms [9], computer vision and pattern analysis

to provide a better understanding of dynamics in the visual system [10-12]. W h e n the

dimensions of the problem increase, the computation burden becomes heavy. Thus the

saving in computation time using efficient fast algorithms will be of even more

significance [45]. Because of the complexity of the problem involved, a systematic

approach is required for the comprehension, derivation, construction and effective

implementation of multidimensional fast algorithms. It seems that the structural approach

introduced in Chapter Three is, at least, one technique capable of being developed to

higher dimensions as this method has been successfully demonstrated in the construction

of 2-D vector radix F F T algorithms [45, 60] and 2-D direct vector radix fast Discrete

Cosine Transform (DCT) algorithms which will be discussed shortly after this chapter

[80]. The approach can also assist in developing computer programs using

multidimensional vector radix algorithms, especially when the computer programs for the

corresponding 1-D fast algorithms are available.

In this chapter, the structural approach for the construction of m - D (m > 3) fast

vector radix F F T algorithms is closely examined using both matrix and diagrammatical

forms. From definitions of the multidimensional D F T and its inverse, equations which

represent multidimensional vector radix Decimation-In-Time and Decimation-In-

Frequency FFTs are derived. A structural approach based on the matrix representation is

described which is used to construct multidimensional vector radix FFTs. A recursive

logic diagram symbol system is then presented to show how an m - D (m > 3) vector radix

FFT algorithm can be derived and represented in a graphical form. A n example is also

given to demonstrate the simple procedure required to construct a vector radix-4*4*4 FFT

74

algorithm on a 16* 16* 16-point 3-D D F T problem using the symbol system. Since the

approach using diagrammatical representations does not impose any restrictions on how

the decimations (DIT or DIF) should be applied to each dimension, various vector radix

FFT algorithms can be constructed by this method. Although not discussed in this thesis,

the material presented in this chapter can be extended to m-D (m > 3) fast vector radix

DCT algorithms as well.

4-1 Definitions

As mentioned in the previous chapter, the multidimensional DFT of dimension m is

defined as:

X(kl,k2,...,km) = I I ... I x(nl,n2,...,nm) WN WN —WN nl=0 n2=0 nm=0 * l m

(4-1-la)

and its inverse is defined as: Ni-i N2-i Nm-i lkl

x(nl,n2,...,nm)= * I I ... I X(kl,k2,...,km)WN NlN2...Nm k l = 0k2=0 km=0 1

WKn2k2...WN

nmkm (4-1-lb)

where ki, ni = 0,1,... ,Nj-1; i = 1,2,... ,m. In their matrix forms,

X = Wm x (4-1-2a)

and x = 1 W - m X (4-l-2b) * NiN2...Nm

W ~

where W°» = WN* ® W* ® ...® W^ , W^ , i=l,...,m, which represents the Nj-

point 1-D DFT matrix; W-*= W^ ® W^ ® ...® W^ , W^. , i=l,-,m, which

represents the Ni-point 1-D inverse DFT matrix; X and x are NiN2. • .Nm column vectors

of output and input sequences respectively (also see Example-6).

75

If D I T is used on all indices of the m-D D F T assuming Nj = rj * Nj', i = l,...,m,

set:

ki =kii*Ni'+kio; ni = nii*rj + nio;

where kii, nio = 0,l,...,rj-l; kio, nil = 0,l,...,Nj'-l.

X(kliJclo;k2i,k2o;...;km1,km0)= X X ... X nl0=0n20=0 nmo=0

N£l N£l N^-l nlikl0.„n2ik2o .„nmikm0 X X ••• X x(nli,nlo;n2i,n20;...;nmi,nmo) W N ! W N ! ...WN ,

nl1=0n21=0 nm^O W l m INm

wnloklo wn2ok2o wnmokmo ^nlokli wn2ok2i wnmokmi Ni N2 "" N m ri r2 rm

(4-1-3)

Accordingly, the m-D DIF VR FFT and mixed VR FFT equations can be derived.

If DIF is used on all indices of the m-D DFT assuming Ni = rj * Nj\ i = 1,.. .,m, set:

ki = kii*ri + kio; ni = nii*Ni'+ nio;

where kii, nio = 0,l,...,Nj'-l; kio, nil = 0,l,...,ri-l.

Nj'-l N2'-l Nm'-1

X(kl],klo;k2iJc2o;...;kmi,kmo)= X X ... X nlo=On2o=0 nmo=0

rrl r2-l rm-l x „,nl()kli „Ji2ok2i ...nmokm]

X X ... X x(nli,nl0;n2i,n2o;...;nmi,nmo)WNV ' C - W N m '

nloklo wn2ok2o wnmokm0 wnlikl0 wn2ik20 wnmikm0 W N ! W N 2 " N m

W n r2 ••• rm

(4-1-4)

Since there is more than one dimension, different decimations can be applied to

different dimensional indices. Therefore there are more variations of the vector radix FFT

algorithm. A unified form for mixed VR FFT algorithms somehow is difficult to present.

4-2 Matrix Representations and Structure Theorems

A matrix form for DIT vector radix FFT algorithms presented by Equation (4-1-3)

can be given as follows:

76

[X(Wj^lo;^i,k2o;...;^mjJcmo)]=It[xi,(klo,n/o;k2o,n2o;...;kmo,nmo)]

(4-2-la)

[xi'(klo,n-?0 ;k2o.«20 ;...;kmo,«mo)] = Fl [xi(kl0,nio ;k2o.«20 ;...;kmo,«mo)]

(4-2-lb)

[xi(klo,nio 'XlQ>,n2o ;...;kmo,nmo)] Nj'-l NV-1 N'-l , , , 0 IO

- v v v \i^lllkl0«/n2lk20 ™,nmikmo r , . 7 0 0 ,, - X X ••• X W N . W N . ...WN , [x(nli,«7o;n2i,/z2o;...;nmi,/imo)]

nli=0n2!=0 nm^O l l m

(4-2-lc)

where [x(nli,nl o ;n2\,n2o ;...;nm\,nmo)], [xi(klQ,nl o ;k2Q,n2o ;. • .;kmQ,nmo )],

[xi'(klo,"70; k2o,/i2o;...;kmo,/tmo)] and [X(£7;,klo;&2;,k2o;...;/:m;,kmo)] are

ri*r2*.. .*rm column vectors with ki] and nio (1 ^ i ^ m) varying in bit reversed order; F}

is the twiddle factor matrix which is an rir2...rm*rir2...rm diagonal matrix with the

element value F<(i,i) (1 < i < nr2...rm) equal to w"^kl° W^k2^ ..W™okm°

accordingly, and I1 is the matrix for the m-D vector radix-ri*r2*...*rm butterfly

structure which is also an r]T2.. .rm*nr2.. .rm matrix with the element value Il(i j) (1 ^ ij

<rir2...rm) equal to w"7^ ^n20k2j ^nmoknu correspondingly- Equation (4-2-lc)

contains ri*r2*...*rm Ni'*N2'*...*Nm'-point m-D DFTs which can be further

decimated.

The generalization of the structure theorem for the m-D DIT case will be stated as

follows:

If Ni = n * Ni' in an m-D DFT defined by Equation (4-1-1) and the 1-D DIT FFT

algorithms are given by:

[X(kii,kiQ)] = l£i [xi'(kio,m0 )] (4-2-2a)

[xi'Qd0tm0 )] = F^[xi(kio,m0 )] (4-2-2b)

[xi(ki0,m0)]= X W?!lklo[x(nii,mo)]

nii=0 JNl

N-'-l

[xiGdn.nin)]= Y. W?!1,^0 [x(nii,«w )1 (4-2-2c)

where 1 < i < m; 0 < kii, nio < ri-1; 0 kio, nil < Nj'-l; then, the DIT m-D vector

radix-ri*r2*...*rm FFT algorithm will be given by Equation (4-2-1), where:

77

E* = F*\ ® F£22 ® ...® Fjj™ ; (4-2-3a)

It = I^®I^®...®I^. (4-2-3b)

Similarly, a matrix form for DIF vector radix FFT algorithms given by Equation

(4-1-4) can be presented as follows:

[x\(klo ,nlo; k2o ,n2o;...; kmo ,nmo)] = If [x(nlj ,nlo; n2j ,n2n;...; /zm; ,nmo)]

(4-2-4a)

[xi'(/:7o,nlo;/:2o,n2o;...;^/7/o,nmo)] = E f [xi(£/o ,nlo; £2#,n2o;...; £mo ,nmo)]

(4-2-4b)

[X(kl\,klo ;k.2i,k2o;...;kmi,kmo)] =

Y T ..."I"' <«"» W^k2'...WJ,m?krai [x, Wo,nio; *2„,n2o;...; *m«,nmo)]

nlo=On20=0 nmo=0 i l J>2 ^

(4-2-4C)

where [x(nlj ,nlo; AZ2; ,n2n;...; «m; ,nmo)], [xi(£/o,nlo; £2^ ,n2o;...; fono.nmn)],

[xi'(*7o ,nlo; A:2o,n2o;...; kmo .nmo)] and [X(kli,fc70 ;k2i,£2n ;...;kmi,fcm0 )] are

ri*r2*.. .*rm column vectors with &# and ni] (1 < i < m ) varying in bit reversed order; Ff

is the twiddle factor matrix which is an rir2...rm*rir2...rm diagonal matrix with the

element value £f(i,i) (1 < i < nr2...rm) equal to Wn^kl° V^f0...*™01™0

accordingly; and If is the matrix for the m-D vector radix-n*r2*.. .*rm butterfly structure

which is also an nr2...rm*rir2...rm matrix with the element value If(i,j) (1 i,j

rir2.-.rm) equal to W "7 ; U ° W n 2 ; * 2°...W™ 7* m° correspondingly. Equation (4-2-4c)

rl r2 rm

contains ri*r2*...*rm Ni'*N2'*...*Nm'-point m - D DFTs which can be further

decimated.

The generalization of the structure theorem for the m-D DIF case will be stated as

follows:

If Nj = n * Nj" in an m-D D F T defined by Equation (4-1-1) and the 1-D DIF FFT

algorithms are given by:

[xi(ki0 ,nio)] = 1%. [x(nii ,ni0)] (4"2"5a)

[xi'Wo ,ni0)] = F^lxiikio ,ni0)] (4"2-5b)

78

[X(kii,fc0)]= X W ^ k i l [xi'CJkio ,ni0)] (4-2-5C) ni0=0 »

where 1 < i < m; 0 < kii, nio < Ni'-l; 0 < kio, nil < ri-1; then, the DIF m - D vector

radix-ri*r2*.. .*rm F F T algorithm will be given by Equation (4-2-4), where:

E f = F ^ ® F*2 ® ...® F ^ ; (4-2-6a)

If = I^®422®...®I

f^. (4-2-6b)

To obtain the complete equations for an m-D DIT or DIF vector radix FFT, simply

apply the theorem to the remaining short length m - D DFTs repeatedly and this makes the

derivation simpler and programming easier especially when corresponding 1-D

algorithm(s) or program(s) are available.

4-3 Diagrammatical Presentations

The logic diagram for an m - D vector radix FFT is much simpler than its matrix

representations. Because the representation for m - D where m > 3 will be the same as that

for 2-D except for the definitions of each symbol, a recursive symbol system can be

developed. Consider a procedure for developing a vector radix-ri*r2*.. .*rj (1 < i < m,

where m is the dimension of the D F T ) butterfly structure along with the twiddle

multiplication stage.

In Figure-15-(a), xi (0 < i < rj-1) is a vector with dimension equal to ri*r2*...*ri-i,

the elements of which are in natural order. The symbol defined by VR-n*r2*.. .*ri-i B F

structure is the (i-l)-dimensional vector radix-ri*r2*...*ri-i butterfly computation

structure and that by VR-ri*r2*...*ri-i T M is the (i-l)-dimensional vector radix-

ri *r2*...*ri.i twiddling multiplication, x'i (0 < i < rj-l) is a vector of n*r2*...*ri-i

dimensions with elements in bit reversed order. The symbol defined by R-rj B F on

Dimension-i is a 1-dimensional radix-ri butterfly computation structure which works on

dimension-i. xii (0 < i < ri-1) is a vector of n*r2*... *ri-i dimensions. The elements of

xii are in bit reversed order, as are output vectors xii (0 < i rj-1) themselves.

79

r

x

r4

c c C O

E

*x

>

fa C2

cj

X

X M

s-X

1 rt > r , W fl o . rH

w c fl

'•T3

1 i—i

• . — i

' ' bO rl "S3 fl

a nj i-i

bO «1

'So o •X o rQ

-fl fl

fa CQ nj

fl O CO

fl V fl

Tl 1 fl 0

-n <o

o <-M

fl -r= u fl rH

-t-=

s H Td

s fa

w

H TTI

n rH

0

<L> l-H

fl u fl rH

-t-=

S H T3 fl llj

fa PQ 'nl fl 0 tn C V

Fi .in

-4-=

fl

.."So rt * r>

kO

rH

fl bO • f—1

En

S v <-> hi u fl o 7 ^ ; - -r»

© _ X

<s u" r -

X

1

iZ

X

»

©

X

1

l_

«

U

•

• u • »— u « >

<s CT

X

L.

K

— i_

• (S t-

• F-*

(-ei >

1 S E~

s H

* — i i

£T

X

,_ c c

fa c £ ca © £ P

c

u u

fa W

u OC

© 1— X X

© <~< X X

80

X X

._ i-

• . •

• 1—1 i_

ck >

u • . •

r M >

X X

s-X

s fa

f~ 1

> i — i

nj fl

.2 co fl 4) g

• i—t

•

'i

ram

ure.

.2 fl 13 £

o l_

"- CJ

u CC — • u

>

© _

l_

OC

i"^

bO fl o o r—i ' -"'

-J-=

r ^ "3 u if o fl

hp 5 fl U

• r-H

3 ^ SS E H

.. nj

1 U fa I T« I *° 1 ^ i S x

A -bO ,

. p—( •

fa k

81

Using the basic modification rules of logic diagrams, the logic diagram shown in

Figure-15-(b) is derived. Combining ri*r2*...*ri_i-dimensional VR-n*r2*...*ri-i BF

with 1-dimensional radix-ri B F results in an ri*r2*...*ri_i*ri-dimensional VR-ri*r2*...*

ri-l*ri B F computational structure. W h e n VR-ri*r2*.. .*ri-i T M is combined with radix-rj

T M , an ri*r2*...*ri.i*ri-dimensional twiddling multiplication stage is formed using this

symbol system as shown in Figure-15-(c). It is possible to build an m-dimensional

vector radix FFT algorithm by drawing 1-D F F T algorithm(s) and to finally achieve the

algorithm required.

Example-6:

The procedure to derive a vector radix-4*4*4 F F T algorithm using the logic

diagram to compute a 16* 16* 16-point 3-D DFT, given a 1-D radix-4 FFT algorithm for a

16-point 1-D DFT, is as follows:

(a) draw the logic diagram using radix-4 FFT for the 1-D 16-point D F T as shown

in Figure-6;

(b) determine the logic diagram using the "row-column" radix-4 FFT method on

the 16* 16* 16-point 3-D D F T (not shown);

(c) use the vector radix-4*4 FFT to replace the 2-D row-column FFT algorithm

on 2-D data vectors as shown in Figure-16; blocks inscribed by VR-4*4 BFa,

VR-4*4 T M and VR-4*4 BFb are defined in Figure-10;

(d) modify Figure-16 to Figure-17 and combine twiddle factors to obtain the

vector radix-4*4*4 FFT algorithm. The major difference between Figure-10

and Figure-17 is that all the symbols of Figure-10 represent and operate on

vectors of size 16 whilst those of Figure-17, of size 256 (or 16*16) where xi

= [x(i,0,0), x(i,0,l), .... x(i,0,15), x(i,l,0), .... x(i,l,15), ..., x(i,15,0), ...,

x(i,15,15)], i = 0 15.

82

— «-> X x

DEQHPSW JO JO .JO

c E "c o St c

c

8£

X o I'

a •+

c

vO rH

I

a) rJ

3 to •H

84

4-4 Computing Power Limitations

W h e n the dimensions of DFTs increase, the number of operations required for the

computation of DFTs increases dramatically. At the current stage of VLSI technology,

only m-D (m > 3) DFTs with relatively small size can be processed at real-time speed [13,

28, 129, 130]. The difference between computation times of an addition and a

multiplication is reduced, even to none [40], so that the total number of numerical

operations becomes the key issue [129]. The time used for data transfers also becomes

significant so that in-place computation and a regular computing structure will certainly be

crucial in m - D D F T calculations [129,130]. So far the implementation of m-D D F T by

and large uses the "row-column" method whether it uses VLSI [2, 13-15], or Very Long

Instruction Word (VLIW) architecture supercomputers [129], or distributed memory

multiprocessor supercomputers [130]. Amongst a few reports that use m-D fast vector

radix FFT algorithms is the one by Liu and Hughes [134]. Although only discussed in

[134] is the implementation of vector radix-2* 2 FFT, many of its advantages over the

row-column method have already been shown. The portions of the saving using vector

radix FFT algorithms over that of the row-column FFT become substantial when the

dimension of DFTs increases and/or higher radices are used as indicated by Table-2 [1,

43-45].

Three very active areas associated with the hardware implementation of DFTs are

ASICs [14-15,134,137,142], systolic array designs [135, 136, 140, 141, 145, 147] and

neural networks [138]. Still, even the latest successful implementations can only cater for

1-D DFTs or very small size 2-D or 3-D DFTs at real-time speed with the neural networks

approach in its early stage.

Complete hardware solutions to m - D (m > 3) D F T problems are dependant upon

future development of VLSI technology, understanding different m-D algorithms and

possessing the abilities to construct them in a systematical way. It has been shown by

many [134, 141,147] that fast algorithms chosen for VLSI implementation should

possess a regular computation structure more than anything else, apart from the arithmetic

complexity and maximal use of pipelined and parallelism of the algorithms, to enable

85

TabIe-2

Arithmetic Complexity of FFT algorithms for 64 * 64 * 64 3-D DFTs

3-D FFT Algorithms

R-2

R-4

R-8

VR-2*2*2

VR-4*4*4

VR-8*8*8

CF VR-8*8*8

Number of BF multiplications

0

0

3*131,072

0

0

393,216

229,376

Number of T M multiplications

3*655,360

3*393,216

3*229,376

1,146,880

516,096

261,632

261,632

Total number of multiplications

1,966,080

1,179,648

1,081,344

1,146,880

516,096

654,848

491,008

Percentage (total multiplications.)

100.00%

60.00% ~1

55.00%

58.33%

26.25%

33.31%

24.97%

NOTE:

R-i: The "row-column" 1-D radix-i FFT algorithm:

VR: The vector radix 2-D FFT algorithm;

CF: Combined Factor method applied; BF: ButterFly computation structure;

TM: Twiddling Multiplications;

86

systematical V L S I integration. The understanding of such algorithms and their

implementation would be greatly assisted by an understanding of their computational

structures.

87

PART II.

MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS

88

CHAPTER FIVE: INTRODUCTION TO MULTIDIMENSIONAL

DISCRETE COSINE TRANSFORMS

The Discrete Cosine Transform (DCT) was first introduced into digital signal

processing for the purposes of pattern recognition and Wiener filtering [17]. The two

dimensional (2-D) D C T is used for transform coding of images in telecommunication

such as video-conferencing, video telephony, video image compression for H D T V and

applications in fast packet switching networks [3, 18, 19, 157]. Its performance is

virtually indistinguishable from the optimal Karhunen-Loeve transform [3, 17] in terms of

energy packing ability, decorrelation efficiency and the least mean square error. Many

fast D C T algorithms require only real number operations and possess fairly regular

computational structures similar to those of FFTs and vector radix FFTs, which

substantially facilitate software and hardware implementations. It has, by now, become

the standard decorrelation transform for compression of 1-D and 2-D signals [72, 73].

To implement the 2-D D C T , there are many fast algorithms available, and these

algorithms are basically divided into two groups:

Direct fast algorithms, which are based on matrix factorization of the D C T

matrix or computation of a long length D C T by shorter length DCTs;

Indirect fast algorithms, which compute the D C T through an FFT of the

same size [34, 57, 82] or other fast algorithms [79, 113,124, 156].

In each group there are two approaches—the row-column approach, where the 2-D D C T

is generated by repeated application of a 1-D D C T , and the 2-D fast algorithm approach.

In each group there are many fast algorithms, as is shown in Figure-18 which is by no

means exclusive.

In 2-D image transform coding, the original image is usually divided into 8*8 or

16*16 blocks, and these blocks are cosine transformed. For real-time video coding, it is

assumed that the transmitting rate is 30 frames/second with frame size of 288*352 pixels,

and a processing rate of about 3.04 Ms/s is required. This means that an 8*8-point D C T

89

V.

Q Q

i

<N O

<£

P ^ ^* 1—

c — <

c

3

E c U LU

:/:

-5 c

< i—

r_ U -Q

U-o o

re C tr-C

< E •5

c c to < C^

re Q

2; r-

c (A c re H 're

E c c >•* c

o I-

i5 o

——

o re C lr

< c E _3 O U

o at o

E •-z 'u O to <

U Q

1

_ E

c CC

< IT,

U-X ^ OS

c

3 > Q

u CO

o ex <

"3 U

i o at

— ^

->

c lr

o C

1 re

.E *tA

re z c o "O o g CC d CC <

U.

o

w c c -o s C2

c —

o •5 _

3 C re

S c c _ 3

e

t u_ c E "c lw<

it c at o t—

fc u. X

•5 re

" at C o o > o d

< fc =5 Q CN

H

c £ U-.

c re H. "re E c >N

O

c

f—

d < o 3

- i-t

"c o U o r-

<

E

C

"re

1 o |. o c

><

re OS

"5. CO

o

c

<

E re

a

Z o o >

c

< E re to 'u

ca o

<

c CO

<

CT

re X o

— •ST —' d to < E jo c re C 3 O

:

t

i

d Ci

re X

s "u E re

1

r-

>

2

<

c re — —

d CO < "3 o C O U o

«5

6 CO

<

2 o

£

c to

< 3 o X o

f

d

< c

re

o

£

t/5

E

o WD C5

E-U Q t/;

r^ i

4 _

c

C5

to to

co r H

| a u

5D . H

r -

90

has to be carried out in about 21Ms or a 16* 16-point D C T has to be calculated within about

84^s.

For the last couple of years, VLSI fabrications of D C T processors which provide

real-time image coding throughput have been reported. Currently, these D C T processors

can only perform fixed-point calculation. The length of the input data format varies from

9- to 16-bits, as does that of the output of D C T processors. Usually, they are adequate

for the coding applications. Examples of D C T processors are IMS A121 D C T processor

(inmos) [88], STV3200 D C T processor (SGS Thomson) [87], and TMC2311 (TRW)

[89]. Because the transform size used in image coding is relatively small, being 8*8 or

16*16, the performance of the above D C T processors is very close in terms of the

processing speed. Their processing rate varies from 13 M H z to 27 M H z which caters for

real-time image coding. For the same reason, various row-column algorithms, including

the direct matrix multiplication method, have been used in VLSI D C T processors without

showing a great deal of difference in speed performance for image coding. N e w

development in this area can be found in [158] and [159]. A comparative study of the

error performance of these D C T processors remains to be undertaken.

Of the two classes of 1-D fast D C T algorithms, the indirect approach shows little

advantage over the direct approach in terms of the arithmetic complexity, and it usually

does not have a regular computation structure and involves an excessive number of

additions [72]. These would be manifested in their 2-D extensions, although 2-D indirect

methods have been reported to have a lesser number of multiplications [72]. Amongst the

direct algorithms, the Lee algorithm is by far the most efficient in terms of the number of

multiplications (or the total number of numerical operations) [76,78] and it has a regular,

systematic and simple computation structure. However, the algorithm requires inversion

or division of the cosine coefficients which has been claimed to cause numerical

instabilities because of roundoff errors in finite length registers [38, 72, 76, 77]. This

problem will be examined in the next chapter in comparison with other methods. Hou

introduced a new fast D C T algorithm [77] which uses bit-shifting and data shuffling for

better numerical performance. Hou's algorithm is as efficient as that of Lee's in terms of

91

the number of multiplications and additions and also has a simple regular structure. When

these direct 1-D algorithms are extended into 2-D applications, their structural features will

be preserved as indicated previously.

In the context of fast computation of 2-D D C T s , there are several reports on 2-D

indirect algorithms [57, 72, 79], whilst the direct method up to now is dominated by row-

column 1-D algorithms. The 2-D direct fast D C T algorithm [38], though more efficient,

remains less well known [65, 72, 77]. Besides, not all 1-D direct fast D C T algorithms

can be expanded to 2-D fast algorithms effectively. The only one which has been reported

is the 2-D fast algorithm by Haque based on Lee's method [38]. In [38], the direct matrix

decomposition method is used to expand Lee's algorithm, and the improvement of the

new algorithm over several other known algorithms is demonstrated in terms of the

number of operations. It has also been shown that the roundoff errors in Lee's algorithm

do not cause serious problems for small size such as 8*8- and 16* 16-point 2-D data block

DCTs, which are commonly used in image coding applications.

W e use a structured approach on Lee's algorithm directiy to generate 2-D fast D C T

algorithm and reproduced the Haque algorithm. A 2-D logic diagram is also used to

represent the algorithm so that 8*8- and 16* 16-point 2-D D C T s using the new 2-D fast

D C T algorithm are readily devised from the 1-D Lee algorithm and easily implemented

[46, 80, 81]. To avoid the roundoff errors that Lee's algorithm may cause, a new two

dimensional fast D C T algorithm has been devised based on Hou's algorithm [80, 81,

108] using the same technique. Both algorithms are equally efficient

5-1 Definitions of 1-D DCT and Its Inverse DCT

The definition of the N-point 1-D discrete cosine transform and its inverse are given

by the following equations [3, 4, 17, 38, 76, 77, 82].

X(k)4e(k)^x(n)C^+1)k (5-1-la)

and,

92

x(n) = I e(k) X(k) d*!+1)k (5-1-lb) k=0

where n,k, = 0,1,...,N-1; cgj+1)k = cos[*(2^1)lc]; and

' TV if k=°; e(k)=^ V 2

^ 1 otherwise.

In its matrix form, Equation (5-1-1) can be written as:

X = C° x (5-l-2a)

and,

x = €i X (5-l-2b)

where C° = R IT C ; Ctf = C a £ ; E = diag[-^,l,...,l]; IT = d i a g [ ~ . . . , | ] ;

X is the DCT vector; x the data vector; C(i,j) = C(^+1)i; Cj = (C)T and the

superscript T stands for the transpose operation of a matrix.

In order to derive fast algorithms, define

X(k)=^|yX(k) (5-l-3a)

and,

X(k) = e(k) X(k) (5-1-3b)

resulting in the "denormalized" D C T and IDCT as shown in Equation (5-1-4).

X(k)= NZx(n)C^J+1)k (5-l-4a) n=0

and, N-i A

-2N x(n) = Ni: X(k)C?N

n+1)k (5-Mb) k=0

where n,k = 0,1,...,N-1.

In matrix form:

f=Cx (5-l-5a)

and,

93

x = Cn X = ( C ) T X (5-l-5b)

where T stands for transpose operation.

It is from Equation (5-1-4) or Equation (5-1-5) that fast algorithms are derived.

Operations involving IE and T can be applied either before or after the denormalized

DCT or 1DCT is performed.

Example-7: For a 4-point DCT,

C B =diag.h=-, 1,1,1] diag.[j,j,^-,j] LV2

r r o r o r o r o • ^8 *~8 *~8 u 8

^"8 ^8 ~^8 ~^8

T1 -P1 -C1 C1

^4 W W ^4 y->3 s~\\ e~\\ J~\J

_ ^8 "^8 u 8 "^8 ,

(5-1-6)

and the 4-point IDCT matrix is given,

C ° 3 =

(~>V s^l (~yl s~yJ

*~8 ^8 *~4 ^8

/r>0 /-3 «~il y^l

"8 8 "W ~ 8 y->0 J"->3 ^r-,1 X^i

c 8 " ^ "*~4 ^8

/~i0 .~>1 .—>1 y-i3

_ 8 " ^ W ~^8 _

diag.[-7=-, 1,1,1] (5-1-7)

5-2 Definitions of 2-D D C T and Its Inverse D C T

The N*N-point 2-D D C T and its inverse (IDCT) are given by Equations (5-2-la)

and (5-2-lb) [4, 57, 72, 80].

v/, ^ 4 ^ / i ^ 1 ^ 1 ^ o(2n+l)k r(2m+l)/

X(k,/) = M M e(k) e^) I I x(n,m) C1^ G ^ NN n=0 m=0

(5-2-1 a)

and,

x(n,m) = Z I e(k) e(/) X(k,/) C ^ C 2 N

k=o / =0

(5-2-lb)

In its matrix form, Equation (5-2-1) can be written as:

X_ = C9 £ (5-2-2a)

and,

94

x = Cr[02L (5-2-2b)

where x and X are formed by stacking transposed row vectors of the input and output

2-D arrays respectively; C9 = (IE ® E ) (IT <g> IT ) ( C ® C ); Cn° = ( Ca ® Cj

)(E® E); E = diag[-^,1,...,1]; T = diag[^,|,...,^]; C(i,j) = cgj+1)i; and C3 =

(C)T. The symbol ® stands for the tensor product.

Defining

and,

NN X ^ = 4 e ( k ) e ( / ) X ^ (5-2"3a)

X(k,/) = e(k) e(/) X(k,/) (5-2-3b)

results in the denormalized 2-D D C T and IDCT as shown in Equation (5-2-4) [46, 80].

X(k,/) = I NZ x(n,m) C(^+1)k C^m+1)/ (5-2-4a) n=0 m=0

and,

x(n,m)=NS S X(k,/)Cgj+1)kcg,m+1)/ (5-2-4b) k=0 / =0 ^ ^

In matrix form:

£=(€ ® C ) x_ (5-2-5a)

and,

x. =( Cn ® Ca ) £= ( C <S> C )T X_ (5-2-5b)

Fast algorithms are usually derived from either Equation (5-2-4) or their matrix

form as presented by Equation (5-2-5). Although the definitions for the 2-D DCT and its

inverse are slightly different from those given in [72], those of denormalized forms are

the same. These are the basis for the derivation of various fast algorithms.

Generally, the mathematical derivation of an algorithm is quite involved and it is

difficult to see the computational structure. For this reason, a logic diagram is used to

present the computational structure of each algorithm.

95

5-3 Applications of 2-D D C T s in Image Compression

Image coding (compression) is a typical application of the 2-D D C T . It has been

made an international standard by C C I T T for video coding applications [72, 73].

Various 2-D D C T algorithms have been developed into computer programs for purposes

of the simulation study of video coding [81, 96]. In the following example, image

compression will be shown using the row-column Lee's fast D C T algorithm on a

256*256-pixel frame of image.

ExampIe-8:

In this example, the Series 151 Image Processor by Image Technology has been

used to acquire and store images. It is hosted by a P C - A T which performs the 2-D D C T

calculation. A 512*512-pixel image is snapped and stored in the frame grabber of Series

151 Image Processor. The frame is divided into four quadrants, each of which consists

of 256*256-pixels. The upper-right quadrant is used to display the original image, the

upper-left quadrant the scaled D C T coefficients, the bottom-left quadrant the reconstructed

image after applying different filtering on the D C T coefficients and the bottom-right

quadrant is the difference image between the original and reconstructed images. The 2-D

D C T is applied on 8*8-pixel blocks. D C T coefficients are scaled using a block size

64*64-pixels. The difference image can also be scaled so that the error signal can be

seen. Signed 9-bit integer is used for D C T coefficients. The system setting is shown in

Figure-19.

T w o types of filter masks are used in this example—the 2-D ideal low-pass filter

and zigzag filter as shown in Figure-20 (a) and (b), where n is the length of the filter.

The filter mask is used to eliminate selected D C T coefficients. In Figure-21, an ideal low-r 8 bits 64 pixels Q c< .

pass filter is used with n = 4 so that a compression ratio O I Q ^-ts * Toplxels ~ 1S

obtained, using signed 9-bit integer for D C T coefficients.

The difference image is magnified twenty times. The effects, shown by stripes, are

caused by two dimensional noise introduced in the imaging system. This has been

detected by analyzing 2-D Fourier spectrum of the image [Appendix C]. A zigzag filter of 8 bits

n = 5 is used in Figure-22 to achieve a compression ratio of 3.79:1, that is, -jrrrj *

96

64 pixe s Ait n o Ugh a higher compression ratio is used than that of Figure-21, the

improvement in the reconstructed image quality is obvious especially in the areas of

English characters of the poster on the background, face and shoulder. When different bit

allocating or adaptive schemes are applied, higher compression ratios can be obtained [3,

105].

System setting for image compression experiment.

Figure-19

97

l-g n=

X X X X

X X X X

— I -.-.—- •"%y0'

X X X X

X X X X

The two dimensional rectangular filter of size n.

Figure-20 (a)

l^wJ**-*** ll^~" *~J M *•"*• *TSTV^

X X X X X

X X X X

X X X

X X

X

The two dimensional zigzag filter of size n.

Figure-20 (b)

98

An example of D C T compression of a 256*256-pixel image,

an ideal low-pass filter used with n=4 with signed 9-bit for DCT coefficients.

Figure-21

99

A n example of D C T compression of a 256*256-pixel image

a zigzag filter used with n=5 with signed 9-bit for DCT coefficients.

Figure-22

100

5-4 2-D Indirect Fast D C T Algorithms

In the two categories of 2-D fast DCT algorithms, the indirect approach obtains a 2-

D DCT from a 2-D DFT of the same size [57]. One can use row-column FFT algorithms,

WFTA, etc., to calculate real valued DFTs as discussed previously. The arithmetic

complexity is fairly low [72]. The computational kernel of this method is of simple

structure, hence it is easy to be VLSI implemented. If 2-D FFT algorithms are invoked to

calculate 2-D DFT, the arithmetic complexity can be further reduced particularly if the

vector radix FFT algorithms are used [37, 43-45]. The structure of the algorithm is kept

fairly simple and roundoff errors are also reduced compared with the row-column

approach [50, 51]. A polynomial transform for 2-D DFT computation has lower

computational requirements but has a complex computation structure [58]. Whether it is

justified for fast computation of 2-D DCT remains to be seen [72]. The same can be said

for 2-D indirect fast DCT methods using other reduced multiplication fast Fourier

transform algorithms [32, 35].

There are several reports on 2-D indirect fast DCTs which map DCTs into DFTs

[57, 72, 79, 84] and in [57] complete formulas for both forward and inverse DCT

transforms are given. These are used in this thesis.

According to Makhoul [57], a 2-D DCT can be converted to and computed by a 2-D

DFT following the steps given below.

Step 1: 2-D N*N-point data rearrangement

x(2ni,2n2); if 0 < ni <[^-]; 0 < n2 < ft^k

x(2N-2n!-l,2n2) if [*^r3 ^ n2 < N-1;.0 < n2 < [ -j-]

XT. 1 N+l x(2ni,2N-2n2-l); if 0 < ni <[-y-]; [ — ] < n 2 ^ N'

1

x(2N-2ni-l,2N-2n2-l) if [^-jk < ni < N-T,.[-y-] < n2 < N-l

(5-4-1)

v(ni,n2) =<

v(ni,n2)=<

101

Step 2: 2-D N*N-point DFT on v(nbn2)

V(kbk2) = ^ Z1 v(ni,n2) W"

lkl Wjf2 (5-4-2) ni=0 ni=0 N N

where ki = 0,1,...,N-1, k2 = 0,1,...,N-1 and WN = e 'J27r/N.

Step 3: Obtain 2-D DCT from the output of 2-D FFT by two methods

C(khk2) = 2Re{ Wk^[Wk2 V(ki,k2) + W4

kN2V(kl5N-k2)]} (5-4-3a)

or:

C(ki,k2) = 2Re{Wk2 [Wk^V(k!,k2) + W^V(N-k!,k2)]] (5-4-3b)

that is, a 2-D DCT is defined as:

C(k,,k2) - 4 S X v(n,.ii2) cos-^^cos™^ (5.4.3c) ni=0 n2=0 m m

Different 2-D fast discrete Fourier transform algorithms may be used (e.g., the

vector radix FFT [37, 43, 45, 60], the 2-D WFTA [35], the 2-D polynomial transform

[58], etc., [32, 65]) to calculate 2-D DFT in Step 2 for 8*8- or 16*16-point FFT. The

inverse DCT can be done by the following steps:

Step 1': Generate the 2-D DFT from the 2-D DCT

V(khk2) = jW^W^2 {[C(ki,k2) - C(N-ki, N-k2)] - j[C(N-k!,k2) - C(kj,

N-k2)]} (5-4-4)

Step V: 2-D IDFT

v(ni,n2) = r^r I I V(k!,k2) WN"lkl W^2 (5-4-5)

^^ ki=0 k2=0 w JN

where ni,n2 = 0,1,...,N-1.

Step 3': Recover the sequence x(ni,n2) from Step 1 for the forward DCT.

In [72], [79] and [84], there are no formulas given. Instead, the 2-D inverse DCT

are generated on the flow graph using the transposition theorem of orthogonal transforms.

A 2-D indirect DCT method using convolution algorithm has been mentioned in

[72] which claims a dramatic reduction of multiplications. A detailed study yet remains to

be seen.

102

The arithmetic complexity of indirect D C T algorithms depends on the FFT

algorithm used. Since FFT algorithms are well documented, the arithmetic complexity of

indirect D C T algorithms can be readily obtained.

103

CHAPTER SIX: 2-D DIRECT FAST DCT ALGORITHMS

It is known that 2-D fast transform algorithms are often more efficient than the row-

column 1-D algorithm in terms of computational operations, i.e., they need less

multiplications and additions than the row-column method to compute the same transform.

Many of 2-D algorithms also possess in-place computation, regular structure and small

roundoff errors [37, 45, 50, 51], which all provide advantages.

2-D direct fast D C T algorithms, which will be discussed in the following sections,

are generated from 1-D Lee's and Hou's algorithms respectively. They require fewer

number of multiplications than the row-column method and provide a systematic

computation structure featured by 2-D BFs and T M stages as well. The in-place

computation possessed by these algorithms is obvious. The computational complexity of

various D C T algorithms is considered and the computation structure of 2-D direct D C T

algorithms is analyzed in comparison with that of 2-D vector radix FFTs.

In the next two sections, the matrix and diagrammatical representations of 1-D

direct D C T algorithms are discussed as the bases for the derivation of 2-D direct vector

radix D C T algorithms. The 2-D D C T algorithms are then introduced using the structural

approach in a similar manner in which V R FFTs are constructed.

6-1 2-D Direct Fast DCT Algorithm Based on Lee's Method

6-1-1 1-D Lee's algorithm in matrix form

Lee's algorithm is a direct fast D C T algorithm. For an N-point forward DCT,

Equation (5-1-4a) can be decomposed into two N/2-point DCTs by the following steps in

a matrix form [46, 76, 80].

-g'^n)-

_g'2(n)_

"gl(n)l _

_g2(n)J "

1 1

1 - U

x(n) x(N-l-n)J

1

0

0

1 or,(2n+l) ZL-2N

S\(n)-

Lg2(n)_

(6-1-la)

(6-1-lb)

104

•Gi(k)-

G2(k). n=0 W gl(n)'

.g2(n). (6-1-lc)

X(2k)

X(2k+1).

1 00 L 0 1 1.

' Gi(k) •

G2(k)

G2(k+1). (6-1-ld)

where k,n = 0,1,...N/2-1, and G2(k+1) lk=N/2-l = 0. Define the pro- or pre-calculation

matrix P, butterfly matrix B, and the multiplication matrix M as follows:

P = 1 00 0 1 1

B = 1 1 1 -1 M =

r l

o

o l

or(2n+l) ZL-2N

The N-point IDCT in Equation (5-1-4b) can also be decomposed into two N/2-

point EDCTs by the following steps in the matrix form [46, 80].

Hi(k)

H2(k)J = P

" X(2k) '

X(2k+1)

LX(2k-l)J

(6-l-2a)

•hi(ny

h2(n). - kt0U2(N/2) |_H2(k).

•h'jdi)

h'2(n)

x(n)

x(N-l-n)

= M

= B

hi(n)"

h2(n)J

fh'jCn)-

h'2(n)

(6-1-2b)

(6-1-2c)

(6-1-2d)

where X(2k-1) =0ifk = 0. The above one dimensional fast D C T algorithm is described

in Lee's paper [76] except for the matrix representations [46]. The matrix representations

used here are very useful when a new 2-D fast DCT algorithm is devised [45, 46, 60,70,

80, 83].

105

Notice that:

x(n) '

Lx(N-l-n). 1 '

sN-l-2n x(n) (6-1-3)

and,

~ X(2k)

X(2k+1)

LX(2k-l)-J

Gl(k)

G2(k)

.G2(k+1)

X(2k). (6-1-4)

1 0

0 1

0s

Gl(k)'

LG2(k). (6-1-5)

where the delay operator defined as x(n+l) = s x(n), x(n-l) = s_1 x(n).

The logic diagrams for the 1-D 8-point and 16-point denormalized IDCT are shown

in Figure-23 and Figure-24 respectively. Note that in the above figures the input

sequence is in bit reversed order whilst the output sequence is generated by starting with

the set (0,1), forming a new set by adding the prefix "0" to each element, and then

obtaining the rest of the elements by complementing the existing ones. Therefore the sets

corresponding to 2-, 4-, 8- and 16-point output sequences will be: (0,1), (00,01,11,10),

(000, 001, 011, 010, 111, 110, 100, 101) and (0000, 0001, 0011, 0010, 0111, 0110,

0100, 0101, 1111, 1110, 1100, 1101, 1000, 1001, 1011, 1010). The corresponding

1-D fast DCT algorithms can be obtained easily by interchanging the input and output,

reversing the direction of data flow and changing addition blocks to branches and

branches to addition blocks of the IDCT logic diagrams [76, 84].

From the logic diagram, Lee's algorithm achieves a good performance in terms of

the number of multiplications and additions and also has a regular structure.

For an N-point DCT, N = 2m, the number of multiplications and the number of

additions, which are required for the calculation, are shown as follows:

0M[DCT(2m)] = m * 2^-1;

m m-1 • _ : 1

0A[DCT(2m)] = m * 2 m + £ 21 (2m -1), for m>l.

i=0

106

U CM

C tc en

75 ^ 1/5

~v —

.E *3i

U CM

U CM

u u CM

Q r-

I

C CO. ©

c.

ci oo

U CM

— oc

u CM

CO.

I — • *

u CM

II

CO CM I

<u u 3 OO •t-t

107

1/1 CM

u CM I

*o v tn *r> •*: *N

« c

M

1

1

I

IS IS 16 IB TQ IS

9- :;•;• m % •>- f- % s

SIS

13- M

: +

c : cr-

i

nf • : ttj- '." _ «^>

i 5

<=. d.

§

is

5 BIS

5 i , . i

« *^ IN **. © r^ *» *~ .^ .-. I-* r-. — r . - -•6, .» — IN — ( © — r - ^ I T , ^ . O C - w C -

CO CM

— ro

u CM

I

Ir- n — r-i

o CM

II VO

?*-I CM o\ en

u CM

•p.

II

CM r- ci

u CM

3

u

CM

U CM

I 9-

o 9-

CM CM |

vJ rJ , ^ 3

OD •H

CM — C*i

u CM

r- —

U CM

u CM

CI

I CM

ro —

u CM

u CM

iU*

ci oo

U CM

CM

ca -H 00

u CM

CO.

CM

108

Since the publication of Lee's paper, the algorithm has been criticized because of roundoff

errors produced by the required division of the cosine coefficients in the matrix M .

Haque [38], however, has shown that roundoff errors are not serious for small size DCTs

although there has not been recognition of this fact [65,72,77].

6-1-2 Derivation of 2-D fast DCT algorithm from Lee's algorithm

Although this method was first introduced by Haque [38] in 1985 immediately after

the publication of Lee's algorithm, it was derived independently by W u and Paoloni [46]

using a structural approach. The latter technique makes the 2-D fast algorithm a simple

and systematic extension of Lee's algorithm and will be presented here.

The N*N-point 2-D D C T and its inverse (IDCT) after denormalization are given by

Equations (5-2-4a) and (5-2-4b). Equation (5-2-4a) can be decomposed into

(N/2)*(N/2)-point 2-D DC T s in the same way as was done to the 1-D D C T in the last

sub-section. Matrix forms of the four step algorithm are shown by Equations (6-1-2-la)

to (6-1-2-ld) [46].

Sl=Bx_ (6-1-2-la)

8 =K£ (6-1-2-lb)

where

N/2-1 N/2-1 ( 2 n +i) k (2m+l)/ „ "• " \ \^ 2(N/2) W(N/2) *

Z. = LG1

£=(

g' =

LsN-l-2nj ®

r-g'jCn.m)

g'2(n,m)

g'3(n,m)

_g'4(n,m)_

1 •

zN-l-2m ) x(n,m),

g1(n,m>

g2(n,m)

g3(n,m)

_g4(n,m)_

g =

(6-1-2-lc)

(6-1-2-ld)

109

a =

rGi(k,/)

G2(k,/)

G3(k,/)

LG4(k,/)J

£.=( "1 0" 0 1 -0s_

® "1 0" 0 1 .0z.

"1-.s _

<8> "1-. z_ £ = (

£ = P 0 P,

M = M <g> M',

£ = B ® B,

) X(2k,2/) ,

) a

where k,/ ,n,m = 0,1,...N/2-1, and Gi(k+1,*) lk=N/2-l = Gi(*,/ +1) 1/ =N/2-l = 0, for i =

2,3,4; matrices P, B and M are defined as those in the last sub-section and M ' is

derived by substituting m for n in M ; the symbol ® stands for the tensor (Kronecker )

product; and delay operators operate on different indices.

Equation (6-1-2-lc) represents four (N/2)*(N/2)-point 2-D DCTs which can be

further decomposed into even shorter length 2-D DCTs and so forth. In essence, the

relationship between 2-D direct D C T algorithms and their 1-D counterparts is governed

by the properties of the tensor product. Equation (6-1-2-1) can be proved using direct

derivation (Appendix D).

The 2-D fast IDCT algorithm derived from Equation (6-1-2) in matrix form can be

given in a similar way [46, 80]. According to the structured approach, 2-D fast IDCT

algorithm based on Lee's method will be presented by Equation (6-1-2-2).

H = P X

k = N5_1r(2n+l)k

N^"1r(2m+1)/ „

ntoC2(N/2) m t 0

C2(N/2) O-

x. = Bhl

(6-l-2-2a)

(6-l-2-2b)

(6-l-2-2c)

(6-l-2-2d)

110

where

" 1 "

s -s-l_

® " 1 "

z

_z"l_

hl(k,/)"

h2(k,/)

h3(U) ' h4(k,/)_

" 1 -sN-l-2n

® _ 1 -zN-l-2m

p = p® p,

M = M <S>M',

£ = B <g> B,

where k, / , m, n = 0,1,...,N/2-1, s and z are two delay operators which operate on

different dimensions, X(2k-1,*) lk=o = X(*,2/ -1) 1/ = 0 = 0.

The mathematical structure becomes even simpler when a logic diagram is used [45,

46, 80]. For instance, the logic diagram of an 8-point 1-D IDCT is given in Figure-23

[46]. The row-column IDCT is applied on an 8*8-point 2-D IDCT in Figure-25, where

the row transforms are implemented by the ID IDCT blocks at the input and the column

transforms by the remainder. The number of multiplications required for an N*N-point 2-

D IDCT is N2log2N. Figure-26 shows the evolution of the row-column approach to the

two dimensional algorithm. The 1-D IDCT row operations of Figure-25 are now

distributed throughout the logic diagram and thus the number of multiplications remains,

as yet, unchanged. However, it is possible to combine adjacent factors to reduce further

the number of multiplications. For example, in the block 2D-M3 of Figure-26, the factor

oc is moved into the block M 3 , which has the effect of changing the internal multiplier

values of M 3 [45, 46]. The procedure is repeated for blocks 2 D - M 2 and 2D-M1. The

total multiplications required to perform an N*N-point 2-D IDCT is now (3/4)N2log2N

whilst the number of additions remains unchanged. The logic diagram for a 16* 16-point

H = H2(k,/)

H3(k,/)

H4(k,/)J

h =

Ill

© X

X X CM X x X X

IT)

©

<x T

<x <s <x

v^

<x 1—<

<x V)

<x f)

<x t-

«

112

r- — U

CM

I

~ I MS

•o H U o I CM 5 en

fcC

•5 re

u CM

- r- O U CM CM |

(J ^_ i-H 3 60

SJO

c II £ CM = CO. H

Q

c*> oc

u CM

— OC

u CM

CM ca

* oc

r - ntf

u CM

^* •*< «X "X «X -x <X "X

113

C3E3E3

mHtpmmmHE ^ *. ^ * < ^ •X <X -K «M "H

II

u") CM r- rr,

u CM

I

f-1 CM r- CO

U CM

—i CM

I " co I CM

£ CM

O^ en

U CM

^

fl

It-

CM

U CM

E

I

9.

CM

U CM

U CM

r>*

0)

9.

rJ

3 60

— en

U CM

s ?-

c _ r~ r-

u CM

XO r-

U CM

CO

ro —

U CM

CM nj1

U CM

«-T

I." — 00

u CM

ca

I CM II 8

114

IDCT using the above 2-D fast algorithm can also be constructed in the same manner.

Start by drawing a 16-point 1-D IDCT diagram using Lee's algorithm then the 2-D

diagram can be constructed immediately using the simple rules of the logic diagram [81]

as shown in Figure-27. The forward DCT can be obtained by reversing the direction of

the arrows in the logic diagram of the IDCT, since the DCT is an orthogonal transform

[76, 84].

Using the above logic diagrams, both software and hardware implementation of 2-D

DCTs can proceed.

6-2 2-D Direct Fast DCT Algorithm Based on Hou's Method

6-2-1 1-D Hou's algorithm in matrix form

Hou introduced a recursive fast DCT algorithm in 1987, which achieves

computational efficiency equal to that of Lee's algorithm and provides a better numerical

performance. However, as a tradeoff, shifting and multiplexing operations are required in

this method. Although Hou classifies his algorithm differently, it still is a direct fast DCT

algorithm and it is recursive because the higher order DCT matrices are generated directly

from the lower order DCT matrices [77].

The 1-D N-point DCT definition used to generate the algorithm is derived from

Equation (5-1-4a) [82] and is given by: N-l

X ( k ) = I x(n) cos(9k + 27tkn/N) (6-2-1-1) n=0

where 8k = 7tk/(2N) and for n = 0,l,...,(N/2)-l,

x(n) = x(2n) 1 (6-2-1-2)

x(N-l-n) = x(2n+l)J

The Decimation-In-Frequency recursive Hou's algorithm for 1-D DCT generated

from Equation (6-2-1-1) is :

rA -

Ze A

LZo-

= T(N) V Xr_

115

where

f (N) =

r A xj

A

T(l A

) = 1,T(2) = • 1 1 -

a -a , a = cos£

4 K T f ) Q - K T ( ^ ) Q

Zg = R^g, £<> is a vector consisting of even terms of X(k) in natural order,

z0 = R^0, £0 is a vector consisting of odd terms of X(k) in natural order,

R is the permutation matrix for performing bit reversal. For example, r 1 00(H

R2 = I2 = 1 0

L0 U , R4 =

K = RLR, L =

1W2TL

00 10 0 100

-000 1-1 r i o o o ... on -1 2 o o ... o 1 -2 2 0 ... 0 -12-2 2 ... 0 * * * * *

L-l 2-22 ... 2J

etc.;

, Q = diag[cosOm], m=0,l,2,...,(N/2)-l,

and O m = (m + X ^ p ) . Note that K 2 = L2, and K is the result of bit reversal of the row

and column indices of L.

Equation (6-2-1-3) can be used recursively to form the complete formula. The logic

diagrams for 8-point and 16-point DCTs using the DIF Hou algorithm are presented in

Figures 28 and 29. The input sequence is such as x(0), x(2), ..., x(N-2), x(N-l), x(N-

3), ..., x(l), where N is the length of the DCT. The output sequence is in bit-reversed

order.

In [77], another algorithm called the Decimation-In-Time algorithm for DCT is

devised as a dual method to the DIF algorithm. It is obtained simply by switching the A

indices between the input and the output and taking the transpose of D C T matrix T(N).

For the inverse transform, substitute Equation (6-2-1-2) into Equation (5-1-4b), to

give

x(n) = NI X(k)(^r1)k

k=0 ^

(6-2-1-4)

116 o **" ^ ;£. — *o <*i r-Ix Xx $< ix IX IX $< p<

[ +

. . • • • • - ^ • • - • • : . - • • : . - • • • • . . : • :

...... +

1 1 V

rn

j

3 i -i t i + + +1 +

l- TI o — d. ea.

+

+

+ + + j + JL J_

. : s

+ + + I + 1

r^ l

+

i

i

4

1

1

+

1

rJ <s CN ifi

LJ

1 +

H =

i

' " rn

7' + -r +

H H S

o — »-ci tn. ri

+ + + 1 J . "7 T C

. . , . ^ — *N c-> «r .0 fcC fcO *t

+ + + " i.; 1 r-U

— ts

• :

•

fc-.hr

p

"u r> C£

rs

*3 O »»•

r-«

CJD r "« -5

H CJ O

""*

c c 1

oc

CC

0 0 II 3 -a c cti

^^^y

£h ^ w * W5 O O ||

ca ^

*=n«: 1/5

O O II C CO.

#•.

y""N

vC r—«

00

O O II CO to ^^

Kio 01— **s

i/s

0 U II CM to

fc/Kc lOlr-(/; O O

cn

GJ rH

•H

t-

to

K|r^

v: O o II

c to

http://fc-.hr

117

:_P !3J

_3 M

C3 J

0

IS

ag B B 9 B 0

* 9

O t-0

3

H U Q Q

o c 1

VC

r~

c & 5 "•5 o '3D _o o

ml—. 00

o o to

*12 00

o o to

ON

CM en

to

O o II r-3

IT) CM

Q

o

to K C «-•

•rr CM

t=.ht

to

O o II vo

3

00

o o

C

CO >o oo oo O o II

3

00

O O

II ON CO. | 0)

Kloo 5 CM cn 00

o o II

3

oo O o II

o CO

CO

en

en

vC

CM en

00

O O II cn

3 fc/|<N oo

O o II

CM

3 fc-Tjcs wnlen 00

o o II

HP5 00

O O

oo

O O II cn to

fc/Kb O N ] — 00

O o II CM tO

O 3

118

The IDCT matrix will be exactly the transpose of the D C T matrix defined by Equation (6-

2-1-1). Following Hou's indexing scheme, the fast IDCT algorithm is given below.

P

IX r] = TT(N)

A "| Ze A

Zo.

(6-2-1-5)

where

T\N) =

rT^T) QTT£)KT 2; v^* \2-

TT(T) -Q^C T-)KT KT = RLTR.

. . . . , *K-T

This means that the DIT fast DCT algorithm given by Hou is equivalent to the

inverse fast algorithm. The IDCT matrix can be factorized into the form shown by

Equation (6-2-1-6).

Arr. T\N) =

I I-.1 -L

- I 0" - 0 Q„

frfy 0

0 TT&) 2'J

I 0 0 KTJ

(6-2-1-6)

where all matrices I, K T , Q and TT( j-) are of dimension (N/2)*(N/2). According to

Equation (6-2-1-6), Figure-3 in [77] should be the one shown in Figure-30.

The number of multiplications and additions, which are required to perform an N-m-l • m.;.i

point DCT, is equal to that of Lee's algorithm in addition to X 2 (2 "" -1), for m > 1, i=0

shift operations where N = 2m. In software programming, the multiplexing can be hidden

so that there is no extra operational cost. In other words, there is no extra operation due

this multiplexing compared with the program using Lee's algorithm [81,96].

6-2-2 Derivation of 2-D fast D C T algorithm from Hou's

algorithm

Hou's algorithm can be extended into a 2-D fast algorithm in very much the same

way as Lee's algorithm. In order to derive a new two dimensional recursive fast DCT

algorithm based on Hou's approach, the DCT matrix in Equation (6-2-1-3) is rewritten as

follows:

119

*N| I Nil

o cp

on

O

c o

u

o m

I

a; 3 60 •H

o <Xl <x|

120

T(N) = I 0

0 KTf)Q

I 0

0 KT#)Q

T $ T(?)

I 0 L 0 K

I o -0 K

I A XT

Tf) 0 0 I 0

v2' •I

0 Tf)

Tf) 0

0 tf)_,

I I

LI -I

I 0 0 Q

A N

Tf) 0 0 I

I 0

0 Q

I I

LI-I

11 II

(6-2-2-1)

where all matrices I, K , Q and T$) are of dimension (N/2)*(N/2)

Set following equations [57]:

x(n,m) = x(2n,2m)

x(n,N-l-m) = x(2n,2m+l)

x(N-l-n,m) = x(2n+l,2m)

x(N-l-n,N-l-m) = x(2n+l,2m+l)J

N N 0<n<y- l,0<m< j - 1.

(6-2-2-2)

Substituting Equation (6-2-2-2) into Equation (5-2-4a), a modified version of the 2-

D DCT is derived. N-l N-l ,._.,„ ,„_ ,„

(6-2-2-3) X(k,/)=I I x ( n , m ) C ^ + 1 > k C ^ ^ n=0 m=0

The matrix form for the denormalized 2-D D C T defined by Equation (6-2-2-3) after

reordering the input and output sequence will be:

r- A -,

Zee A

Zeo A

Zoe A

LZ oo-*

= ( T(N) ® T(N) )

where

£e = (R ® R) Xe , £e =

£0=(R®R)X0 ,£0 =

A •

Z ee A

Z eo. rA -| Zoe A

Z oo.

xpp X pr

Xrp

XrrJ

(6-2-2-4)

X e, —

x0 =

X ee

.X eo.

X oe

.X oo.

X = (P®P)x , Pis the permutation matrix which results in Equation (6-2-1-2),

121

Xe and X0 are in natural order.

Substituting Equation (6-2-2-1) into (6-2-2-4), a new 2-D fast DCT algorithm is

derived.

( T(N) ® T(N)) = {

®{

I 0

0 K

I 0 0 K

,N T Q o

N, 0 TCrT)

r A .N-T(f-) 0 0 Tf)

I 0 0 QJ

I 0

L o QJ Li -I

11

I -I

11

)

= {

{

I 0

0 K J ®

I 0

0 K }{ Tf) 0

0 T© ®

I 0 0 Q ®

I 0 L 0 QJ H

I I

II <8> n i" II.

A XT

Tf) 0 A N

0 Tf) d

} (6-2-2-5)

So an N*N-point 2-D D C T is decomposed into four shorter length DCTs at the cost

of an increased number of multiplications represented by the term which contains factor

Q. After combining the coefficients , the new algorithm uses 25% less number of

multiplications than the row-column Hou algorithm.

Although its mathematical derivation is quite involved, the logic diagrams for 8*8-

and 16* 16-point 2-D DCT using the new fast algorithm are quite simple [81] and are

shown in Figures 31 and 32. They are derived from Figures 27 and 28 respectively and

symbols defined also in the Figures accordingly. Again, in the 2-D algorithm additional

shift operations are traded for better numerical performance. The elements of input

vectors are in such order as xi = [x(i,0), x(i,2), ..., x(i,N-2), x(i,N-l), x(i,N-3), ...,

x(i,l)] and the elements of output vectors are in bit-reversed order.

The 2-D IDCT fast algorithm can be derived from Equation (6-2-1-6) and the logic

diagram can be obtained using the same method.

mmmmci mrnr^mrjîTTTrîTTi-

tÎ^l^l^l^l^l^J

JK <K *< tX .X #< 123

rfrFiFiFi

aaLtiaEDBaa s

fci Kc

oo O

o II

to SO

00

O O II

o to

c ON

CM

CM

cn oo

O o II

3 •a te |

I 8 I 1

CM CM

en

c

H U

o

1/3

o o II

3 r-CM

en t/5

O o

1 3

o _-D. cn CM

en 00

O O II en

fc^lcM CNJcn

00

O O II

CM

3 fc/jcM «n|en

oo

O o II

f= cn > » t/3

o

00

O O II

a c

»o oo oo

o o II CQ.

oo

o o II o co_ cn so

o o II cn to t=: No

Osl—i oo O O II CM

cn I <1) U 3 CO

If

124

6-3 Comparison of Arithmetic Complexity of Various D C T

Algorithms [72, 156]

Listed below are the arithmetic complexities of direct fast algorithms for a 1-D DCT

of length N = 2m, including the number of real multiplications OM[DCT(2m)] and the

number of real additions 0A[DCT(2m)].

Chen [78]— 0M[DCT(2

m)] = N * log2N - 4r + 4, N > 4;

0A[DCT(2m)] = 2i (log2N _ 1} + 2;

Lee [76]— 0M[DCT(2m)]= | *log2N,

0A[DCT(2m)] = M * log2N - N + 1;

Hou [77]-0M[DCT(2™)] = £ * log2N,

0A[DCT(2m)] = ~- * log2N - N + 1;

Ma-Yin [65]— 0M[DCT(2

m)]=m*2m-1, N = 2m;

0A[DCT(2m)] = (3 * m - 2) * 21"-1 + 1;

Vetterli, et ai [34]— • 0M[DCr(2">)] = £ * log2N

0A[DCT(2m)] = Y * (3 * l oS2N - 2) + L

125

The general formula given can be derived either from decomposition equations or

from logic diagrams provided in the thesis using an induction method.

The number of multiplications or additions for 2-D row-column D C T methods is

obtained by multiplying the number used in 1-D fast D C T algorithms of the same size by

2*N. Then the arithmetic complexity of the 2-D vector radix D C T algorithms is easily

obtained by noting that the number of multiplications is reduced to three quarters of that

used by the row-column fast D C T algorithm and the number of additions remains

unchanged. Further discussions on the arithmetic complexity of DCTs can be found in

[72] and [156].

6-4 Comparison of Computation Structures of 2-D Direct VR

D C T s and V R FFTs

So far, independent V R FFT and V R D C T algorithms have been presented which

show some similarities in their computation structures. Further comparison will reveal

those basic computation structures common to both V R FFTs and V R DCTs and major

differences as well. The reason why the vector radix approach can be applied to FFTs

based on the Cooley-Tukey method and direct fast DCTs by Lee and Hou will soon

become clear. This exercise will certainly be beneficial to the software and hardware

implementation, including VLSI implementation, of vector radix fast algorithms.

Apart from the D F T being a complex valued transform and the D C T a real valued

one, there are some obvious differences in the computation structures of 1-D Cooley-

Tukey FFT and 1-D direct fast DCTs. Take a 1-D 8-point DIF FFT as shown in Figure-3

and a 1-D 8-point direct fast D C T by Hou as shown in Figure-28, for example. It can be

seen that the input sequence of the FFT is in natural order and the output in bit-reversed

order (or vice versa) whilst the input of the Hou's D C T is in a different shuffled order

(refer to Section 6-2-1) and the output in bit-reversed order. As a result, the D C T

algorithm requires that both input and output sequences be re-ordered whilst the FFT only

rearranges one of them. While the D C T algorithm needs a post-calculation stage, the FFT

does not. The FFT algorithms often have trivial twiddling multiplication stages, such as

126

the one inside the Radix-4 DIF F F T B F of Figure-3, and the Hou's fast D C T does not

have trivial twiddling multiplication stage.

On the other hand, there are many important features which are common to Figure-3

and Figure-28. The basic computation structures of both algorithms are 2-point

butterflies and separable twiddling multiplication stages. They both perform in-place

computations at every stage, which is quite different from the W F T A [31] or the Chen's

fast D C T algorithm [78]. The post-calculation of Hou's algorithm has also distinct

stages. The fact that Cooley-Tukey FFTs and fast D C T s by Lee and Hou have in-place

computation and separable twiddling multiplication stages makes them feasible to be

extended to multidimensional fast algorithms. It makes the modification rules of the logic

diagram applicable and the combination of twiddle factors of different dimensions

possible. Not surprisingly, taking Figures 10 and 32 for example, 2-D V R FFTs and 2-D

V R D C T algorithms derived from Lee's and Hou's methods have the vector radix-2*2

butterfly and combined twiddle factors' stage as their commonly used computation

structures. Since V R FFTs based on the Cooley-Tukey method have trivial twiddling

stages, 2-D butterflies with higher vector radices are allowed in 2-D V R FFT algorithms.

6-5 Summary

In this chapter, two vector radix fast discrete cosine transform algorithms are

introduced using both the matrix representation and the logic diagram. These two

algorithms show the arithmetic advantages over the row-column method because the

number of multiplications is reduced by one quarter. The computational structure of 2-D

vector radix algorithms is regular and featured by 2-D butterfly, twiddling multiplications

and post- or pre-calculation structures, and can be systematically generated from their 1-D

corresponding algorithms. Computer programs using these algorithms have been

developed and the use of the structural approach has assisted in the program development

procedure [96]. The arithmetic complexity of various D C T algorithms is considered. A

comparative study on the computation structures of vector radix FFTs and vector radix

127

direct D C T algorithms is carried out and the correct system configuration for the 1-D DIT

Hou's algorithm has also been presented in this chapter.

128

CHAPTER SEVEN: HARDWARE IMPLEMENTATION OF 2-D DCTS

FOR REAL-TIME IMAGE CODING SYSTEMS

Research on fast digital signal processing algorithms has followed the development

of computer technology, especially VLSI technology, since the foundational work laid by

Cooley and Tukey in 1965 [22]. In their well known paper they reduced the computation

burden of computing a length N Discrete Fourier Transforms (DFTs) from the original

order of N 2 to the order of Nlog 2N and the same reduction can be realized for DCTs.

Since then, many fast algorithms have been published and the evaluation of various fast

algorithms has been based on the following theoretical judgements, namely:

(a) the number of numerical operations (multiplications/additions);

(b) round-off errors;

(c) in-place computation; and

(d) the computation structure.

Of the above criteria, the computational complexity in terms of the number of

multiplications and additions has been the focal point in the development of fast

algorithms. In early years particularly, research concentrated on reduced multiplications

algorithms [39] as the time spent on a multiplication was far more than that for an addition

on general purpose computers. This has also been true of D C T computations until

recently [72]. In implementations of 2-D D C T s for real-time image coding, like any other

real-time applications, special hardware, instead of general purpose computers have to be

employed. This special hardware often depends on the leading edge of the VLSI

technology. As a result, the development of VLSI technology led to a re-consideration of

those criteria on which new algorithms were devised and to a re-assessment of the

effectiveness of various fast algorithms. In other words, the device and evaluation of

(new) fast algorithms have to be made relevant to VLSI technology or, simply, the

specific hardware installment

129

In this chapter, a single processor system to implement the modified Makhoul's 2-D

indirect algorithm is firstly described using the newly released C M O S VLSI FFT

processor—A41102. Various VLSI D C T processors for the 2-D image coding have then

been reviewed. Different algorithms for the 2-D D C T computation are re-assessed in the

light of the hardware implementation using different Digital Signal Processors (DSPs),

compared with the direct row-column matrix multiplication algorithm using

Multiplier/Accumulator processors [81].

7-1 Description of Hardware Implementation of Modified 2-D

Makhoul D C T Algorithm Using F D P ™ A41102

The 2-D indirect fast D C T algorithms may not be as efficient as the 2-D direct

methods in terms of arithmetic complexity or they may not have regular computation

structure or in-place computation, but if the VLSI FFT processor is used, this method

would show its advantages in processing speed and overall system simplicity [74, 86].

As mentioned previously, the Austek A41102 Frequency Domain Processor (FDP)

is an FFT chip which provides a continuous sampling rate of up to 2.5 Ms/s and it has

selectable 16-, 20-, or 24-bit word length [26-28, 85]. More importantly, 8*8- and

16* 16-point DFTs can be performed by a single pass within 25.6^s and 102.4^s

respectively. When a modified 2-D indirect D C T algorithm is used [57, 86, 126], the

configuration using the FDP will give a fairly large D C T processing throughput. For

convenience, Equation (5-4-3a) is repeated as follows:

C(kj,k2) = 2Re{ W^[W*2 V(ki,k2) + W*2V(ki,N-k2)]} (7-1-1)

From Equation (7-1-1), the 2-D DCT can be obtained by adding two terms from the

DFT:

W ^ W ^ V 0 q , k 2 ) andW^W^ 2 V(ki,N-k 2 ).

Define k2' = N - k2 in the second term, which results in:

130

W^W^V(kl5N-k2) = W^W^-NV(k!,k2')

= iW4NW4NV(kl'k2,)- (74-2)

Equation (7-1-2) states a very important fact, namely, that all the elements

represented by the second term in Equation (7-1-1) can be obtained by j multiplying the

corresponding elements of the first term, which is nothing but interchanging the real and

imaginary part of the element. There is an uncommitted complex multiplier on the FDP

A41102 which can be used either before or after the FFT operation is completed. This

uncommitted complex multiplier can be employed in conjunction with a ROM to generate

all the elements in Equation (7-1-1). Therefore, the 2-D DCT can be calculated by a

system described diagrammatically in Figure-33. Since the use of the uncommitted

complex multiplier will not slow down the process, a processing rate of 2.5 Ms/s can be

obtained by this single FDP system. This system configuration provides a comparatively

simple hardware solution over that of the row-column method [86]. The processing

speed can be improved by introducing a multi-processor configuration [86]. The above

modified Makhoul algorithm has been used to calculate 2-D DCTs using polynomial

transforms [126].

CM

131

u

1 Adder-1

1 Adder-2

r ~i fciD

tM

c

v.

>

f rM OS

>

V

1 *fc

5

CM

y. _.

>

£•5 5: -7$

CM

O OS

CM

2 -—

>

- z

£ J7Z

CM

__ C

S—

a <4-i

PQ cd H-J

05

Q

c t/3 3 T5

© vj s.

H ~ U £

•*J

© .5: Q, -O vo .S rH * Q V© i r-i <S

L. a C J=

<*->

oc c * o CC

T3 , 00

CM; 5 .©.

<~m

© <s o C i-H

o *-* - ^ rt < u 3 ft. .2Pfl U. rv c •"* o u E a ^ (/5 >> t/5

m cn 01 rH

3 CO •H fcn

c >

132

7-2 Discussion of 2-D D C T Image Coding Systems Using VLSI

Digital Signal Processors

For the fast computation of 2-D D C T s in real-time image coding, the fastest and

simplest system configuration would be using dedicated VLSI D C T processors [74, 75,

87, 89]. For example, in [111], a hardware architecture is reported using the row-column

fast D C T algorithm by Chen and et al [78] on 8*8-point blocks. The processor accepts 8-

bit video input digital signals, uses 12-bit internal precision and provides 12-bit D C T

output. A 16* 16-point D C T VLSI processor is demonstrated in [139] using a direct

matrix multiplication method and a concurrent architecture [75, 143]. The processor

accepts 9-bit 2's complement data, maintains 12-bit precision after column DCTs and

produces 14-bit D C T coefficients at 14.3 M H z sample rate. In a recent report [74], a 27

M H z D C T Chip which performs 8*8-point DC T s has been demonstrated using Duhamel-

H'Mida fast cyclic convolution algorithm [113]. SGS-Thomson Microelectronics Group

has been marketing its VLSI Discrete Cosine Transformer STV3200 [87]. The STV3200

D C T processor accepts 9-bit 2's complement input data, uses a 16-bit internal precision

and produces 12-bit 2's complement D C T coefficients. It can perform 4*4- up to 16*16-

point DCTs at a rate expected to be 13.5 M H z . The IMS A121 of inmos, which is now

part of SGS-Thomson Microelectronics Group, is yet another VLSI D C T processor [88].

The IMS A121 can perform an 8*8-point D C T in 3.2M* (20 M H z pixel rate) using the

direct matrix multiplication method. It accepts 9-bit signed input, uses 14-bit signed

integer for the cosine function Look U p Table (LUT) and a 16-bit precision after the first

matrix multiplication, and renders 12-bit output for D C T coefficients. TRW's TMC2311

[89] is another fast D C T processor which calculates 8*8-point D C T in 4.48^ (14.3 M H z

pixel rate). The TMC2311 accepts 12- or 14-bit input data and produces optional 12-, 14-

or 16-bit output. The row-column method has been used in all the above D C T

processors, and so has the fixed-point computation. Since the length of D C T s under

consideration is comparatively small, various algorithms have been used in VLSI

integration of D C T processors without showing a great deal of difference in speed

133

performance for image coding. A comparative study of the error performance of these

D C T processors remains to be undertaken.

Where D C T processors are not available, the use of various DSPs, the F D P and

Multiplier/Accumulators (M/A) provides many options.

For the fixed-point D C T computation, a single M / A processor IDT7210 with

multiply/accumulate cycle 25n s [90] would render a throughput of about 2.5 Ms/s for an

8*8-point D C T , or 1.23 Ms/s for a 16* 16-point D C T using the row-column matrix

multiplication method [81]. A single processor system using A T & T s W E DSP16A [91],

which is an M / A based DSP, would give a processing rate of 1.02 Ms/s for an 8*8-point

D C T [92] and about 0.95 Ms/s for a 16* 16-point D C T [81]. Taking advantages of very

fast M / A processors, the direct matrix multiplication method out-performs many fast

algorithms using other digital signal processors in terms of the speed and system

simplicity.

Using the Austek F D P A41102 discussed previously would also provide a fairly

large throughput and simple system solution.

Using TMS320C30 [93], D C T s in floating-point can be calculated, which, as shall

be shown in the next chapter, has a much higher signal to noise ratio than the integer

computation.

The TMS320C30, which is a floating-point digital signal processor, is the third

generation device in the T M S 3 2 0 family. Multiplication, memory access operation,

addition, shift or all other A L U operations can be executed within one clock cycle (60™).

Algorithms can be further optimized using the parallel commands that the TMS320C30

provides. The speed at which a particular algorithm can be implemented depends upon

how compatible it is with the hardware.

From previous studies, if the efficiency of an algorithm is judged by the number of

additions, Chen's algorithm is the best. Lee's algorithm is better than Chen's if the

number of multiplications, or even the total number of numerical operations (including

additions and multiplications), is used as the criteria. But if the TMS320C30 is used to

implement an 8-point 1-D D C T , the total number of clock cycles which are used to

134

complete the process will be the main issue. Since there are only a limited number of

registers, not every one of which can be used in the parallel processing instructions, on

TMS320C30, algorithms which do not have regular structure or in-place computation

tend to introduce more data handling operations resulting in a relatively slow

implementation although they may have the same arithmetic complexity as others [81]. In

one occasion, the implementation of a 2-D 8*8-point D C T using Chen's algorithm on the

TMS320C30 requires about 60 cycles more than that using Lee's or Hou's algorithm

under similar programming conditions. Although the indirect D C T algorithm using the

W F T A has the same arithmetic complexity as those of Lee's and Hou's algorithms, it

also requires about 60 cycles more than the two when it comes to the implementation on

the TMS320C30. The difference between algorithms in terms of the exact number of

cycles may vary because of programer's experience, but the fact remains the same. This

problem becomes worse as the length of the D C T increases. Another observation is that

although the vector radix D C T algorithms have relatively low arithmetic complexity as

well as in-place and regular computation structure, they are out-performed by the row-

column D C T s due to the current arrangement of DSPs' architecture and the limited

number of registers provided [81]. In other words, pipelined and parallel structure of

vector radix D C T s cannot be fully employed by current DSPs.

Because the D C T processing speed, using floating-point computation, is

considerably slower at present than the fixed-point computation, more TMS320C30s are

required to provide real-time image coding speed, which means an increase in the system

complexity.

Unless VLSI D C T processors are used, a multi-processor system is required to

render a real-time image coding speed for an image of 288*352 pixels, or equally a video

signal rate 3.04128 Ms/s. Since the D C T process, together with the quantization, decides

the overall performance of an image coding system [128], using floating-point

computation for D C T s also remains to be justified.

From the above discussion, it is concluded that:

135

(1) two fast algorithms, which have equal computational complexity, may not

have the same efficiency in the hardware implementation as the limited

resources on DSPs often impose different restrictions on them;

(2) the direct matrix multiplication method using very high speed

multiplier/accumulators may out-perform many fast algorithms in certain

applications and provides a simple system solution;

(3) a fast algorithm which possesses a regular computation structure will not only

facilitate future VLSI implementation but also provide better performance

using available DSPs than those which do not; and

(4) the pipeline and parallel computation structure of many multidimensional fast

algorithms has yet to be fully exploited in VLSI system design [81].

136

CHAPTER EIGHT: THE EFFECTS OF FINITE-WORD-LENGTH

COMPUTATION FOR FAST DCT ALGORITHMS

8-1 Introduction

In [81], various fast Discrete Cosine Transform (DCT) algorithms have been

examined and compared in terms of computational efficiency (or arithmetic complexity)

from both the software and hardware implementation point of view. In this chapter,

further comparison of fast DCT algorithms will be conducted in order to analyze the

effects of finite-word-length computations on the DCT process.

Generally speaking, the imposition of finite-word-length computation produces

overflow and roundoff errors [5, 6, 24]. Overflow occurs when the magnitude of an

operation exceeds the value that the finite-word-length register can represent. Roundoff is

required when a b-bit datum sample is multiplied by a b-bit coefficient resulting in a

product that is of 2b-bits long. To maintain a certain word-length in a computation

procedure, truncation or rounding has to be applied which causes errors usually referred

to as roundoff noise or roundoff errors. The use of quantized coefficients will also

introduce errors in the finite-word-length calculation [5, 6, 24]. So far, there has been

little report on the issue comparing different fast DCT algorithms [132].

The main concern of this chapter is to investigate roundoff errors produced in

various direct fast DCT algorithms when finite-word-length arithmetic is used and when

cosine multiplicands are quantized. Results are generated by computer simulation [94,

95]. The infinite precision calculation of a DCT is implemented by using double-

precision floating-point arithmetic, which is considered to be the benchmark. The

roundoff error performance is measured by the Signal to Noise

Ratio (SNR) which is defined as follows:

IDCT^ S N R = 1 0 « . o g , 0 ( £ ( D C T d , D C r f ) 2 ) (8->-D

where DCTd is the DCT output with the double-precision, DCTf is the DCT output with

finite-word-length which could be 32-bit floating-point, or integer with finite length.

137

For the investigation of roundoff errors caused by using 32-bit floating-point

operation, Chen's [78], Lee's [76] and Hou's [77] algorithms will be considered for 1-D

DCT implementations, as well as the direct matrix multiplication method [81, 96]. In the

2-D DCT simulation with 32-bit floating-point, the row-column method using direct

matrix multiplication, Chen's, Lee's and Hou's algorithms and 2-D vector radix direct

DCT algorithms [46, 80] will be studied.

For the analysis of roundoff errors caused by using the integer calculation, some

simulation results have been reported in [81, 96].

According to simulation theory [94], the sample mean Xi of a random variable can

'X 2

be estimated by the sample mean xj of I observed values, and the variance S x of I

2 independent samples can be approximated by the sample variance s£ of observed values. 2 The formulas for calculating xi and sx are given by the following equations: I

I Xj i=o

XI=-T-(8-1-2)

2 Sv =

I 2 r-2 I X . - I Xj i = l (8-1-3)

*x~ i - l

where xj is the observed value of the sample sequence. To improve the reliability of the

simulation output, the replications method [94] has been used in this study. The formulas 2

x

2 for the sample mean Xi and variance S are presented by the following equations

I

IXi Xl = ^r- (8-1-4)

< -^ i (X, - XI)2 = -Iji X] - ^(i X^)2 (8-1-5)

where I is the number of runs and Xi is a sample mean on run i.

The input data to the DCT is produced by a random number generator with

Gaussian distribution. The Gaussian input data yj is obtained from a uniform distribution

138

sequence x\ on the interval (0,1), which is provided in the running time library, using the

Central Limit Theorem:

n

I Xi - J

yi = — =-. (8-1-6)

When n = 12, the equation becomes :

yi= £ xi-6. (8-1-7) i = l

Equation (8-1-7) has been used in the simulation to generate Gaussian random data as the

input to DCTs. The test of the Gaussian input generating program on one million samples

has shown a satisfactory result.

The simulation programs have been written in C language and compiled using PC-

AT and run on PC computers.

8-2 Simulation Design

In this section, the structure of the simulation program, error models, benchmarks

for the DCT computations and data collection are described.

8-2-1 Structure of the simulation program

The simulation program consists of five parts:

simulation requirement input;

initialization;

generation of the input for DCTs;

computation of the DCTs; and

simulation data collection.

Increment BL by 1,

Simulation Input:

a. The Length of DCTs, n; b. The Number of Block samples in Each Simulation Run, bl; c. The Word-Length for LUT, nb1; (Optional for Integer)

d. The Word-Length for Roundoff nb2; (Optional for Integer)

Initialization: a. Generation of LUT;

b. Clear All Data Collection Variables.

I ?

The Random Number Generator is seeded.

BL = 1

Input Data Generation

Computation of DCTs: a. Double-Precision DCT; b. Finite-Word-Length DCT. I Increment I by 1.

Computation of SNR: a. SNR of the Current Block;

b. Sum of SNRs for Run I.

No

Calculation for Run I: a. Mean SNR of Run I; b. Sum of Mean SNRs for Each Run c. Sum of Sqared Mean SNRs,

Data Collection: a. Sample Mean of SNR; b. Sample Variance of SNR; c. Confidence Interval.

~ *

( ENDJ

Structure of simulation program for error analysis

Figure-34

140

This can be described by a flowchart as shown in Figure-34. The details in some of

the blocks may vary from one fast D C T algorithm to another according to simulation

requirements.

The input data is integer with specified word-length selectable as to whether it is

signed 8-bit, or unsigned 8-bit or signed 9-bit. The data is processed in blocks of size

4*4, 8*8, 16*16 or 32*32 points and the length of the D C T equals the block size. The

number of blocks is the number of two dimensional D C T s in each simulation run. For

the row-column implementation, the number of the 1-D D C T required is double the block

size.

The initialization is used to set up Look U p Tables (LUT) where the cosine or sine

multiplicands are pre-calculated and stored, for each D C T program.

Note that the input and output of the D C T process are referred to as "data" and

"coefficients" whilst the values of the cosine functions in Look U p Table are referred to as

"multiplicands".

After each D C T block calculation, the signal to noise ratio is calculated and

accumulated to find the sample mean. W h e n the number of blocks is reached, the mean

value of the signal to noise ratio on the current run is computed and this sample mean is

again accumulated as well as its mean square. This process is repeated eight times before

the final sample mean and the sample variance of the signal to noise ratio are calculated

according to Equations (8-1-4) and (8-1-5). The confidence interval has also been used to

render a provisional guide for the simulation.

8-2-2 Error model for the basic computation structure

The basic computational structure of fast D C T algorithms is the butterfly as shown

in Figure-3 and consideration needs to be given to the roundoff errors produced in this

stage. It is known that the error model of the floating-point calculation is different from

that of the integer operation because both floating-point multiplications and additions will

introduce roundoff errors whilst only multiplications using integer calculation will cause

141

roundoff errors. Although the simulation method is used in this study instead of the

theoretical approach where the error model is a necessity, understanding of the model

assists in the simulation design, especially for the integer calculation.

There are two additions and one multiplication in a butterfly structure and roundoff

errors will usually be introduced at three locations, represented by en, ef2 and eo as

shown in Figure-35. In essence, the accumulated roundoff errors depend heavily on the

total number of multiplications and additions required by each algorithm for the D C T .

Since a' is a finite-word-length expression of the exact multiplicand a, it also introduces

computation noise. The error information computed from the simulation will include all

the above effects.

8-2-3 DCT in infinite-word-length

It is assumed that the D C T outputs calculated in infinite-word-length are

independent of the individual algorithm used and that 64-bit double-precision is

considered to be "infinite" compared with 32-bit floating-point or 16-bit integer data

format, etc.. In the simulation, the roundoff noise is calculated by subtracting the D C T

coefficients in finite-word-length obtained by an algorithm from those of the same D C T

algorithm using double-precision. The signal to noise ratio is evaluated using Equation

(8-1-1)

8-2-4 Data collection

The Gaussian random data is mapped into signed 8-bit, or unsigned 8-bit or signed

9-bit integers as the input to the D C T process. The amount of input data is about the same

as that contained in a frame of image 288*352 pixels in size. For each run, a new seed is

chosen for the random number generator to make sure that different simulation runs are

independent from each other. Double-precision is used throughout the calculation of the

sample mean and variance to keep the error caused by the data collection at a minimum

level. The replications method is used to reduce the sample variance [94,96].

142

+ r—

c

x

Q)

3

5 X:

o

c

a. i

ex

C u

I

a n 3 M •H

u:

E x

E x

143

8-3 Simulation Results

The fast DCT algorithms under evaluation include those by Chen [78], Lee [76] and

Hou [77] in comparison with the direct matrix multiplication method. These one

dimensional algorithms are used to implement the 2-D DCT using a row-column

operation. As well, fast two dimensional algorithms based on Lee's and Hou's approach

have been developed and evaluated [46, 80]. All the simulation results are plotted to

present a meaningful comparison of the above-mentioned algorithms in terms of the error

performance using the finite-word-length calculation.

That the infinite-word-length computation is independent of the DCT algorithm has

been demonstrated by comparing the difference between Lee's and Hou's algorithms

using 64-bit double-precision arithmetic. The signal to difference ratio is in excess of 250

dB, independent of block size. Thus, as expected, the output is essentially independent

of algorithms when the precision is essentially infinite.

In the floating-point calculation of DCT, the 32-bit floating-point arithmetic is used

throughout the DCT computation and multiplicands in LUTs are also presented in the 32-

bit floating-point format.

8-3-1 Floating-point computation of 1-D DCTs

Figures 36, 37 and 38, show the signal to noise ratios of Chen's, Lee's and Hou's

algorithms, in comparison with those of the Direct Matrix Multiplication (DMM) method,

when the length of the 1-D DCT is 4, 8, 16 and 32 respectively. The form of the 1-D

input data varies between signed 8-bit, unsigned 8-bit and signed 9-bit integers. It can be

seen that when the length of the DCT increases, the signal to noise ratio decreases for all

algorithms but that of Chen's algorithm shows the least degradation. All the signal to

noise ratios are greater than 134 dB for all the fast algorithms under all the input data

conditions using the floating point calculation and at least 10 dB better than that of the

direct matrix multiplication method. The difference between the best and the worst SNR

for the same DCT length is less than 9 dB for fast algorithms. The error performances of

144

-C Q> o 2 O _J X Q O O O O Q Q Q Q

* + * •

3 Q. C

m i

CO T3 O C O)

<J5 o Q O

a i

oi

c ro

o u_ Q

I-U D

O .c

C c _l

c x: H

m I

a 3

•gp 'oney QSION O; |euBis

145

O QJ 3 $ -C m O .2

O —i X Q H1 H- H- h-O O O O Q Q Q d <M «M

QL

CD.

CO T3 G) C CT

'55 c D r-~

O Q c o 0. CT C CO

o

o

n

CM m

CO CM

• ^ -

CM

O CM

ID

r-

O Q O 1

T 0) •C

M M

o x: CT C

o o x:

r-» m i 01 VJ 3 CO •H fe

•QP 'oiieu 9S10N 0} |BU6IS

146

CD CO 3 ^

x: S o S O -J X Q H h- H- K

o o o o Q a O Q

MM

3

a c m

i

O) •a

a c CT

w H-"*

o Q c o a

i

CT C ro

o LL Q

o

to

CM CO

CO CM

"J-CM

O CM

CD

r-O a

1-0) JZ

» o x: CT C 0) _l 0)

x:

00 m GJ

U - > CJ: •H

.

"8P 'oiiea ©SION oj IBUBJS

147

Lee's and Hou's algorithms are very close. The interesting fact is that the error

performance is dependent on the form of the input data. For example, for an 8- or 16-

point D C T , Chen's algorithm provides better signal to noise ratio than both Lee's and

Hou's when the input data is a signed 8- or 9-bit integer, whilst the reverse is true when

the input data is an unsigned 8-bit integer. The degrading rate of SNRs of all the

algorithms with the unsigned 8-bit integer input is much less than that with a signed 8- or

9-bit integer input.

8-3-2 Floating-point computation of 2-D DCTs

The signal to noise ratios of the row-column Chen's, Lee's and Hou's algorithms

are plotted in Figures 39, 40 and 41 along with those of 2-D Vector Radix (VR) D C T

algorithms based on Lee's and Hou's approaches in comparison with that of the row-

column direct matrix multiplication method. It is interesting to note that when the input

data is an 8- or 9-bit integer the row-column Chen's algorithm gives the best error

performance whilst the difference between SNRs of all the algorithms of length 4 is

marginal for all the fast algorithms. However, when the input data is an unsigned 8-bit

integer the vector radix D C T algorithms provide a better performance for the D C T lengths

(8 and 16) used in practical image coding. Again, it can be seen that when the length of

the D C T is increasing the signal to noise ratio is decreasing with the degrading rate of

Chen's algorithm being the least and those of the vector radix algorithms being the most

(about 23 dB drop from length 4 to length 32 when the unsigned 8-bit integers are used as

input).

Since in the floating-point computation of D C T s the signal to noise ratio of each fast

algorithm considered is greater than 121 dB, the differences between fast algorithms is

relatively marginal. It is obvious that the performance of fast algorithms is supenor to

that of the direct matrix multiplication method.

CD 3 £ CD

_l X Q O

b b " o o o Q Q ' X X > >

X X X X

9 9 9 9 CM CM CM CM

M M * *

148

3

a c CQ •

CO T3 CO

c CT

W H O Q o 0. CT

c ro

o u. Q •

CM

co CM

•c

CM

O CM

ID --

o a a CM

a x: o x: CT C o _l

a si

o> 1 a r-i

— cr •H

t.

•«s-

T —

CO

r — j

r*. CO

' 1

CO

I

CO

I

CO

1 ~1—

CM

1

r-~ CM

1-1 ' l

in CM

I co CM

1

CM

I CT

I r^ u • )

8P '0!;eu esjofg oj |BU6|S

149

c 0) SI

O O X D CM

CD CD

o X a CM

3 O

X o X Q CM

CD

_J

X > Q CM

3

o X X > a CM

2 a o X o CM

M M + *

3

a c DO I

CO TJ

a c CT W C

O Q

O a. CT C * ro

u_ Q

i

CM

H U Q Q CM

U x: *— o ^ •-CT

O <T 1 0J

u 3 CC •H

fc

•gp 'oijea SSJON 0} ieu6|s

150

CO

O O rr

a CM

0) 0) -J

o X a CM

3 O

Q CM

CO CO _l X > a CM

3

o X > Q CM

2 a o X a CM

H M I *

3

a c Dp d) T3 CO C CT

C/5 U Q

o DL

i

CT C ro

o Q i

CM

eo CM

••J-CM

O CM

ID

\-o Q Q

i

CM a JZ _ o x: CT c CJ _l

a x:

<r I CJ

u »-r

—

•gp 'oijea esiON o; |BU6.IS

151

8-4 S u m m a r y

The errors caused by the use of the finite-word-length (32-bit floating-point)

computation in the process of discrete cosine transforms for coding purposes has been

studied.

In the floating-point computation, the signal to noise ratios of all the fast algorithms

are fairly close and above 120 dB for both 1-D and 2-D D C T computations using fast

algorithms. They are also superior to that of the direct matrix multiplication method. For

one dimensional 4- to 32-point DCTs, Chen's algorithm shows better performance than

both Lee's and Hou's if the input data is a signed 8- or 9-bit integer whilst the reverse is

true when the input data is an unsigned 8-bit input. For two dimensional 4*4- to 32*32-

point DCTs, the row-column Chen's algorithm is still superior if the input data is a signed

8- or 9-bit integer whilst vector radix D C T algorithms have a higher number of errors.

However, if the input data is an unsigned 8-bit integer the vector radix algorithms perform

better than others for 4*4- to 16* 16-point DCTs. They were only inferior to row-column

methods when the length of the 2-D D C T was 32.

It has also been found that for both floating-point and integer computations

[96], the performance of fast D C T algorithms in terms of the signal to noise ratio is

dependent on the form of the input data. The input data is Gaussian noise mapped into

signed 8- or 9-bit integers or unsigned 8-bit integers. A similar study is being undertaken

using the fixed-point arithmetic (or integer computation). The results will be reported

elsewhere.

152

CHAPTER NINE: CONCLUSIONS

9-1 Conclusions

In an attempt to ease the burden caused by the construction and implementation of

multidimensional fast transform algorithms, a structural approach is introduced which is

described by two representations—the matrix representation with the tensor product and

logic diagrams with a set of modification rules. Using this structural approach, various

vector radix F F T algorithms, including the vector split-radix FFT and mixed vector radix

FFT algorithms, and vector radix direct fast D C T algorithms are derived and implemented

systematically from their 1-D counterparts. The relationship between vector radix

algorithms and corresponding 1-D fast algorithms is clearly explained, particularly by

diagrammatical representation. The derivation of vector radix algorithms becomes much

simpler using the logic diagrams and implementation by both software and hardware can

be based on pre-knowledge of the corresponding 1-D algorithms. The structural

approach is described by theorems and a recursive diagrammatical symbol system which

are successively applied to both multidimensional vector radix F F T and vector radix fast

D C T algorithms. The development of computer programs using vector radix fast

algorithms, including combined factor vector radix-8*8 FFT and vector radix D C T based

on Lee's and Hou's methods, has demonstrated the effectiveness of this approach,

especially when the program using the 1-D algorithm is available. Further discussion on

the hardware implementation of vector radix FFTs has shown that in a pipelined VLSI

design to compute 16* 16-point D F T , only one complex multiplier is needed, whilst the

traditional row-column method requires two. Further, with the implementation of 2-D,

say 512*512-point DFTs, using the F D P A41102, the number of FDPs can be reduced if

the vector radix method is applied, thereby reducing the system complexity.

Consequently, the structural approach has been extended to vector radix FFTs of

higher dimensions. Although not discussed in the thesis, the approach can be applied to

m-D (m > 3) vector radix direct D C T algorithms as well.

It has been demonstrated that the logic diagram is a useful and very effective

presentation form in expanding knowledge of multidimensional transform algorithms.

153

The fact that 2-D vector radix fast D C T algorithms were derived firstly by using logic

diagrams, then sorting out their matrix presentations in a general case is a good example.

The computation structure of 2-D vector radix DCT algorithms are discussed in

comparison with that of 2-D vector radix FFT algorithms to show the basic computation

structures common to vector radix algorithms and major differences.

In analyzing the structure of Hou's DIT fast DCT algorithm, the correct system

description is presented together with the 2-D vector radix DCT algorithm.

A single processor 2-D DCT coding system using FDP A41102 is presented

rendering a processing rate of 2.5 Ms/s. Where the VLSI DCT processors are not

available, it provides an option to the hardware implementation for the transform coding

problem.

Different aspects of hardware implementation of DCTs for image coding

applications are also discussed using various VLSI DCT processors, DSPs,

Multiplier/Accumulators and FDP. It is pointed out that design of fast algorithms and

designed of VLSI processors are closely related, with computation structure being a very

important issue.

Error performance of various fast DCT algorithms is evaluated using computer

simulation for the floating-point calculation. When random numbers with Gaussian

distribution are used as input, it has been found that the performance of algorithms

depends on the form of the input as to whether it is signed 8- or 9-bit or unsigned 8-bit

integer. The length of DCTs considered is chosen for the image coding so that it varies

from 4 to 32. The performance of fast DCT algorithms is also compared with that of the

direct matrix multiplication method, in both 1-D and 2-D cases, to show the former is

better than the latter when floating-point calculation is used. The performance of fast

algorithms using integer computation is still under evaluation.

In conclusion, it is appropriate to point out some remaining problems and

suggestions are made for further research.

154

9-2 Suggestions for Future Research

So far, an extensive study has been made on the theoretical aspects of vector radix

algorithms for both DFTs and D C T s and computer programs have been developed to

show the validity of the approach proposed. System configurations have been described

using vector radix FFT algorithms and the F D P A41102. Listed following are various

project for future work.

(a) Hardware implementation of pipelined vector radix FFT for 512*512-point D F T

computation using the F D P A41102s as described in Chapter Three. Since the FDP

uses a fixed-point or a block floating-point arithmetic, the error analysis of this

system can be conducted based on information presented in [26-28, 85], or by

simulation using the software provided by Austek [114]. Different aspects of this

multi-processor system can be evaluated and compared with other system

configurations.

(b) Feasibility, performance, advantages and disadvantages of VLSI integration of

vector radix FFT algorithms can be closely examined to enhance published results

[2, 134, 137]. Many advantages of using the vector radix FFT in VLSI

implementation of 2-D DFTs have been shown in [134] and [137] compared with

the row-column FFT algorithms. However, the performance of V R FFTs in terms

of area*time2 [2] has yet to be evaluated. Since only the vector radix-2*2 FFT is

considered in [134] and [137], the number of multiplier stages is shown to be

log2N - 1, where N is the length of the 2-D Ni*NT2-point D F T assuming Ni = N 2 =

N. It has been demonstrated in this thesis that when higher radices are used, the

number of multiplier stages can be reduced. Thus, it is expected that the area*time2

performance of VLSI implementation will be improved using vector radix FFT

algorithms.

(c) Extension of the structural approach to other multidimensional fast digital signal

processing algorithms should also be studied.

155

(d) Hardware implementation of 2-D D C T for image coding can be carried out using

FDP A41102 as described in Chapter Seven in the application of video-telephony

or video-conferencing.

(e) An interactive study of DCT computation and quantization can be carried out so that

evaluation of the overall DCT coding system can be reached [128] before a DCT

codec is implemented for telecommunication purposes.

(f) Feasibility of VLSI integration of vector radix fast DCT algorithms for coding

systems can also be explored.

156

BIBLIOGRAPHY

[1] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing,

Prentice-Hall Inc., Englewood Cliffs, N.J., 1984.

[2] I. Gertner and M. Shamash, "VLSI Architectures for Multidimensional Fourier

Transform Processing", IEEE Transactions on Computers, Vol.C-36, pp. 1265-1274,

November 1987.

[3] R.J. Clarke, Transform Coding of Images, Academic Press, 1985.

[4] W.K. Pratt, Digital Image Processing, John Wiley & Sons, Inc., 1978.

[5] L.R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing,

Prentice-Hall, 1975.

[6] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing, Prentice-Hall

International Inc., 1975.

[7] K.R. Castleman, Digital Image Processing, Prentice-Hall Inc., Englewood Cliffs,

New Jersey, 1979.

[8] R. C. Gonzales and P. Wintz, Digital Image Processing, Addison-Wesley Publishing

Company Inc., 1977.

[9] W.S. Hinshaw and A.H. Lent, "An Introduction to N M R Imaging: From the Block

Equation to the Imaging Equation", Proceedings IEEE, Vol.71, No.3, March 1983.

[10] L. Jacobson and H. Wechsler, "A Theory for Invariant Object Recognition in the

Frontoparallel Plane", IEEE Trans. Pattern Anal. Machine IntelL, Vol.PAMI-6,

pp.325-331,May 1984.

[11] H. Gafni and Y.Y. Zeevi, "A Model for Separation of Spatial and Temporal

Information in the Visual System", Biol. Cybern., Vol.28, pp.73-82, 1977.

[12] H. Gafni and Y.Y. Zeevi, "A Model for Processing of Movement in the Visual

System", Biol. Cybern., Vol.32, pp.165-173, 1979.

157

[13] The Last Word in DSP. Zoran, Digital Signal Processors Data Book, Z O R A N

Corporation, 1987.

[14] J.D. O'Sullivan, D.R. Brown, K.T. Hua and C.E. Jacka, "A VLSI Chip for Fast

Fourier Transforms", Digest of Papers, IREECON'87, p. 142, 1987.

[15] D.R. Brown, K.T. Hua , J.D. O'Sullivan, C.E. Jacka and P.E. Single, "A VLSI

Chip for Fast Fourier Transforms", ASSPA 89, Signal Processing, Theories,

Implementations and Applications, pp. 164-168, April 1989.

[16] Fernando Macias-Garza, A.C. Bovik, K.R. Diller, S.J. Aggarwal and J.K.

Aggarwal, "Digital Reconstruction of Three-Dimensional Serially Sectioned Optical

Images", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-36,

pp.1067-1075, July 1988.

[17] N. Ahmed, T. Natarajan and K.R. Rao, "Discrete Cosine Transform", IEEE

Transactions on Computers, Vol.C-23, pp.90-93, January 1974.

[18] K.R. Rao and P. Yip, Discrete Cosine Transform, Academic Press, Orlando, Fl..

1990.

[19] A. Uzum, A.W. Seeto, D. Rosenfeld, D. Skellen and A. Maheswaran, "Video

Coding: A Survey", Workshop on Telecommunication Services Based on Video and

Images, Sydney, September 1988.

[20] J.W. Cooley, P.A.W. Lewis and P.D. Welch, "Historical Notes on the Fast Fourier

Transform", Proceedings of the IEEE, Vol.55, pp. 1675-1677, October 1967.

[21] M.T. Heideman, D.H. Johnson and G S . Burrus, "Gauss and the History of the Fast

Fourier Transform", IEEE ASSP Magazine, pp. 14-21, 1984.

[22] J.W. Cooley and J.W. Tukey, "An Algorithm for the Machine Calculation of

Complex Fourier Series", Math. Comput., Vol.19, No.90, pp.297-301, 1965.

[23] L.R. Rabiner, "The Acoustics, Speech, and Signal Processing Society—A Historical

Perspective", IEEE ASSP Magazine, pp.4-10, January 1984.

158

[24] B. Gold and C M . Rader, Digital Processing of Signals, McGraw-Hill, Book Co.,

1969.

[25] W.T. Cochran, J.W. Cooley, D.L. Favin, H.D. Helms, R.A. Kaenel, W.W. Lang,

G.C. Maling, JR., D.E. Nelson, C M . Rader and P.D. Welch, "What is the Fast

Fourier Transform?", Proceedings of the IEEE, Vol.55, pp.1664-1674, October

1967.

[26] A41102 Frequency Domain Processor, Austek Microsystems Proprietary, Inc. and

Austek Microsystems Pty. Ltd., 1988.

[27] Frequency Domain Processor (FDP™), Austek Microsystems Proprietary, Inc. and


[28] A41I02 Frequency Domain Processor, Austek Microsystems Proprietary, Inc. and


[29] M. Bellanger, Digital Processing of Signals—Theory and Practice, John Wiley &

Sons Ltd., 1985.

[30] L. Auslander, E. Feig and S. Winograd, "Abelian Semi-Simple Algebras and

Algorithms for the Discrete Fourier Transform", Advances in Applied Mathematics,

No.5, pp.31-55, 1984.

[31] R.E. Blahut, Fast Algorithms for Digital Signal Processing, Addison-Wesley

Publishing, Inc., 1985.

[32] A. Guessoum and R.M. Mersereau, "Fast Algorithms for the Multidimensional

Discrete Fourier Transform", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, No.4, pp.937-943, August 1986.

[33] L. Auslander, and R. Tolimieri, "Ring Structure and the Fourier Transform", The

Mathematical Intelligence, Vol.7, No.3, pp.49-52, p.54, 1985.

[34] M. Vetterli and H.J. Nussbaumer, "Simple FFT and D C T Algorithms with Reduced

Number of Operations", Signal Processing, August 1984.

159

[35] L. Auslander, E. Feig and S. Winograd, "New Algorithms for the Multidimensional

Discrete Fourier Transform", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-31, No.2, pp.388-403, April 1983.

[36] Soo-Chang Pei and Ja-Lin W u , "Split Vector Radix 2-D Fast Fourier Transform",

IEEE Transactions on Circuits and Systems , Vol.CAS-34, pp.978-980, August

1987.

[37] Zhi-Jian M o u and P. Duhamel, "In-Place Butterfly-Style FFT of 2-D Real

Sequences", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-36, pp.1642-1650, October 1988.

[38] M.A. Haque, "A Two-Dimensional Fast Cosine Transform", IEEE Transactions on

Acoustics, Speech, and Signal Processing, Vol.ASSP-33, pp.1532-1539, 1985.

[39] D.F.Elliott and K.R. Rao, Fast Transforms: Algorithms, Analyses, Applications,

Academic Press, 1982.

[40] Third-Generation T M S 3 2 0 User's Guide, SPRU031, Texas Instruments

Incorporated, 1988.

[41] L.R. Morris, "Comparative Study of Time Efficient FFT and W F T A Programs for

General Purpose Computers", IEEE Trans, on Acoustics, Speech, and Signal

Processing, Vol.ASSP-26, pp.141-150, April 1978.

[42] G.E.Rivard, "Direct Fast Fourier Transform of Bivariate Functions", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-25, pp.250-

252, June 1977.

[43] D.B. Harris, J.H. McClellan, D.S.K. Chan, and H.W. Schuessler, "Vector Radix

Fast Fourier Transform", 1977 IEEE Int. Conf. Acoust., Speech, Signal Processing

Rec, pp.548-551,May 1977.

[44] B. Arambepola, "Fast Computation of Multidimensional Discrete Fourier

Transforms", IEE Proceedings, Vol.127, Pt.F, No.l, February, 1980.

160

[45] H.R. W u and F.J. Paoloni, "The Structure of Vector Radix Fast Fourier

Transforms", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.37, pp.1415-1424, September 1989.

[46] H.R. W u and F.J. Paoloni, "A T w o Dimensional Fast Cosine Transform

Algorithm—A Structural Approach", Proceedings of IEEE International Conference

on Image Processing, Singapore, pp.50-54, September 1989.

[47] E.O. Brigham, The Fast Fourier Transform, Prentice-Hall Inc., Englewood Cliffs,

N.J., 1974.

[48] S. Winograd, "On Computing the Discrete Fourier Transform", Mathematics of

Computation, Vol.32, No.141, pp.175-199, January 1978.

[49] D.W. Tufts and G. Sadasiv, "The Arithmetic Fourier Transform", IEEE ASSP

Magazine, pp. 13-17, January 1988.

[50] S. Prakash and V.V. Rao, "Vector Radix FFT Error Analysis", IEEE Transactions on

Acoustics, Speech, and Signal Processing, Vol.ASSP-30, pp.808-811, October

1982.

[51] I. Pitas and M.G. Strintzis, "Floating Point Error Analysis of Two-Dimensional Fast

Fourier Transform Algorithms", IEEE Trans, on Circuits and Systems, Vol.35,

pp. 112-115, January 1988.

[52] Q S . Burrus and P.W. Eschenbacher, "An In-Place, In-Order Prime Factor FFT

Algorithm", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-29, August 1981.

[53] H.W. Johnson and C.S. Burrus, "On the Structure of Efficient DFT Algorithms",

IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33,

pp.248-254, February 1985.

[54] Kenji Nakayama, "An Improved Fast Fourier Transform Algorithm Using Mixed

Frequency and Time Decimations", IEEE Transactions on Acoustics, Speech, and

Signal Processing, Vol.ASSP-36, pp.290-292, February 1988.

161

[55] M.A. Richard, "On the Efficient Implementation of the Split-Radix FFT",

Proceedings ofICASSP-86, pp. 1801-1804, 1986.

[56] R.W. Linderman et al, "CUSP: A 2-urn C M O S Digital Signal Processor", IEEE

Journal of Solid-State Circuits, Vol.SC-20, pp. 761-769, June 1985.

[57] J. Makhoul, "A Fast Cosine Transform in One and Two Dimensions", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-28, pp.27-34.

1980.

[58] H.J. Nussbaumer and P. Quandalle, "Fast Computation of Discrete Fourier

Transforms Using Polynomial Transforms", IEEE Transactions on Acoustics,

Speech, and Signal Processing, Vol.ASSP-27, pp. 169-181, April 1979.

[59] O.R. Hinton and R.A. Salch, "Two-Dimensional Discrete Fourier Transform with

Small Multiplicative Complexity Using Number Theoretic Transforms", IEE

Proceedings, Vol.131, Pt.G, No.6, December 1984.

[60] H.R. W u and F.J. Paoloni, "On the T w o Dimensional Vector Split-Radix FFT

Algorithm", IEEE Transactions on Acoustics, Speech, and Signal Processing,

August 1989.

[61] R.C Agarwal and J.W. Cooley, "An Efficient Vector Implementation of the FFT

Algorithm on IBM 3090VF", ICASSP, pp.249-252, 1986.

[62] R.M. Mersereau and T.C. Speake, "A Unified Treatment of Cooley-Tukey

Algorithms for the Evaluation of the Multidimensional DFT", IEEE Transactions on

Acoustics, Speech, and Signal Processing, Vol.ASSP-29, pp.1011-1018, October

1981.

[63] C S . Burrus and T.W. Parks, Discrete Fourier Transform/Fast Fourier Transform

and Convolution Algorithms, A Wiley-Interscience Publication, John Wiley & Sons,

1985.

[64] G.D. Bergland, "A Fast Fourier Transform Algorithm Using Base Eight Iterations",

Math Computation, Vol.22, pp.275-279, April 1968.

162

[65] Weizhen M a and Ruixiang Yin, "New Recursive Factorization Algorithms to

Compute D F T Q m ) and DCT(2™)", IEEE Asian Electronics Conference, Hong Kong,

1987.

[66] Weizhen M a and Dekun Yang, "New Fast Algorithm for Two-Dimensional Discrete

Fourier Transform DFT(2n,2)", Electronics Letters, Vol.25, No.l, pp.21-22,

January 1989.

[67] H.R. W u and F.J. Paoloni, "Structured Vector Radix FFT Algorithms and Hardware

Implementation", submitted to Journal of Electrical and Electronics Engineering,

Australia, for publication, 1989.

[68] P. Duhamel, "Implementation of 'Split-Radix' FFT Algorithms for Complex, Real,

and Real-Symmetric Data", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, pp.285-295, April 1986.

[69] P.R. Halmos, Finite-Dimensional Vector Spaces, D.Van Nostrand Company, Inc.,

1958.

[70] H.R. W u , and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast

Fourier Transforms", Technical Report No. I, Department of Electrical and Computer

Engineering, The University of Wollongong, 1986.

[71] Zhi-Jian M o u and P. Duhamel, "Corrections to 'In-Place Butterfly-Style FFT of 2-D

Real Sequences'", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-37, September 1989.

[72] M. Vetterli, P. Duhamel and C Guillemot, "Trade-Offs in the Computation of

Mono- and Multi-Dimensional DCTs", Proceedings of IEEE International

Conference on A coustics, Speech, and Signal Processing, pp.999-1002, 1989.

[73] S. Okubo, R. Nicol, B. Haskell and S. Sabri, "Progress of CCJTT Standardization

on n*384 kbit/s Video Codec", IEEE Globecom'87, pp.36-39, 1987.

[74] J.C Carlach, P.Penard and J.L. Sicre, "TCAD: A 27 M H Z 8*8 Discrete Cosine

Transform Chip", Proc.,ICASSP'89, 1989.

163

[75] M. T. Sun, T.C. Chen, A. Gottlieb, L. W u and M.L. Liou, "A 16*16 Discrete

Cosine Transform Chip", Proc. of SPIE'87 Symp. Visual Commun. Image Proc,

Vol.845, pp. 13-18, Oct. 1987.

[76] B.G. Lee, "A New Algorithm to Compute the Discrete Cosine

Transform", IEEE Trans, on Acoust., Speech, Signal Proce

ssing, Vol.ASSP-32,pp.1243-1245, December 1984.

[77] H.S. Hou, "A Fast Recursive Algorithm For Computing the Discrete Cosine

Transform", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-

35, pp.1455-1461, 1987.

[78] Wen-Hsiung Chen, C Harrison Smith, and S. C Fralick, "A Fast Computational

Algorithm for The Discrete Cosine. Transform", IEEE Transactions on

Communications, Vol. COM-25, No.9, pp. 1004-1009, September 1977.

[79] M. Vetterli, "Fast 2-d Discrete Cosine Transform", IEEE ASSP Conf, pp. 1538-

1541, 1985.

[80] H.R. Wu and F.J. Paoloni, "A Structural Approach to Two Dimensional Direct Fast

Discrete Cosine Transform Algorithms", Proceedings of International Symposium on

Computer Architecture & Digital Signal Processing, Hong Kong, October 1989.

[81] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms", Technical

Report-1, The University of Wollongong-Telecom Research Laboratories (Australia)

R&D Contract for the Study of Fast Implementations of Discrete Cosine Transform

Coding Systems, under No.7066, June 1989.

[82] M.J. Narasimha and A.M. Peterson, "On the Computation of the Discrete Cosine

Transform", IEEE Transactions on Communications, Vol.COM-26, pp.934-936,

1978.

164

[83] H.R. W u and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast

Fourier Transforms", 1SSPA 87, Signal Processing, Theories, Implementations and

Applications, pp.89-92, August 1987.

[84] M. Vetterli, "Trade-Off s in the Computation of Mono- and Multi-dimensional

DCTs", Technical Report: CUICTRITR-090-88-18, Center for Telecommunications

Research, 1988.

[85] Austek Microsystems Proprietary, Inc. and Austek Microsystems Pty. Ltd., A User

Guide for the A41102, 1988.

[86] H.R. W u , FJ. Paoloni and W . Tan, "Implementation of 2-D D C T for Image Coding

Using F D P ™ A41102", Proceedings of the Conference on Image Processing and the

Impact of New Technologies, Canberra, December 1989.

[87] Real Time Discrete Cosine Transformer—Advanced Specifications #2, SGS

Thomson Microelectronics, March 1987.

[88] IMS A121 2-D Discrete Cosine Transform Processor—Advance Information, inmos,

April 1989.

[89] TMC2311-CMOS Fast Cosine Transform Processor—Advance Information, T R W

LSI Products Inc., 1989.

[90] High Performance C M O S — D a t a Book, Integrated Device Technology, 1988.

[91] W E ® DSP16A Digital Signal Processor—Advance Data Sheet, A T & T 1988.

[92] D.M. Blaker, "Using the DSP16/DSP16A for Image Compression", DSP Review,

AT&T, Vol.2, Issue 1, pp.4-5, 1989.

[93] TEX A S INSTRUMENTS, Third-Generation TMS320 User's Guide, 1988.

[94] A. Alan B. Pritsker, Introduction to Simulation and SLAM II, 3rd ed. Systems

Publishing Corporation, Halsted Press, 1986.

[95] Byron J.T. Morgan, Elements of Simulation, Chapman and Hall Ltd, 1984.

[96] H.R. W u and F.J. Paoloni, "Simulation Study on the Effects of Finite-Word-Length

Calculations for Fast D C T Algorithms", Technical Report-2, The University of

165

Wollongong-Telecom Research Laboratories (Australia) R & D Contract for the Study

of Fast Implementations of Discrete Cosine Transform Coding Systems, under

No.7066, October 1989.

[97] R. Yavne, "An Economical Method for Calculating the Discrete Fourier Transform",

National Computer Conference and Exposition Proceedings, Vol.33, pp.115-125,

1968.

[98] P. Duhamel, B. Piron and J.M. Etcheto, "On Computing the Inverse DFT", IEEE

Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-36, pp.285-286,

February 1988.

[99] M.T. Heideman and C S . Burrus, "On the Number of Multiplications Necessary to

Compute a Length-2n DFT", IEEE Trans, on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, pp.91-95, February 1986.

[100] T.S. Huang, "How the Fast Fourier Transform Got Its Name", Computer, Vol.4,

No.3, p. 15, May-June 1971.

[101] Yoiti Suzuki, Toshio Sone and Ken'iti Kido, "A New FFT Algorithm of Radix 3, 6,

and 12", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-34,

pp.380-383, April 1986.

[102] W.A. Perera and P.J.W. Rayner, "Optimal Design of Multiplierless DFTs and

FFTs", ICASSP'86, pp.245-248, 1986.

[103] W.M. Gentleman, "Fast Fourier Transforms—For Fun and Profit", Proceedings-

Fall Joint Computer Conference, pp.333-578, 1966.

[104] M.R. Schroeder, "The Unreasonable Effectiveness of Number Theory in Science and

Communication (1987 Rayleigh Lecture)", IEEE ASSP Magazine, pp.5-12, January

1988.

[105] K.N. Ngan, K.S. Leong and H. Singh, "Adaptive Cosine Transform Coding of

Images in Perceptual Domain", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-37, pp. 1743-1750, November 1989

166

[106] H. Kitajima, "A Symmetric Cosine Transform", IEEE Trans, on Computers, Vol.C-

29, pp.317-323, 1980.

[107] Byeong Gi Lee, "FCT - A Fast Cosine Transform", IEEE ASSP Conf, 28A.3.1-

28A.3.4, 1984.

[108] H.R. W u and F.J. Paoloni, "A 2-D Fast Cosine Transform Algorithm Based on

Hou's Approach", submitted to IEEE Transactions on Acoustics, Speech, and Signal

Processing , for publication, 1989.

[109] H.R. W u and F.J. Paoloni, "The Impact of the VLSI Technology on the Fast

Computation of Discrete Cosine Transforms for Image Coding", to be submitted.

[110] H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms, Springer-

Verlag, Berlin Heidelberg, 1982.

[Ill] E. Arnould, and JP. Dugre, "Real Time Discrete Cosine Transform - An Original

Architecture", IEEE ASSP Conf, 48.6.1-48.6.4,1984.

[112] Naoki Suehiro and Mitsutoshi Hatori, "Fast Algorithms for the DFT and Other

Sinusoidal Transforms", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, No. 3, pp. 642-644, June 1986.

[113] Pierre Duhamel and Hedi H'Mida, "New 2n D C T Algorithms Suitable for VLSI

Implementation", IEEE ASSP Conf, pp. 1805-1808, 1987.

[114] Austek Microsystems Proprietary, Inc. and Austek Microsystems Pty. Ltd., FDPSIM

USERS GUIDE, 198S.

[115] W E ® DSP32C Digital Signal Processor—Advance Information Data Sheet, AT&T.

[116] W E ® DSP32C Digital Signal Processor—Information Manual, AT&T, December

1988.

[117] C S . Burrus, "Bit Reverse Unscrambling for A Radix-2M FFT", Proc. ICASSP,

ppl809-1810, 1987.

167

[118] H.Nawab and J.H. McClellan, "Bounds on the Minimum Number of Data Transfers

in W F T A and FFT Programs", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-27, No.4, pp.394-398, August 1979.

[119] Z. Wang, "On Computing the Discrete Fourier and Cosine Transforms", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33, No.4,

pp.1341-1344, October 1985.

[120] Z. Wang and B.R. Hunt, "Comparative Performance of Two Different Versions of

the Discrete Cosine Transform", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-32, No.2, pp.450-453, April 1984.

[121] P. Yip and K.R. Rao, "On the Shift Property of D C T s and DSTs", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-35, No.3,

pp.404-406, March 1987.

[122] K.N. Ngan, "Image Display Techniques Using the Cosine Transform", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-32, No.l,

pp. 173-177, February 1984.

[123] O. Ersoy, "On Relating Discrete Fourier, Sine, and Symmetric Cosine Transforms",

IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33,

No.l, pp.219-222, February 1985.

[124] H.S. Malvar, "Fast Computation of the Discrete Cosine Transform and the Discrete

Hartley Transform", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-35, No.10, pp.1484-1485, October 1987.

[125] V. Nagesha, "Comments on 'Fast Computation of the Discrete Cosine Transform and

the Discrete Hartley Transform'", IEEE Transactions on Acoustics, Speech, and

Signal Processing, Vol.ASSP-37, No.3, pp.439-440, March 1989.

[126] N. Nasrabadi and R. King, "Computationally Efficient Discrete Cosine Transform

Algorithm", Electronics Letters, Vol.19, January 1983.

168

[127] H.R. W u and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms", Addendum

of Technical Report-1, The University of Wollongong-Telecom Research

Laboratories (Australia) R & D Contract for the Study of Fast Implementations of

Discrete Cosine Transform Coding Systems, under No.7066, November 1989.

[128] D.J. Bailey and N. Birch, "Image Compression Using a Discrete Cosine Transform

Image Processor", Electronic Engineering, July 1989.

[129] P.K. Rodman, "High Performance FFTs for a V L J W Architecture", Proceedings of

International Symposium on Computer Architecture & Digital Signal Processing,

Hong Kong, October 1989.

[130] R.K. Asbury, "2D and 3D FFTs on the Intel IPSC/2—A Distributed Memory, Multi-

Processor Supercomputer", Proceedings of International Symposium on Computer

Arclutecture & Digital Signal Processing, Hong Kong, October 1989.

[131] S.Y. Kung, "From VLSI Arrays to Neural Networks", Proceedings of International

Symposium on Computer Architecture & Digital Signal Processing, Hong Kong,

October 1989.

[132] Y. He and Z. Wang, "Fixed-Point Error Analysis for the Fast Cosine Transform",

Proceedings of International Symposium on Computer Architecture & Digital Signal

Processing, Hong Kong, October 1989.

[133] W . M a and D. Yang, "On Computing 2-D DFT", Proceedings of International

Symposium on Computer Architecture &. Digital Signal Processing, Hong Kong,

October 1989.

[134] W . Liu, T. Hughes and W.T. Krakow, "A Rasterization of Two-Dimensional Fast

Fourier Transform", in VLSI Signal Processing, II, ed. by S.Y. Kung, R.E. Owen

and J.G. Nash, pp. 281-292, IEEE Press, 1986.

[135] S.Y. Kung, H.J. Whitehouse and T. Kailath, ed., VLSI and Modern Signal

Processing, Prentice-Hall, Inc., 1985.

169

[136] S.Y. Kung, VLSI Array Processors, Prentice-Hall, Inc., 1988.

[137] W . Liu and D.E. Atkins, "VLSI Pipelined Architectures for Two Dimensional Fast

Fourier Transform with Raster-Scan Input Device", International Conference on

Computer Design: VLSI in Computer, pp.370-375, 1984.

[138] A.D. Culhane, M . C Peckerar and C.R.K. Marrian, "A Neural Net Approach to

Discrete Hartley and Fourier Transforms", IEEE Transactions on Circuits and

Systems, Vol.CAS-36, pp.695-703, 1989.

[139] M.-T. Sun, T.-C Chen and A.M. Gottlieb, "VLSI Implementation of a 16*16

Discrete Cosine Transform", IEEE Transactions on Circuits and Systems, Vol.CAS-

36, pp.610-617, 1989.

[140] J.A. Beraldin, T. Aboulnasr and W . Steenaart, "Efficient One-Dimensional Systolic

Array Realization of the Discrete Fourier Transform", IEEE Transactions on Circuits

and Systems, Vol.CAS-36, pp.95-100, 1989.

[141] T. Willey, R. Chapman, H. Yoho, T S . Durrani and D. Preis, "Systolic

Implementations for Deconvolution, D F T and FFT", IEE Proceedings, Vol.132,

Pt.F, 1985.

[142] E.E. Swartzlander, Jr. and G. Hallnor, "Fast Transform Processor Implementation",

Proceedings of 1CASSP 84, pp.25A.5.1-25A.5.4, 1984.

[143] M.T. Sun, L. W u and M.L. Liou, "A Concurrent Architecture for VLSI

Implementation of Discrete Cosine Transform", IEEE Transactions on Circuits and

Systems, Vol.CAS-34, pp.992-994, 1987.

[144] C D . Thompson, "Fourier Transforms in VLSI", IEEE Transactions on Computers,

Vol.C-32, pp.1047-1057, 1983.

[145] H. Mori, H. Ouchi and S. Mori, "A W S I Oriented T w o Dimensional Systolic Array

for FFT", Proceedings of ICASSP 86, pp.2155-2158, 1986.

170

[146] A. Iwata, I. Horiba, N. Suzumura and N. Takagi, "3-Dimensional Reconstructing

Algorithm for Digital Tomo-Synthesis", Proceedings of ICASSP 86, pp. 1741-1744,

1986.

[147] K.J. Jones, "2D Systolic Solution to Discrete Fourier Transform", IEE Proceedings,

Vol.136, Pt.F, pp.211-216, 1989.

[148] H. Schid, Decimal Computation, John Wiley & Sons, Inc. 1974.

[149] IEEE Micro, (Sepcial Issue on Digital Signal Processors,) Vol.6, No.6, December

1986.

[150] IEEE Micro, (Sepcial Issue on Digital Signal Processors,) Vol.8, No.6, December

1988.

[151] K. Hwang, Computer Arithmetic, John Wiley & Sons, Inc. 1979.

[152] E.E. Swartzlander, Jr., VLSI Signal Processing Systems, Kluwer Academic

Publishers, 1986.

[153] H.R. W u , and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast

Fourier Transforms—Part II", Technical Report No. 2, Department of Electrical and

Computer Engineering, The University of Wollongong, 1986.

[ 154] M. Vulis, "The Weighted Redundancy Transform", IEEE Transactions on Acoustics,

Speech, and Signal Processing, Vol.ASSP-37, pp.1687-1692, November 1989.

[155] J.H. McClellan and C M . Rader, Number Theory in Digital Signal Processing,

Prentice-Hall Inc., Englewood Cliffs, N.J., 1979.

[156] P. Duhamel and C Guillemot, "Polynomial Transform Computation of the 2-D

DCT", to be presented at ICASSP-90, April 1990.

[157] J. Suzuki, M. Nomura and S. Ono, "Comparative Study of Transform Coding for

Super High Definition Images", to be presented at ICASSP-90, April 1990.

[158] U. Totzek, F. Matthiesen, S. Wohlleben and T.G. Noll, " C M O S VLSI

Implementation of the 2D-DCT with Linear Processor Arrays", to be presented at

ICASSP-90, April 1990.

171

[159] M. Yan, J.V. McCanny and Y. Hu, "VLSI Architectures for Digital Image Coding".

to be presented at ICASSP-90, April 1990.

172

APPENDIX A: PRELIMINARY BACKGROUND ON THE TENSOR

(KRONECKER) PRODUCT AND THE LOGIC

DIAGRAM

In this section, a brief introduction to the two basic tools, which are frequently used

throughout the thesis, will be presented, namely, the tensor product with its properties

and the logic diagram. The definition and the properties of the tensor product are included

to be self contained. The purpose of introducing the logic diagram instead of the

conventional signal flowgraph (the Mason flowgraph) will soon become clear.

Definition—the Tensor (or Kronecker) Product: [31]

Let A = [amk] be an M by K matrix, and B = [bm/] be an N by L

matrix. The tensor product of A and B, denoted by A ® B , is a matrix

with M N rows and K L columns whose entry in row (m-l)N + n and

column (k-l)L + 1 is given by cmn>k/ = amkt>m/-

The tensor product, A®B, is an M by K array of N by L blocks, with the (m,k)th

such block being amkB. It is apparent from the definition that the tensor product is not

commutative but is associative, i.e.,

A®B*B®A (A"1)

(A®B)®C = A®(B®C) (A-2)

The following equalities can be also proven to be true [31][29].

(A®B)(C®D) = (AC)®(BD) (A-3)

P^(Ir®A)MrN = (A®Ir) (A-4)

where A, B, C, and D are N * N matrices; Ir is of r*r dimension; PrN is a permutation

matrix which is defined by, when r = 2:

P ^ X 0 ' xl> x2,--->XN-l) = (x0, xN/2, xl> XN/2.],..-,XN-l)> <A-5)

and M^j is the inverse matrix of P^ which is denoted by MN=[PNJ" •

If we define a time delay operator s which gives:

173

x(n+l) = s*x(n) or x(n-l) = s-^xfa), (A-6)

w e shall have the following properties of the tensor product: xi(n) "i

_x2(n+l)j

'1 0" _0si_

- xi(m,n) -

X2(m,n+1)

X3(m+l,n)

_X4(m+l ,n+ 1)J

'xi(n)

_X2(n)

= "1 0" .0S2.

® -1 0"

.0 si_

-xi(m,n)~

x2(m,n)

X3(m,n)

_x4(m,n)_

(A-7)

(A-8)

The logic diagram which w e shall introduce consists of basic elements such as line,

heavy line (or vector line), addition operation block, vector addition operation block,

scalar product block, and vector scalar product. The definitions are shown in Figure-A-1.

The logic diagram of one dimensional algorithms is equivalent to the Mason signal

flowgraph and the logic diagram of multidimensional algorithms is a direct extension of

that in the one dimensional, which introduces the vector operation concept into the

graphical form. The logic diagram has its own rules which can be readily used to derive

equivalent logic diagrams, which makes the modification of algorithms and the derivation

of multidimensional fast algorithms a comparatively simple procedure. The inverse of an

orthorgonal transform is equal to its transpose. The transpose operation on a logic

diagram is to change the addition block to a branch node, and a branch node to a addition

block as well as to change the direction of the input and output data flow.

In the thesis, all the multidimensional fast algorithms can be derived by either using

the properties of the tensor product—a mathematical approach, or using the logic

diagrams—an engineering approach.

174

c o

£3

X

E o u I.

o

to C

u CJ

c

O

o &J > X _«J

E o u o

to c

t c3 CJ CJ c i—

O o o > <

***• 4.

~ II CJ

CJ > 'dj

T3

? CJ

*->

X _CJ

I—

C

'.J

5

<

U

^ s—" ,—. + •—-

^ r: II '~~ '— cv CJ ,_J -

* + "

II

& CJ

.r-

o

X _CJ r* t—

r-

5 o i—

O CJ u~

C O

a >

O CJ CJ

>

C c

c CJ

00

Z CJ u CJ

X o r-

5 CJ

c ^. CJ

*

a II

CJ

>

Cv

C-1—

CJ

c/:

ll .— u C c«

5 "—' cz * w TT

O ^

••

CJ

-C5 *

n »

>

CJ

CJ

C u

CJ y.

C CJ CJ

>

>_:

o > o

C

r-

O r-r-

z u CJ

"5

_

1 < 1 a u

. — i

CJ

+

rt Xi a —

175

APPENDIX B: PROOF OF STRUCTURE THEOREMS

In order to prove Structure Theorem 1 of Chapter Three, we assume that I T = l^ ®

IN2 . To prove that ll = IT, it is only needed to show that Il(&/ ,£] ;mo ,no) = IT(*; ,#1

;mo ,no) as the size of both matrices is rir2*rir2- According to definitions of 1-D and 2-

D butterfly matrices, the following equations are obvious:

Since

IT(*/ ^ 1 W o ,no ) = IJ5 (kj , m o ) * IN22 (£] , no )

= w*im° * w£jno n r2.

V(k] ,£] \mo ,no ) = IT(^7 ,#1 \rno ,no )•

that is:

F = IT-The second part of the theorem can be proved using the same approach as well as

other structure theorems.

176

APPENDIX C: THE COMBINED FACTOR METHOD

To obtain the combined factor vector radix-8*8 DIF FFT using Equation (3-6-8), it

simply means to combine a with matrix I N and using the fact that a 2 = -j.

Diagrammatically, the method is attempting to reduce the number of multiplications that

the vector radix-8*8 butterfly structure contains by combining the row twiddles with the

column twiddles. For some algorithms, there may be more than one way of conducting

this task, but minimum multiplications and combined factors (or twiddles) at regular

places are often preferred. The example in [45] is but one. The combination of the row

twiddles with the column's needs more explanation. If the row twiddle is W . T and the

column twiddle is W ^ , combining W ^ * W ^ results in W ^ + P N i . when N] = N 2 =

N. w^^1 = Wa+P = W* where y = a+B. If the Look Up Table (LUT) is used in the N1N2 N N

program as is the case in this study, the combined factors are pre-calculated and stored in

the L U T which are called when needed. This practice increases the D F T processing

speed considerably. Figure-C-1 shows a 2-D 64*64-point D F T calculated using the CF

VR-8*8 FFT [45] and a 2-D input data as shown in Figure-C-2. Figure-C-3 is another 2-

D D F T generated by the same program using Figure-C-4 as the input.

The vector radix-16* 16 FFT algorithm can be constructed in the same manner. The

vector radix-16* 16 butterfly computational structure can be calculated according to

Figure-9 and the vector radix-16* 16 twiddle factors can be generated using the structure

theorem from the corresponding twiddles of the radix-16 FFT algorithm.

177

Figure-C-1: A 2-D 64 x 64-point D F T calculated using the CF VR-8 x 8 FFT algorithm.

178

Figure-C-2: The 2-D input data used for Figure-C-1.

179

Figure-C-3: A 2-D 64 x 64-point D F T calculated using the CF V R 8 x 8 FFT

algorithm. algorithm.

180

Figure-C-4: The 2-D input data used for Figure-C-3.

181

APPENDIX D: DERIVATION OF VECTOR RADIX 2-D FAST DCT

BASED ON LEE'S ALGORITHM

In Equation (5-2-4a), set k = 2*k' + k" and 1 = 2*1 + /"; k',/' = 0,1,...,N/2-1 and

k",/" = 0,1. Then the following four equations will be obtained:

X(2k',2T) = j£ j^n.m ) Cgjg^cg^1^ ' (D-l)

x(2k«,2/•+!) = £o mX0x(n'm} cwm'c™ +1)(2/'+1) (D"2)

X(2k'+1,2/') = I1 ^ x(n,m ) C^1^1^*1*7' (D-3) n=0 m = 0 ^ AIN/Z;

X(2k'+1,2/ '+1) = S I1 x(n,m ) rf^X^cgj" + D<2/ '+i) (r>4) n=0 m =0

From Equation (D-l), N/2-1 N/2-1

X(2k',2/')= I I [x(n,m) + x(N-l-n,m) n=0 m =0

+ x(n,N-l-m) + x(N-l-n,N-l-m)] C ^ f ' c ^ t 1 ^ ' 2(N/2) ^2(N/2)

(D-5)

Note that cl2^"1^11*' = C^})k'. Using the same method:

N/2-1 N/2-1

X(2k',2/'+1) = I I [x(n,m) + x(N-l-n,m) n=0 m =0

/ XT 1 ^ /XT i XT i s-ir.(2n+l)k,p(2m+l)(2/

,+ l)

:(n,N-l-m) - x(N-l-n,N-l-m)] C2rN/2) C 2 N 2(N/2)

(D-6) N/2-1 N/2-1

X(2k'+l,2/') = £ X [x(n,m) - x(N-l-n,m) n=0 m =0

, XT , x /XT , XT 1 M ^(2n+l)(2k,+l)r(2m+l)/

+ x(n,N-l-m) -x(N-l-n,N-l-m)]C2N C2(N/2)

(D-7)

and, N/2-1 N/2-1

X(2k'+l,2/,+l)= £ I [x(n,m) - x(N-l-n,m) n=0 m =0

~ T , XT , M ^(2n+l)(2k'+l)r(2m+l)(2/ +1)

- x(n,N-1 -m) + x(N-1 -n,N-1 -m)] CV2N C 2 N

(D-8)

182

Note that (f^m)+1V' = cgjjW', ( W - i - n H U ^ D = .Cgj+^2k'+1)

and d2^'1""1 >+1^2/'+1) = A2m+\)(2l •+!)

Define:

gl(n,m) = [x(n,m) + x(N-l-n,m) + x(n,N-l-m) + x(N-l-n,N-l-m)] 1

g2(n,m) = — m +1)[x(n,m) + x(N-l-n.m) - x(n,N-l-m) - x(N-l-n.N-l-m)] 2 C2N

g3(n,m)=—^^{x(n,m) - x(N-l-n,m) + x(n,N-l-m) - x(N-l-n,N-l-m)] 2(~2N

g4(n,m)=—^13—(2^TT^x^n'm^ - x(N-l-n,m) - x(n,N-l-m) + x(N-l-n,N-l-m)] 2 C2N 2C2N

Then:

^ ^ e ^ " ' " 1 ' ^^N ^2(N/2) ^2N n=0 rn=0 ^ '

(2n+l)k'r(2m+l 2(N/2) e2(N/2)

(2n+l)k'r(2m+l' 2(N/2) W(N/2)

W M N ^ 1 , . r,(2n+l)k>(2m+l)/' = X X g2(n,m) C V ^ C2rN/2/

n=0 m=0 N/2-1 W2-1 (2n+ 1)k. (2m+ l)(/'+l)

+ Xn X g2(n,m) Cv9/Nm

; C ) m mJ

n=0 m=0

= G2(k',/') + G2(k',/*+l) (D-9)

N/2-1 N/2-1 On+llk' C2m + n/' . • where G2(k',/') = X I g2(n,m) C ^ ^ C ^ ^ ' , noticing that

n=0 m=0 0~<2m+lW2n+l)k'~(2m+l)(2f '+1) _ r(2n+l)k'r(2m+l)'' r(2n+1 )k'c(2m+1)(/'+1)

2C2N L2(N/2) C2N ~ C2(N/2) U2(N/2) + ^2(N/2) 2(N/2)

«^,, , ^ W*-1^1 , , -P(2n+lW2n+l)(2k,+l)r(2m+l)r

X(2k'+l,2/')= X X g3(n.ni) 2 C ^ C ^ C2(N/2)

n=0 m=0

Ng"1^"1 , , r(2n+l)k'r(2m+l)/' = X X g3(n,m) C2(N/2) C2(N/2)

n=0 m=0 Ng-1Ng-] , , r(2n+l)(k'+l)r(2m+l)r

+ X 1 g3(n,m) C2(N/2) C2(N/2)

n=0 m=0

= G3(k',/') + G3(k'+U') (D-10)

where O3CW') = T TOOM*) C^cg-1*', notog to 11=0 m=0

2r(2n+l)r(2n+l)(2k'+l)c(2m+l)/' _ C^^cS^'' + C^f '^cE^' '• 2 C2N L 2N C2(N/2) ~ 2(N/2) 2(N/2) 2(N/2) 4 W 4

Accordingly,

183

N/2-1 N/2-1

X(2k,+ l,2/'+l)= X X g4(n,m) 2C2^+1)2^m+1)C2

2Nn+1)(2k'+1)c(2m+1)(2/'+1)

= G4(k',/') + G4(k',/ '+1) + G4(k'+1,/') + G4(k'+1,/ '+1)

(D-ll)

where G4(k',/') = X I^nm) c(2n+1)k'r(2m+1)r

n=0 m=0 2(N^) ^ W ) '

After defining that Gi(k',/') = T Tgl^m) C^fcg^1*'', the matrix

form of the forward algorithm can be obtained.

•g'jCn.m)-

g'2(n,m)

g'3(n,m)

_g'4(n,m)_

"gjCn.m)

g2(n,m)

g3(n,m)

.g4(n,m)

= (B®B)

= (M®M')

x(n,m)

x(n,N-l-m)

x(N-l-n,m)

Lx(N-l-n,N-l-m).

g'jCn.m)'

g'2(n,m)

g'3(n,m)

_g'4(n,m)_

(D-l2a)

(D-l 2b)

•Gi(k,/)n

G2(k,/)

G3(k,/)

.G4(k,/ )J

N/2-1 N/2-1

X n=0

m=0

,(2n+l)k 2(N/2)

(2m+l)/ X C n^n\ ^2(N/2)

•gjCn.m)-

g2(n,m)

g3(n,m)

_g4(n,m)_

(D-l 2c)

X(2k,2/ )

X(2k,2/+1)

X(2k+1,2/)

•X(2k+1,2/+1)-I

= (P®P)

r Gi(k,/) • G2(k,/)

G2(k,/+1)

G3(k,/)

G4(k,/)

G4(k,/+1)

G3(k+1,/) G4(k+1,/)

l"G3(k+l,/+l)-

(D-12d)

184

where k,/ ,n,m = 0,1,...N/2-1, and Gi(k+1,*) \k=N/2-i = Gi(*,/ +1) 1/ =m-i = 0, for i =

2,3,4.

For the inverse 2-D DCT defined by Equation (5-2-4b), the same decimation can be

applied to achieve the following equations.

N/2-1 N/2-1 A /0 ... , „ ,.,,

x(n,m)= ^^k'ai^C^C^1

+ X(2V 01 '+n r(2n+l)k'r(2m+l)(2/,+ l)

t A U K ^ +i> L2(N/2) ^2N

+ X(2k'+l,2/0C(^+1>(2k'+1)c^)

1)/'

+ X(2k'+1,2/ "+1) C^+1)(*'+1)cg^+1)C" ,+1)] (D-13)

N/2-1 N/2-1 A «n,n,, /omJ.n/' x(n,N-l-m) = ^ I X ^ ^ C ^ C ^ '

-X(2k\2/Vl)C(2^

+ X(2k'+l,2/')C(^+1)(2k'+1)C2

2^)1)/'

- X(2k'+1,2/ '+!) C(2"+1X2k'+1H2N

m+1)(2/ '+1)J (D-14)

N/2-1 N/2-1 /\ n nvi /•Omj.n/'

x(N-l-n,m)= I Z IX(2k\2/0C ( ^ ) k c gJ^ 1 ) /

2N ^2N

3n+l)k'p(2n 2(N/2) 2(N/2)

,- V70W 9/ '+n r(2n+l)k'r(2m+l)(2/,+ l)

+ X(2k,2/ +l)C2( N / 2 ) C 2 N

YOVu.1 O/'-t r(2n+l)(2k,+ l)r(2m+l)/'

•A(2k+1,2/ j U 2 N <~2(N/2)

- 4(2k'fl,2/ VI) c^2k'+])C^+mi'+l)) (D-15)

^T , M i A Nv 2 Nv 1 Anv 0/ -^ r(2n+l)k'r(2m+l)/' x(N-l-n,N-l-m)= X X [X(2k ,2/ ) C2( N / 2 ) <~2(N/2)

k'=0 / '=0 w ; v

\T?f 9/ '+n <-(2n+l)k,c.(2m+l)(2/ '+1)

+ &(2k'+],2/'+l)C<22-1)<2k'+,,cS"+,)(2r+,)] (D-16)

where n,m = 0,1,...,N/2-1. Define X(*,2/ -1)1/'=o = 0, X(2k'-l,*)l|c'=0 = 0 and.

185

N/2-1 N/2-] A

hi(n,m)= X X X(2k',2/') C( 2"^ ) k'c ( 2 m + 1 ) r

k'=0 /'=0 ^2(N/2) W(N/2)

h2(n,m) = 2Cgf+1) f"* T * X(2k',2l'+!) C?"?l)k'c(2m+1X2/ '+1)

k=0 / '=0 2(N/2) ^2N N/2-1 N/2-1 A

= X X X(2k'2/'+n r(2n+1)k'r(2m+1)/' kio r=o K ' +1)C2(N/2) C2(N/2) N/2-1 N/2-1 A

S?o,i„X ( 2 k'-2 r + I)«k'C^), ) (''+ , )

N/2-1 N/2-1

= I IH?(k'/')C ( 2 n + 1 ) k 'r ( 2 m + 1 ) r

kio /4o 2 ;^2(N/2) C2(N/2)

h3(n,m) = 2c£n+1) T "l X(2k'+1,2/') C^}+1^,+1)cgjJ^'

k*=0 /~0 ' 2N ^2(N/2)

N/2-1 N/2-1

Z I H3(k',/') c ^ f ' d * ? !1

k'=0 / '=0 AN/2) 2(N/2)

h4(n,m) = 2C<gn +1>2C<2Nm+1) T *£ 1X(2k'+lf2/'+!) c(

2"+1K2k'+1)c?Mm+1X2/'+1)

k'=0 / =0 2N 2N N/2-1 N/2-1

where n,m = 0,1,...,N/2-1;

H2(k',/ ') = X(2k',2/ '+1) + X(2k',2/ '-1);

H3(k',/ ') = X(2k'+1,2/') + X(2k'-1,2/ ');

H4(k',/ ') = X(2k'+1,2/ *+l) + X(2k'+1,2/ '-1) + X(2k'-1,2/ '+1) + X(2k'-1,2/ '-!).

Therefore,

x(n,m) = hi(n,m) + —r—rh2(n,m) + —\—rh3(n,m) + — - — ] — = — H i 4 ( n , m ) Z ~2N 2 U 2 N z C 2 N 2 U 2 N

(D-17)

x(n,N-l-m) = h,(n,m) - -JL^2(„,m) + - l f c 3 ( n , m ) - j h ^ m ) ZL-2N 2 U 2 N 2 L 2 N 2 L 1 N

(D-l 8)

x(N-l-n,m) = hi(n,m) + -JLjhtfn.m) - - J - ^ f o . m ) - 2 * 2m+1h4(n,m) Z U 2 N 2(~2N / U 2 N Z U 2 N

(D-l 9)

x(N-l-n,N-l-m)= hi(n,m) + — J - ^ 2 ( n , m ) - — 5 ^ 3 ( n , m ) - +\ 2m+1h4(n,m) Z C ^ 2C2N ^ ^ Zl^N

(D-20)

For k,/,n,m = 0,1,...N/2-1, and X(2k-1,*) lk=0 = X(*,2/-1) l/=0 = 0, the matrix

form for 2-D IDCT algorithm is presented as follows:

•Hi(k,/)-|

H2(k,/)

H3(k,/ )

-H4(k,/ )J

= (P®P)

X(2k,2/)

X(2k,2/+1)

X(2k,2/ -1)

X(2k+1,2/)

X(2k+1,2/+1)

X(2k+1,2/-1)

X(2k-1,2/)

X(2k-1,2/+1)

L-X(2k-1,2/-1)-J

(D-21a)

•hi(n,m)-]

h2(n,m)

h3(n,m)

.h4(n,m)_

-h'^n.m)-

h'2(n,m)

h'3(n,m)

_h'4(n,m)

(2m+l)/ ^ J^;' (2n+l)k r(2m+l

Hi(k,/)n

H2(k,/) H3(k,/)

LH4(k,/ )J

= (MOM 1)

rh](n,m)

h2(n,m)

h3(n,m)

_h4(n,m)_

x(n,m)

x(n,N-l-m)

x(N-l-n,m)

.x(N-l-n,N-l-m)J

= (B®B)

r-h'jtn.m)-

h'2(n,m)

h'3(n,m)

_h'4(n,m)_

(D-21b)

(D-21c)

(D-21d)

187

APPENDIX E: ARITHMETIC COMPLEXITY OF THE VECTOR

SPLIT-RADIX DIF FFT ALGORITHM

According to Equation (3-8-3), the multiplications in the 2-D vector split-radix f

DIF FFT are all listed in the twiddle factor matrix F . The N * N D F T can be calculated — m

using one (N/2)*(N/2) DFT and twelve (N/4)*(N/4) DFTs. The number of extra f

multiplications required at each stage using this approach is caused by E m . The total

number of complex multiplications Mn needed for a 2-D N*N DFT, where N = 2n, is

given by

Mn = Mn_i + 12*Mn.2 + Mextra (E-1)

with M2=M]=0, and it can be shown that Mextra = 12*((N/4)2 - N/4) for n>3.

The total number of complex additions An is

An = An-i + 12*An-2 + Aextra (E-2)

where Aextra = 3*(N/2)2 + 48*(N/4)2, A0 = 0, and Ai = 8.

To prove that Mextra = 12*((N/4)2 . N/4) for n>3, it is observed that amongst F^,

there are three group of factors:

(1) WNn, WN3n, Wjf, WN3m;

(2) WNm+n, WN

3m+3n; and

(3) WN2m+n, WN2m+3n, WNm

+2n, WNm+3n, WN3

m+2n, Wrfm+n;

within which it is needed to determine the number of trivial multiplications. The term

"trivial multiplication" means the value of the twiddle factor being ±1 and ±j. There are

no multiplications needed for 2*2 and 4*4 point DFTs, and thus, MQ = M: =0.

In the first group, when n (m resp.) = 0 there are N/4 trivial multiplications as m (n

resp.) varies from 0 to N/4-1.

In the second group, WN™ is considered first. According to the properties of the

transformation <a>b which finds the residue of a modulo b [155], for each

m=l,2,...,N/4-l, there exists n such that m+n=N/4 or <m+n>N/4 = 0, in addition to m=0

and n=0. There are N/4 trivial multiplications for the factor WN™+". Since

<3(m+n)>N/4=«3>N/4<m+n>N/4>N/4, <3(m+n)>N/4=0 as long as <m+n>N/4 = 0 then it

188

is true that amongst (N/4)2 multiplications, N/4 are trivial for each of twiddle factors in

this group.

In the last group, W x r 2 ™ , W N3 m + n md w N

2 n+3m need on]y be considered as the

rest can be proved accordingly.

Considering the factor W N2 m + n , w e have

<2m+n> N / 4 = « 2 m > N / 4 + <n>N/4>N/4

= « 2 m > N / 4 + n>N/4

= <m'+ n>N/4 (E-3)

where m'= <2m>N/ 4. For m=0,l,..,N/4-l, therefore for every m'e A, there exists an n

such that m + n ^ ^ =0. Therefore, at those points multiplications become trivial. It can

be shown that the same is true for W^>m+n as well.

Wxr2""1"3111 also contains N/4 trivial multiplications. However, this is not so simple

to prove. To start, it is required to prove that <3m>Ny4 is a function or one-to-one and

onto mapping domain of which is A={0,l,...,N/4-l ], i.e., for every m e A, <3m>Ny4 e

A, if m ^ m ^ both m j and m 2 e A, then <3mj>N/ 4 <3m2>N/4. This can be achieved

by invoking the theorem[8] which states that: for n=0,l,2,...,M-l, <sni>N takes on all the

N possible residues if (a,M)=l.

In the problem considered here, a=3, M=N/4=2n/4, n>3, and N/4 is power of 2 as

well, a and N/4 are mutually prime. According to the above theorem, <3m> N / 4 is a

function. For every m e A, there exists an m'=<3m>N/4 e A, while m ^ r n ^ rn'i^m^.

Since <2n+3m>N/ 4= « 2 n >N/4 + <3m>N/4 >N/4

= « 2 n >N / 4 +m' >N / 4 (E-4)

Then in Equation (E-4), m' e A, and can be any value of the element in A in

accordance with m.

So, for each n e A, there exists an m, thereafter an m', such that Equation (E-4)

will be zero.

Therefore W N2 n + 3 m contains N/4 trivial multiplications, which completes the

derivation.

From the above discussion, it is concluded that:

189

Mextra=12*((N/4)2-N/4) (E-5)

for n>3 .

The number of complex multiplications needed for the 2-D vector split-radix FFT to

perform N*N complex DFTs is listed in Table-E-1.

190

Table-E-l: The number of Complex multiplications required for the 2-D vector split-radix FFT to perform N x N- point complex DFTs.

N

8

16

32

64

128

256

512

1024

2048

4096

Mn.2

0

0

24

168

1128

6024

31464

152136

724776

3333768

M„.i

0

24

168

1128

6024

31464

152136

724776

3333768

15170664

Mextra

24

144

672

2880

11904

48384

195072

783360

3139584

12570624

Mn

24

168

1128

6024

31464

152136

724776

3333768

15170664

67746504

the implementation of multidimensional discrete transforms for digital signal processing

Documents