the implementation of multidimensional discrete transforms for digital signal processing
TRANSCRIPT
University of WollongongResearch Online
University of Wollongong Thesis Collection University of Wollongong Thesis Collections
1990
The implementation of multidimensional discretetransforms for digital signal processingHong Ren WuUniversity of Wollongong
Research Online is the open access institutional repository for theUniversity of Wollongong. For further information contact the UOWLibrary: [email protected]
Recommended CitationWu, Hong Ren, The implementation of multidimensional discrete transforms for digital signal processing, Doctor of Philosophythesis, Department of Electrical and Computer Engineering, University of Wollongong, 1990. http://ro.uow.edu.au/theses/1353
THE IMPLEMENTATION OF MULTIDIMENSIONAL DISCRETE
TRANSFORMS
FOR
DIGITAL SIGNAL PROCESSING
A thesis submitted in fulfilment of the requirements for the award of the degree
DOCTOR OF PHILOSOPHY
from
THE UNIVERSITY OF WOLLONGONG
by
WU, HONG REN, B.E., M.E.
THE DEPARTMENT OF ELECTRICAL
AND COMPUTER ENGINEERING.
FEBRUARY 1990.
"Entertaining someone with fish, uou could only
serve him once,', but if you teach him the, art of fishing, it
vt>iCC serve fvim for a Cije time,."
—cAn ancient Chinese wise man and, philosopher.
1
CONTENTS
ACKNOWLEDGEMENTS vi
ABSTRACT viii
LIST OF ACRONYMS AND SYMBOLS x
CHAPTER ONE: INTRODUCTION 1
1 -1 Introduction to Multidimensional Digital Signal Processing 1
1-2 Applications 2
1-3 History and N e w Achievements in Fast Signal Processing
Algorithms 3
1-4 Objectives 6
1-5 Thesis Review and Contributions 8
1-6 Publications, Submitted Papers and Internal Technical
Reports 11
PART I.
MULTIDIMENSIONAL DISCRETE FOURIER TRANSFORMS 14
CHAPTER TWO: 1-D DISCRETE FOURIER
TRANSFORM AND FAST FOURIER
TRANSFORM ALGORITHMS 15
2-1 Definitions 15
2-2 Matrix Representations for 1-D Cooley-Tukey F F T
Algorithms 16
2-3 Computational Considerations 23
2-4 Summary 31
ii
CHAPTER THREE: 2-D DFT AND 2-D FFT
ALGORITHMS 32
3-1 Introduction to 2-D Discrete Fourier Transforms 32
3-2 Definitions 37
3-3 Row-Column FFT Algorithms 38
3-4 Vector Radix FFT Algorithms 39
3-5 Matrix Representations for 2-D Vector Radix FFT
Algorithms 43
3-6 Structure Theorems 46
3-7 Structural Approach via Logic Diagrams 55
3-8 2-D Vector Split-Radix FFT Algorithms 63
3-9 Comparisons of Various 2-D Vector Radix FFT
Algorithms 67
3-10 Vector Radix FFT Using F D P ™ A41102 69
3-11 Summary 72
CHAPTER FOUR: A PERSPECTIVE ON VECTOR RADLX
FFT ALGORITHMS OF HIGHER
DIMENSIONS 72
4-1 Definitions 74
4-2 Matrix Representations and Structure Theorems 75
4-3 Diagrammatical Presentations 78
4-4 Computing Power Limitations 84
Ill
PART II.
MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS 87
CHAPTER FIVE: INTRODUCTION TO MULTI
DIMENSIONAL DISCRETE COSINE
TRANSFORMS 88
5-1 Definitions of 1 -D DCT and Its Inverse DCT 91
5-2 Definitions of 2-D DCT and Its Inverse DCT 93
5-3 Applications of 2-D DCTs in Image Compression 95
5-4 2-D Indirect Fast DCT Algorithms 100
CHAPTER SIX: 2-D DIRECT FAST DCT
ALGORITHMS 103
6-1 2-D Direct Fast DCT Algorithm Based on Lee's Method 103
6-1-1 1-D Lee's algorithm in matrix form 103
6-1-2 Derivation of 2-D fast DCT algorithm from
Lee's algorithm 108
6-2 2-D Direct Fast DCT Algorithm Based on Hou's Method 114
6-2-1 1-D Hou's algorithm in matrix form 114
6-2-2 Derivation of 2-D fast DCT algorithm from
Hou's algorithm 118
6-3 Comparison of Arithmetic Complexity of Various DCT
Algorithms 124
6-4 Comparison of Computation Structures of 2-D Direct VR
DCTs and VR FFTs 125
6-5 Summary 126
IV
CHAPTER SEVEN: HARDWARE IMPLEMENTATION OF
2-D DCTS FOR REAL-TIME IMAGE
CODING SYSTEMS 128
7_ l Description of Hardware Implementation of Modified 2-D
Makhoul D C T Algorithm Using F D P ™ A41102 129
7-2 Discussion of 2-D D C T Image Coding Systems Using
VLSI Digital Signal Processors 132
CHAPTER EIGHT: THE EFFECTS OF FINITE-WORD-
LENGTH COMPUTATION FOR FAST
DCT ALGORITHMS 136
8-1 Introduction 136
8-2 Simulation Design 138
8-2-1 Structure of the simulation program 138
8-2-2 Error model for the basic computation structure 140
8-2-3 D C T in infinite-word-length 141
8-2-4 Data collection 141
8-3 Simulation Results 143
8-3-1 Floating-point computation of 1-D DCTs 143
8-3-2 Floating-point computation of 2-D DCTs 147
8-4 Summary 151
CHAPTER NINE: CONCLUSIONS 152
9-1 Conclusions 152
9-2 Suggestions for Future Research 154
BIBLIOGRAPHY 156
APPENDIX A:
APPENDIX B
APPENDIX C
APPENDIX D
APPENDIX E:
PRELIMINARY BACKGROUND ON
THE TENSOR (KRONECKER)
PRODUCT AND THE LOGIC DIAGRAM 172
PROOF OF STRUCTURE THEOREMS 175
THE COMBINED FACTOR METHOD 176
DERIVATION OF VECTOR RADIX 2-D
FAST DCT BASED ON LEE'S
ALGORITHM 181
ARITHMETIC COMPLEXITY OF THE
VECTOR SPLIT-RADIX DIF FFT
ALGORITHM 187
vi
ACKNOWLEDGEMENTS
The author wishes to express his deepest appreciation to his Supervisor, Dr. FJ.
Paoloni, Associate Professor of the Department of Electrical and Computer Engineering,
The University of Wollongong, for his guidance, support and encouragement and also for
his understanding and confidence in the author throughout this research. His professional
and optimistic attitude towards the research have made this research challenging,
interesting, productive and enjoyable.
The author wishes to thank Professor Huang, Ruji, of the Department of Industrial
Automation, University of Science and Technology, Beijing (formerly Beijing University
of Iron and Steel Technology), who, as his Masters' Supervisor, had a great influence on
shaping the author's research skills and abilities as an independent as well as cooperative
researcher.
Sincere thanks are also extended to Professor B.H. Smith who introduced the
author to this Institution and made this study possible in the first place.
The author wishes to thank the following people for their generous help, patience
and useful discussions at various stages of this program: Dr. G.W. Trott and Dr. T.S.
Ng, Department of Electrical and Computer Engineering; Mr. I.C. Piper, Computer
Services; Mr. J.K. Giblin, formerly with Computer Services and now with Network
Technical Services, B.H.P. Steel International Group; Mr. G. Andersson, Computer
Services; Dr. N. Smyth and Dr, K.G. Russell, Department of Mathematics, The
University of Wollongong; Professor J.H. McClellan, School of Electrical Engineering,
Georgia Institute of Technology, formerly with Schlumberger Well Services; Dr. J.D.
O'Sullivan, Dr. DJ. McLean, Dr. C.E. Jacka and Mr. K.T. Hwa, Division of Radio
Physics, CSIRO in Epping, N e w South Wales; Dr. M J . Biggar and Dr. W.B.S. Tan,
Telecom Research Laboratories (Australia); Professor K.R. Rao, Department of Electrical
Engineering, The University of Texas at Arlington; Professor M . Vetterli, Department of
Electrical Engineering, Columbia University; Mr. P. Single, Austek Microsystems
(Australia); Dr. M.A. Magdy, Mr. J.F. Chicharo and Mrs. C. Quinn, Department of
vii
Electrical and Computer Engineering, The University of Wollongong; Mr. P.J. Costigan
and all technical staff in the Department.
The author is deeply grateful to his friend and English teacher Mrs. B. S. Perry for
her generous help and professional assistance in the author's understanding of English
and Australian culture, and her and her husband's, Mr. E.J.W. Perry, understanding,
friendship and encouragement which made the author's stay in Wollongong worthwhile,
most pleasant and enjoyable.
The assistance from Miss. M.J. Fryer, of the Department of Electrical and
Computer Engineering, The University of Wollongong, throughout this research and
particularly in reading the final manuscript of this thesis is warmly appreciated.
Financial support received from the Department of Electrical and Computer
Engineering and the Committee of Post-Graduate Study, The University of Wollongong,
by means of the Departmental Teaching Fellowship and the Post-Graduate Research
Scholarship respectively, which made this research possible, is sincerely acknowledged.
Financial supports from Australian Telecommunication and Electronics Research Board
and from Telecom Research Laboratories (Australia) through R&D contract No.7066 are
also acknowledged.
Finally, the author wishes to express his deepest gratitude to Mei Mei, his best
friend, colleague and wife, without whose patience, understanding, appreciation and
continuous support, encouragement and inspiration, this work would not have been
accomplished. The continuous support and understanding from his parents, from whom
he has been separated for the cause is also greatly appreciated.
vm
ABSTRACT
A structural approach to the construction of multidimensional vector radix fast
Discrete Fourier Transforms (DFTs) and fast direct Discrete Cosine Transforms (DCTs)
is presented in this thesis. The approach features the use of matrix representation of one-
dimensional (1-D) and two-dimensional (2-D) FFT and fast DCT algorithms along with
the tensor product, and the use of logic diagram and rules for modifications.
In the first pan of the thesis, the structural approach is applied to construct 2-D
Decimation-In-Time (DIT), Decimation-In-Frequency (DIF) and mixed (DIT & DIF)
vector radix FFT algorithms from corresponding 1-D FFT algorithms by the Cooley-
Tukey method. The results are summarized in theorems as well as examples using logic
diagrams. It has been shown that the logic diagram (or signal flow graph) as well as
being a form of representation and interpretation of fast algorithm equations, is a stand
alone engineering tool for the construction of fast algorithms. The concept of "vector
signal processing" is adapted into the logic diagram representation which reveals the
structural features of multidimensional vector radix FFTs and explains the relationships
and differences between the row-column FFT, the vector radix FFT reported previously
and the approach presented in this thesis. The introduction of the structural approach
makes the formulation of a multidimensional vector radix FFT algorithm of high radix and
dimension easy to evaluate and implement by both software and hardware.
The hardware implementation of 2-D DFTs is discussed in the light of vector radix
FFTs using the Frequency Domain Processor (FDP™) A41102, which has shown
improvement in reducing the system complexity over the traditional row-column method.
With the help of the structural approach, the vector split-radix DIF FFT algorithm,
mixed (DIT & DIF) vector radix FFT and Combined Factor (CF) vector radix FFT
algorithms are presented whereby a comparison study is made in terms of arithmetic
complexity. The approach is then generalized to vector radix FFTs of higher dimensions.
Two vector radix DCT algorithms are presented in the second part of the thesis.
Although the one based on Lee's approach was reported by Haque using a direct matrix
IX
derivation method, it is derived independently by the author using the structural approach.
The other vector radix DCT algorithm is based on Hou's method. The arithmetic
complexities of these two algorithms are considered as well as various other known row-
column DCT algorithms. The computation structures of 2-D vector radix direct fast DCT
algorithms are discussed in comparison with those of 2-D vector radix FFT algorithms.
A correction to the system description of Hou's DIT fast DCT algorithm is presented as
an analysis result of the algorithm's computation structure.
The system design of the 2-D modified Makhoul algorithm using the FDP A41102
provides yet another solution to the real-time 2-D DCT image coding problem. The
effects of finite-word-length computation of DCT using various direct fast algorithms are
studied by computer simulation for the purpose of transform coding of images. The
results are also presented in the thesis.
X
LIST OF ACRONYMS AND SYMBOLS
ASSP: Acoustics, Speech, and Signal Processing
AUSTEK: Austek Microsystems Proprietary. Inc. and Austek Microsystems
Proprietary Ltd.
BF: ButterFly computational structure of fast transform algorithms
CCll'i: International Telegraph and Telephone Consultative Committee
CF: Combined Factor method
CSIRO: the Commonwealth Scientific and Industrial Research Organization
DCT: Discrete Cosine Transform
DCTd: the DCT output sequence with the double-precision (64-bit floating-point)
DCTf: the DCT output sequence with the finite-word-length, (32-bit floating
point or fixed-point)
DFT: Discrete Fourier Transform
DIF: Decimation-In-Frequency
DIT: Decimation-In-Time
DSP: Digital Signal Processor, or Digital Signal Processing
FDP: Frequency Domain Processor
FFT: Fast discrete Fourier Transform algorithm(s)
FIR: Finite-extent Impulse Response
HDTV: High Definition Television
HR: Infinite-extent Impulse Response
ISDN: Integrated Services Digital Networks
inmos: a part of SGS THOMSON Microelectronics Group
m-D: multi-Dimensional
M/A: Multiplier/Accumulator, or Multiply/Accumulate
MIT: Massachusetts Institute of Technology
xi
Ms/s:
NMR:
RMFFT:
SGS THOMSON:
SNR:
TM
TRW:
VLSI:
VR:
VSP:
VSR:
WFTA:
Zoran:
oc,...,?:
B:
B:
p(2n+l)k *~2N
C(ki,k2):
C:
Ci':
C:
Q:
en.eceo:
Ftt-
Milhon samples per second
Nuclear Magnetic Resonance
Reduced Multiplications Fast discrete Fourier Transform algorithm(s)
SGS T H O M S O N Microelectronics Group
Signal to Noise Ratio
Twiddling Multiplications of fast transform algorithms
T R W LSI Products Inc.
Very Large Scale Integrated circuits
Vector Radix
Vector Signal Processor
Vector Split-Radix
Winograd Fourier Transform Algorithm(s)
Zoran Corporation
small Greek letters are used for transform coefficients throughout the
thesis
matrix of butterfly computation structure outiining the Cooley-Tukey FFT
butterfly matrix of Lee's fast D C T algorithm
butterfly matrix of vector radix fast D C T algorithm based on Lee's
method
cos[Z*|p]
2-D DCT sequence in 2-D indirect fast DCT algorithm
1-D D C T matrix
inverse 1-D D C T matrix
denormalized 1-D D C T matrix
transpose of the denormalized 1-D D C T matrix
roundoff errors
matrix for 1-D radix-r twiddling multiplication of length N DIT FFT
xu
matrix for 2-D vector radix-ri*r2 twiddling multiplications structure of
length Ni*N2 V R DIT FFT
matrix for 1-D radix-r twiddling multiplication of length N DIF FFT
matrix for 2-D vector radix-ri*r2 twiddling multiplications structure of
length N i * N 2 V R DIF F F T
matrix for 1-D radix-r butterfly structure of length N DIT FFT
matrix for 2-D vector radix-ri*r2 butterfly structure of length Ni*NT2 VR
DIT F F T
matrix for 1-D radix-r butterfly structure of length N DIF FFT
matrix for 2-D vector radix-r\*i2 butterfly structure of length Ni*NT2 VR
DIF F F T
imaginary unity
length of the transform
diag.[-,-,...,-]
multiplication matrix of Lee's fast DCT algorithm
multiplication matrix of vector radix fast D C T algorithm based on Lee's
method
number of addition operations required by the transform •
number of multiplication operations required by the transform •
pre- or pro-calculation matrix of Lee's fast D C T algorithm
pre- or pro-calculation matrix of vector radix fast D C T algorithm based
on Lee's method
diag.[-^l,...,l]
sample variance
recursive denormalized DCT matrix used in Hou's fast DCT algorithm
diag.[|,|...|]
T: matrix of twiddle multiplications outlining the Cooley-Tukey FFT
X_ : vector of multi-dimensional transform sequence
x : vector of multi-dimensional data sequence
X(k): transfonn sequence
x(m): data sequence
X : vector of 1-D transform sequence, [X(0),X(1),...,X(N-1)]
x : vector of 1-D data sequence, [x(0),x(l),...,x(N-l)]
X(k) and X(k): denormalized 1-D DCT sequence
~ A X (k) and X (k): vector of the denormalized 1-D D C T sequence
X_ (k) and X (k): vector of the denormalized 2-D DCT sequence
X i: sample mean of a random variable
v(n i ,n2): rearranged 2-D sequence for indirect DCT
V(ki,k2): 2-D discrete Fourier transform of v(ni,n2)
W^m: exp(-27ijkm/N)
WN: 1 -D discrete Fourier transform matrix
Y/^ : 1-D inverse discrete Fourier transform matrix
®: tensor (or Kronecker) product
*: multiply
+: add
1
CHAPTER ONE: INTRODUCTION
1-1 Introduction to Multidimensional Digital Signal Processing
Only after the advent of the modem electronic computer has multidimensional (m-D)
signal processing become a reality. It has attracted more and more research interest as
integrated circuits have become faster, cheaper and more compact [1]. It covers a large
research area including image processing, computer-aided tomography, image
compression and image coding, multidimensional Finite-extent Impulse Response (FIR)
filtering, multidimensional Infinite-extent Impulse Response (HR) filtering, beamforming,
multidimensional spectrum analysis and estimation, radar detection, seismic signal
processing, biomedical signal processing, etc. While multidimensional signal processing,
as defined by its name, is processing on all signals where dimensionality is equal to or
greater than two; at present, two dimensional and three dimensional problems are of
practical concern [2].
Although multidimensional signal processing is an extension of one dimensional
signal processing, it does have different problems associated with the huge amount of
data involved, which makes implementation a difficult issue. More complicated
mathematics is required, which could be more arduous to comprehend. It also allows a
greater degree of freedom because it provides versatile solutions to a single problem.
These difficulties make multidimensional signal processing a very complicated task, and
also motivate research in mathematics, algorithms and implementation. Practical solutions
to these problems are based on the development of modern technology (particularly
computer technology) and raise the future requirements on the technology front.
On the whole, as in one dimensional signal processing, there are two basic
approaches to multidimensional signal processing problems. One is the spatial (or
original) domain approach, and the other is the frequency (or transform) domain approach
[1, 3-8]. They are two forms of mathematical representation within the natural world.
Although they are equally powerful, one can be more appealing than the other in certain
applications. This thesis will be focusing on the transform approach, the implementation
2
of multidimensional Discrete Fourier Transforms (DFTs) and Discrete Cosine Transforms
(DCTs) in particular.
1-2 Applications
The Fourier transform theory has played an important role in multidimensional
signal processing [1, 7, 8] and will continue to be a topic of interest in theoretical, as well
as applied, work in this field [9-12]. Mathematical fundamentals of multidimensional
Fourier transforms have been thoroughly examined, and many fast algorithms have been
proposed. The introduction of vector processors [61], VLSI vector signal processors
[13], VLSI F F T processors [14, 15, 142], systolic array processors [140, 141, 145,
147], Single Instruction Multiple Data (SIMD) [130] and Very Long Instruction Word
(VLIW) supercomputers [129], makes the implementation of multidimensional Fourier
transforms more real in practical situations than ever before.
The multidimensional Fourier transform finds its application in the 2-D context,
such as image enhancement (smoothing, edge detecting), image restoration, image
compression and encoding, image description [4, 8], radar detection [134,137], 2-D FIR
filter implementation and design [1] and invariant object recognition [10]. The 3-D
Fourier transform is required in nuclear magnetic resonance imaging algorithms [9], 3-D
tomo-synthesis [146] and in construction of 3-D microscopic-scale objects to remove out-
of-focus noise [16]. Multidimensional Fourier transforms used for simultaneous time
spatial or spatial frequency representation in computer vision and pattern analysis provide
better tools for pattern analysis and a better understanding of dynamic patterns in the
visual system [2, 10, 11, 12].
Almost a decade after the introduction of the Cooley-Tukey Fast Fourier Transform
algorithm (FFT), the Discrete Cosine Transform was first introduced into digital signal
processing for the purposes of pattern recognition and Wiener filtering in 1974 [17] .
But it soon led to a vast range of engineering applications. In the multidimensional
context, the two dimensional (2-D) D C T is used for image compression and transform
coding of images [3, 4] in telecommunications such as video-conferencing, video
3
telephony, video image compression for High Definition Television (HDTV), block
structure/distortion in image coding, activity classification in transform coding, surface
texture analysis, tomographic classification, photovideotex, pattern recognition,
progressive image transmission, printed image coding and applications in fast packet
switching networks [18, 19, 157]. The D C T s can be implemented by fast algorithms
with either software or hardware and render almost optimal performance that is virtually
indistinguishable from the Karhunen-Loeve Transform [3, 17], in terms of energy
packing ability and decorrelation efficiency. Various VLSI D C T processors have also
been reported and demonstrated recently, for video coding applications [74, 75, 87-89,
111, 143].
1-3 History and New Achievements of Fast Signal Processing
Algorithms
Reviewing the history of a research and study area provides a perspective which
generally benefits future research and study. A review of the study of FFT algorithms in
digital signal processing has particular significance.
Great engineering power has its deep roots in mathematics. Applications of
research achievements rely on the development of relevant technology. The development
of technology motivates further research in conveying more mathematical wonders into
application. The gap has to be bridged by a proper approach and a form of representation
which are attractive to the engineering society.
The history of FFT algorithms did not begin at Good, or Thomas, or Danielson, or
Lanczos, or even Runge [20]. It can be traced back to the great German mathematician
Carl Friedrich Gauss (1777-1855) [21]. But it has only become an important engineering
concern, since the advent of the modern electronic digital computer, through the
fundamental work laid by Cooley and Tukey [22] and those who have helped to give this
mathematical curiosity an engineering interpretation and eventually to convert it to an
engineering power [5, 23,24]. It has been said that the rediscovery of the F F T algorithm
was one of the saviours of the predecessor of the IEEE Acoustics, Speech, and Signal
4
Processing Society [23] and marked the beginning of modern digital signal processing [6,
31, 135]. The Cooley-Tukey FFT algorithm, in addition to being widely used because it
came first, owes much to its simple structure; a structure which is appealing to the
engineering society. The representation of F F T by the so-called butterfly signal flow
graph [24, 25] fits nicely into the newly released VLSI F F T processor—the Frequency
Domain Processor (FDP™) A41102 [14, 15, 26-29]. Some problems can be timeless
and solutions to them can be discovered, and rediscovered again and again.
Representation also is of vital importance for each step of the conversion from research
achievement to engineering application. One of the tasks required of scientific researchers
is to demystify and clarify, not mystify.
FFT algorithms also have their roots in Abelian Semi-Simple Algebras by which the
mathematical structure of the F F T is revealed. These Abelian Semi-Simple Algebras
provide explanation as to how various F F T algorithms are devised. Many attempts have
been made to convey this mathematical result to the engineering society [29-34]. When
the mathematical structure of a process is well understood, many fast algorithms for it can
be constructed systematically. Taking their 1-D counterparts respectively, the tensor (or
Kronecker) product has been used successfully to generate multidimensional Winograd
Fourier transform algorithms [35] and prime factor F F T algorithms [32].
In [31], a matrix form is introduced to represent the vector radix-2*2 and -4*4
Decimation-In-Time (DIT) F F T algorithms. However, the tensor product is used as a
form of representation for the 2-D VR-4*4 F F T algorithm rather than as a tool for the
construction of the V R F F T algorithm from its 1-D counterpart [31]. Many
multidimensional fast algorithms are constructed using a direct derivation method [36-38].
It is worth noting that in most of the published literature, multidimensional algorithms are
still described in a 1-D diagrammatical representation form by the traditional butterfly
signal flow graph which can be over-complicated in the multidimensional case. When
problems become extended, new representation forms have to be found to disclose the
myths behind mathematical structures, which are sometimes quite complicated (or
abstract).
5
In the history of fast signal processing algorithms, the basic issues, which are
associated with evaluating the effectiveness of an algorithm from the outset, have been:
(1) reduction of arithmetic complexity;
(2) reduction of round-off errors and errors due to the quantization of the
coefficients;
(3) in-place computation; and,
(4) possession of a regular computation structure.
Three of the above four points (point 2 excluded) are associated with the processing
speed which is a major engineering concern. An algorithm which does not possess in-
place computation or regular computational structure will require more bookkeeping and
indexing operations, and will affect the processing speed.
In the early years, multiplications were more time-consuming than additions and
other types of operations (data transfer, for instance) on general purpose computers.
Reducing the number of multiplications became the centre point of the evaluation of fast
algorithms. As a result, a group of FFT algorithms, called the reduced multiplications
FFT (RMFFT) algorithms were introduced [39] including the prime factor algorithm,
Winograd Fourier Transform Algorithm (WFTA) and polynomial transforms. Many of
these were obtained at the expense of more additions and loss of regular computation
structure. However, the introduction of Digital Signal Processors (DSPs) and the
development of VLSI technology, Application Specific Integrated Circuits (ASIC)
technology in particular, have changed this tradition dramatically, and now an addition (or
even loading of data) takes about the same time to complete as a multiplication on some of
processors [40]. The issue is not just reduction of the number of multiplications but the
total number of operations. Fast algorithms which do not posses in-place computation or
do not have a regular structure, will be in a disadvantageous position as they have to pay a
severe cost in loading, storing, copying data and other indexing tasks [39, 41]. In
systolic array implementations of DFT and FFTs, emphasis has been on modularity,
pipelining and parallelism, and simple, regular and local communication
6
structures [140, 141, 145, 147] (apart from area*time2 criteria commonly used for
VLSI designs).
A n algorithm is only fast when the hardware can take advantage of it [31]. A
theoretically fast algorithm may be even less effective than a "slow" algorithm on certain
processors. Some features of many fast algorithms, such as parallelism and pipelining
structure, still remain to be fully exploited [134, 137]. These algorithms will be many-
times faster than they are now, only when computer technology resolves the problems
which are associated with them. For example, VLSI implementation-of FFT algorithms
is not limited to radix-2 or radix-4 butterflies. Full length (up to 256 complex point)
C M O S and H M O S FFT processors (DFPs) have been reported, demonstrated [14, 15]
and now are commercially available, as mentioned previously. A n FFT processor that
computes a 4096-point complex D F T in 1 0 2 ^ with 22-bit floating-point arithmetic was
also reported in [142]. W h e n the computation structure of the Cooley-Tukey algorithm
described by butterflies is made into a VLSI pipelining architecture, a 256 complex point
D F T can be achieved in about 102.4^s (200^ for H M O S Chip) on FDPs. Another
feature of F D P A41102 is that 8*8- or 16*16-point 2-D D F T can be accomplished in one
pass although the row-column approach is used. This means a reduction in the time for
sweeping data. This places many digital signal processing applications using FFT into
the real-time or pseudo-real-time processing category.
1-4 Objectives
As explained in the previous section, because of the nature of multidimensional
Digital Signal Processing (DSP), there are various multidimensional D S P algorithms from
which to choose and the structures of these algorithms appear to be more complex than
those of their 1-D counterparts. Without an appropriate method, the construction, the
evaluation of the performance and the implementation of m - D algorithms would be a very
difficult task indeed. This thesis attempts to seek a structural approach to m-D fast DSP
algorithms to make the task simpler. Instead of deriving m - D fast algorithms from a
defined m - D D S P problem directly, evaluating and implementing algorithms according to
7
equations so derived, the approach suggests that the construction, evaluation and
implementation of m-D fast algorithms be based on our knowledge of, and experience
with, the corresponding 1-D algorithms, if possible. For example, 1-D FFT algorithms
based on the Cooley-Tukey method are extensively studied and well documented.
Computer programs of 1-D FFTs can be found in the published literature and in computer
software mathematics libraries. Many DSP manufacturers provide their version of FFT
programs. The VLSI integration of 1-D FFT algorithms has also broadened our
knowledge. All the above knowledge can be made useful for the development of m-D
FFT algorithms. The simplest case would be the row-column approach. The row-
column m-D FFT algorithm, for instance, is obtained by repeatedly using the 1-D FFT
algorithm on each dimension so that it can be said that all the knowledge and experience,
including the programs and hardwares, of 1-D FFTs, are directly made use of in this m-D
method and the method is constructed and built on 1-D FFT by deriving the relation
between m-D DFT and the DFT on each of its dimensions. In this case, it happens that it
is not necessary to worry about the structure of 1-D FFTs, nor how they are constructed.
It is simply to make use of what is available at the 1-D level. However, when the number
of dimensions of DFTs increases, the computational saving of m-D fast FFTs, of which
the vector radix FFT is one, over the row-column FFT will become substantial in terms of
the number of multiplications or the total number of numerical operations, as will be
shown in Chapter Four of this thesis [44,45]. When the m-D vector radix FFT is to be
constructed, the structure of 1-D FFT algorithms and the structural relationship between
the m-D FFT and 1-D FFTs have to be studied and understood in order to generate
systematically the required m-D FFT from the knowledge (algorithm, software and
hardware) possessed of corresponding 1-D FFTs. The above described approach is
hereby called a structural approach. This kind of approach will not only help the
construction, software and hardware implementation of m-D algorithms, but also assist
the study of VLSI integration of m-D algorithms, which possess a greater degree of
complexity than the 1-D case. Its function as a tool for software development of m-D
discrete transform algorithms has been successfully demonstrated during this research.
8
This thesis is mainly concerned with the implementation of the multidimensional
vector radix FFT algorithms [42-44], based on the Cooley-Tukey method [22], and that
of 2-D direct vector radix fast DCT algorithms. The mathematical structures of these
algorithms are to be examined and a graphical representation (logic diagram) is to be
introduced to accommodate the concept of vector signal processing in graphical form.
Various issues associated with the software and hardware implementations of m-D DFTs
and DCTs are also to be investigated using some state-of-the-art digital signal processors.
It will be shown that the algorithms under study are highly structured and have a
close link to their 1-D counterparts. They provide more efficient process in terms of the
computational complexity and will be fast if the parallel and pipeline structure can be fully
exploited.
1-5 Thesis Review and Contributions
A structural approach to the construction of multidimensional vector radix fast
Discrete Fourier Transforms (DFTs) and fast direct Discrete Cosine Transforms (DCTs)
is presented in this thesis. Rigorous mathematical derivation is presented by representing
the 1-D and 2-D FFT and fast DCT algorithms in matrix form together with tensor
product However, the same algorithm is also derived more simply by examination of the
structure of logic diagrams with given rules for modification. The structural approach is
applied to construct 2-D Decimation-In-Time (DIT), Decimation-In-Frequency (DIF) and
mixed (DIT & DIF) vector radix FFT algorithms from corresponding 1-D FFT algorithms
using the Cooley-Tukey approach. The whole procedure is summarized in theorems.
The results are then generalized to vector radix FFTs of higher dimensions and vector
radix DCT algorithms. It has been shown that the logic diagram (or signal flow graph) is,
in addition to being a form of representation and interpretation of fast algorithm equations,
a stand-alone engineering tool for the construction of fast algorithms. The concept of
"vector processing" is adapted into the logic diagram representation . This reveals the
structural features of multidimensional vector radix FFTs and explains the relationships
and differences between the row-column FFT, the vector radix FFT in [43, 44] and the
9
approach presented in this thesis. Introduction of the structural approach makes the
multidimensional vector radix F F T algorithms of high radix and high dimension easy to
evaluate and implement by both software and hardware.
The hardware implementation of 2-D D F T is discussed in the light of vector radix
FFTs using the Frequency Domain Processor ( F D P ™ ) A41102, which has shown
improvement in reducing the system complexity over the traditional row-column method.
With the help of the structural approach, the vector split-radix DIF FFT algorithm,
mixed (DIT & DIF) vector radix FFT and Combined Factor (CF) vector radix FFT
algorithms are presented whereby a comparison study is made in terms of arithmetic
complexity.
T w o vector radix D C T algorithms are presented in the second part of the thesis.
Although the one based on Lee's approach was reported by Haque using a direct matrix
derivation method, it is here derived independently, using the structural approach. The
other vector radix D C T algorithm is based on Hou's method.
The system design of the 2-D modified Makhoul algorithm using the F D P A41102
provides yet another solution to the real-time 2-D image coding problem. The effects of
finite-word-length computation of D C T using various direct fast algorithms are studied
by computer simulation and results are presented.
Chapter One presents an introduction to the multidimensional digital signal
processing, with emphasis given to the transform method, multidimensional discrete
Fourier transforms and discrete cosine transforms in particular. The development and
new achievements of fast digital signal processing algorithms are reviewed, producing
insight to the research area.
T w o basic representations, namely the matrix form and the logic diagram, for 1-D
D F T and F F T algorithms are presented in Chapter Two, which lays the foundation for the
presentation of the structural approach to the construction of multidimensional vector
radix F F T algorithms. It has also been shown that the logic diagram is a form of
representation for F F T algorithms and a tool to derive or construct F F T algorithms as
well.
10
Chapter Three forms one of the major chapters of the thesis. After the introduction
of general matrix representations for the first stage 2-D DIT, DIF and mixed decimation
vector radix algorithms, structure theorems are presented along with diagrammatical
representation, which bear the essential message for the structural approach towards the
construction of various vector radix FFT algorithms. The applications of theorems and
the logic diagram are demonstrated by various examples, including the 2-D vector split-
radix DIF FFT algorithm. As well, comparative studies of vector radix FFTs and
hardware implementation of vector radix FFTs using the FDP A41102 are presented in
this chapter.
The structural approach is extended to multidimensional vector radix FFT
algorithms of higher dimension in Chapter Four. A recursive symbol system, which
makes the derivation of multidimensional vector radix FFT from 1-D FFTs a systematic,
straight-forward and error free procedure, is presented for the logic diagram
representation of vector radix FFTs.
The second part of this thesis consists of study results on the fast computation of 2-
D discrete cosine transforms, its application to the transform coding of real-time images
and error analysis of various direct fast DCT algorithms for image coding purposes using
the floating-point computation.
A brief introduction to multidimensional DCTs is presented in Chapter Five. Two
vector radix direct fast DCT algorithms are constructed using the structural approach and
presented in Chapter Six. The arithmetic complexity of various direct fast DCT
algorithms is also discussed in this chapter. In Chapter Seven, hardware implementations
of 2-D DCTs for real-time image coding are discussed using dedicated VLSI DCT
processors, digital signal processors, fast multiplier/accumulators and the newly released
FDP A41102. The effects of finite-word-length computation for fast DCT algorithms are
studied using the floating-point arithmetic in comparison with the direct matrix
multiplication method and simulation results presented in Chapter Eight. In conclusion,
Chapter Nine summarizes the main approach taken, the contribution made by this thesis
and future aspects of research.
11
Preliminary material on the tensor (or Kronecker) product and logic diagrams are
presented in appendices. A short proof of the structure theorems, vector radix direct fast
D C T algorithm based on Lee's method and derivations of various combined vector radix
FFT algorithms also are presented in appendices.
1-6 Publications, Submitted Papers and Internal Technical
Reports
[1-6.1] H.R. Wu and FJ. Paoloni, "The Structure of Vector Radix Multidimensional
Fast Fourier Transforms", ISSPA 87. Signal Processing. Theories-
Implementations and Applications, pp.89-92, August 1987.
[1-6.2] H.R. Wu and F.J. Paoloni, "The Structure of Vector Radix Fast Fourier
Transforms", IEEE Transactions on Acoustics. Speech, and Signal
Processing. Vol.37, pp.1415-1424, September 1989.
[1-6.3] H.R. Wu and F.J. Paoloni, "On the Two Dimensional Vector Split-Radix
FFT Algorithm", IEEE Transactions on Acoustics. Speech, and Signal
Processing. Vol.37, pp.1302-1304, August 1989.
[1-6.4] H.R. Wu and F.J. Paoloni, "Structured Vector Radix FFT Algorithms and
Hardware Implementation", Journal of Electrical and Electro
-nics Engineering, Australia, September 1990.
[1-6.5] H.R. Wu and F.J. Paoloni, "A Two Dimensional Fast Cosine Transform
Algorithm—A Structural Approach", Proceedings of IEEE International
Conference on Image Processing, pp.50-54, Singapore, September 1989.
12
[1-6.6] H.R. W u and F.J. Paoloni, "A 2-D Fast Cosine Transform Algorithm Based
on Hou's Approach", IEEE Trans, on Acoust., Speech, Sign
al Processing, to appear in June 1991.
[1-6.7] H.R. Wu, F.J. Paoloni and W. Tan, "Implementation of 2-D DCT for Image
Coding Using F D P ™ A41102", Proceedings of the Conference on Image
Processing and the Impact of New Technologies, pp.35-38, Canberra,
December 1989.
[1-6.8] H.R. Wu and F.J. Paoloni, "A Structural Approach to Two Dimensional
Direct Fast Discrete Cosine Transform Algorithms", Proceedings of
International Symposium on Computer Architecture & Digital Signal
Processing, pp.358-362, Hong Kong, October 1989.
[1 -6.9] H.R. Wu and F.J. Paoloni, "The Impact of the VLSI Technology on the Fast
Computation of Discrete Cosine Transforms for Image Coding", to be
submitted.
[1-6.10] H.R. Wu and F.J. Paoloni, "A Perspective on Vector Radix FFT Algorithms
of Higher Dimensions", Proc. of the IASTED Int. Symp. on Sig
-nal Processing & Digital Filtering, June 1990.
[1-6.11] H.R. W u and F.J. Paoloni, "Implementation of 2-D Vector Radix FFT
Algorithms Using the Frequency Domain Processor A41102", Proc.
of the IASTED Int. Symp. on Signal Processing & Digit
al Filtering, June 1990.
13
(Internal Technical Report)
[1-6.12] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware
Implementation of Various Fast Discrete Cosine Transform Algorithms",
Technical Report-1, the University of Wollongong-Telecom Research
Laboratories (Australia) R&D Contract for the Study of Fast Implementations
of Discrete Cosine Transform Coding Systems, under No.7066, June 1989.
[1-6.13] H.R. Wu and F.J. Paoloni, "Simulation Study on the Effects of Finite-Word-
Length Calculations for Fast DCT Algorithms", Technical Report-2, the
University of Wollongong-Telecom Research Laboratories (Australia) R&D
Contract for the Study of Fast Implementations of Discrete Cosine Transform
Coding Systems, under No.7066, October 1989.
[1-6.14] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware
Implementation of Various Fast Discrete Cosine Transform Algorithms",
Addendum of Technical Report-1, the University of Wollongong-Telecom
Research Laboratories (Australia) R&D Contract for the Study of Fast
Implementations of Discrete Cosine Transform Coding Systems, under
No.7066, November 1989.
14
PART I.
MULTIDIMENSIONAL DISCRETE FOURIER TRANSFORMS
15
CHAPTER TWO: 1-D DISCRETE FOURIER TRANSFORM AND
FAST FOURIER TRANSFORM ALGORITHMS
In this thesis, multidimensional vector radix FFT algorithms [42-44] based on the
Cooley-Tukey method [22] are considered in detail. Although the 1-D Cooley-Tukey
FFT algorithm and many others have been well studied and understood, they have been
included here for the purpose of understanding the structure of m-D VR FFT, the
evolution of VR FFTs from 1-D FFT, and even the 1-D FFT itself. The more that is
understood about 1-D FFT, the more easily the knowledge of m-D VR FFT algorithms
can be expanded. The matrix and the logic diagram representation of the 1-D FFT
algorithm form a foundation which provides the basis for the m-D VR FFTs.
After defining the 1-D discrete Fourier transform, the matrix forms for the 1-D FFT
are introduced. An examination as to why FFTs can achieve better computational
efficiency in different ways is then presented.
2-1 Definitions
The Discrete Fourier Transform (DFT), X(k), of a vector x(m) of length N is
defined [22, 47] as follows:
X(k)= I x(m)W*m (2-1-1) m=0
where WN = exp(-j27t/N), j = V7! and k = 0,1,... ,N-1.
The inverse DFT (IDFT) is given by:
x(m)=lzX(k)WNkm (2-1-2)
1X1 k=0 1N
where m = 0,1,...,N-1.
The derivation or development of the DFT from its corresponding continuous
Fourier Transform (FT) can be found in [39,47].
In their matrix forms, the DFT and IDFT are defined by the following equations:
16
X = W * x (2-1-3)
and:
x = N WN! A' (2-1-4)
where X = [X(0), X(1),...,X(N-1)]T, x = [x(0), x(l),...,x(N-l)]T, W^ is an N*N
matrix and for 0 < k,m < N-l, W* (k,m) = W^m; N = diag.[~ ...,£]; for 0 < k,m <
N"l. Wj^ (m,k) = WjJm. All matrices are of size N*N. W^ is often called the 1-D
DFT matrix.
2-2 Matrix Representations for 1-D Cooley-Tukey FFT
Algorithms
If the direct matrix operation is used, the computation of a 1-D DFT as defined by
Equation (2-1-3) needs N2 complex multiplications and N2-N complex additions,
provided the input sequence is complex. Thus, fast algorithms need to be introduced to
reduce the arithmetic complexity and the time for the computation.
In this section, the traditional 1-D Cooley-Tukey FFT algorithm is presented in a
traditional manner, which is followed by traditional graphical and matricial representations
of the algorithm and its computation structure. The matrix and diagrammatical forms,
which are used throughout the thesis, for both DIT and DIF FFT algorithms are then
presented with differences and relationships explained.
Using the Cooley-Tukey method or the Decimation-In-Time FFT algorithm [22,24,
47]:
k = ki*N' + kn; m = mi*r + mo; (2-2-1)
is set; where N' = N/r, ki, mo = 0,l,...,r-l and ko, mi = 0,1,...,N'-1.
Then Equation (2-2-2) is derived from Equation (2-1-1).
X(k!,k0)= Z X^mi.moJW^'W^wf'"1" (2-2-2)
mo=0 mj=0
The four step algorithm to calculate Equation (2-2-2) are:
17
Step 1: The second Butterfly (BF2) or the shorter length DFTs:
Step 2:
N'-]
x'l(ko,mo)= X x(mi.mo) W | 2 m i
m]=0 N'
The Twiddling Multiplications (TM):
(2-2-3a)
Step 3:
x1(ko,mo)=x'](ko,mo)W^Tomo
N
The first Butterfly (BF1) of radix-r:
(2-2-3b)
Step 4:
r-l x2(ko,ki)= I xi(k 0 , m 0 ) W
k i m o
mo=0 r
The unscrambling:
(2-2-3c)
X(ki,ko) = X2(ko,ki) (2-2-3d)
The algorithm shows that the original 1-D D F T of length N can be calculated by r
length N' (shorter than length N) DFTs and computation saving thereafter can be
achieved.
There have been many attempts to interpret the FFT by finding both mathematical
and graphical expressions for the algorithm [24, 25, 31, 39, 47]. The Mason flow graph
[6] is one such attempt and its introduction has been of great help in interpreting and
presenting the FFT. Figure-1 shows a Mason signal flow graph representing 8-point DIT
Cooley-Tukey FFT algorithm, where N = 8, the input is in natural numerical order and
the output in bit-reversed order. By bit-reversed order, it is meant that if (n2ninn)b is the
binary representation of the natural decimal index (n)d, its decimal index (nr)d in bit-
reversed order will be (nonin2)b- For example, if (n)d = (4)d = (100)b, (nr)d = (001)b =
(l)d- Bit-reversed order of a sequence: ^ O N ^
1 2
{?) d
01 10
W is
10 01
2 1
18
c CTJ
1-1 0 4-1
r—
a CO s-i u s 0 tH (ii
rH C3 C to •H LO r*
0 en « 2 <
+J •H
0 M
<
H fc En >, 0) ^ O H 1 >~, cu rH o 0 u AJ
c •H o a oo
i — i
i
a s-i D tn
•t-i
t.
X X X X X X X X
19
It is assumed that the decimal number is used for indexing unless it is indicated
explicitly otherwise.
In Figure-1, the signal flow graph shows vividly the construction of the algorithm,
including the butterfly (BF), twiddles ( T M ) and the unscrambling, and the number of
operations saved noting that W ^ = 1 and W g = -j. The relationships between the long
length DFT and shorter length DFTs when the algorithm is applied can also be described
[25].
A n alternative representation uses matrices to give the mathematical interpretation
and explanation of the algorithm. The F F T algorithm can be constructed by matrix
decomposition (or factorization) [39, 47], and its roots lie in algebra [30, 33]. In [30],
the matrix decomposition which underlines the Cooley-Tukey algorithm [30] is shown to
be:
W * = 3 1 T 1 B 2 T 2 . . . B m . 1 T m . 1 5 m (2-2-4)
assuming the length of the D F T N = 2 m . 3[ (i=l,...,m) represents the butterfly stage
which evolves from 2-point DFTs [30]. It is shown that each 3j only needs N complex
additions to calculate. 7j (i=l,...,m-l) is a diagonal matrix (representing the twiddling
multiplications) with half of the elements being 1 (or trivial), i.e., only N/2 complex
multiplications needed to work out each of the Tj matrix. This matrix form describes the
Cooley-Tukey algorithm both precisely and concisely. As it has been shown, the
computational efficiency is also very easy to evaluate.
The matrix form adapted in this thesis is virtually the same as that by Blahut [31],
and its indexing scheme follows traditional Cooley-Tukey's presentation [22, 47]. The
matrix equations for the 1-D radix-2 DIT FFT algorithm on the 8-point 1-D D F T (i.e., r =
2, N = 8 and N' =4,) is presented as follows:
20
(BF2:)
ko -x'i(0,mo)
x'i(2,m0)
x'i(l,mo)
_x'i(3,mo).
(TM:)
mo
"xi(k0,0)- _
_xi(ko,i). "
(BF1:)
ki -x2(k0,0)- _
_X2(ko,l). "
(Unscrambling:)
•X(0,k0)-
X(l,k0)_ =
"x2(ko,0)-
_X2(ko,l)_
In Equation (2-2-5a), there are two (when mo = 0,1, respectively) length four DFTs
which can be further decimated using this recursive algorithm.
The algorithm is described by a logic diagram in Figure-2 which is similar to the
Mason signal flow graph in Figure-1. There is a correspondence between the matrix
form of the butterfly, twiddles and its signal flow graph. As shown in Figure-2, the logic
diagram consists of one stage of radix-4 DIT FFT Butterfly (BF), a stage of radix-2 BF
and a Twiddling Multiplication (TM) stage between the radix-4 and radix-2 BF stages
represented by DIT TM-4*2 where a = exp(-J7t/4). Other symbols are defined in
Appendix A. It can be seen that there are two radix-4 butterflies in the radix-4 BF stage
and each radix-4 butterfly can be implemented with no multiplication and only eight
additions. The difference between Equation (2-2-4) and Equation (2-2-5) is that the
former, which is a form that underlines the Cooley-Tukey algorithm [30], is a top-down
overall matrix decomposition method; and the latter is a representation of the algorithm
1 1 1 1 -
1 1 - 1 - 1
1 -1 -j j
1 -1 j -j -
mj x(0,mo)-
x(2,mo)
x(l,mo)
Lx(3,mo).
(2-2-5a)
1 0
0 W k0 N
mo
xi'(k0,0)-
_xi'(k0,l). (2-2-5b)
- 1 1 -_1 -1_
mo -xi(k0,0)-
_xi(ko,l)_ (2-2-5c)
21
c JS *-) «.—< s_i
o <
H p •— ?. Ol •— — U-tp t-
15 o ^ oc
fc« Q *_> c o &<
CM 1 CJ VJ
fcC •rH t-
CO
o X
^ X
<N
X
CO
X
"<3-
X
lO
X
VO
X r-X
22
itself. From the matrix form point of view, it is a bottom-up approach. Equation (2-2-4)
can be seen as a mathematical representation of the Cooley-Tukey algorithm. It explains
"what" and "why". But when it comes to "how", i.e., how to derive an algorithm, the
"decimation" method has been predominantly preferred [5, 6, 22, 24, 25, 36, 37, 43,
68]. Equation (2-2-5) is a concise representation of the Cooley-Tukey algorithm. Funher
decimation can proceed on Equation (2-2-5a) as m 0 takes each fixed value the equation
itself is a half-length DFT. It is easy to see that the matrix form used in this thesis is
similar to that used by Blahut but with reordered data sequence and different indexing
scheme.
Similarly, a matrix form of equations can be introduced for the Decimation-In-
Frequency (DIF) Cooley-Tukey FFT algorithm.
Assuming N = r*N', set
k = ki*r + ko; m = mi*N' + mo;
where kn.mi = 0,1 r-1 and ki,mn = 0,1,...,N'-1.
Given Ni = N 2 = N = 8, N = 2 * N', N' = 4, matrix equations for the 1-D radix-2
DIF FFT algorithm on the 8-point D F T are presented as follows:
(BF1:)
k0 xi(0,mo)l
Lxi(l,mo)J L 1 "3
1 1 mj
x(0,mo)
x(l,m0). (2-2-6a)
(TM:)
(BF2:)
k0
xi'(0,mo)'
xi'(l,mo).
1 0
OWJJ0
ko
xi(0,mo)
_xi(l,mo). (2-2-6b)
ki X2(k0,0)-
X2(ko,2) X2(kn,l)
X2(ko,3)-
- 1 1 1 l-i
1 1 - 1 - 1
1 -1 -j j
- 1 -1 j -j J
mo rxi,(k0,0) xi'(k0,2)
xiYko,!) Lxi'(ko,3)
(2-2-6c)
23
(Unscrambling:)
rX(0,k0)-X(2,k0)
X(l,k0)
Lx(3,k0)_
-x2(k0,0)-
X2(ko,2)
X2(ko,l) -x2(k0)3)_
A logic diagram used to perform this 8-point D F T is shown in Figure-3.
Because of the binary organization of digital computers, algorithms for N (not a
power of 2) have received less attention, although the Cooley-Tukey algorithm can be
applied to D F T of length N which can be any composite number [21, 24].
Thereafter, in this thesis, most of the discussion will be on DFTs with length being a
power of 2.
2-3 Computational Considerations
The computational consideration is a sophisticated problem. The majority of the
work undertaken in search of fast algorithms for DFTs has gone into reducing the
computational complexity [34, 39, 48, 49]. O n most general purpose computers,
multiplications are more expensive than additions. As a result, especially in the early
years of research on FFT algorithms, much work was carried out in reducing the number
of multiplications, some of which are accomplished at the expense of additions, in-place
computation, and most of all, regular computing structure. A second consideration was
reducing roundoff errors [25] and research in this direction has been very much alive and
up to date [50, 51]. In-Place computation has also drawn a lot of attention from
researchers [39, 52]. Regular structure is yet another factor which is important to both
software and hardware implementation of FFT algorithms [14, 34, 41, 53, 54]. A fast
algorithm may lose its initial momentum due to its lack of in-place computation or regular
computing structure which dramatically increases the bookkeeping task, and it may be
placed in a disadvantageous position after all [39, 41, 154]. All the above should be
considered and balanced to devise or choose an FFT algorithm for a specific application.
Another point to be considered is that the advantage of an improved algorithm can be
24
<N
X
+ + I
c <
u. .
fe
wB:
o I
oc
O X xf x X 1<
25
wasted if the computer hardware (or the digital signal processor) cannot take advantage of
it [31]. This aspect shall be discussed in the second part of the thesis where the hardware
implementation of 2-D D C T s is considered.
The computational complexity in terms of operations is usually evaluated in two
ways, i.e., by examining the mathematical equations or by looking at the logic diagrams
(the signal flow graphs). There are many interpretations as to why the FFT algorithm can
be fast. Mathematically speaking, as the 1-D D F T is a summation of weighted inputs
with the weight being a periodic function, it can be evaluated by a clever insertion of
parentheses to reduce the number of additions and multiplications, thus becoming an
algebraic exercise. The number of multiplications can be further reduced by locating the
trivial multiplications such as ±1 and ±j in regular positions [31, 45]. This can also be
explained using the logic diagram which gives an engineering interpretation. Examine a
4-point D F T , for example. The algorithm represented by Figure-4-(a) is equivalent to the
direct matrix operation. As a result, 12 additions and 16 multiplications are needed to
complete the transform. By close examination, it is found that use can be made of the
periodicity of the weighting function as shown in Figure-4-(b) to group those inputs with
the same weights to reduce the number of additions to 8 in Figure-4-(c). It can be noted
thatW^ = 1, which does not require multiplication, and W 4 can be moved to the right of
the summation symbol, which further reduces the number of multiplications by 1. Using
conventional presentation, the algorithm represented by the logic diagram is given by
Figure-4-(d). Similarly, w ] = -j, which can be calculated without multiplication. The
final fast algorithm for 4-point DFT needs only 8 additions as shown in Figure-4-(e).
Logic diagrams can also be used to explain why one algorithm can be faster than the
other, in terms of the number of multiplications, when different radices are used. Take a
16-point 1-D DFT, for example. In Figure-5, a mixed radix-2 and radix-8 FFT algorithm
is used so that 10 multiplications are required to complete the transform. If the twiddles
between the radix-2 and radix-8 butterflies are moved to the right and are combined with
the twiddles inside the radix-8 butterfly, a new group of twiddles is formed as shown in
26
ft 1)
\
^
\
)\
\
\)
\ )
)\
^
y\
\
w4u
w?
w?
w?
w4u
wj
w?
w?
W4U
w4z
w?
w*
w4u
wi
w?
w?
1 1 X(0)
+ X(l)
+ X(2)
+ • X(3)
x(0)
x(l)
x(2)
x(3) 1
x(0)
x(l)
x(2)
x(3) — ,
x(0)
x(l)
x(2)
x(3)
x(0) ,
x(l) .
x(2) ,
x(3)
1
1
T
1
1
w4[
-1
-wj
1
-,
1
-1
1
-wi
-1
wj
+
+
+
+
Figure-4-(a) Figure-4-(b)
27
x(0)
x(2)
x^O)
x(2)
x(l)
x(3)
x(l)
x(3)
wi
Wj
+
+
+
Figure-4-(c)
-i
-i
+
+
+
X(O)
X(l)
X(2^
X(3)
x(O)
x(2)
+ + X(O)
+ + X(l)
x(l)
x(3) + + X(2)
+
W
+ X(3)
Figure-4-(d)
28
x(0)
xuj
x(2)
x(3)
+
+ X(0)
-1 + X(2)
-1
+
+
-J
+ X(1)
+ X(3)
Figure-4-(e)
29
^ rr X
01
X
^ c* X
c
X X X
+J L±J L+J L±J L+J L+J + 1 _±J + + + + +ir+1+1f+
s S i S S S S S GuC+] r+i r+n r+i m r+n r+i rr fa. o H fa. fa.
s* p, K
^—s
•—• ><
o ts X
*~^ p-1 v ^ X
/- TJ-K
>n X
^^ o <i> X
t-~ X
^^ OC X
*-*. 0\ - X
to' f—t
X
^ • *
«- K
<N'
X x 5 X
30
X I
^ ^ W ,-v
=c ^ — Q, X X X X
1 1 + •
i
+ •
+ -
+] "
.
f + + + -
j -
,, 1
f + + + -r1! r1!
;••-- S
..... , , . ,.
f + + + -.J
- . -
h + + + H
: . • " ' " :
1
2 ^ ••» — " O —> X X X X 1 1 1 1
, - v ^ C**i
C "J". — X X X
r~.
X X
I I I I I f ]+ + + n ni
i l
h + + + i . J .
: -P..,«
H + + + -•h r1! r1!
•
- ; + + + -
• • * — «
• ' . • • • : - . . : • • i
+ 1 + 1 +
£ e
? i is Hi rh
i,-E "
f + !+'
(- + +
V V T
-
+ +
II" 1 !
+ [+
=i
+ + -JL _L
+ 1 + -
*T "i*
. ' • :
1 1 + E 1 10 I H—'
i g^j ^
f + J . . J ,
a a H
T> '7-" fa.
i i — - i T 1-1+ ttl
JU
K X CJ, o x x
* 6 g 6 S £ S § X X X X X X X
X
"•3-
X
fa. fa.
«
b.
M
O x
c X
o — II
fa. C
N/
c X o II
I
J-l
3 •H fa.
c a
31
Figure-6. As a result, only 8 multiplications are needed to perform the same 16-point
DFT. As a matter of fact, Figure-6 represents the radix-4 FFT algorithm. According to
Richard [55], the best choice of algorithm depends strongly on whether the execution
speed is dominated by: (1) multiply time; (2) equally by multiply and addition time; or (3)
butterfly time. From the logic diagram it can be seen that, when multiply time dominates,
to optimize FFT algorithms is to find the best locations for those twiddle factors so that
the number of multiplications is minimal.
2-4 Summary
In this chapter, both 1-D Decimation-In-Time and Decimation-In-Frequency FFT
algorithms have been examined using two forms, namely, the matrix representation and
the logic diagram. These two forms will be used throughout this thesis as a basis for the
derivation of multidimensional vector radix FFT algorithms. The logic diagram is
equivalent to the traditional signal flow graph, but it can be generalized into a
multidimensional form without any complication and misperception. It is also
demonstrated that alternative algorithms can be derived using logic diagrams alone. In
other words, the logic diagram is not only a form of representation of algorithms but can
also be used to derive new algorithms. It can be used as a stand-alone engineering tool.
32
CHAPTER THREE: 2-D DFT AND 2-D FFT ALGORITHMS
3-1 Introduction to 2-D Discrete Fourier Transforms
It is well known that when the sample length of the convolution for image filtering
or the correlation for template matching is long, the FFT approach is faster than the
original domain approach [8]. In most multidimensional applications, this is usually the
case. Although there are many approximation methods in the original domain which
perform multidimensional processing in real-time and have achieved reasonably good
results, it is believed that multidimensional FFT will still have a place in the field, in
theoretical analysis as well as in practical applications.
A 2-D DFT problem can arise from 2-D signal processing applications or can result
from mathematical manipulations [1, 57]. The study of fast algorithms for 2-D DFTs is
one of the practical concerns in multidimensional.(m-D) digital signal processing based on
contemporary technology—especially computer technology, and also the first step to
studying multidimensional fast Fourier transform algorithms. Many properties of and
applications for 2-D Fourier transforms can be extended directly to computing
multidimensional Fourier transforms [31],
At present, however, it must be admitted that the requirements of most real-time
multidimensional DFT applications cannot be satisfied using ordinary methods [1, 8, 31,
39, 47, 63] with even the most advanced digital signal processors [13, 26, 40] or
supercomputers [129, 130], due to the large amount of data needed to be processed. To
improve the situation, good multidimensional FFT algorithms and greater computing
power provided by the development of application specific VLSI technology in the
computer industry are needed. This has motivated research in both theory and
technology. On the theoretical front, apart from the traditional row-column approach,
there are reports of many multidimensional FFT methods: the vector radix FFT
algorithms [31, 42-45], multidimensional DFTs by polynomial transforms [58],
multidimensional prime factor FFT algorithms [32], multidimensional Winograd Fourier
transforms [35], multidimensional Number Theoretic Transform algorithm [59], vector
33
split-radix F F T algorithms [36, 37, 60] and, recently, vector FFT algorithms developed
for vector computers [61] and supercomputers [129, 130]. A good mathematical
explanation of different multidimensional F F T algorithm can be found in [1, 30, 31, 33].
O n the technology front, apart from different Digital Signal Processors (DSP), especially
the Zoran Vector Signal Processor (VSP) [13], which can perform FFT, the successful
fabrication of radix-2 and radix-4 butterflies has been reported [56] as well as full length
FFT processors [142]. And recently, a full length FFT processor which does up to 256-
point complex D F T in 102.4^s has been fabricated and demonstrated [14, 15], and now is
commercially available [26, 27]. Many proposals have been made to implement DFTs
and FFTs using systolic array processors [140, 141, 145, 147] and V L r W [129] or
S L M D [130] supercomputers. The neural net implementation of FFTs is still in its early
stage [138]. A VLSI architecture has been proposed and designed using G E 3-u.m
C M O S technology and the vector radix-2*2 F F T algorithm for rasterizing the 2-D D F T
with size N * N at video speed [134].
From the beginning, research on fast computation of DFTs has followed criteria to
evaluate the effectiveness of an FFT algorithm, i.e., an effective FFT algorithm should be
computationally efficient in terms of the number of operations, it should reduce roundoff
errors, it should possess an in-place computation, and a regular structure. Ignorance of
the last two points wiD cause an increase in bookkeeping burden and is responsible for the
disadvantages which the bookkeeping task may cause. According to the above criteria,
although algorithms based on the Cooley-Tukey method usually need more
multiplications than many reduced multiplications FFT algorithms [39,41, 58], they have
obvious advantages over the rest by the last three criteria. This is also true of the
multidimensional vector radix FFT algorithms [50, 51].
In this part of the thesis, 2-D Vector Radix (VR) F F T algorithms, which are
multidimensional extensions of the 1-D Cooley-Tukey algorithms, are considered. The
Cooley-Tukey F F T algorithm is of historical importance in modern digital signal
processing. It is still a most widely used algorithm, both in software and hardware,
including VLSI implementation of DFTs, because of its regular structure and many other
34
computational advantages. The fact that an algorithm has a regular structure is very
crucial in VLSI implementation. The C S I R O designed A U S T E K F D P ™ A41101 and
A41102 F F T processors exploit the good structure provided by the Cooley-Tukey
algorithm to form a pipeline architecture and to achieve one of the fastest FFT processing
speeds on record [14, 15, 142].
The vector radix F F T algorithm was first conceived by Rivard [42], further
developed by Harris, McClellan, Chan and Schuessler [43], Arambepola [44], and
unified by Mersereau and Speake [62]. The vector radix FFT algorithm is a straight
forward extension of the 1-D Cooley-Tukey algorithm, and it is more efficient in terms of
the number of multiplications than its row-column 1-D counterpart A VLSI architecture,
using the vector radix-2*2 FFT algorithm, has been proposed and patented [134, 137] to
show many of its advantages over the traditional row-column implementations.
Nevertheless, it is less well known to electrical and computer engineers, and more often
than not, misunderstood and treated as a mathematically complicated and involved
process. Many think that it is not worth the effort using V R F F T in real applications.
This is only natural when all the struggle and strife of two decades ago in understanding
and interpreting the Cooley-Tukey F F T algorithm [20, 23, 25, 47, 63, 64] is recalled.
On the other hand, there are new reports lately on the vector split-radix (VSR) FFT
algorithms by Pei and W u [36], and M o u and Duhamel [37]. The derivation of these
"strangely" split vector radix algorithms is rather complicated and final results are difficult
to appreciate due to the direct derivation approach used.
In order to extend different 1-D F F T algorithms to higher dimensions whilst
avoiding confusions and tedious derivation, the structural features of the
multidimensional FFT algorithms have to be examined. The structural features of the
multidimensional Winograd Fourier transform algorithms have been studied extensively
[30, 35], as well as those of the prime factor algorithm [32]. Efficient as they are, the
Winograd Fourier transform algorithm demands the length of the D F T to be a prime, and
the prime factor fast Fourier transform algorithm requires the length of the short DFTs to
be mutually prime [45, 154]. In essence, they tend to reduce the number of
35
multiplications at the expense of the number of additions, and especially the regular
computation structure. Furthermore, these structural features are described exclusively by
matrix decomposition (or factorization) which is somehow less attractive to the engineers
than the graphical representation such as signal flow graph—"butterflies". O n the other
hand, the structural features of the vector radix F F T algorithms have not been well
examined, understood, or exploited.
In the history of the F F T algorithms, matrix decomposition has served as a method
of interpretation of fast algorithms [47] and an alternative representation of the algorithms
[39]. Algorithms are usually derived by decimation of the indices of the transform
function and finally described by signal flow graphs on which computer programs or
VLSI architecture designs are based. Matrix representation was used as a tool for the
construction of some F F T algorithms. It has been found that the construction of fast
algorithms and algebra, of which matrix is one, are deeply related, although they are not
the same subject [30].
W h e n the dimensions increase, problems become more complex and are often
difficult to comprehend as explained in previous chapter. Solutions become more flexible
and there are many alternative approaches. W h e n a new algorithm is derived, it is not
always certain if the formulas are error free and correcting them is not easy. More often
than not, the effort spent in verifying a new algorithm is equal to the time taken to derive
another new algorithm. This is why the study of a systematic and structural approach to
the m - D F F T algorithms is justified.
Whenever a mathematical result is used for engineering applications, it results in a
new algorithm. It may be further converted into software or hardware implementations,
in which case the representation becomes important, so much so they make whatever they
represent either a technological and industrial wave or a sinking leaf in the sea of research
papers. There is no need to stress the significant role played by a group of researchers at
M I T in providing electrical engineers with an engineering interpretation of the Cooley-
Tukey F F T algorithm [23], as the so-called "butterfly" signal flow graph is a major
consideration in all of the publications on F F T algorithms which are based on the Cooley-
36
Tukey method. It has been used to represent many other fast transform algorithms as
well [39]. Unfortunately, in most publications on m-D FFT algorithms, the graphical
presentation is still in a 1-D form which can, more often than not, be over complicated.
Another thing which is missing in the 1-D graphical presentation of m-D algorithms is the
connection and relation between the m-D algorithm and its corresponding row-column 1-
D algorithm. The graphic presentation which is going to be introduced in this chapter will
adapt the vector signal process concept. It will become clear that this form is a better form
and it can be used to explain the relationship between different m-D algorithms as well.
It is the purpose of this part of the thesis to establish the general matrix
representation forms for vector radix FFT algorithms. These general forms can provide a
structural approach to the construction of various VR FFT algorithms whilst still
allowing appreciation for the simple and regular structure such as those possessed by their
1-D counterparts, in addition to the computational improvement. The form that has been
chosen is a combination of matrix representation and logic diagram which is a
multidimensional extension of the signal flow graph ("butterflies") for 1-D FFT
algorithms. Matrices are used as a concise representation of algorithms and any
particular structure of an algorithm as well as a tool to derive multidimensional VR FFT
from their 1-D corresponding algorithms. Logic diagrams independently provide another
approach towards the derivation of m-D VR FFT algorithms and are used for
computational considerations, software programming and hardware implementation. The
relationships between the row-column FFT and various VR FFT can also be vividly
interpreted or explained by these forms. The use of this structural approach makes the
derivation and implementation of various VR FFT algorithms a straight-forward and
structured one. Otherwise it could be a tedious, untidy and potentially erroneous
procedure. With a 1-D FFT algorithm, based on Cooley-Tukey's concept, it is possible
to systematically achieve its multidimensional VR FFT. The properties, such as in-place
calculation, symmetry, data order and structures like butterfly, twiddling multiplications
and unscrambling, are all preserved in a multidimensional context. This approach can be
37
also extended to some other fast transform algorithms and has been done so to the 2-D
fast cosine transform, as explored in the second part of this thesis.
Another feature of multidimensional V R FFT algorithms which is worth mentioning
is that V R FFTs have better fixed-point and floating-point error characteristics than both
the row-column FFTs and polynomial transform FFTs [50, 51]. Naturally, when the
number of operations is reduced, the number of error sources will also be reduced.
3-2 Definitions
The general Ni*N2-point 2-D D F T and its inverse are defined as [1]:
X(k,/ ) = NX Ix(m,n) W™* W^ (3-2-la) m=0 n=0 1 M
and,
x(m,n) - t Tx(k,/ ) W-»k WN"2' (3-2-1 b)
where k,m = 0,l,...,Ni-l; n, / = 0,1,...,N2-1. In the following discussion it is
assumed that Ni and N2 are power of 2 to simplify the presentation.
The matrix forms of the 2-D D F T and its inverse definitions are given as follows.
X = W 2 x (3-2-2a)
and,
x = ^ _ W " 2 X (3-2-2b) N1N2
where X is an N i * N 2 column vector formed by stacking transposed row vectors of the 2-
D output array, x is also an N i * N 2 column vector formed by stacking transposed row
vectors of the 2-D input array, W 2 = W ^ ® W ^ , W " 2 = W j ^ ® W j ^ , and <g>
stands for the tensor (or Kronecker) product. Another matrix form for the definition of
the 2-D D F T for general periodically sampled signals is given by Mersereau and Speake
[62].
X(k)= I x. (n) expt-jk^TtN-^n] k e J n (3-2-3a)
and,
38
£ (n) = I X (k) exp[jkT(27tN-l)n] ne I N (3-2-3b) keJN
where N is a periodicity matrix, IN and JN are regions in which x (n) and X (k) are
supported respectively [1, 62]. A special case of these two regions is the rectangular,
which is commonly used.
Yet another definition can be given to the 2-D DFT in a form of the matrix row and
column operations [3]. In this form the input and its DFT sequences are both in a 2-D
matrix form. Although the row and column matrix operations are most familiar, this form
seems difficult to extend to anything beyond the 2-D DFT. When it comes to the
derivation of 2-D FFT algorithms, such as vector radix FFT algorithms, this form is not
as convenient to use as others.
The definition given by Equation (3-2-3) is a very concise mathematical
representation of the 2-D (m-D) DFT. Based on this definition, FFT algorithms for
rectangularly or hexagonally sampled signals or signals which are sampled on an arbitrary
periodic grid in either the spatial or Fourier domain are devised. The relationships
between the existing m-D FFT algorithms based on the Cooley-Tukey scheme are also
well explained in this form. However, this form helps little to show how to derive vector
radix FFT algorithms given that the corresponding 1-D FFTs are known.
Definitions given by Equations (3-2-1) and Equations (3-2-2) are by far the most
commonly used presentations for m-D DFTs [2, 9, 30, 31, 35, 37, 65, 66]; Equation (3-
2-2) being a direct matrix representation of Equation (3-2-1).
3-3 Row-Column FFT Algorithms
The relationship between the two multidimensional DFT and the 1-D DFTs can be
expressed by the Kronecker product in a matrix form [30, 35], i.e., the multidimensional
DFT matrix W2 is presented by
W 2 =W^1 <8> W^2 (3-3-D
where wi, i=l,2, represents the Ni-point 1-D DFT matrix.
39
The first implication from Equation (3-3-1) in the 2-D D F T problem is the row-
column approach which is well known. If Ni = N 2 = 16, the row-column radix-4 FFT
can be used to calculate the 2-D D F T as is shown in Figure-7. In Figure-7, the vectors
xO to xl5 consist of the row elements of the 2-D input array and XO to X15 represent
rows of the output array with elements in bit-reversed order. Each heavy line represents
sixteen datum lines each of which carries an element from xi. The block inscribed by R-
16 FFT represents the 16-point D F T using the radix-4 FFT as given in Figure-6 and
performs the row FFT in the diagram. The addition block stands for vector addition and
operates on the elements from the same column of the two input vectors. Likewise, the -
1 block and the -j block performs corresponding operations on every element of the input
vector. The part of Figure-7 to the right of the R-16 FFT blocks forms a radix-4 FFT
structure. However this structure operates on columns of the input array only, i.e., it
performs the column FFT. Using the logic diagram shown in Figure-7, the computation
structure of the 2-D D F T is exceedingly clear.
3-4 Vector Radix FFT Algorithms
Instead of proceeding with decimation operations on each dimension separately (one
after another) as the row-column method does, the vector radix FFT algorithm suggests
that decimation be performed on all indices (or dimensions) simultaneously [42-44].
In the case where decimation-in-time is used on both indices of the 2-D DFT,
assuming that N\ = T \ * N\ and N 2 = r2 * N2', set:
k = k i * N f +kn; m = m i * n + m o ; / = / i * N 2 +/n; n = ni*r2+no;
where ki , m o = 0,l,...,ri-l; kn , mi = 0,l,...,Ni'-l and l\ , n0 = 0,l,...,r2-l; /0 , ni
= 0,1,...,N2'-1. From Equation (3-2-1), Equation (3-4-1) is derived:
40
o oe x x x x x x x x x
_ r- ^
x x x x X x
dbcfai^igL^
a
CO
c £
a u
•c ><
c
H b.
a o
c c
II C3
X
II
a
i
>-i
3 60 •H fi-
41
X(ki,ko;/i,/o)= S I' Y Ni"x(mi,mo;ni,no)W™1.kow"1!0
mo=0 no=0 mi=0 m=0 Nl N2
wmokowno/owm°klwno/l Nl N 2 rj r2
- V V Wm°klWno/l WmOkO\X7nO/0Nv1 Nv_1 / ^rmiko,,Jii/o
(3-4-1)
When decimation-in-frequency is applied along both indices, set:
k = k1*r1+k0; m = m1*N1'+mo; / = /1*r2 + /0; n = n1*N2' + n0.
where N1=r1*N1', N2=r2*N2\ kb m0 =0,1,..., N^-l; ICQ, mi =0,1,..., rrl, and lh n0
=0,1,..., N2'-l; /0, m =0,1,..., r2-l.
Then from Equation (3-2-1), Equation (3-4-2) is derived:
X(k1,k0;/1,/0)= £ i"1 £ l\(mllm0;n1,no)Wj
klW"°V m0=0n0=0 1^=0 n^O
! 2
wmokown°/owmikowni/o Nl N 2 ri r2
Ni'-l NV-1 , , , , n-1 r->-l - Y V Wm°klWn0/l Wm0k°Wn0/0 V V v/n, «, • r, r, UUmlkOwnl'0 ~ ^ n ^n Ni' W N V WNi W No ^ X x(mlsm0; nl5n0)W W m0=0n0=0
A>1 i>2 "1 i>(2 m 1 = o n i = 0 n r2
(3-4-2)
Since more than one dimension can be decimated, different decimation schemes can
be applied to different dimensions which leads to a mixed decimation vector radix FFT
(mixed VR FFT for short [44, 67]). For instance, the DIF is used on the row index and
the DIT on the column index by setting:
k = k1*r1+k0; m = m1*N1' + m0; / = h * N2' + /n; n = ni*r2+no;
where N1=r1*N1', N2=r2*N2', klf TDQ =0,1,..., Nj'-l; k0, mj =0,1,..., rrl, and /i , n0
= 0,l,...,r2-l; /0,m = 0,1,...,N2'-1.
42
X(k1,k0;/1,/0) = \* 'j X1 ^"xCmLino ; ni,n0) W ^ w "0 ' 1
m0=Ono=0 mi=0ni=0 ^1 r2
wm°k° wn°/owm ] k°wn 11° Nj N 2 ri N2'
N/-1 r2-l 1 , , , n-1 NV-1 - 2v 2. W N , W W N " W N £ E x(mi,mo; ni,n0) W
1 UWN^,U
m0=0n0=0 i>:l r2 1N1 1N2 m i = 0 n i=0
rl N2
(3-4-3)
From Equation (3-4-1), in each stage of the FFT operation, row twiddles both
inside Wr° 1 and outside wJJP0 the butterfly structure can be combined with
column twiddles (W^0 \ w£J° °, respectively). Intuitively, this explains why vector
radix FFT algorithms have fewer number of multiplications than their row-column
counterparts.
The point is that although this original VR FFT presentation is mathematically and
computationally simple and clear, it helps little in eliminating the complicated and tedious
procedure for the derivation of various VR FFT algorithms required by specific
applications. When the mixed radix FFT method [54] or the split-radix method [68] is
invoked for each dimension to obtain the VR FFT algorithms, further complications in the
derivation procedure would be expected. It has not been seen in any literature that there
are simple solutions to the problem. The computational complexity can be calculated on 2 3
the wrong basis as well. For instance, W N counts one complex multiplication just as W N
does. Another point that has to be made here is that the mixed vector radix FFT algorithm
has more variety than the 1-D mixed radix FFT algorithm [47, 54], which has not been
addressed properly in the published literature [1,31, 42-44, 62] if it was addressed at all.
This will be discussed further through examples.
43
3-5 Matrix Representations for 2-D Vector Radix F F T
Algorithms
In order to present the structural approach, a matrix form is introduced for 2-D V R
FFTs. Its indexing scheme follows the traditional Cooley-Tukey presentation, which has
been widely used in the literature and adopted in both software and hardware
implementations, otherwise it is a generalized form of that presented in [31].
A matrix form for DIT V R FFTs given by Equation (3-4-1) can be written as the
following three steps:
(BF:) [X(*i , k0 ; 2i , IQ)] = I1 [xi'(ko, m0 ; /o, n0)] (3-5-la)
(TM:) [xi'(ko, mo; IQ, no)] = EHxiCkn, mo; /n, no)] (3-5-lb)
(Remaining Short Length 2-D DFTs:)
Nj'-l N2'-l v ,
[x1(k0,m0;/o,«o)]= I I W™ 1r 0^,°[x(mi fm 0;ni,n 0)] mi=0 ni=0 1 z
(3-5-lc)
where [X(k] , ko ; £j , IQ)], [xi'(kn, mo ; IQ, no )], [xi(ko, mo ; /o, «o )],and [x(mi,mo ;
n\,no )] are rir2 column vectors, with kj ,£i and mo ,no varying in bit reversed order, El
is the twiddle factor matrix which is an nr2*rir2 diagonal matrix with the element value
Fl(i,i) (i = 1,2,.. .,rir2) equal to W ^o k ° W { ^ / o accordingly, and I1 is the matrix for the 2-D
vector radix-n*r2 BF structure which is also an nr2*rir2 matrix with the element value
Hid) (ij = l,2,...,rir2) equal to W ^ W ^ 7 correspondingly. Equation (3-5-lc)
contains n*r2 Ni'*N2'-point 2-D DFTs which can be further decimated.
Example-1: Given an Ni*N2-point 2-D D F T where Ni = N 2 = N = 16, the VR-4*4
DfT FFT algorithm in matrix form can be presented as follows:
(BF:)
xo-X2 XI X3_
=
rll 1 1 -. 11-1-1 1 -1 -j j
L i -i j -j J
®
r l l 1 1 "
11-1-1
1 -1 -j j
- 1 "1 j -j J
rxi'O xi'2
xi'l -xi'3
(3-5-2a)
44
(TM:)
-xi'O-
xi'2
xi'l
Lxi'3J
1 0 0
0 W 2k0 N
0
0 -!
0
rO 0 0 W*u 0
0 0 0 w 3k0 N .
®
1 0 0
0 w2/0
N 0
0
0
0 0 WJJ 0
0 0 0 w. 3/0 N -I
rxiO-xi2
xil Lxi3_
(3-5-2b)
(Remaining Short Length 2-D DFTs:)
Nj'-l N2'-l m v „ ,
x x w™;k° *$> mi=0n!=0 X>1 ^ 2
-xiO-xi2 xil
_xi3_
rxO-
x2
xl
x3.
where i = 0,1,...,3;
Xi = [X(i,k0 ; 0,/0),X(i,ko ; 2,/0),X(i,k0 ; l,/0),X(i,k0 ; 3,/0)]T;
xi'i = [xi'(k0,i; /o,0),xi'(ko,i; /0,2),xi'(k0,i; /0,l),xi'(k0,i; /o,3)]T;
xii = [xi(kn,i; /o,0),xi(ko,i; /o,2),xi(ko,i; /o,l),xi(ko,i; /o,3)]T;
xi = [x(mi,i; ni,0),x(mi,i; ni,2),x(mi,i; ni,l),x(mi,i; ni,3)]T;
(3-5-2c)
Il =
It4 Jt4 Jt4 Jt4
It4 Jt4 _Jt4 _Jt4
114 .J14 .jjt4 jjt4
_ J14 _Jt4 jjt4 _jjt4 _
El =
N u
0 W2koF
0 0
0 0
44 N
0 0
0 0
Wk°Ft4 0
0 w3k°Ft4
u ffN N
It4 =
1 1 1 1 - 1
1 1 - 1 - 1
1 -1 -j j
1 -1 j -j -
I7t4 _
r- 1 0 0
owj'0 0 >0
0
0
o o wjy o o o o w3/°
The tensor product in Equation (3-5-2) now is just used as a form of concise
presentation. However, it does indicate an important fact which will be discussed in the
next section.
45
A matrix form can be also written for the first stage of DIF V R FFTs presented by
Equation (3-4-2) as the following three steps:
(BF:) [xjCfy , mo; £0 . n
0)] = If [x(m7 , m^ *; , n0)] (3-5-3a)
(TM:) [x{(k0, mo; £0, n0)] = Ef Ui(k0, n^; £0.
n0)] (3-5-3b)
(Remaining Short Length 2-D DFTs:)
[XCk,,^ ;llt£0 )]= X X W ™ f ^ W x ^ , m0; £0 , n0)]- (3-5-3c) n\Q=0 nQ=0 1 ^
where [X(kltk0 ; /l f ^ 0 )], [ x ^ , TDQ; £0 , n0)], [xi'(k0 , VTIQ; £0 . n0)] and [x(m7, mo;
";> no)] are rlr2 column vectors with kn , -£# and mj , nj varying in bit reversed order,
E f is the twiddle factor matrix which is an r ^ T j ^ diagonal matrix with the element
value Ff(i,i) G=l,2,..., r^) equal to W J ^ W J ^ 0 correspondingly, and If is the matrix
for the 2-D vector radix-ri*r2 BF structure which is also an r\t2*x\?2 matrix with the
element value If(i,j) (i,j=l,2,..., rir2 ) equal to W ^ ' V 7 ^°. The product of Ef, If and rl T2
[x(m7, mo; nj, no)] is a column vector again so that further decimation can proceed on
Equation (3-5-3c).
The matrix form for the 2-D mixed vector radix FFT algorithm given by Equation
(3-4-3) is as follows:
(BF1:)
[x\(kn ,mo ; £o .no)] = IJBF W m ^ ^0 '- nl >n0)J (3-5-4a)
(TM:)
[xi'(^ ,mo ; IQ ,n0 )] = E ? T M [x\(ko ,mo ; Iq M )] (3"5-4b)
(BF2:)
[X(kj ,ko ; £j ,/0 )] = IJBF [xi'(ko /no ; h ,no )] 0-5-4c)
46
where [x(mj ,mn ; nj ,no)] and [x\(kn ,mn ; #o ,no)] in Equation (3-5-4a) are r}N2'
column vectors with kg , #o ani^ ml > "7 varying in bit reversed order; [XI(£Q ,mo ; IQ ,no
)] and [x\\ko ,mo ; /o >w0 )] m Equation (3-5-4b) are rjr2 column vectors with kg and I\Q
varying in bit reversed order; [xi'(ko /no ; /o ,«o )] and [X(£/ ,ko ; £] ,IQ )] in (3-5-4c)
are Ni'r2 column vectors with kj , £j and mo, no varying in bit reversed order. E ^ T M is
the twiddle factor matrix which is an rir2*rjr2 diagonal matrix with the element value
— I T M ^'^ (i=l»2,..., rjr2) equal to W N ° W N ° correspondingly, I™BF is the matrix for
the 2-D vector radix-ri*N2' BF structure which is an riN2'*riN2 matrix with the
element value lfBF(i,j) (i,j=l,2,..., r^r2 ) equal to W ™ ; °W";, °, and I ^ F is the matrix
for the 2-D vector radix-Nj'*r2 BF structure which is an Ni,r2*N1'r2 matrix with the
element value lJBF(i,j) (i,j=l,2,..., rxr2 ) equal to w J J £ *; W ^ 7 . The superscript m
stands for the mixed decimation vector radix FFT. Further decimation can proceed on
both Equation (3-5-4a) and Equation (3-5-4c).
3-6 Structure Theorems
By using the structural features of the multidimensional vector radix DIT FFT stated
in the following theorem, the straight-forward but often tedious derivation can be
bypassed.
[Structure Theorem 1:—Decimation-In-Time FFT]
If a 2-D D F T is defined by Equation (3-2-la), Ni = n * N\ and N 2 = r2 * N2', the
vector radix-ri*r2 decimation-in-time FFT is used, and the matrix representations of
corresponding 1-D FFT equations are given as follows:
[X(kj , ko)] = ijjj [xi'Cko, m0)] (3-6'la)
[xiXko, m0)] = F ^ [xi(ko, m0)] (3-6"lb)
[xi(k0, m0)] = Y W ™ 1 * 0 [x(mi,m0 )] C3"6'1^ ra!=0 " J
47
and,
[X(*i , /0)] = I*2 [XI'(/O, no)] (3.6.2a)
[xi'(/o. "0)] = F^ [Xl(/0, «o)] (3-6-2b)
N2'-l ,
[xi(/o,^)]= X W " 1 ' 0 ^ , ^ ) ] (3_6.2c) ni=0 z
where *7 , m0 = 0,l,...,n-l; ko , mi = 0,l,...,Ni'-l; £ltn0 = 0,l,...,r2-l; /0 , m =
0,1,... ,N2'-1; F N ^ (F N2 respectively) is the twiddle factor matrix of 1-D radix-ri (radix-
r2 respectively) DIT FFT and ij^ (I*2 respectively) is the BF structure matrix of 1-D
radix-ri (radix-r2 respectively) DIT FFT, then the matrix equation for the 2-D vector
radix-ri*r2 DIT FFT algorithm is presented by Equation (3-5-1), where El = F*1 <g> F*2,
and I1 = INi ® IN2 with symbol ® standing for the tensor (or the Kronecker) product
[30, 31, 69]. In other words, El can be obtained by replacing the element F*1 (i,i) of
matrix FjJJ with F^(i,i)*F^ and p by replacing I^(i,j) of ijj with ^ ( i j ) * ^ .
The structure theorem can be readily proved using matrix theory once all equations
have been expressed in the above matrix form (see Appendix B). It can be verified that
the result is correct by referring to Equation (3-4-1). The complete equations for a
specific DIT V R F F T can be obtained by applying the theorem on the remaining short
length 2-D D F T s repeatedly. The application of the structure theorem will be
demonstrated in the examples at the end of this subsection.
The relationship between 1-D radix-2^1 FFTs and corresponding vector radix FFTs
is clearly explained by the structure theorem and thus the derivation of the higher order
vector radix F F T algorithms becomes simpler. Since F F T based on [22] and [43] is the
issue, not surprisingly, the statements cover the processing stages of both B F and T M .
The unscrambling stage of a complete vector radix F F T equation is also governed by this
rule, i.e., once the unscrambling matrix for corresponding 1-D FFTs are known, that of
48
the 2-D vector radix FFT algorithm will be the result of the tensor product of the two
[153].
Similarly, the following theorems for the DIF V R FFTs and the mixed V R FFTs are
also true.
[Structure Theorem 2:—Decimation-In-Frequency FFT]
Suppose that the N j * N 2 2-D D F T is defined by Equation (3-2-la), where
Ni=ri*Ni', N2=r2*N2', decimation-in-frequency is used and the matrix representations
of the corresponding 1-D FFT equations are given as follows:
[ xT( k0 , mo)] = 1 ^ [ x( m j , mo)] (3-6-3a)
[ Xl'( k0 , mo)] = F** [ Xl( ko , mo )] (3-6-3b)
Nl'_1 mold
[ X( klf k0) ] = I W ^ f 1 [Xl'( ko, mo)] <3-6-3c) mo=0 N*
and,
[ Xl( £0 , n0)] = 1 ^ [ x( nj , n0)] (3-6-4a)
[ Xl'( £0 , n0 )] = F*2 [ Xl( £0 , n0 )] (3-6-4b)
N2'-l _ i
[ X( /,, £0) ] = I W ^ 7 [ Xl'( £0 , n0)] (3'6-4c)
where k0,m2 = 0,1,..., rrl; kj, m0 =0,1,..., Nj'-l; £o>ni = 0,1,..., r2-l; /j,
n 0 =0,1,..., N2'-l, F^1 (F^2 respectively) is the twiddle factor matrix of 1-D radix-r!
(radix-r2 respectively) decimation-in-frequency FFT and ijjj (Irrespectively) is the BF
structure matrix of 1-D radix-rx (radix-r2 respectively) decimation-in-frequency FFT. The
matrix equation for the 2-D vector radix-rj*r2 DIF FFT algorithm is given by Equation
(3-5-3), where Ef= F % ® ¥ % and 1 ^ 1 ^ ® ^ , w k h the Symbo1 ® Standing f ° r the
tensor (or the Kronecker) product [30, 31, 69].
49
[Structure Theorem 3:—Mixed V R FFT]
For a given Ni*N 2 2-D D F T as shown in Equation (3-2-la), if Ni = ri*Ni\ N 2 =
r2*N2', the matrix representation of 1-D DIF FFT and that of 1-D DrT FFT algorithm are
presented as follows:
[ x2( k0 , m0)] = 1 ^ [ x( w ; , mo)] (3-6-5a)
[ Xi'( ICQ , mo)] = F ^ [ Xi( k0 , HIQ )] (3-6-5b)
Nl'"1 mnki
[X(khk0)] = I W™° 1 [xi(k0,m0)] (3-6-5c)
mo=0 1
and,
[X(^7 , IQ)] = I*2 [xi'(/0, no )] (3"6-6a)
[xi'(/0, "0 )] = F^2 [xi(/0, n0 )] (3-6-6b)
[xi(/o, no)] = I W"^,0 [x(ni,;io )] (3-6-6c)
v/hzrt ko/nj = 0,1,..., rrl; kls m o =0,1,..., Nj'-l; £] , no = 0,l,...,r2-l; IQ, n\-
0,1,...,N2'-1, F^1 is the twiddle factor matrix of the 1-D radix-r j DIF FFT, ijj1 is the
BF structure matrix of the 1-D radix-r! DIF FFT, FJJ? is the twiddle factor matrix of the
1-D radix-r2 DIT FFT and I^2 is the B F structure matrix of the 1-D radix-r2 DrT FFT.
The matrix equation for the 2-D mixed vector radix-r}*r2 FFT algorithm is given by
Equation (3-5-4), where & „ = 1 ^ ® < / , E?TM = F f ^ ® F £ , and I? B F - I™'" ®
I^2, with the superscript m for the mixed vector radix FFT.
The application of the above theorems can be shown by the following examples.
ExampIe-2:
Deriving the 1-D radix-8 FFT algorithm used to be a significant task [64].
However, comparing it with generating the 2-D vector radix-8*8 FFT, it is relatively
simple. For many, writing up the corresponding 1-D algorithm (even deriving it from
scratch) or drawing its logic diagram is a good starting point for generating required 2-D
50
VR FFT and it is simple enough. By applying the structure theorem, the vector radix
FFT formula will then be achieved with little extra effort.
Consider a 2-D DFT defined as Equation (3-2-la) where N1=N2=N=8(A, \i is a
positive integer so that the VR-8*8 DIF FFT can be applied.
Begin by writing the butterfly structure and twiddling multiplications of the 1-D
radix-8 DIF FFT algorithm in matrix form presented by Equation (3-6-3) where:
r=8, N'=N/8, kj, mo =0,1,...,N'-1, and
[X(khk0)] = [X(k1,0),X(k1,4),X(k1,2),X(k1,6), .
X( kl5 1 ), X( kl5 5 ), X( kls 3 ), X( klf 7 )]T;
[ xj'C k0 , m0)] =[ x{( 0, mo ), x{( 4, TUQ ), x{( 2, IHQ ), x{( 6, niQ ),
X!'( 1, mo ), x{( 5, mo ), x{( 3, HIQ ), x{( 7, m 0 )] T;
[ xx( k0, mo)] = [ xj( 0, mo), xi( 4, mo), xj( 2, mQ), x;( 6, IXIQ ),
x2( 1, mo), X!( 5, mo), x2( 3, mo), x}( 7, mo)] T;
[ x( m; , mo)] = [ x( 0, m 0 ), x( 4, n^), x( 2, mo ), x( 6, iriQ ),
x( 1, m 0 ) , x( 5, mo ), x( 3, niQ ), x( 7, m 0 )] T;
1 1 1 1
1
1
1
1
1 1 1 1
-1
-1
-1
-1
1 1 -1 -1
-j
-j
j
j
1 1 -1 -1
j
j
-j
-j
1 -1
-j j
a
-a
-ja
ja
1 -1
-j j
-a
a
ja
-ja
1 -1
j -j
-ja
ja a
-a
1 -1
j -j
ja
-ja -a
a
(3-6-7)
a =W8=exp(-J7t/4);
Ff8=diaef 1 W4"10 W 2 m o W 6 ™ 0 W m ° W 5 m ° , W?Tm°, wj"°].
r N -Qiag.L i, w N , w N , \>N , w N , v»N , " N , N
51
The logic diagram shown in Figure-3 performs the R-8 DIF FFT BF where there
are only two complex multiplications (caused by a) and the T M stage, where there are
seven non-trivial complex multiplications because of the twiddles, can be added to the BF
[45].
According to Structure Theorem 2, the first stage of the 2-D VR-8*8 DIF FFT
matrix representation will be given by Equation (3-5-3), where:
r!=r2=8, k!,m0, /1? n0 =0,1,...,N'-1, and N=N/8.
[x(mj, niQ ; nj, T\Q )]=[X0, X4, x2, x6,xl, x5, x3,x7] T;
xi = [ x(i, mo;0, n0 ), x (i, mQ;4, n0 ), x (i, mQ;2, n0 ), x (i, mo;6, n0 ),
x(i, mo;l, n0 ), x (i, mo;5, n0 ), x (i, mo;3, n0 ), x (i, mQ;7, n0 ) ], and i
=0,1,...,7.
[x^kQ, m0 ; £0, n0 )]=[xx0, xx4, xx2, Xl6, xjl, xx5, xt3, xx7 ] T;
xji = [x!(i, mo;0, n 0 ) , xj(i, m0;4, n 0 ) , x^i, mo;2, n 0 ) , xT(i, m0;6, n 0 ) ,
xj(i, m0;l, n0), x^i, m0;5, n0), Xj(i, m0;3, n0), x^i, m0;7, n0)], and i
=0,1,...,7.
[x{(k0, m0 ; £0, n0 )] = [x1,0, x{4, x{2, x{6, xx'l, Xl'5, Xj'3, xj'7 ] T;
xj'i = [ \{(i, mo;0, n0 ), x{(i, m0;4, n0 ), x{(i, m0;2, n0 ), xj'Ci, m0;6, n0 ),
xj'Ci, m0;l, n0), xi'(i, mQ:5, n0), Xj'(i, mo;3, n0), x{(i, m0;7, n0)],
and i =0,1,...,7.
[X (kl7k0 ; h,£0 )]=[ XO, X4, X2, X6, XI, X5, X3, X7 ]T;
Xi= [ X( kl5i; lh0), X ( kj,i; l\A), X ( kl5i; lh2 ), X ( kj,i; /i,6),
X ( klti; /j,l ), X ( ki,i; /j,5 ), X ( k2,i; /lf3 ), X ( k,,i; /j,7 ) ], and i
=0,1,...,7.
52
Ef= diag.[ F« , w£x» p« , wjfo F« _ w^o pf8
w7F«,W^F«,W^FffljW7moFf8 j
FN = diaS-[ 1, w£°, W^°, w£«, Wj* W*"*, w^c- w7n0]
From Equation (3-6-7) we have:
rf8 Tf8 Tf8
V =
N
f8 N
f8 N
{f& l N
f8 •N
f8 N
f8 N
f8 N
-I
-I
-I
-I
N
f8 N
f8 N
f8 N
f8 N
f8 N
f8 N
I
I
-I
-I
N
f8 N
f8 N
f8 N
I
I
-I
-I
f8 N
f8 N
f8 N
f8 N
-I
[f8
f8 N
f8 N
f8 N
I f8 N
[f8
I
-I
f8 N
f8 N
-ilf8
jTf8
.jf8 {Tf8
ilf8 -ilf8
ilf8
-ilf8
5Tf8 JTf8 ~Tf8
•JJN ih aIN ;Tf8 ;rf8 ~rf8
:Tf8 .Tf8 . Tf8
J J N - ^ N "jaIN
• Tf8 .Tf8 • Tf8
-aIfN8 -jal^8 jaIfN
8
aIfN8 jalf8 -jaIfN
8
jaIfN8 al{J -al^8
. Tf8 •jaIN
aljj al f8 N
- If8 (x) Tf8
- 1N 09 1N
(3-6-8)
The complete equations of the VR-8*8 FFT for a specific 2-D D F T application can
be obtained by applying the structure theorem recursively. Another point to be made is
that Equation (3-6-8) is the matrix presentation of VR-8*8 FFT butterfly structure which
is equivalent to an 8*8-point DFT and itself can also be calculated by further invoking the
vector radix approach. In mathematical terms, this implies further application of the
properties of the tensor product to Equation (3-6-8). Computing VR-8*8 FFT BF as it is,
would commonly mean invoking the row-column method. As a result, this VR-8*8 FFT
would be inferior to the VR-4*4 FFT in terms of the arithmetic complexity. However, if
the vector radix approach is used to perform this VR-8*8 BF by one of the following: the
method [43] indicated; the Combined Factor (CF) method in [45, Appendix C]; or the
53
mixed VR method, as will be shown by the following example, the performance of the
VR-8*8 FFT would be better than that of VR-4*4 FFTs [44, 45, 67].
ExampIe-3:
Given Ni = N2 = N = 8, N = 2 * N, N = 4, matrix equations for the 1-D radix-2
DIF FFT algorithm on the 8-point DFT are presented as follows:
(BF1:)
ko "xi(0,mo)" _xi(l,mo).
"i r l -i_
mi
"x(0,mo)~
_x(l5mo)_ (3-6-9a)
(TM:)
ko
xi'(0,mo)'
.xi'(l,mo).
1 0
ow£°
ko
xi(0,mo)'
xi(l,mo). (3-6-9b)
(BF2:)
kj rX(0,k0)-
X(2,k0)
X(l,k0)
LX(3,k0)J
- 1 1 1 1-j 1 1 - 1 - 1
1 -1 -j j
- 1 -1 j -j J
mo rxi'(k0,0)-i
xi'(k0,2)
xi'(k0,l) Lxi'(k0,3)J
(3-6-9c)
The matrix equations for the 1-D radix-2 DrT FFT algorithm on the 8-point D F T are
presented as follows:
(BF1:)
(TM:)
*l •x(0,/0r
X(1,/0)J
- 1 1 -
1 -1_
no •xi'(/0,0)-
.xi'(/0,D. (3-6-10a)
no
•xi'(/o,oy
xi'(/0.D.
1 o
o w to N
no
xi(/o,0)"
Lxi(/0,1). (3-6-10b)
54
(BF2:)
*0 -xi(0,no)~
xi(2,no)
xi(l,no)
_xi(3,n0)_
- 1 1 1 l-i
1 1 - 1 - 1
1 -1 -j j
L i -i j -j J
ni -x(0,no)
x(2,n0)
x(l,no)
-x(3,n0)
(3-6-10c)
Using Structure Theorem 3, from Equations (3-6-9) and (3-6-10) the matrix form
for the mixed DIF & DIT vector radix FFT algorithm is derived.
(BF1:)
ko *0 •xi(0,mo;0,/ xi(0,mo;2,/
xi(0,mo;l,/ xi(0,mo;3,/
xi(l,mo;0,/
xi(l,mo;2,/
xi(l,mo;l,/
.xi(l,mo;3,/
M
mj nj rx(0,mo;0,no)-
x(0,mo;2,no)
x(0,mo;l,no)
x(0,mo;3,no)
x(l,mo;0,no)
x(l,mo;2,no)
x(l,mo;l,no)
Lx(l,mo;3,no)-
(3-6-11 a)
(TM:)
ko no -xi'(0,mo;/o,0)-
xi'(0,mo;/o,l) xi'(l,mo;/o,0)
.xi'(l,mo;/o,l).
if 3FJ*
ko no r-xi(0,mo;/o,0)-
xi(0,mo;/o,l) xi(l,mo;/o,0)
_xi(l,mo;/o,l)-
(3-6-lib)
(BF2:)
ki £] •X(0,ko;0,/0)-i X(0,k0;l,/0) X(2,ko;0,/0) X(2,k0;l,/o) X(l,k0;0,/o) X(l,k0;l,/0) X(3,ko;0,/0) •X(3,k0;Uo)
M
mo no -xi'(k0,0;/0,0)-
xi'(ko,0;/0,l)
xi'(ko,2;/o,0)
xi'(ko,2;/o,l)
xi'(k0,l;/0,0)
xi'(ko,i;/o,l) xi'(k0,3;/o,0)
-xi'(ko3;/o»i)-
(3-6-1lc)
55
where: r 1 0 0 O - i
F^®Flg2 =
OW^J 0 0
0 0 WjJ° 0
0 0 0 W ™ 0 + ' ° _
The above theorems provide very simple construction tools for various vector radix
FFT algorithms. Having a knowledge of different 1-D FFT algorithms, 2-D V R FFT for
a required application can be readily achieved. Since the theorems state clearly what the
2-D B F or T M stage should look like, checking a new variation of V R FFTs becomes a
simple and straight-forward procedure. Once complete equations of 1-D FFT algorithms
are made available, it is a matter of inter-weaving corresponding B F and T M structures of
1-D algorithms to form 2-D (m-D) B F and T M structures. Although not discussed in the
theorems, the output sequences of 2-D V R FFTs also obey the properties of the tensor
product in respect of 1-D FFT output sequences [153].
3-7 Structural Approach via Logic Diagrams
The diagrammatical interpretation of structure theorems can be expressed both at
stage-by-stage level [45] and as a complete form for a specific application [67]. Obtaining
the logic diagram of a 2-D FFT from those of 1-D FFTs requires the following procedure:
drawing 1-D FFT logic diagram(s); generating the logic diagram using the row-column
FFT; and finally, modifying the logic diagram using the row-column FFT into various 2-
D vector radix FFTs. Modification of the logic diagram follows the simple rules as
shown by the following equations.
In Figure-8(a), AxO ± A x l = A(x0 ± xl).
In Figure-8(b), ocAx = A (ax).
where x, xO and xl are column vectors; xO and xl are of the same dimension; A is an
operator and a is a scalar.
For long length DFTs and using high radices [45, 70], the logic diagram of FFT at
the stage-by-stage level would be useful because final drawing would be difficult to
o X
m m CD
o
$
© X X
CD
m m © i-i
X X
57
accommodate on one sheet of paper, nor is it necessary, although it is not unachievable.
For small size DFTs, deriving a complete logic diagram is always preferable.
Example-4:
In this first example, the VR-4*4 F F T algorithm on a 16* 16-point D F T will be
derived using the logic diagram. As most 1-D F F T algorithms are well documented, it is
always simple to start by drawing a 1-D logic diagram. In this case, the logic diagram of
a 16-point D F T using radix-4 F F T algorithm is presented in Figure-6. Even if there were
no 1-D logic diagram available, drawing a 1-D diagram from equations is much simpler
than doing so from equations for a 2-D vector radix F F T algorithm. For this reason, it is
preferable that fast algorithms be presented in logic diagrams (or equally flow graphs)
whenever it is feasible. From personal experience, more often than not, one can judge if
it is worthwhile for a 1-D fast transform algorithm to be generalized to its
multidimensional counterpart and if the saving could be made in terms of computational
complexity just by looking at the structure of the logic diagram of the algorithm.
After the logic diagram is drawn for the 1-D radix-4 FFT as shown in Figure-6, the
figure is partitioned into three parts according to the stages of the FFT procedure as
included in Figure-6. The logic diagram of the 2-D 16* 16-point D F T then is presented
using the row-column radix-4 FFT as it is given in Figure-7. Replace all blocks inscribed
by R-16 F F T in Figure-7 by Figure-6 to obtain Figure-9. Then Figure-9 can be modified
to achieve Figure-10 which is the logic diagram for the vector radix-4*4 DIT FFT
algorithm on a 16* 16-point DFT. The twiddle factors of the row FFT are combined with
those of the column FFT to reduce the number of multiplications and this is the reason
why the vector radix approach is less expensive in terms of the computational operations
than the row-column approach. In Figure-10, the VR-4*4 F F T B F is actually
implemented by the row-column approach. Another alternative is to apply the VR-2*2
FFT to the VR-4*4 F F T B F but there is no further saving in the number of non-trivial
multiplications.
X X X X X X X x X x x X X x *
o x T IZ « — « „ ^ t/* — c>
x x ^ x x x x ^ x x ^ x x ^ x
B afttf
60
The matrix form for the vector radix DIT FFT algorithm and the structure theorem can be
readily extended to higher dimensions and so can the logic diagrams. Figure-10 can be
used as the VR-16*16 FFT B F in the vector radix-16* 16 FFT algorithm and so forth
[70].
This example not only shows the evolution of the VR-4*4 FFT algorithm from its
1-D counterpart, but also indicates that to perform 16* 16-point 2-D D F T in a pipelined
computation, only one complex multiplier is required for a VLSI design [14, 15, 134,
137].
As this technique does not impose any requirement on the radices nor any
knowledge of how the decimation (DIT or DIF) procedure is undertaken to obtain 1-D
FFTs, not surprisingly, the mixed radix FFT algorithms can be derived by this approach
as well.
Example-5:
In this example, the mixed DIF and DIT vector radix FFT algorithm is derived to
compute an 8*8-point D F T which is equivalent to that presented in Equation (3-6-11).
The 8*8-point 2-D D F T is first calculated using the row-column FFT algorithm and
different decimation techniques have been used, as shown in Figure-11 where 32
nontrivial multiplications are involved. In Figure-11, row transforms are performed
using 1-D DIT FFTs as shown in Figure-2, and column transforms are computed by DIF
FFTs as shown in Figure-3. When the mixed vector radix FFT algorithm is applied to the
same problem, the logic diagram of which is shown in Figure-12, the number of
nontrivial multiplications are reduced to 24 after combining the row and the column
twiddles. This example once again demonstrates that different multidimensional vector
radix FFT algorithms can be developed systematically by using the structure theorem.
If the complete Equation (3-6-11) for the mixed vector radix FFT still looks
somewhat complicated, its diagrammatical presentation is extremely clear and straight
forward.
61
CU-oc
S101S O O i
^cE
u 11 o —. M
0 COf-i
•«*
ft
£ © I
H ft
o
<y J=
«-)
oc c
• ••*
P .-- £
E * ft 5 ^
5
H
t-^
X
**! X
"1 X
X
? X
r?
X
X
©
X
3^F
S "o ft-I
00 * 00
x :-: u
62
o
X X X X X IT}
X m X X
+
+
a: 5
+
+
xa u
+ + +
+ +
*S c
+ +
+ +
+
u.
*
>
+ 4-
t * *
H T C 2
> s •D H
S
+
u.
EC
+
eiEfr T i - C
*Efr
+
«Eh
+
u
eiEfr
+
-Tf-'
;c£
+
teofr
+
01515 +
a o X
fH
X <s X
m X
T X
IT) X
vo X
t-X
63
Before considering the 2-D vector split-radix F F T algorithm and the comparative
study of various vector radix F F T algorithms, two points have to be made. One is that
this structural approach, both in its matrix form and diagrammatical form, can be
extended to multidimensional cases with little difficulty. The other is that the 2-D direct
vector radix D C T algorithms were devised by examination of logic diagrams of the
corresponding 1-D algorithms and were later verified by mathematical analysis.
The discussion on the combined factor vector radix-8*8 and vector radix-16* 16
FFT algorithms are included in Appendix C.
3-8 2-D Vector Split-Radix FFT Algorithms
Another successful application of the structural approach is to generate complete
equations for the DIF vector split-radix FFT.[60]. The idea behind the split-radix
approach is quite simple. In one dimensional Discrete Fourier Transform computation,
the 1-D split-radix approach [68] divides a length N D F T into two DFTs of length N/2
when a radix-2 FFT is applied at the first stage. One of the resulting N/2 DFTs, which
involves odd terms, is further decimated using radix-2. Thus the original D F T is
implemented by an N/2 D F T together with two N/4 DFTs and an algorithm can be
devised to reduce the number of operations required to complete the transform. The
trouble is that when this very approach is applied to 2-D DFTs using the traditional
mathematical representations [36, 37], the final equation for the algorithm contains so
many terms that without an understanding of its structural features, derivation and
verification of the algorithm and the implementation of the algorithm would be difficult
indeed [37, 71]. This is not to mention its generalization to even higher dimension
applications. Recently, the split-radix F F T algorithm has been extended to two
dimensions using Decimation-In-Frequency (DIF) [36] and Decimation-In-Time [37]. In
this section, the complete equations for the first stage of the vector split-radix DIF FFT
algorithm are derived using a structural approach [45, 60]. To derive the complete
equation for the vector split-radix DIF F F T equations, the structure theorem is used
64
initially to obtain VR-2*2 and VR-4*4 DIF F F T equations. The split-radix idea is then
applied to compute the outputs when both indices are even in a vector radix-2*2 step and
the rest in a vector radix-4*4 step. The algorithm is the two dimensional counterpart of
the 1-D split-radix DIF FFT algorithm [68], and differs from the split vector radix 2-D
FFT [36] in the way in which the vector radices are divided.
Using the structural approach [45] the vector radix-2*2 and the vector radix-4*4
DIF FFT algorithms can be derived easily from corresponding 1-D algorithms. The
matrix form of the 2-D vector radix-2*2 DIF FFT is given by the following equations,
assuming Ni = NT2 = N.
ko *0 -xi(0,mo;0,nn)-xi(0,mn;l,no) xi(l,mo;0,no)
_xi(l,mo;l,no)_
-xi'(0,mo;0,no)
xi'(0,mo;l,no) xi'(l,mo;0,no)
_xi'(l,mo;l,no)
- 1 1 1 l-i 1 - 1 1 - 1
1 1-1-1
-1-1-1 1 -
mj nj
-x(0,mo;0,no)-
x(0,mo;l,no) x(l,mo;0,no)
_x(l,mo;l,no)-.
p i 0 0 0 -j
ow!? o o N
0 0 WJJ° 0 _o o o w;j0 + n o_
-xi(0,mo;0,no)-
xi(0,mo*,l,no) xi(l,mo;0,no)
_xi(l,mo;l,no)_
ko £o ko #o rX(ki,0;/i,0)-| X(k1,0;/1,l) X(ki,l;/i,0)
LX(k1,l;/1,l)J mo=0 no=0 w "
-xi,(0,mo;0,no)~ xi'(0,mo;l,no) xi'(l,mo;0,no)
-xi \U no; 1,no) J
(3-8-la)
(3-8-lb)
(3-8-lc)
where ki, l\ =0,1,...,N'-1, and N' = N/2.
The vector radix-4*4 DIF FFT is described by the following equations:
[xi(*0,mo; £0V0)] = If [x(w;,mo;«7,no)] (3-8-2a)
[xi'(*0,mo; V»0)] =Ef[xi(^,m0;^,no)] (
3'8-2b)
[X(kuk0 ;hJo)] = T Y W^,?kl W^1 [xi'(*0,mo;^.no)] (3"8-2c)
mo=0 no=0 ™ iy
where ki, l\ =0,1,...,N"-1, and N " = N/4;
65
[X(ki,*0 ;lh£0)] = [X(k1,0;/1,0),X(k1,0;/1,2),X(k1,0;/1,l),X(k1,0;/1,3),
X(k1,2;/1,0),X(ki,2;/1,2),X(k1,2;/1,l),X(ki,2;/1,3),
X(ki,l;/1,0),X(k1,l;/1,2),X(k1,l;/1,l),X(k1,l;/1,3),
X(ki,3;/1,0),X(k1,3;/1,2),X(k1,3;/1,l),X(k1,3;/i,3)]T;
[x(m7,mo;«7,n0)] = [x(0,mo;0,no),x(0,mo;2,no),x(0,mo;l,n0),x(0,mo;3,no),
x(2,mo;0,no),x(2,m0;2,no),x(2,m0; 1 ,n0),x(2,mo;3,n0),
x(l,mo;0,no),x(l,mo;2,n0),x(l,mo;l,n0),x(l,mo;3,no),
x(3,mo;0,no),x(3,mo;2,n0),x(3,mo; 1 ,no),x(3,m0;3,n0)]T;
Ef=diag.[l,W^nO,wJO,W^nO,
w2m05w2m0+2n° w2m°+n° w2m°+3n°
wm°,wmo + 2 n o wm°+n° w m o + 3 n°
w3m0,w3m0+2n° w3 m o + n° W3mo+3n°l-
If=
Tf4 Jf4 jf4 jf4
If4 jf4 _jf4 _jf4
If4 _lf4 _jlf4 j!f4
Llf4.!f4 jjf4 _j!f4 J
I» =
1 1 1 1 - 1 1 1 - 1 - 1
1 -1 -j j
- 1 -1 j -j
The basic approach of the vector split-radix algorithm is to compute the outputs
when both indices are even in a vector radix-2*2 step and the rest in a vector radix-4*4
step. Both indices are even in the first line in each of Equations (3-8-la) to (3-8-lc).
Thus, in the 4*4 process, twelve equations are required out of the sixteen in Equation (3-
8-2) since X(ki,0;/i,0), X(ki,0;/i,2), X(ki,2;/i,0) and X(ki,2;/i,2) have already been
solved by the vector radix-2*2 step. This is the first stage of the vector split-radix DIF
FFT decomposition as is shown below:
X(ki,0;/,,0)= I 1 Z 1 W " ) , o k l W" 0 / l x1(0^0;0,no) mo=0 no=0 N ™
xi(0,mo;0,n0) = [ 1 1 1 1 ]
r-x(0,mo;0,no)"l x(0,mo;l,no) x(l,mo;0,no)
Lx(l,mo;l,no)J
(3-8-3a)
(3-8-3b)
where N* = N/2, ki,/i = 0,1 N*-l; and,
66
[xi(fy),mo;%no)]m = I m [x(m;,mo;n;,no)]
[xi'(ko,mo;%n0)]m = Fm [xi(^,mo;^no)]m
[X(ki
(3-8-3c)
(3-8-3d)
,k0 ;hJ0 )]m = I 1 2 W ™ ? k l W"0/1 [xi(k0,mQ;£0,nQ)]m (3-8-3e) mo=0 no=0 a> 1N
where ki, h =0,1,...,N"-1, and N" = N/4;
[X(ki,^;/l,^)]m=[X(ki,0;/i,l),X(ki,0;/i,3),
X(ki,2;/1,l),X(ki,2;/1,3),
X(ki,l;/i,0),X(ki,l;/i,2),X(ki,l;/i,l),X(ki,l;/i,3),
X(ki,3;/i,0),X(ki,3;/i,2),X(ki,3;/i,l),X(k1,3;/i,3)]T;
[x i (kn,mo; ^o,no)]m=[x l (O.mn; 1 ,no),xi (0,mo; 3,no),
xi(2,mo;l,no),xi(2,mo;3,no),
xi(l,mo;0,no),xi(l,mo;2,no),xi(l,mo;l,no),xi(l,mo;3,no),
xi(3,mo;0,no),xi(3,mo;2,no),xi(3,mo;l,no),xi(3,mo;3,no)]T;
E 4 = diag.[ W ^ W 3 ^ , W2jnO+nOtW2mo+3n0>
wmo wmo+2nn wmo+no wmo+3no W N ' N ' N ' N
w3mo w3mo+2no w3mo+no w3mo+3no-.. W N ' N ' N ' N J'
4 =
r f4 f4 Tf4 f4 H m m m m
jf4 jf4 jf4 ,f4
I» -If4 -jlf4 jlf4
L I» -I» jl» -jlf4j
If4 =
-11 1 Ii 11-1-1 1 -1 -j j
L i -i j -j -.
if4 = m
1 -1 -j j 1 1-1 j -jj
and [xQnj,mQ;ni,nQ)] is defined as in Equation (3-8-2). The first, second, fifth and sixth
rows of [X(khk0 ;liJ0)]. [xi(Jto,m0;4>,no)], Ef and If have been omitted to obtain
[X(khk0 ;h,£0 )]m , [xi(Ao.mo;^,no)]m. fm and Ifm. All indices are the same as those
in Equation (3-8-2), but a long and tedious direct derivation has been avoided. The logic
diagram of the vector split-radix DIF FFT can be achieved by modifying the
corresponding logic diagrams of VR-2*2 and VR-4*4 DIF FFT algorithms, which is a
67
simple procedure. Complete equations for the first stage 2-D vector split-radix DIT FFT
algorithm can also be constructed by this simple approach [37].
3-9 Comparisons of Various 2-D Vector Radix FFT Algorithms
The comparison of vector radix F F T algorithms in this section mainly follows the
traditional judgement, i.e., arithmetic complexity, error analysis, in-place computation
and regularity of the computation structure, as mentioned in the previous chapter. Since
the analysis of arithmetic complexity in the early work of vector radix FFTs [42, 43],
there have been many other reports on the issue associated with different vector radix
algorithms [1, 36, 37, 44, 45, 60, 62]. The arithmetic complexity, in terms of
multiplications, of various vector radix FFTs is listed in Table-1 in comparison with row-
column FFTs, assuming that inputs are complex. N = 4096 is chosen because all the
vector radix F F T algorithms considered can be applied. It is worth noting that although
the split radix method requires less multiplications than the rest of the Cooley-Tukey
based FFTs in 1-D D F T computations [68], its applications in the 2-D case [36, 37, 60]
are less effective than the Combined Factor (CF) VR-16*16 FFT in terms of multiply
operations [45, 70]. Besides, since vector radix FFTs preserve a regular computation
structure inherited from 1-D Cooley-Tukey algorithms, they are bound to have advantages
in the software and hardware D F T implementations [154]. They carry out an in-place
computation and their numerical features are also superior to the row-column method.
Vector radix F F T algorithms consist of VR-2*2 BFs, regular twiddling multiplication
stages and regular indexing formation. These features, along with the pipelined and
parallel structure inherited from their 1-D counterparts, would facilitate both software and
hardware implementation of fast computation of 2-D DFTs as well [134,137].
To give a brief idea of the reduction of complex multiplications, for a 4096*4096 2-
D D F T problem, the number of complex multiplications required by the vector split-radix
DIF F F T [Appendix E] is only about 3 7 % of that required by the radix-2 row-column
FFT algorithm [45]; about 6 5 % of that required by the 1-D split-radix FFT row-column
FFT [68]; 4 9 % of that needed by the vector radix-2*2 F F T [43]; 6 6 % of that required by
Table-1 Arithmetic complexity of FFT algorithms for 4096 * 4096 2-D DFTs
in terms of multiplications
2-D FFT Algorithms
RC R-2
RC R-4
RC R-8
RC R-16
RC SR FFT [68]
VR-2 * 2
VR-4 * 4
VR-8*8
VR-16 * 16
CF VR-8*8
CF VR-16 -16
VSR-1 [36]
VSR-2 [60]
Number of BF multiplications
0
0
2*16,777,216
2*25,165,824
N/A
0
0
33,554,432
50,331,648
25,165,824
33,030,144
N/A
N/A
Number of T M multiplications
2* 92,274,688
2* 62,914,560
2* 44,040,192
2*31,457.280
N/A
138,412,032
78,643,200
49,545,216
33,423,360
49,545,216
33,423,360
N/A
N/A
Total number of multiplications
184,549,376
125,829,120
121,634,816
113,246.208
104,398,848
138,412,032
78,643,200
83,099,648
83,755,008
74,711,040
60,453,504
102,676,560
67,746,504
Percentage (total multiplications.)
100.00%
68.18%
65.91%
61.36%
56.57%
75.00%
42.61%
45.03%
45.38%
40.48%
36.01%
55.64%
36.71%
NOTE:
RC R-i:
VR: CF:
N/A:
BF: TM:
SR:
VSR:
The row-column 1-D radix-i FFT algorithm;
The vector radix 2-D FFT algorithm;
Combined Factor method applied;
Not Applicable;
ButterFly computation structure;
Twiddling Multiplications;
Split-Radix FFT algorithm;
Vector Split-Radix 2-D FFT algorithm.
69
a different vector split-radix DIF F F T approach [36]; and it is slightly (2%) inferior to the
combined factor vector radix-16* 16 F F T algorithm [45, 70]. This algorithm needs
slightly more complex additions than the vector radix-2*2 FFT algorithm. Further
discussion on the issue can be found in [37].
3-10 Vector Radix FFT Using FDP™ A41102
The Australian CSIRO designed A U S T E K Frequency Domain Processor (FDP™)
A41102 is a high performance C M O S VLSI device providing a complete hardware
solution for implementing FFTs [14,15]. Its main features include performing up to 256
complex point DFTs within 102.4M-S and 2-D 8*8-point DFTs or 16* 16-point DFTs
using a single processor configuration with a throughput of 2.5 Ms/s [28]. In [28], 2-D
512*512-point and 1024*1024-point D F T s are implemented using FDPs by the row-
column method. Although there are many publications in which multidimensional vector
radix FFT algorithms are shown by simulation to have computational advantages over the
row-column method, there have been very few reports on hardware implementation [134,
137]. In this section, it shall be demonstrated that when the vector radix method is used,
fewer FDPs are required to obtain the same 2-D FFT processing throughput.
The vector radix-8*8 FFT algorithm can be used to calculate 512*512-point DFTs.
The complete operation is divided into three vector radix-8*8 Butterfly (BF) and two
Twiddling Multiplication (TM) stages [70]. Since the VR-8*8 butterfly computation
structure is a 2-D 8*8-point D F T on its own, it does not greatly matter if it is implemented
by the row-column or the vector radix approach so long as the most efficient computation
can be achieved. In fact, the 2-D 8*8-point D F T is calculated by the row-column FFT on
the F D P A41102. Using the VR-8*8 F F T algorithm to perform 512*512-point DFTs, a
multi-FDP system design is described in Figure-13 which consists of three FDPs with
auxiliary discrete circuits rendering a processing rate of 2.5 Ms/s compared with four
FDPs required by a row-column procedure [28]. In this configuration, the VR-8*8 BFs
are calculated by the 2-D 8*8-point FFT function provided on F D P A41102s and two T M
stages are performed using two available uncommitted complex multipliers. Using a
70
3
o Q
o
CM*-"1
Q i
R-.«
oo oo G
* B > s see
ra T3
4 k
A* ? i
2-D 8**
FFT
i i
1
ts
o
<
5 u.
1st VR
Twid
dle
ROM
W5
o D. i
*
C o
s c E
•-*
H oc
* oc •o a
v >
Q Es, ex c
I QJ
u 3 60 •H
T Q
71
3 C
O
c I
* ti.
4
* >% oo c * B > = ~ £ i- ""•
oc
9 t <N
o 4
c;
— -O
O uX i — 3
L
O — Q r-J
(S c:
<
5 u.
2 < Oi
CQ J
liftL
k
(/* <f V i
oc * >, at & > = "5 cc c
oc
s£ c-i
<N ^
« C5 w L.
& L. r»"\
i
*r-*£ < Oi
1
If I (5
i
ft< y4
< * i5 >
—• * \o >-. • i -
os S > 3
*" f -9 "L M * [ L
5 C i
C3
C3 •o
3
ca
i
s 2
r-i O
<
Q u.
>^S N h K
CJ o _ < c-Q U.
*=5 > 3s — Ho;
H fa
Q
o c I
© *
o
.2 5
c a
E Q.
E •-*
H fa fa
C3
u >
•a
u X
H
Q fa ex
I 0)
3 60 •H
Q
72
mixed VR-16* 16 and VR-8*8 F F T algorithm, 1024*1024-point DFTs can also be
implemented using three FDPs, rendering a throughput of 2.5 Ms/s with alternative
configurations as shown in Figure-14. In real-time image processing, a multi-processor
system has to be used and reduction in the number of processors will mean a decrease in
the system complexity. These are but a few examples to show what can be done using
the vector radix approach.
3-11 Summary
In this chapter, the structural approach to the construction of 2-D V R FFT
algorithms has been presented by both mathematical equations and diagrammatical forms.
The use of this method helps to understand structures of various V R FFTs, and, most
importantly, also eases the burden of the implementation task for electrical engineers.
Using the diagrammatical representation, the modification of a V R FFT algorithm to fit
special design requirements becomes a simple task. The comparison study on various V R
FFTs summarizes the arithmetic complexities of V R FFTs and also their merits in the
context of error analysis, in-place computation and regularity of the computation
structure.
The introduction of the F D P A41102 demonstrates a complete VLSI hardware
solution to the D F T computation. It has been shown that if the vector radix method were
used, the number of complex multipliers on the processor to perform either 2-D 8*8- or
16* 16-point D F T could be reduced to one. Even if die F D P is used in its current status,
incorporating the vector radix method in the application of 2-D DFTs for real-time image
processing will reduce the number of processors required to achieve the same
performance when the row-column FFT is used. This would mean a reduction in the
system complexity.
73
CHAPTER FOUR: A PERSPECTIVE ON VECTOR RADIX FFT
ALGORITHMS OF HIGHER DIMENSIONS
As discussed in the introduction, multidimensional (m-D) Discrete Fourier
Transforms (DFTs) with dimensions equal to or greater than three have been used in
construction of 3-D microscopic-scale objects to remove out-of-focus noise [16], Nuclear
Magnetic Resonance ( N M R ) imaging algorithms [9], computer vision and pattern analysis
to provide a better understanding of dynamics in the visual system [10-12]. W h e n the
dimensions of the problem increase, the computation burden becomes heavy. Thus the
saving in computation time using efficient fast algorithms will be of even more
significance [45]. Because of the complexity of the problem involved, a systematic
approach is required for the comprehension, derivation, construction and effective
implementation of multidimensional fast algorithms. It seems that the structural approach
introduced in Chapter Three is, at least, one technique capable of being developed to
higher dimensions as this method has been successfully demonstrated in the construction
of 2-D vector radix F F T algorithms [45, 60] and 2-D direct vector radix fast Discrete
Cosine Transform (DCT) algorithms which will be discussed shortly after this chapter
[80]. The approach can also assist in developing computer programs using
multidimensional vector radix algorithms, especially when the computer programs for the
corresponding 1-D fast algorithms are available.
In this chapter, the structural approach for the construction of m - D (m > 3) fast
vector radix F F T algorithms is closely examined using both matrix and diagrammatical
forms. From definitions of the multidimensional D F T and its inverse, equations which
represent multidimensional vector radix Decimation-In-Time and Decimation-In-
Frequency FFTs are derived. A structural approach based on the matrix representation is
described which is used to construct multidimensional vector radix FFTs. A recursive
logic diagram symbol system is then presented to show how an m - D (m > 3) vector radix
FFT algorithm can be derived and represented in a graphical form. A n example is also
given to demonstrate the simple procedure required to construct a vector radix-4*4*4 FFT
74
algorithm on a 16* 16* 16-point 3-D D F T problem using the symbol system. Since the
approach using diagrammatical representations does not impose any restrictions on how
the decimations (DIT or DIF) should be applied to each dimension, various vector radix
FFT algorithms can be constructed by this method. Although not discussed in this thesis,
the material presented in this chapter can be extended to m-D (m > 3) fast vector radix
DCT algorithms as well.
4-1 Definitions
As mentioned in the previous chapter, the multidimensional DFT of dimension m is
defined as:
X(kl,k2,...,km) = I I ... I x(nl,n2,...,nm) WN WN —WN nl=0 n2=0 nm=0 * l m
(4-1-la)
and its inverse is defined as: Ni-i N2-i Nm-i lkl
x(nl,n2,...,nm)= * I I ... I X(kl,k2,...,km)WN NlN2...Nm k l = 0k2=0 km=0 1
WKn2k2...WN
nmkm (4-1-lb)
where ki, ni = 0,1,... ,Nj-1; i = 1,2,... ,m. In their matrix forms,
X = Wm x (4-1-2a)
and x = 1 W - m X (4-l-2b) * NiN2...Nm
W ~
where W°» = WN* ® W* ® ...® W^ , W^ , i=l,...,m, which represents the Nj-
point 1-D DFT matrix; W-*= W^ ® W^ ® ...® W^ , W^. , i=l,-,m, which
represents the Ni-point 1-D inverse DFT matrix; X and x are NiN2. • .Nm column vectors
of output and input sequences respectively (also see Example-6).
75
If D I T is used on all indices of the m-D D F T assuming Nj = rj * Nj', i = l,...,m,
set:
ki =kii*Ni'+kio; ni = nii*rj + nio;
where kii, nio = 0,l,...,rj-l; kio, nil = 0,l,...,Nj'-l.
X(kliJclo;k2i,k2o;...;km1,km0)= X X ... X nl0=0n20=0 nmo=0
N£l N£l N^-l nlikl0.„n2ik2o .„nmikm0 X X ••• X x(nli,nlo;n2i,n20;...;nmi,nmo) W N ! W N ! ...WN ,
nl1=0n21=0 nm^O W l m INm
wnloklo wn2ok2o wnmokmo ^nlokli wn2ok2i wnmokmi Ni N2 "" N m ri r2 rm
(4-1-3)
Accordingly, the m-D DIF VR FFT and mixed VR FFT equations can be derived.
If DIF is used on all indices of the m-D DFT assuming Ni = rj * Nj\ i = 1,.. .,m, set:
ki = kii*ri + kio; ni = nii*Ni'+ nio;
where kii, nio = 0,l,...,Nj'-l; kio, nil = 0,l,...,ri-l.
Nj'-l N2'-l Nm'-1
X(kl],klo;k2iJc2o;...;kmi,kmo)= X X ... X nlo=On2o=0 nmo=0
rrl r2-l rm-l x „,nl()kli „Ji2ok2i ...nmokm]
X X ... X x(nli,nl0;n2i,n2o;...;nmi,nmo)WNV ' C - W N m '
nloklo wn2ok2o wnmokm0 wnlikl0 wn2ik20 wnmikm0 W N ! W N 2 " N m
W n r2 ••• rm
(4-1-4)
Since there is more than one dimension, different decimations can be applied to
different dimensional indices. Therefore there are more variations of the vector radix FFT
algorithm. A unified form for mixed VR FFT algorithms somehow is difficult to present.
4-2 Matrix Representations and Structure Theorems
A matrix form for DIT vector radix FFT algorithms presented by Equation (4-1-3)
can be given as follows:
76
[X(Wj^lo;^i,k2o;...;^mjJcmo)]=It[xi,(klo,n/o;k2o,n2o;...;kmo,nmo)]
(4-2-la)
[xi'(klo,n-?0 ;k2o.«20 ;...;kmo,«mo)] = Fl [xi(kl0,nio ;k2o.«20 ;...;kmo,«mo)]
(4-2-lb)
[xi(klo,nio 'XlQ>,n2o ;...;kmo,nmo)] Nj'-l NV-1 N'-l , , , 0 IO
- v v v \i^lllkl0«/n2lk20 ™,nmikmo r , . 7 0 0 ,, - X X ••• X W N . W N . ...WN , [x(nli,«7o;n2i,/z2o;...;nmi,/imo)]
nli=0n2!=0 nm^O l l m
(4-2-lc)
where [x(nli,nl o ;n2\,n2o ;...;nm\,nmo)], [xi(klQ,nl o ;k2Q,n2o ;. • .;kmQ,nmo )],
[xi'(klo,"70; k2o,/i2o;...;kmo,/tmo)] and [X(£7;,klo;&2;,k2o;...;/:m;,kmo)] are
ri*r2*.. .*rm column vectors with ki] and nio (1 ^ i ^ m) varying in bit reversed order; F}
is the twiddle factor matrix which is an rir2...rm*rir2...rm diagonal matrix with the
element value F<(i,i) (1 < i < nr2...rm) equal to w"^kl° W^k2^ ..W™okm°
accordingly, and I1 is the matrix for the m-D vector radix-ri*r2*...*rm butterfly
structure which is also an r]T2.. .rm*nr2.. .rm matrix with the element value Il(i j) (1 ^ ij
<rir2...rm) equal to w"7^ ^n20k2j ^nmoknu correspondingly- Equation (4-2-lc)
contains ri*r2*...*rm Ni'*N2'*...*Nm'-point m-D DFTs which can be further
decimated.
The generalization of the structure theorem for the m-D DIT case will be stated as
follows:
If Ni = n * Ni' in an m-D DFT defined by Equation (4-1-1) and the 1-D DIT FFT
algorithms are given by:
[X(kii,kiQ)] = l£i [xi'(kio,m0 )] (4-2-2a)
[xi'Qd0tm0 )] = F^[xi(kio,m0 )] (4-2-2b)
[xi(ki0,m0)]= X W?!lklo[x(nii,mo)]
nii=0 JNl
N-'-l
[xiGdn.nin)]= Y. W?!1,^0 [x(nii,«w )1 (4-2-2c)
where 1 < i < m; 0 < kii, nio < ri-1; 0 kio, nil < Nj'-l; then, the DIT m-D vector
radix-ri*r2*...*rm FFT algorithm will be given by Equation (4-2-1), where:
77
E* = F*\ ® F£22 ® ...® Fjj™ ; (4-2-3a)
It = I^®I^®...®I^. (4-2-3b)
Similarly, a matrix form for DIF vector radix FFT algorithms given by Equation
(4-1-4) can be presented as follows:
[x\(klo ,nlo; k2o ,n2o;...; kmo ,nmo)] = If [x(nlj ,nlo; n2j ,n2n;...; /zm; ,nmo)]
(4-2-4a)
[xi'(/:7o,nlo;/:2o,n2o;...;^/7/o,nmo)] = E f [xi(£/o ,nlo; £2#,n2o;...; £mo ,nmo)]
(4-2-4b)
[X(kl\,klo ;k.2i,k2o;...;kmi,kmo)] =
Y T ..."I"' <«"» W^k2'...WJ,m?krai [x, Wo,nio; *2„,n2o;...; *m«,nmo)]
nlo=On20=0 nmo=0 i l J>2 ^
(4-2-4C)
where [x(nlj ,nlo; AZ2; ,n2n;...; «m; ,nmo)], [xi(£/o,nlo; £2^ ,n2o;...; fono.nmn)],
[xi'(*7o ,nlo; A:2o,n2o;...; kmo .nmo)] and [X(kli,fc70 ;k2i,£2n ;...;kmi,fcm0 )] are
ri*r2*.. .*rm column vectors with &# and ni] (1 < i < m ) varying in bit reversed order; Ff
is the twiddle factor matrix which is an rir2...rm*rir2...rm diagonal matrix with the
element value £f(i,i) (1 < i < nr2...rm) equal to Wn^kl° V^f0...*™01™0
accordingly; and If is the matrix for the m-D vector radix-n*r2*.. .*rm butterfly structure
which is also an nr2...rm*rir2...rm matrix with the element value If(i,j) (1 i,j
rir2.-.rm) equal to W "7 ; U ° W n 2 ; * 2°...W™ 7* m° correspondingly. Equation (4-2-4c)
rl r2 rm
contains ri*r2*...*rm Ni'*N2'*...*Nm'-point m - D DFTs which can be further
decimated.
The generalization of the structure theorem for the m-D DIF case will be stated as
follows:
If Nj = n * Nj" in an m-D D F T defined by Equation (4-1-1) and the 1-D DIF FFT
algorithms are given by:
[xi(ki0 ,nio)] = 1%. [x(nii ,ni0)] (4"2"5a)
[xi'Wo ,ni0)] = F^lxiikio ,ni0)] (4"2-5b)
78
[X(kii,fc0)]= X W ^ k i l [xi'CJkio ,ni0)] (4-2-5C) ni0=0 »
where 1 < i < m; 0 < kii, nio < Ni'-l; 0 < kio, nil < ri-1; then, the DIF m - D vector
radix-ri*r2*.. .*rm F F T algorithm will be given by Equation (4-2-4), where:
E f = F ^ ® F*2 ® ...® F ^ ; (4-2-6a)
If = I^®422®...®I
f^. (4-2-6b)
To obtain the complete equations for an m-D DIT or DIF vector radix FFT, simply
apply the theorem to the remaining short length m - D DFTs repeatedly and this makes the
derivation simpler and programming easier especially when corresponding 1-D
algorithm(s) or program(s) are available.
4-3 Diagrammatical Presentations
The logic diagram for an m - D vector radix FFT is much simpler than its matrix
representations. Because the representation for m - D where m > 3 will be the same as that
for 2-D except for the definitions of each symbol, a recursive symbol system can be
developed. Consider a procedure for developing a vector radix-ri*r2*.. .*rj (1 < i < m,
where m is the dimension of the D F T ) butterfly structure along with the twiddle
multiplication stage.
In Figure-15-(a), xi (0 < i < rj-1) is a vector with dimension equal to ri*r2*...*ri-i,
the elements of which are in natural order. The symbol defined by VR-n*r2*.. .*ri-i B F
structure is the (i-l)-dimensional vector radix-ri*r2*...*ri-i butterfly computation
structure and that by VR-ri*r2*...*ri-i T M is the (i-l)-dimensional vector radix-
ri *r2*...*ri.i twiddling multiplication, x'i (0 < i < rj-l) is a vector of n*r2*...*ri-i
dimensions with elements in bit reversed order. The symbol defined by R-rj B F on
Dimension-i is a 1-dimensional radix-ri butterfly computation structure which works on
dimension-i. xii (0 < i < ri-1) is a vector of n*r2*... *ri-i dimensions. The elements of
xii are in bit reversed order, as are output vectors xii (0 < i rj-1) themselves.
79
r
x
r4
c c C O
E
*x
>
fa C2
cj
X
X M
s-X
1 rt > r , W fl o . rH
w c fl
'•T3
1 i—i
• . — i
' ' bO rl "S3 fl
a nj i-i
bO «1
'So o •X o rQ
-fl fl
fa CQ nj
fl O CO
fl V fl
Tl 1 fl 0
-n <o
o <-M
fl -r= u fl rH
-t-=
s H Td
s fa
w
H TTI
n rH
0
<L> l-H
fl u fl rH
-t-=
S H T3 fl llj
fa PQ 'nl fl 0 tn C V
Fi .in
-4-=
fl
.."So rt * r>
kO
rH
fl bO • f—1
En
S v <-> hi u fl o 7 ^ ; - -r»
© _ X
<s u" r -
X
1
iZ
X
»
©
X
1
l_
«
U
•
• u • »— u « >
<s CT
X
L.
K
— i_
• (S t-
• F-*
(-ei >
1 S E~
s H
* — i i
£T
X
,_ c c
fa c £ ca © £ P
c
u u
fa W
u OC
© 1— X X
© <~< X X
80
X X
._ i-
• . •
• 1—1 i_
ck >
u • . •
r M >
X X
s-X
s fa
f~ 1
> i — i
nj fl
.2 co fl 4) g
• i—t
•
'i
ram
ure.
.2 fl 13 £
o l_
"- CJ
u CC — • u
>
© _
l_
OC
i"^
bO fl o o r—i ' -"'
-J-=
r ^ "3 u if o fl
hp 5 fl U
• r-H
3 ^ SS E H
.. nj
1 U fa I T« I *° 1 ^ i S x
A -bO ,
. p—( •
fa k
81
Using the basic modification rules of logic diagrams, the logic diagram shown in
Figure-15-(b) is derived. Combining ri*r2*...*ri_i-dimensional VR-n*r2*...*ri-i BF
with 1-dimensional radix-ri B F results in an ri*r2*...*ri_i*ri-dimensional VR-ri*r2*...*
ri-l*ri B F computational structure. W h e n VR-ri*r2*.. .*ri-i T M is combined with radix-rj
T M , an ri*r2*...*ri.i*ri-dimensional twiddling multiplication stage is formed using this
symbol system as shown in Figure-15-(c). It is possible to build an m-dimensional
vector radix FFT algorithm by drawing 1-D F F T algorithm(s) and to finally achieve the
algorithm required.
Example-6:
The procedure to derive a vector radix-4*4*4 F F T algorithm using the logic
diagram to compute a 16* 16* 16-point 3-D DFT, given a 1-D radix-4 FFT algorithm for a
16-point 1-D DFT, is as follows:
(a) draw the logic diagram using radix-4 FFT for the 1-D 16-point D F T as shown
in Figure-6;
(b) determine the logic diagram using the "row-column" radix-4 FFT method on
the 16* 16* 16-point 3-D D F T (not shown);
(c) use the vector radix-4*4 FFT to replace the 2-D row-column FFT algorithm
on 2-D data vectors as shown in Figure-16; blocks inscribed by VR-4*4 BFa,
VR-4*4 T M and VR-4*4 BFb are defined in Figure-10;
(d) modify Figure-16 to Figure-17 and combine twiddle factors to obtain the
vector radix-4*4*4 FFT algorithm. The major difference between Figure-10
and Figure-17 is that all the symbols of Figure-10 represent and operate on
vectors of size 16 whilst those of Figure-17, of size 256 (or 16*16) where xi
= [x(i,0,0), x(i,0,l), .... x(i,0,15), x(i,l,0), .... x(i,l,15), ..., x(i,15,0), ...,
x(i,15,15)], i = 0 15.
82
— «-> X x
DEQHPSW JO JO .JO
c E "c o St c
c
8£
X o I'
a •+
c
vO rH
I
a) rJ
3 to •H
84
4-4 Computing Power Limitations
W h e n the dimensions of DFTs increase, the number of operations required for the
computation of DFTs increases dramatically. At the current stage of VLSI technology,
only m-D (m > 3) DFTs with relatively small size can be processed at real-time speed [13,
28, 129, 130]. The difference between computation times of an addition and a
multiplication is reduced, even to none [40], so that the total number of numerical
operations becomes the key issue [129]. The time used for data transfers also becomes
significant so that in-place computation and a regular computing structure will certainly be
crucial in m - D D F T calculations [129,130]. So far the implementation of m-D D F T by
and large uses the "row-column" method whether it uses VLSI [2, 13-15], or Very Long
Instruction Word (VLIW) architecture supercomputers [129], or distributed memory
multiprocessor supercomputers [130]. Amongst a few reports that use m-D fast vector
radix FFT algorithms is the one by Liu and Hughes [134]. Although only discussed in
[134] is the implementation of vector radix-2* 2 FFT, many of its advantages over the
row-column method have already been shown. The portions of the saving using vector
radix FFT algorithms over that of the row-column FFT become substantial when the
dimension of DFTs increases and/or higher radices are used as indicated by Table-2 [1,
43-45].
Three very active areas associated with the hardware implementation of DFTs are
ASICs [14-15,134,137,142], systolic array designs [135, 136, 140, 141, 145, 147] and
neural networks [138]. Still, even the latest successful implementations can only cater for
1-D DFTs or very small size 2-D or 3-D DFTs at real-time speed with the neural networks
approach in its early stage.
Complete hardware solutions to m - D (m > 3) D F T problems are dependant upon
future development of VLSI technology, understanding different m-D algorithms and
possessing the abilities to construct them in a systematical way. It has been shown by
many [134, 141,147] that fast algorithms chosen for VLSI implementation should
possess a regular computation structure more than anything else, apart from the arithmetic
complexity and maximal use of pipelined and parallelism of the algorithms, to enable
85
TabIe-2
Arithmetic Complexity of FFT algorithms for 64 * 64 * 64 3-D DFTs
3-D FFT Algorithms
R-2
R-4
R-8
VR-2*2*2
VR-4*4*4
VR-8*8*8
CF VR-8*8*8
Number of BF multiplications
0
0
3*131,072
0
0
393,216
229,376
Number of T M multiplications
3*655,360
3*393,216
3*229,376
1,146,880
516,096
261,632
261,632
Total number of multiplications
1,966,080
1,179,648
1,081,344
1,146,880
516,096
654,848
491,008
Percentage (total multiplications.)
100.00%
60.00% ~1
55.00%
58.33%
26.25%
33.31%
24.97%
NOTE:
R-i: The "row-column" 1-D radix-i FFT algorithm:
VR: The vector radix 2-D FFT algorithm;
CF: Combined Factor method applied; BF: ButterFly computation structure;
TM: Twiddling Multiplications;
86
systematical V L S I integration. The understanding of such algorithms and their
implementation would be greatly assisted by an understanding of their computational
structures.
87
PART II.
MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS
88
CHAPTER FIVE: INTRODUCTION TO MULTIDIMENSIONAL
DISCRETE COSINE TRANSFORMS
The Discrete Cosine Transform (DCT) was first introduced into digital signal
processing for the purposes of pattern recognition and Wiener filtering [17]. The two
dimensional (2-D) D C T is used for transform coding of images in telecommunication
such as video-conferencing, video telephony, video image compression for H D T V and
applications in fast packet switching networks [3, 18, 19, 157]. Its performance is
virtually indistinguishable from the optimal Karhunen-Loeve transform [3, 17] in terms of
energy packing ability, decorrelation efficiency and the least mean square error. Many
fast D C T algorithms require only real number operations and possess fairly regular
computational structures similar to those of FFTs and vector radix FFTs, which
substantially facilitate software and hardware implementations. It has, by now, become
the standard decorrelation transform for compression of 1-D and 2-D signals [72, 73].
To implement the 2-D D C T , there are many fast algorithms available, and these
algorithms are basically divided into two groups:
Direct fast algorithms, which are based on matrix factorization of the D C T
matrix or computation of a long length D C T by shorter length DCTs;
Indirect fast algorithms, which compute the D C T through an FFT of the
same size [34, 57, 82] or other fast algorithms [79, 113,124, 156].
In each group there are two approaches—the row-column approach, where the 2-D D C T
is generated by repeated application of a 1-D D C T , and the 2-D fast algorithm approach.
In each group there are many fast algorithms, as is shown in Figure-18 which is by no
means exclusive.
In 2-D image transform coding, the original image is usually divided into 8*8 or
16*16 blocks, and these blocks are cosine transformed. For real-time video coding, it is
assumed that the transmitting rate is 30 frames/second with frame size of 288*352 pixels,
and a processing rate of about 3.04 Ms/s is required. This means that an 8*8-point D C T
89
V.
Q Q
i
<N O
<£
P ^ ^* 1—
c — <
c
3
E c U LU
:/:
-5 c
< i—
r_ U -Q
U-o o
re C tr-C
< E •5
c c to < C^
re Q
2; r-
c (A c re H 're
E c c >•* c
o I-
i5 o
——
o re C lr
< c E _3 O U
o at o
E •-z 'u O to <
U Q
1
_ E
c CC
< IT,
U-X ^ OS
c
3 > Q
u CO
o ex <
"3 U
i o at
— ^
->
c lr
o C
1 re
.E *tA
re z c o "O o g CC d CC <
U.
o
w c c -o s C2
c —
o •5 _
3 C re
S c c _ 3
e
t u_ c E "c lw<
it c at o t—
fc u. X
•5 re
" at C o o > o d
< fc =5 Q CN
H
c £ U-.
c re H. "re E c >N
O
c
f—
d < o 3
- i-t
"c o U o r-
<
E
C
"re
1 o |. o c
><
re OS
"5. CO
o
c
<
E re
a
Z o o >
c
< E re to 'u
ca o
<
c CO
<
CT
re X o
— •ST —' d to < E jo c re C 3 O
:
t
i
d Ci
re X
s "u E re
1
r-
>
2
<
c re — —
d CO < "3 o C O U o
«5
6 CO
<
2 o
£
c to
< 3 o X o
f
d
< c
re
o
£
t/5
E
o WD C5
E-U Q t/;
r^ i
4 _
c
C5
to to
co r H
| a u
5D . H
r -
90
has to be carried out in about 21Ms or a 16* 16-point D C T has to be calculated within about
84^s.
For the last couple of years, VLSI fabrications of D C T processors which provide
real-time image coding throughput have been reported. Currently, these D C T processors
can only perform fixed-point calculation. The length of the input data format varies from
9- to 16-bits, as does that of the output of D C T processors. Usually, they are adequate
for the coding applications. Examples of D C T processors are IMS A121 D C T processor
(inmos) [88], STV3200 D C T processor (SGS Thomson) [87], and TMC2311 (TRW)
[89]. Because the transform size used in image coding is relatively small, being 8*8 or
16*16, the performance of the above D C T processors is very close in terms of the
processing speed. Their processing rate varies from 13 M H z to 27 M H z which caters for
real-time image coding. For the same reason, various row-column algorithms, including
the direct matrix multiplication method, have been used in VLSI D C T processors without
showing a great deal of difference in speed performance for image coding. N e w
development in this area can be found in [158] and [159]. A comparative study of the
error performance of these D C T processors remains to be undertaken.
Of the two classes of 1-D fast D C T algorithms, the indirect approach shows little
advantage over the direct approach in terms of the arithmetic complexity, and it usually
does not have a regular computation structure and involves an excessive number of
additions [72]. These would be manifested in their 2-D extensions, although 2-D indirect
methods have been reported to have a lesser number of multiplications [72]. Amongst the
direct algorithms, the Lee algorithm is by far the most efficient in terms of the number of
multiplications (or the total number of numerical operations) [76,78] and it has a regular,
systematic and simple computation structure. However, the algorithm requires inversion
or division of the cosine coefficients which has been claimed to cause numerical
instabilities because of roundoff errors in finite length registers [38, 72, 76, 77]. This
problem will be examined in the next chapter in comparison with other methods. Hou
introduced a new fast D C T algorithm [77] which uses bit-shifting and data shuffling for
better numerical performance. Hou's algorithm is as efficient as that of Lee's in terms of
91
the number of multiplications and additions and also has a simple regular structure. When
these direct 1-D algorithms are extended into 2-D applications, their structural features will
be preserved as indicated previously.
In the context of fast computation of 2-D D C T s , there are several reports on 2-D
indirect algorithms [57, 72, 79], whilst the direct method up to now is dominated by row-
column 1-D algorithms. The 2-D direct fast D C T algorithm [38], though more efficient,
remains less well known [65, 72, 77]. Besides, not all 1-D direct fast D C T algorithms
can be expanded to 2-D fast algorithms effectively. The only one which has been reported
is the 2-D fast algorithm by Haque based on Lee's method [38]. In [38], the direct matrix
decomposition method is used to expand Lee's algorithm, and the improvement of the
new algorithm over several other known algorithms is demonstrated in terms of the
number of operations. It has also been shown that the roundoff errors in Lee's algorithm
do not cause serious problems for small size such as 8*8- and 16* 16-point 2-D data block
DCTs, which are commonly used in image coding applications.
W e use a structured approach on Lee's algorithm directiy to generate 2-D fast D C T
algorithm and reproduced the Haque algorithm. A 2-D logic diagram is also used to
represent the algorithm so that 8*8- and 16* 16-point 2-D D C T s using the new 2-D fast
D C T algorithm are readily devised from the 1-D Lee algorithm and easily implemented
[46, 80, 81]. To avoid the roundoff errors that Lee's algorithm may cause, a new two
dimensional fast D C T algorithm has been devised based on Hou's algorithm [80, 81,
108] using the same technique. Both algorithms are equally efficient
5-1 Definitions of 1-D DCT and Its Inverse DCT
The definition of the N-point 1-D discrete cosine transform and its inverse are given
by the following equations [3, 4, 17, 38, 76, 77, 82].
X(k)4e(k)^x(n)C^+1)k (5-1-la)
and,
92
x(n) = I e(k) X(k) d*!+1)k (5-1-lb) k=0
where n,k, = 0,1,...,N-1; cgj+1)k = cos[*(2^1)lc]; and
' TV if k=°; e(k)=^ V 2
^ 1 otherwise.
In its matrix form, Equation (5-1-1) can be written as:
X = C° x (5-l-2a)
and,
x = €i X (5-l-2b)
where C° = R IT C ; Ctf = C a £ ; E = diag[-^,l,...,l]; IT = d i a g [ ~ . . . , | ] ;
X is the DCT vector; x the data vector; C(i,j) = C(^+1)i; Cj = (C)T and the
superscript T stands for the transpose operation of a matrix.
In order to derive fast algorithms, define
X(k)=^|yX(k) (5-l-3a)
and,
X(k) = e(k) X(k) (5-1-3b)
resulting in the "denormalized" D C T and IDCT as shown in Equation (5-1-4).
X(k)= NZx(n)C^J+1)k (5-l-4a) n=0
and, N-i A
-2N x(n) = Ni: X(k)C?N
n+1)k (5-Mb) k=0
where n,k = 0,1,...,N-1.
In matrix form:
f=Cx (5-l-5a)
and,
93
x = Cn X = ( C ) T X (5-l-5b)
where T stands for transpose operation.
It is from Equation (5-1-4) or Equation (5-1-5) that fast algorithms are derived.
Operations involving IE and T can be applied either before or after the denormalized
DCT or 1DCT is performed.
Example-7: For a 4-point DCT,
C B =diag.h=-, 1,1,1] diag.[j,j,^-,j] LV2
r r o r o r o r o • ^8 *~8 *~8 u 8
^"8 ^8 ~^8 ~^8
T1 -P1 -C1 C1
^4 W W ^4 y->3 s~\\ e~\\ J~\J
_ ^8 "^8 u 8 "^8 ,
(5-1-6)
and the 4-point IDCT matrix is given,
C ° 3 =
(~>V s^l (~yl s~yJ
*~8 ^8 *~4 ^8
/r>0 /-3 «~il y^l
"8 8 "W ~ 8 y->0 J"->3 ^r-,1 X^i
c 8 " ^ "*~4 ^8
/~i0 .~>1 .—>1 y-i3
_ 8 " ^ W ~^8 _
diag.[-7=-, 1,1,1] (5-1-7)
5-2 Definitions of 2-D D C T and Its Inverse D C T
The N*N-point 2-D D C T and its inverse (IDCT) are given by Equations (5-2-la)
and (5-2-lb) [4, 57, 72, 80].
v/, ^ 4 ^ / i ^ 1 ^ 1 ^ o(2n+l)k r(2m+l)/
X(k,/) = M M e(k) e^) I I x(n,m) C1^ G ^ NN n=0 m=0
(5-2-1 a)
and,
x(n,m) = Z I e(k) e(/) X(k,/) C ^ C 2 N
k=o / =0
(5-2-lb)
In its matrix form, Equation (5-2-1) can be written as:
X_ = C9 £ (5-2-2a)
and,
94
x = Cr[02L (5-2-2b)
where x and X are formed by stacking transposed row vectors of the input and output
2-D arrays respectively; C9 = (IE ® E ) (IT <g> IT ) ( C ® C ); Cn° = ( Ca ® Cj
)(E® E); E = diag[-^,1,...,1]; T = diag[^,|,...,^]; C(i,j) = cgj+1)i; and C3 =
(C)T. The symbol ® stands for the tensor product.
Defining
and,
NN X ^ = 4 e ( k ) e ( / ) X ^ (5-2"3a)
X(k,/) = e(k) e(/) X(k,/) (5-2-3b)
results in the denormalized 2-D D C T and IDCT as shown in Equation (5-2-4) [46, 80].
X(k,/) = I NZ x(n,m) C(^+1)k C^m+1)/ (5-2-4a) n=0 m=0
and,
x(n,m)=NS S X(k,/)Cgj+1)kcg,m+1)/ (5-2-4b) k=0 / =0 ^ ^
In matrix form:
£=(€ ® C ) x_ (5-2-5a)
and,
x. =( Cn ® Ca ) £= ( C <S> C )T X_ (5-2-5b)
Fast algorithms are usually derived from either Equation (5-2-4) or their matrix
form as presented by Equation (5-2-5). Although the definitions for the 2-D DCT and its
inverse are slightly different from those given in [72], those of denormalized forms are
the same. These are the basis for the derivation of various fast algorithms.
Generally, the mathematical derivation of an algorithm is quite involved and it is
difficult to see the computational structure. For this reason, a logic diagram is used to
present the computational structure of each algorithm.
95
5-3 Applications of 2-D D C T s in Image Compression
Image coding (compression) is a typical application of the 2-D D C T . It has been
made an international standard by C C I T T for video coding applications [72, 73].
Various 2-D D C T algorithms have been developed into computer programs for purposes
of the simulation study of video coding [81, 96]. In the following example, image
compression will be shown using the row-column Lee's fast D C T algorithm on a
256*256-pixel frame of image.
ExampIe-8:
In this example, the Series 151 Image Processor by Image Technology has been
used to acquire and store images. It is hosted by a P C - A T which performs the 2-D D C T
calculation. A 512*512-pixel image is snapped and stored in the frame grabber of Series
151 Image Processor. The frame is divided into four quadrants, each of which consists
of 256*256-pixels. The upper-right quadrant is used to display the original image, the
upper-left quadrant the scaled D C T coefficients, the bottom-left quadrant the reconstructed
image after applying different filtering on the D C T coefficients and the bottom-right
quadrant is the difference image between the original and reconstructed images. The 2-D
D C T is applied on 8*8-pixel blocks. D C T coefficients are scaled using a block size
64*64-pixels. The difference image can also be scaled so that the error signal can be
seen. Signed 9-bit integer is used for D C T coefficients. The system setting is shown in
Figure-19.
T w o types of filter masks are used in this example—the 2-D ideal low-pass filter
and zigzag filter as shown in Figure-20 (a) and (b), where n is the length of the filter.
The filter mask is used to eliminate selected D C T coefficients. In Figure-21, an ideal low-r 8 bits 64 pixels Q c< .
pass filter is used with n = 4 so that a compression ratio O I Q ^-ts * Toplxels ~ 1S
obtained, using signed 9-bit integer for D C T coefficients.
The difference image is magnified twenty times. The effects, shown by stripes, are
caused by two dimensional noise introduced in the imaging system. This has been
detected by analyzing 2-D Fourier spectrum of the image [Appendix C]. A zigzag filter of 8 bits
n = 5 is used in Figure-22 to achieve a compression ratio of 3.79:1, that is, -jrrrj *
96
64 pixe s Ait n o Ugh a higher compression ratio is used than that of Figure-21, the
improvement in the reconstructed image quality is obvious especially in the areas of
English characters of the poster on the background, face and shoulder. When different bit
allocating or adaptive schemes are applied, higher compression ratios can be obtained [3,
105].
System setting for image compression experiment.
Figure-19
97
l-g n=
X X X X
X X X X
— I -.-.—- •"%y0'
X X X X
X X X X
The two dimensional rectangular filter of size n.
Figure-20 (a)
l^wJ**-*** ll^~" *~J M *•"*• *TSTV^
X X X X X
X X X X
X X X
X X
X
The two dimensional zigzag filter of size n.
Figure-20 (b)
98
An example of D C T compression of a 256*256-pixel image,
an ideal low-pass filter used with n=4 with signed 9-bit for DCT coefficients.
Figure-21
99
A n example of D C T compression of a 256*256-pixel image
a zigzag filter used with n=5 with signed 9-bit for DCT coefficients.
Figure-22
100
5-4 2-D Indirect Fast D C T Algorithms
In the two categories of 2-D fast DCT algorithms, the indirect approach obtains a 2-
D DCT from a 2-D DFT of the same size [57]. One can use row-column FFT algorithms,
WFTA, etc., to calculate real valued DFTs as discussed previously. The arithmetic
complexity is fairly low [72]. The computational kernel of this method is of simple
structure, hence it is easy to be VLSI implemented. If 2-D FFT algorithms are invoked to
calculate 2-D DFT, the arithmetic complexity can be further reduced particularly if the
vector radix FFT algorithms are used [37, 43-45]. The structure of the algorithm is kept
fairly simple and roundoff errors are also reduced compared with the row-column
approach [50, 51]. A polynomial transform for 2-D DFT computation has lower
computational requirements but has a complex computation structure [58]. Whether it is
justified for fast computation of 2-D DCT remains to be seen [72]. The same can be said
for 2-D indirect fast DCT methods using other reduced multiplication fast Fourier
transform algorithms [32, 35].
There are several reports on 2-D indirect fast DCTs which map DCTs into DFTs
[57, 72, 79, 84] and in [57] complete formulas for both forward and inverse DCT
transforms are given. These are used in this thesis.
According to Makhoul [57], a 2-D DCT can be converted to and computed by a 2-D
DFT following the steps given below.
Step 1: 2-D N*N-point data rearrangement
x(2ni,2n2); if 0 < ni <[^-]; 0 < n2 < ft^k
x(2N-2n!-l,2n2) if [*^r3 ^ n2 < N-1;.0 < n2 < [ -j-]
XT. 1 N+l x(2ni,2N-2n2-l); if 0 < ni <[-y-]; [ — ] < n 2 ^ N'
1
x(2N-2ni-l,2N-2n2-l) if [^-jk < ni < N-T,.[-y-] < n2 < N-l
(5-4-1)
v(ni,n2) =<
v(ni,n2)=<
101
Step 2: 2-D N*N-point DFT on v(nbn2)
V(kbk2) = ^ Z1 v(ni,n2) W"
lkl Wjf2 (5-4-2) ni=0 ni=0 N N
where ki = 0,1,...,N-1, k2 = 0,1,...,N-1 and WN = e 'J27r/N.
Step 3: Obtain 2-D DCT from the output of 2-D FFT by two methods
C(khk2) = 2Re{ Wk^[Wk2 V(ki,k2) + W4
kN2V(kl5N-k2)]} (5-4-3a)
or:
C(ki,k2) = 2Re{Wk2 [Wk^V(k!,k2) + W^V(N-k!,k2)]] (5-4-3b)
that is, a 2-D DCT is defined as:
C(k,,k2) - 4 S X v(n,.ii2) cos-^^cos™^ (5.4.3c) ni=0 n2=0 m m
Different 2-D fast discrete Fourier transform algorithms may be used (e.g., the
vector radix FFT [37, 43, 45, 60], the 2-D WFTA [35], the 2-D polynomial transform
[58], etc., [32, 65]) to calculate 2-D DFT in Step 2 for 8*8- or 16*16-point FFT. The
inverse DCT can be done by the following steps:
Step 1': Generate the 2-D DFT from the 2-D DCT
V(khk2) = jW^W^2 {[C(ki,k2) - C(N-ki, N-k2)] - j[C(N-k!,k2) - C(kj,
N-k2)]} (5-4-4)
Step V: 2-D IDFT
v(ni,n2) = r^r I I V(k!,k2) WN"lkl W^2 (5-4-5)
^^ ki=0 k2=0 w JN
where ni,n2 = 0,1,...,N-1.
Step 3': Recover the sequence x(ni,n2) from Step 1 for the forward DCT.
In [72], [79] and [84], there are no formulas given. Instead, the 2-D inverse DCT
are generated on the flow graph using the transposition theorem of orthogonal transforms.
A 2-D indirect DCT method using convolution algorithm has been mentioned in
[72] which claims a dramatic reduction of multiplications. A detailed study yet remains to
be seen.
102
The arithmetic complexity of indirect D C T algorithms depends on the FFT
algorithm used. Since FFT algorithms are well documented, the arithmetic complexity of
indirect D C T algorithms can be readily obtained.
103
CHAPTER SIX: 2-D DIRECT FAST DCT ALGORITHMS
It is known that 2-D fast transform algorithms are often more efficient than the row-
column 1-D algorithm in terms of computational operations, i.e., they need less
multiplications and additions than the row-column method to compute the same transform.
Many of 2-D algorithms also possess in-place computation, regular structure and small
roundoff errors [37, 45, 50, 51], which all provide advantages.
2-D direct fast D C T algorithms, which will be discussed in the following sections,
are generated from 1-D Lee's and Hou's algorithms respectively. They require fewer
number of multiplications than the row-column method and provide a systematic
computation structure featured by 2-D BFs and T M stages as well. The in-place
computation possessed by these algorithms is obvious. The computational complexity of
various D C T algorithms is considered and the computation structure of 2-D direct D C T
algorithms is analyzed in comparison with that of 2-D vector radix FFTs.
In the next two sections, the matrix and diagrammatical representations of 1-D
direct D C T algorithms are discussed as the bases for the derivation of 2-D direct vector
radix D C T algorithms. The 2-D D C T algorithms are then introduced using the structural
approach in a similar manner in which V R FFTs are constructed.
6-1 2-D Direct Fast DCT Algorithm Based on Lee's Method
6-1-1 1-D Lee's algorithm in matrix form
Lee's algorithm is a direct fast D C T algorithm. For an N-point forward DCT,
Equation (5-1-4a) can be decomposed into two N/2-point DCTs by the following steps in
a matrix form [46, 76, 80].
-g'^n)-
_g'2(n)_
"gl(n)l _
_g2(n)J "
1 1
1 - U
x(n) x(N-l-n)J
1
0
0
1 or,(2n+l) ZL-2N
S\(n)-
Lg2(n)_
(6-1-la)
(6-1-lb)
104
•Gi(k)-
G2(k). n=0 W gl(n)'
.g2(n). (6-1-lc)
X(2k)
X(2k+1).
1 00 L 0 1 1.
' Gi(k) •
G2(k)
G2(k+1). (6-1-ld)
where k,n = 0,1,...N/2-1, and G2(k+1) lk=N/2-l = 0. Define the pro- or pre-calculation
matrix P, butterfly matrix B, and the multiplication matrix M as follows:
P = 1 00 0 1 1
B = 1 1 1 -1 M =
r l
o
o l
or(2n+l) ZL-2N
The N-point IDCT in Equation (5-1-4b) can also be decomposed into two N/2-
point EDCTs by the following steps in the matrix form [46, 80].
Hi(k)
H2(k)J = P
" X(2k) '
X(2k+1)
LX(2k-l)J
(6-l-2a)
•hi(ny
h2(n). - kt0U2(N/2) |_H2(k).
•h'jdi)
h'2(n)
x(n)
x(N-l-n)
= M
= B
hi(n)"
h2(n)J
fh'jCn)-
h'2(n)
(6-1-2b)
(6-1-2c)
(6-1-2d)
where X(2k-1) =0ifk = 0. The above one dimensional fast D C T algorithm is described
in Lee's paper [76] except for the matrix representations [46]. The matrix representations
used here are very useful when a new 2-D fast DCT algorithm is devised [45, 46, 60,70,
80, 83].
105
Notice that:
x(n) '
Lx(N-l-n). 1 '
sN-l-2n x(n) (6-1-3)
and,
~ X(2k)
X(2k+1)
LX(2k-l)-J
Gl(k)
G2(k)
.G2(k+1)
X(2k). (6-1-4)
1 0
0 1
0s
Gl(k)'
LG2(k). (6-1-5)
where the delay operator defined as x(n+l) = s x(n), x(n-l) = s_1 x(n).
The logic diagrams for the 1-D 8-point and 16-point denormalized IDCT are shown
in Figure-23 and Figure-24 respectively. Note that in the above figures the input
sequence is in bit reversed order whilst the output sequence is generated by starting with
the set (0,1), forming a new set by adding the prefix "0" to each element, and then
obtaining the rest of the elements by complementing the existing ones. Therefore the sets
corresponding to 2-, 4-, 8- and 16-point output sequences will be: (0,1), (00,01,11,10),
(000, 001, 011, 010, 111, 110, 100, 101) and (0000, 0001, 0011, 0010, 0111, 0110,
0100, 0101, 1111, 1110, 1100, 1101, 1000, 1001, 1011, 1010). The corresponding
1-D fast DCT algorithms can be obtained easily by interchanging the input and output,
reversing the direction of data flow and changing addition blocks to branches and
branches to addition blocks of the IDCT logic diagrams [76, 84].
From the logic diagram, Lee's algorithm achieves a good performance in terms of
the number of multiplications and additions and also has a regular structure.
For an N-point DCT, N = 2m, the number of multiplications and the number of
additions, which are required for the calculation, are shown as follows:
0M[DCT(2m)] = m * 2^-1;
m m-1 • _ : 1
0A[DCT(2m)] = m * 2 m + £ 21 (2m -1), for m>l.
i=0
106
U CM
C tc en
75 ^ 1/5
~v —
.E *3i
U CM
U CM
u u CM
Q r-
I
C CO. ©
c.
ci oo
U CM
— oc
u CM
CO.
I — • *
u CM
II
CO CM I
<u u 3 OO •t-t
107
1/1 CM
u CM I
*o v tn *r> •*: *N
« c
M
1
1
I
IS IS 16 IB TQ IS
9- :;•;• m % •>- f- % s
SIS
13- M
: +
c : cr-
i
nf • : ttj- '." _ «^>
i 5
<=. d.
§
is
5 BIS
5 i , . i
« *^ IN **. © r^ *» *~ .^ .-. I-* r-. — r . - -•6, .» — IN — ( © — r - ^ I T , ^ . O C - w C -
CO CM
— ro
u CM
I
Ir- n — r-i
o CM
II VO
?*-I CM o\ en
u CM
•p.
II
CM r- ci
u CM
3
u
CM
U CM
I 9-
o 9-
CM CM |
vJ rJ , ^ 3
OD •H
CM — C*i
u CM
r- —
U CM
u CM
CI
I CM
ro —
u CM
u CM
iU*
ci oo
U CM
CM
ca -H 00
u CM
CO.
CM
108
Since the publication of Lee's paper, the algorithm has been criticized because of roundoff
errors produced by the required division of the cosine coefficients in the matrix M .
Haque [38], however, has shown that roundoff errors are not serious for small size DCTs
although there has not been recognition of this fact [65,72,77].
6-1-2 Derivation of 2-D fast DCT algorithm from Lee's algorithm
Although this method was first introduced by Haque [38] in 1985 immediately after
the publication of Lee's algorithm, it was derived independently by W u and Paoloni [46]
using a structural approach. The latter technique makes the 2-D fast algorithm a simple
and systematic extension of Lee's algorithm and will be presented here.
The N*N-point 2-D D C T and its inverse (IDCT) after denormalization are given by
Equations (5-2-4a) and (5-2-4b). Equation (5-2-4a) can be decomposed into
(N/2)*(N/2)-point 2-D DC T s in the same way as was done to the 1-D D C T in the last
sub-section. Matrix forms of the four step algorithm are shown by Equations (6-1-2-la)
to (6-1-2-ld) [46].
Sl=Bx_ (6-1-2-la)
8 =K£ (6-1-2-lb)
where
N/2-1 N/2-1 ( 2 n +i) k (2m+l)/ „ "• " \ \^ 2(N/2) W(N/2) *
Z. = LG1
£=(
g' =
LsN-l-2nj ®
r-g'jCn.m)
g'2(n,m)
g'3(n,m)
_g'4(n,m)_
1 •
zN-l-2m ) x(n,m),
g1(n,m>
g2(n,m)
g3(n,m)
_g4(n,m)_
g =
(6-1-2-lc)
(6-1-2-ld)
109
a =
rGi(k,/)
G2(k,/)
G3(k,/)
LG4(k,/)J
£.=( "1 0" 0 1 -0s_
® "1 0" 0 1 .0z.
"1-.s _
<8> "1-. z_ £ = (
£ = P 0 P,
M = M <g> M',
£ = B ® B,
) X(2k,2/) ,
) a
where k,/ ,n,m = 0,1,...N/2-1, and Gi(k+1,*) lk=N/2-l = Gi(*,/ +1) 1/ =N/2-l = 0, for i =
2,3,4; matrices P, B and M are defined as those in the last sub-section and M ' is
derived by substituting m for n in M ; the symbol ® stands for the tensor (Kronecker )
product; and delay operators operate on different indices.
Equation (6-1-2-lc) represents four (N/2)*(N/2)-point 2-D DCTs which can be
further decomposed into even shorter length 2-D DCTs and so forth. In essence, the
relationship between 2-D direct D C T algorithms and their 1-D counterparts is governed
by the properties of the tensor product. Equation (6-1-2-1) can be proved using direct
derivation (Appendix D).
The 2-D fast IDCT algorithm derived from Equation (6-1-2) in matrix form can be
given in a similar way [46, 80]. According to the structured approach, 2-D fast IDCT
algorithm based on Lee's method will be presented by Equation (6-1-2-2).
H = P X
k = N5_1r(2n+l)k
N^"1r(2m+1)/ „
ntoC2(N/2) m t 0
C2(N/2) O-
x. = Bhl
(6-l-2-2a)
(6-l-2-2b)
(6-l-2-2c)
(6-l-2-2d)
110
where
" 1 "
s -s-l_
® " 1 "
z
_z"l_
hl(k,/)"
h2(k,/)
h3(U) ' h4(k,/)_
" 1 -sN-l-2n
® _ 1 -zN-l-2m
p = p® p,
M = M <S>M',
£ = B <g> B,
where k, / , m, n = 0,1,...,N/2-1, s and z are two delay operators which operate on
different dimensions, X(2k-1,*) lk=o = X(*,2/ -1) 1/ = 0 = 0.
The mathematical structure becomes even simpler when a logic diagram is used [45,
46, 80]. For instance, the logic diagram of an 8-point 1-D IDCT is given in Figure-23
[46]. The row-column IDCT is applied on an 8*8-point 2-D IDCT in Figure-25, where
the row transforms are implemented by the ID IDCT blocks at the input and the column
transforms by the remainder. The number of multiplications required for an N*N-point 2-
D IDCT is N2log2N. Figure-26 shows the evolution of the row-column approach to the
two dimensional algorithm. The 1-D IDCT row operations of Figure-25 are now
distributed throughout the logic diagram and thus the number of multiplications remains,
as yet, unchanged. However, it is possible to combine adjacent factors to reduce further
the number of multiplications. For example, in the block 2D-M3 of Figure-26, the factor
oc is moved into the block M 3 , which has the effect of changing the internal multiplier
values of M 3 [45, 46]. The procedure is repeated for blocks 2 D - M 2 and 2D-M1. The
total multiplications required to perform an N*N-point 2-D IDCT is now (3/4)N2log2N
whilst the number of additions remains unchanged. The logic diagram for a 16* 16-point
H = H2(k,/)
H3(k,/)
H4(k,/)J
h =
Ill
© X
X X CM X x X X
IT)
©
<x T
<x <s <x
v^
<x 1—<
<x V)
<x f)
<x t-
«
112
r- — U
CM
I
~ I MS
•o H U o I CM 5 en
fcC
•5 re
u CM
- r- O U CM CM |
(J ^_ i-H 3 60
SJO
c II £ CM = CO. H
Q
c*> oc
u CM
— OC
u CM
CM ca
* oc
r - ntf
u CM
^* •*< «X "X «X -x <X "X
113
C3E3E3
mHtpmmmHE ^ *. ^ * < ^ •X <X -K «M "H
II
u") CM r- rr,
u CM
I
f-1 CM r- CO
U CM
—i CM
I " co I CM
£ CM
O^ en
U CM
^
fl
It-
CM
U CM
E
I
9.
CM
U CM
U CM
r>*
0)
9.
rJ
3 60
— en
U CM
s ?-
c _ r~ r-
u CM
XO r-
U CM
CO
ro —
U CM
CM nj1
U CM
«-T
I." — 00
u CM
ca
I CM II 8
114
IDCT using the above 2-D fast algorithm can also be constructed in the same manner.
Start by drawing a 16-point 1-D IDCT diagram using Lee's algorithm then the 2-D
diagram can be constructed immediately using the simple rules of the logic diagram [81]
as shown in Figure-27. The forward DCT can be obtained by reversing the direction of
the arrows in the logic diagram of the IDCT, since the DCT is an orthogonal transform
[76, 84].
Using the above logic diagrams, both software and hardware implementation of 2-D
DCTs can proceed.
6-2 2-D Direct Fast DCT Algorithm Based on Hou's Method
6-2-1 1-D Hou's algorithm in matrix form
Hou introduced a recursive fast DCT algorithm in 1987, which achieves
computational efficiency equal to that of Lee's algorithm and provides a better numerical
performance. However, as a tradeoff, shifting and multiplexing operations are required in
this method. Although Hou classifies his algorithm differently, it still is a direct fast DCT
algorithm and it is recursive because the higher order DCT matrices are generated directly
from the lower order DCT matrices [77].
The 1-D N-point DCT definition used to generate the algorithm is derived from
Equation (5-1-4a) [82] and is given by: N-l
X ( k ) = I x(n) cos(9k + 27tkn/N) (6-2-1-1) n=0
where 8k = 7tk/(2N) and for n = 0,l,...,(N/2)-l,
x(n) = x(2n) 1 (6-2-1-2)
x(N-l-n) = x(2n+l)J
The Decimation-In-Frequency recursive Hou's algorithm for 1-D DCT generated
from Equation (6-2-1-1) is :
rA -
Ze A
LZo-
= T(N) V Xr_
115
where
f (N) =
r A xj
A
T(l A
) = 1,T(2) = • 1 1 -
a -a , a = cos£
4 K T f ) Q - K T ( ^ ) Q
Zg = R^g, £<> is a vector consisting of even terms of X(k) in natural order,
z0 = R^0, £0 is a vector consisting of odd terms of X(k) in natural order,
R is the permutation matrix for performing bit reversal. For example, r 1 00(H
R2 = I2 = 1 0
L0 U , R4 =
K = RLR, L =
1W2TL
00 10 0 100
-000 1-1 r i o o o ... on -1 2 o o ... o 1 -2 2 0 ... 0 -12-2 2 ... 0 * * * * *
L-l 2-22 ... 2J
etc.;
, Q = diag[cosOm], m=0,l,2,...,(N/2)-l,
and O m = (m + X ^ p ) . Note that K 2 = L2, and K is the result of bit reversal of the row
and column indices of L.
Equation (6-2-1-3) can be used recursively to form the complete formula. The logic
diagrams for 8-point and 16-point DCTs using the DIF Hou algorithm are presented in
Figures 28 and 29. The input sequence is such as x(0), x(2), ..., x(N-2), x(N-l), x(N-
3), ..., x(l), where N is the length of the DCT. The output sequence is in bit-reversed
order.
In [77], another algorithm called the Decimation-In-Time algorithm for DCT is
devised as a dual method to the DIF algorithm. It is obtained simply by switching the A
indices between the input and the output and taking the transpose of D C T matrix T(N).
For the inverse transform, substitute Equation (6-2-1-2) into Equation (5-1-4b), to
give
x(n) = NI X(k)(^r1)k
k=0 ^
(6-2-1-4)
116 o **" ^ ;£. — *o <*i r-Ix Xx $< ix IX IX $< p<
[ +
. . • • • • - ^ • • - • • : . - • • : . - • • • • . . : • :
...... +
1 1 V
rn
j
3 i -i t i + + +1 +
l- TI o — d. ea.
+
+
+ + + j + JL J_
. : s
+ + + I + 1
r^ l
+
i
i
4
1
1
+
1
rJ <s CN ifi
LJ
1 +
H =
i
' " rn
7' + -r +
H H S
o — »-ci tn. ri
+ + + 1 J . "7 T C
. . , . ^ — *N c-> «r .0 fcC fcO *t
+ + + " i.; 1 r-U
— ts
• :
•
fc-.hr
p
"u r> C£
rs
*3 O »»•
r-«
CJD r "« -5
H CJ O
""*
c c 1
oc
CC
0 0 II 3 -a c cti
^^^y
£h ^ w * W5 O O ||
ca ^
*=n«: 1/5
O O II C CO.
#•.
y""N
vC r—«
00
O O II CO to ^^
Kio 01— **s
i/s
0 U II CM to
fc/Kc lOlr-(/; O O
cn
GJ rH
•H
t-
to
K|r^
v: O o II
c to
117
:_P !3J
_3 M
C3 J
0
IS
ag B B 9 B 0
* 9
O t-0
3
H U Q Q
o c 1
VC
r~
c & 5 "•5 o '3D _o o
ml—. 00
o o to
*12 00
o o to
ON
CM en
to
O o II r-3
IT) CM
Q
o
to K C «-•
•rr CM
t=.ht
to
O o II vo
3
00
o o
C
CO >o oo oo O o II
3
00
O O
II ON CO. | 0)
Kloo 5 CM cn 00
o o II
3
oo O o II
o CO
CO
en
en
vC
CM en
00
O O II cn
3 fc/|<N oo
O o II
CM
3 fc-Tjcs wnlen 00
o o II
HP5 00
O O
oo
O O II cn to
fc/Kb O N ] — 00
O o II CM tO
O 3
118
The IDCT matrix will be exactly the transpose of the D C T matrix defined by Equation (6-
2-1-1). Following Hou's indexing scheme, the fast IDCT algorithm is given below.
P
IX r] = TT(N)
A "| Ze A
Zo.
(6-2-1-5)
where
T\N) =
rT^T) QTT£)KT 2; v^* \2-
TT(T) -Q^C T-)KT KT = RLTR.
. . . . , *K-T
This means that the DIT fast DCT algorithm given by Hou is equivalent to the
inverse fast algorithm. The IDCT matrix can be factorized into the form shown by
Equation (6-2-1-6).
Arr. T\N) =
I I-.1 -L
- I 0" - 0 Q„
frfy 0
0 TT&) 2'J
I 0 0 KTJ
(6-2-1-6)
where all matrices I, K T , Q and TT( j-) are of dimension (N/2)*(N/2). According to
Equation (6-2-1-6), Figure-3 in [77] should be the one shown in Figure-30.
The number of multiplications and additions, which are required to perform an N-m-l • m.;.i
point DCT, is equal to that of Lee's algorithm in addition to X 2 (2 "" -1), for m > 1, i=0
shift operations where N = 2m. In software programming, the multiplexing can be hidden
so that there is no extra operational cost. In other words, there is no extra operation due
this multiplexing compared with the program using Lee's algorithm [81,96].
6-2-2 Derivation of 2-D fast D C T algorithm from Hou's
algorithm
Hou's algorithm can be extended into a 2-D fast algorithm in very much the same
way as Lee's algorithm. In order to derive a new two dimensional recursive fast DCT
algorithm based on Hou's approach, the DCT matrix in Equation (6-2-1-3) is rewritten as
follows:
119
*N| I Nil
o cp
on
O
c o
u
o m
I
a; 3 60 •H
o <Xl <x|
120
T(N) = I 0
0 KTf)Q
I 0
0 KT#)Q
T $ T(?)
I 0 L 0 K
I o -0 K
I A XT
Tf) 0 0 I 0
v2' •I
0 Tf)
Tf) 0
0 tf)_,
I I
LI -I
I 0 0 Q
A N
Tf) 0 0 I
I 0
0 Q
I I
LI-I
11 II
(6-2-2-1)
where all matrices I, K , Q and T$) are of dimension (N/2)*(N/2)
Set following equations [57]:
x(n,m) = x(2n,2m)
x(n,N-l-m) = x(2n,2m+l)
x(N-l-n,m) = x(2n+l,2m)
x(N-l-n,N-l-m) = x(2n+l,2m+l)J
N N 0<n<y- l,0<m< j - 1.
(6-2-2-2)
Substituting Equation (6-2-2-2) into Equation (5-2-4a), a modified version of the 2-
D DCT is derived. N-l N-l ,._.,„ ,„_ ,„
(6-2-2-3) X(k,/)=I I x ( n , m ) C ^ + 1 > k C ^ ^ n=0 m=0
The matrix form for the denormalized 2-D D C T defined by Equation (6-2-2-3) after
reordering the input and output sequence will be:
r- A -,
Zee A
Zeo A
Zoe A
LZ oo-*
= ( T(N) ® T(N) )
where
£e = (R ® R) Xe , £e =
£0=(R®R)X0 ,£0 =
A •
Z ee A
Z eo. rA -| Zoe A
Z oo.
xpp X pr
Xrp
XrrJ
(6-2-2-4)
X e, —
x0 =
X ee
.X eo.
X oe
.X oo.
X = (P®P)x , Pis the permutation matrix which results in Equation (6-2-1-2),
121
Xe and X0 are in natural order.
Substituting Equation (6-2-2-1) into (6-2-2-4), a new 2-D fast DCT algorithm is
derived.
( T(N) ® T(N)) = {
®{
I 0
0 K
I 0 0 K
,N T Q o
N, 0 TCrT)
r A .N-T(f-) 0 0 Tf)
I 0 0 QJ
I 0
L o QJ Li -I
11
I -I
11
)
= {
{
I 0
0 K J ®
I 0
0 K }{ Tf) 0
0 T© ®
I 0 0 Q ®
I 0 L 0 QJ H
I I
II <8> n i" II.
A XT
Tf) 0 A N
0 Tf) d
} (6-2-2-5)
So an N*N-point 2-D D C T is decomposed into four shorter length DCTs at the cost
of an increased number of multiplications represented by the term which contains factor
Q. After combining the coefficients , the new algorithm uses 25% less number of
multiplications than the row-column Hou algorithm.
Although its mathematical derivation is quite involved, the logic diagrams for 8*8-
and 16* 16-point 2-D DCT using the new fast algorithm are quite simple [81] and are
shown in Figures 31 and 32. They are derived from Figures 27 and 28 respectively and
symbols defined also in the Figures accordingly. Again, in the 2-D algorithm additional
shift operations are traded for better numerical performance. The elements of input
vectors are in such order as xi = [x(i,0), x(i,2), ..., x(i,N-2), x(i,N-l), x(i,N-3), ...,
x(i,l)] and the elements of output vectors are in bit-reversed order.
The 2-D IDCT fast algorithm can be derived from Equation (6-2-1-6) and the logic
diagram can be obtained using the same method.
mmmmci mrnr^mrj^iTTTr^iTTi-
t^I^l^l^l^l^l^J
JK <K *< tX .X #< 123
rfrFiFiFi
aaLtiaEDBaa s
fci Kc
oo O
o II
to SO
00
O O II
o to
c ON
CM
CM
cn oo
O o II
3 •a te |
I 8 I 1
CM CM
en
c
H U
o
1/3
o o II
3 r-CM
en t/5
O o
1 3
o _-D. cn CM
en 00
O O II en
fc^lcM CNJcn
00
O O II
CM
3 fc/jcM «n|en
oo
O o II
f= cn > » t/3
o
00
O O II
a c
»o oo oo
o o II CQ.
oo
o o II o co_ cn so
o o II cn to t=: No
Osl—i oo O O II CM
cn I <1) U 3 CO
If
124
6-3 Comparison of Arithmetic Complexity of Various D C T
Algorithms [72, 156]
Listed below are the arithmetic complexities of direct fast algorithms for a 1-D DCT
of length N = 2m, including the number of real multiplications OM[DCT(2m)] and the
number of real additions 0A[DCT(2m)].
Chen [78]— 0M[DCT(2
m)] = N * log2N - 4r + 4, N > 4;
0A[DCT(2m)] = 2i (log2N _ 1} + 2;
Lee [76]— 0M[DCT(2m)]= | *log2N,
0A[DCT(2m)] = M * log2N - N + 1;
Hou [77]-0M[DCT(2™)] = £ * log2N,
0A[DCT(2m)] = ~- * log2N - N + 1;
Ma-Yin [65]— 0M[DCT(2
m)]=m*2m-1, N = 2m;
0A[DCT(2m)] = (3 * m - 2) * 21"-1 + 1;
Vetterli, et ai [34]— • 0M[DCr(2">)] = £ * log2N
0A[DCT(2m)] = Y * (3 * l oS2N - 2) + L
125
The general formula given can be derived either from decomposition equations or
from logic diagrams provided in the thesis using an induction method.
The number of multiplications or additions for 2-D row-column D C T methods is
obtained by multiplying the number used in 1-D fast D C T algorithms of the same size by
2*N. Then the arithmetic complexity of the 2-D vector radix D C T algorithms is easily
obtained by noting that the number of multiplications is reduced to three quarters of that
used by the row-column fast D C T algorithm and the number of additions remains
unchanged. Further discussions on the arithmetic complexity of DCTs can be found in
[72] and [156].
6-4 Comparison of Computation Structures of 2-D Direct VR
D C T s and V R FFTs
So far, independent V R FFT and V R D C T algorithms have been presented which
show some similarities in their computation structures. Further comparison will reveal
those basic computation structures common to both V R FFTs and V R DCTs and major
differences as well. The reason why the vector radix approach can be applied to FFTs
based on the Cooley-Tukey method and direct fast DCTs by Lee and Hou will soon
become clear. This exercise will certainly be beneficial to the software and hardware
implementation, including VLSI implementation, of vector radix fast algorithms.
Apart from the D F T being a complex valued transform and the D C T a real valued
one, there are some obvious differences in the computation structures of 1-D Cooley-
Tukey FFT and 1-D direct fast DCTs. Take a 1-D 8-point DIF FFT as shown in Figure-3
and a 1-D 8-point direct fast D C T by Hou as shown in Figure-28, for example. It can be
seen that the input sequence of the FFT is in natural order and the output in bit-reversed
order (or vice versa) whilst the input of the Hou's D C T is in a different shuffled order
(refer to Section 6-2-1) and the output in bit-reversed order. As a result, the D C T
algorithm requires that both input and output sequences be re-ordered whilst the FFT only
rearranges one of them. While the D C T algorithm needs a post-calculation stage, the FFT
does not. The FFT algorithms often have trivial twiddling multiplication stages, such as
126
the one inside the Radix-4 DIF F F T B F of Figure-3, and the Hou's fast D C T does not
have trivial twiddling multiplication stage.
On the other hand, there are many important features which are common to Figure-3
and Figure-28. The basic computation structures of both algorithms are 2-point
butterflies and separable twiddling multiplication stages. They both perform in-place
computations at every stage, which is quite different from the W F T A [31] or the Chen's
fast D C T algorithm [78]. The post-calculation of Hou's algorithm has also distinct
stages. The fact that Cooley-Tukey FFTs and fast D C T s by Lee and Hou have in-place
computation and separable twiddling multiplication stages makes them feasible to be
extended to multidimensional fast algorithms. It makes the modification rules of the logic
diagram applicable and the combination of twiddle factors of different dimensions
possible. Not surprisingly, taking Figures 10 and 32 for example, 2-D V R FFTs and 2-D
V R D C T algorithms derived from Lee's and Hou's methods have the vector radix-2*2
butterfly and combined twiddle factors' stage as their commonly used computation
structures. Since V R FFTs based on the Cooley-Tukey method have trivial twiddling
stages, 2-D butterflies with higher vector radices are allowed in 2-D V R FFT algorithms.
6-5 Summary
In this chapter, two vector radix fast discrete cosine transform algorithms are
introduced using both the matrix representation and the logic diagram. These two
algorithms show the arithmetic advantages over the row-column method because the
number of multiplications is reduced by one quarter. The computational structure of 2-D
vector radix algorithms is regular and featured by 2-D butterfly, twiddling multiplications
and post- or pre-calculation structures, and can be systematically generated from their 1-D
corresponding algorithms. Computer programs using these algorithms have been
developed and the use of the structural approach has assisted in the program development
procedure [96]. The arithmetic complexity of various D C T algorithms is considered. A
comparative study on the computation structures of vector radix FFTs and vector radix
127
direct D C T algorithms is carried out and the correct system configuration for the 1-D DIT
Hou's algorithm has also been presented in this chapter.
128
CHAPTER SEVEN: HARDWARE IMPLEMENTATION OF 2-D DCTS
FOR REAL-TIME IMAGE CODING SYSTEMS
Research on fast digital signal processing algorithms has followed the development
of computer technology, especially VLSI technology, since the foundational work laid by
Cooley and Tukey in 1965 [22]. In their well known paper they reduced the computation
burden of computing a length N Discrete Fourier Transforms (DFTs) from the original
order of N 2 to the order of Nlog 2N and the same reduction can be realized for DCTs.
Since then, many fast algorithms have been published and the evaluation of various fast
algorithms has been based on the following theoretical judgements, namely:
(a) the number of numerical operations (multiplications/additions);
(b) round-off errors;
(c) in-place computation; and
(d) the computation structure.
Of the above criteria, the computational complexity in terms of the number of
multiplications and additions has been the focal point in the development of fast
algorithms. In early years particularly, research concentrated on reduced multiplications
algorithms [39] as the time spent on a multiplication was far more than that for an addition
on general purpose computers. This has also been true of D C T computations until
recently [72]. In implementations of 2-D D C T s for real-time image coding, like any other
real-time applications, special hardware, instead of general purpose computers have to be
employed. This special hardware often depends on the leading edge of the VLSI
technology. As a result, the development of VLSI technology led to a re-consideration of
those criteria on which new algorithms were devised and to a re-assessment of the
effectiveness of various fast algorithms. In other words, the device and evaluation of
(new) fast algorithms have to be made relevant to VLSI technology or, simply, the
specific hardware installment
129
In this chapter, a single processor system to implement the modified Makhoul's 2-D
indirect algorithm is firstly described using the newly released C M O S VLSI FFT
processor—A41102. Various VLSI D C T processors for the 2-D image coding have then
been reviewed. Different algorithms for the 2-D D C T computation are re-assessed in the
light of the hardware implementation using different Digital Signal Processors (DSPs),
compared with the direct row-column matrix multiplication algorithm using
Multiplier/Accumulator processors [81].
7-1 Description of Hardware Implementation of Modified 2-D
Makhoul D C T Algorithm Using F D P ™ A41102
The 2-D indirect fast D C T algorithms may not be as efficient as the 2-D direct
methods in terms of arithmetic complexity or they may not have regular computation
structure or in-place computation, but if the VLSI FFT processor is used, this method
would show its advantages in processing speed and overall system simplicity [74, 86].
As mentioned previously, the Austek A41102 Frequency Domain Processor (FDP)
is an FFT chip which provides a continuous sampling rate of up to 2.5 Ms/s and it has
selectable 16-, 20-, or 24-bit word length [26-28, 85]. More importantly, 8*8- and
16* 16-point DFTs can be performed by a single pass within 25.6^s and 102.4^s
respectively. When a modified 2-D indirect D C T algorithm is used [57, 86, 126], the
configuration using the FDP will give a fairly large D C T processing throughput. For
convenience, Equation (5-4-3a) is repeated as follows:
C(kj,k2) = 2Re{ W^[W*2 V(ki,k2) + W*2V(ki,N-k2)]} (7-1-1)
From Equation (7-1-1), the 2-D DCT can be obtained by adding two terms from the
DFT:
W ^ W ^ V 0 q , k 2 ) andW^W^ 2 V(ki,N-k 2 ).
Define k2' = N - k2 in the second term, which results in:
130
W^W^V(kl5N-k2) = W^W^-NV(k!,k2')
= iW4NW4NV(kl'k2,)- (74-2)
Equation (7-1-2) states a very important fact, namely, that all the elements
represented by the second term in Equation (7-1-1) can be obtained by j multiplying the
corresponding elements of the first term, which is nothing but interchanging the real and
imaginary part of the element. There is an uncommitted complex multiplier on the FDP
A41102 which can be used either before or after the FFT operation is completed. This
uncommitted complex multiplier can be employed in conjunction with a ROM to generate
all the elements in Equation (7-1-1). Therefore, the 2-D DCT can be calculated by a
system described diagrammatically in Figure-33. Since the use of the uncommitted
complex multiplier will not slow down the process, a processing rate of 2.5 Ms/s can be
obtained by this single FDP system. This system configuration provides a comparatively
simple hardware solution over that of the row-column method [86]. The processing
speed can be improved by introducing a multi-processor configuration [86]. The above
modified Makhoul algorithm has been used to calculate 2-D DCTs using polynomial
transforms [126].
CM
131
u
1 Adder-1
1 Adder-2
r ~i fciD
tM
c
v.
>
f rM OS
>
V
1 *fc
5
CM
y. _.
>
£•5 5: -7$
CM
O OS
CM
2 -—
>
- z
£ J7Z
CM
__ C
S—
a <4-i
PQ cd H-J
05
Q
c t/3 3 T5
© vj s.
H ~ U £
•*J
© .5: Q, -O vo .S rH * Q V© i r-i <S
L. a C J=
<*->
oc c * o CC
T3 , 00
CM; 5 .©.
<~m
© <s o C i-H
o *-* - ^ rt < u 3 ft. .2Pfl U. rv c •"* o u E a ^ (/5 >> t/5
m cn 01 rH
3 CO •H fcn
c >
132
7-2 Discussion of 2-D D C T Image Coding Systems Using VLSI
Digital Signal Processors
For the fast computation of 2-D D C T s in real-time image coding, the fastest and
simplest system configuration would be using dedicated VLSI D C T processors [74, 75,
87, 89]. For example, in [111], a hardware architecture is reported using the row-column
fast D C T algorithm by Chen and et al [78] on 8*8-point blocks. The processor accepts 8-
bit video input digital signals, uses 12-bit internal precision and provides 12-bit D C T
output. A 16* 16-point D C T VLSI processor is demonstrated in [139] using a direct
matrix multiplication method and a concurrent architecture [75, 143]. The processor
accepts 9-bit 2's complement data, maintains 12-bit precision after column DCTs and
produces 14-bit D C T coefficients at 14.3 M H z sample rate. In a recent report [74], a 27
M H z D C T Chip which performs 8*8-point DC T s has been demonstrated using Duhamel-
H'Mida fast cyclic convolution algorithm [113]. SGS-Thomson Microelectronics Group
has been marketing its VLSI Discrete Cosine Transformer STV3200 [87]. The STV3200
D C T processor accepts 9-bit 2's complement input data, uses a 16-bit internal precision
and produces 12-bit 2's complement D C T coefficients. It can perform 4*4- up to 16*16-
point DCTs at a rate expected to be 13.5 M H z . The IMS A121 of inmos, which is now
part of SGS-Thomson Microelectronics Group, is yet another VLSI D C T processor [88].
The IMS A121 can perform an 8*8-point D C T in 3.2M* (20 M H z pixel rate) using the
direct matrix multiplication method. It accepts 9-bit signed input, uses 14-bit signed
integer for the cosine function Look U p Table (LUT) and a 16-bit precision after the first
matrix multiplication, and renders 12-bit output for D C T coefficients. TRW's TMC2311
[89] is another fast D C T processor which calculates 8*8-point D C T in 4.48^ (14.3 M H z
pixel rate). The TMC2311 accepts 12- or 14-bit input data and produces optional 12-, 14-
or 16-bit output. The row-column method has been used in all the above D C T
processors, and so has the fixed-point computation. Since the length of D C T s under
consideration is comparatively small, various algorithms have been used in VLSI
integration of D C T processors without showing a great deal of difference in speed
133
performance for image coding. A comparative study of the error performance of these
D C T processors remains to be undertaken.
Where D C T processors are not available, the use of various DSPs, the F D P and
Multiplier/Accumulators (M/A) provides many options.
For the fixed-point D C T computation, a single M / A processor IDT7210 with
multiply/accumulate cycle 25n s [90] would render a throughput of about 2.5 Ms/s for an
8*8-point D C T , or 1.23 Ms/s for a 16* 16-point D C T using the row-column matrix
multiplication method [81]. A single processor system using A T & T s W E DSP16A [91],
which is an M / A based DSP, would give a processing rate of 1.02 Ms/s for an 8*8-point
D C T [92] and about 0.95 Ms/s for a 16* 16-point D C T [81]. Taking advantages of very
fast M / A processors, the direct matrix multiplication method out-performs many fast
algorithms using other digital signal processors in terms of the speed and system
simplicity.
Using the Austek F D P A41102 discussed previously would also provide a fairly
large throughput and simple system solution.
Using TMS320C30 [93], D C T s in floating-point can be calculated, which, as shall
be shown in the next chapter, has a much higher signal to noise ratio than the integer
computation.
The TMS320C30, which is a floating-point digital signal processor, is the third
generation device in the T M S 3 2 0 family. Multiplication, memory access operation,
addition, shift or all other A L U operations can be executed within one clock cycle (60™).
Algorithms can be further optimized using the parallel commands that the TMS320C30
provides. The speed at which a particular algorithm can be implemented depends upon
how compatible it is with the hardware.
From previous studies, if the efficiency of an algorithm is judged by the number of
additions, Chen's algorithm is the best. Lee's algorithm is better than Chen's if the
number of multiplications, or even the total number of numerical operations (including
additions and multiplications), is used as the criteria. But if the TMS320C30 is used to
implement an 8-point 1-D D C T , the total number of clock cycles which are used to
134
complete the process will be the main issue. Since there are only a limited number of
registers, not every one of which can be used in the parallel processing instructions, on
TMS320C30, algorithms which do not have regular structure or in-place computation
tend to introduce more data handling operations resulting in a relatively slow
implementation although they may have the same arithmetic complexity as others [81]. In
one occasion, the implementation of a 2-D 8*8-point D C T using Chen's algorithm on the
TMS320C30 requires about 60 cycles more than that using Lee's or Hou's algorithm
under similar programming conditions. Although the indirect D C T algorithm using the
W F T A has the same arithmetic complexity as those of Lee's and Hou's algorithms, it
also requires about 60 cycles more than the two when it comes to the implementation on
the TMS320C30. The difference between algorithms in terms of the exact number of
cycles may vary because of programer's experience, but the fact remains the same. This
problem becomes worse as the length of the D C T increases. Another observation is that
although the vector radix D C T algorithms have relatively low arithmetic complexity as
well as in-place and regular computation structure, they are out-performed by the row-
column D C T s due to the current arrangement of DSPs' architecture and the limited
number of registers provided [81]. In other words, pipelined and parallel structure of
vector radix D C T s cannot be fully employed by current DSPs.
Because the D C T processing speed, using floating-point computation, is
considerably slower at present than the fixed-point computation, more TMS320C30s are
required to provide real-time image coding speed, which means an increase in the system
complexity.
Unless VLSI D C T processors are used, a multi-processor system is required to
render a real-time image coding speed for an image of 288*352 pixels, or equally a video
signal rate 3.04128 Ms/s. Since the D C T process, together with the quantization, decides
the overall performance of an image coding system [128], using floating-point
computation for D C T s also remains to be justified.
From the above discussion, it is concluded that:
135
(1) two fast algorithms, which have equal computational complexity, may not
have the same efficiency in the hardware implementation as the limited
resources on DSPs often impose different restrictions on them;
(2) the direct matrix multiplication method using very high speed
multiplier/accumulators may out-perform many fast algorithms in certain
applications and provides a simple system solution;
(3) a fast algorithm which possesses a regular computation structure will not only
facilitate future VLSI implementation but also provide better performance
using available DSPs than those which do not; and
(4) the pipeline and parallel computation structure of many multidimensional fast
algorithms has yet to be fully exploited in VLSI system design [81].
136
CHAPTER EIGHT: THE EFFECTS OF FINITE-WORD-LENGTH
COMPUTATION FOR FAST DCT ALGORITHMS
8-1 Introduction
In [81], various fast Discrete Cosine Transform (DCT) algorithms have been
examined and compared in terms of computational efficiency (or arithmetic complexity)
from both the software and hardware implementation point of view. In this chapter,
further comparison of fast DCT algorithms will be conducted in order to analyze the
effects of finite-word-length computations on the DCT process.
Generally speaking, the imposition of finite-word-length computation produces
overflow and roundoff errors [5, 6, 24]. Overflow occurs when the magnitude of an
operation exceeds the value that the finite-word-length register can represent. Roundoff is
required when a b-bit datum sample is multiplied by a b-bit coefficient resulting in a
product that is of 2b-bits long. To maintain a certain word-length in a computation
procedure, truncation or rounding has to be applied which causes errors usually referred
to as roundoff noise or roundoff errors. The use of quantized coefficients will also
introduce errors in the finite-word-length calculation [5, 6, 24]. So far, there has been
little report on the issue comparing different fast DCT algorithms [132].
The main concern of this chapter is to investigate roundoff errors produced in
various direct fast DCT algorithms when finite-word-length arithmetic is used and when
cosine multiplicands are quantized. Results are generated by computer simulation [94,
95]. The infinite precision calculation of a DCT is implemented by using double-
precision floating-point arithmetic, which is considered to be the benchmark. The
roundoff error performance is measured by the Signal to Noise
Ratio (SNR) which is defined as follows:
IDCT^ S N R = 1 0 « . o g , 0 ( £ ( D C T d , D C r f ) 2 ) (8->-D
where DCTd is the DCT output with the double-precision, DCTf is the DCT output with
finite-word-length which could be 32-bit floating-point, or integer with finite length.
137
For the investigation of roundoff errors caused by using 32-bit floating-point
operation, Chen's [78], Lee's [76] and Hou's [77] algorithms will be considered for 1-D
DCT implementations, as well as the direct matrix multiplication method [81, 96]. In the
2-D DCT simulation with 32-bit floating-point, the row-column method using direct
matrix multiplication, Chen's, Lee's and Hou's algorithms and 2-D vector radix direct
DCT algorithms [46, 80] will be studied.
For the analysis of roundoff errors caused by using the integer calculation, some
simulation results have been reported in [81, 96].
According to simulation theory [94], the sample mean Xi of a random variable can
'X 2
be estimated by the sample mean xj of I observed values, and the variance S x of I
2 independent samples can be approximated by the sample variance s£ of observed values. 2 The formulas for calculating xi and sx are given by the following equations: I
I Xj i=o
XI=-T-(8-1-2)
2 Sv =
I 2 r-2 I X . - I Xj i = l (8-1-3)
*x~ i - l
where xj is the observed value of the sample sequence. To improve the reliability of the
simulation output, the replications method [94] has been used in this study. The formulas 2
x
2 for the sample mean Xi and variance S are presented by the following equations
I
IXi Xl = ^r- (8-1-4)
< -^ i (X, - XI)2 = -Iji X] - ^(i X^)2 (8-1-5)
where I is the number of runs and Xi is a sample mean on run i.
The input data to the DCT is produced by a random number generator with
Gaussian distribution. The Gaussian input data yj is obtained from a uniform distribution
138
sequence x\ on the interval (0,1), which is provided in the running time library, using the
Central Limit Theorem:
n
I Xi - J
yi = — =-. (8-1-6)
When n = 12, the equation becomes :
yi= £ xi-6. (8-1-7) i = l
Equation (8-1-7) has been used in the simulation to generate Gaussian random data as the
input to DCTs. The test of the Gaussian input generating program on one million samples
has shown a satisfactory result.
The simulation programs have been written in C language and compiled using PC-
AT and run on PC computers.
8-2 Simulation Design
In this section, the structure of the simulation program, error models, benchmarks
for the DCT computations and data collection are described.
8-2-1 Structure of the simulation program
The simulation program consists of five parts:
simulation requirement input;
initialization;
generation of the input for DCTs;
computation of the DCTs; and
simulation data collection.
Increment BL by 1,
Simulation Input:
a. The Length of DCTs, n; b. The Number of Block samples in Each Simulation Run, bl; c. The Word-Length for LUT, nb1; (Optional for Integer)
d. The Word-Length for Roundoff nb2; (Optional for Integer)
Initialization: a. Generation of LUT;
b. Clear All Data Collection Variables.
I ?
The Random Number Generator is seeded.
BL = 1
Input Data Generation
Computation of DCTs: a. Double-Precision DCT; b. Finite-Word-Length DCT. I Increment I by 1.
Computation of SNR: a. SNR of the Current Block;
b. Sum of SNRs for Run I.
No
Calculation for Run I: a. Mean SNR of Run I; b. Sum of Mean SNRs for Each Run c. Sum of Sqared Mean SNRs,
Data Collection: a. Sample Mean of SNR; b. Sample Variance of SNR; c. Confidence Interval.
~ *
( ENDJ
Structure of simulation program for error analysis
Figure-34
140
This can be described by a flowchart as shown in Figure-34. The details in some of
the blocks may vary from one fast D C T algorithm to another according to simulation
requirements.
The input data is integer with specified word-length selectable as to whether it is
signed 8-bit, or unsigned 8-bit or signed 9-bit. The data is processed in blocks of size
4*4, 8*8, 16*16 or 32*32 points and the length of the D C T equals the block size. The
number of blocks is the number of two dimensional D C T s in each simulation run. For
the row-column implementation, the number of the 1-D D C T required is double the block
size.
The initialization is used to set up Look U p Tables (LUT) where the cosine or sine
multiplicands are pre-calculated and stored, for each D C T program.
Note that the input and output of the D C T process are referred to as "data" and
"coefficients" whilst the values of the cosine functions in Look U p Table are referred to as
"multiplicands".
After each D C T block calculation, the signal to noise ratio is calculated and
accumulated to find the sample mean. W h e n the number of blocks is reached, the mean
value of the signal to noise ratio on the current run is computed and this sample mean is
again accumulated as well as its mean square. This process is repeated eight times before
the final sample mean and the sample variance of the signal to noise ratio are calculated
according to Equations (8-1-4) and (8-1-5). The confidence interval has also been used to
render a provisional guide for the simulation.
8-2-2 Error model for the basic computation structure
The basic computational structure of fast D C T algorithms is the butterfly as shown
in Figure-3 and consideration needs to be given to the roundoff errors produced in this
stage. It is known that the error model of the floating-point calculation is different from
that of the integer operation because both floating-point multiplications and additions will
introduce roundoff errors whilst only multiplications using integer calculation will cause
141
roundoff errors. Although the simulation method is used in this study instead of the
theoretical approach where the error model is a necessity, understanding of the model
assists in the simulation design, especially for the integer calculation.
There are two additions and one multiplication in a butterfly structure and roundoff
errors will usually be introduced at three locations, represented by en, ef2 and eo as
shown in Figure-35. In essence, the accumulated roundoff errors depend heavily on the
total number of multiplications and additions required by each algorithm for the D C T .
Since a' is a finite-word-length expression of the exact multiplicand a, it also introduces
computation noise. The error information computed from the simulation will include all
the above effects.
8-2-3 DCT in infinite-word-length
It is assumed that the D C T outputs calculated in infinite-word-length are
independent of the individual algorithm used and that 64-bit double-precision is
considered to be "infinite" compared with 32-bit floating-point or 16-bit integer data
format, etc.. In the simulation, the roundoff noise is calculated by subtracting the D C T
coefficients in finite-word-length obtained by an algorithm from those of the same D C T
algorithm using double-precision. The signal to noise ratio is evaluated using Equation
(8-1-1)
8-2-4 Data collection
The Gaussian random data is mapped into signed 8-bit, or unsigned 8-bit or signed
9-bit integers as the input to the D C T process. The amount of input data is about the same
as that contained in a frame of image 288*352 pixels in size. For each run, a new seed is
chosen for the random number generator to make sure that different simulation runs are
independent from each other. Double-precision is used throughout the calculation of the
sample mean and variance to keep the error caused by the data collection at a minimum
level. The replications method is used to reduce the sample variance [94,96].
142
+ r—
c
x
Q)
3
5 X:
o
c
a. i
ex
C u
I
a n 3 M •H
u:
E x
E x
143
8-3 Simulation Results
The fast DCT algorithms under evaluation include those by Chen [78], Lee [76] and
Hou [77] in comparison with the direct matrix multiplication method. These one
dimensional algorithms are used to implement the 2-D DCT using a row-column
operation. As well, fast two dimensional algorithms based on Lee's and Hou's approach
have been developed and evaluated [46, 80]. All the simulation results are plotted to
present a meaningful comparison of the above-mentioned algorithms in terms of the error
performance using the finite-word-length calculation.
That the infinite-word-length computation is independent of the DCT algorithm has
been demonstrated by comparing the difference between Lee's and Hou's algorithms
using 64-bit double-precision arithmetic. The signal to difference ratio is in excess of 250
dB, independent of block size. Thus, as expected, the output is essentially independent
of algorithms when the precision is essentially infinite.
In the floating-point calculation of DCT, the 32-bit floating-point arithmetic is used
throughout the DCT computation and multiplicands in LUTs are also presented in the 32-
bit floating-point format.
8-3-1 Floating-point computation of 1-D DCTs
Figures 36, 37 and 38, show the signal to noise ratios of Chen's, Lee's and Hou's
algorithms, in comparison with those of the Direct Matrix Multiplication (DMM) method,
when the length of the 1-D DCT is 4, 8, 16 and 32 respectively. The form of the 1-D
input data varies between signed 8-bit, unsigned 8-bit and signed 9-bit integers. It can be
seen that when the length of the DCT increases, the signal to noise ratio decreases for all
algorithms but that of Chen's algorithm shows the least degradation. All the signal to
noise ratios are greater than 134 dB for all the fast algorithms under all the input data
conditions using the floating point calculation and at least 10 dB better than that of the
direct matrix multiplication method. The difference between the best and the worst SNR
for the same DCT length is less than 9 dB for fast algorithms. The error performances of
144
-C Q> o 2 O _J X Q O O O O Q Q Q Q
* + * •
3 Q. C
m i
CO T3 O C O)
<J5 o Q O
a i
oi
c ro
o u_ Q
I-U D
O .c
C c _l
c x: H
m I
a 3
•gp 'oney QSION O; |euBis
145
O QJ 3 $ -C m O .2
O —i X Q H1 H- H- h-O O O O Q Q Q d <M «M
QL
CD.
CO T3 G) C CT
'55 c D r-~
O Q c o 0. CT C CO
o
o
n
CM m
CO CM
• ^ -
CM
O CM
ID
r-
O Q O 1
T 0) •C
M M
o x: CT C
o o x:
r-» m i 01 VJ 3 CO •H fe
•QP 'oiieu 9S10N 0} |BU6IS
146
CD CO 3 ^
x: S o S O -J X Q H h- H- K
o o o o Q a O Q
MM
3
a c m
i
O) •a
a c CT
w H-"*
o Q c o a
i
CT C ro
o LL Q
o
to
CM CO
CO CM
"J-CM
O CM
CD
r-O a
1-0) JZ
» o x: CT C 0) _l 0)
x:
00 m GJ
U - > CJ: •H
.
"8P 'oiiea ©SION oj IBUBJS
147
Lee's and Hou's algorithms are very close. The interesting fact is that the error
performance is dependent on the form of the input data. For example, for an 8- or 16-
point D C T , Chen's algorithm provides better signal to noise ratio than both Lee's and
Hou's when the input data is a signed 8- or 9-bit integer, whilst the reverse is true when
the input data is an unsigned 8-bit integer. The degrading rate of SNRs of all the
algorithms with the unsigned 8-bit integer input is much less than that with a signed 8- or
9-bit integer input.
8-3-2 Floating-point computation of 2-D DCTs
The signal to noise ratios of the row-column Chen's, Lee's and Hou's algorithms
are plotted in Figures 39, 40 and 41 along with those of 2-D Vector Radix (VR) D C T
algorithms based on Lee's and Hou's approaches in comparison with that of the row-
column direct matrix multiplication method. It is interesting to note that when the input
data is an 8- or 9-bit integer the row-column Chen's algorithm gives the best error
performance whilst the difference between SNRs of all the algorithms of length 4 is
marginal for all the fast algorithms. However, when the input data is an unsigned 8-bit
integer the vector radix D C T algorithms provide a better performance for the D C T lengths
(8 and 16) used in practical image coding. Again, it can be seen that when the length of
the D C T is increasing the signal to noise ratio is decreasing with the degrading rate of
Chen's algorithm being the least and those of the vector radix algorithms being the most
(about 23 dB drop from length 4 to length 32 when the unsigned 8-bit integers are used as
input).
Since in the floating-point computation of D C T s the signal to noise ratio of each fast
algorithm considered is greater than 121 dB, the differences between fast algorithms is
relatively marginal. It is obvious that the performance of fast algorithms is supenor to
that of the direct matrix multiplication method.
CD 3 £ CD
_l X Q O
b b " o o o Q Q ' X X > >
X X X X
9 9 9 9 CM CM CM CM
M M * *
148
3
a c CQ •
CO T3 CO
c CT
W H O Q o 0. CT
c ro
o u. Q •
CM
co CM
•c
CM
O CM
ID --
o a a CM
a x: o x: CT C o _l
a si
o> 1 a r-i
— cr •H
t.
•«s-
T —
CO
r — j
r*. CO
' 1
CO
I
CO
I
CO
1 ~1—
CM
1
r-~ CM
1-1 ' l
in CM
I co CM
1
CM
I CT
I r^ u • )
8P '0!;eu esjofg oj |BU6|S
149
c 0) SI
O O X D CM
CD CD
o X a CM
3 O
X o X Q CM
CD
_J
X > Q CM
3
o X X > a CM
2 a o X o CM
M M + *
3
a c DO I
CO TJ
a c CT W C
O Q
O a. CT C * ro
u_ Q
i
CM
H U Q Q CM
U x: *— o ^ •-CT
O <T 1 0J
u 3 CC •H
fc
•gp 'oijea SSJON 0} ieu6|s
150
CO
O O rr
a CM
0) 0) -J
o X a CM
3 O
Q CM
CO CO _l X > a CM
3
o X > Q CM
2 a o X a CM
H M I *
3
a c Dp d) T3 CO C CT
C/5 U Q
o DL
i
CT C ro
o Q i
CM
eo CM
••J-CM
O CM
ID
\-o Q Q
i
CM a JZ _ o x: CT c CJ _l
a x:
<r I CJ
u »-r
—
•gp 'oijea esiON o; |BU6.IS
151
8-4 S u m m a r y
The errors caused by the use of the finite-word-length (32-bit floating-point)
computation in the process of discrete cosine transforms for coding purposes has been
studied.
In the floating-point computation, the signal to noise ratios of all the fast algorithms
are fairly close and above 120 dB for both 1-D and 2-D D C T computations using fast
algorithms. They are also superior to that of the direct matrix multiplication method. For
one dimensional 4- to 32-point DCTs, Chen's algorithm shows better performance than
both Lee's and Hou's if the input data is a signed 8- or 9-bit integer whilst the reverse is
true when the input data is an unsigned 8-bit input. For two dimensional 4*4- to 32*32-
point DCTs, the row-column Chen's algorithm is still superior if the input data is a signed
8- or 9-bit integer whilst vector radix D C T algorithms have a higher number of errors.
However, if the input data is an unsigned 8-bit integer the vector radix algorithms perform
better than others for 4*4- to 16* 16-point DCTs. They were only inferior to row-column
methods when the length of the 2-D D C T was 32.
It has also been found that for both floating-point and integer computations
[96], the performance of fast D C T algorithms in terms of the signal to noise ratio is
dependent on the form of the input data. The input data is Gaussian noise mapped into
signed 8- or 9-bit integers or unsigned 8-bit integers. A similar study is being undertaken
using the fixed-point arithmetic (or integer computation). The results will be reported
elsewhere.
152
CHAPTER NINE: CONCLUSIONS
9-1 Conclusions
In an attempt to ease the burden caused by the construction and implementation of
multidimensional fast transform algorithms, a structural approach is introduced which is
described by two representations—the matrix representation with the tensor product and
logic diagrams with a set of modification rules. Using this structural approach, various
vector radix F F T algorithms, including the vector split-radix FFT and mixed vector radix
FFT algorithms, and vector radix direct fast D C T algorithms are derived and implemented
systematically from their 1-D counterparts. The relationship between vector radix
algorithms and corresponding 1-D fast algorithms is clearly explained, particularly by
diagrammatical representation. The derivation of vector radix algorithms becomes much
simpler using the logic diagrams and implementation by both software and hardware can
be based on pre-knowledge of the corresponding 1-D algorithms. The structural
approach is described by theorems and a recursive diagrammatical symbol system which
are successively applied to both multidimensional vector radix F F T and vector radix fast
D C T algorithms. The development of computer programs using vector radix fast
algorithms, including combined factor vector radix-8*8 FFT and vector radix D C T based
on Lee's and Hou's methods, has demonstrated the effectiveness of this approach,
especially when the program using the 1-D algorithm is available. Further discussion on
the hardware implementation of vector radix FFTs has shown that in a pipelined VLSI
design to compute 16* 16-point D F T , only one complex multiplier is needed, whilst the
traditional row-column method requires two. Further, with the implementation of 2-D,
say 512*512-point DFTs, using the F D P A41102, the number of FDPs can be reduced if
the vector radix method is applied, thereby reducing the system complexity.
Consequently, the structural approach has been extended to vector radix FFTs of
higher dimensions. Although not discussed in the thesis, the approach can be applied to
m-D (m > 3) vector radix direct D C T algorithms as well.
It has been demonstrated that the logic diagram is a useful and very effective
presentation form in expanding knowledge of multidimensional transform algorithms.
153
The fact that 2-D vector radix fast D C T algorithms were derived firstly by using logic
diagrams, then sorting out their matrix presentations in a general case is a good example.
The computation structure of 2-D vector radix DCT algorithms are discussed in
comparison with that of 2-D vector radix FFT algorithms to show the basic computation
structures common to vector radix algorithms and major differences.
In analyzing the structure of Hou's DIT fast DCT algorithm, the correct system
description is presented together with the 2-D vector radix DCT algorithm.
A single processor 2-D DCT coding system using FDP A41102 is presented
rendering a processing rate of 2.5 Ms/s. Where the VLSI DCT processors are not
available, it provides an option to the hardware implementation for the transform coding
problem.
Different aspects of hardware implementation of DCTs for image coding
applications are also discussed using various VLSI DCT processors, DSPs,
Multiplier/Accumulators and FDP. It is pointed out that design of fast algorithms and
designed of VLSI processors are closely related, with computation structure being a very
important issue.
Error performance of various fast DCT algorithms is evaluated using computer
simulation for the floating-point calculation. When random numbers with Gaussian
distribution are used as input, it has been found that the performance of algorithms
depends on the form of the input as to whether it is signed 8- or 9-bit or unsigned 8-bit
integer. The length of DCTs considered is chosen for the image coding so that it varies
from 4 to 32. The performance of fast DCT algorithms is also compared with that of the
direct matrix multiplication method, in both 1-D and 2-D cases, to show the former is
better than the latter when floating-point calculation is used. The performance of fast
algorithms using integer computation is still under evaluation.
In conclusion, it is appropriate to point out some remaining problems and
suggestions are made for further research.
154
9-2 Suggestions for Future Research
So far, an extensive study has been made on the theoretical aspects of vector radix
algorithms for both DFTs and D C T s and computer programs have been developed to
show the validity of the approach proposed. System configurations have been described
using vector radix FFT algorithms and the F D P A41102. Listed following are various
project for future work.
(a) Hardware implementation of pipelined vector radix FFT for 512*512-point D F T
computation using the F D P A41102s as described in Chapter Three. Since the FDP
uses a fixed-point or a block floating-point arithmetic, the error analysis of this
system can be conducted based on information presented in [26-28, 85], or by
simulation using the software provided by Austek [114]. Different aspects of this
multi-processor system can be evaluated and compared with other system
configurations.
(b) Feasibility, performance, advantages and disadvantages of VLSI integration of
vector radix FFT algorithms can be closely examined to enhance published results
[2, 134, 137]. Many advantages of using the vector radix FFT in VLSI
implementation of 2-D DFTs have been shown in [134] and [137] compared with
the row-column FFT algorithms. However, the performance of V R FFTs in terms
of area*time2 [2] has yet to be evaluated. Since only the vector radix-2*2 FFT is
considered in [134] and [137], the number of multiplier stages is shown to be
log2N - 1, where N is the length of the 2-D Ni*NT2-point D F T assuming Ni = N 2 =
N. It has been demonstrated in this thesis that when higher radices are used, the
number of multiplier stages can be reduced. Thus, it is expected that the area*time2
performance of VLSI implementation will be improved using vector radix FFT
algorithms.
(c) Extension of the structural approach to other multidimensional fast digital signal
processing algorithms should also be studied.
155
(d) Hardware implementation of 2-D D C T for image coding can be carried out using
FDP A41102 as described in Chapter Seven in the application of video-telephony
or video-conferencing.
(e) An interactive study of DCT computation and quantization can be carried out so that
evaluation of the overall DCT coding system can be reached [128] before a DCT
codec is implemented for telecommunication purposes.
(f) Feasibility of VLSI integration of vector radix fast DCT algorithms for coding
systems can also be explored.
156
BIBLIOGRAPHY
[1] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing,
Prentice-Hall Inc., Englewood Cliffs, N.J., 1984.
[2] I. Gertner and M. Shamash, "VLSI Architectures for Multidimensional Fourier
Transform Processing", IEEE Transactions on Computers, Vol.C-36, pp. 1265-1274,
November 1987.
[3] R.J. Clarke, Transform Coding of Images, Academic Press, 1985.
[4] W.K. Pratt, Digital Image Processing, John Wiley & Sons, Inc., 1978.
[5] L.R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing,
Prentice-Hall, 1975.
[6] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing, Prentice-Hall
International Inc., 1975.
[7] K.R. Castleman, Digital Image Processing, Prentice-Hall Inc., Englewood Cliffs,
New Jersey, 1979.
[8] R. C. Gonzales and P. Wintz, Digital Image Processing, Addison-Wesley Publishing
Company Inc., 1977.
[9] W.S. Hinshaw and A.H. Lent, "An Introduction to N M R Imaging: From the Block
Equation to the Imaging Equation", Proceedings IEEE, Vol.71, No.3, March 1983.
[10] L. Jacobson and H. Wechsler, "A Theory for Invariant Object Recognition in the
Frontoparallel Plane", IEEE Trans. Pattern Anal. Machine IntelL, Vol.PAMI-6,
pp.325-331,May 1984.
[11] H. Gafni and Y.Y. Zeevi, "A Model for Separation of Spatial and Temporal
Information in the Visual System", Biol. Cybern., Vol.28, pp.73-82, 1977.
[12] H. Gafni and Y.Y. Zeevi, "A Model for Processing of Movement in the Visual
System", Biol. Cybern., Vol.32, pp.165-173, 1979.
157
[13] The Last Word in DSP. Zoran, Digital Signal Processors Data Book, Z O R A N
Corporation, 1987.
[14] J.D. O'Sullivan, D.R. Brown, K.T. Hua and C.E. Jacka, "A VLSI Chip for Fast
Fourier Transforms", Digest of Papers, IREECON'87, p. 142, 1987.
[15] D.R. Brown, K.T. Hua , J.D. O'Sullivan, C.E. Jacka and P.E. Single, "A VLSI
Chip for Fast Fourier Transforms", ASSPA 89, Signal Processing, Theories,
Implementations and Applications, pp. 164-168, April 1989.
[16] Fernando Macias-Garza, A.C. Bovik, K.R. Diller, S.J. Aggarwal and J.K.
Aggarwal, "Digital Reconstruction of Three-Dimensional Serially Sectioned Optical
Images", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-36,
pp.1067-1075, July 1988.
[17] N. Ahmed, T. Natarajan and K.R. Rao, "Discrete Cosine Transform", IEEE
Transactions on Computers, Vol.C-23, pp.90-93, January 1974.
[18] K.R. Rao and P. Yip, Discrete Cosine Transform, Academic Press, Orlando, Fl..
1990.
[19] A. Uzum, A.W. Seeto, D. Rosenfeld, D. Skellen and A. Maheswaran, "Video
Coding: A Survey", Workshop on Telecommunication Services Based on Video and
Images, Sydney, September 1988.
[20] J.W. Cooley, P.A.W. Lewis and P.D. Welch, "Historical Notes on the Fast Fourier
Transform", Proceedings of the IEEE, Vol.55, pp. 1675-1677, October 1967.
[21] M.T. Heideman, D.H. Johnson and G S . Burrus, "Gauss and the History of the Fast
Fourier Transform", IEEE ASSP Magazine, pp. 14-21, 1984.
[22] J.W. Cooley and J.W. Tukey, "An Algorithm for the Machine Calculation of
Complex Fourier Series", Math. Comput., Vol.19, No.90, pp.297-301, 1965.
[23] L.R. Rabiner, "The Acoustics, Speech, and Signal Processing Society—A Historical
Perspective", IEEE ASSP Magazine, pp.4-10, January 1984.
158
[24] B. Gold and C M . Rader, Digital Processing of Signals, McGraw-Hill, Book Co.,
1969.
[25] W.T. Cochran, J.W. Cooley, D.L. Favin, H.D. Helms, R.A. Kaenel, W.W. Lang,
G.C. Maling, JR., D.E. Nelson, C M . Rader and P.D. Welch, "What is the Fast
Fourier Transform?", Proceedings of the IEEE, Vol.55, pp.1664-1674, October
1967.
[26] A41102 Frequency Domain Processor, Austek Microsystems Proprietary, Inc. and
Austek Microsystems Pty. Ltd., 1988.
[27] Frequency Domain Processor (FDP™), Austek Microsystems Proprietary, Inc. and
Austek Microsystems Pty. Ltd., 1988.
[28] A41I02 Frequency Domain Processor, Austek Microsystems Proprietary, Inc. and
Austek Microsystems Pty. Ltd., 1988.
[29] M. Bellanger, Digital Processing of Signals—Theory and Practice, John Wiley &
Sons Ltd., 1985.
[30] L. Auslander, E. Feig and S. Winograd, "Abelian Semi-Simple Algebras and
Algorithms for the Discrete Fourier Transform", Advances in Applied Mathematics,
No.5, pp.31-55, 1984.
[31] R.E. Blahut, Fast Algorithms for Digital Signal Processing, Addison-Wesley
Publishing, Inc., 1985.
[32] A. Guessoum and R.M. Mersereau, "Fast Algorithms for the Multidimensional
Discrete Fourier Transform", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-34, No.4, pp.937-943, August 1986.
[33] L. Auslander, and R. Tolimieri, "Ring Structure and the Fourier Transform", The
Mathematical Intelligence, Vol.7, No.3, pp.49-52, p.54, 1985.
[34] M. Vetterli and H.J. Nussbaumer, "Simple FFT and D C T Algorithms with Reduced
Number of Operations", Signal Processing, August 1984.
159
[35] L. Auslander, E. Feig and S. Winograd, "New Algorithms for the Multidimensional
Discrete Fourier Transform", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-31, No.2, pp.388-403, April 1983.
[36] Soo-Chang Pei and Ja-Lin W u , "Split Vector Radix 2-D Fast Fourier Transform",
IEEE Transactions on Circuits and Systems , Vol.CAS-34, pp.978-980, August
1987.
[37] Zhi-Jian M o u and P. Duhamel, "In-Place Butterfly-Style FFT of 2-D Real
Sequences", IEEE Transactions on Acoustics, Speech, and Signal Processing,
Vol.ASSP-36, pp.1642-1650, October 1988.
[38] M.A. Haque, "A Two-Dimensional Fast Cosine Transform", IEEE Transactions on
Acoustics, Speech, and Signal Processing, Vol.ASSP-33, pp.1532-1539, 1985.
[39] D.F.Elliott and K.R. Rao, Fast Transforms: Algorithms, Analyses, Applications,
Academic Press, 1982.
[40] Third-Generation T M S 3 2 0 User's Guide, SPRU031, Texas Instruments
Incorporated, 1988.
[41] L.R. Morris, "Comparative Study of Time Efficient FFT and W F T A Programs for
General Purpose Computers", IEEE Trans, on Acoustics, Speech, and Signal
Processing, Vol.ASSP-26, pp.141-150, April 1978.
[42] G.E.Rivard, "Direct Fast Fourier Transform of Bivariate Functions", IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-25, pp.250-
252, June 1977.
[43] D.B. Harris, J.H. McClellan, D.S.K. Chan, and H.W. Schuessler, "Vector Radix
Fast Fourier Transform", 1977 IEEE Int. Conf. Acoust., Speech, Signal Processing
Rec, pp.548-551,May 1977.
[44] B. Arambepola, "Fast Computation of Multidimensional Discrete Fourier
Transforms", IEE Proceedings, Vol.127, Pt.F, No.l, February, 1980.
160
[45] H.R. W u and F.J. Paoloni, "The Structure of Vector Radix Fast Fourier
Transforms", IEEE Transactions on Acoustics, Speech, and Signal Processing,
Vol.37, pp.1415-1424, September 1989.
[46] H.R. W u and F.J. Paoloni, "A T w o Dimensional Fast Cosine Transform
Algorithm—A Structural Approach", Proceedings of IEEE International Conference
on Image Processing, Singapore, pp.50-54, September 1989.
[47] E.O. Brigham, The Fast Fourier Transform, Prentice-Hall Inc., Englewood Cliffs,
N.J., 1974.
[48] S. Winograd, "On Computing the Discrete Fourier Transform", Mathematics of
Computation, Vol.32, No.141, pp.175-199, January 1978.
[49] D.W. Tufts and G. Sadasiv, "The Arithmetic Fourier Transform", IEEE ASSP
Magazine, pp. 13-17, January 1988.
[50] S. Prakash and V.V. Rao, "Vector Radix FFT Error Analysis", IEEE Transactions on
Acoustics, Speech, and Signal Processing, Vol.ASSP-30, pp.808-811, October
1982.
[51] I. Pitas and M.G. Strintzis, "Floating Point Error Analysis of Two-Dimensional Fast
Fourier Transform Algorithms", IEEE Trans, on Circuits and Systems, Vol.35,
pp. 112-115, January 1988.
[52] Q S . Burrus and P.W. Eschenbacher, "An In-Place, In-Order Prime Factor FFT
Algorithm", IEEE Transactions on Acoustics, Speech, and Signal Processing,
Vol.ASSP-29, August 1981.
[53] H.W. Johnson and C.S. Burrus, "On the Structure of Efficient DFT Algorithms",
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33,
pp.248-254, February 1985.
[54] Kenji Nakayama, "An Improved Fast Fourier Transform Algorithm Using Mixed
Frequency and Time Decimations", IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol.ASSP-36, pp.290-292, February 1988.
161
[55] M.A. Richard, "On the Efficient Implementation of the Split-Radix FFT",
Proceedings ofICASSP-86, pp. 1801-1804, 1986.
[56] R.W. Linderman et al, "CUSP: A 2-urn C M O S Digital Signal Processor", IEEE
Journal of Solid-State Circuits, Vol.SC-20, pp. 761-769, June 1985.
[57] J. Makhoul, "A Fast Cosine Transform in One and Two Dimensions", IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-28, pp.27-34.
1980.
[58] H.J. Nussbaumer and P. Quandalle, "Fast Computation of Discrete Fourier
Transforms Using Polynomial Transforms", IEEE Transactions on Acoustics,
Speech, and Signal Processing, Vol.ASSP-27, pp. 169-181, April 1979.
[59] O.R. Hinton and R.A. Salch, "Two-Dimensional Discrete Fourier Transform with
Small Multiplicative Complexity Using Number Theoretic Transforms", IEE
Proceedings, Vol.131, Pt.G, No.6, December 1984.
[60] H.R. W u and F.J. Paoloni, "On the T w o Dimensional Vector Split-Radix FFT
Algorithm", IEEE Transactions on Acoustics, Speech, and Signal Processing,
August 1989.
[61] R.C Agarwal and J.W. Cooley, "An Efficient Vector Implementation of the FFT
Algorithm on IBM 3090VF", ICASSP, pp.249-252, 1986.
[62] R.M. Mersereau and T.C. Speake, "A Unified Treatment of Cooley-Tukey
Algorithms for the Evaluation of the Multidimensional DFT", IEEE Transactions on
Acoustics, Speech, and Signal Processing, Vol.ASSP-29, pp.1011-1018, October
1981.
[63] C S . Burrus and T.W. Parks, Discrete Fourier Transform/Fast Fourier Transform
and Convolution Algorithms, A Wiley-Interscience Publication, John Wiley & Sons,
1985.
[64] G.D. Bergland, "A Fast Fourier Transform Algorithm Using Base Eight Iterations",
Math Computation, Vol.22, pp.275-279, April 1968.
162
[65] Weizhen M a and Ruixiang Yin, "New Recursive Factorization Algorithms to
Compute D F T Q m ) and DCT(2™)", IEEE Asian Electronics Conference, Hong Kong,
1987.
[66] Weizhen M a and Dekun Yang, "New Fast Algorithm for Two-Dimensional Discrete
Fourier Transform DFT(2n,2)", Electronics Letters, Vol.25, No.l, pp.21-22,
January 1989.
[67] H.R. W u and F.J. Paoloni, "Structured Vector Radix FFT Algorithms and Hardware
Implementation", submitted to Journal of Electrical and Electronics Engineering,
Australia, for publication, 1989.
[68] P. Duhamel, "Implementation of 'Split-Radix' FFT Algorithms for Complex, Real,
and Real-Symmetric Data", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-34, pp.285-295, April 1986.
[69] P.R. Halmos, Finite-Dimensional Vector Spaces, D.Van Nostrand Company, Inc.,
1958.
[70] H.R. W u , and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast
Fourier Transforms", Technical Report No. I, Department of Electrical and Computer
Engineering, The University of Wollongong, 1986.
[71] Zhi-Jian M o u and P. Duhamel, "Corrections to 'In-Place Butterfly-Style FFT of 2-D
Real Sequences'", IEEE Transactions on Acoustics, Speech, and Signal Processing,
Vol.ASSP-37, September 1989.
[72] M. Vetterli, P. Duhamel and C Guillemot, "Trade-Offs in the Computation of
Mono- and Multi-Dimensional DCTs", Proceedings of IEEE International
Conference on A coustics, Speech, and Signal Processing, pp.999-1002, 1989.
[73] S. Okubo, R. Nicol, B. Haskell and S. Sabri, "Progress of CCJTT Standardization
on n*384 kbit/s Video Codec", IEEE Globecom'87, pp.36-39, 1987.
[74] J.C Carlach, P.Penard and J.L. Sicre, "TCAD: A 27 M H Z 8*8 Discrete Cosine
Transform Chip", Proc.,ICASSP'89, 1989.
163
[75] M. T. Sun, T.C. Chen, A. Gottlieb, L. W u and M.L. Liou, "A 16*16 Discrete
Cosine Transform Chip", Proc. of SPIE'87 Symp. Visual Commun. Image Proc,
Vol.845, pp. 13-18, Oct. 1987.
[76] B.G. Lee, "A New Algorithm to Compute the Discrete Cosine
Transform", IEEE Trans, on Acoust., Speech, Signal Proce
ssing, Vol.ASSP-32,pp.1243-1245, December 1984.
[77] H.S. Hou, "A Fast Recursive Algorithm For Computing the Discrete Cosine
Transform", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-
35, pp.1455-1461, 1987.
[78] Wen-Hsiung Chen, C Harrison Smith, and S. C Fralick, "A Fast Computational
Algorithm for The Discrete Cosine. Transform", IEEE Transactions on
Communications, Vol. COM-25, No.9, pp. 1004-1009, September 1977.
[79] M. Vetterli, "Fast 2-d Discrete Cosine Transform", IEEE ASSP Conf, pp. 1538-
1541, 1985.
[80] H.R. Wu and F.J. Paoloni, "A Structural Approach to Two Dimensional Direct Fast
Discrete Cosine Transform Algorithms", Proceedings of International Symposium on
Computer Architecture & Digital Signal Processing, Hong Kong, October 1989.
[81] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware
Implementation of Various Fast Discrete Cosine Transform Algorithms", Technical
Report-1, The University of Wollongong-Telecom Research Laboratories (Australia)
R&D Contract for the Study of Fast Implementations of Discrete Cosine Transform
Coding Systems, under No.7066, June 1989.
[82] M.J. Narasimha and A.M. Peterson, "On the Computation of the Discrete Cosine
Transform", IEEE Transactions on Communications, Vol.COM-26, pp.934-936,
1978.
164
[83] H.R. W u and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast
Fourier Transforms", 1SSPA 87, Signal Processing, Theories, Implementations and
Applications, pp.89-92, August 1987.
[84] M. Vetterli, "Trade-Off s in the Computation of Mono- and Multi-dimensional
DCTs", Technical Report: CUICTRITR-090-88-18, Center for Telecommunications
Research, 1988.
[85] Austek Microsystems Proprietary, Inc. and Austek Microsystems Pty. Ltd., A User
Guide for the A41102, 1988.
[86] H.R. W u , FJ. Paoloni and W . Tan, "Implementation of 2-D D C T for Image Coding
Using F D P ™ A41102", Proceedings of the Conference on Image Processing and the
Impact of New Technologies, Canberra, December 1989.
[87] Real Time Discrete Cosine Transformer—Advanced Specifications #2, SGS
Thomson Microelectronics, March 1987.
[88] IMS A121 2-D Discrete Cosine Transform Processor—Advance Information, inmos,
April 1989.
[89] TMC2311-CMOS Fast Cosine Transform Processor—Advance Information, T R W
LSI Products Inc., 1989.
[90] High Performance C M O S — D a t a Book, Integrated Device Technology, 1988.
[91] W E ® DSP16A Digital Signal Processor—Advance Data Sheet, A T & T 1988.
[92] D.M. Blaker, "Using the DSP16/DSP16A for Image Compression", DSP Review,
AT&T, Vol.2, Issue 1, pp.4-5, 1989.
[93] TEX A S INSTRUMENTS, Third-Generation TMS320 User's Guide, 1988.
[94] A. Alan B. Pritsker, Introduction to Simulation and SLAM II, 3rd ed. Systems
Publishing Corporation, Halsted Press, 1986.
[95] Byron J.T. Morgan, Elements of Simulation, Chapman and Hall Ltd, 1984.
[96] H.R. W u and F.J. Paoloni, "Simulation Study on the Effects of Finite-Word-Length
Calculations for Fast D C T Algorithms", Technical Report-2, The University of
165
Wollongong-Telecom Research Laboratories (Australia) R & D Contract for the Study
of Fast Implementations of Discrete Cosine Transform Coding Systems, under
No.7066, October 1989.
[97] R. Yavne, "An Economical Method for Calculating the Discrete Fourier Transform",
National Computer Conference and Exposition Proceedings, Vol.33, pp.115-125,
1968.
[98] P. Duhamel, B. Piron and J.M. Etcheto, "On Computing the Inverse DFT", IEEE
Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-36, pp.285-286,
February 1988.
[99] M.T. Heideman and C S . Burrus, "On the Number of Multiplications Necessary to
Compute a Length-2n DFT", IEEE Trans, on Acoustics, Speech, and Signal
Processing, Vol.ASSP-34, pp.91-95, February 1986.
[100] T.S. Huang, "How the Fast Fourier Transform Got Its Name", Computer, Vol.4,
No.3, p. 15, May-June 1971.
[101] Yoiti Suzuki, Toshio Sone and Ken'iti Kido, "A New FFT Algorithm of Radix 3, 6,
and 12", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-34,
pp.380-383, April 1986.
[102] W.A. Perera and P.J.W. Rayner, "Optimal Design of Multiplierless DFTs and
FFTs", ICASSP'86, pp.245-248, 1986.
[103] W.M. Gentleman, "Fast Fourier Transforms—For Fun and Profit", Proceedings-
Fall Joint Computer Conference, pp.333-578, 1966.
[104] M.R. Schroeder, "The Unreasonable Effectiveness of Number Theory in Science and
Communication (1987 Rayleigh Lecture)", IEEE ASSP Magazine, pp.5-12, January
1988.
[105] K.N. Ngan, K.S. Leong and H. Singh, "Adaptive Cosine Transform Coding of
Images in Perceptual Domain", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-37, pp. 1743-1750, November 1989
166
[106] H. Kitajima, "A Symmetric Cosine Transform", IEEE Trans, on Computers, Vol.C-
29, pp.317-323, 1980.
[107] Byeong Gi Lee, "FCT - A Fast Cosine Transform", IEEE ASSP Conf, 28A.3.1-
28A.3.4, 1984.
[108] H.R. W u and F.J. Paoloni, "A 2-D Fast Cosine Transform Algorithm Based on
Hou's Approach", submitted to IEEE Transactions on Acoustics, Speech, and Signal
Processing , for publication, 1989.
[109] H.R. W u and F.J. Paoloni, "The Impact of the VLSI Technology on the Fast
Computation of Discrete Cosine Transforms for Image Coding", to be submitted.
[110] H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms, Springer-
Verlag, Berlin Heidelberg, 1982.
[Ill] E. Arnould, and JP. Dugre, "Real Time Discrete Cosine Transform - An Original
Architecture", IEEE ASSP Conf, 48.6.1-48.6.4,1984.
[112] Naoki Suehiro and Mitsutoshi Hatori, "Fast Algorithms for the DFT and Other
Sinusoidal Transforms", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-34, No. 3, pp. 642-644, June 1986.
[113] Pierre Duhamel and Hedi H'Mida, "New 2n D C T Algorithms Suitable for VLSI
Implementation", IEEE ASSP Conf, pp. 1805-1808, 1987.
[114] Austek Microsystems Proprietary, Inc. and Austek Microsystems Pty. Ltd., FDPSIM
USERS GUIDE, 198S.
[115] W E ® DSP32C Digital Signal Processor—Advance Information Data Sheet, AT&T.
[116] W E ® DSP32C Digital Signal Processor—Information Manual, AT&T, December
1988.
[117] C S . Burrus, "Bit Reverse Unscrambling for A Radix-2M FFT", Proc. ICASSP,
ppl809-1810, 1987.
167
[118] H.Nawab and J.H. McClellan, "Bounds on the Minimum Number of Data Transfers
in W F T A and FFT Programs", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-27, No.4, pp.394-398, August 1979.
[119] Z. Wang, "On Computing the Discrete Fourier and Cosine Transforms", IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33, No.4,
pp.1341-1344, October 1985.
[120] Z. Wang and B.R. Hunt, "Comparative Performance of Two Different Versions of
the Discrete Cosine Transform", IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol.ASSP-32, No.2, pp.450-453, April 1984.
[121] P. Yip and K.R. Rao, "On the Shift Property of D C T s and DSTs", IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-35, No.3,
pp.404-406, March 1987.
[122] K.N. Ngan, "Image Display Techniques Using the Cosine Transform", IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-32, No.l,
pp. 173-177, February 1984.
[123] O. Ersoy, "On Relating Discrete Fourier, Sine, and Symmetric Cosine Transforms",
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33,
No.l, pp.219-222, February 1985.
[124] H.S. Malvar, "Fast Computation of the Discrete Cosine Transform and the Discrete
Hartley Transform", IEEE Transactions on Acoustics, Speech, and Signal Processing,
Vol.ASSP-35, No.10, pp.1484-1485, October 1987.
[125] V. Nagesha, "Comments on 'Fast Computation of the Discrete Cosine Transform and
the Discrete Hartley Transform'", IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol.ASSP-37, No.3, pp.439-440, March 1989.
[126] N. Nasrabadi and R. King, "Computationally Efficient Discrete Cosine Transform
Algorithm", Electronics Letters, Vol.19, January 1983.
168
[127] H.R. W u and F.J. Paoloni, "Comparison Study on Software and Hardware
Implementation of Various Fast Discrete Cosine Transform Algorithms", Addendum
of Technical Report-1, The University of Wollongong-Telecom Research
Laboratories (Australia) R & D Contract for the Study of Fast Implementations of
Discrete Cosine Transform Coding Systems, under No.7066, November 1989.
[128] D.J. Bailey and N. Birch, "Image Compression Using a Discrete Cosine Transform
Image Processor", Electronic Engineering, July 1989.
[129] P.K. Rodman, "High Performance FFTs for a V L J W Architecture", Proceedings of
International Symposium on Computer Architecture & Digital Signal Processing,
Hong Kong, October 1989.
[130] R.K. Asbury, "2D and 3D FFTs on the Intel IPSC/2—A Distributed Memory, Multi-
Processor Supercomputer", Proceedings of International Symposium on Computer
Arclutecture & Digital Signal Processing, Hong Kong, October 1989.
[131] S.Y. Kung, "From VLSI Arrays to Neural Networks", Proceedings of International
Symposium on Computer Architecture & Digital Signal Processing, Hong Kong,
October 1989.
[132] Y. He and Z. Wang, "Fixed-Point Error Analysis for the Fast Cosine Transform",
Proceedings of International Symposium on Computer Architecture & Digital Signal
Processing, Hong Kong, October 1989.
[133] W . M a and D. Yang, "On Computing 2-D DFT", Proceedings of International
Symposium on Computer Architecture &. Digital Signal Processing, Hong Kong,
October 1989.
[134] W . Liu, T. Hughes and W.T. Krakow, "A Rasterization of Two-Dimensional Fast
Fourier Transform", in VLSI Signal Processing, II, ed. by S.Y. Kung, R.E. Owen
and J.G. Nash, pp. 281-292, IEEE Press, 1986.
[135] S.Y. Kung, H.J. Whitehouse and T. Kailath, ed., VLSI and Modern Signal
Processing, Prentice-Hall, Inc., 1985.
169
[136] S.Y. Kung, VLSI Array Processors, Prentice-Hall, Inc., 1988.
[137] W . Liu and D.E. Atkins, "VLSI Pipelined Architectures for Two Dimensional Fast
Fourier Transform with Raster-Scan Input Device", International Conference on
Computer Design: VLSI in Computer, pp.370-375, 1984.
[138] A.D. Culhane, M . C Peckerar and C.R.K. Marrian, "A Neural Net Approach to
Discrete Hartley and Fourier Transforms", IEEE Transactions on Circuits and
Systems, Vol.CAS-36, pp.695-703, 1989.
[139] M.-T. Sun, T.-C Chen and A.M. Gottlieb, "VLSI Implementation of a 16*16
Discrete Cosine Transform", IEEE Transactions on Circuits and Systems, Vol.CAS-
36, pp.610-617, 1989.
[140] J.A. Beraldin, T. Aboulnasr and W . Steenaart, "Efficient One-Dimensional Systolic
Array Realization of the Discrete Fourier Transform", IEEE Transactions on Circuits
and Systems, Vol.CAS-36, pp.95-100, 1989.
[141] T. Willey, R. Chapman, H. Yoho, T S . Durrani and D. Preis, "Systolic
Implementations for Deconvolution, D F T and FFT", IEE Proceedings, Vol.132,
Pt.F, 1985.
[142] E.E. Swartzlander, Jr. and G. Hallnor, "Fast Transform Processor Implementation",
Proceedings of 1CASSP 84, pp.25A.5.1-25A.5.4, 1984.
[143] M.T. Sun, L. W u and M.L. Liou, "A Concurrent Architecture for VLSI
Implementation of Discrete Cosine Transform", IEEE Transactions on Circuits and
Systems, Vol.CAS-34, pp.992-994, 1987.
[144] C D . Thompson, "Fourier Transforms in VLSI", IEEE Transactions on Computers,
Vol.C-32, pp.1047-1057, 1983.
[145] H. Mori, H. Ouchi and S. Mori, "A W S I Oriented T w o Dimensional Systolic Array
for FFT", Proceedings of ICASSP 86, pp.2155-2158, 1986.
170
[146] A. Iwata, I. Horiba, N. Suzumura and N. Takagi, "3-Dimensional Reconstructing
Algorithm for Digital Tomo-Synthesis", Proceedings of ICASSP 86, pp. 1741-1744,
1986.
[147] K.J. Jones, "2D Systolic Solution to Discrete Fourier Transform", IEE Proceedings,
Vol.136, Pt.F, pp.211-216, 1989.
[148] H. Schid, Decimal Computation, John Wiley & Sons, Inc. 1974.
[149] IEEE Micro, (Sepcial Issue on Digital Signal Processors,) Vol.6, No.6, December
1986.
[150] IEEE Micro, (Sepcial Issue on Digital Signal Processors,) Vol.8, No.6, December
1988.
[151] K. Hwang, Computer Arithmetic, John Wiley & Sons, Inc. 1979.
[152] E.E. Swartzlander, Jr., VLSI Signal Processing Systems, Kluwer Academic
Publishers, 1986.
[153] H.R. W u , and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast
Fourier Transforms—Part II", Technical Report No. 2, Department of Electrical and
Computer Engineering, The University of Wollongong, 1986.
[ 154] M. Vulis, "The Weighted Redundancy Transform", IEEE Transactions on Acoustics,
Speech, and Signal Processing, Vol.ASSP-37, pp.1687-1692, November 1989.
[155] J.H. McClellan and C M . Rader, Number Theory in Digital Signal Processing,
Prentice-Hall Inc., Englewood Cliffs, N.J., 1979.
[156] P. Duhamel and C Guillemot, "Polynomial Transform Computation of the 2-D
DCT", to be presented at ICASSP-90, April 1990.
[157] J. Suzuki, M. Nomura and S. Ono, "Comparative Study of Transform Coding for
Super High Definition Images", to be presented at ICASSP-90, April 1990.
[158] U. Totzek, F. Matthiesen, S. Wohlleben and T.G. Noll, " C M O S VLSI
Implementation of the 2D-DCT with Linear Processor Arrays", to be presented at
ICASSP-90, April 1990.
171
[159] M. Yan, J.V. McCanny and Y. Hu, "VLSI Architectures for Digital Image Coding".
to be presented at ICASSP-90, April 1990.
172
APPENDIX A: PRELIMINARY BACKGROUND ON THE TENSOR
(KRONECKER) PRODUCT AND THE LOGIC
DIAGRAM
In this section, a brief introduction to the two basic tools, which are frequently used
throughout the thesis, will be presented, namely, the tensor product with its properties
and the logic diagram. The definition and the properties of the tensor product are included
to be self contained. The purpose of introducing the logic diagram instead of the
conventional signal flowgraph (the Mason flowgraph) will soon become clear.
Definition—the Tensor (or Kronecker) Product: [31]
Let A = [amk] be an M by K matrix, and B = [bm/] be an N by L
matrix. The tensor product of A and B, denoted by A ® B , is a matrix
with M N rows and K L columns whose entry in row (m-l)N + n and
column (k-l)L + 1 is given by cmn>k/ = amkt>m/-
The tensor product, A®B, is an M by K array of N by L blocks, with the (m,k)th
such block being amkB. It is apparent from the definition that the tensor product is not
commutative but is associative, i.e.,
A®B*B®A (A"1)
(A®B)®C = A®(B®C) (A-2)
The following equalities can be also proven to be true [31][29].
(A®B)(C®D) = (AC)®(BD) (A-3)
P^(Ir®A)MrN = (A®Ir) (A-4)
where A, B, C, and D are N * N matrices; Ir is of r*r dimension; PrN is a permutation
matrix which is defined by, when r = 2:
P ^ X 0 ' xl> x2,--->XN-l) = (x0, xN/2, xl> XN/2.],..-,XN-l)> <A-5)
and M^j is the inverse matrix of P^ which is denoted by MN=[PNJ" •
If we define a time delay operator s which gives:
173
x(n+l) = s*x(n) or x(n-l) = s-^xfa), (A-6)
w e shall have the following properties of the tensor product: xi(n) "i
_x2(n+l)j
'1 0" _0si_
- xi(m,n) -
X2(m,n+1)
X3(m+l,n)
_X4(m+l ,n+ 1)J
'xi(n)
_X2(n)
= "1 0" .0S2.
® -1 0"
.0 si_
-xi(m,n)~
x2(m,n)
X3(m,n)
_x4(m,n)_
(A-7)
(A-8)
The logic diagram which w e shall introduce consists of basic elements such as line,
heavy line (or vector line), addition operation block, vector addition operation block,
scalar product block, and vector scalar product. The definitions are shown in Figure-A-1.
The logic diagram of one dimensional algorithms is equivalent to the Mason signal
flowgraph and the logic diagram of multidimensional algorithms is a direct extension of
that in the one dimensional, which introduces the vector operation concept into the
graphical form. The logic diagram has its own rules which can be readily used to derive
equivalent logic diagrams, which makes the modification of algorithms and the derivation
of multidimensional fast algorithms a comparatively simple procedure. The inverse of an
orthorgonal transform is equal to its transpose. The transpose operation on a logic
diagram is to change the addition block to a branch node, and a branch node to a addition
block as well as to change the direction of the input and output data flow.
In the thesis, all the multidimensional fast algorithms can be derived by either using
the properties of the tensor product—a mathematical approach, or using the logic
diagrams—an engineering approach.
174
c o
£3
X
E o u I.
o
to C
u CJ
c
O
o &J > X _«J
E o u o
to c
t c3 CJ CJ c i—
O o o > <
***• 4.
~ II CJ
CJ > 'dj
T3
? CJ
*->
X _CJ
I—
C
'.J
5
<
U
^ s—" ,—. + •—-
^ r: II '~~ '— cv CJ ,_J -
* + "
II
& CJ
.r-
o
X _CJ r* t—
r-
5 o i—
O CJ u~
C O
a >
O CJ CJ
>
C c
c CJ
00
Z CJ u CJ
X o r-
5 CJ
c ^. CJ
*
a II
CJ
>
Cv
C-1—
CJ
c/:
ll .— u C c«
5 "—' cz * w TT
O ^
••
CJ
-C5 *
n »
>
CJ
CJ
C u
CJ y.
C CJ CJ
>
>_:
o > o
C
r-
O r-r-
z u CJ
"5
_
1 < 1 a u
. — i
CJ
+
rt Xi a —
175
APPENDIX B: PROOF OF STRUCTURE THEOREMS
In order to prove Structure Theorem 1 of Chapter Three, we assume that I T = l^ ®
IN2 . To prove that ll = IT, it is only needed to show that Il(&/ ,£] ;mo ,no) = IT(*; ,#1
;mo ,no) as the size of both matrices is rir2*rir2- According to definitions of 1-D and 2-
D butterfly matrices, the following equations are obvious:
Since
IT(*/ ^ 1 W o ,no ) = IJ5 (kj , m o ) * IN22 (£] , no )
= w*im° * w£jno n r2.
V(k] ,£] \mo ,no ) = IT(^7 ,#1 \rno ,no )•
that is:
F = IT-The second part of the theorem can be proved using the same approach as well as
other structure theorems.
176
APPENDIX C: THE COMBINED FACTOR METHOD
To obtain the combined factor vector radix-8*8 DIF FFT using Equation (3-6-8), it
simply means to combine a with matrix I N and using the fact that a 2 = -j.
Diagrammatically, the method is attempting to reduce the number of multiplications that
the vector radix-8*8 butterfly structure contains by combining the row twiddles with the
column twiddles. For some algorithms, there may be more than one way of conducting
this task, but minimum multiplications and combined factors (or twiddles) at regular
places are often preferred. The example in [45] is but one. The combination of the row
twiddles with the column's needs more explanation. If the row twiddle is W . T and the
column twiddle is W ^ , combining W ^ * W ^ results in W ^ + P N i . when N] = N 2 =
N. w^^1 = Wa+P = W* where y = a+B. If the Look Up Table (LUT) is used in the N1N2 N N
program as is the case in this study, the combined factors are pre-calculated and stored in
the L U T which are called when needed. This practice increases the D F T processing
speed considerably. Figure-C-1 shows a 2-D 64*64-point D F T calculated using the CF
VR-8*8 FFT [45] and a 2-D input data as shown in Figure-C-2. Figure-C-3 is another 2-
D D F T generated by the same program using Figure-C-4 as the input.
The vector radix-16* 16 FFT algorithm can be constructed in the same manner. The
vector radix-16* 16 butterfly computational structure can be calculated according to
Figure-9 and the vector radix-16* 16 twiddle factors can be generated using the structure
theorem from the corresponding twiddles of the radix-16 FFT algorithm.
177
Figure-C-1: A 2-D 64 x 64-point D F T calculated using the CF VR-8 x 8 FFT algorithm.
178
Figure-C-2: The 2-D input data used for Figure-C-1.
179
Figure-C-3: A 2-D 64 x 64-point D F T calculated using the CF V R 8 x 8 FFT
algorithm. algorithm.
180
Figure-C-4: The 2-D input data used for Figure-C-3.
181
APPENDIX D: DERIVATION OF VECTOR RADIX 2-D FAST DCT
BASED ON LEE'S ALGORITHM
In Equation (5-2-4a), set k = 2*k' + k" and 1 = 2*1 + /"; k',/' = 0,1,...,N/2-1 and
k",/" = 0,1. Then the following four equations will be obtained:
X(2k',2T) = j£ j^n.m ) Cgjg^cg^1^ ' (D-l)
x(2k«,2/•+!) = £o mX0x(n'm} cwm'c™ +1)(2/'+1) (D"2)
X(2k'+1,2/') = I1 ^ x(n,m ) C^1^1^*1*7' (D-3) n=0 m = 0 ^ AIN/Z;
X(2k'+1,2/ '+1) = S I1 x(n,m ) rf^X^cgj" + D<2/ '+i) (r>4) n=0 m =0
From Equation (D-l), N/2-1 N/2-1
X(2k',2/')= I I [x(n,m) + x(N-l-n,m) n=0 m =0
+ x(n,N-l-m) + x(N-l-n,N-l-m)] C ^ f ' c ^ t 1 ^ ' 2(N/2) ^2(N/2)
(D-5)
Note that cl2^"1^11*' = C^})k'. Using the same method:
N/2-1 N/2-1
X(2k',2/'+1) = I I [x(n,m) + x(N-l-n,m) n=0 m =0
/ XT 1 ^ /XT i XT i s-ir.(2n+l)k,p(2m+l)(2/
,+ l)
:(n,N-l-m) - x(N-l-n,N-l-m)] C2rN/2) C 2 N 2(N/2)
(D-6) N/2-1 N/2-1
X(2k'+l,2/') = £ X [x(n,m) - x(N-l-n,m) n=0 m =0
, XT , x /XT , XT 1 M ^(2n+l)(2k,+l)r(2m+l)/
+ x(n,N-l-m) -x(N-l-n,N-l-m)]C2N C2(N/2)
(D-7)
and, N/2-1 N/2-1
X(2k'+l,2/,+l)= £ I [x(n,m) - x(N-l-n,m) n=0 m =0
~ T , XT , M ^(2n+l)(2k'+l)r(2m+l)(2/ +1)
- x(n,N-1 -m) + x(N-1 -n,N-1 -m)] CV2N C 2 N
(D-8)
182
Note that (f^m)+1V' = cgjjW', ( W - i - n H U ^ D = .Cgj+^2k'+1)
and d2^'1""1 >+1^2/'+1) = A2m+\)(2l •+!)
Define:
gl(n,m) = [x(n,m) + x(N-l-n,m) + x(n,N-l-m) + x(N-l-n,N-l-m)] 1
g2(n,m) = — m +1)[x(n,m) + x(N-l-n.m) - x(n,N-l-m) - x(N-l-n.N-l-m)] 2 C2N
g3(n,m)=—^^{x(n,m) - x(N-l-n,m) + x(n,N-l-m) - x(N-l-n,N-l-m)] 2(~2N
g4(n,m)=—^13—(2^TT^x^n'm^ - x(N-l-n,m) - x(n,N-l-m) + x(N-l-n,N-l-m)] 2 C2N 2C2N
Then:
^ ^ e ^ " ' " 1 ' ^^N ^2(N/2) ^2N n=0 rn=0 ^ '
(2n+l)k'r(2m+l 2(N/2) e2(N/2)
(2n+l)k'r(2m+l' 2(N/2) W(N/2)
W M N ^ 1 , . r,(2n+l)k>(2m+l)/' = X X g2(n,m) C V ^ C2rN/2/
n=0 m=0 N/2-1 W2-1 (2n+ 1)k. (2m+ l)(/'+l)
+ Xn X g2(n,m) Cv9/Nm
; C ) m mJ
n=0 m=0
= G2(k',/') + G2(k',/*+l) (D-9)
N/2-1 N/2-1 On+llk' C2m + n/' . • where G2(k',/') = X I g2(n,m) C ^ ^ C ^ ^ ' , noticing that
n=0 m=0 0~<2m+lW2n+l)k'~(2m+l)(2f '+1) _ r(2n+l)k'r(2m+l)'' r(2n+1 )k'c(2m+1)(/'+1)
2C2N L2(N/2) C2N ~ C2(N/2) U2(N/2) + ^2(N/2) 2(N/2)
«^,, , ^ W*-1^1 , , -P(2n+lW2n+l)(2k,+l)r(2m+l)r
X(2k'+l,2/')= X X g3(n.ni) 2 C ^ C ^ C2(N/2)
n=0 m=0
Ng"1^"1 , , r(2n+l)k'r(2m+l)/' = X X g3(n,m) C2(N/2) C2(N/2)
n=0 m=0 Ng-1Ng-] , , r(2n+l)(k'+l)r(2m+l)r
+ X 1 g3(n,m) C2(N/2) C2(N/2)
n=0 m=0
= G3(k',/') + G3(k'+U') (D-10)
where O3CW') = T TOOM*) C^cg-1*', notog to 11=0 m=0
2r(2n+l)r(2n+l)(2k'+l)c(2m+l)/' _ C^^cS^'' + C^f '^cE^' '• 2 C2N L 2N C2(N/2) ~ 2(N/2) 2(N/2) 2(N/2) 4 W 4
Accordingly,
183
N/2-1 N/2-1
X(2k,+ l,2/'+l)= X X g4(n,m) 2C2^+1)2^m+1)C2
2Nn+1)(2k'+1)c(2m+1)(2/'+1)
= G4(k',/') + G4(k',/ '+1) + G4(k'+1,/') + G4(k'+1,/ '+1)
(D-ll)
where G4(k',/') = X I^nm) c(2n+1)k'r(2m+1)r
n=0 m=0 2(N^) ^ W ) '
After defining that Gi(k',/') = T Tgl^m) C^fcg^1*'', the matrix
form of the forward algorithm can be obtained.
•g'jCn.m)-
g'2(n,m)
g'3(n,m)
_g'4(n,m)_
"gjCn.m)
g2(n,m)
g3(n,m)
.g4(n,m)
= (B®B)
= (M®M')
x(n,m)
x(n,N-l-m)
x(N-l-n,m)
Lx(N-l-n,N-l-m).
g'jCn.m)'
g'2(n,m)
g'3(n,m)
_g'4(n,m)_
(D-l2a)
(D-l 2b)
•Gi(k,/)n
G2(k,/)
G3(k,/)
.G4(k,/ )J
N/2-1 N/2-1
X n=0
m=0
,(2n+l)k 2(N/2)
(2m+l)/ X C n^n\ ^2(N/2)
•gjCn.m)-
g2(n,m)
g3(n,m)
_g4(n,m)_
(D-l 2c)
X(2k,2/ )
X(2k,2/+1)
X(2k+1,2/)
•X(2k+1,2/+1)-I
= (P®P)
r Gi(k,/) • G2(k,/)
G2(k,/+1)
G3(k,/)
G4(k,/)
G4(k,/+1)
G3(k+1,/) G4(k+1,/)
l"G3(k+l,/+l)-
(D-12d)
184
where k,/ ,n,m = 0,1,...N/2-1, and Gi(k+1,*) \k=N/2-i = Gi(*,/ +1) 1/ =m-i = 0, for i =
2,3,4.
For the inverse 2-D DCT defined by Equation (5-2-4b), the same decimation can be
applied to achieve the following equations.
N/2-1 N/2-1 A /0 ... , „ ,.,,
x(n,m)= ^^k'ai^C^C^1
+ X(2V 01 '+n r(2n+l)k'r(2m+l)(2/,+ l)
t A U K ^ +i> L2(N/2) ^2N
+ X(2k'+l,2/0C(^+1>(2k'+1)c^)
1)/'
+ X(2k'+1,2/ "+1) C^+1)(*'+1)cg^+1)C" ,+1)] (D-13)
N/2-1 N/2-1 A «n,n,, /omJ.n/' x(n,N-l-m) = ^ I X ^ ^ C ^ C ^ '
-X(2k\2/Vl)C(2^
+ X(2k'+l,2/')C(^+1)(2k'+1)C2
2^)1)/'
- X(2k'+1,2/ '+!) C(2"+1X2k'+1H2N
m+1)(2/ '+1)J (D-14)
N/2-1 N/2-1 /\ n nvi /•Omj.n/'
x(N-l-n,m)= I Z IX(2k\2/0C ( ^ ) k c gJ^ 1 ) /
2N ^2N
3n+l)k'p(2n 2(N/2) 2(N/2)
,- V70W 9/ '+n r(2n+l)k'r(2m+l)(2/,+ l)
+ X(2k,2/ +l)C2( N / 2 ) C 2 N
YOVu.1 O/'-t r(2n+l)(2k,+ l)r(2m+l)/'
•A(2k+1,2/ j U 2 N <~2(N/2)
- 4(2k'fl,2/ VI) c^2k'+])C^+mi'+l)) (D-15)
^T , M i A Nv 2 Nv 1 Anv 0/ -^ r(2n+l)k'r(2m+l)/' x(N-l-n,N-l-m)= X X [X(2k ,2/ ) C2( N / 2 ) <~2(N/2)
k'=0 / '=0 w ; v
\T?f 9/ '+n <-(2n+l)k,c.(2m+l)(2/ '+1)
+ &(2k'+],2/'+l)C<22-1)<2k'+,,cS"+,)(2r+,)] (D-16)
where n,m = 0,1,...,N/2-1. Define X(*,2/ -1)1/'=o = 0, X(2k'-l,*)l|c'=0 = 0 and.
185
N/2-1 N/2-] A
hi(n,m)= X X X(2k',2/') C( 2"^ ) k'c ( 2 m + 1 ) r
k'=0 /'=0 ^2(N/2) W(N/2)
h2(n,m) = 2Cgf+1) f"* T * X(2k',2l'+!) C?"?l)k'c(2m+1X2/ '+1)
k=0 / '=0 2(N/2) ^2N N/2-1 N/2-1 A
= X X X(2k'2/'+n r(2n+1)k'r(2m+1)/' kio r=o K ' +1)C2(N/2) C2(N/2) N/2-1 N/2-1 A
S?o,i„X ( 2 k'-2 r + I)«k'C^), ) (''+ , )
N/2-1 N/2-1
= I IH?(k'/')C ( 2 n + 1 ) k 'r ( 2 m + 1 ) r
kio /4o 2 ;^2(N/2) C2(N/2)
h3(n,m) = 2c£n+1) T "l X(2k'+1,2/') C^}+1^,+1)cgjJ^'
k*=0 /~0 ' 2N ^2(N/2)
N/2-1 N/2-1
Z I H3(k',/') c ^ f ' d * ? !1
k'=0 / '=0 AN/2) 2(N/2)
h4(n,m) = 2C<gn +1>2C<2Nm+1) T *£ 1X(2k'+lf2/'+!) c(
2"+1K2k'+1)c?Mm+1X2/'+1)
k'=0 / =0 2N 2N N/2-1 N/2-1
where n,m = 0,1,...,N/2-1;
H2(k',/ ') = X(2k',2/ '+1) + X(2k',2/ '-1);
H3(k',/ ') = X(2k'+1,2/') + X(2k'-1,2/ ');
H4(k',/ ') = X(2k'+1,2/ *+l) + X(2k'+1,2/ '-1) + X(2k'-1,2/ '+1) + X(2k'-1,2/ '-!).
Therefore,
x(n,m) = hi(n,m) + —r—rh2(n,m) + —\—rh3(n,m) + — - — ] — = — H i 4 ( n , m ) Z ~2N 2 U 2 N z C 2 N 2 U 2 N
(D-17)
x(n,N-l-m) = h,(n,m) - -JL^2(„,m) + - l f c 3 ( n , m ) - j h ^ m ) ZL-2N 2 U 2 N 2 L 2 N 2 L 1 N
(D-l 8)
x(N-l-n,m) = hi(n,m) + -JLjhtfn.m) - - J - ^ f o . m ) - 2 * 2m+1h4(n,m) Z U 2 N 2(~2N / U 2 N Z U 2 N
(D-l 9)
x(N-l-n,N-l-m)= hi(n,m) + — J - ^ 2 ( n , m ) - — 5 ^ 3 ( n , m ) - +\ 2m+1h4(n,m) Z C ^ 2C2N ^ ^ Zl^N
(D-20)
For k,/,n,m = 0,1,...N/2-1, and X(2k-1,*) lk=0 = X(*,2/-1) l/=0 = 0, the matrix
form for 2-D IDCT algorithm is presented as follows:
•Hi(k,/)-|
H2(k,/)
H3(k,/ )
-H4(k,/ )J
= (P®P)
X(2k,2/)
X(2k,2/+1)
X(2k,2/ -1)
X(2k+1,2/)
X(2k+1,2/+1)
X(2k+1,2/-1)
X(2k-1,2/)
X(2k-1,2/+1)
L-X(2k-1,2/-1)-J
(D-21a)
•hi(n,m)-]
h2(n,m)
h3(n,m)
.h4(n,m)_
-h'^n.m)-
h'2(n,m)
h'3(n,m)
_h'4(n,m)
(2m+l)/ ^ J^;' (2n+l)k r(2m+l
Hi(k,/)n
H2(k,/) H3(k,/)
LH4(k,/ )J
= (MOM 1)
rh](n,m)
h2(n,m)
h3(n,m)
_h4(n,m)_
x(n,m)
x(n,N-l-m)
x(N-l-n,m)
.x(N-l-n,N-l-m)J
= (B®B)
r-h'jtn.m)-
h'2(n,m)
h'3(n,m)
_h'4(n,m)_
(D-21b)
(D-21c)
(D-21d)
187
APPENDIX E: ARITHMETIC COMPLEXITY OF THE VECTOR
SPLIT-RADIX DIF FFT ALGORITHM
According to Equation (3-8-3), the multiplications in the 2-D vector split-radix f
DIF FFT are all listed in the twiddle factor matrix F . The N * N D F T can be calculated — m
using one (N/2)*(N/2) DFT and twelve (N/4)*(N/4) DFTs. The number of extra f
multiplications required at each stage using this approach is caused by E m . The total
number of complex multiplications Mn needed for a 2-D N*N DFT, where N = 2n, is
given by
Mn = Mn_i + 12*Mn.2 + Mextra (E-1)
with M2=M]=0, and it can be shown that Mextra = 12*((N/4)2 - N/4) for n>3.
The total number of complex additions An is
An = An-i + 12*An-2 + Aextra (E-2)
where Aextra = 3*(N/2)2 + 48*(N/4)2, A0 = 0, and Ai = 8.
To prove that Mextra = 12*((N/4)2 . N/4) for n>3, it is observed that amongst F^,
there are three group of factors:
(1) WNn, WN3n, Wjf, WN3m;
(2) WNm+n, WN
3m+3n; and
(3) WN2m+n, WN2m+3n, WNm
+2n, WNm+3n, WN3
m+2n, Wrfm+n;
within which it is needed to determine the number of trivial multiplications. The term
"trivial multiplication" means the value of the twiddle factor being ±1 and ±j. There are
no multiplications needed for 2*2 and 4*4 point DFTs, and thus, MQ = M: =0.
In the first group, when n (m resp.) = 0 there are N/4 trivial multiplications as m (n
resp.) varies from 0 to N/4-1.
In the second group, WN™ is considered first. According to the properties of the
transformation <a>b which finds the residue of a modulo b [155], for each
m=l,2,...,N/4-l, there exists n such that m+n=N/4 or <m+n>N/4 = 0, in addition to m=0
and n=0. There are N/4 trivial multiplications for the factor WN™+". Since
<3(m+n)>N/4=«3>N/4<m+n>N/4>N/4, <3(m+n)>N/4=0 as long as <m+n>N/4 = 0 then it
188
is true that amongst (N/4)2 multiplications, N/4 are trivial for each of twiddle factors in
this group.
In the last group, W x r 2 ™ , W N3 m + n md w N
2 n+3m need on]y be considered as the
rest can be proved accordingly.
Considering the factor W N2 m + n , w e have
<2m+n> N / 4 = « 2 m > N / 4 + <n>N/4>N/4
= « 2 m > N / 4 + n>N/4
= <m'+ n>N/4 (E-3)
where m'= <2m>N/ 4. For m=0,l,..,N/4-l, therefore for every m'e A, there exists an n
such that m + n ^ ^ =0. Therefore, at those points multiplications become trivial. It can
be shown that the same is true for W^>m+n as well.
Wxr2""1"3111 also contains N/4 trivial multiplications. However, this is not so simple
to prove. To start, it is required to prove that <3m>Ny4 is a function or one-to-one and
onto mapping domain of which is A={0,l,...,N/4-l ], i.e., for every m e A, <3m>Ny4 e
A, if m ^ m ^ both m j and m 2 e A, then <3mj>N/ 4 <3m2>N/4. This can be achieved
by invoking the theorem[8] which states that: for n=0,l,2,...,M-l, <sni>N takes on all the
N possible residues if (a,M)=l.
In the problem considered here, a=3, M=N/4=2n/4, n>3, and N/4 is power of 2 as
well, a and N/4 are mutually prime. According to the above theorem, <3m> N / 4 is a
function. For every m e A, there exists an m'=<3m>N/4 e A, while m ^ r n ^ rn'i^m^.
Since <2n+3m>N/ 4= « 2 n >N/4 + <3m>N/4 >N/4
= « 2 n >N / 4 +m' >N / 4 (E-4)
Then in Equation (E-4), m' e A, and can be any value of the element in A in
accordance with m.
So, for each n e A, there exists an m, thereafter an m', such that Equation (E-4)
will be zero.
Therefore W N2 n + 3 m contains N/4 trivial multiplications, which completes the
derivation.
From the above discussion, it is concluded that:
189
Mextra=12*((N/4)2-N/4) (E-5)
for n>3 .
The number of complex multiplications needed for the 2-D vector split-radix FFT to
perform N*N complex DFTs is listed in Table-E-1.
190
Table-E-l: The number of Complex multiplications required for the 2-D vector split-radix FFT to perform N x N- point complex DFTs.
N
8
16
32
64
128
256
512
1024
2048
4096
Mn.2
0
0
24
168
1128
6024
31464
152136
724776
3333768
M„.i
0
24
168
1128
6024
31464
152136
724776
3333768
15170664
Mextra
24
144
672
2880
11904
48384
195072
783360
3139584
12570624
Mn
24
168
1128
6024
31464
152136
724776
3333768
15170664
67746504