the regularized fast hartley transform: optimal formulation of real-data fast fourier transform for...

SIGNALS AND COMMUNICATION TECHNOLOGY

For other titles published in this series, go tohttp://www.springer.com/series/4748

Keith Jones

The Regularized Fast

Hartley Transform

Optimal Formulation of Real-DataFast Fourier Transform for Silicon-BasedImplementation in Resource-ConstrainedEnvironments

123

Dr. Keith JonesL-3 Communications TRL TechnologyShannon Way,Ashchurch, TewkesburyGloucestershire, GL20 8ND, U.K.

ISBN 978-90-481-3916-3 e-ISBN 978-90-481-3917-0DOI 10.1007/978-90-481-3917-0Springer Dordrecht Heidelberg London New York

Library of Congress Control Number: 2009944070

c© Springer Science+Business Media B.V. 2010No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form orby any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, withoutwritten permission from the Publisher, with the exception of any material supplied specifically for thepurpose of being entered and executed on a computer system, for exclusive use by the purchaser of thework.

Cover design: WMXDesign GmbH

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Most real-world spectrum analysis problems involve the computation of thereal-data discrete Fourier transform (DFT), a unitary transform that maps elementsof the linear space of real-valued N-tuples, RN , to elements of its complex-valuedcounterpart, CN , and when carried out in hardware it is conventionally achievedvia a real-from-complex strategy using a complex-data version of the fast Fouriertransform (FFT), the generic name given to the class of fast algorithms used for theefficient computation of the DFT. Such algorithms are typically derived by exploit-ing the property of symmetry, whether it exists just in the transform kernel or, incertain circumstances, in the input data and/or output data as well. In order to makeeffective use of a complex-data FFT, however, via the chosen real-from-complexstrategy, the input data to the DFT must first be converted from elements of RN toelements of CN .

The reason for choosing the computational domain of real-data problems suchas this to be CN , rather than RN , is due in part to the fact that computing equip-ment manufacturers have invested so heavily in producing digital signal processing(DSP) devices built around the design of the complex-data fast multiplier andaccumulator (MAC), an arithmetic unit ideally suited to the implementation ofthe complex-data radix-2 butterfly, the computational unit used by the familiarclass of recursive radix-2 FFT algorithms. The net result is that the problemof the real-data DFT is effectively being modified so as to match an existingcomplex-data solution rather than a solution being sought that matches the ac-tual problem. The increasingly powerful field-programmable gate array (FPGA) andapplication-specific integrated circuit (ASIC) technologies are now giving DSP de-sign engineers far greater control, however, over the type of algorithm that maybe used in the building of high-performance DSP systems, so that more appropriatealgorithmically-specialized hardware solutions to the real-data DFT may be activelysought and exploited to some advantage with these technologies.

The first part of this monograph thus concerns itself with the design of a newand highly-parallel formulation of the fast Hartley transform (FHT) which is tobe used, in turn, for the efficient computation of the DFT. The FHT is the genericname given to the class of fast algorithms used for the efficient computation of thediscrete Hartley transform (DHT) – a unitary (and, in fact, orthogonal) transformand close relative of the DFT possessing many of the same properties – which,

v

vi Preface

for the processing of real-valued data, has attractions over the complex-data FFTin terms of reduced arithmetic complexity and reduced memory requirement. It’sbilateral or reversal property also means that it may be straightforwardly appliedto the transformation from Hartley space to data space as well as from data spaceto Hartley space, making it thus equally applicable to the computation of both theDFT and its inverse. A drawback, however, of conventional FHT algorithms lies inthe loss of regularity (as relates to the algorithm structure) arising from the needfor two sizes – and thus two separate designs – of butterfly for efficient fixed-radixformulations, where the regularity equates to the amount of repetition and symmetrypresent in the design. A generic version of the double butterfly, referred to as the“GD-BFLY” for economy of words, is therefore developed for the radix-4 FHT thatovercomes the problem in an elegant fashion. The resulting single-design solution,referred to as the regularized radix-4 FHT and abbreviated to “R2

4 FHT”, lends itselfnaturally to parallelization and to mapping onto a regular computational structurefor implementation with parallel computing technology.

A partitioned-memory architecture for the parallel computation of the GD-BFLYand the resulting R2

4 FHT is next developed and discussed in some detail, this ex-ploiting a single locally-pipelined high-performance processing element (PE) thatyields an attractive solution, particularly when implemented with parallel comput-ing technology, that is both area-efficient and scalable in terms of transform length.High performance is achieved by having the PE able to process the input/outputdata sets to the GD-BFLY in parallel, this in turn implying the need to be able toaccess simultaneously, and without conflict, both multiple data and multiple twiddlefactors, or trigonometric coefficients, from their respective memories.

A number of pipelined versions of the PE are described using both fastfixed-point multipliers and phase rotators – where the phase rotation operationis carried out in optimal fashion with hardware-efficient Co-Ordinate RotationDIgital Computer (CORDIC) arithmetic – which enable arithmetic complexity tobe traded off against memory requirement. The result is a set of scalable designsbased upon the partitioned-memory single-PE computing architecture which eachyield a hardware-efficient solution with universal application, such that each newapplication necessitates minimal re-design cost, as well as solutions amenable toefficient implementation with the silicon-based technologies. The resulting area-efficient and scalable single-PE architecture is shown to yield solutions to thereal-data radix-4 FFT that are capable of achieving the computational density – thatis, the throughput per unit area of silicon – of the most advanced commercially-available complex-data solutions for just a fraction of the silicon resources.

Consideration is given to the fact that when producing electronic equipment,whether for commercial or military use, great emphasis is inevitably placed uponminimizing the unit cost so that one is seldom blessed with the option of using thelatest state-of-the-art device technology. The most common situation encountered isone where the expectation is to use the smallest (and thus the least expensive) devicethat is capable of yielding solutions able to meet the performance objectives, whichoften means using devices that are one, two or even three generations behind thelatest specification. As a result, there are situations where there would be great merit

Preface vii

in having designs that are not totally reliant on the availability of the increasinglylarge quantities of expensive embedded resources, such as fast multipliers and fastmemory, as provided by the manufacturers of the latest silicon-based devices, butare sufficiently flexible to lend themselves to implementation in silicon even whenconstrained by the limited availability of embedded resources.

The designs are thus required to be able to cater for a range of resource-constrained environments where the particular resources being consumed and tradedoff, one against another, include the programmable logic, the power and the time(update time or latency), as well as the embedded resources already discussed. Thechoice of which particular FPGA device to use throughout the monograph for com-parative analysis of the various designs is not considered to be of relevance to theresults obtained as the intention is that the attractions of the solutions developedshould be valid regardless of the specific device onto which they are mapped – thatis, a “good” design should be device-independent. The author is well aware, how-ever, that the intellectual investment made in achieving such a design may seem tofly in the face of current wisdom whereby the need for good engineering design andpractice is avoided through the adoption of ever more powerful (and power consum-ing) computing devices – no apologies offered.

The monograph, which is based on the fruits of 3 years of applied industrialresearch in the U.K., is aimed at both practicing DSP engineers with an inter-est in the efficient hardware implementation of the real-data FFT and academics/researchers/students from engineering, computer science and mathematics back-grounds with an interest in the design and implementation of sequential and parallelFFT algorithms. It is intended to provide the reader with the tools necessary to bothunderstand the new formulation and to implement simple design variations that offerclear implementational advantages, both theoretical and practical, over more con-ventional complex-data solutions to the problem. The highly-parallel formulationof the real-data FFT described in the monograph will be shown to lead to scalableand device-independent solutions to the latency-constrained version of the problemwhich are able to optimize the use of the available silicon resources, and thus tomaximize the achievable computational density, thereby making the solution a gen-uine advance in the design and implementation of high-performance parallel FFTalgorithms.

L-3 Communications TRL Technology, Dr. Keith JonesShannon Way, Ashchurch, Tewkesbury,Gloucestershire, GL20 8ND, U.K.

Acknowledgements

Firstly, and most importantly, the author wishes to thank his wife and partner incrime, Deborah, for her continued support for the project which has occupied mostof his free time over the past 12 months or so, time that would otherwise have beenspent together doing more enjoyable things.

Secondly, given his own background as an industrial mathematician, the authorgratefully acknowledges the assistance of Andy Beard of TRL Technology, whohas painstakingly gone through the manuscript clarifying those technology-basedaspects of the research least familiar to the author, namely those relating to theever-changing world of the FPGA, thereby enabling the author to provide a morecomprehensible interpretation of certain aspects of the results.

Finally, the author wishes to thank Mark de Jongh, the Senior Publishing Editorin Electrical Engineering at Springer, together with his management colleagues atSpringer, for seeing the potential merit in the research and providing the opportunityof sharing the results with you in this monograph.

ix

Contents

1 Background to Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The DFT and Its Efficient Computation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Twentieth Century Developments of the FFT. . . . . . . . . . . . . . . . . . . . . . . . . 41.4 The DHT and Its Relation to the DFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Attractions of Computing the Real-Data DFT via the FHT .. . . . . . . . . 71.6 Modern Hardware-Based Parallel Computing Technologies .. . . . . . . . 81.7 Hardware-Based Arithmetic Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.8 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.9 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.10 Organization of the Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Fast Solutions to Real-Data Discrete Fourier Transform . . . . . . . . . . . . . . . . 152.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Real-Data FFT Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 The Bergland Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 The Brunn Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Real-From-Complex Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Computing One Real-Data DFT via One

Full-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.2 Computing Two Real-Data DFTs via One

Full-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Computing One Real-Data DFT via One

Half-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Data Re-ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 The Discrete Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Normalization of DHT Outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Decomposition into Even and Odd Components . . . . . . . . . . . . . . . . . . . . . 29

xi

xii Contents

3.4 Connecting Relations Between DFT and DHT . . . . . . . . . . . . . . . . . . . . . . . 293.4.1 Real-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.2 Complex-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Fundamental Theorems for DFT and DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.1 Reversal Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.2 Addition Theorem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5.3 Shift Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5.4 Convolution Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5.5 Product Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.6 Autocorrelation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.7 First Derivative Theorem.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.8 Second Derivative Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.9 Summary of Theorems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Fast Solutions to DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.7 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Derivation of the Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . 414.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Derivation of the Conventional Radix-4 Butterfly Equations . . . . . . . . 424.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations .. . 454.4 Radix-4 Factorization of the FHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.5 Closed-Form Expression for Generic Radix-4 Double Butterfly . . . . 48

4.5.1 Twelve-Multiplier Version of Generic Double Butterfly . . . . 544.5.2 Nine-Multiplier Version of Generic Double Butterfly . . . . . . . 54

4.6 Trigonometric Coefficient Storage, Accession and Generation .. . . . . 564.6.1 Minimum-Arithmetic Addressing Scheme . . . . . . . . . . . . . . . . . . . 574.6.2 Minimum-Memory Addressing Scheme . . . . . . . . . . . . . . . . . . . . . 574.6.3 Trigonometric Coefficient Generation

via Trigonometric Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.7 Comparative Complexity Analysis with Existing FFT Designs . . . . . 594.8 Scaling Considerations for Fixed-Point Implementation .. . . . . . . . . . . . 614.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Algorithm Design for Hardware-Based Computing Technologies . . . . . . 655.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2 The Fundamental Properties of FPGA and ASIC Devices . . . . . . . . . . . 665.3 Low-Power Design Techniques.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3.1 Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3.2 Silicon Area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3.3 Switching Frequency .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4 Proposed Hardware Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.4.1 Scalability of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Contents xiii

5.4.2 Partitioned-Memory Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.4.3 Flexibility of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5 Constraints on Available Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.6 Assessing the Resource Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Derivation of Area-Efficient and Scalable Parallel Architecture . . . . . . . . 776.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.2 Single-PE Versus Multi-PE Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.3 Conflict-Free Parallel Memory Addressing Schemes . . . . . . . . . . . . . . . . 80

6.3.1 Data Storage and Accession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.3.2 Trigonometric Coefficient Storage,

Accession and Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4 Design of Pipelined PE for Single-PE Architecture . . . . . . . . . . . . . . . . . . 89

6.4.1 Internal Pipelining of Generic Double Butterfly . . . . . . . . . . . . . 906.4.2 Space Complexity Considerations .. . . . . . . . . . . . . . . . . . . . . . . . . . . 916.4.3 Time Complexity Considerations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.5 Performance and Requirements Analysis of FPGAImplementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.6 Constraining Latency Versus Minimizing Update-Time . . . . . . . . . . . . . 956.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7 Design of Arithmetic Unit for Resource-Constrained Solution . . . . . . . . . .1017.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1017.2 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1027.3 Fast Multiplier Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1037.4 CORDIC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104

7.4.1 CORDIC Formulation of Complex Multiplier . . . . . . . . . . . . . . .1047.4.2 Parallel Formulation of CORDIC-Based PE . . . . . . . . . . . . . . . . .1057.4.3 Discussion of CORDIC-Based Solution . . . . . . . . . . . . . . . . . . . . .1067.4.4 Logic Requirement of CORDIC-Based PE . . . . . . . . . . . . . . . . . .109

7.5 Comparative Analysis of PE Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1107.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115

8 Computation of 2n-Point Real-Data Discrete Fourier Transform . . . . . . .1178.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1178.2 Computing One DFT via Two Half-Length Regularized FHTs. . . . . .118

8.2.1 Derivation of 2n-Point Real-Data FFT Algorithm . . . . . . . . . . .1198.2.2 Implementational Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

8.3 Computing One DFT via One Double-Length Regularized FHT . . . .1298.3.1 Derivation of 2n-Point Real-Data FFT Algorithm . . . . . . . . . . .1298.3.2 Implementational Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . .130

xiv Contents

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134

9 Applications of Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . .1359.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1359.2 Fast Transform-Space Convolution and Correlation. . . . . . . . . . . . . . . . . .1369.3 Up-Sampling and Differentiation of Real-Valued Signal. . . . . . . . . . . . .137

9.3.1 Up-Sampling via Hartley Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1389.3.2 Differentiation via Hartley Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1399.3.3 Combined Up-Sampling and Differentiation . . . . . . . . . . . . . . . .139

9.4 Correlation of Two Arbitrary Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1409.4.1 Computation of Complex-Data Correlation

via Real-Data Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1419.4.2 Cross-Correlation of Two Finite-Length Data Sets . . . . . . . . . .1429.4.3 Auto-Correlation: Finite-Length Against

Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1439.4.4 Cross-Correlation: Infinite-Length Against

Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1459.4.5 Combining Functions in Hartley Space . . . . . . . . . . . . . . . . . . . . . .147

9.5 Channelization of Real-Valued Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1499.5.1 Single Channel: Fast Hartley-Space Convolution.. . . . . . . . . . .1499.5.2 Multiple Channels: Conventional Polyphase

DFT Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1519.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156

10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15910.1 Outline of Problem Addressed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15910.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16010.3 Conclusions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162

Appendix A Computer Program for Regularized FastHartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163

A.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163A.2 Description of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164

A.2.1 Control Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164A.2.2 Generic Double Butterfly Routines . . . . . . . . . . . . . . . . . . . . . . . . . . .164A.2.3 Address Generation and Data Re-ordering Routines . . . . . . . .165A.2.4 Data Memory Accession and Updating Routines . . . . . . . . . . . .165A.2.5 Trigonometric Coefficient Generation Routines . . . . . . . . . . . . .166A.2.6 Look-Up-Table Generation Routines . . . . . . . . . . . . . . . . . . . . . . . . .167A.2.7 FHT-to-FFT Conversion Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . .167

A.3 Brief Guide to Running the Program .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167A.4 Available Scaling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169

Contents xv

Appendix B Source Code Listings for Regularized FastHartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173

B.1 Listings for Main Program and Signal Generation Routine . . . . . . . . . .173B.2 Listings for Pre-processing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185B.3 Listings for Processing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .189

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .221

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223

Biography

Keith John Jones is a Chartered Mathematician (C.Math.) and Fellow of the Instituteof Mathematics & Its Applications (F.I.M.A.), UK, having obtained a B.Sc. Honoursdegree in Mathematics from the University of London in 1974 as an external stu-dent, an M.Sc. in Applicable Mathematics from Cranfield Institute of Technologyin 1977, and a Ph.D. in Computer Science from Birkbeck College, University ofLondon, in 1992, again as an external student. The Ph.D. was awarded primarilyfor research into the design of novel systolic processor array architectures for theparallel computation of the DFT.

Dr. Jones currently runs a mathematical/software consultancy in Weymouth,Dorset, with his wife Deborah, as well as being employed as a part-time consul-tant with TRL Technology in Tewkesbury, Gloucestershire, where he is engaged inthe design and implementation of high-performance digital signal processing algo-rithms and systems for wireless communications. Dr. Jones has published widely inthe signal processing and sensor array processing fields, having a particular interestin the application of number theory, algebra, and nonstandard arithmetic techniquesto the design of low-complexity algorithms and circuits for efficient implementa-tion with suitably defined parallel computing architectures. Dr. Jones also holds anumber of patents in these fields.

Dr. Jones has been named in both “Who’s Who in Science and Engineering” andthe “Dictionary of International Biography” since 2008.

xvii

Chapter 1Background to Research

Abstract This chapter provides the background to the research results discussedin the monograph that relate to the design and implementation of the regularizedFHT. Following a short historical account of the role of the DFT in modern sciencea case is made for the need for highly-parallel FFT algorithms geared specificallyto the processing of real-valued data for use in the type of resource-constrained(both silicon and power) environments encountered in mobile communications. Therelation of the DHT to the DFT is given and the possible benefits of using a highly-parallel formulation of the FHT for solving the real-data DFT problem discussed.This is followed by an account of the parallel computing technologies now avail-able via the FPGA and the ASIC with which such a formulation of the problemmight be efficiently implemented. A hardware-efficient arithmetic unit is also dis-cussed which can yield a flexible-precision solution whilst minimizing the memoryrequirement. A discussion of performance metrics for various computing architec-tures and technologies is then given followed by an outline of the organization of themonograph.

1.1 Introduction

The subject of spectrum or harmonic analysis started in earnest with the workof Joseph Fourier (1768–1830), who asserted and proved that an arbitrary func-tion could be represented via a suitable transformation as a sum of trigonometricfunctions [6]. It seems likely, however, that such ideas were already commonknowledge amongst European mathematicians by the time Fourier appeared on thescene, mainly through the earlier work of Joseph Louis Lagrange (1736–1813) andLeonhard Euler (1707–1783), with the first appearance of the discrete version of thistransformation, the discrete Fourier transform (DFT) [36,39], dating back to Euler’sinvestigations of sound propagation in elastic media in 1750 and to the astronomicalwork of Alexis Claude Clairaut (1713–1765) in 1754 [24]. The DFT is now widelyused in many branches of science, playing in particular a central role in the fieldof digital signal processing (DSP) [36, 39], enabling digital signals – namely thosethat have been both sampled and quantized – to be viewed in the frequency domain

K. Jones, The Regularized Fast Hartley Transform, Signals and CommunicationsTechnology, DOI 10.1007/978-90-481-3917-0 1,c� Springer Science+Business Media B.V. 2010

1

2 1 Background to Research

where, compared to the time domain, the information contained in the signal mayoften be more easily extracted and/or displayed, or where many common DSP func-tions, such as that of the finite impulse response (FIR) filter or the matched filter[36, 39], may be more easily or efficiently carried out.

The monograph is essentially concerned with the problem of computing the DFT,via the application of various factorization techniques, using silicon-based parallelcomputing equipment – as typified by field-programmable gate array (FPGA) andapplication-specific integrated circuit (ASIC) technologies [31] – bearing in mindthe size and power constraints relevant to the particular field of interest, namely thatof mobile communications, where a small battery may be the only source of powersupply for long periods of time. The monograph looks also to exploit the fact thatthe measurement data, as with many real-world problems, is real valued in nature,with each sample of data thus belonging to R, the field of real numbers [4], althoughthe restriction to fixed-point implementations limits the range of interest still furtherto that of Z, the commutative ring of integers [4].

1.2 The DFT and Its Efficient Computation

Turning firstly to its definition, the DFT is a unitary transform [17], which forthe case of N input/output samples, may be expressed in normalized form via theequation

X.F/Œk� D 1pN

N�1X

nD0

xŒn�:WnkN ; k D 0; 1; : : : ; N � 1 (1.1)

where the input/output data vectors belong to CN, the linear space of complex-valued N-tuples [4], and the transform kernel – also known as the Fourier Matrixand which is, as one would expect, a function of both the input and output dataindices – derives from the term

WN D exp.�i2 =N/; i Dp

�1; (1.2)

the primitive Nth complex root of unity [4, 32, 34]. The unitary nature of the DFTmeans that the inverse of the Fourier Matrix is equal to its conjugate-transpose,whilst its columns form an orthogonal basis [6, 7, 17] – similarly, a transform issaid to be orthogonal when the inverse of the transform matrix is equal simply toits transpose, as is the case with any real-valued kernel. Note that the multiplicationof any power of the term WN by any number belonging to C, the field of complexnumbers [4], simply results in a phase shift of that number – the amplitude ormagnitude remains unchanged.

The direct computation of the N-point DFT as defined above involves O.N2/

arithmetic operations, so that many of the early scientific problems involving theDFT could not be seriously attacked without access to fast algorithms for its efficientsolution, where the key to the design of such algorithms is the identification and

1.2 The DFT and Its Efficient Computation 3

exploitation of the property of symmetry, whether it exists just in the transformkernel or, in certain circumstances, in the input data and/or output data as well.

One early area of activity with such transforms involved astronomical calcula-tions, and in the early part of the nineteenth century the great Carl Friedrich Gauss(1777–1855) used the DFT for the interpolation of asteroidal orbits from a finite setof equally-spaced observations [24]. He developed a fast two-factor algorithm forits computation that was identical to that described in 1965 by James Cooley andJohn Tukey [12] – as with many of Gauss’s greatest ideas, however, the algorithmwas never published outside of his collected works and only then in an obscureLatin form. This algorithm, which for a transform length of N D N1 � N2 involvesjust O..N1 C N2/ � N/ arithmetic operations, was probably the first member of theclass of algorithms now commonly referred to as the fast Fourier transform (FFT)[5,6,9,12,17,35], which is unquestionably the most ubiquitous algorithm in use to-day for the analysis or manipulation of digital data. In fact, Gauss is known to havefirst used the above-mentioned two-factor FFT algorithm for the solution of the DFTas far back as 1805, the same year that Admiral Nelson routed the French fleet atthe Battle of Trafalgar – interestingly, Fourier served in Napoleon Bonaparte’s armyfrom 1798 to 1801, during its invasion of Egypt, acting as scientific advisor.

Although the DFT, as defined above, allows for both the input and output datasets to be complex valued (possessing both amplitude and phase), many real-worldspectrum analysis problems, including those addressed by Gauss, involve only real-valued (possessing amplitude only) input data, so that there is a genuine need forthe identification of a subset of the class of FFTs that are able to exploit this fact –bearing in mind that real-valued data leads to a Hermitian-symmetric (or conjugate-symmetric) frequency spectrum:

complex-data FFT ) exploitation of kernel symmetry;

whilst

real-data FFT ) exploitation of kernel & spectral symmetries;

with the exploitation of symmetry in the transform kernel being typically achievedby invoking the property of periodicity and of the Shift Theorem, as will be discussedlater in the monograph.

There is a requirement, in particular, for the development of real-data FFTalgorithms which retain the regularity – as relates to the algorithm structure – oftheir complex-data counterparts as regular algorithms lend themselves more nat-urally to an efficient implementation. Regularity, which equates to the amount ofrepetition and symmetry present in the design, is most straightforwardly achievedthrough the adoption of fixed-radix formulations, such as with the familiar radix-2and radix-4 algorithms [9, 11], as this essentially reduces the FFT design to thatof a single fixed-radix butterfly, the computational engine used for carrying out therepetitive arithmetic operations. Note that with such a formulation the radix actuallycorresponds to the size of the resulting butterfly, although in Chapter 8 it is seen howa DFT, whose length is a power of two but not a power of four, may be solved bymeans of a highly-optimized radix-4 butterfly.


An additional attraction of fixed-radix FFT formulations, which for an arbitraryradix “R” decomposes an N-point DFT into logRN temporal stages each comprisingN/R radix-R butterflies, is that they lend themselves naturally to a parallel solution.

Such decompositions may be defined over the dimensions of either space –facilitating its mapping onto a single-instruction multiple-data (SIMD) architecture[1] – or time – facilitating its mapping, via the technique of pipelining, onto a sys-tolic architecture [1, 29] – that enables them to be efficiently mapped onto one ofthe increasingly more accessible/affordable parallel computing technologies. Withthe systolic solution, each stage of the computational pipeline – referred to here-after as a computational stage (CS) – corresponds to that of a single temporal stage.A parallel solution may also be defined over both space and time dimensions whichwould involve a computational pipeline where each CS of the pipeline involves theparallel execution of the associated butterflies via SIMD-based processing – such anarchitecture being often referred to in the computing literature as parallel-pipelined.

1.3 Twentieth Century Developments of the FFT

As far as modern-day developments in FFT design are concerned it is the namesof Cooley and Tukey that are always mentioned in any historical account, but thisdoes not really do justice to the many contributors from the first half of the twenti-eth century whose work was simply not picked up on, or appreciated, at the time ofpublication. The prime reason for such a situation was the lack of a suitable tech-nology for their efficient implementation, this remaining the case until the advent ofthe semiconductor technology of the 1960s.

Early pioneering work was carried out by the German mathematician Carl Runge[40], who in 1903 recognized that the periodicity of the DFT kernel could be ex-ploited to enable the computation of a 2N-point DFT to be expressed in terms of thecomputation of two N-point DFTs, this factorization technique being subsequentlyreferred to as the “doubling algorithm”. The Cooley–Tukey algorithm, which doesnot rely on any specific factorization of the transform length, may thus be viewedas a simple generalization of this algorithm, as successive application of the dou-bling algorithm leads straightforwardly to the radix-2 version of the Cooley–Tukeyalgorithm. Runge’s influential work was subsequently picked up and popularizedin publications by Karl Stumpff [45] in 1939 and Gordon Danielson and CorneliusLanczos [13] in 1942, each in turn making contributions of their own to the subject.Danielson and Lanczos, for example, produced reduced-complexity solutions to theDFT through the exploitation of symmetries in the transform kernel, whilst Stumpffdiscussed versions of both the “doubling algorithm” and the “tripling algorithm”.

All of the techniques developed, including those of more recent origin suchas the “nesting algorithm” of Schmuel Winograd [49] and the “split-radix algo-rithm” of Pierre Duhamel [15], rely upon the divide-and-conquer [28] principle,whereby the computation of a composite length DFT is broken down into that of anumber of smaller DFTs where the small-DFT lengths correspond to the factors

1.3 Twentieth Century Developments of the FFT 5

of the original transform length. Depending upon the particular factorization ofthe transform length, this process may be repeated in a recursive fashion on theincreasingly smaller DFTs.

When the lengths of the small DFTs have common factors, as encountered withthe familiar fixed-radix formulations, then between successive stages of small DFTsthere will be a need for the intermediate results to be modified by elements of theFourier Matrix, these terms being commonly referred to in the FFT literature astwiddle factors. When the algorithm in question is a fixed-radix algorithm of thedecimation-in-time (DIT) type [9], whereby the sequence of data space samplesis decomposed into successively smaller sub-sequences, then the twiddle factorsare applied to the inputs to the butterflies, whereas when the fixed-radix algorithmis of the decimation-in-frequency (DIF) type [9], whereby the sequence of trans-form space samples is decomposed into successively smaller sub-sequences, thenthe twiddle factors are applied to the outputs to the butterflies. Note, however, thatwhen the lengths of the small DFTs have no common factors at all – that is, whenthey are relatively prime [4, 32] – then the need for the twiddle factor applicationdisappears as each factor becomes equal to one. This particular result was made pos-sible through the development of a new number-theoretic data re-ordering schemein 1958 by the statistician Jack Good [20], the scheme being based upon the ubiqui-tous Chinese Remainder Theorem (CRT) [32,34,35] – which for the interest of thosereaders of a more mathematical disposition provides a means of obtaining a uniquesolution to a set of simultaneous linear congruences – whose origins supposedlydate back to the first century A.D. [14].

Note also that in the FFT literature, the class of fast algorithms based uponthe decomposition of a composite length DFT into smaller DFTs whose lengthshave common factors – such as the Cooley–Tukey algorithm – is often referredto as the Common Factor Algorithm (CFA), whereas the class of fast algorithmsbased upon the decomposition of a composite length DFT into smaller DFTs whoselengths are relatively prime is often referred to as the Prime Factor Algorithm (PFA).

Before moving on from this brief historical discussion, it is worth returning tothe last name mentioned, namely that of Jack Good, as his background is a particu-larly interesting one for anyone with an interest in the history of computing. DuringWorld War Two Good served at Bletchley Park in Buckinghamshire, England, work-ing alongside Alan Turing [25] on, amongst other things, the decryption of messagesproduced by the Enigma machine [19] – as used by the German armed forces. Atthe same time, and on the same site, a team of engineers under the leadership ofTom Flowers [19] – all seconded from the Post Office Research Establishment atDollis Hill in North London – were, unbeknown to the outside world, developingthe world’s first electronic computer, the Colossus [19], under the supervision ofTuring and Cambridge mathematician Max Newman. The Colossus was built pri-marily to automate various essential code breaking tasks such as the cracking of theLorenz code used by Adolf Hitler to communicate with his generals and was thefirst serious device – albeit a very large and a very specialized one – on the pathtowards our current state of technology whereby entire signal processing systemsmay be mapped onto a single silicon chip.


1.4 The DHT and Its Relation to the DFT

A close relative of the Fourier Transform is that of the Hartley Transform, asintroduced by Ralph Hartley (1890–1970) in 1942 for the analysis of transient andsteady state transmission problems [23]. The discrete version of this unitary trans-form is referred to as the discrete Hartley transform (DHT) [8], which for the caseof N input/output samples, may be expressed in normalized form via the equation

X.H/Œk� D 1pN

N�1X

nD0

xŒn�:cas.2 nk=N/ k D 0; 1; : : : ; N � 1 (1.3)

where the input/output data vectors belong to RN, the linear space of real-valuedN-tuples, and the transform kernel – also known as the Hartley Matrix and whichis, as one would expect, a function of both the input and output data indices – is asgiven by

cas.2 nk=N/ � cos.2 nk=N/ C sin.2 nk=N/: (1.4)

Note that as the elements of the Hartley Matrix – as given by the “cas” function –are all real valued, the DHT is orthogonal, as well as unitary, with the columns ofthe matrix forming an orthogonal basis.

Unlike the DFT, the DHT has no natural interpretation as a frequency spectrum,its most natural use being as a means for computing the DFT and as such, fast so-lutions to the DHT, which are referred to generically as the fast Hartley transform(FHT) [7, 8, 43], have become increasingly popular as an alternative to the FFTfor the efficient computation of the DFT. The FHT is particularly attractive for thecase of real-valued data, its applicability being made possible by the fact that allof the familiar properties associated with the DFT, such as the Circular Convolu-tion Theorem and the Shift Theorem, are also applicable to the DHT, and that thecomplex-valued DFT output set and real-valued DHT output set may each be simplyobtained, one from the other. To see the truth of this, note that the equality

cas.2 nk=N/ D Re�Wnk

N

� � Im�Wnk

N

�(1.5)

(where “Re” stands for the real component and “Im” for the imaginary compo-nent) relates the kernels of the two transformations, both of which are periodic withperiod 2 . As a result

X.H/Œk� D Re�

X.F/Œk��

� Im�

X.F/Œk��

; (1.6)

which expresses the DHT output in terms of the DFT output, whilst

Re�

X.F/Œk��

D 1

2

�X.H/ŒN � k� C X.H/Œk�

�(1.7)

1.5 Attractions of Computing the Real-Data DFT via the FHT 7

and

Im�

X.F/Œk��

D 1

2

�X.H/ŒN � k� � X.H/Œk�

�; (1.8)

which express the real and imaginary components of the DFT output, respectively,in terms of the DHT output.

1.5 Attractions of Computing the Real-Data DFT via the FHT

Although applicable to the computation of the DFT for both real-valued andcomplex-valued data, the major computational advantage of the FHT over the FFT,as implied above, lies in the processing of real-valued data. As most real-worldspectrum analysis problems involve only real-valued data, significant performancegains may be obtained by using the FHT without any great loss of generality. Thisis evidenced by the fact that if one computes the complex-data FFT of an N-pointreal-valued data sequence, the result will be 2N real-valued (or N complex-valued)samples, one half of which are redundant. The FHT, on the other hand, will pro-duce just N real-valued outputs, thereby requiring only one half as many arithmeticoperations and one half the memory requirement for storage of the input/outputdata. The reduced memory requirement is particularly relevant when the transformlength is large and the available resources are limited, as may well be the case withthe application area of interest, namely that of mobile communications.

The traditional approach to the DFT problem has been to use a complex-data so-lution, regardless of the nature of the data, this often entailing the initial conversionof real-valued data to complex-valued data via a wideband digital down conversion(DDC) process or through the adoption of a real-from-complex strategy wherebytwo real-data FFTs are computed simultaneously via one full-length complex-dataFFT [42] or where one real-data FFT is computed via one half-length complex-dataFFT [11, 42]. Each of the real-from-complex solutions, however, involves a com-putational overhead when compared to the more direct approach of a real-data FFTin terms of increased memory, increased processing delay to allow for the possi-ble acquisition/processing of pairs of data sets, and additional packing/unpackingcomplexity. With the DDC approach, the information content of short-durationsignals may also be compromised through the introduction of unnecessary filteringoperations.

The reason for such a situation is due in part to the fact that computing equip-ment manufacturers have invested so heavily in producing DSP devices built aroundthe fast multiplier and accumulator (MAC), an arithmetic unit ideally suited to theimplementation of the complex-data radix-2 butterfly, the computational unit usedby the familiar class of recursive radix-2 FFT algorithms. The net result is that theproblem of the real-data DFT is effectively being modified so as to match an existingcomplex-data solution rather than a solution being sought that matches the actualproblem.


It should be noted that specialized FFT algorithms [2,10,15,16,18,30,33,44,46]do however exist for dealing with the case of real-valued data. Such algorithms com-pare favourably, in terms of arithmetic complexity and memory requirement, withthose of the FHT, but suffer in terms of a loss of regularity and reduced flexibility inthat different algorithms are typically required for the computation of the DFT andits inverse. Clearly, in applications requiring transform-space processing followedby a return to the data space, this could prove something of a disadvantage, partic-ularly when compared to the adoption of a bilateral transform, such as the DHT,which may be straightforwardly applied to the transformation from Hartley spaceto data space as well as from data space to Hartley space, making it thus equallyapplicable to the computation of both the DFT and its inverse – this is as a result ofthe fact that its definitions for the two directions, up to a scaling factor, are identical.

A drawback of conventional FHT algorithms [7, 8, 43], however, lies in the needfor two sizes – and thus two separate designs – of butterfly for fixed-radix formu-lations, where a single-sized radix-R butterfly produces R outputs from R inputsand a double-sized radix-R butterfly produces 2R outputs from 2R inputs. A genericversion of the double-sized butterfly, referred to as the generic double butterfly andabbreviated hereafter to “GD-BFLY”, is therefore developed in this monograph forthe radix-4 FHT which overcomes the problem in an elegant fashion. The resultingsingle-design radix-4 solution, referred to as the regularized FHT [26] and abbre-viated hereafter to “R2

4 FHT”, lends itself naturally to parallelization [1, 3, 21] andto mapping onto a regular computational structure for implementation with parallelcomputing technology.

1.6 Modern Hardware-Based Parallel Computing Technologies

The type of high-performance parallel computing equipment referred to above istypified by the increasingly attractive FPGA and ASIC technologies which nowgive design engineers far greater flexibility and control over the type of algorithmthat may be used in the building of high-performance DSP systems, so that moreappropriate hardware solutions to the real-data FFT may be actively sought andexploited to some advantage with these silicon-based technologies. With such tech-nologies, however, it is no longer adequate to view the complexity of the FFT purelyin terms of arithmetic operation counts, as has conventionally been done, as thereis now the facility to use both multiple arithmetic units – particularly fast multi-pliers – and multiple blocks or banks of fast memory in order to enhance the FFTperformance via its parallel computation.

As a result, a whole new set of constraints has arisen relating to the design ofefficient FFT algorithms. With the recent and explosive growth of wireless tech-nology, and in particular that of mobile communications, algorithms are now beingdesigned subject to new and often conflicting performance criteria where the idealis either to maximize the throughput (that is, to minimize the update time) or sat-isfy some constraint on the latency, whilst at the same time minimizing the required

1.7 Hardware-Based Arithmetic Units 9

silicon resources (and thereby minimizing the cost of implementation) as well askeeping the power consumption to within the available budget. Note, however, thatthe throughput is also constrained by the input–output (I/O) speed, as the algorithmcannot process the data faster than it can access it. Such trade-offs are consideredin some considerable detail for the hardware solution to the R2

4 FHT, with the aim,bearing in mind the target application area of mobile communications, of achievinga power-efficient solution.

The adoption of the FHT for wireless communications technology seems partic-ularly apt, given the contribution made by the originator of the Hartley Transform(albeit the continuous version) to the foundation of information theory, where theShannon–Hartley Theorem [37] helped to establish Shannon’s idea of channel ca-pacity [37, 41]. The theorem simply states that if the amount of digital data orinformation transmitted over a communication channel is less than the channel ca-pacity, then error-free communication may be achieved, whereas if it exceeds thatcapacity, then errors in transmission will always occur no matter how well the equip-ment is designed.

1.7 Hardware-Based Arithmetic Units

When producing electronic equipment, whether for commercial or military use,great emphasis is inevitably placed upon minimizing the unit cost so that one isseldom blessed with the option of using the latest state-of-the-art device technol-ogy. The most common situation encountered is one where the expectation is touse the smallest (and thus the least expensive) device that is capable of yielding so-lutions able to meet the performance objectives, which often means using devicesthat are one, two or even three generations behind the latest specification. As a re-sult, there are situations where there would be great merit in having designs that arenot totally reliant on the availability of the increasingly large quantities of expen-sive embedded resources, such as fast multipliers and fast memory, as provided bythe manufacturers of the latest silicon-based devices, but are sufficiently flexible tolend themselves to implementation in silicon even when constrained by the limitedavailability of embedded resources.

One way of achieving such flexibility with the R24 FHT would be through the

design of a processing element (PE) that minimizes or perhaps even avoids theneed for fast multipliers, or fast memory, or both, according to the availability ofthe resources on the target device. Despite the increased use of the hardware-basedcomputing technologies, however, there is still a strong reliance upon the use ofsoftware-based techniques for the design of the arithmetic unit. These techniques,as typified by the familiar fast multiplier, are relatively inflexible in terms of theprecision they offer and, although increasingly more power efficient, tend to be ex-pensive in terms of silicon resources.

There are a number of hardware-based arithmetic techniques available, how-ever, such as the shift-and-add techniques, as typified by the Co-Ordinate Rotation


DIgital Computer (CORDIC) arithmetic [47] unit, and the look-up table (LUT) tech-niques, as typified by the Distributed Arithmetic (DA) arithmetic [48] unit, that dateback to the DSP revolution of the mid-twentieth century but nevertheless still offergreat attractions for use with the new hardware-based technologies. The CORDICarithmetic unit, for example, which may be used to carry out in an optimal fashionthe operation of phase rotation – the key operation for the computation of the DFT –may be implemented by means of a computational structure whose form may rangefrom fully-sequential to fully-parallel, with the latency of the CORDIC operationincreasing linearly with increasing parallelism.

The application of the CORDIC technique to the computation of the R24 FHT is

considered in this monograph for its ability to both minimize the memory require-ment and to yield a flexible-precision solution to the real-data DFT problem.

1.8 Performance Metrics

Having introduced and defined the algorithms of interest in this introductory chapter,namely the DFT and its close relation the DHT, as well as discussing very briefly thevarious types of computing architecture and technology available for the implemen-tation of their fast solutions, via the FFT and the FHT, respectively, it is now worthdevoting a little time to considering the type of performance metrics most appro-priate to each. For the mapping of such algorithms onto a uni-processor computingdevice, for example, the performance would typically be assessed according to thefollowing:

Performance Metric for Uni-processor Computing Device:

An operation-efficient solution to a discrete unitary or orthogonal transform, when executedon a (Von Neumann) uni-processor sequential computing device, is one which exploits thesymmetry of the transform kernel such that the transform may be computed with feweroperations than by direct implementation of its definition.

As is clear from this definition, the idea of identifying and exploiting the property ofsymmetry, whether it exists just in the transform kernel or, in certain circumstances,in the input data and/or output data as well, is central to the problem of designing fastalgorithms for the efficient computation of discrete unitary or orthogonal transforms.For the mapping of such algorithms onto a multi-processor computing device, on theother hand, the performance would typically be assessed according to the following:

Performance Metric for Multi-processor Computing Device:

A time-efficient solution to a discrete unitary or orthogonal transform, when executed on amulti-processor parallel computing device, is one which facilitates the execution of many ofthe operations simultaneously, or in parallel, such that the transform may be computed inless time than by its sequential implementation.

With the availability of multiple processors the idealized objective of a parallel so-lution is to obtain a linear speed-up in performance which is directly proportional to

1.9 Basic Definitions 11

the number of processors used, although in reality, with most multi-processor appli-cations, being able to obtain such a speed-up is rarely achievable. The main problemrelates to the communication complexity arising from the need to move potentiallylarge quantities of data between the processors. Finally, for the mapping of suchalgorithms onto a silicon-based parallel computing device, the performance wouldtypically be assessed according to the following:

Performance Metric for Silicon-Based Parallel Computing Device:

A hardware-efficient solution to a discrete unitary or orthogonal transform, when executedon a silicon-based parallel computing device, is one which facilitates the execution of manyof the operations simultaneously, or in parallel, such that the transform throughput per unitarea of silicon is maximized.

Although other metrics could of course be used for this definition, this partic-ular metric of throughput per unit area of silicon – referred to hereafter as thecomputational density – is targeted specifically at the type of power-constrained en-vironment that one would expect to encounter with mobile communications, as it isassumed that a solution that yields a high computational density will be attractive interms of both power efficiency and hardware efficiency, given the known influenceof silicon area – to be discussed in Chapter 5 – on the power consumption.

1.9 Basic Definitions

To clarify the use of a few basic terms, note that the input data to unitary transforms,such as the DFT and the DHT, may be said to belong to the data space which, asalready stated, is CN for the case of the DFT and RN for the case of the DHT. Analo-gously, the output data from such transforms may be said to belong to the transformspace which for the case of the DFT is referred to as Fourier space and for the caseof the DHT is referred to as Hartley space. As already implied, all vectors with an at-tached superscript of “(F)” will be assumed to reside within Fourier space whilst allthose with an attached superscript of “(H)” will be assumed to reside within Hartleyspace. These definitions will be used throughout the monograph, where appropriate,in order to simplify or clarify the exposition.

Also, it has already been stated that for the case of a fixed-radix FFT, thetrigonometric elements of the Fourier Matrix, as applied to the appropriate butterflyinputs/outputs, are generally referred to as twiddle factors. However, for consis-tency, the elements of both the Fourier and Hartley Matrices, as required for thebutterflies of their respective decompositions, will be referred to hereafter as thetrigonometric coefficients – for the fast solution to both transform types the ele-ments are generally decomposed into pairs of real numbers for efficient application.

Finally, note that the curly brackets “f.g” will be used throughout the mono-graph to denote a finite set or sequence of digital samples, as required, for example,for expressing the input–output relationship for both the DFT and the DHT. Also,the indexing convention generally adopted when using such sequences is that the


elements of a sequence in data space – typically denoted with a lower-case charac-ter – are indexed by means of the letter “n”, whereas the elements of a sequence intransform space – typically denoted with an upper-case character – are indexed bymeans of the letter “k”.

1.10 Organization of the Monograph

The first section of the monograph provides the background information necessaryfor a better understanding of both the problem being addressed, namely that of thereal-data DFT, and of the resulting solution described in the research results thatfollow. This involves, in Chapter 1, an outline of the problem set in a historicalcontext, followed in Chapter 2 by an account of the real-data DFT and of the fastalgorithms and techniques conventionally used for its solution, and in Chapter 3by a detailed account of the DHT and the class of FHT algorithms used for its fastsolution, and of those properties of the DHT that make the FHT of particular interestwith regard to the fast solution of the real-data DFT.

The middle section of the monograph deals with the novel solution proposed fordealing with the real-data DFT problem. This involves, in Chapter 4, a detailed ac-count of the design and efficient computation of a solution to the DHT based uponthe GD-BFLY, namely the regularized FHT or R2

4 FHT, which lends itself naturallyto parallelization and to mapping onto a regular computational structure for im-plementation with silicon-based parallel computing technology. Design constraintsand considerations for such technologies are then discussed in Chapter 5 prior tothe consideration, in Chapter 6, of different architectures for the mapping of theR2

4 FHT onto such hardware. A partitioned-memory architecture exploiting a sin-gle high-performance PE is identified for the parallel computation of the GD-BFLYand of the resulting R2

4 FHT [27] whereby both the data and the trigonometric co-efficients are partitioned or distributed across multiple banks of memory, referredto hereafter as the data memory (DM) and the trigonometric coefficient memory(CM), respectively. Following this, in Chapter 7, it is seen how the fast multipliersused by the GD-BFLY might in certain circumstances be beneficially replaced by ahardware-based parallel arithmetic unit – based upon CORDIC arithmetic – that isable to yield a flexible-precision solution, without need of trigonometric coefficientmemory, when implemented with the proposed hardware-based technology.

The final section of the monograph deals with applications of the resulting solu-tion to the real-data DFT problem. This involves, in Chapter 8, an account of howthe application of the R2

4 FHT may be extended to the efficient parallel computationof the real-data DFT whose length is a power of two, but not a power of four, thisbeing followed by its application, in Chapter 9, to the computation of some of themore familiar and computationally-intensive DSP-based functions, such as thoseof correlation – both auto-correlation and cross-correlation – and of the widebandchannelization of real-valued radio frequency (RF) data via the polyphase DFT filterbank [22]. With each such function, which might typically be encountered in that

References 13

increasingly important area of wireless communications relating to the geolocation[38] of signal emitters, the adoption of the R2

4 FHT may potentially result in bothconceptually and computationally simplified solutions.

The monograph concludes with two appendices which provide both a detaileddescription and a listing of computer source code, written in the “C” program-ming language, for all the functions of the proposed partitioned-memory single-PEsolution to the R2

4 FHT, this code being used for proving the mathematical/logicalcorrectness of its operation. The computer program provides the user with vari-ous choices of PE design and of storage/accession scheme for the trigonometriccoefficients, helping the user to identify how the algorithm might be efficientlymapped onto suitable parallel computing equipment following translation of thesequential “C” code to parallel code as produced by a suitably chosen hardwaredescription language (HDL). The computer code for the complete solution is also tobe found on the compact disc (CD) accompanying the monograph.

Finally, note that pseudo-code, loosely based on the “C” programming language,will be used throughout the monograph, where appropriate, to illustrate the op-eration of the R2

4 FHT and of the individual functions of which the R24 FHT is

comprised.

References

1. S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice Hall, Upper Saddle River,NJ, 1989)

2. G.D. Bergland, A Fast Fourier Transform Algorithm for Real-Valued Series. Comm. ACM.11(10) (1968)

3. A.W. Biermann, Great Ideas in Computer Science (MIT Press, Cambridge, MA, 1995)4. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, New York, 1977)5. R. Blahut, Fast Algorithms for Digital Signal Processing (Addison Wesley, Boston, MA, 1985)6. R.N. Bracewell, The Fourier Transform and Its Applications (McGraw Hill, New York, 1978)7. R.N. Bracewell, The fast Hartley transform. Proc. IEEE. 72(8) (1984)8. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986)9. E.O. Brigham, The Fast Fourier Transform and its Applications (Prentice Hall, Englewood

Cliffs, NJ, 1988)10. G. Bruun, Z-transform DFT filters and FFTs. IEEE Trans. ASSP. 26(1) (1978)11. J.W. Cooley, P.A.W. Lewis, P.D. Welch, The fast Fourier transform algorithm and its

applications. Technical Report RC-1743, IBM (1967)12. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series.

Math. Comput. 19(4), 297–301 (1965)13. G.C. Danielson, C. Lanczos, Some improvements in practical Fourier series and their applica-

tion to x-ray scattering from liquids. J. Franklin Inst. 233, 365–380, 435–452 (1942)14. C. Ding, D. Pei, A. Salomaa, Chinese Remainder Theorem: Applications in Computing,

Coding, Cryptography. World Scientific (1996)15. P. Duhamel, Implementations of Split-Radix FFT Algorithms for Complex, Real and Real-

Symmetric Data. IEEE Trans. ASSP. 34(2), 285–295 (1986)16. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to

cyclic convolution of real data. IEEE Trans. ASSP. 35(6), 818–824 (1987)17. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications

(Academic, New York, 1982)


18. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP. 33(4) (1985)19. P. Gannon, Colossus: Bletchley Park’s Greatest Secret (Atlantic Books, London, 2006)20. I.J. Good, The interaction algorithm and practical Fourier series. J. Roy. Stat. Soc. Ser. B 20,

361–372 (1958)21. D. Harel, Algorithmics: The Spirit of Computing. (Addison Wesley, Reading, MA, 1997)22. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice Hall, Upper

Saddle River, NJ, 2004)23. R.V.L. Hartley, A more symmetrical Fourier analysis applied to transmission problems. Proc.

IRE. 30 (1942)24. M.T. Heideman, D.H. Johnson, C.S. Burrus, Gauss and the history of the fast Fourier transform.

IEEE ASSP Mag. 1, 14–21 (1984)25. A. Hodges, Alan Turing: The Enigma (Vintage, London, 1992)26. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc.

Vision Image Signal Process. 153(1), 70–78 (February 2006)27. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform

via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007)28. L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley,

New York, 1985)29. S.Y. Kung, VLSI Array Processors (Prentice Hall, Englewood Cliffs, NJ, 1988)30. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans.

ASSP. 32(2) (1984)31. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), Burlington,

MA, 2004)32. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall,

Englewood Cliffs, NJ, 1979)33. H. Murakami, Real-valued fast discrete Fourier transform and decimation-in-frequency algo-

rithms. IEEE Trans. Circuits Syst. II: Analog Dig. Signal Process. 41(12), 808–816 (1994)34. I. Nivan, H.S. Zuckerman, An Introduction to the Theory of Numbers (Wiley, New York, 1980)35. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms (Springer, Berlin, 1981)36. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Englewood

Cliffs, NJ, 1989)37. J.R. Pierce, An Introduction to Information Theory: Symbols, Signals and Noise (Dover,

New York, 1980)38. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, Boston, MA, 2005)39. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall,

Englewood Cliffs, NJ, 1975)40. C. Runge, Uber die Zerlegung Empirisch Periodischer Funktionen in Sinnus-Wellen. Zeit. Fur

Math. Und Physik. 48, 443–456 (1903)41. C.E. Shannon, A mathematical theory of communication. BSTJ 27, 379–423, 623–657 (1948)42. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital

Signal Processors. downloadable document from website: www.Motorola.com43. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On computing the discrete Hartley

transform. IEEE ASSP 33, 1231–1238 (1985)44. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform

algorithms. IEEE Trans. ASSP. 35(6), 849–863 (1987)45. K. Stumpff, Tafeln und Aufgaben zur Harmonischer Analyse und Periodogrammrechnung.

(Julius Springer, Berlin, 1939)46. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform

algorithms. IEEE Signal Process. 42(11) (1994)47. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput.

EC-8(3), 330–334 (1959)48. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review.

IEEE ASSP Mag. 6(3), 4–19 (1989)49. S. Winograd, Arithmetic Complexity of Computations. (SIAM, Philadelphia, PA, 1980)

Chapter 2Fast Solutions to Real-Data DiscreteFourier Transform

Abstract This chapter discusses the two approaches conventionally adopted fordealing with the real-data DFT problem. The first approach involves the design ofspecialized fast algorithms, such as those due to Bergland and Bruun, which aregeared specifically to addressing real-data applications and therefore able to exploit,in a direct way, the real-valued nature of the data – which is known to result in aHermitian-symmetric frequency spectrum. The second approach, which is the mostcommonly adopted, particularly for applications requiring a hardware solution, in-volves re-structuring the data so as to use an existing complex-data FFT algorithm,possibly coupled with pre-FFT and/or post-FFT stages, to produce the DFT of ei-ther one or two (produced simultaneously) real-valued data sets – such solutionsthus said to be obtained via a “real-from-complex” strategy. A discussion is finallyprovided relating to the results obtained in the chapter.

2.1 Introduction

Since the original developments of spectrum analysis in the eighteenth century, thevast majority of real-world applications have been concerned with the processing ofreal-valued data, where the data generally corresponds to amplitude measurementsof some signal of interest. As a result, there has always been a genuine practical needfor fast solutions to the real-data DFT with two quite distinct approaches evolvingover this period to address the problem. The first and more intellectually challengingapproach involves trying to design specialized algorithms which are geared specif-ically to real-data applications and therefore able to exploit, in a direct way, thereal-valued nature of the data which is known to result in a Hermitian-symmetricfrequency spectrum, whereby for the case of an N-point transform

Re�

X.F/Œk��

D Re�

X.F/ŒN � k��

(2.1)

andIm

�X.F/Œk�

�D �Im

�X.F/ŒN � k�

�; (2.2)


15

16 2 Fast Solutions to Real-Data Discrete Fourier Transform

so that one half of the DFT outputs are in fact redundant. Such solutions, as typifiedby the Bergland [1] and Bruun [3, 13] algorithms, only need therefore to produceone half of the DFT outputs. The second and less demanding approach – but alsothe most commonly adopted, particularly for applications requiring a hardwaresolution – involves re-structuring the data so as to use an existing complex-dataFFT algorithm, possibly coupled with pre-FFT and/or post-FFT stages, to producethe DFT of either one or two (produced simultaneously) real-valued data sets –such solutions thus said to be obtained via a real-from-complex strategy [17]. Bothof these approaches are now discussed in some detail prior to a summary of theirrelative merits and drawbacks.

2.2 Real-Data FFT Algorithms

Since the re-emergence of computationally-efficient FFT algorithms, as initiatedby the published work of James Cooley and John Tukey in the mid-1960s [5], anumber of attempts have been made [1, 3, 6, 7, 9, 10, 12, 18, 19] at producing fastalgorithms that are able to directly exploit the spectral symmetry that arises fromthe processing of real-valued data. Two such algorithms are those due to GlennBergland (1968) and Georg Bruun (1978) and these are now briefly discussed so asto give a flavour of the type of algorithmic structures that can result from pursuingsuch an approach. The Bergland algorithm effectively modifies the DIT version ofthe familiar Cooley–Tukey radix-2 algorithm [2] to account for the fact that only onehalf of the DFT outputs need to be computed, whilst the Bruun algorithm adopts anunusual recursive polynomial-factorization approach – note that the DIF version ofthe Cooley–Tukey fixed-radix algorithm, referred to as the Sande–Tukey algorithm[2], may also be expressed in such a form – which involves only real-valued polyno-mial coefficients until the last stage of the computation, making it thus particularlysuited to the computation of the real-data DFT. Examples of the signal flow graphs(SFGs) for both DIT and DIF versions of the radix-2 FFT algorithm are as givenbelow in Figs. 2.1 and 2.2, respectively.

2.2.1 The Bergland Algorithm

The Bergland algorithm is a real-data FFT algorithm based upon the observation thatthe frequency spectrum arising from the processing of real-valued data is Hermitian-symmetric, so that only one half of the DFT outputs needs to be computed. Startingwith the DIT version of the familiar complex-data Cooley–Tukey radix-2 FFT al-gorithm, if the input data is real valued, then at each of the log2N temporal stagesof the algorithm the computation involves the repeated combination of two trans-forms to yield one longer double-length transform. From this, Bergland observedthat the property of Hermitian symmetry may actually be exploited at each of the

2.2 Real-Data FFT Algorithms 17

X[7]

X[6]

X[5]

X[4]

X[3]

X[2]

X[1]

X[0]

X[7]

X[5]

X[3]

X[1]

X[6]

X[4]

X[2]

X[0]

W78

W68

W58

W48

W38

W28

W18

W08

4-point DFT

4-point DFT

Fig. 2.1 Signal flow graph for DIT decomposition of eight-point DFT

W38

W28

W18

W08

-

-

-

- X[7]

X[6]

X[5]

X[4]

X[3]

X[2]

X[1]

X[0]

X[7]

X[5]

X[3]

X[1]

X[6]

X[4]

X[2]

X[0]

4-point DFT

4-point DFT

Fig. 2.2 Signal flow graph for DIF decomposition of eight-point DFT

log2N stages of the algorithm. Thus, as all the odd-addressed output samples foreach such double-length transform form the second half of the frequency spectrum,which can in turn be straightforwardly obtained from the property of spectral sym-metry, Bergland’s algorithm instead uses those memory locations for storing theimaginary components of the data.

Thus, with Bergland’s algorithm, given that the input data sequence is real val-ued, all the intermediate results may be stored in just N memory locations – eachlocation thus corresponding to just one word of memory. The computation can alsobe carried out in an in-place fashion – whereby the outputs from each butterflyare stored in the same set of memory locations as used by the inputs – althoughthe indices of the set of butterfly outputs are not in bit-reversed order, as they arewith the Cooley–Tukey algorithm, being instead ordered according to the Bergland


ordering [1], as also are the indices of the twiddle factors or trigonometric coeffi-cients. However, the normal ordering of the twiddle factors may, with due care, beconverted to the Bergland ordering and the Bergland ordering of the FFT outputssubsequently converted to the normal ordering, as required for an efficient in-placesolution [1, 17].

Thus, the result of the above modifications is an FFT algorithm with an arith-metic complexity of O.N:log2N/ which yields a reduction of two saving, comparedto the conventional zero-padded complex-data FFT solution – to be described inSection 2.3.1 – in terms of both arithmetic complexity and memory requirement.

2.2.2 The Brunn Algorithm

The Bruun algorithm is a real-data FFT algorithm based upon an unusual recur-sive polynomial-factorization approach, proposed initially for the case of N inputsamples where N is a power of two, but subsequently generalized to arbitrary even-number transform sizes by Hideo Murakami in 1996 [12].

Recall firstly, from Chapter 1, that the N-point DFT can be written in normalizedform as

X.F/Œk� D 1pN

N�1X

nD0

xŒn�:WnkN k D 0; 1; : : : ; N � 1 (2.3)

where the transform kernel is derived from the term

WN D exp.�i2 =N/; (2.4)

the primitive Nth complex root of unity. Then, by defining the polynomial x(z)whose coefficients are those elements of the sequence fx[n]g, such that

x.z/ D 1pN

N�1X

nD0

xŒn�:zn; (2.5)

it is possible to view the DFT as a reduction of this polynomial [11], so that

X.F/Œk� D x�Wk

N

� D x.z/ mod�z � Wk

N

�(2.6)

where “mod” stands for the modulo operation [11] which denotes the polynomialremainder upon division of x(z) by

�z � Wk

N

�[11]. The key to fast execution of

the Bruun algorithm stems from being able to perform this set of N polynomialremainder operations in a recursive fashion.

To compute the DFT involves evaluating the remainder of x(z) modulo somepolynomial of degree one, more commonly referred to as a monomial, a total of Ntimes, as suggested by Equations 2.5 and 2.6. To do this efficiently, one can combinethe remainders recursively in the following way. Suppose it is required to evaluate

2.3 Real-From-Complex Strategies 19

x(z) modulo U(z) as well as x(z) modulo V(z). Then, by first evaluating x(z) modulothe polynomial product, U(z).V(z), the degree of the polynomial x(z) is reduced,thereby making subsequent modulo operations less computationally expensive.

Now the product of all of the monomials,�z � Wk

N

�, for values of k from 0 up

to N � 1, is simply�zN � 1

�, whose roots are clearly the N complex roots of unity.

A recursive factorization of�zN � 1

�is therefore required which breaks it down into

polynomials of smaller and smaller degree with each possessing as few non-zerocoefficients as possible. To compute the DFT, one takes x(z) modulo each level ofthis factorization in turn, recursively, until one arrives at the monomials and the finalresult. If each level of the factorization splits every polynomial into an O(1) numberof smaller polynomials, each with an O(1) number of non-zero coefficients, then themodulo operations for that level will take O(N) arithmetic operations, thus leadingto a total arithmetic complexity, for all log2N levels, of O.N:log2N/, as obtainedwith the standard Cooley–Tukey radix-2 algorithm.

Note that for N a power of two, the Bruun algorithm factorizes the polynomial�zN � 1

�recursively via the rules

z2M � 1 D �zM � 1

� �zM C 1

�(2.7)

and

z4M C a:z2M C 1 D�

z2M C p2 � a:zM C 1

� �z2M � p

2 � a:zM C 1�

; (2.8)

where “a” is a constant such that jaj � 2. On completion of the recursion, whenM D 1, there remains polynomials of degree two that can each be evaluated modulotwo roots of the form

�z � Wk

N

�for each polynomial. Thus, at each recursive stage,

all of the polynomials may be factorized into two parts, each of half the degree andpossessing at most three non-zero coefficients, leading to an FFT algorithm withan O.N:log2N/ arithmetic complexity. Moreover, since all the polynomials havepurely real coefficients, at least until the last stage, they quite naturally exploit thespecial case where the input data is real valued, thereby yielding a reduction of twosaving, compared to the conventional zero-padded complex-data FFT solution tobe discussed in Section 2.3.1, in terms of both arithmetic complexity and memoryrequirement.

2.3 Real-From-Complex Strategies

By far the most common approach to solving the real-data DFT problem is thatbased upon the use of an existing complex-data FFT algorithm as it simplifies theproblem, at worst, to one of designing pre-FFT and/or post-FFT stages for the pack-ing of the real-valued data into the correct format required for input to the FFTalgorithm and for the subsequent unpacking of the FFT output data to obtain thespectrum (or spectra) of the original real-valued data set (or sets). Note that any fast


algorithm may be used for carrying out the complex-data FFT, so that both DITand DIF versions of fixed-radix FFTs, as already discussed, as well as more so-phisticated FFT designs such as those corresponding to the mixed-radix, split-radix,prime factor, prime-length and Winograd’s nested algorithms [8, 11], for example,might be used.

2.3.1 Computing One Real-Data DFT via One Full-LengthComplex-Data FFT

The most straightforward approach to the problem involves first packing the real-valued data into the real component of a complex-valued data sequence, padding theimaginary component with zeros – this action more commonly referred to as zeropadding – and then feeding the resulting complex-valued data set into a complex-data FFT. The arithmetic complexity of such an approach is clearly identical tothat obtained when a standard complex-data FFT is applied to genuine complex-valued data, so that no computational benefits stemming from the nature of the dataare achieved with such an approach. On the contrary, computational resources arewasted with such an approach, as excessive arithmetic operations are performed forthe computation of the required outputs and twice the required amount of memoryused for the storage of the input/output data.

2.3.2 Computing Two Real-Data DFTs via One Full-LengthComplex-Data FFT

The next approach to the problem involves computing two N-point real-data DFTs,simultaneously, by means of one N-point complex-data FFT. This is achieved bypacking the first real-valued data sequence into the real component of a complex-valued data sequence and the second real-valued data sequence into its imagi-nary component. Thus, given two real-valued data sequences, fg[n]g and fh[n]g,a complex-valued data sequence, fx[n]g, may be simply obtained by setting

x Œn� D g Œn� C i:h Œn� ; (2.9)

with the kth DFT output of the resulting data sequence being written in normalizedform, in terms of the DFTs of fg[n]g and fh[n]g, as

X.F/Œk� D 1pN

N�1X

nD0

xŒn�:WnkN

2.3 Real-From-Complex Strategies 21

D 1pN

N�1X

nD0

gŒn�:WnkN C i

1pN

N�1X

nD0

hŒn�:WnkN

D G Œk� C i:H Œk�

D .GRŒk� � HIŒk�/ C i .GIŒk� C HRŒk�/ ; (2.10)

where GRŒk� and GIŒk� are the real and imaginary components, respectively, ofG[k] – the same applying to HRŒk� and HIŒk� with respect to H[k]. Similarly, the(N–k)th DFT output may be written in normalized form as

X.F/ŒN � k� D 1pN

N�1X

nD0

xŒn�:Wn.N�k/N

D 1pN

N�1X

nD0

gŒn�:W�nkN C i

1pN

N�1X

nD0

hŒn�:W�nkN

D G� Œk� C i:H� Œk�

D .GRŒk� C HIŒk�/ C i .�GIŒk� C HRŒk�/ (2.11)

where the superscript “�” stands for the operation of complex conjugation, so thatupon combining the expressions of Equations 2.10 and 2.11, the DFT outputs G[k]and H[k] may be written, in terms of the DFT outputs X.F/Œk� and X.F/ŒN � k�, as

GŒk� D GRŒk� C iGIŒk�

D 1

2

�Re

hX.F/Œk� C X.F/ŒN � k�

iC iIm

hX.F/Œk� � X.F/ŒN � k�

i�(2.12)

and

HŒk� D HRŒk� C iHIŒk�

D 1

2

�Im

hX.F/Œk� C X.F/ŒN � k�

i� iRe

hX.F/Œk� � X.F/ŒN � k�

i�(2.13)

where the terms Re�X.F/Œk�

�and Im

�X.F/Œk�

�denote the real and imaginary com-

ponents, respectively, of X.F/Œk�.Thus, it is evident that the DFT of the two real-valued data sequences, fg[n]g and

fh[n]g, may be computed simultaneously, via one full-length complex-data FFT al-gorithm, with the DFT of the sequence fg[n]g being as given by Equation 2.12 andthat of the sequence fh[n]g by Equation 2.13. The pre-FFT data packing stage isquite straightforward in that it simply involves the assignment of one real-valueddata sequence to the real component of the complex-valued data array and one real-valued data sequence to its imaginary component. The post-FFT data unpackingstage simply involves separating out the two spectra from the complex-valued FFToutput data, this involving two real additions/subtractions for each real-data DFToutput together with two scaling operations each by a factor of 2 (which in fixed-point hardware reduces to that of a simple right shift operation).


2.3.3 Computing One Real-Data DFT via One Half-LengthComplex-Data FFT

Finally, the last approach to the problem involves showing how an N-point complex-data FFT may be used to carry out the computation of one 2N-point real-data DFT.The kth DFT output of the 2N-point real-valued data sequence fx[n]g may be writtenin normalized form as

X.F/Œk� D 1p2N

2N�1X

nD0

xŒn�:Wnk2N k D 0; 1; : : : ; N � 1

D 1p2N

N�1X

nD0

xŒ2n�:WnkN C Wk

2N1p2N

N�1X

nD0

xŒ2n C 1�:WnkN (2.14)

which, upon setting gŒn� D xŒ2n� and hŒn� D xŒ2n C 1�, becomes

X.F/Œk� D 1p2N

N�1X

nD0

gŒn�:WnkN C Wk

2N1p2N

N�1X

nD0

hŒn�:WnkN k D 0; 1; : : : ; N � 1

D GŒk� C Wk2NHŒk�: (2.15)

Therefore, by setting yŒn� D gŒn� C i:hŒn� and exploiting the combined expressionsof Equations 2.10 and 2.11, the DFT output Y[k] may be written as

YŒk� D .GRŒk� � HIŒk�/ C i .GIŒk� C HRŒk�/ (2.16)

and that for Y[N–k] as

YŒN � k� D .GRŒk� C HIŒk�/ C i .�GIŒk� C HRŒk�/ : (2.17)

Then, by combining the expressions of Equations 2.15–2.17, the real component ofX.F/Œk� may be written as

X.F/R Œk� D 1

2Re .YŒk� C YŒN � k�/

C 1

2cos.k =N/:Im .YŒk� C YŒN � k�/

� 1

2sin.k =N/:Re .YŒk� � YŒN � k�/ (2.18)

and the imaginary component as

X.F/I Œk� D 1

2Im .YŒk� � YŒN � k�/

2.4 Data Re-ordering 23

C 1

2sin.k =N/:Im .YŒk� C YŒN � k�/

� 1

2cos.k =N/:Re .YŒk� � YŒN � k�/ : (2.19)

Thus, it is evident that the DFT of one real-valued data sequence, fx[n]g, of length2N, may be computed via one N-point complex-data FFT algorithm, with the realcomponent of the DFT output being as given by Equation 2.18 and the imaginarycomponent of the DFT output as given by Equation 2.19. The pre-FFT data pack-ing stage is conceptually simple, but nonetheless burdensome, in that it involvesthe assignment of the even-addressed samples of the real-valued data sequence tothe real component of the complex-valued data sequence and the odd-addressedsamples to its imaginary component. The post-FFT data unpacking stage is consid-erably more complex than that required for the approach of Section 2.3.2, requiringthe application of eight real additions/subtractions for each DFT output, togetherwith two scaling operations, each by a factor of 2, and four real multiplications bypre-computed trigonometric coefficients.

2.4 Data Re-ordering

All of the fixed-radix formulations of the FFT, at least for the case where the trans-form length is a power or two, require that either the inputs to or the outputs fromthe transform be permuted according to the digit reversal mapping [4]. In fact, it ispossible to place the data re-ordering either before or after the transform for boththe DIT and DIF formulations [4]. For the case of a radix-2 algorithm the data re-ordering is more commonly known as the bit reversal mapping, being based uponthe exchanging of single bits, whilst for the radix-4 case it is known as the di-bitreversal mapping, being based instead upon the exchanging of pairs of bits.

Such data re-ordering, when mapped onto a uni-processor sequential computingdevice, may be carried out via the use of either:

1. An LUT, at the expense of additional memory; or2. A fast algorithm using just shifts, additions/subtractions and memory ex-

changes; or3. A fast algorithm that also makes use of a small LUT – containing the reflected

bursts of ones that change on the lower end with incrementing – to try and opti-mize the speed at the cost of a slight increase in memory

with the optimum choice being dependent upon the available resources and the timeconstraints of the application.

Alternatively, when the digit-reversal mapping is appropriately parallelized, itmay be mapped onto a multi-processor parallel computing device, such as an FPGA,possessing multiple banks of fast memory, thus enabling the time-complexity to begreatly reduced – see the recent work of Ren et al. [14] and Seguel and Bollman [16].


The optimum approach to digit-reversal is dictated very much by the operation ofthe FFT, namely whether the FFT is of burst or streaming type, as discussed inChapter 6.

2.5 Discussion

The aim of this chapter has been to highlight both the advantages and the disadvan-tages of the conventional approaches to the solution of the real-data DFT problem.As is evident from the examples discussed in Section 2.2, namely the Berglandand Bruun algorithms, the adoption of specialized real-data FFT algorithms maywell yield solutions possessing attractive performance metrics in terms of arith-metic complexity and memory requirement, but generally this is only achievedat the expense of a more complex algorithmic structure when compared to thoseof the highly-regular fixed-radix designs. As a result, such algorithms would notseem to lend themselves particularly well to being mapped onto parallel computingequipment.

Similarly, from the examples of Section 2.3, namely the real-from-complexstrategies, the regularity of the conventional fixed-radix designs may only be ex-ploited at the expense of introducing additional processing modules, namely thepre-FFT and/or post-FFT stages for the packing of the real-valued data into thecorrect format required for input to the FFT algorithm and for the subsequent un-packing of the FFT output data to obtain the spectrum (or spectra) of the originalreal-valued data set (or sets). An additional set of problems associated with the real-from-complex strategies, at least when compared to the more direct approach of areal-data FFT, relate to the need for increased memory and increased processingdelay to allow for the possible acquisition/processing of pairs of data sets.

It is worth noting that an alternative DSP-based approach to those discussedabove is to first convert the real-valued data to complex-valued data by meansof a wideband DDC process, this followed by the application of a conventionalcomplex-data FFT. Such an approach, however, introduces an additional function tobe performed – typically an FIR filter with length dependent upon the performancerequirements of the application – which also introduces an additional processingdelay prior to the execution of the FFT. Drawing upon a philosophical analogy,namely the maxim of the fourteenth century Franciscan scholar, William of Occam,commonly known as “Occam’s razor” [15], why use two functions when just onewill suffice. A related and potentially serious problem arises when there is limitedinformation available on the signal under analysis as such information might wellbe compromised via the filtering operation, particularly when the duration of thesignal is short relative to that of the transient response of the filter – as might be en-countered, for example, with problems relating to the detection of extremely shortduration dual-tone multi-frequency (DTMF) signals.

Thus, there are clear drawbacks to all such approaches, particularly when theapplication requires a solution in hardware using parallel computing equipment, so

References 25

that the investment of searching for alternative solutions to the fast computationof the real-data DFT is still well merited. More specifically, solutions are requiredpossessing both highly regular designs that lend themselves naturally to mappingonto parallel computing equipment and attractive performance metrics, in terms ofboth arithmetic complexity and memory requirement, but without requiring exces-sive packing/unpacking requirements and without incurring the latency problems (asarising from the increased processing delay) associated with the adoption of certainof the real-from-complex strategies.

References

1. G.D. Bergland, A fast Fourier transform algorithm for real-valued series. Comm. ACM.11(10) (1968)

2. E.O. Brigham, The Fast Fourier Transform and Its Applications. (Prentice Hall, EnglewoodCliffs, NJ, 1988)

3. G. Bruun, Z-Transform DFT Filters and FFTs. IEEE Trans. ASSP. 26(1) (1978)4. E. Chu, A. George, Inside the FFT Black Box (CRC Press, Boca Raton, FL, 2000)5. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series.

Math. Comput. 19(4), 297–301 (1965)6. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and real-

symmetric data. IEEE Trans. ASSP 34(2), 285–295 (1986)7. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to

cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (1987)8. P. Duhamel, M. Vetterli, Fast Fourier transforms: A tutorial review and a state of the art. Signal

Process. 19, 259–299 (1990)9. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP 33(4) (1985)

10. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans.ASSP. 32(2) (1984)

11. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall,Englewood Cliffs, NJ, 1979)

12. H. Murakami, Real-valued fast discrete Fourier transform and cyclic convolution algorithmsof highly composite even length. Proc. ICASSP 3. 1311–1314 (1996)

13. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms. (Springer,Berlin, 1981)

14. G. Ren, P. Wu, D. Padua, Optimizing Data Permutations for SIMD Devices (PDL’06, Ottawa,Ontario, Canada, 2006)

15. B. Russell, History of Western Philosophy (George Allen & Unwin, London, 1961)16. J. Seguel, D. Bollman, A framework for the design and implementation of FFT permutation

algorithms. IEEE Trans. Parallel Distrib. Syst. 11(7), 625–634 (2000)17. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital

Signal Processors. Downloadable document from website: www.Motorola.com18. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform

algorithms. IEEE Trans. ASSP 35(6), 849–863 (1987)19. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform

algorithms. IEEE Signal Process. 42(11) (1994)

Chapter 3The Discrete Hartley Transform

Abstract This chapter introduces the DHT and discusses those aspects of itssolution, as obtained via the FHT, which make it an attractive choice for apply-ing to the real-data DFT problem. This involves first showing how the DFT may beobtained from the DHT, and vice versa, followed by a discussion of those funda-mental theorems, common to both the DFT and DHT algorithms, which enable theinput data sets to be similarly related to their respective transforms and thus enablethe DHT to be used for solving those DSP-based problems commonly addressed viathe DFT, and vice versa. The limitations of existing FHT algorithms are then dis-cussed bearing in mind the ultimate objective of mapping any subsequent solutiononto silicon-based parallel computing equipment. A discussion is finally providedrelating to the results obtained in the chapter.

3.1 Introduction

An algorithm that would appear to satisfy most if not all of the requirements laiddown in Section 2.5 of the previous chapter is that of the DHT, as introduced inChapter 1, a discrete unitary transform [6] that involves only real arithmetic (thusmaking it also orthogonal) and that is intimately related to the DFT, satisfying allthose properties required of a unitary transform as well as possessing fast algorithmsfor its solution. Before delving into the details, however, it is perhaps worth re-stating the definition, namely that for the case of N input/output samples, the DHTmay be expressed in normalized form via the equation

X.H/Œk� D 1pN

N�1X

nD0

xŒn�:cas.2 nk=N/ k D 0; 1; : : : ; N � 1 (3.1)

where the input/output data vectors belong to RN, the linear space of real-valuedN-tuples, and the transform kernel is given by

cas.2 nk=N/ D cos.2 nk=N/ C sin.2 nk=N/; (3.2)


27

28 3 The Discrete Hartley Transform

a periodic function with period 2 and possessing (amongst others) the followingset of useful properties:

cas .A C B/ D cosA:casB C sinA:cas. � B/

cas .A � B/ D cosA:cas .�B/ C sinA:casB

casA:casB D cos .A � B/ C sin .A C B/

casA C casB D 2:cas

�1

2.A C B/

�:cos

�1

2.A � B/

�

casA � casB D 2:cas

��1

2.A C B/

�:sin

�1

2.A � B/

�(3.3)

as will be exploited later for derivation of the FHT algorithm.

3.2 Normalization of DHT Outputs

Suppose now that the DHT operation is applied twice, in succession, the first timeto a real-valued data sequence, fx[n]g, and the second time to the output of thefirst operation. Then given that the DHT is bilateral and, like the DFT, a unitarytransform, the output from the second operation, fy[n]g, can be expressed as

fyŒn�g D DHT .DHT .fxŒn�g// � fxŒn�g; (3.4)

so that the output of the second operation is actually equivalent to the input of thefirst operation.

However, it should be noted that without the presence of the scaling factor,1=

pN, that has been included in the current definition of the DHT, as given by

Equation 3.1 above, the magnitudes of the outputs of the second DHT would actu-ally be equal to N times those of the inputs of the first DHT, so that the role of thescaling factor is to ensure that the magnitudes are preserved. It should be borne inmind, however, that the presence of a coherent signal in the input data will resultin most of the growth in magnitude occurring in the forward transform, so that anyfuture scaling strategy – as discussed in the following chapter – must reflect this fact.

Note that a scaling factor of 1=N is often used for the forward definition of boththe DFT and the DHT, the value of 1=

pN being used instead here purely for math-

ematical elegance, as it reduces the definitions of the DHT for both the forward andthe reverse directions to an identical form. The fundamental theorems discussed inSection 3.5 for both the DFT and the DHT, however, are valid regardless of thescaling factor used.

3.4 Connecting Relations Between DFT and DHT 29

3.3 Decomposition into Even and Odd Components

The close relationship between the DFT and the DHT hinges upon symmetryconsiderations which may be best explained by considering the decomposition ofthe DHT into its “even” and “odd” components [2], denoted E[k] and O[k], respec-tively, and written as

X.H/Œk� D EŒk� C OŒk� (3.5)

where, for an N-point transform, E[k] is such that

E ŒN � k� D E Œk� (3.6)

and O[k] is such that

O ŒN � k� D �O Œk� : (3.7)

As a result, the even and odd components may each be expressed in terms of theDHT outputs via the expressions

EŒk� D 1

2

�X.H/Œk� C X.H/ŒN � k�

�(3.8)

and

OŒk� D 1

2

�X.H/Œk� � X.H/ŒN � k�

�; (3.9)

respectively, from which the relationship between the DFT and DHT outputs maybe straightforwardly obtained.

3.4 Connecting Relations Between DFT and DHT

Firstly, from the equality

cas.2 nk=N/ D Re�Wnk

N

� � Im�Wnk

N

�; (3.10)

which relates the kernels of the two transformations, the DFT outputs may beexpressed as

X.F/Œk� D EŒk� � iOŒk�; (3.11)

so that

Re�

X.F/Œk��

D 1

2

�X.H/ŒN � k� C X.H/Œk�

�(3.12)

and

Im�

X.F/Œk��

D 1

2

�X.H/ŒN � k� � X.H/Œk�

�; (3.13)


whilstX.H/Œk� D Re

�X.F/Œk�

�� Im

�X.F/Œk�

�: (3.14)

3.4.1 Real-Data DFT

Thus, from Equations 3.12 to 3.14, the complex-valued DFT output set and the real-valued DHT output set may now be simply obtained, one from the other, so thata fast algorithm for the solution of the DFT may also be used for the computationof the DHT whilst a fast algorithm for the solution of the DHT may similarly beused for the computation of the DFT. Note from the above equations that pairs ofreal-valued DHT outputs combine to give individual complex-valued DFT outputs,such that

X.H/Œk� & X.H/ŒN � k� $ X.F/Œk� (3.15)

for k D 1; 2; : : :; N=2 � 1, whilst the remaining two terms are such that

X.H/Œ0� $ X.F/Œ0� (3.16)

andX.H/ŒN=2� $ X.F/ŒN=2�: (3.17)

With regard to the two trivial mappings provided above by Equations 3.16and 3.17, it may also be noted from Equation 3.10 that when k D 0, we have

cas .2 nk=N/ D WnkN D 1; (3.18)

so that the zero-address component in Hartley space maps to the zero-address(or zero-frequency) component in Fourier space, and vice versa, as implied byEquation 3.16, whilst when k D N=2, we have

cas .2 nk=N/ D WnkN D .�1/n; (3.19)

so that the Nyquist-address component in Hartley space similarly maps to theNyquist-address (or Nyquist-frequency) component in Fourier space, and vice versa,as implied by Equation 3.17.

3.4.2 Complex-Data DFT

Now, having defined the relationship between the Fourier-space and Hartley-spacerepresentations of a real-valued data sequence it is a simple task to extend the resultsto the case of a complex-valued data sequence. Given the linearity of the DFT –this property follows from the Addition Theorem to be discussed in the following

3.5 Fundamental Theorems for DFT and DHT 31

section – the DFT of a complex-valued data sequence, fxRŒn� C i:xIŒn�g, can bewritten as the sum of the DFTs of the individual real and imaginary components,so that

DFT .fxRŒn� C i:xIŒn�g/ D DFT .fxRŒn�g/ C i:DFT .fxIŒn�g/ : (3.20)

Therefore, by first taking the DHT of the individual real and imaginary componentsof the complex-valued data sequence and then deriving the DFT of each such com-ponent by means of Equations 3.12 and 3.13, the real and imaginary components ofthe DFT of the complex-valued data sequence may be written in terms of the twoDHTs as

Re�

X.F/Œk��

D 1

2

�X.H/

R ŒN � k� C X.H/R Œk�

�� 1

2

�X.H/

I ŒN � k� � X.H/I Œk�

�(3.21)

and

Im�

X.F/Œk��

D 1

2

�X.H/

R ŒN � k� � X.H/R Œk�

�C 1

2

�X.H/

I ŒN � k� C X.H/I Œk�

�; (3.22)

respectively, so that it is now possible to compute the DFT of both real-valued andcomplex-valued data sequences by means of the DHT – pseudo code is providedfor both the real-valued data and complex-valued data cases in Figs. 3.1 and 3.2,respectively.

The significance of the complex-to-real decomposition described here for thecomplex-data DFT is that it introduces an additional level of parallelism to theproblem as the resulting DHTs are independent and thus able to be computedsimultaneously, or in parallel, when implemented with parallel computing technol-ogy – a subject to be introduced in Chapter 5. This is particularly relevant when thetransform is long and the throughput requirement high.

3.5 Fundamental Theorems for DFT and DHT

As has already been stated, if the DFT and DHT algorithms are to be used inter-changeably, for solving certain types of signal processing problem, then it is essen-tial that there are corresponding theorems [2] for the two transforms which enablethe input data sequences to be similarly related to their respective transforms.

Suppose firstly that the sequences fx[n]g and fX.F/Œk�g are related via theexpression

DFT .fxŒn�g/ DnX.F/Œk�

o; (3.23)

so that fx[n]g is the input data sequence to the DFT and fX.F/Œk�g the correspond-ing transform-space output, thus belonging to Fourier space, and that fx[n]g andfX.H/Œk�g are similarly related via the expression


Description:

The real and imaginary components of the real-data N-point DFT outputs are optimally stored in the following way:

XRdata[0] � zeroth frequency outputXRdata[1] � real component of 1st frequency outputXRdata[N–1] � imaginary component of 1st frequency outputXRdata[2] � real component of 2nd frequency outputXRdata[N–2] � imaginary component of 2nd frequency output

--- --- --- --- --- --- --- --- --- --- --- ---

XRdata[N/2–1] � real component of (N/2–1)th frequency outputXRdata[N/2+1] � imaginary component of (N/2–1)th frequency outputXRdata[N/2] � (N/2)th frequency output

Note: The components XRdata[0] and XRdata[N/2] do not need to be modified toyield zeroth and (N/2)th frequency outputs.

Pseudo-Code for DHT-to-DFT Conversion:

k = N – 1;for ( j � 1; j < (N/2); j�j+1){ store � XRdata[k] + XRdata[j]; XRdata[k] � XRdata[k] – XRdata[j]; XRdata[j] � store; XRdata[j] � XRdata[j] / 2; XRdata[k] � XRdata[k] / 2; k � k – 1;}

Fig. 3.1 Pseudo-code for computing real-data DFT from DHT outputs

DHT .fxŒn�g/ DnX.H/Œk�

o; (3.24)

so that fx[n]g is now the input data sequence to the DHT and fX.H/Œk�g the corre-sponding transform-space output, thus belonging to Hartley space. Then using thenormalized definition of the DHT as given by Equation 3.1 – with a similar scalingstrategy assumed for the definition of the DFT, as given by Equation 1.1, and itsinverse – the following commonly encountered theorems may be derived, each onecarrying over from one transform space to the other. Note that the data sequence isassumed, in each case, to be of length N.

3.5.1 Reversal Theorem

The DFT-based relationship is given by

DFT .fxŒN � n�g/ DnX.F/ŒN � k�

o; (3.25)


Description:

The complex-data N-point DFT outputs are optimally stored with array "XRdata" holding the real component of both the input and output data, whilst the array "XIdata" holds the imaginary component of both the input and output data.

Note: The components XRdata[0] and XRdata[N/2] do not need to be modified toyield zeroth and (N/2)th frequency outputs.

Pseudo-Code for DHT-to-DFT Conversion:

k � N – 1; for (j � 1; j < (N/2);j�j+1) { // Real Data Channel. store � XRdata[k] + XRdata[ j]; XRdata[k] � XRdata[k] – XRdata[ j]; XRdata[ j] � store; XRdata[ j] � XRdata[j] / 2; XRdata[k] � XRdata[k] / 2;// Imaginary Data Channel. store � XIdata[k] + XIdata[j]; XIdata[k] � XIdata[k] – XIdata[j]; XIdata[j] � store; XIdata[j] = XIdata[j] / 2; XIdata[k] � XIdata[k] / 2;// Combine Outputs for Real and Imaginary Data Channels. store1 � XRdata[j] + XIdata[k]; store2 � XRdata[j] – XIdata[k]; store3 � XIdata[j] + XRdata[k]; XIdata[k] � XIdata[j] – XRdata[k]; XRdata[ j] � store2; XRdata[k] � store1; XIdata[ j] � store3; k � k – 1; }

Fig. 3.2 Pseudo-code for computing complex-data DFT from DHT outputs

with the corresponding DHT-based relationship given by

DHT .fxŒN � n�g/ DnX.H/ŒN � k�

o: (3.26)

3.5.2 Addition Theorem


DFT .fx1Œn� C x2Œn�g/ D DFT .fx1Œn�g/ C DFT .fx2Œn�g/

DnX.F/

1 Œk�o

CnX.F/

2 Œk�o

; (3.27)



DHT .fx1Œn� C x2Œn�g/ D DHT .fx1Œn�g/ C DHT .fx2Œn�g/

DnX.H/

1 Œk�o

CnX.H/

2 Œk�o

: (3.28)

3.5.3 Shift Theorem


DFT .fxŒn � n0�g/ Dne�i2 n0k=N:X.F/Œk�

o; (3.29)


DHT .fxŒn � n0�g/

Dncos .2 n0k=N/ :X.H/Œk�

o�

nsin .2 n0k=N/ :X.H/ŒN � k�

o: (3.30)

3.5.4 Convolution Theorem

Denoting the operation of circular or cyclic convolution by means of the symbol“�”, the DFT-based relationship is given by

DFT .fx1Œn�g � fx2Œn�g/ DnX.F/

1 Œk�:X.F/2 Œk�

o; (3.31)


DHT .fx1Œn�g � fx2Œn�g/

D�

1

2

�X.H/

1 Œk�:X.H/2 Œk� � X.H/

1 ŒN � k�:X.H/2 ŒN � k�

CX.H/1 Œk�:X.H/

2 ŒN � k� C X.H/1 ŒN � k�:X.H/

2 Œk��

: (3.32)


3.5.5 Product Theorem


DFT .fx1Œn�:x2Œn�g/ DnX.F/

1 Œk�o

�nX.F/

2 Œk�o

; (3.33)


DHT .fx1Œn�:x2Œn�g/D

�1

2

�X.H/

1 Œk� � X.H/2 Œk� � X.H/

1 ŒN � k� � X.H/2 ŒN � k�

C X.H/1 Œk� � X.H/

2 ŒN � k� C X.H/1 ŒN � k� � X.H/

2 Œk��

: (3.34)

3.5.6 Autocorrelation Theorem

Denoting the operation of circular or cyclic correlation by means of the symbol “˝”,the DFT-based relationship is given by

DFT .fxŒn�g ˝ fxŒn�g/ D 1

2

� ˇ̌ˇX.F/Œk�

ˇ̌ˇ2

; (3.35)


DHT .fxŒn�g ˝ fxŒn�g/ D� ˇ̌

ˇX.H/Œk�ˇ̌ˇ2 C

ˇ̌ˇX.H/ŒN � k�

ˇ̌ˇ2

: (3.36)

3.5.7 First Derivative Theorem


DFT�˚

x0Œn�� D

ni2 kX.F/Œk�

o; (3.37)


DHT�˚

x0Œn�� D

n2 kX.H/ŒN � k�

o: (3.38)


3.5.8 Second Derivative Theorem


DFT�˚

x00Œn�� D

n�4 2k2X.F/Œk�

o; (3.39)


DHT�˚

x00Œn�� D

n�4 2k2X.H/Œk�

o: (3.40)

3.5.9 Summary of Theorems

This section simply highlights the fact that for every fundamental theorem asso-ciated with the DFT, there is an analogous theorem for the DHT, which may beapplied, in a straightforward fashion, so that the DHT may be used to addressthe same type of signal processing problems as the DFT, and vice versa. An im-portant example is that of the digital filtering of an effectively infinite-length datasequence with a fixed-length FIR filter, more commonly referred to as continuousconvolution, where the associated linear convolution is carried out via the piece-wise application of the Circular Convolution Theorem using either the overlap-addor the overlap-save technique [3]. The role of the DHT, in this respect, is muchlike that of the number-theoretic transforms (NTTs) [7] – as typified by the Fer-mat number transform (FNT) and the Mersenne number transform (MNT) – whichgained considerable popularity back in the 1970s amongst the academic community.These transforms, which are defined over finite or Galois fields [7] via the use ofresidue number arithmetic [7], exist purely for their ability to satisfy the CircularConvolution Theorem.

An additional and important result, arising from the Product Theorem ofEquations 3.33 and 3.34, is that when the real-valued data sequences fx1Œn�g andfx2Œn�g are identical, we obtain Parseval’s Theorem [3], as given by the equation

N�1X

nD0

jxŒn�j2 DN�1X

kD0

ˇ̌ˇX.F/Œk�

ˇ̌ˇ2 D

N�1X

kD0

ˇ̌ˇX.H/Œk�

ˇ̌ˇ2

; (3.41)

which simply states that the energy in the signal is preserved under both the DFT andthe DHT (and, in fact, under any discrete unitary or orthogonal transformation), sothat the energy measured in the data space is equal to that measured in the transformspace. This theorem will be used later in Chapter 8, where it will be invoked toenable a fast radix-4 FHT algorithm to be applied to the fast computation of thereal-data DFT whose transform length is a power of two, but not a power of four.

Finally, note that whenever theorems involve dual Hartley-space terms intheir expression – such as the terms X.H/Œk� and X.H/[N–k], for example, in the

3.6 Fast Solutions to DHT 37

convolution and correlation theorems – that it is necessary that care be taken to treatthe zero-address and Nyquist-address terms separately, as neither term possessesa dual.

3.6 Fast Solutions to DHT

Knowledge that the DHT is in possession of many of the same properties as theDFT is all very well, but to be of practical significance, it is also necessary that theDHT, like the DFT, possesses fast algorithms for its efficient computation. The firstwidely published work in this field is thought to be that due to Ronald Bracewell[1, 2], who produced both radix-2 and radix-4 versions of the DIT fixed-radix FHTalgorithm. His work in this field was summarized in a short monograph [2] whichhas formed the inspiration for the work discussed here.

The solutions produced by Bracewell are attractive in that they achieve the de-sired performance metrics in terms of both arithmetic complexity and memoryrequirement – that is, compared to a conventional complex-data FFT, they requireone half of the arithmetic operations and one half the memory requirement – butsuffer from the fact that they need two sizes – and thus two separate designs – of but-terfly for efficient fixed-radix formulations. For the radix-4 algorithm, for example,a single-sized butterfly produces four outputs from four inputs, as shown in Fig. 3.3,whilst a double-sized butterfly produces eight outputs from eight inputs, as shown inFig. 3.4, both of which will be developed in some detail from first principles in thefollowing chapter. This lack of regularity makes an in-place solution somewhat dif-ficult to achieve, necessitating the use of additional memory between the temporalstages, as well as making an efficient mapping onto parallel computing equipmentless than straightforward.

Although other algorithmic variations for the efficient solution to the DHT havesubsequently appeared [4, 5, 10], they all suffer, to varying extents, in terms of theirlack of regularity, so that alternative solutions to the DHT are still sought that pos-sess the regularity associated with the conventional complex-data fixed-radix FFTalgorithms but without sacrificing the benefits of the existing FHT algorithms interms of their reduced arithmetic complexity, reduced memory requirement andoptimal latency. Various FHT designs could be studied, including versions of thepopular radix-2 and split-radix [4] algorithms, but when transform lengths allowfor comparison, the radix-4 FHT is more computationally efficient than the radix-2FHT, its design more regular than that of the split-radix FHT, and it has the potentialfor an eightfold speed up with parallel computing equipment over that achievablevia a purely sequential solution, making it a good candidate to pursue for poten-tial hardware implementation. The radix-4 version of the FHT has therefore beenselected as the algorithm of choice in this monograph.


X[3]

X[2]

X[1]

X[0]

X[3]

X[2]

X[1]

X[0]

-

-

-

-

-

-

Zero-address version of single-sized butterfly

X[3]

X[2]

X[1]

X[0]

X[3]

X[2]

X[1]

X[0]

2

2

-

Nyquist-address version of single-sized butterfly

Fig. 3.3 Signal flow graphs for single-sized butterfly for radix-4 FHT algorithm

-

-

-

-

-

-

-

-

-

-

-

trigonometric coefficients

-

X[3]

X[2]

X[1]

X[0]

X[3]

X[2]

X[1]

X[0]

X[7]

X[6]

X[5]

X[4]

X[7]

X[6]

X[5]

X[4]

Fig. 3.4 Signal flow graph for double-sized butterfly for radix-4 FHT algorithm

3.8 Discussion 39

3.7 Accuracy Considerations

When compared to a full-length FFT solution based upon one of the real-from-complex strategies, as discussed in Section 2.3 of the previous chapter, the FHTapproach will involve approximately the same number of arithmetic operations(when the complex arithmetic operations of the FFT are reduced to equivalent realarithmetic operations) in order to obtain each real-data DFT output. The associatednumerical errors may be due to both rounding, as introduced via the discarding ofthe lower order bits from the fixed-point multiplier outputs, and truncation, as in-troduced via the discarding of the least significant bit from the adder outputs afteran overflow has occurred. The underlying characteristics of such errors for the twoapproaches will also be very similar, however, due to the similarity of their butterflystructures, so that when compared to FFT-based solutions possessing comparablearithmetic complexity the errors will inevitably be very similar [8, 11].

This feature of the FHT will be particularly relevant when dealing with a fixed-point implementation, as is implied with any solution that is to be mapped onto anFPGA or ASIC device, where the combined effects of both truncation errors [9] androunding errors [9] will need to be properly assessed and catered for through theoptimum choice of word length and scaling strategy.

3.8 Discussion

When the DHT is applied to the computation of the DFT, as discussed in Section 3.4,a conversion routine is required to map the DFT outputs from Fourier space toHartley space. For the real-data case, as outlined in Fig. 3.1, the conversion pro-cess involves two real additions/subtractions for each DFT output together with twoscaling operations, whilst for the complex-data case, as outlined in Fig. 3.2, thisincreases to four real additions/subtractions for each DFT output together with twoscaling operations. All the scaling operations, however, are by a factor of 2 whichin fixed-point arithmetic reduces to that of a simple right shift operation.

Note that if the requirement is to use an FHT algorithm to compute the powerspectral density (PSD) [3, 6], which is typically obtained from the squared mag-nitudes of the DFT outputs, then there is no need for the Hartley-space outputsto be first transformed to Fourier space, as the PSD may be computed directly fromthe Hartley-space outputs – an examination of Equations 3.12–3.14 should convinceone of this. Also, it should be noted that with many of the specialized real-data FFTalgorithms, apart from their lack of regularity, they also suffer from the fact thatdifferent algorithms are generally required for the fast computation of the DFT andits inverse. Clearly, in applications requiring transform-space processing followedby a return to the data space, as encountered for example with matched filtering,this could prove something of a disadvantage, particularly when compared to theadoption of a bilateral transform, such as the DHT, where the definitions of both thetransform and its inverse, up to a scaling factor, are identical.


References

1. R.N. Bracewell, The fast Hartley transform. Proc. IEEE 72(8) (1984)2. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986)3. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice Hall, Englewood

Cliffs, NJ, 1988)4. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and real-

symmetric data. IEEE Trans. ASSP 34(2), 285–295 (1986)5. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to

cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (1987)6. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications

(Academic, New York, 1982)7. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall,

Englewood Cliffs, NJ, 1979)8. J.B. Nitschke, G.A. Miller, Digital filtering in EEG/ERP analysis: Some technical and empiri-

cal comparisons. Behavior Res. Methods, Instrum. Comput. 30(1), 54–67 (1998)9. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall,

Englewood Cliffs, NJ, 1975)10. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On Computing the Discrete Hartley

Transform. IEEE ASSP 33, 1231–1238 (1985)11. A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartley

transform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Chapter 4Derivation of the Regularized Fast HartleyTransform

Abstract This chapter discusses a new formulation of the FHT, referred to as theregularized FHT, which overcomes the limitations of existing FHT algorithms giventhe ultimate objective of mapping the solution onto silicon-based parallel comput-ing equipment. A generic version of the double-sized butterfly, the GD-BFLY, isdescribed which dispenses with the need for two sizes – and thus two separate de-signs – of butterfly as required via conventional fixed-radix formulations. Efficientschemes are also described for the storage, accession and generation of the trigono-metric coefficients using suitably defined LUTs. A brief complexity analysis is thengiven in relation to existing FFT and FHT approaches to both the real-data andcomplex-data DFT problems. A discussion is finally provided relating to the resultsobtained in the chapter.

4.1 Introduction

A drawback of conventional FHT algorithms, as highlighted in the previous chapter,lies in the need for two sizes – and thus two separate designs – of butterfly forefficient fixed-radix formulations. For the case of the radix-4 FHT to be discussedhere, a single-sized butterfly, producing four outputs from four inputs, is required forboth the zero-address and the Nyquist-address iterations of the relevant temporalstages, whilst a double-sized butterfly, producing eight outputs from eight inputs,is required for each of the remaining iterations. We look now at how this lack ofregularity might be overcome, bearing in mind the desire, ultimately, to map theresulting algorithmic structure onto suitably defined parallel computing equipment:

Statement of Performance Objective No 1:

The aim is to produce a design for a generic double-sized butterfly for use by a radix-4version of the FHT which lends itself naturally to parallelization and to mapping onto aregular computational structure for implementation with parallel computing technology.

Note that the attraction of the radix-4 solution, rather than that of the more familiarradix-2 case, is its greater computational efficiency – in terms of both reduced arith-metic complexity and reduced memory access – and the potential for exploiting


41

42 4 Derivation of the Regularized Fast Hartley Transform

greater parallelism, at the arithmetic level, via the larger sized butterfly, therebyoffering the possibility of achieving a higher computational density when imple-mented in silicon – to be discussed in Chapter 6.

4.2 Derivation of the Conventional Radix-4 Butterfly Equations

The first step towards achieving this goal concerns the derivations of the twodifferent sized butterflies – the single and the double – as required for efficient im-plementation of the radix-4 FHT. A DIT version is to be adopted given that the DITalgorithm is known to yield a slightly better signal-to-noise ratio (SNR) than the DIFalgorithm when fixed-point processing is used [7, 10]. In fact, the noise variance ofthe DIF algorithm can be shown to be twice that of the DIT algorithm [10], so thatthe DIT algorithm offers the possibility of using shorter word lengths and ultimatelyless silicon for a given level of performance. The data re-ordering, in addition, is as-sumed to take place prior to the execution of the transform so that the data maybe efficiently generated and stored in memory in the required di-bit reversed orderdirectly from the output of the analog-to-digital conversion (ADC) unit at minimalexpense.

Let us first decompose the basic DHT expression as given by Equation 3.1 fromthe previous chapter – although in this instance without the scaling factor and withoutput vector X.H/ now replaced simply by X for ease of exposition – into fourpartial summations, such that

XŒk� DN=4�1X

nD0

xŒ4n�:cas.2 .4n/k=N/

CN=4�1X

nD0

xŒ4n C 1�:cas.2 .4n C 1/k=N/

CN=4�1X

nD0

xŒ4n C 2�:cas.2 .4n C 2/k=N/

CN=4�1X

nD0

xŒ4n C 3�:cas.2 .4n C 3/k=N/: (4.1)

Suppose now that

x1Œn� D xŒ4n�; x2Œn� D xŒ4n C 1�; x3Œn� D xŒ4n C 2� & x4Œn� D xŒ4n C 3� (4.2)

and note from Equation 3.3 of the previous chapter that

cas.2 .4n C r/k=N/ D cas.2 nk=.N=4/ C 2 rk=N/

D cos.2 rk=N/ :cas.2 nk=.N=4//

C sin.2 rk=N/ :cas.�2 nk=.N=4// (4.3)

4.2 Derivation of the Conventional Radix-4 Butterfly Equations 43

and

cas.�2 nk=N/ D cas.2 n.N � k/=N/: (4.4)

Then if the partial summations of Equation 4.1 are written as

X1Œk� DN=4�1X

nD0

x1Œn�:cas.2 nk=.N=4// (4.5)

X2Œk� DN=4�1X

nD0

x2Œn�:cas.2 nk=.N=4// (4.6)

X3Œk� DN=4�1X

nD0

x3Œn�:cas.2 nk=.N=4// (4.7)

X4Œk� DN=4�1X

nD0

x4Œn�:cas.2 nk=.N=4//; (4.8)

it enables the equation to be re-written as

XŒk� D X1Œk� C cos.2 k=N/:X2Œk� C sin.2 k=N/:X2ŒN=4 � k�

C cos.4 k=N/:X3Œk� C sin.4 k=N/:X3ŒN=4 � k�

C cos.6 k=N/:X4Œk� C sin.6 k=N/:X4ŒN=4 � k�; (4.9)

the first of the double-sized butterfly equations.Now, by exploiting the properties of Equations 4.3 and 4.4, the remaining double-

sized butterfly equations may be written as

XŒN=4 � k� D X1ŒN=4 � k�

C sin.2 k=N/:X2ŒN=4 � k� C cos.2 k=N/:X2Œk�

� cos.4 k=N/:X3ŒN=4 � k� C sin.4 k=N/:X3Œk�

� sin.6 k=N/:X4ŒN=4 � k� � cos.6 k=N/:X4Œk� (4.10)

XŒk C N=4� D X1Œk�

� sin.2 k=N/:X2Œk� C cos.2 k=N/:X2ŒN=4 � k�

� cos.4 k=N/:X3Œk� � sin.4 k=N/:X3ŒN=4 � k�

C sin.6 k=N/:X4Œk� � cos.6 k=N/:X4ŒN=4 � k� (4.11)


XŒN=2 � k� D X1ŒN=4 � k�


C cos.4 k=N/:X3ŒN=4 � k� � sin.4 k=N/:X3Œk�

� cos.6 k=N/:X4ŒN=4 � k� C sin.6 k=N/:X4Œk� (4.12)

XŒk C N = 2� D X1Œk�


C cos.4 k=N/:X3Œk� C sin.4 k=N/:X3ŒN=4 � k�

� cos.6 k=N/:X4Œk� � sin.6 k=N/:X4ŒN=4 � k� (4.13)

XŒ3N=4 � k� D X1ŒN=4 � k�

� sin.2 k=N/:X2ŒN=4 � k� � cos.2 k=N/:X2Œk�


C sin.6 k=N/:X4ŒN=4 � k� C cos.6 k=N/:X4Œk� (4.14)

XŒk C 3N=4� D X1Œk�

C sin.2 k=N/:X2Œk� � cos.2 k=N/:X2ŒN=4 � k�


� sin.6 k=N/:X4Œk� C cos.6 k=N/:X4ŒN=4 � k� (4.15)

XŒN � k� D X1ŒN=4 � k�



C cos.6 k=N/:X4ŒN=4 � k� � sin.6 k=N/:X4Œk�; (4.16)

where N/4 is the length of the DHT output sub-sequences, fX1[k]g, fX2[k]g,fX3[k]g and fX4[k]g, and the parameter “k” varies from 1 up to N/8–1.

When k D 0, which corresponds to the zero-address case, we obtain the single-sized butterfly equations

XŒ0� D X1Œ0� C X2Œ0� C X3Œ0� C X4Œ0� (4.17)

XŒN=4� D X1Œ0� C X2Œ0� � X3Œ0� � X4Œ0� (4.18)

XŒN=2� D X1Œ0� � X2Œ0� C X3Œ0� � X4Œ0� (4.19)

XŒ3N=4� D X1Œ0� � X2Œ0� � X3Œ0� C X4Œ0�; (4.20)

and when k D N=8, which corresponds to the Nyquist-address case, we obtain thesingle-sized butterfly equations

4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations 45

XŒN=8� D X1ŒN=8� C p2:X2ŒN=8� C X3ŒN=8� (4.21)

XŒ3N=8� D X1ŒN=8� � X3ŒN=8� C p2:X4ŒN=8� (4.22)

XŒ5N=8� D X1ŒN=8� � p2:X2ŒN=8� C X3ŒN=8� (4.23)

XŒ7N=8� D X1ŒN=8� � X3ŒN=8� � p2:X4ŒN=8�: (4.24)

Thus, two different-sized butterflies are required for efficient computation of theDIT version of the radix-4 FHT, their SFGs being as given in Figs. 3.3 and 3.4 of theprevious chapter. For the single-sized butterfly equations, the computation of eachoutput involves the addition of at most four terms, whereas for the double-sizedbutterfly equations, the computation of each output involves the addition of seventerms. The resulting lack of regularity makes an attractive hardware implementationvery difficult to achieve, therefore, without suitable reformulation of the associatedequations.

4.3 Single-to-Double Conversion of the Radix-4Butterfly Equations

In order to derive a computationally-efficient single-design solution to the radix-4 FHT, it is therefore necessary to regularize the algorithm structure by replacingthe single and double sized butterflies with a generic version of the double-sizedbutterfly. Before this can be achieved, however, it is first necessary to show how thesingle-sized butterfly equations may be converted to the same form as that of thedouble-sized butterfly.

When just the zero-address equations need to be carried out, it may be achievedvia the interleaving of two sets, each of four equations, one set involving the con-secutive samples fX1[0], X2[0], X3[0], X4[0]g, say, and the other set involving theconsecutive samples fY1[0], Y2[0], Y3[0], Y4[0]g, say. This yields the modifiedbutterfly equations

XŒ0� D X1Œ0� C X2Œ0� C X3Œ0� C X4Œ0� (4.25)

YŒ0� D Y1Œ0� C Y2Œ0� C Y3Œ0� C Y4Œ0� (4.26)

XŒN=4� D X1Œ0� C X2Œ0� � X3Œ0� � X4Œ0� (4.27)

YŒN=4� D Y1Œ0� C Y2Œ0� � Y3Œ0� � Y4Œ0� (4.28)

XŒN=2� D X1Œ0� � X2Œ0� C X3Œ0� � X4Œ0� (4.29)

YŒN=2� D Y1Œ0� � Y2Œ0� C Y3Œ0� � Y4Œ0� (4.30)

XŒ3N=4� D X1Œ0� � X2Œ0� � X3Œ0� C X4Œ0� (4.31)

YŒ3N=4� D Y1Œ0� � Y2Œ0� � Y3Œ0� C Y4Œ0�; (4.32)

with the associated double-sized butterfly being referred to as the “Type-I” butterfly.


Similarly, when both the zero-address and Nyquist-address equations need to becarried out – which is always true when the Nyquist-address equations are required –they may be combined in the same fashion to yield the butterfly equations

XŒ0� D X1Œ0� C X2Œ0� C X3Œ0� C X4Œ0� (4.33)

XŒN=8� D X1ŒN=8� C p2:X2ŒN=8� C X3ŒN=8� (4.34)

XŒN=4� D X1Œ0� C X2Œ0� � X3Œ0� � X4Œ0� (4.35)

XŒ3N=8� D X1ŒN=8� � X3ŒN=8� C p2:X4ŒN=8� (4.36)

XŒN=2� D X1Œ0� � X2Œ0� C X3Œ0� � X4Œ0� (4.37)

XŒ5N=8� D X1ŒN=8� � p2:X2ŒN=8� C X3ŒN=8� (4.38)

XŒ3N=4� D X1Œ0� � X2Œ0� � X3Œ0� C X4Œ0� (4.39)

XŒ7N=8� D X1ŒN=8� � X3ŒN=8� � p2:X4ŒN=8�; (4.40)

with the associated double-sized butterfly being referred to as the “Type-II” but-terfly. With the indexing assumed to start from zero, rather than one, the even-indexed equations thus correspond to the zero-address butterfly and the odd-indexedequations to the Nyquist-address butterfly.

Thus, the sets of single-sized butterfly equations may be reformulated in sucha way that the resulting composite butterflies now accept eight inputs and produceeight outputs, the same as the standard radix-4 double-sized butterfly, referred to asthe “Type-III” butterfly. The result is that the radix-4 FHT, instead of requiring bothsingle and double sized butterflies, may now be carried out instead with three simplevariations of the double-sized butterfly.

4.4 Radix-4 Factorization of the FHT

A radix-4 factorization of the FHT may be obtained in a straightforward fashionin terms of the double-sized butterfly equations through application of the familiardivide-and-conquer [6] principle, as used in the derivation of other fast discreteunitary and orthogonal transforms [4], such as the FFT. This factorization leads tothe algorithm described by the pseudo-code of Fig. 4.1, where all instructions withinthe scope of the outermost “for” loop constitute a single iteration in the temporaldomain and all instructions within the scope of the innermost “for” loop constitutea single iteration in the spatial domain. Thus, each iteration in the temporal domain,more commonly referred to as a “stage”, comprises N/8 iterations in the spatialdomain, where each iteration corresponds to the execution of a single set of double-sized butterfly equations.

4.4 Radix-4 Factorization of the FHT 47

Fig. 4.1 Pseudo-code forradix-4 factorization of FHTalgorithm

// Set up transform length. N = 4α;

// Di-bit reverse input data addresses. (in)N .xN;X =PΦ0

// Loop through log4 temporal stages. offset = 1; for (i = 0; i <α; i=i+1) {

M = 8×offset; // Loop through N/8 spatial iterations.

for ( j = 0; j < N; j=j+M) {

for (k = 0; k < offset; k=k+1) // Carry out radix-4 double butterfly equations.

{ // Double Butterfly Routine:

// computes 8 outputs from 8 inputs

( n=0,1,2,3)XfX ,CMn,k,SM

n,k

(in)N

(out)N =

} } offset = 2 (2i + 1);

}

The implication of the above definitions is that for the processing of a single dataset a given stage may only be executed after its predecessor and before its successor,whereas every iteration of a given stage may in theory be executed simultaneously.Thus, each stage is time dependent and may only be executed sequentially, whereasif the data is available then the iterations within each stage may be executed inparallel.

Note from the pseudo-code of Fig. 4.1 that “ˆ0” is the bijective mapping or per-mutation – with “Pˆ0

” the associated permutation matrix – corresponding to thedi-bit reversal mapping of the FHT input data addresses, whilst the double-sizedbutterfly section referred to in the pseudo-code makes use of cosinusoidal and sinu-soidal terms, as given by

CMn;k D cos.2 nk=M/ n D 0; 1; 2; 3 (4.41)

andSM

n;k D sin.2 nk=M/; n D 0; 1; 2; 3; (4.42)

respectively, the trigonometric coefficients defined in Chapter 1, which are each afunction of the indices of the innermost and outermost loops.

For the FHT factorization described here, the double-sized butterfly routine re-ferred to in the pseudo-code implements either the Type-I butterfly of Equations 4.9to 4.16, the Type-II butterfly of Equations 4.25–4.32, or the Type-III butterfly ofEquations 4.33–4.40. As a result, the FHT appears to require a different SFG foreach “Type” of double butterfly and so appears to lack at this stage the regularity


necessary for an efficient mapping onto a single regular computational structure, aswill be required for an efficient hardware implementation with parallel computingequipment.

4.5 Closed-Form Expression for Generic Radix-4Double Butterfly

The first step towards addressing this problem is to reformulate the double-sizedbutterfly equations so that they may be expressed in a recursive closed-form fashion,as once this is achieved it will then be a simple task to show how the same SFG canbe used to describe the operation of each of the Type-I, Type-II and Type-III double-sized butterflies. This first step is achieved through the introduction of the addresspermutations “ˆ1”, “ˆ2”, “ˆ3” and “ˆ4”, as defined in Table 4.1, and throughthe introduction of arithmetic redundancy into the processing via the use of thetrigonometric coefficients “EM

n;k” (for even-valued index “k”) and “OMn;k” (for odd-

valued index “k”), as defined in Table 4.2, where the cosinusoidal and sinusoidalterms referred to in the table, “CM

n;k” and “SMn;k”, are as given by Equations 4.41

and 4.42, respectively.Through the use of such operators and terms, it can be shown how the same

set of arithmetic operations may be carried out upon the input data set for everyinstance of the double-sized butterfly, despite the fact that for certain of the Type-I

Table 4.1 Address permutations for generic double butterfly

Input address 0 1 2 3 4 5 6 7

ˆ1: Type D I, II 0 1 2 6 4 5 3 7ˆ1: Type D III 0 1 2 3 4 5 6 7ˆ2: Type D I, II 0 4 3 2 1 5 6 7ˆ2: Type D III 0 4 2 6 1 5 3 7ˆ3: Type D I, II 0 4 1 5 2 6 3 7ˆ3: Type D III 0 4 1 3 2 6 7 5ˆ4: Type D I, II, III 0 4 1 5 6 2 3 7

Table 4.2 Trigonometric coefficients for generic double butterfly

Index m 0 1 2 3 4 5 6 7

EMm;k: Type D I 1 0 1 0 1 0 1 0

EMm;k: Type D II 1 0 1 0 1 0 1p

2

1p2

EMm;k: Type D III 1 0 CM

1;k SM1;k CM

2;k SM2;k CM

3;k SM3;k

OMm;k: Type D I 0 �1 0 �1 0 �1 0 �1

OMm;k: Type D II 0 �1 0 �1 0 �1 1p

2

1p2

OMm;k: Type D III 0 �1 SM

1;k CM1;k SM

2;k CM2;k SM

3;k CM3;k

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly 49

and Type-II cases the values of the set of trigonometric coefficients suggest thatthe multiplications are trivial and thus avoidable – that is, that one or more of thetrigonometric coefficients belong to the set f � 1; 0; C1g.

The even-valued and odd-valued indices for the addressing of the input data tothe double-sized butterfly are both arithmetic sequences and consequently gener-ated very simply via the pseudo-code of Fig. 4.2, with the associated double-sized

if (i == 0)

{

// Set up 1st even and odd data indices for Type-I double butterfly.

twice_offset = offset & index_even[0] = j & index_odd[0] = j + 4;

// Set up address permutations for Type-I double butterfly.

(I,II)Φn = Φn

n = 1,2,3,4

(I,II)Φn = Φn n = 1,2,3,4

(III)Φn = Φn

n = 1,2,3,4

} else {

twice_offset = 2 × offset;

if (k == 0) {

// Set up 1st even and odd data indices for Type-II double butterfly.

index_even[0] = j & index_odd[0] = j + offset;

// Set up address permutations for Type-II double butterfly.

}

else

{

// Set up 1st even and odd data indices for Type-III double butterfly.

index_even[0] = j + k & index_odd[0] = j + twice_offset – k;

// Set up address permutations for Type-III double butterfly.

}

} // Set up remaining even and odd data indices for double butterfly.

for (n = 1; n < 4; n=n+1)

{

index_even[n] = index_even[n-1] + twice_offset;

index_odd[n] = index_odd[n-1] + twice_offset; }

Fig. 4.2 Pseudo-code for generation of data indices and address permutations


butterfly – referred to hereafter as the generic double butterfly or “GD-BFLY” –being expressed via the pseudo-code of Fig. 4.3. The address permutations are de-pendent only upon the “Type” of GD-BFLY being executed, with just two slightlydifferent versions being required for each of the first three permutations, and onlyone for the last permutation. The two versions of ˆ1 differ in just two (of the eightpossible) exchanges whilst the two versions of ˆ2 and ˆ3 each differ in just three

// Set up input data vector. for (n = 0; n < 4; n=n+1)

{ X[2n] = X(in)[index_even[n]] & X[2n+1] = X(in)[index_odd[n]];

} // Apply 1st address permutation.

PY T.X

1Φ=

// Apply trigonometric coefficients and 1st

set of additions/subtractions. for (n = 1; n < 4; n=n+1){

{

} // Apply 2nd address permutation.

PX .Y

.X

.Y

T2Φ=

// Apply 2nd set of additions/subtractions.for (n = 0; n < 4; n=n+1)

store = X[2n]+X[2n+1] & X[2n+1]=X[2n] –X[2n+1] & X[2n] = store;

store = Y[2n]+Y[2n+1] & Y[2n+1]=Y[2n] –Y[2n+1] & Y[2n] = store;

} // Apply 3rd address permutation.

PY T3Φ=

// Apply 3rd set of additions/subtractions.

for (n = 0; n < 4; n=n+1) {

} // Apply 4th

address permutation.

PX T4Φ=

// Set up output data vector. for (n = 0; n < 4; n=n+1)

{

X(out)[index_even[n]]=X[2n] & X(out)[index_odd[n]]=X[2n+1];

}

store = E M2n+1,k

M2n,k × Y[2n+1];× Y[2n] + E

Y[2n+1] = O Y[2n] _M2n+1,kOM

2n,k× Y[2n+1];×

Y[2n] = store;

Fig. 4.3 Pseudo-code for carrying out generic double butterfly


(of the eight possible) exchanges, as evidenced from the contents of Table 4.1. Thetrigonometric coefficients, which as stated above include the trivial constants be-longing to the set f� 1; 0; C1g, are dependent also upon the value of the parameter“k” corresponding to the innermost loop of the pseudo-code of Fig. 4.1.

An elegant and informative way of representing the four permutation mappingsmay be achieved by noting from the group-theoretic properties of the symmetricgroup [1] – which for order N is the set of all permutations of N objects – that anypermutation can be expressed as a product of cyclic permutations [1] and that eachsuch permutation can also be simply expressed as a product of transpositions [1].As shorthand for describing a permutation, a cyclic notation is first introduced inorder to describe how the factorization of a given permutation is achieved. Withthis notation, each element within parentheses is replaced by the element to its rightwith the last element being replaced by the first element in the set. Any element thatreplaces itself is omitted. Thus, the two versions of ˆ1 may be expressed as

ˆ1 D .3; 6/ (4.43)

andˆ1 D .:/ ; (4.44)

the second version being the length eight identity mapping, the two versions ofˆ2 as

ˆ2 D .1; 4/ .2; 3/ D .1; 4/ .3; 2/ (4.45)

andˆ2 D .1; 4/ .3; 6/ ; (4.46)

the two versions of ˆ3 as

ˆ3 D .1; 4; 2/ .3; 5; 6/ D .2; 1; 4/ .5; 6; 3/

D .2; 1/ .2; 4/ .5; 6/ .5; 3/ (4.47)

and

ˆ3 D .1; 4; 2/ .5; 6; 7/ D .2; 1; 4/ .5; 6; 7/

D .2; 1/ .2; 4/ .5; 6/ .5; 7/ ; (4.48)

and finally the single version of ˆ4 as

ˆ4 D .1; 4; 6; 3; 5; 2/ D .3; 5; 2; 1; 4; 6/

D .3; 5/ .3; 2/ .3; 1/ .3; 4/ .3; 6/ : (4.49)

From these compact representations – which are equivalent to those given intabular form in Table 4.1 – both the commonalities and the differences betweenthe two versions of each permutation are straightforwardly visualized, with eachpair being distinguished by means of a single transposition whilst the common


component (whether in terms of cyclic permutations or transpositions) is fixed andthus amenable to hard-wiring. The ordering of the transpositions has been adjustedin the above expressions so as to minimize the associated communication lengthsinvolved in the exchanges. For ˆ1 the first version involves the application of a sin-gle transposition, involving addresses “3” and “6”, whilst for ˆ2 the two versionsdiffer only in the final transposition involving the exchange of address “3” with ei-ther address “2” or address “6”, and for ˆ3 they differ only in terms of the finaltransposition involving the exchange of new address “5” (original address “6”) witheither address “3” or address “7”.

Notice that as the combined effect of the first four trigonometric coefficients –corresponding to indices m D 0 and m D 1 in Table 4.2 – for every instance ofthe GD-BFLY, is simply for the first two inputs to the GD-BFLY to pass directlythrough to the second permutation, then the first four multiplications and the associ-ated pair of additions may be simply removed from the SFG of Fig. 3.4 shown in theprevious chapter, to yield the SFG shown below in Fig. 4.4, this being obtained atthe cost of slightly reduced regularity, at the arithmetic level, within the GD-BFLY.This results in the need for just 12 real multiplications for the GD-BFLY, ratherthan 16, whose trigonometric coefficient multiplicands may be obtained, throughsymmetry relations, from just six stored trigonometric coefficients: two each – bothcosinusoidal and sinusoidal – for single-angle, double-angle and triple-angle cases.Also, the number of additions required prior to the second permutation reduces fromeight to just six.

Thus, the three “Types” of GD-BFLY each map efficiently onto the same regu-lar computational structure, this structure being represented by a SFG consisting

trigonometric coefficients

A d

d r

e s s

P e

r m

u t

a t i

o n

i n

p u

t d

a t a

v e

c t

o r

A d

d r

e s s

P e

r m

u t

a t i

o n

A d

d r

e s s

P e

r m

u t

a t i

o n

A d

d r

e s s

P e

r m

u t

a t i

o n

o u

t p u

t d

a t

a v

e c

t o

r

_

_ _

_

_

_

_

_

_

_

_

Φ4Φ3Φ2Φ1

Fig. 4.4 Signal flow graph for twelve-multiplier version of generic double butterfly


of three stages of additive recursion, the first being preceded by a point-wisemultiplication stage involving the trigonometric coefficients. Denoting the input andoutput data vectors by X.in/ and X.out/, respectively, the operation of the GD-BFLYmay thus be represented in a closed-form fashion by means of a multi-stage recur-sion, as given by the expression

X.out/ D PTˆ4

:�

A3:�

PTˆ3

:�

A2:�

PTˆ2

:�

A1:�

M1:�

PTˆ1

:X.in/� � � � � � �

(4.50)

where “Pˆ1”, “Pˆ2

”, “Pˆ3” and “Pˆ4

” are the butterfly-dependent permutationmatrices [1] associated with the address permutations “ˆ1”, “ˆ2”, “ˆ3” and “ˆ4”,respectively. Being orthogonal [1], whereby Pˆ:PT

ˆ D I8 – the matrix version of thelength eight identity mapping – they may each be applied either side of an equation,such that

Y D PTˆ:X � Pˆ:Y D X; (4.51)

where the superscript “T” denotes the transpose operator. The composite matrix“A1:M1” is a butterfly-dependent 2 � 2 block diagonal matrix [1] containing thetrigonometric coefficients (as defined from the contents of Table 4.2 with the firsttwo terms fixed and equal to one), whilst “A2” and “A3” are fixed addition blocks,also expressed as 2 � 2 block diagonal matrices, such that

A1:M1 D

2

666666666664

C1 0

0C1

CE2 CE3

CO2 �O3

CE4 CE5

CO4 �O5

CE6 CE7

CO6 �O7

3

777777777775

; (4.52)

and

A2 D A3 D

2

666666666664

C1 C1

C1 �1

C1 C1

C1 �1

C1 C1

C1 �1

C1 C1

C1 �1

3

777777777775

: (4.53)

Note that as long as each data set for the GD-BFLY is accompanied by anappropriately set “Type” flag – indicating whether the current instance of the GD-BFLY is of Type-I, Type-II or Type-III – then the correct versions of the firstthree permutators may be appropriately applied for any given instance of the GD-BFLY. The reformulated equations, which were obtained through the introduction


of arithmetic redundancy into the processing, thus correspond to a double butterflywhich overcomes, in an elegant fashion, the loss of regularity associated with moreconventional fixed-radix formulations of the FHT. The resulting radix-4 algorithmis henceforth referred to as the regularized FHT or “R2

4 FHT” [5], where the “R24”

part of the expression is short for “Regularized Radix-4”.

4.5.1 Twelve-Multiplier Version of Generic Double Butterfly

As evidenced from the SFG of Fig. 4.4, the GD-BFLY described above requires atotal of 12 real multiplications and 22 real additions, whilst the effect of the per-mutators for a parallel solution is to reduce the communication topology to that ofnearest neighbour for input to both the adders and the multipliers, with the dataentering/leaving the arithmetic components in consecutive pairs. The only changeto the operation of the GD-BFLY, from one instance to another, is in terms of thedefinitions of the first three address permutations, with one of two slightly differentversions being appropriately selected for each such permutation according to theparticular “Type” of the GD-BFLY being executed – see the permutation definitionsof Table 4.1.

As a consequence, each instance of the twelve-multiplier version of the GD-BFLY may be carried out using precisely the same components and represented bymeans of precisely the same SFG.

4.5.2 Nine-Multiplier Version of Generic Double Butterfly

A lower-complexity version of the above GD-BFLY may be achieved by notingthat each block of four multipliers and its associated two adders corresponds to thesolution of a pair of bilinear forms [9], which can be optimally solved, in terms ofmultiplications, with just three multipliers – see the corresponding section of theSFG for the standard Type-III GD-BFLY in Fig. 4.5. This complexity reductionis achieved at the expense of three extra adders for the GD-BFLY and six extraadders for the generation of the trigonometric coefficients. The complete SFG forthe resulting reduced-complexity solution is as shown in Fig. 4.6, from which it canbe seen that the GD-BFLY now requires a total of nine real multiplications and 25real additions.

As with the twelve-multiplier version, there are minor changes to the operationof the GD-BFLY, from one instance to another, in terms of the definitions of thefirst three address permutations, with one of two slightly different versions beingappropriately selected for each such permutation according to the particular “Type”of the GD-BFLY being executed – see the permutation definitions of Table 4.1.Additional but minor changes are also required, however, to the operation of thestage of adders directly following the multipliers and to the ordering of the outputsfrom the resulting operations.


Multiplication-addition block of standard Type-III double butterfly

− cosθθθθ

=b

a

sin

sincos

b~a~

Multiplicative constants:

===

cos θ−sin θc3

cosθc2

cos θ+sin θc1

+

+

–

b

a

b~

a~

–

c3

c2

c1

Fig. 4.5 Reduced-complexity arithmetic block for set of bilinear forms

±

±

±

_

_

_

_

_

_

__

±

±

±

Φ1 Φ2 Φ3 Φ4trigonometriccoefficients

Add

ress

Per

mut

atio

n

outp

ut d

ata

vect

or

inp

ut d

ata

vect

or

Add

ress

Per

mut

atio

n

Add

ress

Per

mut

atio

n

Add

ress

Per

mut

atio

n

Fig. 4.6 Signal flow graph for nine-multiplier version of generic double butterfly

For the first of the three sets of three multipliers, if the GD-BFLY is of Type-I or Type-II then each of the two adders performs addition on its two inputs andthe ordering of the two outputs is the same as that of the two inputs, whilst if theGD-BFLY is of Type-III then each of the two adders performs subtraction on its twoinputs and the ordering of the two outputs is the reverse of that of the two inputs.


Similarly, for the second of the three sets of three multipliers, if the GD-BFLY is ofType-I or Type-II then each of the two adders performs addition on its two inputsand the ordering of the two outputs is the same as that of the two inputs, whilst if theGD-BFLY is of Type-III then each of the two adders performs subtraction on its twoinputs and the ordering of the two outputs is the reverse of that of the two inputs.Finally, for the last of the three sets of three multipliers, if the GD-BFLY is of Type-I then each of the two adders performs addition on its two inputs and the orderingof the two outputs is the same as that of the two inputs, whilst if the GD-BFLY isof Type-II or Type-III then each of the two adders performs subtraction on its twoinputs and the ordering of the two outputs is the reverse of that of the two inputs.Note that the reversal of each pair of outputs is straightforwardly achieved, as shownin Fig. 4.6, by means of a simple switch.

As a consequence, each instance of the nine-multiplier version of the GD-BFLYmay be carried out using precisely the same components and represented by meansof precisely the same SFG.

4.6 Trigonometric Coefficient Storage, Accessionand Generation

An efficient implementation of the R24 FHT invariably requires an efficient mech-

anism for the storage and accession of the trigonometric coefficients required forfeeding into each instance of the GD-BFLY. The requirement, more exactly, is thatsix non-trivial coefficients be either accessed from the CM or suitably generatedon-the-fly in order to be able to carry out the necessary processing for any givendata set. Referring to the definitions for the non-trivial cosinusoidal and sinusoidalterms, as given by Equations 4.41 and 4.42, respectively, if we put “ D N=M, wherethe parameters “M” and “N” are as defined in the pseudo-code of Fig. 4.1, then

CMn;k D cos.2 nk“=N/ D CN

n;k“ for n D 1; 2; 3 (4.54)

and

SMn;k D sin.2 nk“=N/ D SN

n;k“; for n D 1; 2; 3; (4.55)

enabling the terms to be straightforwardly addressed from suitably constructedLUTs via the parameters “n”, “k” and ““”.

The total size requirement of the LUT can be minimized by exploiting therelationship between the cosinusoidal and sinusoidal functions, as given by the ex-pression

cos.x/ D sin

�x C 1

2

�; (4.56)

as well as the periodic nature of each, as given by the expressions

sin.x C 2 / D sin.x/ (4.57)

4.6 Trigonometric Coefficient Storage, Accession and Generation 57

andsin.x C / D � sin.x/ : (4.58)

Two schemes are now outlined which enable a simple trade-off to be madebetween memory size and addressing complexity – as measured in terms of the num-ber of arithmetic/logic operations required for computing the necessary addresses.These particular schemes will be later exploited, in Chapter 6, by the conflict-freeand (for the data) in-place parallel memory addressing schemes developed for theefficient parallel computation of the R2

4 FHT.

4.6.1 Minimum-Arithmetic Addressing Scheme

As already stated, the trigonometric coefficient set comprises both cosinusoidal andsinusoidal terms for single-angle, double-angle and triple-angle cases. To minimizethe arithmetic/logic requirement for the generation of the addresses, the LUT maybe sized according to a single-quadrant addressing scheme, whereby the trigono-metric coefficients are read from a sampled version of the sinusoidal function withargument defined from 0 up to =2 radians. Thus, for the case of an N-point R2

4 FHT,it is required that the LUT be of size N/4 words yielding a total CM requirement,denoted CAopt

MEM, of

CAoptMEM D 1

4N (4.59)

words.This scheme would seem to offer, therefore, a reasonable compromise between

the CM requirement and the addressing complexity, using more than the theoreticalminimum amount of memory required for the storage of the trigonometric coeffi-cients so as to keep the arithmetic/logic requirement of the addressing as simple aspossible.

4.6.2 Minimum-Memory Addressing Scheme

Another approach to this problem is to adopt a two-level LUT, this comprising onecoarse-resolution region of N/4L words for the sinusoidal function, covering 0 up to =2 radians, and one fine-resolution region of L words for each of the cosinusoidaland sinusoidal functions, covering 0 up to =2L radians. The required trigonometriccoefficients may then be obtained from the contents of the two-level LUT throughthe application of one or other of the standard trigonometric identities

cos.™ C ¥/ D cos.™/ : cos.¥/ � sin.™/ : sin.¥/ (4.60)

andsin.™ C ¥/ D sin.™/ : cos.¥/ C cos.™/ : sin.¥/ ; (4.61)


where “™” corresponds to the angle defined over the coarse-resolution region and“¥” to the angle defined over the fine-resolution region.

By expressing the combined size of the two-level LUT for the sinusoidal functionas

f.L/ D Nı4L C L (4.62)

words, it can be seen that the optimum LUT size is obtained when

df

dLD 1 � Nı

4L2 (4.63)

is set to zero, giving L D pN=2 and resulting in a total CM requirement, denoted

CMoptMEM, of

CMoptMEM D 3

2

pN (4.64)

words –p

N=2 for the coarse-resolution region andp

N=2 for each of the two fine-resolution regions.

This scheme therefore yields the theoretical minimum memory requirement forthe storage of the trigonometric coefficients at the expense of an increased arith-metic/logic requirement for the associated addressing. The two-level LUT willactually be regarded hereafter as consisting of three separate complementary-angleLUTs, each of size

pN=2 words, rather than as a single LUT, as all three may need

to be accessed simultaneously if an efficient parallel solution to the R24 FHT is to be

achieved.

4.6.3 Trigonometric Coefficient Generationvia Trigonometric Identities

With both of the storage schemes discussed above, after deriving the single-angletrigonometric coefficients from the respective LUT(s), the double-angle and triple-angle trigonometric coefficients may then be straightforwardly obtained from thesingle-angle trigonometric coefficients through the application of the standardtrigonometric identities

cos.2™/ D 2: cos2 .™/ � 1 (4.65)

sin.2™/ D 2: sin.™/ : cos.™/ (4.66)

and

cos.3™/ D .2: cos.2™/ � 1/ : cos.™/ (4.67)

sin.3™/ D .2: cos.2™/ C 1/ : sin.™/ ; (4.68)

4.7 Comparative Complexity Analysis with Existing FFT Designs 59

respectively, or alternatively, through the replication of the respective LUT(s) foreach of the double-angle and triple-angle cases. This question will be discussedfurther in Chapter 6 in relation to the conflict-free and (for the data) in-place parallelmemory addressing schemes.

4.7 Comparative Complexity Analysis with ExistingFFT Designs

This chapter has concerned itself with the detailed derivation of a regularized ver-sion of the DIT radix-4 FHT, referred to as the R2

4 FHT, the intention being to usethe resulting algorithm for the efficient parallel computation of the real-data DFT.

For most applications, the real-data DFT is still generally solved with a real-from-complex strategy, as discussed in some detail in Chapter 2, whereby anN-point complex-data FFT simultaneously computes the outputs of two N-pointreal-data DFTs, or where the output of an N-point real-data DFT is obtained fromthe output of one N/2-point complex-data FFT. Such approaches, however, areadopted at the possible expense of increased memory, increased processing de-lay to allow for the acquisition/processing of pairs of data sets, and additionalpacking/unpacking complexity. The class of specialized real-data FFTs discussedin Chapter 2 is also commonly used and although these algorithms comparefavourably, in terms of operation counts and memory requirement, with thoseof the FHT, they suffer in terms of a loss of regularity and reduced flexibility in thatdifferent algorithms are required for the computation of the DFT and its inverse.

The performance of the R24 FHT is therefore compared very briefly with those of

the complex-data and real-data FFTs, as described in Chapter 2, together with theconventional non-regularized FHT [2,3]. Performance is evaluated for the computa-tion of both real-data and complex-data DFTs, where the application of the FHT tocomplex-valued data is achieved very simply by processing separately the real andimaginary components of the data and additively combining the outputs to yieldthe complex-data DFT output – this was discussed in some detail in Section 3.4 ofthe previous chapter. The results are summarized in Table 4.3, where a single-PEarchitecture is assumed for each solution such that the PE is able to produce all theoutputs for a single instance of the respective butterfly (there are two types for thestandard non-regularized FHT) simultaneously via the exploitation of fine-grainedparallelism at the arithmetic level – such architectural considerations are to be dis-cussed in some depth in future chapters of the monograph. Such a performancemay prove difficult (if not impossible) to attain for some approaches, however, asthe third row of the table suggests that neither the N-point real-data FFT nor thestandard non-regularized FHT lend themselves particularly well to parallelization.

However, as can be seen from the table, the regularity/simplicity of the designand the bilateral nature of the algorithm make the R2

4 FHT an attractive solutioncompared to the class of real-data FFTs, whilst the reduced processing delay (forthe real-data case) and reduced data memory/pin count requirement (for both the


Tab

le4.

3A

lgor

ithm

icco

mpa

riso

nfo

rre

al-d

ata

and

com

plex

-dat

aFF

Tde

sign

s

Com

plex

-dat

aR

eal-

data

Stan

dard

Reg

ular

ized

Alg

orit

hmN

-poi

ntFF

TN

-poi

ntFF

TN

-poi

ntFH

TN

-poi

ntFH

T

Des

ign

regu

lari

tyH

igh

Low

Low

Hig

hN

oof

butt

erfly

desi

gns

1�1

21

Para

llel

izat

ion

Hig

hL

owL

owH

igh

Ari

thm

etic

dom

ain

Com

plex

field

Com

plex

field

Rea

lfiel

dR

ealfi

eld

Ari

thm

etic

com

plex

ity

O.N

�log

4N

/O

.N�l

og4

N/

O.N

�log

4N

/O

.N�l

og4

N/

Tim

eco

mpl

exit

yO

.N�l

og4

N/

O.N

�log

4N

/O

.N�l

og4

N/

O.N

�log

4N

/

Dat

am

emor

yfo

rN

-poi

ntre

al-d

ata

DFT

2�N

NN

ND

ata

mem

ory

for

N-p

oint

com

plex

-dat

aD

FT2

�N–

NN

Pin

coun

tfor

N-p

oint

real

-dat

aD

FT2

�2�N

2�N

2�N

2�N

Pin

coun

tfor

N-p

oint

com

plex

-dat

aD

FT2

�2�N

–2

�N2

�NPr

oces

sing

dela

yfo

rN

-poi

ntre

al-d

ata

DFT

2�D

DD

DA

ppli

cabl

eto

forw

ard

&in

vers

eD

FTs

Yes

No

Yes

Yes

Add

itiv

eco

mpl

exit

yfo

run

pack

ing

ofN

-poi

ntre

al-d

ata

DFT

N–

NN

Add

itiv

eco

mpl

exit

yfo

run

pack

ing

ofN

-poi

ntco

mpl

ex-d

ata

DFT

––

4�N

4�N

4.8 Scaling Considerations for Fixed-Point Implementation 61

real-data and complex-data cases) offer additional advantages over the conventionalcomplex-data FFT approach. The low memory requirement of the R2

4 FHT approachis particularly relevant for applications involving large transform lengths, as is thecase with many wide bandwidth channelization problems, for example.

Summarizing the results, the regularity of the design, combined with the ease ofparallelization, nearest-neighbour communication topology at the arithmetic com-ponent level (as effected by the permutators) for a parallel solution, simplicity ofthe arithmetic components, optimum processing delay, low pin count and mem-ory requirements, make the R2

4 FHT an extremely attractive candidate to pursue forpossible realization in hardware with parallel computing equipment. The time andarithmetic complexities are shown to be of the same order for each solution con-sidered, with the arithmetic complexity of the GD-BFLY being actually equivalentto that achievable for the butterfly of an optimally designed complex-data radix-4 FFT algorithm [8], widely considered the most computationally attractive of allfixed-radix butterflies.

4.8 Scaling Considerations for Fixed-Point Implementation

For a fixed-point implementation of the R24 FHT, as is the case of interest in this

monograph, the registers available for holding the trigonometric coefficients andthe data are of fixed length, whilst the register used for holding the outputs from thearithmetic operations (namely the accumulator), although of fixed length, is gener-ally longer than those used for holding the trigonometric coefficients and the data.This additional length for the accumulator is to prevent the unnecessary loss ofaccuracy from rounding of the results following the arithmetic operations, as themultiplication of a K-bit word and a L-bit word yields a .K C L/-bit result, whilstthe addition of two L-bit words yields a .L C 1/-bit result. When the trigonometriccoefficients are each less than or equal to one, however, as they are for the R2

4 FHT,each multiplication will introduce no word growth, whereas the addition of any twoterms following the multiplication stage may produce word growth of one bit.

The maximum growth in magnitude through the GD-BFLY occurs when all theinput samples possess equal magnitude and the rotation angle associated with thetrigonometric coefficients is =4, the magnitude then growing by a factor of up to1 C 3

p2 � 5:242. If the data register is fully occupied this will result in three

bits of overflow. To prevent this, an unconditional scaling strategy could be appliedwhereby the data are right shifted by three bits prior to each stage of GD-BFLYs.However, apart from reducing the dynamic range of the data, such scaling intro-duces truncation error if the discarded bits are non-zero. The possibility of overflowwould therefore be eliminated at the cost of unnecessary shifting of the data and apotentially large loss of accuracy.

A more accurate approach would be to adopt a conditional scaling strategy,namely the block floating-point technique [8], whereby the data are shifted onlywhen overflow occurs. The block floating-point mechanism comprises two parts.


The output part calculates the maximum magnitude of the output data for the currentstage of GD-BFLYs, from which a scaling factor is derived as a reference value forthe input scaling of the next stage of GD-BFLYs. The input part receives the scal-ing factor generated by the previous stage of GD-BFLYs, so that the number of bitsto be right shifted for the current input data set will be based on the scaling factorprovided. Therefore, the data overflow and the precision of the integer operationsare controlled automatically by the block floating-point mechanism, which providesinformation not only for the word growth of the current stage of GD-BFLYs but alsofor the word growth of all the previous stages. Such scaling, however, is far morecomplex to implement than that of unconditional scaling.

An alternative to the above two approaches is to allow the data registers to pos-sess a limited number of guard bits to cater for some or all of the word growth, suchthat the scaling strategy need only cater for limited word growth, rather than for theworst case. The performance of such a scheme, however, as with that of uncondi-tional scaling, will always yield a sub-optimal performance – in terms of accuracyand dynamic range – when compared to that achievable by the conditional blockfloating-point scheme.

4.9 Discussion

To summarize the situation so far, a new formulation of the radix-4 FHT has beenderived, referred to as the regularized FHT or R2

4 FHT, whereby the major limitationof existing fixed-radix FHT designs, namely the lack of regularity arising from theneed for two sizes – and thus two separate designs – of butterfly, has been overcome.It remains now to see how easily the resulting structure lends itself to mapping ontoparallel computing equipment, bearing in mind that the ultimate requirement is toderive an area-efficient solution for power-constrained applications, such as mobilecommunications, where parallelism will need to be fully and efficiently exploited inorder that the required throughput rates are attained. There is reason to be optimisticin the endeavor in that the large size of the GD-BFLY, which results in it being ableto produce eight outputs from eight inputs, offers the promise of an eightfold speedup with parallel computing equipment over that achievable via a purely sequentialsolution, whilst the arithmetic requirements of the GD-BFLY as indicated from itsSFG suggest that it could well lend itself to internal pipelining, with each CS of thepipeline being made up from various combinations of the arithmetic components(adders and multipliers) and permutators of which the GD-BFLY is composed.

Note that the radix-4 butterfly used for the standard formulation of the radix-4FFT is sometimes referred to in the technical literature as a dragonfly, rather than abutterfly, due to its resemblance to the said insect – a radix-8 butterfly may also bereferred to as a spider for the same reason.

Finally, it should be noted that the property of symmetry has been exploited notonly to minimize the number of arithmetic operations required by both FFT andFHT algorithms, through the regular nature of the respective decompositions, but

References 63

also to minimize the memory requirement through the nature of the fundamentalfunction from which the associated transform kernels are derived, namely the si-nusoidal function. The basic properties of this function – together with that of itscomplementary function the cosinusoid – are as described by Equations 4.56–4.58given earlier in the chapter, with the sinusoid being an even-symmetric functionrelative to any odd-integer multiple of the argument =2 and an odd-symmetricfunction relative to any even-integer multiple of =2, whilst the cosinusoid is aneven-symmetric function relative to any even-integer multiple of the argument =2

and an odd-symmetric function relative to any odd-integer multiple of =2. Thatis, they are each either even-symmetric or odd-symmetric according to whether theaxis of symmetry is an appropriately chosen multiple of =2.

References

1. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, New York, 1977)2. R.N. Bracewell, The fast Hartley. Transform. Proc. IEEE 72(8) (1984)3. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986)4. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications

(Academic, New York, 1982)5. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc.

Vision, Image Signal Process. 153(1), 70–78 (2006)6. L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley,

New York, 1985)7. Y. Li, Z. Wang, J. Ruan, K. Dai, A low-power globally synchronous locally asynchronous FFT

processor. HPCC 2007, LNCS 4782, 168–179 (2007)8. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall,

Englewood Cliffs, NJ, 1975)9. S. Winograd, Arithmetic Complexity of Computations (SIAM, Philadelphia, PA, 1980)

10. A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartleytransform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Chapter 5Algorithm Design for Hardware-BasedComputing Technologies

Abstract This chapter first provides a brief discussion of the fundamentalproperties of both FPGA and ASIC devices, together with their relative merits,before analyzing the various design techniques and parameters – namely thoserelating to the clock frequency, silicon area and switching frequency – and theconstraints and trade-offs that need to be made between them when trying to designa low-power solution to the regularized FHT for implementation with such tech-nologies. The benefits of incorporating scalability, partitioned-memory processingand flexibility into the design of the proposed solution are considered as well as thedesign options available for silicon-based implementation when constrained by thelimited availability of embedded resources. A discussion is finally provided relatingto the results obtained in the chapter.

5.1 Introduction

The type of high-performance parallel computing equipment typified by the increas-ingly powerful silicon-based FPGA and ASIC technologies [2] now gives designengineers far greater flexibility and control over the type of algorithm that maybe used in the building of high-performance DSP systems, so that more appropri-ate hardware solutions to the problem of solving the real-data DFT may be activelysought and exploited to some advantage with these silicon-based technologies. Withsuch technologies, however, it is no longer adequate to view the FFT complexitypurely in terms of arithmetic operation counts, as has conventionally been done,as there is now the facility to use both multiple arithmetic units – adders and fastmultipliers – and multiple banks of fast random access memory (RAM) in order toenhance the FFT performance via its parallel computation.

As a result, a whole new set of constraints has arisen relating to the design ofefficient FFT algorithms. With the recent and explosive growth of wireless technol-ogy, and in particular that of mobile communications, where a small battery may bethe only source of power supply for long periods of time, algorithms are now beingdesigned subject to new and often conflicting performance criteria where the idealis either to maximize the throughput (that is, to minimize the update time) or satisfy


65

66 5 Algorithm Design for Hardware-Based Computing Technologies

some constraint on the latency, whilst at the same time minimizing the requiredsilicon resources (and thereby minimizing the cost of implementation) as well askeeping the power consumption to within the available budget. Note, however, thatthe throughput is also constrained by the I/O speed, as the algorithm cannot processthe data faster than it can access it.

To be able to produce such a solution, however, it is first necessary to identifythe relevant parameters [2] involved in the design process and then to outline theconstraints and trade-offs that need to be made between them. This chapter firstlooks therefore into those design techniques and parameters – namely those thatrelate to the clock frequency, silicon area and switching frequency – that need to beconsidered for the design of a low-power solution to the R2

4 FHT. The aim, bearingin mind the target application area of mobile communications, is to obtain a solutionthat optimizes the use of the available silicon resources on the target device whilstkeeping the associated power consumption to within the available budget and, in sodoing, to maximize the achievable computational density – as defined in Chapter 1.The particular benefits of incorporating scalability, partitioned-memory processingand flexibility into the design of the solution are then discussed as well as the designoptions available for silicon-based implementation when constrained by the limitedavailability of embedded resources.

5.2 The Fundamental Properties of FPGA and ASIC Devices

An FPGA device [2] is an integrated circuit that contains configurable or pro-grammable blocks of logic along with configurable interconnections between theblocks. DSP design engineers are able to configure or program such devices to per-form a wide variety of signal processing tasks with most modern devices offeringthe facility for repeated re-programming. An ASIC device [2], on the other hand,is custom-designed to address a specific application and as such is able to offerthe ultimate solution in terms of size (the number of transistors), complexity andperformance, where performance is typically measured in terms of computationaldensity. Designing and building an ASIC is an extremely time-consuming and ex-pensive process, however, with the added disadvantage that the final design is frozenin silicon [2] and cannot be modified without creating a new version of the device.

The ASIC is often referred to as being fine-grained because ultimately it isimplemented at the level of the primitive logic gates, whereas the FPGA is oftenreferred to as being coarse-grained because it is physically realized using higher-level blocks of programmable logic. Therefore, in order to enhance the capabilitiesand the competitiveness of the FPGA, manufacturers are now providing embeddedresources, such as fast multipliers and banks of fast RAM with dedicated arithmeticrouting, which are considerably smaller, faster and more power efficient than whenimplemented in programmable logic by the user. These features, when coupled withthe massive parallelism on offer, enable the FPGA to outperform the fastest of theconventional uni-processor DSP devices by two or even three orders of magnitude.

5.3 Low-Power Design Techniques 67

The system-level attractions of the FPGA are its flexibility and cost-effectiveness forlow-volume price-sensitive applications, whilst the additional circuit-level benefitsof reduced delay, area and power consumption are known to be even more pro-nounced with ASIC technology which, as stated above, will always yield optimumperformance when that performance is to be measured in terms of computationaldensity.

The cost of an FPGA, as one would expect, is much lower than that of an ASIC.At the same time, implementing design changes is also much easier with the time-to-market for such designs being considerably shorter. This means that the FPGAallows the design engineer to realize software and hardware concepts on an FPGA-based test platform without having to incur the enormous costs associated withASIC designs. Therefore, high-performance FFT designs, even when ultimately tar-geted at an ASIC implementation, will generally for reasons of ease, time and costbe developed and tested on an FPGA.

For the analysis carried out in this and future chapters, emphasis is placed onthe implementation of arithmetic units with FPGA technology where the target de-vice family is the popular Virtex-II Pro as produced by Xilinx Inc. [6] of the USA.Although this device family may be somewhat old, its use is only intended to fa-cilitate comparison between the different types of PE or computing architectureproposed for the parallel computation of the R2

4 FHT – with real-world applicationsit is not always possible, for various practical/financial reasons, to have access tothe latest device technologies. The FPGA is actually made up of a number of con-figurable logic blocks (CLBs), which provide one with both logic and storage. EachCLB is made up of a number of “slices”, two for the case of a Virtex-II Pro device,with each logic slice containing two LUTs. Each LUT can in turn be configuredas a 16 by one-bit synchronous RAM or read only memory (ROM), which is morecommonly referred to as distributed RAM.

5.3 Low-Power Design Techniques

Over the past decade or so, power consumption has grown from a secondary to amajor constraint in the design of hardware-based DSP solutions. In portable appli-cations, such as mobile communications, low power consumption has long been themain design constraint, due in part to the increasing cost of cooling and packag-ing, but also to the resulting rise in on-chip temperature, which in turn results inreduced reliability. The result is that the identification and application of low-powertechniques, at both arithmetic and algorithmic levels, are crucial to the specifica-tion of an achievable design in silicon that meets with the required power-relatedperformance objectives.

The power consumption associated with the silicon-based implementation ofa high performance DSP algorithm, such as the R2

4 FHT, comprises both “static”and “dynamic” components. The dynamic component has until recently dominatedthe total power consumption, although as the devices become ever bigger and


ever more powerful the contribution of the static (or acquiescent) component tothe total power consumption is becoming increasingly more significant. Given ourhardware-efficient objectives, however, we restrict our attention here to the dynamiccomponent, denoted PD, which may be expressed as

PD D C � V2 � f (5.1)

where “C” is the capacitance of the node switching, “V” is the supply voltage and“f” the switching frequency. This component is primarily driven by the clock fre-quency of the device, the silicon area required for its implementation – which isdetermined by the size of the arithmetic unit, the total memory requirement and thedata routing – and the average switching rate of the individual circuits in each clockcycle. These items are now discussed in more detail in order that a suitable designstrategy might be identified.

5.3.1 Clock Frequency

To achieve high throughput with a hardware-based solution to the DSP-based prob-lem of interest, the clock frequency is typically traded off against parallelism,with the choice of solution ranging from that based upon the use of a single pro-cessor, driven to a potentially very high clock frequency, to that based upon theuse of multiple processors, typically operating concurrently via pipelining of thealgorithms, which combine to achieve the required performance but with a poten-tially much reduced clock frequency. For the particular problem of interest in thismonograph, the parallelism can be exploited at both arithmetic level, in terms ofthe fine-grain parallelism of the GD-BFLY, and algorithmic level, in terms of thecoarse-grain parallelism of the resulting R2

4 FHT, with pipelining techniques be-ing the most power-efficient means of achieving parallelism due to the associatednearest-neighbour communication requirement.

Given the strong dependence of power consumption on clock frequency, thereis clearly great attraction in being able to keep the clock frequency as low as pos-sible for the implementation of the R2

4 FHT provided the resulting solution is ableto meet the required performance objectives relating to throughput. To achieve this,however, it is necessary that an appropriate parallelization scheme be defined, sucha scheme being typically based upon one of the two pipelining schemes outlinedabove, which will additionally impact upon the silicon area requirement, as nowdiscussed.

5.3.2 Silicon Area

Suppose that the R24 FHT is of length N, where

N D 4’ (5.2)

5.3 Low-Power Design Techniques 69

with “’”, the radix exponent corresponding to N, thus representing the numberof temporal stages required by the algorithm. High-performance solutions maybe obtained through coarse-grain algorithmic parallelization by adopting an ’-stage computational pipeline, as shown in Fig. 5.1, where each computationalstage is assigned its own PE and double-buffered memory. But this means that theamount of silicon required by the R2

4 FHT will be both dependent upon and pro-portional to the size of the transform to be computed, as is the case with mostcommercially-available intellectual property (IP) core designs. A solution basedupon a globally-pipelined multi-PE architecture such as this achieves O(N) timecomplexity at the cost of O.log4 N/ space complexity, where space complexity refersloosely to the total silicon area requirement.

Alternatively, with a single-PE architecture, as shown in Fig. 5.2, high per-formance may be achieved for the R2

4 FHT through fine-grain PE-level arithmeticparallelization based upon internal or local pipelining of the PE. However, the suc-cess of this scheme relies heavily, if adequate throughput is to be achieved, uponthe appropriate partitioning and storage of the data and trigonometric coefficients

input data

output data

PE No a N/8 Radix-4 GD-BFLYs

DM

PE No 1 N/8 Radix-4 GD-BFLYs

DM DM

CM

PE No 2 N/8 Radix-4 GD-BFLYs

CM CM

CM – trigonometric coefficient memory DM – data memory

Fig. 5.1 Multi-PE architecture for radix-4 version of regularized FHT

output data

input data

Parallel PE Radix-4

Generic Double Butterfly

Trigonometric Coefficient Memory

Dat

a M

emor

y

loopthrough a × N / 8Radix-4

GD-BFLYs

Fig. 5.2 Single-PE architecture for radix-4 version of regularized FHT


in partitioned memory so that multiple samples/coefficients may be accessed and(for the data) updated in parallel from their respective memory banks. Optimalefficiency also requires that the processing for each instance of the GD-BFLY becarried out in an in-place fashion so that the memory requirement may be kept toa minimum. When such a solution is possible, the result is both area-efficient andscalable in terms of transform length, with space complexity – apart from the mem-ory requirement – being independent of the size of the transform to be computed.Such a solution achieves O.N: log4 N/ time complexity, which when the I/O re-quires N clock cycles, ensures continuous real-time operation for ’ � 8 and thusfor N � 64K at the cost of O(1) space complexity.

The greater the area efficiency, therefore, the lower the achievable throughput, asone would expect, so that the ultimate choice of solution will be very much depen-dent upon the timing constraint, if any, to be imposed upon the problem, as will bediscussed in the following chapter. Note that the word “scalable”, which has alreadybeen used a few times in this monograph and may mean different things in differ-ent contexts, simply refers to the ease with which the sought-after solution may bemodified in order to accommodate increasing or decreasing transform sizes – this isdiscussed further in Section 5.4.1.

5.3.3 Switching Frequency

Another important factor affecting power consumption is the switching power whichrelates to the number of times that a gate makes a logic transition, 0 ! 1 or 1 ! 0,in each clock cycle. Within the arithmetic unit, for example, when one of the inputsis constant, as with the case of the pre-computed trigonometric coefficients, it ispossible to use the pre-computed values to reduce the number of logic transitions in-volved, when compared to a conventional fast multiplier solution, and thus to reducethe associated switching power. With a parallel DA arithmetic [4] unit, for example,it is possible to reduce both switching power and silicon area for the implementationof the arithmetic components at the expense of increased memory for the storage ofpre-computed sums or inner products [1], whereas with a parallel CORDIC arith-metic [3] unit it is possible to eliminate the CM requirement and the associatedpower-hungry memory accesses, which also involves switching activity, at the min-imal expense of increased arithmetic and control logic for the on-the-fly generationof the trigonometric coefficients within each stage of the CORDIC pipeline.

5.4 Proposed Hardware Design Strategy

Having discussed very briefly the key parameters relevant to the production of alow-power silicon-based solution to the R2

4 FHT, a design strategy is now outlinedto assist in the achieving of such a solution. One should bear in mind, however,

5.4 Proposed Hardware Design Strategy 71

that in using the proposed solution for real-world applications, where the availablesilicon resources may vary considerably from one application to another, it wouldbe advantageous to be able to define a small number of variations of the basic PEdesign whereby the appropriate choice of design would enable one to optimize theuse of the available silicon resources on the target device so as to obtain a solutionthat maximizes the achievable computational density.

5.4.1 Scalability of Design

The first property of the sought-after solution to be considered is that of scalability,as referred to above in the discussion on silicon area. A desirable feature of oursolution is that it should be easily adapted, for new applications, at minimal re-design effort and cost. This may in part be achieved by making the solution scalable,in terms of transform length, such that the same single-PE computing architecturemay be used for each new application with the hardware requirements remainingessentially unaltered as the transform length is increased or decreased – other thanthe varying memory requirement necessary to cater for the varying amounts of dataand trigonometric coefficients – such an approach, in turn, playing a key role inkeeping the power consumption to within the available budget.

The consequence of using such a strategy is that as the transform length N isincreased, the silicon area is kept essentially constant at the expense of an increasedupdate time (the elapsed time between the production of each new real-data FFToutput data set) and increased latency (the elapsed time involved in the production ofa real-data FFT output data set from its associated input data set), where the latencyincreases according to the number of times the GD-BFLY is executed per transform,namely N=8: log4 N. However, if the required performance dictates simply that thelatency satisfy the timing constraint imposed by the I/O requirement – namely theprocessing of N samples in N clock cycles – then the property of scalability looksto be an extremely attractive mechanism for achieving an area-efficient solution tothe R2

4 FHT, particularly when implemented with silicon-based parallel computingequipment.

Note that if the requirement was to be able to keep the update time constant, thenit would be necessary either to increase the clock frequency and/or to increase thehardware requirements, in line with the increasing transform size, using a multi-PEarchitecture – either way, it would in turn result in a significant increase in both costand power consumption.

5.4.2 Partitioned-Memory Processing

An additional requirement arising from the property of scalability, as alreadyindicated, is that relating to the need for the data and the trigonometric coefficients


to be appropriately partitioned and stored in partitioned memory so that multiplesamples/coefficients may be accessed and (for the data) updated in parallel fromtheir respective memory banks. The resulting combination of scalability of designand partitioned-memory processing, if it could be achieved, would yield a solutionthat was both area-efficient and able to yield high-throughput and which would beable, for all transform lengths of interest (except for pathologically large cases), tosatisfy the latency constraint arising from the I/O requirement.

An additional attraction of such processing is that the adoption of partitioned-memory, rather than that of a single global memory, results in a further reduction inpower consumption [5].

5.4.3 Flexibility of Design

The final property of the solution to be considered is that of flexibility, wherebythe best possible use might be made of the available silicon resources when the so-lution is applied to new applications. This is achieved with the provision of a fewvariations of the basic PE design, each exploiting the same computing architecture,where the variations enable one to select a specific design according to the par-ticular silicon resources available on the target device. Such flexibility has alreadybeen implied in the results of Sections 4.5 and 4.6 of the previous chapter, whereboth nine-multiplier and twelve-multiplier versions of the GD-BFLY were consid-ered together with different CM addressing schemes, one of which minimized thearithmetic complexity at the cost of an increased CM requirement and the otherminimizing the CM requirement at the cost of increased arithmetic complexity.

A different type of flexibility relates to the available arithmetic precision, as pro-vided by the arithmetic unit. Different signal processing applications involving theuse of an FFT may require very different processing functions in order to carry outthe necessary tasks and often different levels of precision for each such function. TheFFT may well be fed directly by means of an ADC unit, for example, so that theword length of the data into and out of the FFT will be dictated both by the capabilityof the ADC and by the dynamic range requirements of the processing functions intowhich the FFT feeds. For the design to have truly universal application, therefore,it would be beneficial that the arithmetic unit should be easily adapted to cater forarbitrary arithmetic precision processing, including those applications where the re-quirements are not adequately addressed through the use of embedded resources, sothat different levels of accuracy may be achieved for different applications withouthaving to alter the basic design of the PE.

Such flexibility in terms of the arithmetic precision may be achieved via the useof a pipelined CORDIC arithmetic unit, as discussed in some depth in Chapter 7,where increased precision may be obtained by simply increasing the length of theassociated computational pipeline – noting that the CORDIC stages are identical –at the expense of a proportionate increase in latency.

5.6 Assessing the Resource Requirements 73

5.5 Constraints on Available Resources

As already discussed in Section 1.7 of Chapter 1, when producing electronic equip-ment, whether for commercial or military use, one is seldom blessed with the optionof using the latest state-of-the-art device technology. As a result, there are situationswhere there would be great merit in having designs that are not totally reliant on theavailability of the increasingly large quantities of expensive embedded resources,such as fast multipliers and fast memory, as provided by the manufacturers of thelatest silicon-based devices, but are sufficiently flexible to lend themselves to imple-mentation in silicon even when constrained by the limited availability of embeddedresources.

A problem may arise in practice, for example, when the length of the trans-form to be computed is very large compared to the capability of the target devicesuch that there are insufficient embedded resources to enable a successful mappingof the transform onto the device. In such a situation, where the use of a larger andmore powerful device is simply not an option, it is thus required that some means befound of facilitating a successful mapping onto the available device and one way ofachieving this is through the design of a more appropriate arithmetic unit, namelyone which does not rely too heavily upon the use of embedded resources.

As with the requirement for flexible-precision processing, this may be achievedvia the use of a pipelined CORDIC arithmetic unit, to be discussed in Chapter 7,which can be shown to effectively eliminate the requirement for both fast fixed-point multipliers and fast RAM for the trigonometric coefficients.

5.6 Assessing the Resource Requirements

Given the device-independent nature of the R24 FHT design(s) sought in this mono-

graph, a somewhat theoretical approach has been adopted for assessing the resourcerequirements for its implementation in silicon, this assessment being based on thedetermination of the individual requirements, measured in logic slices, for address-ing both the arithmetic complexity and the memory requirement. Such an approachcan only tell part of the story, however, as the amount of logic required for con-trolling the operation and interaction of the various components (which ideallyare manufacturer-supplied embedded components for optimal size and power ef-ficiency) of the design is rather more difficult (if not impossible) to assess, ifconsidered in isolation from the actual hardware design process, this due in part tothe automated and somewhat unpredictable nature of that process, as outlined below.

Typically, after designing and implementing the hardware design in a HDL thereis a multi-stage process to go through before the design is ready for use in a FPGA.The first stage is synthesis, which takes the HDL code and translates it into a“netlist” which is simply a textual description of a circuit diagram or schematic.This is followed by a simulation which verifies that the design specified in the netlistfunctions correctly. Once verified, the netlist is translated into a binary format, the


components and connections that it defines then mapped to CLBs, before the designis finally placed and routed to fit onto the target device. A second simulation is thenperformed to help establish how well the design has been placed and routed beforea “configuration” file is generated to enable the design to be loaded onto the FPGA.

The reality, after this process has been gone through, is that the actual logic re-quirement will invariably be somewhat greater than predicated by theory, due to theinefficient and unpredictable use made of the available resources in meeting the var-ious design constraints. This situation is true for any design considered, however,so that in carrying out a comparative analysis of different FHT or FFT designs thesame inefficiencies will inevitably apply to each. Such an overhead in the logic re-quirement needs to be borne in mind, therefore, when actually assessing whether aparticular device has sufficient resources to meet the given task.

5.7 Discussion

This chapter has looked very briefly into those design techniques and parametersthat need to be considered and traded off in order to achieve a low-power solution tothe R2

4 FHT, where the sought-after solution is required to be able to optimize the useof the available silicon resources on the target device so as to obtain a solution thatmaximizes the achievable computational density. This involved a discussion of thebenefits of incorporating scalability, partitioned-memory processing and flexibilityinto the design of the solution and of the design options available for silicon-basedimplementation when constrained by the limited availability of embedded resources.Clearly, if silicon-based designs can be produced that minimize the requirement forsuch embedded resources, then smaller lower-complexity devices might be used,rather than those at the top end of the device range, as is commonly the case, thusminimizing the financial cost of implementation. The less the reliance on the use ofembedded resources the greater the flexibility in the choice of target hardware.

It remains now to see how a suitable computing architecture might be definedwhich enables the attractions of the hardware-based technologies discussed in thischapter to be effectively exploited for the parallel computation of the R2

4 FHT,the derivation of which was discussed in some considerable detail in the previouschapter. In doing so, it would be advantageous to offer a choice of PE designs whichrange from providing optimality in terms of the arithmetic complexity to optimalityin terms of the memory requirement, as this would provide the user with the abilityto optimize the design of the PE for each new application according to the resourcesavailable on the target device.

The single-PE architecture discussed in this chapter would be particularlyattractive for the case of the R2

4 FHT given that the associated computing engine,the GD-BFLY, produces eight outputs from eight inputs, so that a parallel solutionwould offer a theoretical eightfold speed up over a purely sequential solution. Thiswould necessitate the data and trigonometric coefficients being appropriately parti-tioned and stored in partitioned memory so that multiple samples/coefficients may

References 75

be accessed and (for the data) updated in parallel from their respective memorybanks – this would in turn result in a further decrease of the power consumption.Being able to place the memory close to where the processing is actually takingplace would in addition eliminate the need for long on-chip communication pathswhich can result in long processing delays and increased power consumption.

References

1. G. Birkhoff, S. MacLane. A Survey of Modern Algebra (Macmillan, New York, 1977)2. C. Maxfield. The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), 2004)3. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput.

EC-8(3), 330–334 (1959)4. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review.

IEEE ASSP Mag. 4–19 (1989)5. T. Widhe, J. Melander, L. Wanhammar, Design of efficient radix-8 butterfly PE for VLSI. Proc.

IEEE Int. Symp. Circuits Syst. Hong Kong (1997)6. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 6Derivation of Area-Efficient and ScalableParallel Architecture

Abstract This chapter discusses a partitioned-memory single-PE computingarchitecture for the parallel computation of the regularized FHT which seeks tomaximize the computational density – that is, the throughput per unit area ofsilicon – when implemented with a silicon-based parallel computing device. Apipelined implementation of the GD-BFLY is discussed together with conflict-freeand (for the data) in-place parallel memory addressing schemes for both the dataand the trigonometric coefficients which enable the outputs from each instance ofthe GD-BFLY to be produced within a single clock cycle. Four versions of thesolution are discussed which enable trade-offs to be made of arithmetic complex-ity against memory requirement according to the resources available on the targetdevice. An FPGA implementation of the regularized FHT is then discussed and itsperformance compared with two commercially-available solutions. A discussion isfinally provided relating to the results obtained in the chapter.

6.1 Introduction

A point has now been reached whereby an attractive formulation of the FHT algo-rithm has been produced, namely the R2

4 FHT, whilst those properties required ofsuch an algorithm and of its associated computing architecture for the achievementof an optimal mapping onto silicon-based parallel computing equipment – as typi-fied by the FPGA and the ASIC – have also been outlined. The question now to beaddressed is whether such a mapping can be found, bearing in mind that the ulti-mate objective appears to be demanding a “squaring of the circle”, namely that ofmaximizing the computational throughput whilst at the same time minimizing therequired quantities of silicon resources so as to reduce both power consumption andcost – see the definition “Performance Metric for Silicon-Based Parallel ComputingDevice” in Section 1.8 of Chapter 1. To make the objective more concrete, therefore,it is perhaps worth putting it into the following words:


77

78 6 Derivation of Area-Efficient and Scalable Parallel Architecture

Statement of Performance Objective No 2:

The aim is to produce a scalable solution to the regularized FHT – and hence to the real-data FFT – in silicon that leads to the execution of a length N transform, where N caters forall transform lengths of practical interest, in N clock cycles or less, this performance beingsubject to the constraint that the least amount of silicon resources be used.

Although other metrics could of course be used for this definition, this particularmetric – which for the latency-constrained version of the real-data DFT problemlooks for that solution incurring the lowest possible silicon cost which thus equatesto maximizing the computational density – is targeted specifically at the type ofpower-constrained environment that one would expect to encounter with mobilecommunications, as it is assumed that a solution that yields a high computationaldensity will be attractive in terms of both power consumption and hardware effi-ciency, given the known influence of silicon area on the power consumption – asdiscussed in the previous chapter. The restriction to its execution being completedin N clock cycles or less is to ensure continuous operation (whereby the processingrate of the R2

4 FHT is able to keep up with the speed of the I/O over each block ofdata) and is valid for transform lengths up to and including 64K provided that theoutputs from each instance of the GD-BFLY can be produced within a single clockcycle. For transform lengths longer than this, therefore, it would not be possible tosustain continuous operation with a single PE, so that a multi-PE solution with atleast two PEs and a suitable architecture would be required – this point is taken upagain later in the chapter.

Also, stating the objective in this way, it is possible to ensure that any parallelsolution to the R2

4 FHT, if found, will possess those properties necessary for an at-tractive hardware implementation, although it will also be necessary that a propercomparison be made both of the time complexity, as given by the latency, and ofthe required silicon resources of any such solution with those of existing commer-cially available industry-standard FFT devices – although, as already stated, mostif not all such commercially-available solutions will almost invariably involve thecomputation of the conventional complex-data radix-2 version of the FFT.

6.2 Single-PE Versus Multi-PE Architectures

Two types of parallel computing architecture were briefly discussed in Section 5.3.2of the previous chapter, one based upon the adoption of multiple PEs and the otherupon the adoption of a single PE, where the multi-PE architecture achieves the re-quired computational throughput via algorithm-level pipelining and the other viaarithmetic-level pipelining within the PE itself. The globally-pipelined multi-PE ar-chitecture thus lends itself naturally to streaming operation – which generally takesnaturally-ordered input data and produces digit-reversed output data – whereby thedata samples are processed as soon as they arrive at the first PE in the pipeline,whilst the locally-pipelined single-PE architecture lends itself more naturally toblock-based or burst operation – which generally takes digit-reversed input data

6.2 Single-PE Versus Multi-PE Architectures 79

and produces naturally-ordered output data – whereby all the data samples mustfirst be generated and stored before they can be processed. The single-PE architec-ture certainly looks to offer the most promise for the problem under consideration,bearing in mind the power-constrained environment associated with the target ap-plication area of mobile communications, but in order to achieve the requiredcomputational throughput it will be necessary that the memory, for both the dataand the trigonometric coefficients, be suitably organized.

The memory structure should be such that the data and trigonometric coefficientsrequired for the execution of the GD-BFLY may be accessed simultaneously, with-out conflict, thereby enabling the outputs from each instance of the GD-BFLY to beproduced within a single clock cycle. In order to achieve this it is necessary that thememory be organized according to the partitioned-memory architecture of Fig. 6.1,where the topology of the data routing network is shown in the form of an H-Treeso as to keep the communication paths between the PE and each memory bank ofequal length – although in reality, when mapping such designs onto an FPGA, oneno longer has any control over such matters. The input/output data is partitioned ordistributed over eight memory banks and the trigonometric coefficients over threememory banks, so that suitable parallel addressing schemes need now to be definedwhich ideally enable one data sample to be read from (and written to) each DM bankevery clock cycle, in an in-place fashion and without conflict, and two trigonometriccoefficients to be read from each CM bank every clock cycle [3], again without con-flict. Such addressing schemes, for both the data and the trigonometric coefficients,are now discussed in some detail.

× 9 × 6

× 8

MnD - nth data memory bank

MnC - nth trigonometric coefficient

memory bank

Trigonometric Coefficient Generator

Generic

Radix-4

Double Butterfly

M1D M2

D

M5D M6

D

M3D M4

D

M7D M8

D

M1C

M2C

M3C

Address Generator

Fig. 6.1 Partitioned-memory single-PE architecture for regularized FHT


6.3 Conflict-Free Parallel Memory Addressing Schemes

The partitioned-memory addressing schemes described here for the R24 FHT are

based upon the assumption that the memories are dual-port. Such memory is as-sumed to have four data ports, two for the data inputs and two for the data outputs,although there is only one address input for each input/output data pair. As a result,each memory bank is able to cater for either two simultaneous reads, as required forthe case of the CM, two simultaneous writes, or one simultaneous read and writeusing separate read and write addresses. These read/write options will be shown tobe sufficient for the addressing requirements of both the CM, which requires twosimultaneous reads per clock cycle, and the DM, which for the implementation dis-cussed in this monograph will be shown to need all three options. With regard tothe DM, the addressing scheme is also to be regarded as being in-place as the out-puts of each instance of the GD-BFLY are to be ultimately written back to the samememory locations from which the GD-BFLY inputs were accessed.

6.3.1 Data Storage and Accession

The GD-BFLY, for Type-I, Type-II and Type-III cases, as described in Chapter 4,requires that eight data samples be read from and written to the DM, in an in-placefashion, in order to be able to carry out the processing for a given data set. Oneway for this to be achieved is if the eight samples to be processed by the GD-BFLYare stored with one sample in each DM bank, so that all eight DM banks are usedfor each instance of the GD-BFLY. Another way, given the availability of dual-portmemory, would be to have two samples in each of four DM banks with alternatesets of DM banks being used on alternate sets of data. The problem is addressedhere by adopting suitably modified versions of the rotation-based radix-4 memorymapping, “‰4”, as given by:

Definition of Mapping for Data Memory Addressing:

‰4 .n; ’/ D""�X̨

kD1

��n mod 4k

�>> 2.k � 1/

�mod 4

#<< 1

#(6.1)

so that‰4.n; ’/ 2 f0; 2; 4; 6g;

where the parameter “n” 2 f0; 1; : : : ; N � 1g corresponds to the sample address after di-bit reversal and “’” is the radix exponent corresponding to the transform length “N”, i.e.where ’ D log4 N.

The symbols “>>” and “<<” correspond to the binary right-shift and left-shiftoperations, respectively, which together with the familiar modulo operation, abbre-viated here to “mod”, may be straightforwardly and cost-effectively implementedin programmable logic. Introducing now the function “�” for representing theDM bank addresses, the initial/final data to/from the R2

4 FHT may be written/readto/from the DM banks according to:

6.3 Conflict-Free Parallel Memory Addressing Schemes 81

Definition of Mapping for Pre-FHT and Post-FHT Addressing:

�1 .n; ’/ D ‰4 .n; ’/ C .n mod 2/ (6.2)

so that�1 .n; ’/ 2 f0; 1; : : : ; 7g:

Note that this mapping also holds true for the DM accesses made from within theR2

4 FHT for the execution of the first stage of GD-BFLYs, where all eight of the DMbanks are utilized, whilst those for the remaining stages are carried out according to:

Definition of Mapping for Double Butterfly Addressing:

�2 .k; n; ’/ D ‰4 .n; ’/ C .k mod 2/ (6.3)

so that�2 .k; n; ’/ 2 f0; 1; : : : ; 7g;

where the parameter “k” 2 f0; 1; : : : ; .N=8/ � 1g corresponds to the GD-BFLY executionnumber for the current temporal stage.

Thus, from Equation 6.3 above, if the execution number “k” for the current tem-poral stage is an even-valued integer, then the DM banks required for that particularinstance of the GD-BFLY will be the four even-addressed banks, whilst if the ex-ecution number is an odd-valued integer, the DM banks required will be the fourodd-addressed banks. Having determined the DM bank to which a particular samplebelongs, its location within that DM bank may then be straightforwardly obtained,via the function “ˆ”, according to:

Definition of Mapping for Address Offset:

ˆ.n/ D n >> 3 (6.4)

so thatˆ.n/ 2 f0; 1; : : :; .N=8/ � 1g;

where the parameter “n” 2 f0; 1; : : : ; N �1g corresponds to the sample address after di-bitreversal.

To better understand the workings of these rotation-based memory mappings forthe storage/accession of the data, it is best to first visualize the data as being stored ina two-dimensional array of four columns and N/4 rows, where the data is stored on arow-by-row basis, with four samples to a row. The effect of the generic address map-ping, ‰4, as shown in the example given in Table 6.1 below, is to apply a left-senserotation to each row of data where the amount of rotation is dependent upon the par-ticular .N=4/ � 4 sub-array to which it belongs, as well as the particular .N=16/ � 4

sub-array within that sub-array, as well as the particular .N=64/�4 sub-array withinthat sub-array, etc., until all the relevant partitions have been accounted for – thereare log4 N of these. As a result, there is a cyclic rotation being applied to the dataover each such sub-array – the cyclic nature of the mapping means that within eachsub-array the amount of rotation to be applied to a given row of data is one positiongreater than that for the preceding row. This property, as will later be seen, may bebeneficially exploited by the GD-BFLY through the way in which it stores/accesses


Table 6.1 Structureof generic address mapping‰4 for case of length-64data set

Row Value of Generic address mapping ‰4

0 0 2 4 61 2 4 6 02 4 6 0 23 6 0 2 44 2 4 6 05 4 6 0 26 6 0 2 47 0 2 4 68 4 6 0 29 6 0 2 410 0 2 4 611 2 4 6 012 6 0 2 413 0 2 4 614 2 4 6 015 4 6 0 2

Table 6.2 Structureof address mapping �1

for case of length-64 data set

Row Value of address mapping �1

0 0 3 4 71 2 5 6 12 4 7 0 33 6 1 2 54 2 5 6 15 4 7 0 36 6 1 2 57 0 3 4 78 4 7 0 39 6 1 2 510 0 3 4 711 2 5 6 112 6 1 2 513 0 3 4 714 2 5 6 115 4 7 0 3

the elements of the input/output data sets, for both individual instances of the GD-BFLY, via the address mapping �1, as well as for consecutive pairs of instances,via the address mapping �2, over all eight memory banks. Examples of the addressmappings �1 and �2 are given in Tables 6.2 and 6.3, respectively, where each pairof consecutive rows of bank addresses correspond to the locations of a completeGD-BFLY input/output data set.

Suppose now, for ease of exposition, that the arithmetic within the GD-BFLYcan be assumed to be carried out fast enough to allow for the data sets processed bythe GD-BFLY to be both read from and written back to DM within a single clockcycle – this is not of course actually achievable and a more realistic scenario is to bediscussed later in Section 6.4 when the concept of internal pipelining within the PEis introduced.


Table 6.3 Structureof address mapping �2

for case of length-64 data set

Row Value of address mapping �2

0 0 2 4 61 2 4 6 02 5 7 1 33 7 1 3 54 2 4 6 05 4 6 0 26 7 1 3 57 1 3 5 78 4 6 0 29 6 0 2 410 1 3 5 711 3 5 7 112 6 0 2 413 0 2 4 614 3 5 7 115 5 7 1 3

The input/output data set to/from the GD-BFLY comprises four even-addresssamples and four odd-address samples, where for a given instance of the GD-BFLYfor the first temporal stage, each of the eight DM banks will contain just one sample,as required, whilst for a given instance of the GD-BFLY for the remaining ’�1 tem-poral stages, four of the eight DM banks will each contain one even-address sampleand one odd-address sample with the remaining four DM banks being unused. Asa result, it is generally not possible to carry out all eight reads/writes for the samedata set using all eight DM banks in a single clock cycle. However, if for all but thefirst temporal stage we consider any pair of consecutive instances of the GD-BFLY,then it may be shown that the sample addresses of the second instance will occupythe four DM banks not utilized by the first, so that every two clock cycles the eighteven-address samples and the eight odd-address samples required by the pair of con-secutive instances of the GD-BFLY may be both read from and written to DM, asrequired for conflict-free and in-place memory addressing – see Fig. 6.2 below.

Thus, based upon our simplistic assumption, all eight DM banks for the firsttemporal stage may be both read from and written to within a single clock cycle,whilst for the remaining ’ � 1 temporal stages it can be shown that in any one clockcycle all the input samples for one instance of the GD-BFLY may be both read fromDM and processed by the GD-BFLY, whilst all those output samples produced byits predecessor may be written back to DM. As a result, the solution based upon thesingle-PE R2

4 FHT architecture will be able to yield complete GD-BFLY output setsat the rate of one set per clock cycle, as required.

An alternative way of handling the pipelining for the last ’ � 1 temporalstages would be to read just four samples for the first clock cycle, with onesample from each of the four even-addressed memory banks. This would be fol-lowed by eight samples for each succeeding clock cycle apart from the last, withfour samples for the current instance of the GD-BFLY being read from the four


ES/OS ES/OS ES/OSES/OS

ES/OSES/OSES/OSES/OS

1st butterfly of pair

2nd butterfly of pair

ES – even-address sample OS – odd-address sample

0 1 2 3 4 5 6 7

Fig. 6.2 Addressing of hypothetical pair of consecutive generic double butterflies for all stagesother than first

even-addressed/odd-addressed memory banks and four samples for the succeedinginstance of the GD-BFLY being read from the remaining four odd-addressed/even-addressed memory banks. The processing would be completed by reading justfour samples for the last clock cycle, with one sample from each of the four odd-addressed memory banks. In this way, for each clock cycle apart from the first andthe last, eight samples could be read/written from/to all eight memory banks, onesample per memory bank, with one complete set of eight GD-BFLY outputs be-ing thus produced and another partly produced, to be completed on the succeedingclock cycle. Note, however, that a temporary buffer would be needed to hold onecomplete GD-BFLY output set as the samples written back to memory would alsoneed to come from consecutive GD-BFLY output sets, rather than from a singleGD-BFLY output set, due to the dual-port nature of the memory. For the last clockcycle, the remaining set of eight GD-BFLY outputs could also be written out to alleight memory banks, again one sample per memory bank.

The choice of how best to carry out the pipelining is really down to the indi-vidual HDL programmer, but for the purposes of consistency within the currentmonograph, it will be assumed that all the samples required for a given instance ofthe GD-BFLY are to be read from the DM within the same clock cycle, two samplesper even-addressed/odd-addressed memory bank as originally described, so that allthe input samples for one instance of the GD-BFLY may be both read from DMand processed by the GD-BFLY, whilst all those output samples produced by itspredecessor are written back to DM.

6.3.2 Trigonometric Coefficient Storage, Accessionand Generation

Turning now to the trigonometric coefficients, the GD-BFLY, as described inChapter 4, requires that six non-trivial trigonometric coefficients be either accessed


from CM or efficiently generated in order to be able to carry out the GD-BFLY pro-cessing for a given data set. Two schemes are now outlined for performing this taskwhereby all six trigonometric coefficients may be accessed simultaneously, withina single clock cycle, these schemes offering a straightforward trade-off of memoryrequirement against addressing complexity – as measured in terms of the numberof arithmetic/logic operations required for computing the necessary addresses. Thetwo schemes considered cater for those extremes whereby the requirement is eitherto minimize the arithmetic complexity or to minimize the CM requirement. Clearly,other options that fall between these two extremes are also possible, but these maybe easily defined and developed given an understanding of the techniques discussedhere and in Section 4.6 of Chapter 4.

6.3.2.1 Minimum-Arithmetic Addressing Scheme

The trigonometric coefficient set comprises cosinusoidal and sinusoidal terms forsingle-angle, double-angle and triple-angle cases. Therefore, in order for all sixtrigonometric coefficients to be obtained simultaneously, three LUTs may be ex-ploited with the two single-angle coefficients being read from the first LUT, thetwo double-angle coefficients from the second LUT, and the two triple-angle coef-ficients from the third LUT. To keep the arithmetic complexity of the addressing toa minimum each LUT may be defined as in Section 4.6.1 of Chapter 4, being sizedaccording to the single-quadrant addressing scheme, whereby the trigonometric co-efficients are read from a sampled version of the sinusoidal function with argumentdefined from 0 up to =2 radians. Thus, for the case of an N-point R2

4 FHT, it isrequired that each of the three single-quadrant LUTs be of size N/4 words yieldinga total CM requirement, denoted CAopt

MEM, of

CAoptMEM D 3

4N (6.5)

words.This scheme would seem to offer a reasonable compromise, therefore, between

the CM requirement and the addressing complexity, using more memory than istheoretically necessary, in terms of replicated LUTs, in order to keep the arith-metic/logic requirement of the addressing as simple as possible – namely, a zeroarithmetic complexity when using the twelve-multiplier version of the GD-BFLY orsix additions when using the nine-multiplier version.

6.3.2.2 Minimum-Memory Addressing Scheme

Another approach to the problem is to adopt a two-level LUT for the first of thethree angles, where the associated complementary-angle LUTs are as defined inSection 4.6.2 of Chapter 4, comprising one coarse-resolution region of

pN=2 words


for the sinusoidal function, and one fine-resolution region ofp

N=2 words for eachof the cosinusoidal and sinusoidal functions. To keep the CM requirement to a mini-mum, the double-angle and triple-angle trigonometric coefficients are then obtainedstraightforwardly through the application of standard trigonometric identities, asgiven by Equations 4.65–4.68 of Chapter 4, so that the solution requires that threecomplementary-angle LUTs be used for just the single-angle trigonometric coeffi-cient case, each LUT of size

pN=2 words, yielding a total CM requirement, denoted

CMoptMEM, of

CMoptMEM D 3

2

pN (6.6)

words.The double-angle and triple-angle trigonometric coefficients could also be ob-

tained by assigning a two-level LUT to the storage of each, but the associatedarithmetic complexity involved in generating the addresses turns out to be identicalto that obtained when the trigonometric coefficients are obtained through the directapplication of standard trigonometric identities, so that in this instance the replica-tion of the two-level LUT provides us with three times the memory requirement butwith no arithmetic advantage as compensation.

With the proposed technique, therefore, the CM requirement, as given byEquation 6.6, is minimized at the expense of additional arithmetic/logic for theaddressing – namely, an arithmetic complexity of seven multiplications and eightadditions when using the twelve-multiplier version of the GD-BFLY or sevenmultiplications and 14 additions when using the nine-multiplier version.

6.3.2.3 Summary of Addressing Schemes

The results of this section are summarized in Table 6.4 below, where the CM re-quirement and arithmetic complexity for each of the conflict-free parallel addressingschemes are given. A trade-off has clearly to be made between CM requirement andarithmetic complexity, with the choice being ultimately made according to the re-sources available on the target hardware. Versions I and II of the solution to theR2

4 FHT correspond to the adoption of the minimum-arithmetic addressing schemefor the twelve-multiplier and nine-multiplier PEs, respectively, whilst Versions IIIand IV correspond to the adoption of the minimum-memory addressing scheme forthe twelve-multiplier and nine-multiplier PEs, respectively.

The trigonometric coefficient accession/generation schemes required for Ver-sions I to IV of the above solution are illustrated via Figs. 6.3–6.6, respectively,with the associated arithmetic complexity for the addressing given by zero whenusing Version I of the R2

4 FHT solution, six additions when using Version II, sevenmultiplications and eight additions when using Version III, and seven multiplica-tions and 14 additions when using Version IV.


Tab

le6.

4Pe

rfor

man

ce/r

esou

rce

com

pari

son

for

fast

mul

tipl

ier

vers

ions

ofN

-poi

ntre

gula

rize

dFH

T

Ari

thm

etic

com

plex

ity

Mem

ory

requ

irem

ent

(wor

ds)

Tim

eco

mpl

exit

y(c

lock

cycl

es)

Proc

essi

ngel

emen

tC

oeffi

cien

tge

nera

tor

Dat

aC

oeffi

cien

tsU

pdat

eti

me

Lat

ency

Ver

sion

ofso

luti

onM

ulti

plie

rsA

dder

sM

ulti

plie

rsA

dder

s

I12

220

08

�1= 8

ND

N3

�1= 4

ND

3 =4N

1 =8N

:log 4

N1 =

8N

:log 4

NII

925

06

8�1

= 8N

DN

3�1

= 4N

D3 =

4N

1 =8N

:log 4

N1 =

8N

:log 4

NII

I12

227

88

�1= 8

ND

N3

�1= 2

p ND

3 =2

p N1 =

8N

:log 4

N1 =

8N

:log 4

NIV

925

714

8�1

= 8N

DN

3�1

= 2p N

D3 =

2

p N1 =

8N

:log 4

N1 =

8N

:log 4

N


Fig. 6.3 Resources requiredfor trigonometric coefficientaccession/generationfor Version I of solutionwith one-level LUTs

C1

S1

D1

D2

D3

D4

D5

D6

D7

D8

D9

S2

S3

C2

C3

Sn = sin(nθ) Î LUT[n]

Cn = cos(nθ) Î LUT[n]

Dn = nth coefficient

LUT[n] = 41 N words

Fig. 6.4 Resources requiredfor trigonometric coefficientaccession/generationfor Version II of solutionwith one-level LUTs

D1

D2

D3

D4

D5

D6

D7

D8

D9

C1

S1

S2

S3

C2

C3

Sn = sin(nθ) Î LUT[n]

Cn = cos(nθ) Î LUT[n]


LUT[n] = 41 N words

Note that with the minimum-memory addressing scheme of Figs. 6.5 and 6.6pipelining will certainly need to be introduced so as to ensure that a complete newset of trigonometric coefficients is available for input to the GD-BFLY for each newclock cycle.

6.4 Design of Pipelined PE for Single-PE Architecture 89

S1, C1

S2

S1 = sin(α)

C2

C1 = cos(α) LUT[1]

S2 = sin(β) LUT[2]

C2 = cos(β) LUT[3]


D7, D8, D9

D4, D5, D6

D1, D2, D3

Delay Delay

LUT[n] = N21 words

LUT[1]

Fig. 6.5 Resources required for trigonometric coefficient accession/generation for Version III ofsolution with two-level LUT – pipelining required to maintain computational throughput

S1, C1

S2

C2

S1 = sin(α) LUT[1]

C1 = cos(α) LUT[1]

S2 = sin(β) LUT[2]

C2 = cos(β) LUT[3]


D7, D8, D9

D4, D5, D6

D1, D2, D3

Delay Delay Delay

Delay Delay Delay

LUT[n] = 21 N words

Fig. 6.6 Resources required for trigonometric coefficient accession/generation for Version IV ofsolution with two-level LUT – pipelining required to maintain computational throughput

6.4 Design of Pipelined PE for Single-PE Architecture

To exploit the multi-bank memories and LUTs, together with the associated conflict-free and (for the data) in-place parallel memory addressing schemes, the PE needsnow to be able to produce one complete GD-BFLY output set per clock cycle, asdiscussed in Section 6.3.1, bearing in mind that although, for the first temporal stage,


all eight DM banks can be both read from and written to within the same clockcycle, for the remaining temporal stages, only those four DM banks not currentlybeing read from may be written to (and vice versa).

6.4.1 Internal Pipelining of Generic Double Butterfly

The above constraint suggests that a suitable PE design may be achieved if the GD-BFLY is carried out by means of a “-stage computational pipeline, as shown in thesimple example of Fig. 6.7, where ““” is an odd-valued integer and where each CSof the pipeline contains its own set of storage registers for holding the current set ofprocessed samples. In this way, if a start-up delay of DCG clock cycles is required fora pipelined version of the trigonometric coefficient generator and DPE clock cyclesfor a pipelined version of the PE, where

DPE D “ � 1; (6.7)

then after a total start-up delay of DSU clock cycles for the first temporal stage ofthe processing, where

DSU D DCG C DPE; (6.8)

the PE will be able to read in eight samples and write out eight samples every clockcycle, thereby enabling the first temporal stage to be completed in DSU C N=8 clockcycles, and subsequent temporal stages to be completed in N/8 clock cycles. Notethat the pipeline delay DPE must account not only for the sets of adders and per-mutators, but also for the fixed-point multipliers which are themselves typicallyimplemented as pipelines, possibly requiring as many as five CSs according to therequired precision. As a result, it is likely that at least nine CSs might be required

Stages 1 to a-1: when even-addressed EB memory banks are read from, odd-addressed OBmemory banks are written to & vice versa – two samples per memory bank

Stage 0: both even-addressed EB and odd-addressed OB memory banks are read from & written to at the same time – one sample per memory bank

WRITE READ

READ

EB & OB

EB / OB OB / EB EB / OB OB / EB EB / OB OB / EB

EB & OB EB & OB EB & OB EB & OBEB & OB

PE0 PE1 PE2 PE3 PE4

WRITEPE0 PE1 PE2 PE3 PE4

Fig. 6.7 Parallel solution for PE using five-stage computational pipeline

6.4 Design of Pipelined PE for Single-PE Architecture 91

CM – coefficient memory CS – computational stage DM – data memory

Wri

tes

× 8

CM0 CM1 CM2

Add

ress

Gen

erat

ion

CSβ−1CS0

DM0 DM1 DM7

Rea

ds ×

6

Rea

ds ×

8

CS1

Fig. 6.8 Memory structure and interconnections for internally-pipelined partitioned-memory PE

for implementation of the computational pipeline, with each temporal stage of theR2

4 FHT requiring the PE to execute the pipeline a total of N/8 times.A description of the pipelined PE including the structure of the memory, for both

the data and the trigonometric coefficients, together with its associated interconnec-tions, is given in Fig. 6.8.

Note, however, that depending upon the relative lengths of the computationalpipeline, ““”, and the transform, “N”, an additional delay may need to be appliedfor every temporal stage, not just the first, in order to ensure that sample sets arenot updated in one temporal stage before they have been processed and writtenback to DM in the preceding temporal stage, as this would result in the productionof invalid outputs. If the transform length is sufficiently greater than the pipelinedelay, however, this problem may be avoided – these points are discussed further inSection 6.4.3.

6.4.2 Space Complexity Considerations

The space complexity is determined by the combined requirements of the multi-bank dual-port memory and the arithmetic/logic components. Adopting theminimum-arithmetic addressing scheme of Versions I and II of the R2

4 FHTsolution (as detailed in Table 6.4), the worst-case total memory requirement forthe partitioned-memory single-PE architecture, denoted M.W/

FHT, is given by


M.W/FHT D 8 �

�1

8N

�C 3 �

�1

4N

�

D 7

4N (6.9)

words, where N words are required by the eight-bank DM and 3N/4 words forthe three single-quadrant LUTs required for the CM. In comparison, by adoptingthe minimum-memory addressing scheme of Versions III and IV of the R2

4 FHTsolution (as detailed in Table 6.4), the best-case total memory requirement for thepartitioned-memory single-PE architecture, denoted M.B/

FHT, is given by

M.B/FHT D 8 �

�1

8N

�C 3 �

�1

2

pN

�

D N C 3

2

pN (6.10)

words, where N words are required by the eight-bank DM and 3p

N=2 words forthe three complementary-angle LUTs required for the CM.

The arithmetic/logic requirement is dominated by the presence of the dedicatedfast fixed-point multipliers, with a total of nine or 12 being required by theGD-BFLY and up to seven for the memory addressing, depending upon thechosen addressing scheme.

6.4.3 Time Complexity Considerations

The partitioned-memory single-PE architecture, based upon the internally-pipelinedPE described in Section 6.4.1, enables a new GD-BFLY output set to be producedevery clock cycle. Therefore, the first temporal stage will be completed in DSUCN=8

clock cycles and subsequent temporal stages in either N/8 clock cycles or DSMCN=8

clock cycles, where the additional delay DSM provides the necessary safety marginto ensure that the outputs produced from each stage are valid. The delay dependsupon the relative lengths of the computational pipeline and the transform and mayrange from zero to as large as DPE. As a result, the N-point R2

4 FHT, where “N” is

as given by Equation 5.2, has a worst-case time complexity, denoted T.W/FHT, of

T.W/FHT D

�DSU C 1

8N

�C .˛ � 1/ �

�DSM C 1

8N

�

D .DSU C .’ � 1/DSM/ C 1

8N: log4N (6.11)

clock cycles, and a best-case or standard time-complexity, denoted T.B/FHT, for when

the safety margin delay is not required, of

6.5 Performance and Requirements Analysis of FPGA Implementation 93

T.B/FHT D

�DSU C 1

8N

�C .˛ � 1/ � 1

8N

D DSU C 1

8N: log4N (6.12)

clock cycles, given that ’ D log4N. More generally, for any given combinationof pipeline length, ““”, and transform length, “N”, it should be a straightforwardtask to calculate the exact safety margin delay, DSM, required after each tempo-ral stage in order to guarantee the generation of valid outputs, although for mostparameter combinations of practical interest it will almost certainly be set to zeroso that the time-complexity for each instance of the transform will be as given byEquation 6.12.

Note that a multi-PE R24 FHT architecture, based upon the adoption of an ’-stage

computational pipeline, could only yield this level of performance by exploitingup to “’” times as much silicon as the single-PE R2

4 FHT architecture, assumingthat the PEs in the pipeline are working in sequential fashion with the data andtrigonometric coefficients stored in global memory – that is, with the reads/writesbeing performed at a rate of one per clock cycle. Each stage of a pipelined multi-PE R2

4 FHT architecture requires the reading/writing of all N samples so that ’ � 1

double-buffered memories – each holding up to 2N samples to cater for both inputsand outputs of the PE – are typically required for connecting the PEs in the pipeline.

6.5 Performance and Requirements Analysisof FPGA Implementation

The theoretical complexity requirements discussed above have been proven in sil-icon by TRL Technology (a member of L3 Communications Corporation, U.S.A.)in the U.K., who have produced a generic real-data radix-4 FFT implementation –based upon the R2

4 FHT – on a Xilinx Virtex-II Pro 100 FPGA [8], running at close to200 MHz, for use in various wireless communication systems. A simple comparisonwith the state-of-the-art performances of the RFEL QuadSpeed FFT [4] and RokeManor Research FFT solutions [5] (both multi-PE designs from the UK wherebya complex-data FFT may be used to process simultaneously two real-valued datasets – packing/unpacking of the input/output data sets therefore needs to be ac-counted for) is given in Table 6.5 for the case of 4K-point and 16K-point real-dataFFTs (where 1K � 1;024), where the RFEL and Roke Virtex-II Pro 100 resultsare extrapolated from company data sheets and where the Version II solution ofthe R2

4 FHT described in Section 6.3.2.3 – using the minimum-arithmetic address-ing scheme together with a nine-multiplier PE – is assumed for the TRL solution.Clearly, many alternatives to these two commercially-available devices could havebeen used for the purposes of this comparison, but at the time the comparison wasmade, these devices were both considered to be viable options with performancesthat were (and still are) quite representative of this particular class of multi-PEstreaming FFT solution.


Tab

le6.

5Pe

rfor

man

cean

dre

sour

ceut

iliz

atio

nfo

r4K

-poi

ntan

d16

K-p

oint

real

-dat

ara

dix-

4FF

Ts

Clo

ckfr

eque

ncy

200

MH

zFF

Tle

ngth

Inpu

tw

ord

leng

th

1K

�18

RA

MS

(wit

hdo

uble

buff

erin

g)18

�18

Mul

tipl

iers

Log

icsl

ices

I/O

spee

d(s

ampl

es/c

ycle

)

Upd

ate

tim

epe

rre

al-d

ata

FFT

(�s)

Lat

ency

per

real

-dat

aFF

T(�

s)

TR

La

4K18

119

�5;0

00

1�1

5�1

5

(2.5

%ca

paci

ty)

(2.0

% capa

city

)(5

.0% ca

paci

ty)

(�1

chan

nel)

(�1

chan

nel)

RFE

Lb

4K12

3330

�5;0

00

4�1

0�2

1

(7.5

%ca

paci

ty)

(6.8

% capa

city

)(5

.0% ca

paci

ty)

(�2

chan

nels

)(�

2ch

anne

ls)

RO

KE

b4K

1442

48�3

;800

4�1

0�2

1

(9.5

%ca

paci

ty)

(10.

8% capa

city

)(3

.8% ca

paci

ty)

(�2

chan

nels

)(�

2ch

anne

ls)

TR

La

16K

1844

9�5

;000

1�7

2�7

2

(9.9

%ca

paci

ty)

(2.0

% capa

city

)(5

.0% ca

paci

ty)

(�1

chan

nel)

(�1

chan

nel)

RFE

Lb

16K

1210

737

�6;5

00

4�4

1�8

3

(24.

1%ca

paci

ty)

(8.3

% capa

city

)(6

.5% ca

paci

ty)

(�2

chan

nels

)(�

2ch

anne

ls)

RO

KE

b16

K10

124

55�5

;800

4�4

1�8

3

(28.

0%ca

paci

ty)

(12.

4% capa

city

)(5

.8% ca

paci

ty)

(�2

chan

nels

)(�

2ch

anne

ls)

aD

HT-

to-D

FTco

nver

sion

nota

ccou

nted

for

infig

ures

bPa

ckin

g/un

pack

ing

requ

irem

ent

nota

ccou

nted

for

infig

ures

6.6 Constraining Latency Versus Minimizing Update-Time 95

Note that the particular choice of the real-from-complex strategy for the twocommercially-available solutions has been made to ensure that we compare likewith like, or as close as we can make it, as the adoption of the DDC-based approachwould introduce additional filtering operations to complicate the issue togetherwith an accompanying processing delay. As a matter of interest, for an efficientimplementation with the particular device used here, the Virtex-II Pro 100, a com-plex DDC with 84 dB of spurious-free dynamic range (SFDR) has been shown torequire approximately 1,700 slices of programmable logic [1].

Although the performances, in terms of the update time and latency figures, aresimilar for the solutions described, it is clear from the respective I/O requirementsthat the RFEL and Roke performance figures are achieved at the expense of havingto process twice as much data at a time (two channels yielding two output setsinstead of one) as the TRL solution and (for the case of an N-point transform) havingto execute N/2 radix-2 butterflies every N/2 clock cycles, so that the pipeline needsto be fed with data generated by the ADC unit(s) at the rate of N complex-valued (or2N real-valued) samples every N/2 clock cycles. This means generating the samplesat four times the speed (four samples per clock cycle instead of just one) of the TRLsolution which might in turn involve the use of multiple ADC units. The resultshighlight the fact that although the computational densities of the three solutions arenot that dissimilar, the TRL solution is considerably more area efficient, requiringa small fraction of the memory and fast multiplier requirements of the other twosolutions in order to satisfy the latency constraint, whilst the logic requirement – asrequired for controlling the operation and interaction of the various components ofthe FPGA implementation – which increases significantly with transform length forthe RFEL and Roke solutions, remains relatively constant.

The scalable nature of the TRL solution means that only the memory require-ment needs substantially changing from one transform length to another in order toreflect the increased/decreased quantity of data to be processed, making the cost ofadapting the solution for new applications negligible. For longer transforms, betteruse of the resources could probably be achieved by trading off memory against fastmultiplier requirement through the choice of a more memory-efficient addressingscheme – as discussed above in Section 6.3. Note that double buffering is as-sumed for the sizing of the TRL solution in order to support continuous processing,whereby the I/O is limited to N clock cycles, this resulting in a doubling of the DMrequirement.

6.6 Constraining Latency Versus Minimizing Update-Time

An important point to note is that most, if not all, of the commercially-availableFFT solutions are multi-PE solutions geared to streaming operation where the re-quirement relates to the minimization of the update time – so as to maximize thethroughput – rather than satisfying some constraint on the latency, as has been ad-dressed in this monograph with the design of the R2

4 FHT. In fact, the point should


perhaps be re-made here that in the “Statement of Performance Objective No 2”made at the beginning of the chapter, the requirement was simply that the N-pointtransform be executed within N clock cycles. Now from Equation 6.12, this isclearly achievable for all transform lengths up to and including 64K, so that fortransforms larger than this it would be necessary to increase the throughput rate byan appropriate amount in order that continuous operation be maintained – note thattwo PEs would maintain continuous operation for N � 416.

To clarify the situation, whereas with a pipelined FFT approach, as adopted bythe multi-PE commercially-available solutions, one is able to attain a high through-put rate by effectively minimizing the update time, with the R2

4 FHT it is possibleto increase the throughput rate by adopting an SIMD-type approach, either viaa “multi-R2

4 FHT” solution whereby multiple R24 FHTs are used to facilitate the

simultaneous processing of multiple data sets or via a “multi-PE” solution wherebymultiple PEs are used to facilitate the parallel processing of a single data set bymeans of a multi-PE version of the R2

4 FHT. The multi-PE solution could thusbe used to maintain continuous operation for the case of extremely large trans-form lengths, whereas the multi-R2

4 FHT solution could be used to deal with thosecomputationally-demanding applications where the throughput rate for the genera-tion of each new N-point real-data FFT output data set needs to be greater than oneset every N clock cycles.

With the multi-R24 FHT approach, the attraction is that it is possible to share both

the control logic and the CM between the R24 FHTs, given that the LUTs contain

precisely the same information and need to be accessed in precisely the same orderfor each R2

4 FHT. Such an approach could also be used to some advantage, for ex-ample, when applied to the computation of the complex-data DFT, as discussedin Section 3.4.2 of Chapter 3, where one R2

4 FHT is applied to the computationof the real component of the data and one R2

4 FHT is applied to the computationof the imaginary component. A highly-parallel dual-R2

4 FHT solution such as thiswould be able to attain, for the case of complex data, the eightfold speed up alreadyachieved for the real-data case over a purely sequential solution (now processingeight complex-valued samples per clock cycle rather than eight real-valued sam-ples), yet for minimum additional resources.

With the multi-PE approach – restricting ourselves here to the simple case oftwo PEs – it needs firstly to be noted that as dual-port memory is necessary for theoperation of the single-PE solution so quad-port memory would be necessary forthe operation of a dual-PE solution, so as to facilitate the reading/writing of twosamples from/to each of the eight memory banks for each clock cycle, as well as thereading of four (rather than two) trigonometric coefficients from each of the LUTs,as shared by the PEs. Alternate instances of the GD-BFLY could be straightfor-wardly assigned to alternate PEs with all eight GD-BFLY inputs/outputs for eachPE being read/written from/to memory simultaneously so that conflict-free and in-place parallel memory addressing would be maintained for each PE.

At present, genuine quad-port memory is not available from the major FPGAmanufacturers, so that the obtaining of such a facility may only be achieved throughthe modification of existing dual-port memory at the effective cost of a doubling

6.7 Discussion 97

of the memory requirement. A simple alternative may be obtained, however, bynoting that with current FPGA technology there is typically an approximate fac-tor of 2 difference between the dual-port memory read/write access time and theupdate time for the fast multipliers – and thus the update time of the GD-BFLYcomputational pipeline. As a result, by doubling the speed at which the reads/writesare performed, a pseudo quad-port memory capability may be achieved whereby thedata is read/written from/to the dual-port memory at twice the rate of the computa-tional pipeline and thus at a sufficient rate to sustain the operation of the pipeline.

The ideas considered in this section involving the use of multi-PE and multi-R2

4 FHT solutions would seem to suggest that the throughput rate of the mostadvanced commercially-available solutions could be achieved for reduced quantitiesof silicon, so that the GD-BFLY-based PE could thus be used as a building blockto define real-data FFT solutions to a range of problems according to whether theparticular design objective involves the satisfying of some constraint on the latency,as addressed by this monograph, or the maximization of the throughput rate.

6.7 Discussion

The outcome of this chapter is the specification of a partitioned-memory single-PEcomputing architecture for the parallel computation of the R2

4 FHT, together withthe specification of conflict-free and (for the data) in-place parallel memory address-ing schemes for both the data and the trigonometric coefficients, which enable theoutputs from each instance of the GD-BFLY to be produced via this computing ar-chitecture within a single clock cycle. Four versions of the PE have been described –all based upon the use of a fixed-point fast multiplier and referred to as Versions I,II, III and IV of the solution – which provide the user with the ability to trade offarithmetic complexity, in terms of both adders and multipliers, against the memoryrequirement, with a theoretical performance and resource comparison of the foursolutions being provided in tabular form. The mathematical/logical correctness ofthe operation of all four versions of the solution has been proven in software via acomputer program written in the “C” programming language.

Silicon implementations of both 4K-point and 16K-point transforms have beenstudied, each using Version II of the R2

4 FHT solution – which uses the minimum-arithmetic addressing scheme together with a nine-multiplier version of the PE –and the Xilinx Virtex-II Pro 100 device running at a clock frequency of close to200 MHz. The R2

4 FHT results were seen to compare very favourably with those oftwo commercially-available industry-standard multi-PE solutions, with both the 4K-point and 16K-point transforms achieving the stated performance objective whilstrequiring greatly reduced silicon resources compared to their commercial complex-data counterparts. Note that although the target device family may be somewhat old,it was more than adequate for purpose, which was simply to facilitate comparisonof the relative merits of the single-PE and multi-PE architectures. As already stated,with real-world applications it is not always possible, for various practical/financial


reasons, to have access to the latest device technologies. Such a situation does tendto focus the mind, however, as one is then forced to work to within whatever siliconbudget one happens to have been dealt with.

Note that a number of scalable single-PE designs for the fixed-radix FFT [2,6,7],along the lines of that discussed in this chapter for the R2

4 FHT, have alreadyappeared in the technical literature over the past 10–15 years for the more straight-forward complex-data case, each such solution using a simplified version of thememory addressing scheme discussed here whereby multi-bank memory is againused to facilitate the adoption of partitioned-memory processing.

Another important property of the proposed set of R24 FHT designs discussed

here is that they are able, via the application of the block floating-point scalingtechnique, to optimize the achievable dynamic range of the Hartley space (and thusFourier space) outputs and therefore to outperform the more conventional streamingFFT solutions which, given the need to process the data as and when it arrives, arerestricted to the use of various fixed scaling strategies in order to address the fixed-point overflow problem. With fully-optimized streaming operation, the applicationof block floating-point scaling would involve having to stall the optimal flow of datathrough the computational pipeline, as the entire set of outputs from each stage ofbutterflies needs to be passed through the “maximum” function in order that therequired common exponent may be found. As a result, the block-based nature of thesingle-PE R2

4 FHT operation means that it is also able to produce higher accuracytransform-space outputs than is achievable by its multi-PE FFT counterparts.

Finally, it should be noted that the data re-ordering – carried out here by meansof the di-bit reversal mapping – to be applied to the input data to the transform canbe comfortably carried out in less than N clock cycles, for a length N transform, sothat performance may be maintained through the use of double buffering, wherebyone data set is being re-ordered and written to one set of DM banks whilst anotherdata set – its predecessor – is being read/written from/to another set of DM banks bythe R2

4 FHT. The functions of the two sets of DM banks are then interchanged afterthe completion of each R2

4 FHT. Thus, we may set up what is essentially a two-stagepipeline, where the first stage of the pipeline carries out the task of data re-orderingand the second carries out the R2

4 FHT on the re-ordered data. The data re-orderingmay be carried out in various ways, as already outlined in Section 2.4 of Chapter 2.

References

1. R. Hosking, New FPGAs tackle real-time DSP tasks for defense applications, Boards &Solutions Magazine (November 2006)

2. L.G. Johnson, Conflict-free memory addressing for dedicated FFT hardware. IEEE Trans. Cir-cuits Syst. II: Analog Dig. Signal Proc. 39(5), 312–316 (1992)

3. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform viaregularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007)

4. RF Engines Ltd., IP Cores – Xilinx FFT Library, product information sheet available at companyweb site: www.rfel.com

References 99

5. Roke Manor Research Ltd., Ultra High Speed Pipeline FFT Core, product information sheetavailable at company web site: www.roke.com

6. B.S. Son, B.G. Jo, M.H. Sunwoo, Y.S. Kim, A high-speed FFT processor for OFDM systems.Proc. IEEE Int. Symp. Circuits Syst. 3, 281–284 (2002)

7. C.H. Sung, K.B. Lee, C.W. Jen, Design and implementation of a scalable fast Fourier transformcore. Proc. of 2002 IEEE Asia-Pacific Conference on ASICs, 295–298 (2002)

8. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 7Design of Arithmetic Unitfor Resource-Constrained Solution

Abstract This chapter discusses a solution to the regularized FHT where thepipelined fixed-point multipliers involving the trigonometric coefficients are nowreplaced by pipelined CORDIC phase rotators which eliminate the need for thetrigonometric coefficient memory and lead to the specification of a flexible-precision solution. The design is targeted, in particular, at those applications whereone is constrained by the limited availability of embedded resources. Theoreticalperformance figures for a silicon-based implementation of the CORDIC-based so-lution are derived and the results compared with those for the previously discussedsolutions based upon the use of the fast fixed-point multipliers for various combina-tions of transform length and word length. A discussion is finally provided relatingto the results obtained in the chapter.

7.1 Introduction

The last two chapters have provided us with a detailed account of how the R24 FHT

is able to be mapped onto a partitioned-memory single-PE computing architectureso as to effectively exploit the computational power of the silicon-based parallelcomputing technologies. Four versions of this highly-parallel R2

4 FHT solution havebeen produced with PE designs which range from providing optimality in termsof the arithmetic complexity to optimality in terms of the memory requirement,although the common feature of all four versions is that they each involve the useof a fast fixed-point multiplier. No consideration has been given, as yet, as to whetheran arithmetic unit based upon the fast multiplier is always the most appropriate toadopt or, when such an arithmetic unit is used, how the fast multiplier might bestbe implemented. With the use of FPGA technology, however, the fast multiplier istypically available to the user as an embedded resource which, although expensivein terms of silicon resources, is becoming increasingly more power efficient andtherefore the logical solution to adopt.

A problem may arise in practice, however, when the length of the transform tobe computed is very large compared to the capability of the target device such thatthere are insufficient embedded resources – in terms of fast multipliers, fast RAM,


101

102 7 Design of Arithmetic Unit for Resource-Constrained Solution

or both – to enable a successful mapping of the transform onto the device to takeplace. In such a situation, where the use of a larger and more powerful device issimply not an option, it is thus required that some means be found of facilitatinga successful mapping onto the available device and one way of achieving this isthrough the design of a more appropriate arithmetic unit, namely one which doesnot rely too heavily upon the use of embedded resources.

The choice of which type of arithmetic unit to adopt for the proposed resource-constrained solution has been made in favour of the CORDIC unit, rather than theDA unit, as the well documented optimality of CORDIC arithmetic for the opera-tion of phase rotation [1] – as is shown to be the required operation here – combinedwith the ability to generate the rotation angles that correspond to the trigonometriccoefficients very efficiently on-the-fly, with trivial memory requirement, make it theobvious candidate to pursue – the DA unit would inevitably involve a considerablylarger memory requirement due to the storage of the pre-computed sums or innerproducts. A number of attractive CORDIC-based FPGA solutions to the FFT haveappeared in the technical literature in recent years, albeit for the more straightfor-ward complex-data case, with two such solutions as discussed in references [2, 11].

Note that the sizing to be carried out in this chapter for the various R24 FHT

solutions, including those based upon both the fast fixed-point multiplier and theCORDIC phase rotator, is to be performed for hypothetical implementations ex-ploiting only programmable logic in order to facilitate their comparison.

7.2 Accuracy Considerations

To obtain L-bit accuracy in the GD-BFLY outputs it will be necessary to retainsufficient bits out of the multipliers as well as to use sufficient guard bits in orderto protect both the least significant bit (LSB) and the most significant bit (MSB).This is due to the fact that with fixed-point processing the accuracy may be de-graded through the possible word growth of one bit with each stage of adders. Forthe MSB, the guard bits correspond to those higher order (initially unoccupied) bits,appended to the left of the L most significant data bits out of the multipliers, thatcould in theory, after completion of the stages of GD-BFLY adders, contain theMSB of the output data. For the LSB, the guard bits correspond to those lower or-der (initially occupied) bits, appearing to the right of the L most significant databits out of the multipliers, which could in theory, after completion of the stages ofGD-BFLY adders, affect or contribute to the LSB of the output data. Thus, the pos-sible occurrence of truncation errors due to the three stages of adders is accountedfor by varying the lengths of the registers as the data progresses across the PE.

Allowing for word growth in this fashion permits the application of blockfloating-point scaling [9] – as discussed in Section 4.8 of Chapter 4, prior to eachstage of GD-BFLYs, thereby enabling the dynamic range of any signals present inthe data to be maximized at the output of the R2

4 FHT.

7.3 Fast Multiplier Approach 103

7.3 Fast Multiplier Approach

Apart from the potentially large CM requirement associated with the four PEdesigns discussed in the previous chapter, an additional limitation relates to theirrelative inflexibility, in terms of the arithmetic precision offered, due to their re-liance on the fast fixed-point multiplier. For example, when the word length, “L”,of one or more of the multiplicands exceeds the word length capability, “K”, of theembedded multiplier, it would typically be necessary to use four embedded multi-pliers and two 2K-bit adders to carry out each L � L multiplication (assuming thatK < L � 2K). When implemented on an FPGA in programmable logic, it is tobe assumed that one L � L pipelined multiplier will require of order 5L2=8 slices[3,4] in order to produce a new output each clock cycle, whilst one L-bit adder willrequire L/2 slices [4]. The CM will require L-bit RAM, with the single-port versioninvolving L/2 slices and the dual-port version involving L slices [8]. These logic-based complexity figures will be used later in the chapter for carrying out sizingcomparisons of the PE designs discussed in this and the previous chapters.

To obtain L-bit accuracy in the outputs of the twelve-multiplier version of theGD-BFLY, which involves three stages of adders, it is necessary that L C 3 bits beretained from the multipliers, each of size L � L, in order to guard the LSB, whilstthe first stage of adders is carried out to .L C 4/-bit precision, the second stage to.L C 5/-bit precision and the third stage to .L C 6/-bit precision, in order to guardthe MSB, at which point the data is scaled to yield the L-bit results. Similarly, toobtain L-bit accuracy in the outputs of the nine-multiplier version of the GD-BFLY,which involves four stages of adders, it is necessary that the first stage of adders(preceding the multipliers) be carried out to .L C 1/-bit precision, with L C 4 bitsbeing retained from the multipliers, each of size .LC1/�.LC1/, whilst the secondstage of adders is carried out to .L C 5/-bit precision, the third stage to .L C 6/-bitprecision and the fourth stage to .L C 7/-bit precision, at which point the data isscaled to yield the L-bit results.

Thus, given that the twelve-multiplier version of the GD-BFLY involves a totalof 12 pipelined multipliers, six stage-one adders, eight stage-two adders and eightstage-three adders, the PE can be constructed with an arithmetic-based logic re-quirement, denoted LA

M12, of

LAM12 � 1

2

�15L2 C 22L C 112

�(7.1)

slices, whilst the nine-multiplier version of the GD-BFLY, which involves a totalof three stage-one adders, nine pipelined multipliers, six stage-two adders, eightstage-three adders and eight stage-four adders, requires an arithmetic-based logicrequirement, denoted LA

M9, of

LAM9 � 1

8

�45L2 C 190L C 548

�(7.2)

slices.


These figures, together with the CM requirement – as discussed in Section 6.3 ofthe previous chapter and given by Equations 6.5 and 6.6 – will combine to form thebenchmarks with which to assess the merits of the hardware-based arithmetic unitnow discussed.

7.4 CORDIC Approach

The CORDIC algorithm [12] is an arithmetic technique used for carrying out two-dimensional vector rotations. Its relevance here is in its ability to carry out the phaserotation of a complex number, as this will be seen to be the underlying opera-tion required by the GD-BFLY. The vector rotation, which is a convergent linearprocess, is performed very simply as a sequence of elementary rotations with anever-decreasing elementary rotation angle where each elementary rotation can becarried out using just shift and add-subtract operations.

7.4.1 CORDIC Formulation of Complex Multiplier

For carrying out the particular operation of phase rotation, a vector (X,Y) is rotatedby an angle “™” to obtain the new vector .X0; Y0/. For the nth elementary rota-tion, the fixed elementary rotation angle, arctan.2�n/, which is stored in a ROM, issubtracted/added from/to the angle remainder, “™n”, so that the angle remainder ap-proaches zero with increasing “n”. The mathematical relations for the conventionalnon-redundant CORDIC rotation operation [1] are as given below via the four setsof equations:

(a) Phase Rotation Operation:

X0 D cos.™/ � X � sin.™/ � Y

Y0 D cos.™/ � Y C sin.™/ � X

™0 D 0 (7.3)

(b) Phase Rotation Operation as Sequence of Elementary Rotations:

X0 DK�1Y

nD0

cos.arctan.2�n//.Xn � ¢n:Yn:2�n/

Y0 DK�1Y

nD0

cos.arctan.2�n//.Yn C ¢n:Xn:2�n/

™0 DK�1X

nD0

™ � .¢n:arctan.2�n// (7.4)

7.4 CORDIC Approach 105

(c) Expression for nth Elementary Rotation:

XnC1 D Xn � ¢n:2�n:Yn

YnC1 D Yn C ¢n:2�n:Xn

™nC1 D ™n � ¢n:arctan.2�n/ (7.5)

where “¢n” is either C1 or �1, for non-redundant CORDIC, depending uponthe sign of the angle remainder term, denoted here as “™n”.

(d) Expression for CORDIC Magnification Factor:

M DK�1Y

nD0

arccos.arctan.2�n// DK�1Y

nD0

q.1 C 2�2n/ � 1:647 forlarge K (7.6)

which may need to be scaled out of the rotated output in order to preserve thecorrect amplitude of the phase rotated complex number.

The choice of non-redundant CORDIC, rather than a redundant version wherebythe term “¢n” is allowed to be either C1, �1 or 0, ensures that the value of the magni-fication factor, which is a function of the number of iterations, is independent of therotation angle being applied and therefore fixed for every instance of the GD-BFLYwhether it is of Type-I, Type-II or Type-III – for the definitions see Section 4.3 ofChapter 4.

7.4.2 Parallel Formulation of CORDIC-Based PE

From Equation 7.5, the CORDIC algorithm requires one pair of shift/add–subtractoperations and one add–subtract operation for each bit of accuracy. When imple-mented sequentially [1], therefore, the CORDIC unit implements these elementaryoperations, one after another, using a single CS and feeding back the output as theinput to the next iteration. A sequential CORDIC unit with L-bit output has a la-tency of L clock cycles and produces a new output every L clock cycles. On theother hand, when implemented in parallel form [1], the CORDIC unit implementsthese elementary operations as a computational pipeline – see Fig. 7.1 – using anarray of identical CSs. A parallel CORDIC unit with L-bit output has a latency of Lclock cycles but produces a new output every clock cycle.

An attraction of the fully parallel pipelined architecture is that the shifters ineach CS involve a fixed right shift, so that they may be implemented very efficientlyin the wiring. Also, the elementary rotation angles may be distributed as constantsto each CS so that they may also be hardwired. As a result, the entire CORDICrotator may be reduced to an array of interconnected add–subtract units. Pipeliningis achieved by inserting registers between the add–subtract units, although with mostFPGA architectures there are already registers present in each logic cell, so that theaddition of the pipeline registers involves no additional hardware cost.


Fig. 7.1 Pipeline architecturefor CORDIC rotator

Note: Xn = Xn >> n and Yn=Yn>> n for n = 0,1,...,N−1

sign(Z0)

sign(Z1)

sign(ZN-1)

X1

X2

Y1

Y2

Z1

Z2

α0

Z0Y0X0

~X0 ± Y0

Z0 ± α0

>> 0 >> 0

1α>> 1 >> 1

LUT

XN−1 YN−1 ZN−1

XN YN ZN

αN−1>> N-1 >> N-1

~~

~Y0 ± X0

~Y1 ± X1

~X1 ± Y1

~XN-1 ± YN-1 YN-1 ± XN-1

ZN-1 ± αN-1

Z1 ± α1

~

7.4.3 Discussion of CORDIC-Based Solution

The twelve-multiplier version of the GD-BFLY produces eight outputs fromeight inputs, these samples denoted by .X1; Y1/ through to .X4; Y4/, with themultiplication stage of the GD-BFLY comprising 12 real multiplications which,together with the accompanying set of additions/subtractions, may be expressed forthe case of the standard Type-III GD-BFLY via the three sets of equations


�X2

Y2

�D

�cos.™/ sin.™/

sin.™/ � cos.™/

� �X2

Y2

�(7.7)

�X3

Y3

�D

�cos.2™/ sin.2™/

sin.2™/ � cos.2™/

� �X3

Y3

�(7.8)

�X4

Y4

�D

�cos.3™/ sin.3™/

sin.3™/ � cos.3™/

� �X4

Y4

�(7.9)

where “™” is the single-angle, “2™” the double-angle and “3™” the triple-angle ro-tation angles. These sets of equations are equivalent to what would be obtained ifwe multiplied the complex number interpretations of .X2; Y2/ by e�i™ ; .X3; Y3/ bye�i2™ and .X4; Y4/ by e�i3™ , followed for the case of the standard Type-III GD-BFLYby negation of the components Y2; Y3 and Y4.

As with the nine-multiplier and twelve-multiplier versions of the GD-BFLY,there are minor changes to the operation of the GD-BFLY, from one instance toanother, in terms of the definitions of the first three address permutations, with oneof two slightly different versions being appropriately selected for each accordingto the particular “Type” of GD-BFLY being executed – see Table 4.1 of Chapter 4.In addition, however, there are also minor changes required to the outputs of theCORDIC units in that if the GD-BFLY is of Type-I then the components Y2; Y3

and Y4 do not need to be negated, whereas if the GD-BFLY is of Type-II then onlycomponent Y4 needs to be negated and if the GD-BFLY is of Type-III, as discussedin the previous paragraph, then all three components need to be negated.

Note, however, that the outputs will have grown due to the CORDIC magni-fication factor, “M”, of Equation 7.6, so that this growth needs to be adequatelyaccounted for within the GD-BFLY. The most efficient way of achieving this wouldbe to allow the growth to remain within components .X2; Y2/ through to .X4; Y4/

and the components .X1; Y1/ to be scaled multiplicatively by the term “M”, thisbeing achieved with just two constant coefficient multipliers – see Fig. 7.2. Thiswould result in a growth of approximately 1.647 in all the eight inputs to the secondaddress permutation “ˆ2”. Note that scaling by such a constant differs from the op-eration of a standard fast multiplier in that Booth encoding/decoding circuits are nolonger required, whilst efficient recoding methods [5] can be used to further reducethe logic requirement of the simplified operation to approximately one third that ofthe standard fast fixed-point multiplier.

An obvious attraction of the CORDIC-based approach is that the GD-BFLY onlyrequires knowledge of the single-angle, double-angle and triple-angle rotation an-gles, so that there is no longer any need to construct, maintain and access potentiallylarge LUTs required for the storage of the trigonometric coefficients – that is, for thestorage of sampled versions of the sinusoidal function with argument defined from 0up to =2 radians. As a result, the radix-4 factorization of the CORDIC-based FHTmay be expressed very simply with the updating of the rotation angles for the exe-cution of each instance of the GD-BFLY being performed on-the-fly and involvingonly additions and subtractions.


_ _

_ _

_ _

_ _

outp

ut d

ata

vect

or

inpu

t dat

a ve

ctor

negatedrotationangles

Add

ress

Per

mut

atio

n

Un-scaledCORDICRotator Negate


FixedScaler

Add

ress

Per

mut

atio

n

Add

ress

Per

mut

atio

n

Add

ress

Per

mut

atio

n


Φ1Φ2 Φ3 Φ4

Fig. 7.2 Signal flow graph for CORDIC-based version of generic double butterfly

ZS

~~Note: X = X >> n and Y = Y >> n where n and αn are fixed

YS

ZS

XS

XS

sign(ZS) sign(ZD) sign(ZT)

~X ± Y ~ Z ± αn

>>n

ZS

ZD

ZD

YD

YD

XD

XD

~ ~ Z ± αn

ZT

ZT

YT

YT

XT

XT

~X ± Y ~

Y ± X Z ± αn

>>n >>n >>n >>n >>n

Y ± X X ± Y Y ± X

Fig. 7.3 Computational stage of pipeline for CORDIC rotator with scalar inputs

The optimum throughput for the GD-BFLY is achieved with the fully-parallelhardwired solution of Fig. 7.3, whereby each CS of the pipeline uses nine add–subtract units to carry out simultaneously the three elementary phase rotations –note that in the figure the superscripts “S”, “D” and “T” stand for “single angle”,“double angle” and “triple angle”, respectively.

Due to the decomposition of the original rotation angle into “K” elementary ro-tation angles, it is clear that execution of the phase rotation operation can only beapproximated with the accuracy of the outputs of the last iteration being limited bythe magnitude of the last elementary rotation angle applied. Thus, if L-bit accuracyis required of the rotated output, one would expect the number of iterations, “K”, tobe chosen so that K D L, as the right shifts carried out in the Kth (and last) iterationwould be of length L � 1. This, in turn, necessitates two guard bits on the MSB andlog2L guard bits on the LSB. The MSB guard bits cater for the magnification factor


of Equation 7.6 and the maximum possible range extension ofp

2, whilst the LSBguard bits cater for the accumulated rounding error from the “L” iterations.

Note also, from the definition of the elementary rotation angles,

tan.™n/ D ˙2�n; (7.10)

that the CORDIC algorithm is known to converge over the range � =2 � ™ �C =2, so that in order to cater for rotation angles between ˙ an additional rotationangle of ˙ =2 may need to be applied prior to the elementary rotation angles inorder to ensure that the algorithm converges, thus increasing the number of iterationsfrom K D L to K D L C 1. This may be very simply achieved, however, viaapplication of the equations:

X0 D �¢:Y

Y0 D C¢:X

™0 D ™ C ¢:

�1

2

�(7.11)

where

¢ D( C1 Y < 0

�1 otherwise; (7.12)

whenever the rotation angle lies outside the range of convergence, with the aboveequations being carried out via precisely the same components and represented bymeans of precisely the same SFG as those equations – namely Equations 7.3–7.6 –corresponding to the elementary rotation angles.

7.4.4 Logic Requirement of CORDIC-Based PE

Referring back to the SFG of Fig. 7.2, it is to be assumed that the GD-BFLY outputsare to be computed to L-bit accuracy. Therefore, because of the two stages of addersfollowing the CORDIC rotators, it will be necessary for the CORDIC rotators toadopt L C 3 iterations in order to produce data to .L C 2/-bit accuracy for inputto the first stage of adders. This in turn requires that each CORDIC rotator adoptL C 4 C log2.L C 2/ bits for the registers, this including log2.L C 2/ guard bits forthe LSB and two guard bits for the MSB. Following their operation, the data willhave magnified by one bit so that just the top MSB guard bit needs to be removed,together with the lowest log2.LC2/C1 bits, to leave the required LC2 bits for inputto the adders. The first stage of adders is then carried out to .L C 3/-bit precisionand the second stage to .L C 4/-bit precision, at which point the data is scaled toyield the final L-bit result. The outputs from the two fixed-coefficient multipliers –note that in the time it takes for the CORDIC operation to be executed the same


fixed-coefficient multiplier could be used to carry out the scaling operation for bothof the first two inputs – are retained to .L C 2/-bit precision in order to ensureconsistency with the precision of the outputs from the CORDIC rotators.

Thus, the CORDIC-based version of the GD-BFLY involves three .L C 3/-stagepipelined CORDIC rotators, eight .L C 3/-bit stage-one adders, eight .L C 4/-bitstage-two adders and one shared fixed-coefficient multiplier using a .L C 2/-bitcoefficient, so that the PE may be constructed with a total arithmetic-based logicrequirement, denoted LA

C, of

LAC � 1

2

�10L2 C 83L C 9.L C 3/:log2.L C 2/ C 168

�(7.13)

slices.Note that the single-angle, double-angle and triple-angle rotation angles are fed

directly to the GD-BFLY, so that the only memory requirement is for the storageof three generation angles for each stage of the transform from which the rotationangles may then be recursively derived via simple addition. Thus, assuming single-port memory, the memory-based logic requirement, denoted LM

C , is given by just

LMC � 3

2˛L (7.14)

slices, with the required single-angle, double-angle and triple-angle rotation anglesbeing computed on-the-fly as and when they are required.

7.5 Comparative Analysis of PE Designs

This section provides a very brief theoretical comparison of the silicon resourcesrequired for all the five types of PE so far considered – four corresponding to the useof a pipelined fixed-point multiplier, as discussed in the previous chapter, and onecorresponding to the use of the pipelined CORDIC arithmetic unit – where the sizingis to be based upon the logic-based complexity figures discussed in Sections 7.3 and7.4. An FPGA implementation [6] would of course be able to exploit the availableembedded resources, whether using the fast fixed-point multiplier or the CORDICarithmetic unit, as most FPGA manufacturers now provide their own version ofthe CORDIC unit, in addition to the fast multipliers and RAM, as an embeddedresource to be exploited by the user. A pipelined version of the CORDIC arithmeticunit may even be obtained as an IP core [10] and subsequently used as a buildingblock for constructing larger DSP systems. The assumption here is that any relativeadvantages obtained from implementation in programmable logic will carry overwhen the PEs are implemented using such optimized silicon resources.

The arithmetic-based and memory-based logic requirements for all five versionsof the R2

4 FHT solution – catering for both the arithmetic complexity and the CMrequirement – are as summarized in Table 7.1 below, from which the attraction

7.5 Comparative Analysis of PE Designs 111

Tab

le7.

1L

ogic

reso

urce

sre

quir

edfo

rdi

ffer

ent

vers

ions

ofPE

and

trig

onom

etri

cco

effic

ient

gene

rato

ras

sum

ing

N-p

oint

regu

lari

zed

FHT

and

L-b

itac

cura

cy

Ver

sion

ofso

luti

onPr

oces

sing

elem

entt

ype

Ari

thm

etic

-bas

edlo

gic

for

doub

lebu

tter

fly(s

lice

s)

Mem

ory-

base

dlo

gic

for

coef

ficie

nts

(sli

ces)

Ari

thm

etic

-bas

edlo

gic

for

coef

ficie

ntge

nera

tor

(sli

ces)

IFa

stm

ulti

plie

r1 =

2.1

5L

2C

22L

C112/

3 =4L

N0

IIFa

stm

ulti

plie

r1 =

8.4

5L

2C

190L

C548/

3 =4L

N3L

III

Fast

mul

tipl

ier

1 =2.1

5L

2C

22L

C112/

3 =2L

p N1 =

8.3

5L

2C

162L

C277/

IVFa

stm

ulti

plie

r1 =

8.4

5L

2C

190L

C548/

3 =2L

p N1 =

8.3

5L

2C

246L

C831/

VC

OR

DIC

unit

1 =2.1

0L

2C

83L

C9.L

C3/:

log 2

.LC

2/

C168/

3 =2L

:log

4N

0


of Version V, the CORDIC-based solution, is evident. The benefits stem basicallyfrom the fact that there is no longer any need to construct, maintain and accesspotentially large LUTs required for the storage of the trigonometric coefficients.The same word lengths, denoted “L”, are assumed for both the input/output data,to/from the GD-BFLY, and the trigonometric coefficients.

The control-based logic requirements – for controlling the operation and inter-action of the various components of the design – as discussed in Section 5.6 ofChapter 5, are not included in the results as they are rather more difficult (if notimpossible) to assess, if considered in isolation from the actual hardware designprocess, this due in part to the automated and somewhat unpredictable nature ofthat process. It seems clear, however, that the gains achieved by the CORDIC-basedR2

4 FHT solution in not having to access the CM will be somewhat counter-balancedby the need to control a potentially large number of adders rather than just a fewfast fixed-point multipliers. Also, the two versions of the R2

4 FHT solution basedupon the minimum-memory addressing (Versions III and IV) will involve greatercontrol complexity than those versions based upon the minimum-arithmetic ad-dressing (Versions I and II), as evidenced from the discussions of Section 6.3.2 ofthe previous chapter. For each of the five versions, however, the control-based logicrequirement will vary little with transform length or word length, as indicated in theresults of Section 6.5 of Chapter 6, due to the scalable nature of the designs.

Estimates for the logic requirements due to both the arithmetic complexity andthe CM for various combinations of transform length and data/coefficient wordlength for all the solutions considered, are as given in Table 7.2, with the resultsreinforcing the attraction of the CORDIC-based solution for those parameter setstypically encountered in high-performance DSP applications. It is evident from theresults displayed in Tables 7.1 and 7.2 that as the transform length increases, theassociated memory-based logic requirement makes all those solutions based uponthe fast fixed-point multiplier increasingly less attractive, as the silicon requirementis clearly dominated for such solutions by the increasing memory requirement. Theonly significant change to the CORDIC-based solution as the transform length variesrelates to the memory allocation for storage of the input/output data.

7.6 Discussion

The primary question addressed in this chapter concerned the optimal choice ofarithmetic unit given the requirement for a resource-constrained solution to theR2

4 FHT. This involved the replacement of the fast fixed-point multipliers used bythe GD-BFLY by a hardware-based parallel arithmetic unit which minimized theneed for the use of embedded resources – at least in the shape of fast fixed-pointmultipliers and fast RAM for the trigonometric coefficients.

The particular design investigated was based upon the use of CORDIC arith-metic, as this is known to be computationally optimal for the operation of phaserotation – most FPGA manufacturers now provide their own version of the CORDIC

7.6 Discussion 113

Tab

le7.

2L

ogic

reso

urce

sre

quir

edfo

rco

mbi

nati

ons

oftr

ansf

orm

leng

thN

and

wor

dle

ngth

L

N=

1;0

24

N=

4;0

96

N=

16;3

84

Ver

sion

Proc

essi

ngL

D16

LD

20L

D24

LD

16L

D20

LD

24L

D16

LD

20L

D24

ofso

luti

onel

emen

ttyp

e(a

ppro

xim

ate

sizi

ngin

slic

es�

1K)

(app

roxi

mat

esi

zing

insl

ices

�1K

)(a

ppro

xim

ate

sizi

ngin

slic

es�1

K)

IFa

stm

ulti

plie

r14

1823

5063

7719

424

329

3II

Fast

mul

tipl

ier

1418

2250

6376

194

243

292

III

Fast

mul

tipl

ier

46

95

710

79

12IV

Fast

mul

tipl

ier

46

85

79

79

11V

CO

RD

ICun

it3

45

34

153

45


unit, in addition to the fast multipliers and RAM, as an embedded resource to be ex-ploited by the user. The result of using such a PE is the identification of a solutionto the real-data DFT offering the promise of greatly reduced quantities of silicon re-sources – at least for the arithmetic complexity and memory requirement – and whenimplemented with FPGA technology, the possibility of adopting a lower-complexityand lower-cost device compared to that based upon the use of the fast fixed-pointmultiplier. The mathematical/logical correctness of the operation of the resultingCORDIC-based version of the R2

4 FHT solution, as with those versions based uponthe use of the fast fixed-point multiplier, has been proven in software via a computerprogram written in the “C” programming language.

The comparative benefits of the various designs, as suggested by the complexityfigures derived for a hypothetical implementation with programmable logic, shouldalso carry over, not only when exploiting embedded resources, but also when im-plemented with ASIC technology, where the high regularity of the CORDIC-baseddesign could prove particularly attractive. In fact, a recent study [7] has shown thatcompared to an implementation using a standard cell ASIC, the FPGA area requiredto implement a typical DSP algorithm – such as the R2

4 FHT – is on average 40times larger, whilst the achievable speed, which relates to the critical path delay andhence the maximum allowable clock frequency, is on average one third of that forthe ASIC. As a result, it is possible to hypothesize a dynamic power consumptionfor an FPGA implementation which is on average nine times greater than that forthe ASIC when embedded features are used [7], increasing to 12 times when onlyprogrammable logic is used [7].

Note that the design constraint on the PE discussed in Section 6.4 of the pre-vious chapter, concerning the total number of CSs in the computational pipeline,is applicable for the CORDIC-based solution as well as those based upon the fastfixed-point multiplier, namely that the total number of CSs – including those cor-responding to the CORDIC iterations – needs to be an odd-valued integer, so asto avoid any possible conflict problems with regard to the reading/writing of theinput/output data sets from/to the eight DM banks for each new clock cycle.

Finally, it should be noted that the benefits of adopting the CORDIC-based de-sign, rather than one of the more conventional designs based upon the use of thefast fixed-point multiplier, may only be achieved at the expense of incurring greaterlatency given that the delay associated with a pipelined fixed-point multiplier mighttypically be of order log2 L clock cycles whereas that for the pipelined CORDICarithmetic unit is of order L clock cycles. As a result, for the processing of 16-bit to24-bit data – as discussed in Table 7.2 – whereas the pipelined fixed-point multiplierPE design might typically involve a total pipeline delay of nine clock cycles, say,that based upon the CORDIC arithmetic unit might involve a total pipeline delay oftwo to three times the size, which might in turn (at least for smaller transform sizes)necessitate the adoption of a safety margin delay with each stage of GD-BFLYs.

References 115

References

1. R. Andraka, A Survey of CORDIC Algorithms for FPGA Based Computers. Proceedings ofACM/SIGDA 6th International Symposium on FPGAs (Monteray, CA, 1998) pp. 191–200

2. A. Banerjee, S.D. Anindya, S. Banerjee, FPGA realization of a CORDIC-based FFT proces-sor for biomedical signal processing. Microprocessors Microsyst. (Elsevier). 25(3), 131–142(2001)

3. M. Becvar, P. Stukjunger, Fixed-point arithmetic in FPGA. Acta Polytech. 45(2), 67–72 (2005)4. C.H. Dick, FPGAs: The high-end alternative for DSP applications. Journal of DSP Eng.

(Spring 2000)5. K. Hwang, Computer Arithmetic: Principles, Architectures and Design (Wiley, New York,

1979)6. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform

via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007)7. I. Kuon, J. Rose, Measuring the Gap Between FPGAs and ASICs (FPGA ‘06, Monteray,

CA, 2006)8. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier). 2004)9. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall,

Englewood Cliffs, NJ, 1975)10. RFEL: rfel.com/products/Products Cordic.asp11. T. Sansaloni, A. Perez-Pascual, J. Valls, Area-efficient FPGA-based FFT processor, Elect. Lett.

39(19), 1369–1370 (2003)12. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput. 8(3),

330–334 (1959)

Chapter 8Computation of 2n-Point Real-Data DiscreteFourier Transform

Abstract This chapter describes two solutions to the problem of the real-data DFTwhereby the GD-BFLY, which has been designed for a radix-4 version of the FHT,is now used for the computation of the 2n-point DFT where the transform length isa power of two, but not a power of four. This enables it to be applied, potentially,to a great many more problems, including those that might not necessarily be bestsolved through the direct application of a 4n-point transform. The first approach isreferred to as the “double-resolution” approach, as it involves FHT-based processingat double the required transform-space resolution via two half-length regularizedFHTs, whilst the second approach is referred to as the “half-resolution” approach, asit involves FHT-based processing at one half the required transform-space resolutionvia one double-length regularized FHT. A discussion is finally provided relating tothe results obtained in the chapter.

8.1 Introduction

The results discussed so far in this monograph have been concerned solely with theapplication of the R2

4 FHT to the computation of the real-data DFT where the trans-form length is a power of four. Given the amount of effort and resources devotedto the design of the GD-BFLY and the associated R2

4 FHT, however, there would begreat attraction in being able to extend its range of applicability to that of the 2n-point real-data DFT where the transform length is a power of two, but not a powerof four – note that 2n is a power of four whenever “n” is an even-valued integer.A radix-2 version of the regularized FHT could of course be developed but thiswould yield just fourfold parallelism, at best, rather than the eightfold parallelismof the radix-4 solution, whilst the time-complexity would increase by a factor of2 log2 N=log4 N.

If the applicability of the R24 FHT could be generalized, therefore, without sig-

nificantly compromising performance, it would result in a very flexible solutionto the real-data DFT problem that would be able to address a great many moreproblems, including those that might not necessarily be best solved through the


117

118 8 Computation of 2n-Point Real-Data Discrete Fourier Transform

direct application of a 4n-point transform [1]. Two approaches to this problem arenow addressed:–

1. The first involves the exploitation of two half-length versions of the R24 FHT, with

one transform being applied to the even-addressed samples of the data sequenceand the second transform to the odd-addressed samples.

2. The second involves the exploitation of one double-length version of the R24 FHT,

this being applied to a zero-padded version of the data sequence.

Thus, with the first approach the results are produced in Fourier space, whilstwith the second approach the results are produced in Hartley space so that conver-sion from Hartley space to Fourier space is still required. The first approach will bereferred to as the “double-resolution” approach, as it involves R2

4 FHT-based pro-cessing at double the required transform-space resolution – thus corresponding toa halving of the required resolving capability, whilst the second approach will bereferred to as the “half-resolution” approach, as it involves R2

4 FHT-based process-ing at one half the required transform-space resolution – thus corresponding to adoubling of the required resolving capability. The required resolution will in turnbe referred to as the full resolution as it will correspond exactly to the resolvingcapability of the sought-after solution.

Note, however, that the effect of zero-padding the input data sequence is to pro-duce interpolated results in the transform space, so that although greater accuracymay be achieved in locating tonal signals within that space, the resolving capabil-ity – that is, the ability to distinguish closely-spaced transform-space components ofthe signal – will not be improved at all.

8.2 Computing One DFT via Two Half-LengthRegularized FHTs

This section discusses the first of the two approaches and is concerned with thecomputation of the 2N-point real data DFT, where “N” is a power of four. To seehow this may be achieved, using two N-point R2

4 FHTs, a regular and highly-parallel“R4FHT-to-R2FFT” conversion routine, or converter, is required which enables theHartley-space data to be suitably combined and transformed to Fourier space. Theconverter exploits the following properties:

(a) The outputs of a real-data DFT of length 2N, where N is a power of four, maybe obtained from the outputs of a complex-data DFT of length N (as discussedin Section 2.3.3 of Chapter 2).

(b) The real and imaginary components of the complex-data DFT outputs may eachbe independently obtained via a R2

4 FHT of length N.

The resulting FFT algorithm, which thus exploits two N-point R24 FHTs and one

R4FHT-to-R2FFT converter, produces outputs in Fourier space rather than Hartley

8.2 Computing One DFT via Two Half-Length Regularized FHTs 119

space and so may be regarded as belonging to the same class of specialized real-dataFFT algorithms as those discussed earlier in Section 2.2 of Chapter 2.

8.2.1 Derivation of 2n-Point Real-Data FFT Algorithm

Let us start by denoting the real-valued data sequence by fx[n]g, with the even-addressed sub-sequence given by fxEŒn�g and the odd-addressed sub-sequence byfxOŒn�g. After processing each sub-sequence by means of an N-point R2

4 FHT, letthe R2

4 FHT outputs from the processing of the even-addressed samples be denoted

bynX.H/

E Œk�o

and those obtained from the processing of the odd-addressed samples

bynX.H/

O Œk�o. The R2

4 FHT outputs may then be converted to DFT outputs by means

of the expressions

X.F/R;EŒk� D 1

2

�X.H/

E Œk� C X.H/E ŒN � k�

�(8.1)

X.F/I;E Œk� D 1

2

�X.H/

E ŒN � k� � X.H/E Œk�

�(8.2)

for the even-addressed terms, and

X.F/R;OŒk� D 1

2

�X.H/

O Œk� C X.H/O ŒN � k�

�(8.3)

X.F/I;OŒk� D 1

2

�X.H/

O ŒN � k� � X.H/O Œk�

�(8.4)

for the odd-addressed terms, where “X.F/

R;E=O” denotes the real component of the DFT

output and “X.F/

I;E=O” the imaginary component.Suppose that the sequences fYRŒk�g and fYIŒk�g are now introduced via the

expressions

YRŒk� D X.F/R;EŒk� � X.F/

I;OŒk� (8.5)

YIŒk� D X.F/I;E Œk� C X.F/

R;OŒk�: (8.6)

Then

YRŒN � k� D X.F/R;EŒk� C X.F/

I;OŒk� (8.7)

YIŒN � k� D �X.F/I;E Œk� C X.F/

R;OŒk� (8.8)


and the 2N-point real-data DFT outputs, denotednX.F/

R Œk�o

for the real component

andnX.F/

I Œk�o

for the imaginary component, may be written as

X.F/R Œk� D 1

2ŒYRŒk� C YRŒN � k� C cos.2 k=2N/ .YIŒk� C YIŒN � k�/

� sin.2 k=2N/ .YRŒk� � YRŒN � k�/� (8.9)

X.F/I Œk� D 1

2ŒYIŒk� � YIŒN � k� C sin.2 k=2N/ .YIŒk� C YIŒN � k�/

� cos.2 k=2N/ .YRŒk� � YRŒN � k�/� ; (8.10)

where the DFT outputs that correspond to the required non-negative half of thefrequency spectrum are addressed by means of the index “k” 2 f0; 1; : : : ; N � 1g.

The even-symmetric nature of the sinusoidal function relative to the argument =2 (which corresponds here to the index k D N=2), together with its periodicity –see Equations 4.56–4.58 of Chapter 4 – enables four real outputs and four imaginaryoutputs to be produced from the application of each pair of half-resolution trigono-metric function values. Thus, four complex-valued Fourier-space samples may beefficiently computed from two sets of four real-valued Hartley-space samples bymeans of the R4FHT-to-R2FFT converter, as shown in the SFG of Fig. 8.1, wherethe “cos” and “sin” trigonometric function values are referred to via the parameters“WR” and “WI”, respectively. The DFT addresses for each set of four complex-valued outputs are expressed via the indices

m1 2 f0; 1; : : : ; N=4�1g; m2 D N�m1�1; m3 D N=2�m1�1 & m4 D N=2Cm1

(8.11)

so that each of the two memories containing the R24 FHT output sets, which are al-

ready physically partitioned column-wise into eight memory banks, now needs to beconceptually partitioned row-wise with each being divided into four quadrants, withthe address “m1” corresponding to locations in the first quadrant, “m2” to locationsin the second quadrant, “m3” to locations in the third quadrant and “m4” to locationsin the fourth quadrant – as shown in Fig. 8.2.

The R24 FHT outputs of the even-addressed samples,

nX.H/

E Œk�o, which are stored

in the same memory as the original even-addressed data set, are subsequently over-

written by the real components of the DFT outputs,nX.F/

R Œk�o, this memory being

thus referred to as the even-real data memory .DMER/. Similarly, the R24 FHT out-

puts of the odd-addressed samples,nX.H/

O Œk�o, which are stored in the same memory

as the original odd-addressed data set, are subsequently overwritten by the imagi-

nary components of the DFT outputs,nX.F/

I Œk�o, this memory being thus referred

to as the odd-imaginary data memory .DMOI/. Note that the two R24 FHTs may be

computed in either sequential or parallel mode – as discussed later – but howeverthis is done, the computation of the two R2

4 FHTs precedes that of the converter, so


XI[m2]

XR[m3]

XR[m4]

XI[m3]

XI[m4]

XI[m1]

XI[m2]

XR[m3]

XR[m4]

XI[m3]

XI[m4]

XR[m1]

XR[m2]

XI[m1]

XR[m1]

XR[m2]

WR

WR

WR

WR

WI

WI

÷ 2

÷ 2

÷ 2

÷ 2

÷ 2

÷ 2

÷ 2

÷ 2

_

_ _

_

_

_

__

_

_

Fig. 8.1 Signal flow graph for R4FHT-to-R2FFT converter

that the same memories – the DMER and the DMOI – may be used for holding theinput/output data to/from both the R2

4 FHTs and the converter.Thus, in order to obtain a solution that meets the latency constraint discussed

in Chapter 6, it is required that after the computation of the two R24 FHTs has been

completed, the R4FHT-to-R2FFT converter combines and transforms the two sets ofHartley-space outputs to Fourier space, doing so in the same highly-parallel fashionas is done with the individual R2

4 FHTs. This means that a partitioned-memory com-puting architecture is required which enables conflict-free and (for the data) in-placeparallel memory addressing of both the data and the trigonometric coefficients forthe operation of both the R2

4 FHTs and the R4FHT-to-R2FFT converter.


Quadrant No 1

N/4 rows × 8 columns indexed by ‘m1’








Quadrant No 3

Quadrant No 4

Quadrant No 2

Quadrant No 1

Quadrant No 3

Quadrant No 4

Quadrant No 2

For storage of regularized FHT output for“even-addressed” input samples

& “real component” of real-data FFT output

For storage of regularized FHT output for“odd-addressed” input samples

& “imaginary component” of real-data FFT output

“EVEN-REAL MEMORY” “ODD-IMAGINARY MEMORY”

Fig. 8.2 Memory structure for in-place parallel addressing by R4FHT-to-R2FFT converter

8.2.2 Implementational Considerations

Now in order for the R4FHT-to-R2FFT conversion routine – which from the SFGof Fig. 8.1 requires a total of eight real multiplications and 20 real additions – to becarried out in such a manner, it is necessary that the input data set for each instanceof the converter (which is obtained by taking one sample from each quadrant ofboth the DMER and the DMOI) is such that of the four samples obtained from eachmemory, no more than two possess the same memory bank address, given the dual-port nature of the memory. Unfortunately, however, this is not the case as it is quitepossible for all four samples from each memory to appear in precisely the samememory bank.

To address this problem, it is first necessary to introduce a small intermediatedata memory .DMI/, of two-dimensional form, partitioned into eight rows of eightcolumns where each memory bank is capable of holding a single sample. As aresult, the entire memory is capable of holding eight complete sets of GD-BFLYoutputs – that is, a total of 64 samples – four even-addressed sets of outputs andfour odd-addressed sets. Each GD-BFLY output set, which is stored in a single rowof memory (either in the DMER or the DMOI) is now mapped to a single row (orcolumn) of the DMI, so that once the DMI is full, the samples may then be used toprovide eight complete sets of input data for feeding to the converter, where the eightsamples of each set now come from eight distinct single-sample memory banks. The


R4FHT-to-R2FFT converter is now able to update the contents of the DMI with itsown outputs, eight samples at a time, after which the contents may be written backto the DMER and the DMOI, again eight samples at a time.

Thus, if a row of samples is read from each quadrant of the DMER and theDMOI, eight samples at a time, and subsequently written to a row (or column)of the DMI, eight samples at a time, in the appropriate order, then when any setof eight samples required by the converter is accessed from the DMI, the sam-ples are obtained from eight distinct single-sample memory banks so that the datais read/written from/to the DMI without conflict. This, in turn, enables the eightreads/writes to be carried out simultaneously in just one clock cycle, so that a suit-ably pipelined implementation of the R4FHT-to-R2FFT converter would be able toproduce all eight outputs in a single clock cycle, as required.

With the introduction of a second DMI, identical in form to the first, it is nowpossible for eight complete sets of eight R2

4 FHT outputs to be built up and stored inone DMI, whilst the data in the other DMI is being processed to yield eight completesets of DFT outputs – both the real and the imaginary components – to be writtenback, eight samples at a time, to the DMER and the DMOI. This processing schemeinvolves a start-up delay of eight clock cycles to allow for the first DMI to be filled,first time around, and a completion delay of eight clock cycles to allow for thesecond DMI to be emptied, last time around, whilst in between the functions ofthe DMI alternate every eight clock cycles, with the contents of one DMI beingupdated with data from the DMER and DMOI whilst the contents of the other DMI

is being processed by the R4FHT-to-R2FFT converter. To achieve this, both the datamemory and one of the intermediate memories need to be updated every eight clockcycles, so that simultaneous reads/writes are required for each memory type, wherein each case care needs to be taken to ensure that memory is not updated beforeit has first been used – see the three consecutive updates given by the schedulingscheme of Fig. 8.3.

Thus, it is possible for a partitioned-memory computing architecture to be de-fined which enables the processing for both the two N-point R2

4 FHTs and theR4FHT-to-R2FFT converter to be efficiently carried out, in an in-place fashion,where the basic components of the solution are as shown in the scheme of Fig. 8.4.Note that the two intermediate memories required of such a scheme are best builtwith programmable logic, so as not to waste potentially large quantities of fast andexpensive embedded RAM in their construction, as embedded memory normallycomes with a minimum size of some several thousands of bits, rather than just a fewtens of bits, as required for each of the 64 banks of each DMI.

I2Process DM(current)



I1DM(old)

→ DM I2DM(old)

→ DM I1DM(old)

→ DM

I1DM → DM(new)

I1DM → DM(new)

I2DM → DM(new)

update = n–1 update = n+1update = n

Fig. 8.3 Scheduling of memory-based operations


odd addresses

even addresses

input / output data

R24FHT

R24FHT

Trigonometric Coefficient

Memory

Trigonometric Coefficient

Memory

Data Memory

Data Memory

R4FHT-to-R2FFTConverter

TrigonometricCoefficient

Memory

Fig. 8.4 Scheme for 2N-point real-data FFT using N-point regularized FHT

8.2.2.1 Solution Exploiting One PE for Computation of Regularized FHTs

With regard to the sequential version of the solution, whereby a single PE is assignedto the computation of the two R2

4 FHTs so that they must be executed sequentially,one after another, the time-complexity for the 2N-point real-data DFT using thedouble-resolution approach, denoted TSDR, is given by

TSDRDN

4.log4N C 1/ C16 (8.12)

clock cycles, which includes the start-up and completion delays for the DMI, withthe associated arithmetic complexity for both the PE and the converter given by ei-ther 20 multiplications and 42 additions, when using Versions I or III of the R2

4 FHTsolution, or 17 multiplications and 45 additions when using Versions II or IV. Note,however, that this figure excludes any contributions for the pipeline start-up delaysof both the R2

4 FHTs and the R4FHT-to-R2FFT converter.The worst-case memory requirement for the 2N-point real-data DFT using the

sequential version of the double-resolution approach involves two sets of eightDM banks for storage of the data, with each bank holding N/8 samples, and fourLUTs for storage of the trigonometric coefficients – one set of three single-quadrantLUTs with each LUT holding N/4 double-resolution trigonometric coefficients forminimum-arithmetic addressing by the R2

4 FHT and one single-quadrant LUT hold-ing N/2 full-resolution trigonometric coefficients for minimum-arithmetic address-ing by the R4FHT-to-R2FFT converter. This results in a total memory requirement,denoted M.W/

SDR, of


M.W/SDR D 2NC5

4N C 128

D 13

4N C 128 (8.13)

words, which includes the requirement for the DMI, with the associated arithmeticcomplexity for the memory addressing given by zero when using Version I of theR2

4 FHT solution or six additions when using Version II.In comparison, the best-case memory requirement for the 2N-point real-data

DFT using the sequential version of the double-resolution approach involves twosets of eight DM banks for storage of the data, with each bank holding N/8 samples,and six complementary-angle LUTs (that is, two two-level LUTs) for storage of thetrigonometric coefficients – one set of three complementary-angle LUTs with eachLUT holding

pN=2 double-resolution trigonometric coefficients for minimum-

memory addressing by the R24 FHT and one set of three complementary-angle

LUTs with each LUT holdingp

2N=2 full-resolution trigonometric coefficients forminimum-memory addressing by the R4FHT-to-R2FFT converter. This results in atotal memory requirement, denoted M.B/

SDR, of

M.B/SDR D 2N C 3

2

�pN C p

2N�

C 128 (8.14)

words, which includes the requirement for the DMI, with the associated arithmeticcomplexity for the addressing given by seven multiplications and eight additionswhen using Version III of the R2

4 FHT solution or seven multiplications and 14 ad-ditions when using Version IV.

8.2.2.2 Solution Exploiting Two PEs for Computation of Regularized FHTs

With regard to the parallel version of the solution, whereby a separate PE is as-signed to the computation of each of the R2

4 FHTs so that they may be executedsimultaneously, or in parallel, the time-complexity for the 2N-point real-data DFTusing the double-resolution approach, denoted TPDR, is given by

TPDRDN

8.log4N C 2/ C16 (8.15)

clock cycles, which includes the start-up and completion delays for the DMI, withthe associated arithmetic complexity for both the two PEs and the converter givenby either 32 multiplications and 64 additions, when using Versions I or III of theR2

4 FHT solution, or 26 multiplications and 70 additions when using Versions II orIV. Note, however, that this figure excludes any contributions for the pipeline start-up delays of both the R2

4 FHTs and the R4FHT-to-R2FFT converter.The worst-case memory requirement for the 2N-point real-data DFT using the

parallel version of the double-resolution approach involves two sets of eight DM


banks for storage of the data, with each bank holding N/8 samples, and sevensingle-quadrant LUTs for storage of the trigonometric coefficients – two sets ofthree single-quadrant LUTs with each LUT holding N/4 double-resolution trigono-metric coefficients for minimum-arithmetic addressing by the two R2

4 FHTs andone single-quadrant LUT holding N/2 full-resolution trigonometric coefficients forminimum-arithmetic addressing by the R4FHT-to-R2FFT converter. This results ina total memory requirement, denoted M.W/

PDR, of

M.W/PDR D 2N C 2N C 128

D 4N C 128 (8.16)

words, which includes the requirement for the DMI, with the associated arithmeticcomplexity for the addressing given by zero when using Version I of the R2

4 FHTsolution or six additions when using Version II.

In comparison, the best-case memory requirement for the 2N-point real-dataDFT using the parallel version of the double-resolution approach involves two setsof eight DM banks for storage of the data, with each bank holding N/8 samples, andnine complementary-angle LUTs (that is, three two-level LUTs) for storage of thetrigonometric coefficients – two sets of three complementary-angle LUTs with eachLUT holding

pN=2 double-resolution trigonometric coefficients for minimum-

memory addressing by the two R24 FHTs and one set of three complementary-angle

LUTs with each LUT holdingp

2N=2 full-resolution trigonometric coefficients forminimum-memory addressing by the R4FHT-to-R2FFT converter. This results in atotal memory requirement, denoted M.B/

PDR, of

M.B/PDR D 2N C 3

�pN C 1

2

p2N

�C 128 (8.17)

words, which includes the requirement for the DMI, with the associated arithmeticcomplexity for the addressing given by seven multiplications and eight additionswhen using Version III of the R2

4 FHT solution or seven multiplications and 14 ad-ditions when using Version IV.

Further memory reductions to those given above by Equations 8.16 and 8.17 maybe achieved, however, by simply sharing the LUTs containing the double-resolutiontrigonometric coefficients between the two R2

4 FHTs as they both contain preciselythe same information and need to be accessed in precisely the same order for eachR2

4 FHT. This memory reduction – which results in the same memory requirementas for the sequential solution – yields worst-case and best-case figures of

M.W/PDR D 13

4N C 128 (8.18)

and

M.B/PDR D 2N C 3

2

�pN C p

2N�

C 128 (8.19)


words, respectively, which includes the requirement for the DMI and could beachieved at a minimal cost of a slightly more complex memory addressing scheme.

8.2.2.3 Summary of Latency-Constrained Solutions

The theoretical performance and resource utilization figures for the latency-constrained computation of the 2N-point real-data DFT, where N is a power offour, by means of the R2

4 FHT are summarized in Table 8.1 below, where “S” refersto the sequential solution with one PE being assigned to the computation of bothR2

4 FHTs, and “P” to the parallel solution with one PE being assigned to the compu-tation of each R2

4 FHT. The results highlight the achievable computational densityof the parallel solution, when compared to the sequential version, as the resultingthroughput is nearly doubled at the minimal expense of an additional 12 fast fixed-point multipliers and 22 adders for Versions I or III of the R2

4 FHT solution or justnine fast fixed-point multipliers and 25 adders for Versions II or IV (and, of course,increased programmable logic).

8.2.2.4 Reduced-Complexity Solution for Increasing Throughput Rate

An alternative hardware-efficient solution to the 2N-point real-data DFT prob-lem may be obtained, again using the double-resolution approach, by setting upa two-stage computational pipeline with the first stage of the pipeline performingsequentially the computation of the two N-point R2

4 FHTs and the second stage per-forming the computation of the R4FHT-to-R2FFT converter. With such an approachthe DM requirement would need to be increased by 2N words, however, so that theoutputs from the R2

4 FHTs could be double buffered, thereby enabling one pair ofR2

4 FHT output sets to be processed by the converter whilst another pair of outputsets was being produced.

With such an approach, given that the first CS of the pipeline has 2N clock cycleswithin which to complete the computation of the two N-point R2

4 FHTs, the secondCS would also have 2N clock cycles within which to complete its own task, namelythe computation of the R4FHT-to-R2FFT converter. However, when the converter iscarried out in a highly parallel fashion, as previously described in this section, thesecond CS would require only N/4 C 16 clock cycles (ignoring the pipeline delay)to complete its task, so that the time complexities of the two stages will differ by asignificant factor, whereas optimum utilization of resources requires that they shouldbe comparable. By adopting a much simpler sequential solution to the converter,however, with outputs being produced at the rate of just one or maybe two perclock cycle, rather than eight, comparable time complexities may be achieved and atgreatly reduced silicon cost due to the resulting reduction in processing complexityand the associated simplicity of the control logic.

Thus, by doubling the latency and increasing the DM requirement to allowfor double buffering, a much simpler solution requiring considerably less control


Tab

le8.

1T

heor

etic

alpe

rfor

man

cean

alys

isfo

r2N

-poi

ntre

al-d

ata

FFT

whe

reN

ispo

wer

offo

ur

Ari

thm

etic

com

plex

ity

Type

ofso

luti

onM

ulti

plie

rsA

dder

sM

emor

yre

quir

emen

t(w

ords

)T

ime

com

plex

ity

(clo

ckcy

cles

)

Ver

sion

DI,

Mod

eD

‘S’

Ver

sion

DI,

Mod

eD

‘P’

20 3242 64

13 =

4N

C128

13 =

4N

C128

1 =4N

� log 4

NC

1� C

16

1 =8N

� log 4

NC

2� C

16

Ver

sion

DII

,Mod

eD

‘S’

Ver

sion

DII

,Mod

eD

‘P’

17 2651 76

13 =

4N

C128

13 =

4N

C128

1 =4N

� log 4

NC

1� C

16

1 =8N

� log 4

NC

2� C

16

Ver

sion

DII

I,M

ode

D‘S

’V

ersi

onD

III,

Mod

eD

‘P’

27 3950 72

2N

C3 =

2

� pN

Cp 2

N�

C128

2N

C3 =

2.p N

Cp 2

N/

C128

1 =4N

� log 4

NC

1� C

16

1 =8N

.log

4N

C2/

C16

Ver

sion

DIV

,Mod

eD

‘S’

Ver

sion

DIV

,Mod

eD

‘P’

24 3359 84

2N

C3 =

2

� pN

Cp 2

N�

C128

2N

C3 =

2.p N

Cp 2

N/

C128

1 =4N

� log 4

NC

1� C

16

1 =8N

.log

4N

C2/

C16

8.3 Computing One DFT via One Double-Length Regularized FHT 129

logic may be achieved which, although not able to meet the latency constraint, isnevertheless able to produce a new set of outputs for the 2N-point real-data DFTevery 2N clock cycles.

8.3 Computing One DFT via One Double-LengthRegularized FHT

This section discusses the second of the two approaches and is concerned with thecomputation of the N-point real-data DFT, where “2N” is a power of four. To seehow this may be achieved, using one 2N-point R2

4 FHT, let us first turn to an im-portant result from Section 3.5 of Chapter 3, namely that of Parseval’s Theorem,which states that the energy in a signal is preserved under a unitary or orthogonaltransformation, such as with the DFT or DHT, this being expressed as

N�1X

nD0

jxŒn�j2 DN�1X

kD0

ˇ̌ˇX.F/Œk�

ˇ̌ˇ2 D

N�1X

kD0

ˇ̌ˇX.H/Œk�

ˇ̌ˇ2

; (8.20)

so that the energy measured in the data space is equal to that measured in the trans-form space.

This result, combined with the familiar DFT-based technique of obtaining aninterpolated frequency spectrum by performing the DFT on a zero-padded input dataset, is now exploited to obtain a simple algorithm for obtaining the Hartley-spaceoutputs for a 2n-point FHT by means of the R2

4 FHT and thus, after Hartley space toFourier space conversion, the Fourier-space outputs for a 2n-point real-data DFT.

8.3.1 Derivation of 2n-Point Real-Data FFT Algorithm

Let us start by applying the DHT to a length N data sequence so that the output,

denotednX.H/

N Œk�o, is given by

X.H/N Œk� D 1p

N

N�1X

nD0

xŒn�:cas.2 nk=N/ (8.21)

and then apply the DHT to a length 2N data sequence obtained by appending N zero-valued samples to the same N samples as used above, so that the output, denotednX.H/

2N Œk�o, is given by

X.H/2N Œk� D 1p

2N

2N�1X

nD0

xŒn�:cas.2 nk=2N/: (8.22)


Then by considering only the even-addressed outputs of Equation 8.22 we have that

X.H/2N Œ2k� D 1p

2N

2N�1X

nD0

xŒn�:cas.2 n2k=2N/

D 1p2N

N�1X

nD0

xŒn�:cas.2 nk=N/; (8.23)

so that

X.H/2N Œ2k� D 1p

2X.H/

N Œk�; (8.24)

meaning that the signal energy measured at index “k” using the N-point transform isequal to twice that obtained when it is measured at the corresponding index (that is,“2k”) using the 2N-point transform, as with the longer transform the energy is beingspread over twice as may outputs. In fact, from Parseval’s Theorem, we have that

N�1X

nD0

jxŒn�j2 DN�1X

kD0

ˇ̌ˇX.H/

N Œk�ˇ̌ˇ2

DN�1X

kD0

�ˇ̌ˇX.H/

2N Œ2k�ˇ̌ˇ2 C

ˇ̌ˇX.H/

2N Œ2k C 1�ˇ̌ˇ2�;

DN�1X

kD0

�1

2

ˇ̌ˇX.H/

N Œk�ˇ̌ˇ2 C

ˇ̌ˇX.H/

2N Œ2k C 1�ˇ̌ˇ2�

(8.25)

so that one half of the signal energy is contained in the even-addressed outputs andthe other half in the odd-addressed outputs.

The Hartley-space outputs of interest correspond to the even-addressed outputs,so that although the solution to the 2N-point DHT – as carried out by the R2

4 FHT –produces all 2N outputs, both even-addressed and odd-addressed, it is only the even-addressed outputs that need to be subsequently converted from Hartley space toFourier space.

8.3.2 Implementational Considerations

Although the data need only be generated N samples at a time, the on-chip memory –in the form of the eight memory banks for the storage of the data and three LUTsfor the storage of the trigonometric coefficients, as discussed in Chapter 6 – needsto cater for twice that amount of data and up to twice the corresponding numberof trigonometric coefficients due to the fact that half-resolution processing is beingused to derive the required DFT outputs. As a result, each DM bank needs to be

8.3 Computing One DFT via One Double-Length Regularized FHT 131

able to hold N/4 data samples and each of the three LUTs needs to be able to holdeither N/2 trigonometric coefficients, for Versions I or II of the R2

4 FHT solution,or

p2N=2 trigonometric coefficients for Versions III or IV. Thus, disregarding the

Hartley space to Fourier space conversion requirement, which from Section 3.4 ofChapter 3 is trivial, the time-complexity for the N-point real-data DFT using thehalf-resolution approach, denoted THR, is given by

THR D N

4:log42N (8.26)

clock cycles, which excludes any contribution for the pipeline delay of the R24 FHT,

whilst the worst-case total memory requirement, denoted M.W/HR , is given by

M.W/HR D 2NC3

2N

D 7

2N (8.27)

words, and the best-case total memory requirement, denoted M.B/HR , is given by

M.B/HR D 2N C 3

2

p2N (8.28)

words.Thus, it is evident from the time-complexity figure of Equation 8.26, that in order

to produce a new Hartley-space output set of length 2N every N clock cycles, asrequired, it will be necessary to set up a new input data set every N clock cycles –each set comprising N new samples and N zero-valued samples – and for valuesof “N” such that 2N > 256 (from the time-complexity figures of Equations 6.11and 6.12 in Chapter 6) to effectively double the throughput of the standard R2

4 FHT-based approach.

One way of achieving this is to process alternate data sets on separate R24 FHTs

running in parallel and offset by N clock cycles relative to each other – see Fig. 8.5.In this way, the latency of each R2

4 FHT would be bounded above by 2N clock cycles,for those transform sizes of interest, whilst for the same transform sizes the updatetime of the dual-R2

4 FHT system would be bounded above by just N clock cycles.

clock cycles

Regularized FHT

Regularized FHT Regularized FHT

Regularized FHT Regularized FHT

Regularized FHT

t = 0 t = N t = 2N t = 3N t = 4N t = 5N t = 6N t = 7N

Fig. 8.5 Dual-R24 FHT approach to half-resolution processing scheme


Therefore, given that a single 2N-point R24 FHT requires twice the DM requirement

and up to twice the CM requirement (depending upon the addressing scheme used)of a single N-point R2

4 FHT – albeit with the same arithmetic complexity – therequired update time, achieved via the use of two 2N-point R2

4 FHTs, would involveup to four times the memory requirement and twice the arithmetic complexity.

An alternative approach to that described above would be to adopt a single 2N-point R2

4 FHT, rather than two, but to assign two PEs to its computation, as describedin Section 6.6 of Chapter 6, thus doubling the throughput of the R2

4 FHT and en-abling the processing to keep up with the I/O over each block of data. The feasibilityof a dual-PE solution such as this would clearly be determined, however, by the vi-ability of using either the more complex quad-port memory or a doubled read/writeaccess rate to the dual-port memory, for both the DM and the CM, as it will be nec-essary to read/write two samples from/to each of the eight memory banks for eachclock cycle, as well as to read four (rather than two) trigonometric coefficients fromeach of the LUTs. Thus, achieving the required update time via the use of a dual-PE R2

4 FHT such as this, would involve twice the arithmetic complexity of a singleN-point R2

4 FHT solution, together with either the replacement of all dual-port mem-ory by quad-port memory or a doubling of the read/write access rate to the dual-portmemory.

As a result, the achievable computational density for a solution to the real-dataDFT based upon the half-resolution approach that achieves the required timing con-straint may be said to lie between one quarter and one half of that achievable for a4n-point real-data DFT via the conventional use of the R2

4 FHT, the exact fraction be-ing dependent upon the length of the transform – the longer the transform the largerthe relative memory requirement and the lower the relative computational density –and the chosen approach. The practicality of such a solution is therefore very muchdependent upon the implementational efficiency of the R2

4 FHT compared to that ofother commercially-available solutions. The results of Chapters 6 and 7, however,would seem to suggest the adoption of the R2

4 FHT for both 2n-point and 4n-pointcases to be a perfectly viable option.

8.4 Discussion

The first solution discussed in Section 8.2 has shown how the highly-parallel GD-BFLY may be effectively exploited for the computation of the 2N-point real-dataDFT, where “N” is a power of four. The solution was obtained by means of a“double-resolution” approach involving FHT-based processing at double the re-quired transform-space resolution via the application of two half-length regularizedFHTs. The R4FHT-to-R2FFT converter uses a conflict-free and in-place parallelmemory addressing scheme to enable the computation for the 2n-point case to becarried out in the same highly-parallel fashion as for the 4n-point case.

The solution has some other interesting properties, even when the complexity isviewed purely in terms of sequential arithmetic operation counts, as the computationof the 2N-point real-data DFT – when N is a power of four – requires a total of

8.4 Discussion 133

CmplyFFT D 2N: log2 2N (8.29)

real multiplications when obtained via one of the real-from-complex strategies dis-cussed in Chapter 2, using the standard complex-data Cooley–Tukey algorithm,but only

CmplyFHT D N .3 log4 N C 2/ (8.30)

real multiplications when obtained via the combined use of the R24 FHT and the

R4FHT-to-R2FFT converter. Thus, for the computation of a 2K-point real-data DFT,for example, this means 22,528 real multiplications via the complex-data radix-2FFT or 15,360 real multiplications via the R2

4 FHT, implying a reduction of nearlyone-third by using the solution outlined here. The split-radix algorithm could beused instead of the Cooley–Tukey algorithm to further reduce the multiplicationcount of the radix-2 FFT but only at the expense of a loss of regularity in the FFTdesign.

The second solution discussed in Section 8.3 has shown how the highly-parallelGD-BFLY may be effectively exploited for the computation of the N-point real-data DFT, where “2N” is a power of four. The solution was obtained by meansof a “half-resolution” approach involving FHT-based processing at one half therequired transform-space resolution via the application of one double-length reg-ularized FHT.

A point worth noting is that if

DHT .fxŒ0�; xŒ1�; xŒ2�; xŒ3�g/ DfX.H/

Œ0�; X.H/Œ1�; X.H/Œ2�; X.H/Œ3�g (8.31)

say, then it is also true, via a theorem applicable to both the DFT and the DHT,namely the Stretch or Repeat Theorem [2], that

DHT .fxŒ0�; xŒ1�; xŒ2�; xŒ3�; xŒ0�; xŒ1�; xŒ2�; xŒ3�g/ Df2X.H/Œ0�; 0; 2X.H/Œ1�; 0; 2X.H/Œ2�; 0; 2X.H/Œ3�; 0g (8.32)

this result being true, not just for the four-point sequence shown, but for a datasequence of any length. As a result, an alternative to the zero-padding approach,which instead involves the idea of transforming a repeated or replicated data set,could be used to extract the required FHT outputs from those of a double-lengthR2

4 FHT. Note, however, that the magnitudes of the required even-addressed outputsamples are twice what they should be so that scaling may be necessary – namelydivision by two which in fixed-point hardware reduces to that of a simple right shiftoperation – in order to achieve the correct magnitudes, this being applied either tothe input samples or to the output samples.

The two solutions discussed, based upon both double-resolution and half-resolution approaches – and for which the mathematical/logical correctness oftheir operation has been proven both in software, via a computer program written inthe “C” programming language, and in silicon with a non-optimized Virtex-II Pro


100 FPGA implementation – thus enable the R24 FHT to be applied, potentially, to a

great many more problems including those that might not necessarily be best solvedthrough the direct application of a 4n-point transform.

References

1. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform viaregularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007)

2. J.O. Smith III, Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications(W3K Publishing, Stanford, CA, 2007)

Chapter 9Applications of Regularized FastHartley Transform

Abstract This chapter discusses the application of the regularized FHT to a numberof computationally-intensive DSP-based functions that may benefit from the adop-tion of a transform-space solution, and in particular, where the data in question isreal valued so that the processing may be efficiently carried out in Hartley space.The functions discussed are those of up-sampling, differentiation, correlation – bothauto-correlation and cross-correlation – and channelization. Efficient channeliza-tion, for the case of a single channel (or small number of channels), may be achievedby means of a DDC process where the filtering is performed via fast Hartley-spaceconvolution, whilst for the case of multiple channels, efficiency may be achieved viathe application of the polyphase DFT filter bank. Each such function might typicallybe encountered in that increasingly important area of wireless communications re-lating to the geolocation of signal emitters, with each potentially able to yield bothconceptually and computationally simplified solutions when solved via the regular-ized FHT. A discussion is finally provided relating to the results obtained in thechapter.

9.1 Introduction

Having now seen how the R24 FHT might be used for the efficient parallel computa-

tion of an N-point DFT where N may be a power of either two or four – although foroptimal computational density it should be a power of four – the monograph con-cludes with the description of a number of DSP-based functions where the adoptionof Hartley space, rather than Fourier space, as the chosen transform space withinwhich to carry out the processing, may potentially lead to conceptually and compu-tationally simplified solutions. Three particular sets of functions common to manymodern DSP systems are discussed, namely:

1. The up-sampling and differentiation – for the case of both first and second deriva-tives – of a real-valued signal either individually or in combination.

2. The correlation function of two real-valued or complex-valued signals where thesignals may both be of infinite duration, as encountered with cross-correlation,


135

136 9 Applications of Regularized Fast Hartley Transform

or where one signal is of finite duration and the other of infinite duration, asencountered with auto-correlation.

3. The channelization of a real-valued signal which, for the case of a single channel(or small number of channels), may be achieved by means of a DDC processwhere the filtering is carried out via fast Hartley-space convolution, whilst for thecase of multiple channels, may be achieved via the application of the polyphaseDFT filter bank.

One important area of wireless communications where all three sets of functionsmight typically be encountered is that relating to the geolocation [8] of signal emit-ters, where there is a requirement to produce accurate timing measurements from thedata gathered at a number of sensors, these measurements being generally obtainedfrom the up-sampled outputs of a correlator. When the signal under analysis is ofsufficiently wide bandwidth, however, the data would first have to be partitioned infrequency before such measurements could be made so as to optimize the SNR ofthe signal for specific frequency bands of interest prior to the correlation process.For the case of a single channel (or small number of channels) the associated filter-ing operation may, depending upon the parameters, be most efficiently carried outby means of fast transform-space convolution, whilst when there is a sufficientlylarge number of equi-spaced and equi-bandwidth channels, this process – which isgenerally referred to in the technical literature as channelization – is best carried outby means of a polyphase DFT filter bank [1, 4, 12].

The adoption of the transform-space approach in signal processing makes par-ticular sense when a significant amount of the processing is able to be efficientlycarried out in the transform space, so that several distinct tasks might be benefi-cially performed there before the resulting signal is transformed back to data space.A multi-sensor digital signal conditioner [5] has been defined, for example, whichexploits the transform-space approach to carry out in a highly efficient manner, inFourier space, the various tasks of sample-rate conversion, spectral shaping or fil-tering and malfunctioning sensor detection and compensation.

9.2 Fast Transform-Space Convolution and Correlation

Given the emphasis placed on the transform-space approach in this chapter it isperhaps worth illustrating firstly its importance by considering the simple case of thefiltering of a real-valued signal by means of a length N FIR filter. A linear system[9, 10] such as this is characterized by means of an output signal that is obtainedfrom the convolution of the system input signal with the system impulse response –as represented by a finite set of filter coefficients. A direct data-space formulationof the problem may be written, in un-normalized complex-data form, as

Rconvh;x Œk� D

N�1X

nD0

h�Œn�:xŒk � n�; (9.1)

9.3 Up-Sampling and Differentiation of Real-Valued Signal 137

where the superscript “�” refers to the operation of complex conjugation, so thateach filter output requires N multiplications – this yields an O.N2/ arithmetic com-plexity for the production of N filter outputs. Alternatively, a fast Hartley-spaceconvolution approach – see Section 3.5 of Chapter 3 – combined with the familiaroverlap-save or overlap-add technique [2] associated with conventional FFT-basedlinear convolution [2] (where the FHT of the filter coefficient set is fixed and pre-computed), might typically involve the application of two 2N-point FHTs and onetransform-space product of length 2N in order to produce N filter outputs – thisyields an O(N.logN) arithmetic complexity. Thus, with a suitably chosen FHT algo-rithm, clear computational gains are achievable via fast Hartley-space convolutionfor relatively small values of N.

The correlation function is generally defined as measuring the degree of cor-relation or similarity between a given signal and a shifted replica of that signal.From this, the basic data-space formulation for the cross-correlation function of twoarbitrary complex-valued signals may be written, in un-normalized form and witharbitrary upper and lower limits, as

Rcorrh;x Œk� D

upperX

nDlower

h�Œn�:xŒk C n�; (9.2)

which is similar in form to that for the convolution function of Equation 9.1 exceptthat there is no need to apply the folding operation [2] to one of the two functions tobe correlated. In fact, if either of the two functions to be correlated is an even func-tion, then the operations of convolution and correlation are equivalent. The aboveexpression is such that: (1) when both sequences are of finite length, it correspondsto the cross-correlation function of two finite-duration signals – to be discussed inSection 9.4.2; (2) when one sequence is of infinite length and the other a finite-lengthstored reference it corresponds to the auto-correlation function – to be discussed inSection 9.4.3; or (3) when both sequences are of infinite length it corresponds tothe cross-correlation function of two continuous data streams – to be discussed inSection 9.4.4.

As evidenced from the discussion above relating to the convolution-based filter-ing problem, the larger the correlation problem the greater the potential benefits tobe gained from the adoption of a transform-space approach, particularly when thecorrelation operation is carried out by means of a fast unitary/orthogonal transformsuch as the FFT or the FHT.

9.3 Up-Sampling and Differentiation of Real-Valued Signal

This section looks briefly at how two basic DSP-based functions, namely those ofup-sampling and differentiation, might be efficiently carried out by first transform-ing the real-valued signal from data space to Hartley space, via the application of a


DHT, then modifying in some way the resulting Hartley-space data, before returningto the data space via the application of a second DHT to obtain the data correspond-ing to an appropriately modified version of the original real-valued signal.

9.3.1 Up-Sampling via Hartley Space

The first function considered is that of up-sampling where the requirement is toincrease the sampling rate of the signal without introducing additional frequencycomponents to the signal outside of its frequency range or band of definition – thisfunction being also referred to as band-limited interpolation. Suppose that the signalis initially represented by means of “N” real-valued samples and that it is requiredto increase or interpolate this by a factor of “L”. To achieve this, the real-valued datais first transformed from data space to Hartley space, via the application of a DHTof length N, with zero-valued samples being then inserted between the samples ofthe Hartley-space data according to the following rule [11]:

YŒk� D

8ˆ̂̂ˆ̂<

ˆ̂̂ˆ̂:

L:XŒk� for k 2 Œ0; N=2 � 1�1=2L:XŒN=2� for k D N=2

0 for k 2 ŒN=2 C 1; M � N=2 C 1�1=2L:XŒN=2� for k D M � N=2

L:XŒk � M C N� for k 2 ŒM � N=2 C 1; M � 1�

; (9.3)

where M D L � N, before returning to the data space via the application of asecond DHT, this time of length M, to obtain the resulting up-sampled signal, asrequired – see Fig. 9.1. Note that the non-zero terms in the above expression havebeen magnified by a factor of “L” so as to ensure, upon return to the data space, thatthe magnitudes of the original samples are preserved.

Note that the above technique, which has been defined for the up-sampling of asingle segment of signal data, may be straightforwardly applied to the case of a con-tinuous signal through the piecing together of multiple data-space signal segments

Fig. 9.1 Scheme forup-sampling of signal usingDHT

{X(H)[k]}

{Y(H)[k]} {y[n]}

{x[n]} DHT

DHT

Zero-Pad Centreof Spectrum

– see Equation 9.3

9.3 Up-Sampling and Differentiation of Real-Valued Signal 139

via a suitably adapted reconstruction technique [3] which combines the use of theoverlap-save technique, as associated with conventional FFT-based linear convolu-tion, with that of temporal windowing [7], in order to keep the root-mean-square(RMS) interpolation error to an acceptable level. Without taking such precautions,the interpolation error may well prove to be unacceptably high due to the inclusionof error maxima near the segment boundaries – this problem being referred to in thetechnical literature as the end or boundary effect [2].

9.3.2 Differentiation via Hartley Space

The second function considered is that of differentiation and from the First andSecond Derivative Theorems of Section 3.5 in Chapter 3 it was stated, for the caseof an N-point data set, that

DHT�˚

x0Œn�� D

n2 kX.H/ŒN � k�

o(9.4)

andDHT

�˚x00Œn�

� Dn�4 2k2X.H/Œk�

o; (9.5)

respectively, so that by transforming the real-valued signal from data space to Hart-ley space, via the application of a DHT of length N, then modifying the resultingHartley-space samples according to Equation 9.4 or 9.5 above, before returning tothe data space via the application of a second DHT, also of length N, it is possible toobtain the first or second derived function corresponding to the original real-valuedsignal, as required – see Fig. 9.2.

9.3.3 Combined Up-Sampling and Differentiation

Note from the results of the above two sections that it is a straightforward task tocarry out both the up-sampling and the differentiation of the real-valued signal by

Fig. 9.2 Scheme fordifferentiation of signal usingDHT

{X(H)[k]}

{Y(H)[k]} {y[n]}

{x[n]}

DHT

DHT

ModifyModify: Y(H)[K] = 2πk × X(H)[N – k]


Fig. 9.3 Scheme forcombined up-samplingand differentiation of signalusing DHT

{Z(H)[k]}

{Y(H)[k]}

{X(H)[k]}

{y[n]}

{x[n]} DHT

DHT

Zero-Pad Centre of Spectrum

– see Equation 9.3

Modify

Modify: Z(H)[k] = 2πk × X(H)[N–k]

simply applying both sets of modifications to the same set of Hartley-space samplesbefore returning to the data space. Thus, after modifying the Hartley-space samplesaccording to Equation 9.4 or 9.5 of Section 9.3.2, the resulting samples are thenzero-padded according to Equation 9.3 of Section 9.3.1, before being returned tothe data space via the application of a second DHT to yield an up-sampled versionof the first or second derived function of the original real-valued signal, as required(see Fig. 9.3).

9.4 Correlation of Two Arbitrary Signals

Having covered very briefly the problems of up-sampling and differentiation, thecomputationally more intensive problem of correlation, as introduced in Section 9.2,is now addressed in some detail.

As evidenced from the discussions of Section 9.2 relating to fast transform-spaceconvolution and correlation, when the correlation operation is performed upon twofinite segments of signal, each comprising “N” samples, a direct data-space imple-mentation will yield an O.N2/ arithmetic complexity, whereas a transform-spaceimplementation involving two forward-direction transforms, one transform-spaceproduct and one reverse-direction transform will yield an O(N.logN) arithmeticcomplexity, via the application of a fast unitary/orthogonal transform, which sug-gests that the larger the correlation problem the greater the potential benefits to begained from the adoption of a transform-space approach.

A key ingredient for the success and the generality of the transform-space ap-proach is in being able to carry out a linear correlation by means of one or morecircular correlations, so that by invoking the Circular Correlation Theorem [2] –which is analogous to the more familiar Circular Convolution Theorem [2] – it ispossible to move the processing from the data space to the transform space where afast algorithm may be exploited. Thus, when the data in question is complex-valued,

9.4 Correlation of Two Arbitrary Signals 141

the processing may be carried out in Fourier space via the use of an FFT, whereaswhen the data is real-valued, it may be carried out in Hartley space via the use ofan FHT.

Note that with the problem of geolocation, it is possible for either cross-correlation or auto-correlation to be encountered: if the sensors operate in passivemode, then each operation will be assumed to be that of cross-correlation and thusto be performed on signals from two different sensors to provide time-difference-of-arrival (TDOA) or equivalent relative range measurements, whereas if the sensorsoperate in active mode, then each operation will be assumed to be that of auto-correlation (so that one of the two signals is simply a stored reference of the other)to provide time-of-arrival (TOA) or equivalent range measurements. The essentialdifference, in terms of processing requirement, between the two modes of operation,is that with auto-correlation, one of the two signals if of finite duration and the otherof infinite duration, whilst with cross-correlation, both of the signals are of infiniteduration.

The signal of interest is typically in the form of a sampled pulse or pulse train,for both active and passive systems, so that the received signal, although often re-garded as being of infinite duration for the purposes of correlator implementation,is actually a succession of temporally-spaced finite-duration segments.

9.4.1 Computation of Complex-Data Correlationvia Real-Data Correlation

Although all of the techniques discussed in this chapter are geared to the processingof real-valued signals, it is worth pointing out that as the operation of correlation,denoted by means of the symbol “˝”, is a linear process – and thereby satisfy-ing the property of additivity – the correlation of two complex-valued signals – asencountered, for example, when the signal processing is carried out at base-band[4, 9, 10, 12] – may be decomposed into the summation of four correlations eachoperating upon two real-valued signals, so that

fXRŒn� C i:XIŒn�g ˝ fYRŒn� C i:YIŒn�g� .fXRŒn�g ˝ fYRŒn�g C fXIŒn�g ˝ fYIŒn�g/

C i: .fXRŒn�g ˝ fYIŒn�g � fXIŒn�g ˝ fYRŒn�g/ ; (9.6)

this expression taking into account the operation of complex conjugation to be per-formed upon one of the two input signals – as shown in Equation 9.2.

The attraction of the complex-to-real decomposition described here for thecomplex-data correlation operation is that it introduces an additional level of par-allelism to the problem as the resulting real-data correlations are independent andthus able to be computed simultaneously, or in parallel, as shown in Fig. 9.4. Thisis particularly relevant when the quantities of data to be correlated are large and the


{XI[n]}

{XR[n]}

– correlation

– addition

{YR[n]} {YI[n]}

{ZR[n]}

{ZI[n]}

–

Fig. 9.4 Scheme for complex-data correlation via real-data correlation

throughput requirement high as a transform-space approach may then be the onlyviable approach to adopt, leaving the conventional complex-data approach to relyupon the parallelization of the complex-data FFT and its inverse as the only logicalmeans of achieving the required performance. With the complex-to-real decompo-sition, however, the required performance may be more easily obtained by runningin parallel multiple versions of the R2

4 FHT in both forward and reverse directions.The transformation from data space to Hartley space, for example, may be carriedout by running in parallel two (when using a stored reference) or four (when cross-correlating two arbitrary signals) R2

4 FHTs, this followed by the computation of foursets of transform-space products, again in parallel, with each transform-space prod-uct taking the form of

ZŒk� D 1

2X.H/Œk�

�Y.H/Œk� C Y.H/ŒN � k�

�C1

2X.H/ŒN�k�

�Y.H/ŒN � k� � Y.H/Œk�

�

(9.7)

The results of the four transform-space products may then be additively combinedprior to the results being transformed back to the data space by running in paralleltwo R2

4 FHTs to yield the required data-space correlation results – see Fig. 9.5. Thus,compared to a solution based upon the use of a complex-data FFT, this approachresults in a potential doubling of the parallelism (in addition to that achievable viathe efficient implementation of the R2

4 FHT, as discussed in Chapter 6) with whichto increase the throughput of the complex-data correlation operation.

9.4.2 Cross-Correlation of Two Finite-Length Data Sets

Before moving on to the two important cases of auto-correlation and cross-correlation where one at least of the two signals is of infinite duration, the simple


{XI[n]}

{XR[n]}

{YR[n]} {Y

I[n]}

DHT DHT

DHT

DHT

Combine

Combine

Combine

Combine

{ZR[n]}DHT

{ZI[n]}

–DHT

Combine: 21

21X(H)[k](Y(H)[k]+Y(H) [N – k])+ X(H)[N – k](Y(H)[N – k] - Y(H)[k])

Fig. 9.5 Scheme for complex-data correlation using DHT

problem of cross-correlating two finite-duration signals by means of the DHT isconsidered. To achieve this, if one of the two signal segments is represented by“N1” samples and the other signal segment by “N2” samples, then the length “N”of the DHT is first chosen so that

N � N1 C N2 � 1 (9.8)

One segment is then pre-zero-padded out to a length of “N” samples and the othersegment post-zero-padded also out to a length of “N” samples. Following this, eachzero-padded segment is passed through the N-point DHT, their transforms then mul-tiplied, sample-by-sample, before the transform-space product is transformed backto the data space by means of another N-point DHT, to yield the required cross-correlator output. There will, however, be a deterministic shift of length

S D N � .N1 C N2 � 1/ (9.9)

samples, which needs to be accounted for when interpreting the output, as the re-sulting data set out of the final DHT comprises “N” samples whereas the correlationof the two segments is known to be only of length N1 C N2 � 1. This procedure isoutlined in Fig. 9.6.

9.4.3 Auto-Correlation: Finite-Length Against Infinite-LengthData Sets

The next type of problem considered relates to that of auto-correlation wherebya finite-duration signal segment – in the form of a stored reference – is being


{Z(H)[k]}

{Y(H)[k]}

{X(H)[k]}

{z[n]}

{y[n]}

{x[n]} DHT

DHT

Pre-Zero-Pad

Post-Zero-Pad

DHT

Combine

Combine: 21

21X(H)[k](Y(H)[k] + Y(H)[N – k])+ X(H)[N – k](Y(H)[N – k] - Y(H)[k])

Fig. 9.6 Scheme for correlation of two signal segments

correlated against a continuous or infinite-duration signal. The stored referencecorrelator is commonly referred to in the technical literature as a matched filter,where the output of a detector based upon the application of such a filter is knownto optimize the peak received SNR in the presence of additive white Gaussian noise(AWGN). The output is also known to correspond – at least for the case of idealizeddistortion-free and multipath-free propagation – to the auto-correlation function ofthe stored signal.

This type of problem is best tackled by viewing it as a segmented correlation,a task most simply solved by means of the familiar overlap-save or overlap-addtechnique associated with conventional FFT-based linear convolution. The approachinvolves decomposing the infinite-duration received signal into segments and com-puting the correlation of the stored reference and the received signal as a number ofsmaller circular correlations. With the overlap-save technique, for example, suitablezero-padding of the stored reference combined with the selection of an appropri-ate segment length enables the required correlation outputs to be obtained from thesegmented circular correlation outputs without the need for further arithmetic. Withthe overlap-add technique, on the other hand, the received signal segments needalso to be zero-padded with the required correlation outputs being obtained throughappropriate combination – although only via addition – of the segmented circularcorrelation outputs.

A solution based upon the adoption of the overlap-save technique is as outlined inFig. 9.7, where the stored reference comprises “N1” samples, the DHT is of length“N”, where

N � 2N1; (9.10)

and the number of valid samples produced from each length-N signal segment outof the correlator is given by “N2”, where


{X(H)

[k]}

{Y(H)

[k]}

{Z(H)

[k]} {z[n]}

{y[n]}

{x[n]} Post-Zero-Pad

Get NextOverlapped

Segment

DHT

DHT

DHT Combine

Discard Invalid Outputs

Combine: 21

21X(H)[k](Y(H)[k] + Y(H)[N – k])+ X(H)[N – k](Y(H)[N – k] - Y(H)[k])

Fig. 9.7 Scheme for auto-correlation using DHT

N2 D N � N1 C 1; (9.11)

these samples appearing at the beginning of each new output segment with the lastN1 � 1 samples of each such segment being invalid and thus discarded. To achievethis, consecutive length-N segments of signal are overlapped by N1 � 1 samples,with the first such segment being pre-zero-padded by N1 � 1 samples to account forthe lack of a predecessor.

The optimal choice of segment length is dependent very much upon the lengthof the stored reference, with a sensible lower limit being given by twice the lengthof the stored reference – as given by Equation 9.10. Clearly, the shorter the seg-ment length the smaller the memory requirement but the lower the computationalefficiency of the solution, whereas the larger the segment length the higher the com-putational efficiency but the larger the memory requirement of the solution. Thus,there is once again a direct trade-off to be made of arithmetic complexity againstmemory requirement, according to how long one makes the signal segment.

9.4.4 Cross-Correlation: Infinite-Length Against Infinite-LengthData Sets

The final type of problem considered relates to that of cross-correlation whereby acontinuous or infinite-duration signal is being correlated against another signal of


similar type. This type of problem, as with that for the auto-correlation problem ofthe previous section, is best tackled by viewing it as a segmented correlation, albeitone requiring a rather more complex solution.

With the cross-correlation of two infinite-duration signals, each region of signalthat carries information will be of finite duration, so that if 50% overlapped signalsegments are generated from the data acquired at each sensor, where the segmentlength corresponds to twice the anticipated duration of each signal region of interestadded to twice the maximum possible propagation delay arising from the separationof the sensors, then for some given acquisition period the current signal region ofinterest is guaranteed to appear in the corresponding segment of both sensors. Thus,the cross-correlation breaks down into the successive computation of a number ofoverlapped cross-correlations of finite-duration signals one of which corresponds tothe cross-correlation of the current signal region of interest. If the length of the seg-ment is short enough to facilitate its direct computation – that is, there is adequatememory to hold the sensor data and cross-correlator outputs – then the overlappedcross-correlation of each two signal segments may be carried out by means of thetechnique described in Section 9.4.2. If this is not the case, however, then it is likelythat the number of cross-correlator outputs of actual significance – that is, that corre-spond to the temporal region containing the dominant peaks – will be considerablysmaller than the number of samples in the segment so that computational advantagecould be made of this fact.

To see how this may be achieved [6], each segment needs first to be broken downinto a number of smaller sub-segments with the cross-correlation of the originaltwo segments being subsequently obtained from the cross-correlation of the sub-segments in the following way. Suppose that we regard each long signal segment asbeing comprised of “K” samples, with the number of samples in each sub-segmentbeing denoted by “N”, where

K D M � N; (9.12)

for some integer “M”. Then denoting the sub-segment index by “m” we carry outthe following steps:

1. Segment each set of K samples to give:

xmŒn� D�

xŒn C .m � 1/N� n D 0; 1; : : : ; N � 1

0 n D N; N C 1; : : : ; 2N � 1

for m D 0; 1; : : : ; M � 2, and

ymŒn� D yŒn C .m � 1/N� n D 0; 1; : : : ; 2N � 1

for m D 0; 1; : : : ; M � 2; and (9.13)

ymŒn� D�

yŒn C .m � 1/N� n D 0; 1; : : : ; N � 1

0 n D N; N C 1; : : : ; 2N � 1

for m D M � 1.


2. Carry out the 2N-point DHT of each sub-segment to give:nX.H/

m Œk�o

D DHT .fxmŒn�g/

for m D 0; 1; : : : ; M � 1; and (9.14)nY.H/

m Œk�o

D DHT .fymŒn�g/for m D 0; 1; : : : ; M � 1.

3. Multiply the two Hartley-space output sets, sample-by-sample, to give:

ZmŒk� D 1

2X.H/

m Œk��

Y.H/m Œk� C Y.H/

m Œ2N � k��

C 1

2X.H/

m Œ2N � k��

Y.H/m Œ2N � k� � Y.H/

m Œk��

(9.15)

k D 0; 1; : : : ; 2N � 1 for m D 0; 1; : : : ; M � 1.4. Sum the transform-space products over all M sets to give:

Z.H/Œk� DM�1X

mD0

ZmŒk� k D 0; 1; : : : ; 2N � 1 (9.16)

5. Carry out the 2N-point DHT of the resulting summed product to give:

fzŒn�g D DHT�n

Z.H/Œk�o�

: (9.17)

The above sequence of steps, which illustrate how to carry out the requiredsegmented cross-correlation operation, is also given in diagrammatic form inFig. 9.8 below.

Note that if the sampled data is complex-valued rather than real-valued then theabove sequence of steps may be straightforwardly modified to account for the fourreal-data combinations required by the complex-to-real parallel decomposition dis-cussed in Section 9.4.1. Also, as for each of the correlation schemes discussed inthis section, if the length of the correlation operations is chosen to be a power offour, then the regularized FHT may be beneficially used to enable the function to becarried out in a computationally-efficient manner.

9.4.5 Combining Functions in Hartley Space

Having shown in the previous sections how different functions, such as those ofup-sampling and differentiation, may be efficiently carried out, either individuallyor in combination, via transformation to Hartley space, it is easy to visualize –through straightforward manipulation of the Hartley-space data – how such func-tions may also be combined with that of correlation to enable the output signal from


{X (H) [k]}

{Z (H) [k]}

{Y (H) [k]}

{z[n]}

{xm[n]}

{ym[n]}

Get Next Sub-segment

Post-Zero-Pad DHT

m

m

Get Next Overlapped

Sub-segment

DHT Combine

DHT

Combine: see Equations 9.15 and 9.16

Fig. 9.8 Scheme for cross-correlation using DHT

the correlator to be produced in up-sampled form, or as a derived function of thestandard correlator output signal or, upon combining of the two ideas, as an up-sampled version of a derived function of the standard correlator output signal.

The adoption of the first derived function of the standard correlator output sig-nal, for example, enables one to replace peak detection by zero detection for theestimation of either TOA or TDOA. The utility of such an idea is particularly evi-dent in the seemingly intractable problem of trying to find the TOA correspondingto the direct path component of a multi-path signal given that the largest peak ofthe standard correlator output signal does not necessarily correspond to the locationof the direct path component. With the first derived function, for example, it canbe shown that the position of the peak of the direct path signal corresponds to thepoint at which the value of the first derived function first starts to decrease, whilstwith the second derived function, it can be shown that the position of the peak ofthe direct path signal corresponds to the point at which the first negative peak of thesecond derived function appears. Thus, both first and second derived functions ofthe standard correlator output signal may be used to attack the problem.

Finally, note that with all the correlation-based expressions given in this sectionthat involve the use of dual Hartley-space terms, such as the terms X.H/Œk� andX.H/ŒN � k�, that it is necessary that care be taken to treat the zero-address andNyquist-address terms separately, as neither term possesses a dual.

9.5 Channelization of Real-Valued Signal 149

9.5 Channelization of Real-Valued Signal

The function of a digital multi-channel receiver [9, 10] is to simultaneouslydown-convert a set of frequency division multiplexed (FDM) channels residingin a single sampled data stream. The traditional approach to solving this problemhas been to use a bank of DDC units, with each channel being produced indi-vidually via a DDC unit which digitally down-converts the signal to base-band,constrains the bandwidth with a digital filter, and then reduces the sampling rateby an amount commensurate with the reduction in bandwidth. The problem withthe DDC approach, however, is one of cost in that multiple channels are producedvia replication of the DDC unit, so that there is no commonality of processing andtherefore no possibility of computational savings being made. This is particularlyrelevant when the bandwidth of the signal under analysis dictates that a large num-ber of channels be produced, as the DDC unit required for each channel typicallyrequires the use of two FIR low-pass filters and one stored version of the period ofa complex sinusoid sampled at the input rate.

Two cases are now considered, the first corresponding to the efficient productionof a single channel (or small number of channels) by means of a DDC process wherethe filtering is carried out via fast Hartley-space convolution, the second correspond-ing to the production of multiple channels via the application of the polyphase DFTfilter bank.

9.5.1 Single Channel: Fast Hartley-Space Convolution

For the simple example of a single channel, after the real-valued signal has beenfrequency-shifted to base-band, the remaining task of the DDC process is to filterthe resulting two channels of data so as to constrain the bandwidth of the signal andthus enable the sampling rate to be reduced by an amount commensurate with the re-duction in bandwidth. Each filtering operation may be viewed as a convolution-typeproblem, where the impulse response function of the digital filter is being convolvedwith a continuous or infinite-duration signal.

As already stated, this convolution-based problem may be solved with either adata-space or a transform-space approach, the optimum choice being very muchdependent upon the achievable down-sampling rate out of the two FIR filters –one filter for the in-phase channel and another for the quadrature channel. Clearly,if the down-sampling rate is sufficiently large and/or the length of the impulseresponse of each filter sufficiently short, then the computational efficiency of thedata-space approach may well be difficult to improve upon.

For the case of the transform-space approach, however, this type of problem isbest tackled by viewing it as a segmented convolution, a task most simply solved bymeans of the familiar overlap-save or overlap-add technique, as discussed already inrelation to the analogous problem of segmented correlation. The approach involvesdecomposing the infinite-duration received signal into segments and computing the


{Z(H)[k],Z(H)[k]}Q

{X(H)[k]}

{zI[n],zQ[n]}

{x[n]} Post-Zero-Pad DHT

DHT

Discard InvalidOutputs

{Y(H)[k]}

{Y(H)[k]}

Q

I

I

{yQ[n]} Get NextOverlapped

SegmentDHT Combine

{yI[n]} Get NextOverlapped

SegmentDHT Combine

DHT

Combine: 21

21

X(H)[k](Y(H)[k] + Y(H)[N – k])+ X(H)[N – k](Y(H)[N – k] – Y(H)[k])

Fig. 9.9 Scheme for filtering complex-valued signal using DHT

convolution of the impulse response function of the filter and the received signalas a number of smaller circular convolutions. With the overlap-save technique, forexample, suitable zero-padding of the impulse response function combined with theselection of an appropriate segment length enables the required convolution outputsto be obtained from the segmented circular convolution outputs without the need forfurther arithmetic.

A solution based upon the adoption of the overlap-save technique is as outlinedin Fig. 9.9, where the impulse response function, fx[n]g, comprises “N1” samplesor coefficients, the DHT is of length “N”, where

N � 2 N1; (9.18)

and the number of valid samples produced from each length-N signal segment outof the convolver is given by “N2”, where

N2 D N � N1 C 1; (9.19)

these samples appearing at the end of each new output segment with the first N1 � 1samples of each such segment being invalid and thus discarded. To achieve this, con-secutive length-N segments of the in-phase and quadrature components of the signalare overlapped by N1�1 samples, with the first such segment being pre-zero-paddedby N1 � 1 samples to account for the lack of a predecessor. The transform-spaceproduct associated with each of the small circular convolutions takes the form of

ZŒk� D 1

2X.H/Œk�

�Y.H/Œk� C Y.H/ŒN � k�

� C 1

2X.H/ŒN � k�

�Y.H/ŒN � k� � Y.H/Œk�

�;

(9.20)


with the in-phase and quadrature components of the final filtered output denoted byfzIŒn�g and fzQŒn�g, respectively.

The optimum choice of segment length is dependent very much upon the lengthof the impulse response function of the filter, with a sensible lower limit being givenby twice the length of the impulse response function – as given by Equation 9.18.Clearly, as with the case of segmented correlation, the shorter the segment lengththe smaller the memory requirement but the lower the computational efficiency ofthe solution, whereas the larger the segment length the higher the computationalefficiency but the larger the memory requirement of the solution. Thus, there isonce again a direct trade-off to be made of arithmetic complexity against memoryrequirement, according to how long one makes the signal segment.

9.5.2 Multiple Channels: Conventional Polyphase DFTFilter Bank

A common situation, of particular interest, is where multiple channels – possiblyeven thousands of channels – are to be produced which are of equal spacing and ofequal bandwidth, as a polyphase decomposition may be beneficially used to enablethe bank of DDC processes to be simply transformed into an alternative filterbankstructure, namely the polyphase DFT filter bank, as described in Fig. 9.10 for themost general case of a complex-valued signal, whereby large numbers of channelsmay be simultaneously produced at computationally acceptable levels.

For a brief mathematical justification of this decomposition, it should be firstnoted that a set of N filters, fHk.z/g, is said to be a uniform DFT filter bank [1,4,12]if the filters are expressible as

Hk.z/ � H0�z:Wk

N

�; (9.21)

whereH0.z/ D 1 C z�1 C C z�.N�1/; (9.22)

with z�1 corresponding to the unit delay and WN to the primitive Nth complex rootof unity, as given by Equation 1.2 in Chapter 1.

Two additional ideas of particular importance are those conveyed by theEquivalency Theorem [1, 4, 12] and the Noble Identity [1, 4, 12], where the in-voking of the Equivalency Theorem enables the operations of down-conversionfollowed by low-pass filtering to be replaced by those of band-pass filtering fol-lowed by down-conversion, whilst that of the Noble Identity enables the orderingof the operations of filtering followed by down-sampling to be straightforwardlyreversed. With these two key ideas in mind, assume that the prototype filter, denotedP(z), is expressible in polyphase form as

P.z/ DN�1X

nD0

z�n:Hn.zN/; (9.23)


{x[n]} ↓ N

z-1

↓ N

H1(z)

H0(z)

HN−1(Z)

↓ N

N–P

oint

Com

plex

-Dat

a D

iscr

ete

Fou

rier

Tra

nsfo

rm

{x[n]} – band-pass complex-valued input signal

{yk[m]} – low-pass complex-valued output signal

Down-Sample Filter

z-1

z-1

{y0[m]}

{y1[m]}

{yk[m]}

{yN-1[m]}

Fig. 9.10 Scheme for polyphase DFT channelization of complex-valued signal

for the case of an N-branch system, so that the filter corresponding to the kth branch,Pk.z/, may thus be written as

Pk.z/ D P�z:Wk

N

�

DN�1X

nD0

�z�1:W�k

N

�n:Hn.zN/; (9.24)

with the output of Pk.z/, denoted Yk.z/, given by

Yk.z/ DN�1X

nD0

W�nkN

�z�n:Hn.zN/:X.z/

�; (9.25)

which corresponds to the polyphase structure shown above in Fig. 9.10.


With this structure, therefore, the required band-pass filters are obtained byadopting a polyphase filterbank, with each filter branch being obtained by delayingand sub-sampling the impulse response of a single prototype FIR low-pass filter,followed by the application of a DFT to the instantaneous output sets producedby the polyphase filter bank. The effect of the polyphase filtering is to isolate anddown-sample the individual channels, whilst the DFT is used to convert each chan-nel to base-band. In this way, the same polyphase filter bank is used to generate allthe channels with additional complexity reduction being made possible by comput-ing the DFT with an appropriately chosen FFT algorithm. When the sampled datais complex valued, the feeding of N samples into an N-branch polyphase DFT fil-ter bank will result in the production of N independent channels via the use of acomplex-data FFT, whereas when the sampled data is real valued, the feeding of Nsamples into the N-branch polyphase DFT filter bank will result in the productionof just N/2 independent channels via the use of a real-data FFT.

For the efficient computation of the polyphase DFT filter bank, as for that of thestandard DFT, the traditional approach to the problem has been to use a complex-data solution, regardless of the nature of the data, this often entailing the initialconversion of the real-valued data to complex-valued data via a wideband DDCprocess, or through the adoption of a real-from-complex strategy whereby two real-valued data sets are built up from the polyphase filter bank outputs to enable tworeal-data FFTs to be computed simultaneously via one full-length complex-dataFFT, or where one real-data FFT is performed on the polyphase filter bank out-puts via one half-length complex-data FFT. The most commonly adopted approachis probably to apply the polyphase DFT filter bank after the real-valued data hasfirst been converted to base-band via the wideband DDC process, which means thatthe data has to undergo two separate stages of filtering – one stage following thefrequency shifting and another for the polyphase filter bank – before it is in therequired form.

The same drawbacks are therefore equally valid for the computation of the real-data polyphase DFT filter bank as they are for that of the real-data DFT, thesedrawbacks having already been comprehensively discussed in Chapter 2.

A typical channelization problem involves a real-valued wide bandwidth RF sig-nal, sampled at an intermediate frequency (IF) with a potentially high sampling rate,and a significant number of channels, so that the associated computational demandsof a solution based upon the use of the polyphase DFT filter bank would typicallybe met through the mapping of the polyphase filter bank and the associated real-dataFFT placed at its output onto appropriately chosen parallel computing equipment, asmight be provided by a sufficiently powerful FPGA device. As a result, if the num-ber of polyphase filter branches is a power of four, then the real-data DFT placed atthe output of the polyphase filter bank may be efficiently carried out by means ofthe R2

4 FHT without recourse to the use of a complex-data FFT.


9.5.2.1 Alias-Free Formulation

An important problem associated with the polyphase DFT filter bank is that ofadjacent channel interference which arises through the nature of the sampling pro-cess – namely the fact that with the conventional formulation of the polyphaseDFT filter bank all the channels are critically sampled at the Nyquist rate – as thisresults in overlapping of the channel frequency responses and hence aliasing of asignal in the transition region of one channel into the transition region of one of itsneighbours.

To overcome this problem, the presence of aliased signals arising from the poorfiltering performance near the channel boundaries may be reduced or eliminated byover-sampling the individual channels to above the Nyquist rate.

This over-sampling may be most simply achieved, with a rational factor, byoverlapping the segments of sampled data into the polyphase filter bank, using sim-ple memory shifts/exchanges, and then removing the resulting frequency-dependantphase shifts at the output of the polyphase filter bank by applying circular time shiftsto the filtered data, this being achieved by re-ordering the data with simple memoryshifts/exchanges [4]. The effect of over-sampling is to create redundant spectral re-gions between the desired channel boundaries and thus to prevent the overlappingof adjacent channel frequency responses. For a channel bandwidth of “W”, supposethat an over-sampling ratio of 4/3 is used – equating to an overlap of 25% of thesampled data segments – and that the pass-band and stop-band edges are symmet-rically placed at .3=4/ � .W=2/ and .5=4/ � .W=2/, respectively, relative to thechannel boundary. This results in the creation of a spectral band (in the centre of theredundant region), of width “B”, where

B D 2

�4

3W

2� 5

4W

2

�D W

12; (9.26)

which separates adjacent channel stop-band edges and thus prevents possible alias-ing problems, so that the redundant regions may be easily identified and removedupon spectrum analysis of the individual channels of interest.

Clearly, by suitably adjusting the position of the stop-band edge – that is, bysetting it to exactly R � .W=2/ where “R” is the over-sampling ratio – it is possibleto completely eliminate this spectral safety region such that the locations of thestop-band edges of adjacent channels actually coincide.

If the signal is real valued and the number of channels to be produced is equal toN/2 – and hence the length of the sampled data segments as well as the number ofbranches used by the polyphase filter is equal to N – then an over-sampling ratio of“R” will require an overlap, “O”, of

O D N�1 � 1=R

�(9.27)

samples for the data segments. As with the computation of any DSP-based function,there is a direct trade-off to be made between complexity and performance in that

9.6 Discussion 155

the larger the over-sampling ratio the larger the arithmetic complexity but the easierthe task of the polyphase filtering process. This results in a reduction in the numberof taps required by each of the small filters of the polyphase filter bank, which inturn leads to a reduced latency and a reduced-duration transient response. A realisticvalue for the over-sampling ratio, as commonly adopted in many channelizationproblems, is given by two, whereby the requirement is thus for a 50% overlap of thesampled data segments.

9.5.2.2 Implementational Considerations

With a simplified formulation (albeit not a particularly useful one) of the polyphaseDFT filter bank, which takes no account of the aliasing problem, if “N” real-valuedsamples are fed into an N-branch polyphase filter bank, then the solution to the asso-ciated real-data DFT problem will equate to the execution of one N-point real-dataFFT every N clock cycles which is the very problem that has already been addressedin this monograph through the development of the R2

4 FHT. For the more interestingand relevant situation, however, where an over-sampling ratio of two is adopted toaddress the aliasing problem, the solution to the associated real-data DFT problemwill equate to the execution of one N-point real-data FFT every N/2 clock cycles,so that it will be necessary, for when N > 256 (from the time-complexity figuresof Equations 6.11 and 6.12 in Chapter 6), to double the throughput of the stan-dard R2

4 FHT and hence of the real-data FFT. This may be achieved either by using adual-PE version of the R2

4 FHT, as discussed in Section 6.6 of Chapter 6, or by com-puting two R2

4 FHTs simultaneously, or in parallel, on consecutive overlapped setsof polyphase filter outputs – along the lines of the dual-R2

4 FHT scheme describedin Section 8.3 of the previous chapter.

When the over-sampling ratio is reduced to 4/3, however, the real-data DFTproblem simplifies to the execution of one N-point real-data FFT every 3N/4 clockcycles, so that for those situations where N � 4K, a single R2

4 FHT may well suffice,as evidenced from the time-complexity figures given by Equations 6.11 and 6.12 inChapter 6.

9.6 Discussion

This chapter has focused on the application of the DHT to a number ofcomputationally-intensive DSP-based functions which may benefit from the adop-tion of transform-space processing. The particular application area of geolocationwas discussed in some detail as it is a potential vehicle for all of the functionsconsidered.

With most geolocation systems there is typically a requirement to produceup-sampled correlator outputs from which the TOA or TDOA timing measure-ments may subsequently be derived. The TOA measurement forms the basis of


those geolocation systems based upon the exploitation of multiple range esti-mates whilst the TDOA measurement forms the basis of those geolocation systemsbased upon the exploitation of multiple relative range estimates. The up-sampling,differentiation and correlation functions, as was shown, may all be efficiently per-formed, in various combinations, when the processing is carried out via Hartleyspace, with the linearity of the complex-data correlation operation also leading toits decomposition into four parallel real-data correlation operations. This paralleldecomposition is particularly useful when the quantities of data to be correlated arelarge and the throughput requirement high as it enables the correlation to be effi-ciently computed by running in parallel multiple versions of the R2

4 FHT.With regard to the channelization problem, it was suggested that the computa-

tional complexity involved in the production of a single channel (or small numberof channels) by means of a DDC process may, depending upon the parameters, beconsiderably reduced compared to that of the direct data-space approach, by car-rying out the filtering via fast Hartley-space convolution. For the case of multiplechannels, it was seen that the channelization of a real-valued signal by means of thepolyphase DFT filter bank may also be considerably simplified through the adop-tion of an FHT for carrying out the associated real-data DFT. In fact, with most RFchannelization problems, where the number of channels is large enough to make thequestion of implementational complexity a serious issue, the sampled IF data is nat-urally real valued, so that advantage may be made of this fact in trying to reduce thecomplexity to manageable levels. This can be done in two ways: firstly, by replacingeach pair of short FIR filters – applied to the in-phase and quadrature channels –required by the standard solution for each polyphase branch, with a single short FIRfilter, as the data remains real valued right the way through the polyphase filteringprocess; and secondly, by replacing the complex-data DFT at the output of the stan-dard polyphase filter bank by a real-data DFT which for a suitably chosen numberof channels may be efficiently computed by means of the R2

4 FHT.Note that to be able to carry out the real-data DFT component of the polyphase

DFT filter bank with a dual-PE solution to the R24 FHT, rather than a single-PE

solution, as suggested in Section 9.5.2.2, it would be necessary to use either quad-port memory or a doubled read/write access rate to the dual-port memory, for boththe DM and the CM, so as to ensure conflict-free and (for the data) in-place parallelmemory addressing of both the data and the trigonometric coefficients for each PE –as discussed in Section 6.6 of Chapter 6 – with all eight GD-BFLY inputs/outputsfor each PE being read/written simultaneously from/to memory.

References

1. A.N. Akansu, R.A. Haddad, Multiresolution Signal Decomposition: Transforms – Subbands –Wavelets (Academic Press, San Diego, CA, 2001).

2. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice Hall, EnglewoodCliffs, NJ, 1988).

References 157

3. D. Fraser, Interpolation by the FFT revisited – an experimental investigation. IEEE Trans.ASSP. 37(5), 665–675 (1989).

4. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice Hall, UpperSaddle River, NJ, 2004).

5. K.J. Jones, Digital Signal Conditioning for Sensor Arrays, G.B. Patent Application No:0112415 (5 May 2001).

6. R. Nielson, Sonar Signal Processing (Artech House, Boston, MA, 1991).7. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle

River, NJ, 1989).8. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, Boston, MA, 2005).9. J.G. Proakis, Digital Communications (McGraw Hill, New York, 2001).

10. B. Sklar, Digital Communications: Fundamentals and Applications (Prentice Hall, EnglewoodCliffs, NJ, 2002).

11. C.C. Tseng, S.L. Lee, Design of FIR Digital Differentiator Using Discrete Hartley Transformand Backward Difference (European Signal Processing Conference (EUSIPCO), Lausanne,2008).

12. P.P. Vaidyanathan, Multirate Systems and Filter Banks (Prentice Hall, Englewood Cliffs,NJ, 1993).

Chapter 10Summary and Conclusions

Abstract This chapter first outlines the background to the problem addressed bythe preceding chapters, namely the computation using silicon-based hardware ofthe real-data DFT, including the specific objectives that were to be met by the re-search, this being followed with a further discussion of the results obtained from theresearch and finally of the conclusions to be drawn.

10.1 Outline of Problem Addressed

The problem addressed in this monograph has been concerned with the parallelcomputation of the real-data DFT, targeted at implementation with silicon-basedparallel computing equipment, where the application area of interest is that ofwireless communications, and in particular that of mobile communications, so thatresource-constrained (both silicon and power) solutions based upon the highly reg-ular fixed-radix FFT design have been actively sought. With the computing powernow available via the silicon-based parallel computing technologies, however, it isno longer adequate to view the FFT complexity purely in terms of arithmetic opera-tion counts, as has conventionally been done, as there is now the facility to use bothmultiple arithmetic units – adders, fast multipliers and CORDIC phase rotators –and multiple banks of fast RAM in order to enhance the FFT performance via itsparallel computation.

As a result, a whole new set of constraints has arisen relating to the design ofefficient FFT algorithms for silicon-based implementation. With the environmentencountered in mobile communications, where a small battery may be the onlysource of power supply for long periods of time, algorithms are now being designedsubject to new and often conflicting performance criteria, where the ideal is eitherto maximize the throughput (that is, to minimize the update time) or satisfy someconstraint on the latency, whilst at the same time minimizing the required siliconresources (and thereby minimizing the cost of implementation) as well as keepingthe power consumption to within the available budget.

The traditional approach to the DFT problem has been to use a complex-data so-lution, regardless of the nature of the data, this often entailing the initial conversion


159

160 10 Summary and Conclusions

of real-valued data to complex-valued data via a wideband DDC process or throughthe adoption of a real-from-complex strategy whereby two real-data FFTs arecomputed simultaneously via one full-length complex-data FFT or where one real-data FFT is computed via one half-length complex-data FFT. Each such solution,however, involves a computational overhead when compared to the more direct ap-proach of a real-data FFT in terms of increased memory, increased processing delayto allow for the possible acquisition/processing of pairs of data sets, and additionalpacking/unpacking complexity. With the DDC approach, where two functions areused instead of just one, the information content of short-duration signals may alsobe compromised through the introduction of the additional and unnecessary filteringoperation.

Thus, the traditional approach to the problem of the real-data DFT has effectivelybeen to modify the problem so as to match an existing complex-data solution – theaim of the research carried out in this monograph has been to seek a solution thatmatches the actual problem.

The DHT, which is an orthogonal real-data transform and close relative to theDFT that possesses many of the same properties, was identified as an attractivealgorithm for attacking the real-data DFT problem as the outputs from a real-dataDFT may be straightforwardly obtained from the outputs of the DHT, and vice versa,whilst fast algorithms for its solution – referred to generically as the FHT – arenow commonly encountered in the technical literature. A drawback of conventionalFHTs, however, lies in the lack of regularity arising from the need for two sizes –and thus two separate designs – of butterfly for fixed-radix formulations, where asingle-sized radix-R butterfly produces R outputs from R inputs and a double-sizedradix-R butterfly produces 2R outputs from 2R inputs.

10.2 Summary of Results

To address the above situation, a generic version of the double-sized butterfly,referred to as the generic double butterfly and abbreviated to “GD-BFLY”, wasdeveloped for the radix-4 version of the FHT which overcame the problem in anelegant fashion. The resulting single-design solution, referred to as the regularizedFHT and abbreviated to “R2

4 FHT”, lends itself naturally to parallelization and tomapping onto a regular computational structure for implementation with one of thesilicon-based parallel computing technologies.

A partitioned-memory architecture was identified and developed for the par-allel computation of the GD-BFLY and of the resulting R2

4 FHT, whereby boththe data and the trigonometric coefficients were partitioned or distributed acrossmultiple banks of memory. The approach exploited a single locally-pipelined high-performance PE that yielded an attractive solution which was both area-efficient andscalable in terms of transform length. High performance was achieved by havingthe PE able to process the input/output data sets to the GD-BFLY in parallel, thisin turn implying the need to be able to access simultaneously, and without conflict,

10.2 Summary of Results 161

both multiple data and multiple txrigonometric coefficients, from their respectivememories. The arithmetic and permutation operations performed on the GD-BFLYdata within the PE were mapped onto a computational pipeline where, for the im-plementation considered, it was required that the total number of CSs in the pipelinewas an odd-valued integer so as to avoid any possible conflict problems with thereading/writing of input/output data from/to the DM banks for each new clock cycle.

A number of pipelined versions of the PE were thus described using both fastfixed-point multipliers and CORDIC phase rotators which enabled the arithmeticcomplexity to be traded off against memory requirement. The result was a set of de-signs based upon the partitioned-memory single-PE computing architecture whicheach yield a hardware-efficient solution with universal application, such that eachnew application necessitates minimal re-design cost. The resulting solutions wereshown to be amenable to efficient implementation with the silicon-based technolo-gies and capable of achieving the computational density – that is, the throughputper unit area of silicon – of the most advanced commercially-available complex-data solutions for just a fraction of the silicon resources.

The area-efficiency makes each design particularly attractive for those applica-tions where the real-data transform is sufficiently long as to make the associatedmemory requirement a serious issue for more conventional multi-PE solutions,whilst the block-based nature of their operation means that they are also able, via theblock floating-point scaling strategy, to produce higher accuracy transform-domainoutputs when using fixed-point arithmetic than is achievable by their streaming FFTcounterparts.

Finally, it was seen how the applicability of the R24 FHT – which is a radix-4 al-

gorithm – could be generalized, without significantly compromising performance,to the efficient parallel computation of the real-data DFT whose length is a power oftwo, but not a power of four. This enables it to be applied, potentially, to a great manymore problems, including those that might not necessarily be best solved throughthe direct application of a 4n-point transform. This was followed by its applica-tion to the computation of some of the more familiar and computationally-intensiveDSP-based functions, such as those of correlation – both auto-correlation and cross-correlation – and of the wideband channelization of RF data via the polyphase DFTfilter bank. With each such function, which might typically be encountered in thatincreasingly important area of wireless communications relating to the geoloca-tion of signal emitters, the adoption of the R2

4 FHT may potentially result in bothconceptually and computationally simplified solutions.

Note that the mathematical/logical correctness of the operation of the variousfunctions used by the partitioned-memory single-PE solution to the R2

4 FHT hasbeen proven in software with a computer program written in the “C” programminglanguage. This code provides the user with various choices of PE design and of stor-age/accession scheme for the trigonometric coefficients, helping the user to identifyhow the algorithm might be efficiently mapped onto suitable parallel computingequipment following translation of the sequential “C” code to the parallel codeproduced by a suitably chosen HDL.

162 10 Summary and Conclusions

10.3 Conclusions

The aims of this research, as described above, have been successfully achieved witha highly-parallel formulation of the real-data FFT being defined without recourseto the use of a complex-data FFT and, in so doing, a solution that yields clearimplementational advantages, both theoretical and practical, over the more conven-tional complex-data solutions to the problem. The highly-parallel formulation ofthe real-data FFT described in the monograph has been shown to lead to scalableand device-independent solutions to the latency-constrained version of the problemwhich are able to optimize the use of the available silicon resources, and thus tomaximize the achievable computational density, thereby making the solution a gen-uine advance in the design and implementation of high-performance parallel FFTalgorithms.

Appendix AComputer Program for Regularized FastHartley Transform

Abstract This appendix outlines the various functions of which the regularizedFHT is comprised and provides a detailed description of the computer code, writtenin the “C” programming language, for executing the said functions, where integer-only arithmetic is used to model the fixed-point nature of the associated arithmeticoperations. The computer source code for the complete solution, which is listed inAppendix B, is to be found on the CD accompanying the monograph.

A.1 Introduction

The processing functions required for a fixed-point implementation of the R24 FHT

break down into two quite distinct categories, namely those pre-processing functionsthat need to be carried out in advance of the real-time processing for performing thefollowing tasks:

– Setting up of LUTs for trigonometric coefficients– Setting up of permutation mappings for GD-BFLY

and those processing functions that need to be carried out as part of the real-timesolution:

– Di-bit reversal– DM reads and writes– CM reads and trigonometric coefficient generation– GD-BFLY computation– FHT-to-FFT conversion

The individual modules – written in the “C” programming language with theMicrosoft Visual CCC compiler under their Visual Studio computing environ-ment – that have been developed to implement these particular pre-processing andprocessing functions are now outlined, where integer-only arithmetic has been usedto model the fixed-point nature of the associated arithmetic operations. This is fol-lowed by a brief guide on how to run the program and of the scaling strategies

163

164 A Computer Program for Regularized Fast Hartley Transform

available to the user. Please note, however, that the program has not been exhaus-tively tested so it is quite conceivable that various bugs may still be present inthe current version of the code. The notification of any such bugs, if identified,would be greatly welcomed by the author. The computer code for the complete so-lution, which is listed in Appendix B, is to be found on the CD accompanying themonograph.

A.2 Description of Functions

Before the R24 FHT can be executed it is first necessary that a main module or pro-

gram be produced:

“RFHT4 Computer Program.c”

which carries out all the pre-processing functions, as required for providing thenecessary inputs to the R2

4 FHT, as well as setting up the input data to the R24 FHT

through the calling of a separate module:

“SignalGeneration.c”

such that the data – real valued or complex valued – may be either accessed from anexisting binary or text file or generated by the signal generation module.

A.2.1 Control Routine

Once all the pre-processing functions have been carried out and the input data madeready for feeding to the R2

4 FHT, a control module:

“RFHT4 Control.c”

called from within the main program then carries out in the required order all theprocessing functions that make up the real-time solution, this starting with the di-bitreversal of the input data, followed by the execution of the R2

4 FHT, and finishingwith the conversion of the output data, should it be required, from Hartley space toFourier space.

A.2.2 Generic Double Butterfly Routines

Three versions of the GD-BFLY have been produced, as discussed in Chapters 6and 7, with the first version, involving 12 fast fixed-point multipliers, being carriedout by means of the module:

A.2 Description of Functions 165

“Butterfly V12M.c”,

the second version, involving nine fast fixed-point multipliers, by means of themodule:

“Butterfly V09M.c”

and the third version, involving three CORDIC rotation units, by means of themodule:

“Butterfly Cordic.c”.

The last version makes use of a separate module:

“Rotation.c”

for carrying out the individual phase rotations.

A.2.3 Address Generation and Data Re-ordering Routines

The generation of the four permutation mappings used by the GD-BFLY, as dis-cussed in Chapter 4, is carried out by means of the module:

“ButterflyMappings.c”,

whilst the di-bit reversal of the input data to the R24 FHT is carried out with the

module:

“DibitReversal.c”

and the addresses of the eight-sample data sets required for input to the GD-BFLYare obtained by means of the module:

“DataIndices.c”.

Note that for optimal efficiency the four permutation mappings used by the GD-BFLY only store information relating to the non-trivial exchanges.

A.2.4 Data Memory Accession and Updating Routines

The reading/writing of multiple samples of data from/to DM, which requires theapplication of the memory address mappings discussed in Chapter 6, is carried outby means of the module:


“MemoryBankAddresses.c”

which, given the address of a single di-bit reversed sample, produces both the mem-ory bank address together with the address offset within that particular memorybank. For optimal efficiency, this piece of code should be tailored to the particu-lar transform length under consideration, although a mapping that is implementedfor one particular transform length will also be valid for every transform lengthshorter than it, albeit somewhat wasteful in terms of unnecessary arithmetic/logic.

A.2.5 Trigonometric Coefficient Generation Routines

The trigonometric coefficient sets accessed from CM which are required for the exe-cution of the GD-BFLY are dependent upon the particular version of the GD-BFLYused, namely whether it involves 12 or nine fast fixed-point multipliers, as well asthe type of addressing scheme used, namely whether the storage of the trigonometriccoefficients is based upon the adoption of one-level or two-level LUTs.

For the combination of a twelve-multiplier version of the GD-BFLY and theadoption of three one-level LUTs, the trigonometric coefficients are generated viathe module:

“Coefficients V12M 1Level.c”,

whilst for the combination of a nine-multiplier version of the GD-BFLY and theadoption of three one-level LUTs, the trigonometric coefficients are generated viathe module:


for the combination of a twelve-multiplier version of the GD-BFLY and the adop-tion of three two-level LUTs, the trigonometric coefficients are generated via themodule:


and for the combination of a nine-multiplier version of the GD-BFLY and the adop-tion of three two-level LUTs, the trigonometric coefficients are generated via themodule:

“Coefficients V09M 2Level.c”.

All four versions produce sets of nine trigonometric coefficients which are requiredfor the execution of the GD-BFLY.

A.3 Brief Guide to Running the Program 167

A.2.6 Look-Up-Table Generation Routines

The generation of the LUTs required for the storage of the trigonometric coefficientsis carried out by means of the module:

“Look Up Table 1Level.c”,

for the case of the one-level LUTs, or the module:

“Look Up Table 2Level.c”

for the case of the two-level LUTs.

A.2.7 FHT-to-FFT Conversion Routines

Upon completion of the R24 FHT, the outputs may be converted from Hartley space

to Fourier space, if required, this being carried out by means of the module:

“Conversion.c”.

The routine is able to operate with FHT outputs obtained from the processing ofeither real-data inputs or complex-data inputs.

A.3 Brief Guide to Running the Program

The parameters that define the operation of the R24 FHT are listed as constants at

the top of the main program, “RFHT4 Computer Program.c”, these constants en-abling the various versions of the GD-BFLY to be selected, as required, as well asthe transform length, word lengths (for both data and trigonometric coefficients),input/output data formats, scaling strategy, etc., to be set up by the user at run time.The complete list of parameters is reproduced here, as shown in Fig. A.1, this in-cluding a typical set of parameter values and an accompanying description of eachparameter. The input data set used for testing the various double butterfly and mem-ory addressing combinations may be either read from a binary or text file (real dataor complex data), with the appropriate file name being as specified in the signalgeneration routine, “SignalGeneration.c”, or mathematically generated to model asignal in the form of a single tone (real data or complex data versions) where theaddress of the excited FHT/FFT bin is as specified on the last line of Fig. A.1. For areal-valued input data set the program is able to produce transform outputs in eitherHartley space or Fourier space, whilst when the input data set is complex valued theprogram will automatically produce the outputs in Fourier space.


// SYSTEM PARAMETERS:

#define FHT_length 1024 // transform length: must be a power of 4 #define data_type 1 // 1 => real-valued data, 2 => complex-valued data #define FHT_FFT_flag 1 // 1 => FHT outputs, 2 => FFT outputs #define BFLY_type 3 // Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics #define MEM_type 1 // Memory type: 1 => one-level LUT, 2 => two-level LUT #define scaling 2 // 1 => FIXED, 2 => BFP

// REGISTER-LENGTH PARAMETERS:

#define no_of_bits_data 18 // no of bits representing input data #define no_of_bits_coeffs 24 // no of bits representing trigonometric coefficients

// CORDIC BUTTERFLY PARAMETERS:

#define no_of_iterations 18 // no of Cordic iterations = output accuracy (bits) #define no_of_bits_angle 27 // no of bits representing Cordic rotation angle #define LSB_guard_bits 5 // no of guard bits for LSB: ~ log2(no_of_iterations)

// FILE PARAMETERS:

#define input_file_format 2 // 1 => HEX, 2 => DEC #define output_file_format 2 // 1 => HEX, 2 => DEC

// FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale_factor_0 2 // bits to shift for stage = 0 #define scale_factor_1 2 // bits to shift for stage = 1 #define scale_factor_2 2 // bits to shift for stage = 2 #define scale_factor_3 2 // bits to shift for stage = 3 #define scale_factor_4 2 // bits to shift for stage = 4 – last stage for 1K FHT #define scale_factor_5 2 // bits to shift for stage = 5 – last stage for 4K FHT #define scale_factor_6 1 // bits to shift for stage = 6 – last stage for 16K FHT #define scale_factor_7 0 // bits to shift for stage = 7 – last stage for 64K FHT

// SYNTHETIC DATA PARAMETERS:

#define data_input 1 // 0 => read data from file, 1 => generate data #define dft_bin_excited 117 // tone excited: between 0 and FHT_length/2-1

Fig. A.1 Typical parameter set for regularized FHT program

Note that when writing the outputs of an N-point FHT to file, the program storesone sample to a line; when writing the outputs of an N-point real-data FFT to file,it stores the zero-frequency term on the first line followed by the positive frequencyterms on the next N/2 – 1 lines, with the real and imaginary components of eachterm appearing on the same line; and finally, when writing the outputs of an N-pointcomplex-data FFT to file, it stores the zero-frequency term on the first line fol-lowed by the positive and then negative frequency terms on the next N – 1 lines,with the real and imaginary components of each term appearing on the same line –although the Nyquist-frequency term, like the zero-frequency term, possesses onlya real component. Bear in mind that for the case of the real-data FFT, the magnitudeof a zero-frequency tone (or Nyquist-frequency tone, if computed), if measured in

A.4 Available Scaling Strategies 169

the frequency domain, will be twice that of a comparable positive frequency tone(i.e. having the same signal amplitude) which shares its energy equally with itsnegative-frequency counterpart.

A.4 Available Scaling Strategies

With regard to the fixed-point scaling strategies, note that when the scaling of the in-termediate results is carried out via the conditional block floating-point technique, itis applied at the input to each stage of GD-BFLYs. As a result, any possible magni-fication incurred during the last stage of GD-BFLYs is not scaled out of the results,so that up to three bits of growth will still need to be accounted for in the R2

4 FHToutputs according to the particular post-FHT processing requirements. Examples ofblock floating-point scaling for both the twelve-multiplier and the nine-multiplierversions of the GD-BFLY are given in Figs. A.2 and A.3, respectively, each gearedto the use of an 18-bit fast multiplier – the scaling for the CORDIC version is essen-tially the same as that for the twelve-multiplier version. The program provides theuser with specific information relating to the chosen parameter set, printing to thescreen the amount of scaling required, if any, for each stage of GD-BFLYs requiredby the transform.

Input Data: 18 Bits + Zero Growth

Output Data: 18 Bits + Growth

18 + GrowthCalculate Growth

21

21

18

18

>> Growth–3

>> Growth

1-2

3-8

× 8

× 8

Register Details: PE Internal = 21 (Min) & 24 (Max) PE External = 21

21

1-8 2423

22

22

>>3

Note: GrowthÎ{0, 1, 2, 3}

Fig. A.2 Block floating-point scaling for use with twelve-multiplier and CORDIC versions ofgeneric double butterfly


Input Data: 17 Bits + Zero Growth

Output Data: 17 Bits + Growth

17 + Growth CalculateGrowth

20

18

17

>> Growth–3

>> Growth

1-2

3-8

× 8

× 8

Register Details: PE Internal = 20 (Min) & 23 (Max) PE External = 20

Note: Growth Î {0, 1, 2, 3}

20

21

1-8 2423

22

22

>> 3

18

Theory => 23 Bits Maximum

Fig. A.3 Block floating-point scaling for use with nine-multiplier version of generic doublebutterfly

For the case of the unconditional fixed scaling technique – the individual scalefactors to be applied for each stage of GD-BFLYs are as specified by the set ofconstants given in Fig. A.1 – a small segment of code has been included withinthe generic double butterfly routines which prints to the screen an error messagewhenever the register for holding the input data to either the fast multiplier or theCORDIC arithmetic unit overflows. For the accurate simulation of a given hard-ware device this segment of code needs to be replaced by a routine that mimicsthe “actual” behaviour of the device in response to such an overflow – such a re-sponse being dependent upon the particular device used. When the nine-multiplierversion of the GD-BFLY is adopted the presence of the stage of adders prior to thatof the fast fixed-point multipliers is far more likely to result in an overflow unlessadditional scaling is applied immediately after this stage of adders has been com-pleted, as is performed by the computer program, or alternatively, unless the dataword-length into the GD-BFLY is constrained to be one bit shorter than that for thetwelve-multiplier version. Clearly, in order to prevent fixed-point overflow, the set-tings for the individual scale factors will need to take into account both the transformlength and the particular version of the GD-BFLY chosen, with experience invari-ably dictating when an optimum selection of scale factors has been made. Bear inmind, however, that with the CORDIC-based version of the GD-BFLY there is anassociated magnification of the data magnitudes by approximately 1.647 with eachtemporal stage of GD-BFLYs which needs to be accounted for by the scale factors.

Finally, note that when the CORDIC-based GD-BFLY is selected, regard-less of the scaling strategy adopted, the program will also print to the screen

A.4 Available Scaling Strategies 171

exactly how many non-trivial shifts/additions are required for carrying out thetwo fixed-coefficient multipliers for the chosen parameter set. For the case of an18-stage CORDIC arithmetic unit, for example, a total of nine such non-trivialshifts/additions are required.

Appendix BSource Code Listings for Regularized FastHartley Transform

Abstract This appendix lists the source code, written in the “C” programminglanguage, for the various functions of which the regularized FHT is comprised.The actual computer source code is to be found on the CD accompanying themonograph.

B.1 Listings for Main Program and Signal Generation Routine

#include “stdafx.h”#include <math.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

// D E F I N E P A R A M E T E R S// - - - - - - - - - - - - - - - - - - - - - - - - -// SYSTEM PARAMETERS:

#define FHT length 1024 // transform length: must be a power of 4#define data type 1 // 1 => real-valued data, 2 => complex-valued data#define FHT FFT flag 1 // 1 => FHT outputs, 2 => FFT outputs#define BFLY type 3 // Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 =>

// 3 Cordics#define MEM type 1 // Memory type: 1 => one-level LUT, 2 => two-level

// LUT#define scaling 2 // 1 => FIXED, 2 => BFP

// REGISTER-LENGTH PARAMETERS:

#define no of bits data 18 // no of bits representing input data#define no of bits coeffs 24 // no of bits representing trigonometric coefficients

173

174 B Source Code Listings for Regularized Fast Hartley Transform

// CORDIC BUTTERFLY PARAMETERS:

#define no of iterations 18 // no of Cordic iterations = output accuracy (bits)#define no of bits angle 27 // no of bits representing Cordic rotation angle#define LSB guard bits 5 // no of guard bits for LSB: � log2(no of iterations)

// FILE PARAMETERS:

#define input file format 2 // 1 => HEX, 2 => DEC#define output file format 2 // 1 => HEX, 2 => DEC

// FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale factor 0 2 // bits to shift for stage = 0#define scale factor 1 2 // bits to shift for stage = 1#define scale factor 2 2 // bits to shift for stage = 2#define scale factor 3 2 // bits to shift for stage = 3#define scale factor 4 2 // bits to shift for stage = 4 - last stage for 1K FHT#define scale factor 5 2 // bits to shift for stage = 5 - last stage for 4K FHT#define scale factor 6 2 // bits to shift for stage = 6 - last stage for 16K FHT#define scale factor 7 2 // bits to shift for stage = 7 - last stage for 64K FHT

// SYNTHETIC DATA PARAMETERS:

#define data input 1 // 0 => read data from file, 1 => generate data#define dft bin excited 256 // tone excited: between 0 and FHT length/2-1

void main ()

f// R E G U L A R I Z E D F A S T H A R T L E Y T R A N S F O R M A L G O R I T H M// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -// Author: Dr. Keith John Jones, June 14th 2009// F I X E D - P O I N T F H T I M P L E M E N T A T I O N F O R F P G A// - D A T A & C O E F F I C I E N T S Q U A N T I Z E D// U T I L I Z E S O N E D O U B L E – S I Z E D B U T T E R F L Y// - T Y P E = 12 fast multipliers & 22 adders// or 9 fast multipliers & 25 adders// - -// or 3 Cordic arithmetic units & 2 fixed multipliers & 16 adders// - -// U T I L I Z E S E I G H T D A T A M E M O R Y B A N K S// - S I Z E = N / 8 words per bank// U T I L I Z E S T H R E E C O E F F I C I E N T M E M O R Y B A N K S// - S I Z E = N / 4 words or sqrt(N) / 2 words or zero words per bank

B.1 Listings for Main Program and Signal Generation Routine 175

// Description:

// - - - - - - - - -////////////////////////

This program carries out the FHT using a generic radix-4 double-sized butterfly. Thesolution performs 8 simultaneous reads/writes using 8 memory banks, each of length N/8words. Three LUTs, each of length N/4 words or sqrt(N)/2 words, may also be used forholding the trigonometric coefficients, enabling all six coefficients to be accessed simul-taneously - these LUTs are not required, however, when the arithmetic is performed withthe Cordic unit. Three types of double-sized butterfly are available for use by the FHT:one involves the use of 12 fast fixed-point multipliers and 22 adders, another involvesthe use of 9 fast fixed-point multipliers and 25 adders, whilst a third involves the useof 3 Cordic arithmetic units, 2 fixed multipliers and 16 adders. Two coefficient mem-ory addressing schemes are also available for use by the FHT: one involves the use of3 LUTs, each of length N/4 words, whilst another involves the use of 3 LUTs, each oflength sqrt(N)/2 words. The following combinations of arithmetic and memory are thuspossible:

// 1) for a 12-multiplier double-sized butterfly & N/4 word LUTs, the coefficient// generation involves no arithmetic operations;// 2) for a 12-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the

coefficient// generation involves 7 multiplications and 8 additions;// 3) for a 9-multiplier double-sized butterfly & N/4 word LUTs, the coefficient// generation involves just additions;// 4) for a 9-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the

coefficient// generation involves 7 multiplications and 14 additions; whilst// 5) for a Cordic double-sized butterfly, the coefficients are efficiently generated// on-the-fly.

////////////////

Scaling may be carried out within the regularized FHT to prevent overflow in thedata registers – this may be carried out with either fixed scaling coefficients after eachtemporal stage, or by means of a block floating-point scheme in order to optimize the dy-namic range out of the FHT. The program may produce either FHT or FFT output, wherethe input data may be either real valued or complex valued. For the case of complex-valued data, the FHT is simply applied to the real and imaginary components of the dataseparately before being appropriately combined via the FHT-to-FFT conversion routine.The inputs/outputs may be read/written from/to file with either decimal or hexadecimalformats.

// Files Used:// - - - - - - - -// For input/output data memory:

// input data read.txt - input file from which data is read.// output data fht fft.txt - FHT/FFT output data file.

// For one-level trigonometric coefficient memory:

// LUT A1.txt - LUT for single-angle argument.// LUT A2.txt - LUT for double-angle argument.// LUT A3.txt - LUT for triple-angle argument.

// For two-level trigonometric coefficient memory:

// LUT Sin Coarse.txt - coarse resolution sin LUT for single-angle argument.// LUT Sin Fine.txt - fine resolution sin LUT for single-angle argument.// LUT Cos Fine.txt - fine resolution cos LUT for single-angle argument.


// Functions Used:// - - - - - - - - - - -// FHT Computer Program - main program.// SignalGeneration - signal generation routine.// RFHT4 Control - regularized FHT control routine.// LookUpTable 1Level - one-level LUT generation routine.// LookUpTable 2Level - two-level LUT generation routine.// ButterflyMappings - address permutation generation routine.// DibitReversal - sequential di-bit reversal routine & 1-D to 2-D conversion.// Butterfly V12M - double butterfly calculation routine: 12-multiply version.// Butterfly V09M - double butterfly calculation routine: 9-multiply version.// Butterfly Cordic - double butterfly calculation routine: Cordic version.// Coefficients V12M 1Level - one-level coefficient generation: 12-multiply version.// Coefficients V09M 1Level - one-level coefficient generation: 9-multiply version.// Coefficients V12M 2Level - two-level coefficient generation: 12-multiply version.// Coefficients V09M 2Level - two-level coefficient generation: 9-multiply version.// DataIndices - data address generation routine.// Conversion - DHT-to-DFT conversion routine.// MemoryBankAddress - memory bank address/offset calculation routine.// Rotation - Cordic phase rotation routine.

// Externs:// - - - - - -

void RFHT4 Control (int**, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*,int, int, int, int, int, int, int, int, int*, int*, int, int*, int*, int*, int, int, int, int, int*, int,int, int, int, int, int, int);

void SignalGeneration (int*, int*, int, int, int, int, int, int);void LookUpTable 1Level (int, int, int*, int*, int*, int);void LookUpTable 2Level (int, int, int*, int*, int*, int);void ButterflyMappings (int*, int*, int*, int*);void DibitReversal (int, int, int*, int, int*, int**);void Conversion (int, int, int, int*, int*);void MemoryBankAddress (int, int, int, int, int*, int*);

// Declarations:// - - - - - - - - - -// Integers:// - - - - - -

int wordsize, m, M, n, n1, n2, N, N2, N4, N8, no of bits, data levels, coef levels;int zero = 0, count, RootN, RootNd2, max magnitude, real type = 1, imag type = 2;int fft length, offset, halfpi, growth, growth copy, angle levels, minusquarterpi;int Root FHT length, alpha, lower, upper;

// Integer Arrays:// - - - - - - - - - - - -

int index1[4], index2[16], index3[16], index4[8];int scale factors[8], power of two A[15], power of two B[8];int beta1[8], beta2[8], beta3[8], growth binary[32], arctans[32];


// Floats:// - - - - -

double pi, halfpi float, quarterpi float, twopi, angle, growth float;// Pointer Variables:// - - - - - - - - - - - - -

int *XRdata, *XIdata;int *bank1, *offset1, *bank2, *offset2, *scale total;int *Look Up Sin A1, *Look Up Sin A2, *Look Up Sin A3;int *Look Up Sin Coarse, *Look Up Cos Fine, Look Up Sin Fine;int **XRdata 2D = new int*[8];

// Files:// - - - -

FILE *myinfile, *output;// ***********************************************************************// ## R E G U L A R I S E D F H T I N I T I A L I S A T I O N.// Set up transform parameters.

Root FHT length = (int) (sqrt(FHT length+0.5));for (n = 3; n < 9; n++)f

if (FHT length == (int) (pow(4,n)))f

alpha = n;g

g// Set up standard angles.

pi = atan(1.0)*4.0; halfpi float = atan(1.0)*2.0; twopi = atan(1.0)*8.0;quarterpi float = atan(1.0);wordsize = sizeof (int);memset (&index1[0], 0, wordsize << 2); memset (&index2[0], 0, wordsize << 4);memset (&index3[0], 0, wordsize << 4); memset (&index4[0], 0, wordsize << 3);

// Set up scale factors for butterfly stages.scale factors[0] = scale factor 0; scale factors[1] = scale factor 1;scale factors[2] = scale factor 2; scale factors[3] = scale factor 3;scale factors[4] = scale factor 4; scale factors[5] = scale factor 5;scale factors[6] = scale factor 6; scale factors[7] = scale factor 7;

// Set up dynamic memory.for (n = 0; n < 8; n++)f

XRdata 2D[n] = new int [FHT length/8];gXRdata = new int [FHT length];XIdata = new int [FHT length];if (MEM type == 1)f

Look Up Sin A1 = new int [FHT length/4];Look Up Sin A2 = new int [FHT length/4];Look Up Sin A3 = new int [FHT length/4];


Look Up Sin Coarse = new int [1];Look Up Cos Fine = new int [1];Look Up Sin Fine = new int [1];

gelsef

Look Up Sin A1 = new int [1];Look Up Sin A2 = new int [1];Look Up Sin A3 = new int [1];Look Up Sin Coarse = new int [Root FHT length/2+1];Look Up Cos Fine = new int [Root FHT length/2];Look Up Sin Fine = new int [Root FHT length/2];

gbank1 = new int [1]; bank1[0] = 0;bank2 = new int [1]; bank2[0] = 0;offset1 = new int [1]; offset1[0] = 0;offset2 = new int [1]; offset2[0] = 0;scale total = new int [1];myinfile = stdin;

// Set up write-only file for holding FHT/FFT output data.if ((output = fopen(“output data fht fft.txt”, “w”)) == NULL)

printf (“nn Error opening output data file”);// Set up transform length.

N = FHT length; N2 = (N >> 1); N4 = (N2 >> 1); N8 = (N4 >> 1);RootN = Root FHT length; RootNd2 = RootN / 2;if (data type == 1)f

fft length = N2;gelsef

fft length = N;g

// Set up number of quantisation levels for data.data levels = (int) (pow(2,(no of bits data-1))-1);

// Set up number of quantisation levels for coefficients.coef levels = (int) (pow(2,(no of bits coeffs-1))-1);

// Set up number of quantisation levels for Cordic rotation angles.angle levels = (int) (pow(2,(no of bits angle-1))-1);

// Set up maximum allowable data magnitude into double butterfly.max magnitude = (int) (pow(2,(no of bits data-1)));

// Set up register overflow bounds for use with unconditional fixed scaling strategy.lower = -(data levels+1); upper = data levels;

// Set up power-of-two array.no of bits = alpha << 1;for (n = 0; n <= no of bits; n++) power of two A[n] = (int) pow(2,n);

// Set up modified power-of-two array.for (n = 0; n <= alpha; n++) power of two B[n] = (int) pow(2,(2*n+1));


// Set up Cordic initial rotation angles for each temporal stage.offset = 1;for (n = 0; n < alpha; n++)f

M = offset << 3; offset = power of two B[n];if (n == 0)f

beta1[0] = 0; beta2[0] = 0; beta3[0] = 0;gelsef

angle = -(twopi/M); beta1[n] = (int) ((angle/pi)*angle levels);angle = -2.0*(twopi/M); beta2[n] = (int) ((angle/pi)*angle levels);angle = -3.0*(twopi/M); beta3[n] = (int) ((angle/pi)*angle levels);

gg

// Set up Cordic magnification factor.growth float = 1.0;for (n = 0; n < no of iterations; n++)f

growth float *= sqrt(1+pow(2.0,-2*n));ggrowth = (int) (growth float*angle levels);

// Calculate binary representation of magnification factor.n = 0; count = 0; growth copy = growth;while ((growth copy >= 0) && (n < no of bits angle))f

growth binary[n] = growth copy % 2;growth copy = (growth copy-growth binary[n]) / 2;if (growth binary[n] == 1) count ++;n++;

gif (BFLY type == 3)f

printf (“nn No of additions required by fixed multiplier = %d”, count-1);g

// Set up Cordic micro-rotation angles.for (n = 0; n < no of iterations; n++)f

angle = atan(pow(2.0,-n));arctans[n] = (int) ((angle/pi)*angle levels);

g// Calculate integer form of trigonometric terms.

halfpi = (int) ((halfpi float/pi)*angle levels);minusquarterpi = (int) ((-quarterpi float/pi)*angle levels);

// Print program information to screen.printf (“nnnn Regularized Fast Hartley Transformnn”);printf (“ - - - - - - - - - - - - - - -nnnn”);


if (BFLY type == 1)f

printf (“Butterfly Type = Twelve-Multipliernnnn”);gelsef


printf (“Butterfly Type = Nine-Multipliernnnn”);gelsef

printf (“Butterfly Type = Cordicnnnn”);printf (“LUT Type = Not Relevantnnnn”);

ggif (BFLY type < 3)f

if (MEM type == 1)f

printf (“LUT Type = One-Levelnnnn”);gelsef

printf (“LUT Type = Two-Levelnnnn”);g

gif (data input == 0)f

printf (“Data Type = Realnnnn”);gelsef

printf (“Data Type = Syntheticnnnn”);gprintf (“Transform Length = %dnnnn”, FHT length);if (scaling == 1)f

printf (“Scaling Strategy = Fixed”);gelsef

printf (“Scaling Strategy = Block Floating-Point”);gif (BFLY type == 3)f

printf (“nnnn No of shifts/additions required by fixed multiplier = %d”, count-1);g


// *********************************************************************// ## S I G N A L G E N E R A T I O N.

SignalGeneration (XRdata, XIdata, N, data type, dft bin excited, data input, data levels,input file format);

// *********************************************************************// ## R E G U L A R I S E D F H T P R E - P R O C E S S I N G.// Set up look-up table of multiplicative constants.

if (MEM type == 1)f

// Standard memory solution.LookUpTable 1Level (N, N4, Look Up Sin A1, Look Up Sin A2,

Look Up Sin A3, coef levels);gelsef

if (MEM type == 2)f

// Reduced memory solution.LookUpTable 2Level (N, RootNd2, Look Up Sin Coarse, Look Up Cos Fine,

Look Up Sin Fine, coef levels);g

g// Set up address permutations.

ButterflyMappings (index1, index2, index3, index4);

// *********************************************************************

// ## R E G U L A R I S E D F H T P R O C E S S I N G.// Process “R E A L” component of data - may be real-valued or complex-valued data.

scale total[0] = 0;// Di-bit reverse addresses of data & store in 2-D form.

DibitReversal (N8, no of bits, power of two A, alpha, XRdata, XRdata 2D);// Regularized FHT routine.

RFHT4 Control (XRdata 2D, index1, index2, index3, index4, Look Up Sin A1,Look Up Sin A2, Look Up Sin A3, Look Up Sin Coarse, Look Up cos Fine,Look Up Sine Fine, power of two B, Look Up Cos Fine, Look Up Sin Fine,no of bits coeffs, alpha, N, N2, N4, RootNd2, coef levels, scaling, scale factors,scale total, max magnitude, beta1, beta2, beta3, angle levels, halfpi,minusquarterpi, growth, arctans, no of iterations, no of bits angle,LSB guard bits, lower, upper, BFLY type, MEM type);

// Store output data in 1-D form.n1 = 0; n2 = 1;for (m = 0; m < N8; m++)f

for (n = 0; n < 4; n++)f

MemoryBankAddress (n1, 0, 0, alpha, bank1, offset1);MemoryBankAddress (n2, 1, 0, alpha, bank2, offset2);XRdata[n1] = XRdata 2D[*bank1][m]; n1 += 2;XRdata[n2] = XRdata 2D[*bank2][m]; n2 += 2;

gg


if (data type == 2)f

// Process “I M A G I N A R Y” component of complex-valued data.scale total[0] = 0;

// Di-bit reverse addresses of data & store in 2-D form.DibitReversal (N8, no of bits, power of two A, alpha, XIdata, XRdata 2D);

// Regularized FHT routine.RFHT4 Control (XRdata 2D, index1, index2, index3, index4, Look Up Sin A1,

Look Up Sin A2, Look Up Sin A3, Look Up Sin Coarse,Look Up Cos Fine, Look Up Sin Fine, power of two B, alpha, N, N2, N4,RootNd2, coef levels, no of bits coeffs, scaling, scale factors, scale total,max magnitude, beta1, beta2, beta3, angle levels, halfpi, minusquarterpi,growth, arctans, no of iterations, no of bits angle, LSB guard bits, lower,upper, BFLY type, MEM type);

// Store output data in 1-D form.n1 = 0; n2 = 1;for (m = 0; m < N8; m++)f

for (n = 0; n < 4; n++)f

MemoryBankAddress (n1, 0, 0, alpha, bank1, offset1);MemoryBankAddress (n2, 1, 0, alpha, bank2, offset2);XIdata[n1] = XRdata 2D[*bank1][m]; n1 += 2;XIdata[n2] = XRdata 2D[*bank2][m]; n2 += 2;

gg

gif ((FHT FFT flag > 1) jj (data type == 2))f

// ## F H T - T O - F F T C O N V E R S I O N.Conversion (real type, N, N2, XRdata, XIdata);if (data type == 2)f

Conversion (imag type, N, N2, XRdata, XIdata);g

g// *********************************************************************

// ## W R I T I N G O F F H T / F F T O U T P U T D A T A T O F I L E.if (output file format == 1)f

// “H E X” file format.if (FHT FFT flag == 1)f

// FHT outputs - real-valued input & complex-valued output.for (n = 0; n < N; n++) fprintf (output,“%xnn”, XRdata[n]);

g


elsef


// FFT outputs - real-valued input & complex-valued output.fprintf (output,“%x %xnn”, XRdata[0], zero);for (n = 1; n < N2; n++) fprintf (output,“%x %xnn”, XRdata[n],XRdata[N-n]);

gelsef

// FFT outputs - complex-valued input & complex-valued output.for (n = 0; n < N; n++) fprintf (output,”%x %xnn”, XRdata[n], XIdata[n]);

gg

gelsef

// “D E C” file format.if (FHT FFT flag == 1)f

// FHT outputs - real-valued input & real-valued output.for (n = 0; n < N; n++) fprintf (output,“%10dnn”, XRdata[n]);

gelsef


// FFT outputs - real-valued input & complex-valued output.fprintf (output,”%10d %10dnn”, XRdata[0], zero);for (n = 1; n < N2; n++) fprintf (output,”%10d %10dnn”, XRdata[n],XRdata[N-n]);

gelsef

// FFT outputs - complex-valued input & complex-valued output.for (n = 0; n < N; n++) fprintf (output,”%10d %10dnn”, XRdata[n],XIdata[n]);

gg

g// *********************************************************************

// ## F I L E C L O S U R E & M E M O R Y D E L E T I O N.fclose (output);

// Delete dynamic memory.delete XRdata, XIdata, XRdata 2D;delete bank1, bank2, offset1, offset2, scale total;delete Look Up Sin A1, Look Up Sin A2, Look Up Sin A3;delete Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine;


// End of program.printf (“nnnn Processing Completednnnn”);

g

#include “stdafx.h”#include <stdio.h>

#include <math.h>

void SignalGeneration (int *XRdata, int *XIdata, int N, int data type,int dft bin excited, int data input, int data levels, int input file format)

f// Description:// - - - - - - - -// Routine to generate the signal data required for input to the Regularized FHT.

// Parameters:// - - - - - - - - -// XRdata = real component of 1-D data.// XIdata = imaginary component of 1-D data.// N = transform length.// data type = data type: 1 => real valued data, 2 => complex valued data// dft bin excited = integer representing DFT bin excited.// data input = data type: 0 => read data from file, 1 => generate data.// data levels = no of quantized data levels.// input file format = input file format: 1 => HEX, 2 => DEC.

// Note:// - - - -// Complex data is stored in data file in the form of alternating real and imaginary// components.// Declarations:// - - - - - - - - -// Integers:// - - - - - -

int n;// Floats:// - - - - -

double twopi, argument;

// *********************************************************************

// ## T E S T D A T A G E N E R A T I O N.if (data input == 0)f

// Read in FHT input data from file.FILE *input;if ((input = fopen(“input data fht.txt”, “r”)) == NULL)

printf (“nn Error opening input data file to read from”);if (input file format == 1)f

B.2 Listings for Pre-processing Functions 185

// “H E X” file format.if (data type == 1)f

for (n = 0; n < N; n++) fscanf (input, “%x”, &XRdata[n]);gelsef

for (n = 0; n < N; n++) fscanf (input, “%x %x”, &XRdata[n], &XIdata[n]);g

gelsef

// “D E C” file format.if (data type == 1)f

for (n = 0; n < N; n++) fscanf (input, “%d”, &XRdata[n]);gelsef

for (n = 0; n < N; n++) fscanf (input, “%d %d”, &XRdata[n], &XIdata[n]);g

g// Close file.

fclose (input);gelsef

// Generate single-tone signal for FHT input data.twopi = 8*atan(1.0);for (n = 0; n < N; n++)f

argument = (twopi*n*dft bin excited)/N;XRdata[n] = (int) (cos(argument)*data levels);if (data type == 2)f

XIdata[n] = (int) (sin(argument)*data levels);g

gg

// End of function.g

B.2 Listings for Pre-processing Functions


#include <stdio.h>

void LookUpTable 1Level(int N, int N4, int *Look Up Sin A1,int *Look Up Sin A2, int *Look Up Sin A3, int coef levels)


f// Description:// - - - - - - - - -// Routine to set up the one-level LUTs containing the trigonometric coefficients.// Parameters:// - - - - - - - - -

// N = transform length.// N4 = N / 4.// Look Up Sin A1 = l ook-up table for single-angle argument.// Look Up Sin A2 = look-up table for double-angle argument.// Look Up Sin A3 = look-up table for triple-angle argument.// coef levels = number of trigonometric coefficient quantisation levels.

// Declarations:// - - - - - - - - -// Integers:// - - - - - -

int i;// Floats:// - - - - -

double angle, twopi, rotation;// ***********************************************************************// Set up output files for holding LUT contents.

FILE *output1;if ((output1 = fopen(“LUT A1.txt”, “w”)) == NULL) printf (“nn Error opening 1st LUT

file”);FILE *output2;if ((output2 = fopen(“LUT A2.txt”, “w”)) == NULL) printf (“nn Error opening 2nd

LUT file”);FILE *output3;if ((output3 = fopen(“LUT A3.txt”, “w”)) == NULL) printf (“nn Error opening 3rd

LUT file”);twopi = (double) (atan(1.0) * 8.0);rotation = (double) (twopi / N);

// Set up size N/4 LUT for single-angle argument.angle = (double) 0.0;for (i = 0; i < N4; i++)f

Look Up Sin A1[i] = (int) (sin(angle) * coef levels);angle += (double) rotation;fprintf (output1,“%xnn”, Look Up Sin A1[i]);

g// Set up size N/4 LUT for double-angle argument.

angle = (double) 0.0;for (i = 0; i < N4; i++)f


g

B.2 Listings for Pre-processing Functions 187

// Set up size N/4 LUT for triple-angle argument.angle = (double) 0.0;for (i = 0; i < N4; i++)f


g// Close files.

fclose (output1); fclose (output2); fclose (output3);// End of function.g


#include <math.h>

void LookUpTable 2Level(int N, int RootNd2, int *Look Up Sin Coarse,int *Look Up Cos Fine, int *Look Up Sin Fine, int coef levels)

f// Description:// - - - - - - - - -// Routine to set up the two-level LUTs containing the trigonometric coefficients.// Parameters:// - - - - - - - - -

// N = transform length.// RootNd2 = sqrt(N) / 2.// Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument.// Look Up Cos Fine = fine resolution cos LUT for single-angle argument.// Look Up Sin Fine = fine resolution sin LUT for single-angle argument.// coef levels = number of trigonometric coefficient quantisation levels.

// Declarations:// - - - - - - - - - -// Integers:// - - - - - -// int i;// Floats:// - - - - -

double angle coarse, angle fine, twopi, rotation coarse, rotation fine;// ***********************************************************************// Set up output files for holding LUT contents.

FILE *output1;if ((output1 = fopen(“LUT Sin Coarse.txt”, “w”)) == NULL) printf (“nn Error opening

1st LUT file”);FILE *output2;if ((output2 = fopen(“LUT Cos Fine.txt”, “w”)) == NULL) printf (“nn Error opening

2nd LUT file”);FILE *output3;if ((output3 = fopen(“LUT Sin Fine.txt”, “w”)) == NULL) printf (“nn Error opening

3rd LUT file”);twopi = (double) (atan(1.0) * 8.0);


rotation coarse = (double) (twopi / (2*sqrt((float)N)));rotation fine = (double) (twopi / N);

// Set up size sqrt(N) LUT for single-angle argument.angle coarse = (double) 0.0; angle fine = (double) 0.0;for (i = 0; i < RootNd2; i++)f

Look Up Sin Coarse[i] = (int) (sin(angle coarse) * coef levels);Look Up Cos Fine[i] = (int) (cos(angle fine) * coef levels);Look Up Sin Fine[i] = (int) (sin(angle fine) * coef levels);fprintf (output1,“%xnn”, Look Up Sin Coarse[i]);fprintf (output2,“%xnn”, Look Up Cos Fine[i]);fprintf (output3,“%xnn”, Look Up Sin Fine[i]);angle coarse += (double) rotation coarse; angle fine += (double) rotation fine;

gLook Up Sin Coarse[RootNd2] = coef levels;

// Close files.fclose (output1); fclose (output2); fclose (output3);


#include “stdafx.h”void ButterflyMappings(int *index1, int *index2, int *index3, int *index4)

f// Description:// - - - - - - - - -// Routine to set up the address permutations for the generic double butterfly.// Parameters:// - - - - - - - - -// index1 = 1st address permutation.// index2 = 2nd address permutation.// index3 = 3rd address permutation.// index4 = 4th address permutation.// ***********************************************************************// 1st address permutation for Type-I and Type-II generic double butterflies.

index1[0] = 6; index1[1] = 3;// 1st address permutation for Type-III generic double butterfly.

index1[2] = 3; index1[3] = 6;// 2nd address permutation for Type-I and Type-II generic double butterflies.

index2[0] = 0; index2[1] = 4; index2[2] = 3; index2[3] = 2;index2[4] = 1; index2[5] = 5; index2[6] = 6; index2[7] = 7;

// 2nd address permutation for Type-III generic double butterfly.index2[8] = 0; index2[9] = 4; index2[10] = 2; index2[11] = 6;index2[12] = 1; index2[13] = 5; index2[14] = 3; index2[15] = 7;

// 3rd address permutation for Type-I and Type-II generic double butterflies.index3[0] = 0; index3[1] = 4; index3[2] = 1; index3[3] = 5;index3[4] = 2; index3[5] = 6; index3[6] = 3; index3[7] = 7;

// 3rd address permutation for Type-III generic double butterfly.index3[8] = 0; index3[9] = 4; index3[10] = 1; index3[11] = 3;index3[12] = 2; index3[13] = 6; index3[14] = 7; index3[15] = 5;

B.3 Listings for Processing Functions 189

// 4th address permutation for Type-I, Type-II and Type-III generic double butterflies.index4[0] = 0; index4[1] = 4; index4[2] = 1; index4[3] = 5;index4[4] = 6; index4[5] = 2; index4[6] = 3; index4[7] = 7;


B.3 Listings for Processing Functions


#include <math.h>

void RFHT4 Control(int **Xdata 2D, int *index1, int *index2, int *index3, int*index4, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3,int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine,int *power of two, int alpha, int N, int N2, int N4, int RootNd2, int coef levels,int no of bits coeffs, int scaling, int *scale factors, int *scale total, intmax magnitude, int *beta1, int *beta2, int *beta3, int angle levels, int halfpi, intminusquarterpi, int growth, int *arctans, int no of iterations, int no of bits angle, intLSB guard bits, int lower, int upper, int BFLY type, int MEM type)

f// Description:// - - - - - - - - -// Routine to carry out the regularized FHT algorithm, with options to use either// twelve-multiplier, nine- multiplier or Cordic versions of the generic double butterfly// and N/4 word, sqrt(N)/2 word or zero word LUTs for the storage of the// trigonometric coefficients.// Externs:// - - - - - -

void Butterfly V12M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*,int, int, int);

void Butterfly V09M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*,int, int, int, int);

void Butterfly Cordic (int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int,int, int*, int, int, int, int);

void Coefficients V12M 1Level (int, int, int, int, int, int*, int*, int*, int*, int);void Coefficients V09M 1Level (int, int, int, int, int, int*, int*, int*, int*, int);void Coefficients V12M 2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int,

int);void Coefficients V09M 2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int,

int);void DataIndices (int, int, int, int, int*, int[2][4], int[2][4], int, int);

// Parameters:// - - - - - - - - -// Xdata 2D = 2-D data.// index1 = 1st address permutation.// index2 = 2nd address permutation.// index3 = 3rd address permutation.// index4 = 4th address permutation.


// Look Up Sin A1 = LUT for single-angle argument.// Look Up Sin A2 = LUT for double-angle argument.// Look Up Sin A3 = LUT for triple-angle argument.// Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument.// Look Up Cos Fine = fine resolution cos LUT for single-angle argument.// Look Up Sin Fine = fine resolution sin LUT for single-angle argument.// power of two = array containing powers of 2.// alpha = no of temporal stages for transform.// N = transform length.// N2 = N / 2.// N4 = N / 4.// RootNd2 = sqrt(N) / 2.// coef levels = number of trigonometric coefficient quantisation levels.// no of bits coeffs = number of bits representing trigonometric coefficients.// scaling = scaling flag: 1 => FIXED, 2 => BFP.// scale factors = bits to shift for double butterfly stages.// scale total = total number of BFP scaling bits.// max magnitude = maximum magnitude of data into double butterfly.// beta1 = initial single-angle Cordic rotation angle.// beta2 = initial double-angle Cordic rotation angle.// beta3 = initial triple-angle Cordic rotation angle.// angle levels = number of Cordic rotation angle quantisation levels.// halfpi = integer value of C(pi/2).// minusquarterpi = integer value of �(pi/4).// growth = integer value of Cordic magnification factor.// arctans = Cordic micro-rotation angles.// no of iterations = no of Cordic iterations.// no of bits angle = no of bits representing Cordic rotation angle.// LSB guard bits = no of bits for guarding LSB.// lower = lower bound for register overflow with unconditional scaling.// upper = upper bound for register overflow with unconditional scaling.// BFLY type = BFLY type: 1 => 12 multipliers, 2 => 9 multipliers, 3 => 3// Cordic units.// MEM type = MEM type: 1 => LUT = one-level, 2 => LUT = two-level.

// Description:// - - - - - - - - -// Integers:// - - - - - - -

int i, j, k, n, n2, offset, M, beta, bfly count, Type, negate flag, shift;// Integer Arrays:// - - - - - - - - - - -

int X[9], kk[4], kbeta[3], Data Max[1], coeffs[9], threshold[3];int index even 2D[2][4], index odd 2D[2][4];

// ***********************************************************************// Set up offset for address permutations.

kk[3] = 0;// Set up block floating-point thresholds.


threshold[0] = max magnitude;threshold[1] = max magnitude << 1;threshold[2] = max magnitude << 2;

// Loop through log4 temporal stages.offset = 1; Data Max[0] = 0; shift = 0;for (i = 0; i < alpha; i++)f

// Set up look-up table index and address offsets.M = (int) (offset << 3); beta = (int) (N / M); bfly count = 0;if ((scaling == 2) && (i > 0))f

// Calculate shift to be applied to data so that MSB occupies optimum position.shift = 0;for (n = 0; n < 3; n++)f

if ((Data Max[0] < �threshold[n]) jj (Data Max[0] >= +threshold[n]))shift ++;

g// Increase total number of BFP scaling bits.

scale total[0] += shift;printf (“nnnn Maximum data magnitude from stage %d”, i-1);printf (“ = %d [threshold=”, Data Max[0]); printf (“%d]”, max magnitude);printf (“nn Shift to be applied to data for stage %d”, i);printf (“ = %d”, shift);if (i == (alpha-1))f

printf (“nnnn Total shift applied to data = %d”, scale total[0]);g

// Initialise maximum data magnitude for this stage.Data Max[0] = 0;

g// Loop through spatial iterations.

for (j = 0; j < N; j += M)f

// Initialise address offsets and double butterfly type.kbeta[0] = 0; kbeta[1] = 0; kbeta[2] = 0; Type = 2;for (k = 0; k < offset; k++)f

if (i == 0)f

// ## S T A G E = 0.negate flag = 0;

// Set up data indices for double butterfly .DataIndices (i, j, k, offset, kk, index even 2D, index odd 2D,bfly count, alpha);bfly count ++;

// Set up trigonometric coefficients for double butterfly.if (BFLY type == 1)


f// Butterfly is twelve-multiplier version.

if (MEM type == 1)f

// Standard arithmetic & standard memory solution.Coefficients V12M 1Level (i, k, N2, N4, kbeta[0],

Look Up Sin A1, Look Up Sin A2,Look Up Sin A3, coeffs, coef levels);

gelsef

// Standard arithmetic & reduced memory solution.Coefficients V12M 2Level (i, k, N2, N4, RootNd2, alpha,

kbeta[0], Look Up Sin Coarse, Look Up Cos Fine,Look Up Sin Fine, coeffs, coef levels, no of bits coeffs);

g// Increment address offset.

kbeta[0] += beta;gelsef

// Butterfly is nine-multiplier version.if (BFLY type == 2)f

if (MEM type == 1)f

// Reduced arithmetic & standard memory solution.Coefficients V09M 1Level (i, k, N2, N4, kbeta[0],

Look Up Sin A1, Look Up Sin A2, Look Up Sin A3,coeffs, coef levels);

gelsef

// Reduced arithmetic & reduced memory solution.Coefficients V09M 2Level (i, k, N2, N4, RootNd2, alpha,


gg

// Increment address offset.kbeta[0] += beta;

g// R E A D S - Set up input data vector for double butterfly.

for (n = 0; n < 4; n++)f

n2 = (n << 1);X[n2] = Xdata 2D[index even 2D[0][n]][index even 2D[1][n]];X[n2+1] = Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]];

g


// Carry out set of double butterfly equations.if (BFLY type == 1)f

// Standard arithmetic solution - twelve-multiplier butterfly.Butterfly V12M (i, j, k, X, coeffs, kk, index1, index2, index3, index4,

coef levels, no of bits coeffs, scaling, Data Max, shift, lower,upper);

gelsef


// Reduced arithmetic solution - nine-multiplier butterfly.Butterfly V09M (i, j, k, X, coeffs, kk, index1, index2, index3,

index4, coef levels, no of bits coeffs, scaling, Data Max,shift, 1, lower, upper);

gelsef

// Cordic arithmetic solution.Butterfly Cordic (X, kbeta, kk, index1, index2, index3, index4,

halfpi, minusquarterpi, growth, arctans, no of iterations,no of bits angle, negate flag, scaling, Data Max, shift,LSB guard bits, lower, upper);

// Increment address offsets.kbeta[0] += beta1[i];kbeta[1] += beta2[i];kbeta[2] += beta3[i];

ggif (scaling == 1)f

// F I X E D S C A L I N G - scale output data according to stage number.for (n = 0; n < 8; n++) X[n] = (X[n] >> scale factors[i]);

g// W R I T E S - Set up output data vector for double butterfly.

for (n = 0; n < 4; n++)f

n2 = (n << 1);Xdata 2D[index even 2D[0][n]][index even 2D[1][n]] = X[n2];Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]] = X[n2+1];

ggelsef

// ## S T A G E > 0.// Set up data indices for double butterfly.

DataIndices (i, j, k, offset, kk, index even 2D, index odd 2D,bfly count, alpha); bfly count ++;

// Set up trigonometric coefficients for double butterfly.if (BFLY type == 1)


f// Butterfly is twelve-multiplier version.

if (MEM type == 1)f

// Standard arithmetic & standard memory solution.Coefficients V12M 1Level (i, k, N2, N4, kbeta[0],


gelsef

// Standard arithmetic & reduced memory solution.Coefficients V12M 2Level (i, k, N2, N4, RootNd2, alpha,


g// Increment address offset.

kbeta[0] += beta;gelsef

// Butterfly is nine-multiplier version.if (BFLY type == 2)f

if (MEM type == 1)f

// Reduced arithmetic & standard memory solution.Coefficients V09M 1Level (i, k, N2, N4, kbeta[0],


gelse

f// Reduced arithmetic & reduced memory solution.

Coefficients V09M 2Level (i, k, N2, N4, RootNd2, alpha,kbeta[0], Look Up Sin Coarse, Look Up Cos Fine,Look Up Sin Fine, coeffs, coef levels, no of bits coeffs);

gg

// Increment address offset.kbeta[0] += beta;

g// R E A D S - Set up input data vector for double butterfly.

for (n = 0; n < 4; n++)f

n2 = (n << 1);X[n2] = Xdata 2D[index even 2D[0][n]][index even 2D[1][n]];X[n2+1] = Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]];

g// Carry out set of double butterfly equations.

if (BFLY type == 1)


f// Standard arithmetic solution - twelve-multiplier butterfly.

Butterfly V12M (i, j, k, X, coeffs, kk, index1, index2, index3, index4,coef levels, no of bits coeffs, scaling, Data Max, shift, lower,upper);

gelsef


// Reduced arithmetic solution - nine-multiplier butterfly.Butterfly V09M (i, j, k, X, coeffs, kk, index1, index2, index3,

index4, coef levels, no of bits coeffs, scaling, Data Max, shift,Type, lower, upper);

gelsef

// Cordic arithmetic solution.negate flag = k+1;Butterfly Cordic (X, kbeta, kk, index1, index2, index3, index4,

halfpi, minusquarterpi, growth, arctans, no of iterations,no of bits angle, negate flag, scaling, Data Max, shift,LSB guard bits, lower, upper);

// Increment address offsets.kbeta[0] += beta1[i];kbeta[1] += beta2[i];kbeta[2] += beta3[i];

ggif (scaling == 1)f

// F I X E D S C A L I N G - scale output data according to stage number.for (n = 0; n < 8; n++) X[n] = (X[n] >> scale factors[i]);

g// W R I T E S - Set up output data vector for double butterfly.

for (n = 0; n < 4; n++)f

n2 = (n << 1);Xdata 2D[index even 2D[0][n]][index even 2D[1][n]] = X[n2];Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]]= X[n2+1];

ggType = 3;

ggoffset = power of two[i];

g// End of function.g


#include “stdafx.h”#include <stdlib.h>

void DibitReversal(int N8, int no of bits, int *power of two, int alpha, int *Xdata,int **Xdata 2D)

f// Description:// - - - - - - - - -// Routine to carry out in sequential fashion the in-place di-bit reversal mapping of the// input data and to store data in 2-D form.// Parameters:// - - - - - - - - -

// N8 = N / 8.// no of bits = number of bits corresponding to N.// power of two = array containing powers of 2.// alpha = no of temporal stages for transform.// Xdata = 1-D data.// Xdata 2D = 2-D data.

// Externs:// - - - - - -

void MemoryBankAddress (int, int, int, int, int*, int*);// Declarations:// - - - - - - - - - -// Integers:// - - - - - - -

int i1, i2, i3, j1, j2, k, n, store;// Pointer Variables:// - - - - - - - - - - - - -

int *bank, *offset;// ***********************************************************************// Set up dynamic memory.

bank = new int [1]; bank[0] = 0;offset = new int [1]; offset[0] = 0;

// Re-order data.i3 = 0;for (i1 = 0; i1 < N8; i1++)f

for (i2 = 0; i2 < 8; i2++)f

j1 = 0; j2 = (i3%2);for (k = 0; k < no of bits; k += 2)f

n = no of bits - k;if (i3 & power of two[k]) j1 += power of two[n-2];if (i3 & power of two[k+1]) j1 += power of two[n-1];

gif (j1 > i3)f

store = Xdata[i3]; Xdata[i3] = Xdata[j1]; Xdata[j1] = store;g


// Convert to 2-D form.MemoryBankAddress (i3, j2, 0, alpha, bank, offset);Xdata 2D[*bank][i1] = Xdata[i3];i3 ++;

gg

// Delete dynamic memory.delete bank, offset;


#include “stdafx.h”void Conversion(int channel type, int N, int N2, int *XRdata, int *XIdata)

f// Description:// - - - - - - - - -

////////

Routine to convert DHT coefficients to DFT coefficients. If the FHT is to be used for thecomputation of the real-data FFT, as opposed to being used for the computation of thecomplex-data FFT, the complex-valued DFT coefficients are optimally stored in thefollowing way:

// XRdata[0] = zero’th frequency component// XRdata[1] = real component of 1st frequency component// XRdata[N-1] = imag component of 1st frequency component// XRdata[2] = real component of 2nd frequency component// XRdata[N-2] = imag component of 2nd frequency component// - - -// XRdata[N/2-1] = real component of (N/2-1)th frequency component// XRdata[N/2+1] = imag component of (N/2-1)th frequency component// XRdata[N/2] = (N/2)th frequency component

//////

For the case of the complex-valued FFT, however, the array “XRdata” stores thereal component of both the input and output data, whilst the array “XIdata” stores theimaginary component of both the input and output data.

// Parameters:// - - - - - - - - -

// channel type = 1 => real input channel, 2 => imaginary input channel.// N = transform length.// N2 = N / 2.// XRdata = on input: FHT output for real input channel;// on output: as in “description” above.// XIdata = on input: FHT output for imaginary input channel;// on output: as in “description” above.

// Declarations:// - - - - - - - - - -// Integers:// - - - - - - -

int j, k, store, store1, store2, store3;


// ***********************************************************************if (channel type == 1)f

// R E A L D A T A C H A N N E L.k = N � 1;

// Produce DFT output for this channel.for (j = 1; j < N2; j++)f

store = XRdata[k] + XRdata[j];XRdata[k] = XRdata[k] � XRdata[j]; XRdata[j] = store;XRdata[j] /= 2; XRdata[k] /= 2;k �;

ggelsef

// I M A G I N A R Y D A T A C H A N N E L.k = N � 1;

// Produce DFT output for this channel.for (j = 1; j < N2; j++)f

store = XIdata[k] + XIdata[j];XIdata[k] = XIdata[k] � XIdata[j]; XIdata[j] = store;XIdata[j] /= 2; XIdata[k] /= 2;

// Produce DFT output for complex data.store1 = XRdata[j] + XIdata[k]; store2 = XRdata[j] � XIdata[k];store3 = XIdata[j] + XRdata[k]; XIdata[k] = XIdata[j] � XRdata[k];XRdata[j] = store2; XRdata[k] = store1; XIdata[j] = store3;k - -;

gg



void Coefficients V09M 1Level(int i, int k, int N2, int N4, int kbeta,int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *coeffs,int coef levels)

f// Description:// - - - - - - - - -// Routine to set up the trigonometric coefficients for use by the nine-multiplier version of// the generic double butterfly where one-level LUTs are exploited.// Parameters:// - - - - - - - - -// i = temporal addressing index.// k = spatial addressing index.// N2 = N / 2.// N4 = N / 4.


// kbeta = temporal/spatial index.// Look Up Sin A1 = look-up table for single-angle argument.// Look Up Sin A2 = look-up table for double-angle argument.// Look Up Sin A3 = look-up table for triple-angle argument.// coeffs = current set of trigonometric coefficients.// coef levels = number of trigonometric coefficient quantisation levels.

// Declarations:// - - - - - - - - -// Integers:// - - - - - - -

int m, n, n3, store 00, store 01;static int startup, coeff 00, coeff 01, coeff 02, coeff 03, coeff 04;

// ***********************************************************************if (startup == 0)f

// Set up trivial trigonometric coefficients - valid for each type of double butterfly.coeff 00 = +coef levels;coeff 01 = 0;coeff 02 = �coef levels;

// Set up additional constant trigonometric coefficient for Type-II double butterfly.coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels);coeff 04 = coeff 03 + coeff 03;startup = 1;

gif (i == 0)f

// Set up trigonometric coefficients for Type-I double butterfly.n3 = 0;for (n = 0; n < 3; n++)f

coeffs[n3++] = coeff 00;coeffs[n3++] = coeff 01;coeffs[n3++] = coeff 00;

ggelsef

if (k == 0)f

// Set up trigonometric coefficients for Type-II double butterfly.n3 = 0;for (n = 0; n < 2; n++)f


g


coeffs[6] = coeff 04;coeffs[7] = coeff 03;coeffs[8] = 0;

gelsef

// Set up trigonometric coefficients for Type-III double butterfly.m = kbeta;

// Set up single-angle sinusoidal & cosinusoidal terms.store 00 = Look Up Sin A1[N4-m];store 01 = Look Up Sin A1[m];coeffs[0] = store 00 + store 01;coeffs[1] = store 00;coeffs[2] = store 00 � store 01;

// Set up double-angle sinusoidal & cosinusoidal terms.m <<= 1;store 00 = Look Up Sin A2[N4-m];store 01 = Look Up Sin A2[m];coeffs[3] = store 00 + store 01;coeffs[4] = store 00;coeffs[5] = store 00 � store 01;

// Set up triple-angle sinusoidal & cosinusoidal terms.m += kbeta;if (m < N4)f

store 00 = Look Up Sin A3[N4-m];store 01 = Look Up Sin A3[m];

gelsef

store 00 = -Look Up Sin A3[m-N4];store 01 = Look Up Sin A3[N2-m];

gcoeffs[6] = store 00 + store 01;coeffs[7] = store 00;coeffs[8] = store 00 - store 01;

gg



void Coefficients V09M 2Level(int i, int k, int N2, int N4, int RootNd2, int alpha,int kbeta, int *Look Up Sin Coarse, int *Look Up Cos Fine, int*Look Up Sin Fine, int *coeffs, int coef levels, int no of bits coef)


f// Description:// - - - - - - - - -// Routine to set up the trigonometric coefficients for use by the nine-multiplier version of// the generic// double butterfly where two-level LUTs are exploited.// Parameters:// - - - - - - - - -

// i = temporal index.// k = spatial index.// N2 = N / 2.// N4 = N / 4.// alpha = number of FHT temporal stages.// kbeta = temporal/spatial index.// Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument.// Look Up Cos Fine = fine resolution cos LUT for single-angle argument.// Look Up Sin Fine = fine resolution sin LUT for single-angle argument.// coeffs = current set of trigonometric coefficients.// coef levels = number of trigonometric coefficient quantisation levels.// no of bits coef = number of bits representing trigonometric coefficients.


int m, n, n3, sa1, sca2, ca1, sv1, sv2, cv1, cv2, sum1, sum2, sum3;int store 00, store 01, store 02, store 03;int64 store1, store2, store3, store 04, store 05;

static int startup, alpham1, bits to shift, bits to shift m1;static int coeff 00, coeff 01, coeff 02, coeff 03, coeff 04;

// ***********************************************************************if (startup == 0)f


// Set up additional constant trigonometric coefficient for Type-II double butterfly.coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels);coeff 04 = coeff 03 << 1;

// Set up scaling factor for multiplication stage.bits to shift = no of bits coef - 1;bits to shift m1 = bits to shift - 1;

// Set up scaling factor for address calculation.alpham1 = alpha - 1;startup = 1;

g


if (i == 0)f



ggelsef

if (k == 0)f



gcoeffs[6] = coeff 04;coeffs[7] = coeff 03;coeffs[8] = 0;

gelsef


// Set up single-angle sinusoidal & cosinusoidal terms.sa1 = m >> alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2;cv1 = Look Up Sin Coarse[ca1]; sv1 = Look Up Sin Coarse[sa1];cv2 = Look Up Cos Fine[sca2]; sv2 = Look Up Sin Fine[sca2];sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2;store1 = (( int64)sum1*cv2) >> bits to shift;store2 = (( int64)sum2*sv1) >> bits to shift;store3 = (( int64)sum3*cv1) >> bits to shift;store 00 = (int) (store1 - store2);store 01 = (int) (store1 - store3);coeffs[0] = store 00 + store 01;coeffs[1] = store 00;coeffs[2] = store 00 � store 01;

// Set up double-angle sinusoidal & cosinusoidal terms.store1 = (( int64)store 00*store 00) >> bits to shift m1;store2 = (( int64)store 00*store 01) >> bits to shift m1;


store 02 = (int) (store1 - coef levels);store 03 = (int) store2;coeffs[3] = store 02 + store 03;coeffs[4] = store 02;coeffs[5] = store 02 � store 03;

// Set up triple-angle sinusoidal & cosinusoidal terms.store1 = (( int64)store 02*store 00) >> bits to shift m1;store2 = (( int64)store 02*store 01) >> bits to shift m1;store 04 = (int) (store1 - store 00);store 05 = (int) (store2 + store 01);coeffs[6] = (int) (store 04 + store 05);coeffs[7] = (int) (store 04);coeffs[8] = (int) (store 04 - store 05);

gg



void Coefficients V12M 1Level(int i, int k, int N2, int N4, int kbeta,int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *coeffs,int coef levels)

f// Description:// - - - - - - - - -// Routine to set up the trigonometric coefficients for use by the twelve-multiplier// version of the generic double butterfly where one-level LUTs are exploited.// Parameters:// - - - - - - - - -

// i = temporal addressing index.// k = spatial addressing index.// N2 = N / 2.// N4 = N / 4.// kbeta = temporal/spatial index.// Look Up Sin A1 = look-up table for single-angle argument.// Look Up Sin A2 = look-up table for double-angle argument.// Look Up Sin A3 = look-up table for triple-angle argument.// coeffs = current set of trigonometric coefficients.// coef levels = number of trigonometric coefficient quantisation levels.


int m, n, n3;static int startup, coeff 00, coeff 01, coeff 02, coeff 03;


// **********************************************************************if (startup == 0)f


// Set up additional constant trigonometric coefficient for Type-II double butterfly.coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels);startup = 1;

gif (i == 0)f



ggelsef

if (k == 0)f



gfor (n = 6; n < 9; n++)f

coeffs[n] = coeff 03;g

gelsef


// Set up single-angle sinusoidal & cosinusoidal terms.coeffs[0] = Look Up Sin A1[N4-m];coeffs[1] = Look Up Sin A1[m];


// Set up double-angle sinusoidal & cosinusoidal terms.m <<= 1;coeffs[3] = Look Up Sin A2[N4-m];coeffs[4] = Look Up Sin A2[m];

// Set up triple-angle sinusoidal & cosinusoidal terms.m += kbeta;if (m < N4)f

coeffs[6] = Look Up Sin A3[N4-m];coeffs[7] = Look Up Sin A3[m];

gelsef

coeffs[6] = -Look Up Sin A3[m-N4];coeffs[7] = Look Up Sin A3[N2-m];

g// Set up remaining trigonometric coefficients through symmetry.

coeffs[2] = coeffs[0];coeffs[5] = coeffs[3];coeffs[8] = coeffs[6];

gg



void Coefficients V12M 2Level(int i, int k, int N2, int N4, int RootNd2, int alpha,int kbeta, int *Look Up Sin Coarse, int *Look Up Cos Fine, int*Look Up Sin Fine, int *coeffs, int coef levels, int no of bits coef)

f// Description:// - - - - - - - - -// Routine to set up the trigonometric coefficients for use by the twelve-multiplier// version of the generic double butterfly where two-level LUTs are exploited.// Parameters:// - - - - - - - - -

// i = temporal index.// k = spatial index.// N2 = N / 2.// N4 = N / 4.// alpha = number of FHT temporal stages.// kbeta = temporal/spatial index.// Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument.// Look Up Cos Fine = fine resolution cos LUT for single-angle argument.// Look Up Sin Fine = fine resolution sin LUT for single-angle argument.

// coeffs = current set of trigonometric coefficients.// coef levels = number of trigonometric coefficient quantisation levels.// no of bits coef = number of bits representing trigonometric coefficients.



int m, n, n3, sa1, sca2, ca1, sv1, sv2, cv1, cv2, sum1, sum2, sum3;int64 store1, store2, store3;

static int startup, alpham1, bits to shift, bits to shift m1;static int coeff 00, coeff 01, coeff 02, coeff 03;

// ***********************************************************************if (startup == 0)f


// Set up additional constant trigonometric coefficient for Type-II double butterfly.coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels);

// Set up scaling factor for multiplication stage.bits to shift = no of bits coef - 1;bits to shift m1 = bits to shift - 1;

// Set up scaling factor for address calculation.alpham1 = alpha - 1;startup = 1;

gif (i == 0)f



ggelsef

if (k == 0)f



g


for (n = 6; n < 9; n++)f

coeffs[n] = coeff 03;g

gelsef


// Set up single-angle sinusoidal & cosinusoidal terms.sa1 = m >> alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2;cv1 = Look Up Sin Coarse[ca1]; sv1 = Look Up Sin Coarse[sa1];cv2 = Look Up Cos Fine[sca2]; sv2 = Look Up Sin Fine[sca2];sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2;store1 = (( int64)sum1*cv2) >> bits to shift;store2 = (( int64)sum2*sv1) >> bits to shift;store3 = (( int64)sum3*cv1) >> bits to shift;coeffs[0] = (int) (store1 - store2);coeffs[1] = (int) (store1 - store3);

// Set up double-angle sinusoidal & cosinusoidal terms.cv1 = coeffs[0]; sv1 = coeffs[1];store1 = (( int64)cv1*cv1) >> bits to shift m1;store2 = (( int64)cv1*sv1) >> bits to shift m1;coeffs[3] = (int) (store1 � coef levels);coeffs[4] = (int) store2;

// Set up triple-angle sinusoidal & cosinusoidal terms.cv2 = coeffs[3];store1 = (( int64)cv1*cv2) >> bits to shift m1;store2 = (( int64)sv1*cv2) >> bits to shift m1;coeffs[6] = (int) (store1 � cv1);coeffs[7] = (int) (store2 + sv1);

// Set up remaining trigonometric coefficients through symmetry.coeffs[2] = coeffs[0];coeffs[5] = coeffs[3];coeffs[8] = coeffs[6];

gg



void Butterfly V12M(int i, int j, int k, int *X, int *coeffs, int *kk, int *index1,int *index2, int *index3, int *index4, int coef levels, int no of bits coeffs,int scaling, int *Data Max, int shift, int lower, int upper)


f// Description:// - - - - - - - - -// Routine to carry out the generic double butterfly computation using twelve// fixed-point fast multipliers.// Parameters:// - - - - - - - - -

// i = index for temporal loop.// j = index for outer spatial loop.// k = index for inner spatial loop.// X = 1-D data array.// coeffs = current set of trigonometric coefficients.// kk = offsets for address permutations.// index1 = 1st address permutation.// index2 = 2nd address permutation.// index3 = 3rd address permutation.// index4 = 4th address permutation.// coef levels = number of trigonometric coefficient quantisation levels.// no of bits coeffs = number of bits representing trigonometric coefficients.// scaling = scaling flag: 1 => FIXED, 2 => BFP.// Data Max = maximum magnitude of output data set.// shift = no of bits for input data to be shifted.// lower = lower bound for register overflow with unconditional scaling.// upper = upper bound for register overflow with unconditional scaling.


int m, n, n2, n2p1, n3, n3p1, store, bits to shift1, bits to shift2;// Long Integers:// - - - - - - - - - -

int64 m1, m2, m3, m4;// Integer Arrays:// - - - - - - - - -

int Y[8];// ***********************************************************************// Apply 1st address permutation - comprising one data exchange.

m = kk[0];store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store;

// Set up scaling factor for multiplication stage.bits to shift2 = no of bits coeffs - 1;if (scaling == 1)f

Y[0] = X[0]; Y[1] = X[1];


// ### Check for register overflow & flag when overflow arises.for (n = 0; n < 8; n++)f

if ((X[n] < lower) jj (X[n] > upper))f

printf (“nnnn Overflow occurred on input register”);g

g// ### Check for register overflow completed.

gelsef

// Set up scaling factor for first two samples of input data set.bits to shift1 = 3 - shift;

// Shift data so that MSB occupies optimum position.Y[0] = X[0] << bits to shift1; Y[1] = X[1] << bits to shift1;for (n = 2; n < 8; n++)f

X[n] = X[n] >> shift;g

// Build in three guard bits for LSB.bits to shift2 -= 3;

g// Apply trigonometric coefficients and 1st set of additions/subtractions.

n3 = 0;for (n = 1; n < 4; n++)f

n2 = (n << 1); n2p1 = n2 + 1; n3p1 = n3 + 1;// Truncate contents of registers to required levels.

m1 = (( int64)coeffs[n3]*X[n2]) >> bits to shift2;m2 = (( int64)coeffs[n3p1]*X[n2p1]) >> bits to shift2;Y[n2] = (int) (m1 + m2);

// Truncate contents of registers to required levels.m3 = (( int64)coeffs[n3p1]*X[n2]) >> bits to shift2;m4 = (( int64)coeffs[n3+2]*X[n2p1]) >> bits to shift2;Y[n2p1] = (int) (m3 - m4);n3 += 3;

g// Apply 2nd address permutation.

m = kk[1];for (n = 0; n < 8; n++)f

X[index2[m++]] = Y[n];g

// Apply 2nd set of additions/subtractions.for (n = 0; n < 4; n++)f

n2 = (n << 1); n2p1 = n2 + 1;store = X[n2] + X[n2p1]; X[n2p1] = X[n2] - X[n2p1]; X[n2] = store;

g


// Apply 3rd address permutation.m = kk[2];for (n = 0; n < 8; n++)f

Y[index3[m++]] = X[n];g

// Apply 3rd set of additions/subtractions.for (n = 0; n < 4; n++)f

n2 = (n << 1); n2p1 = n2 + 1;store = Y[n2] + Y[n2p1]; Y[n2p1] = Y[n2] - Y[n2p1]; Y[n2] = store;

gm = kk[3];for (n = 0; n < 8; n++)f

if (scaling == 2)f

// Remove three LSB guard bits - MSB may be magnified by up to three bits.Y[n] = (Y[n] >> 3);

// Update maximum magnitude of output data set.if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n];

g// Apply 4th address permutation.




void Butterfly V09M(int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int*index2, int *index3, int *index4, int coef levels, int no of bits coeffs,int scaling, int *Data Max, int shift, int Type, int lower, int upper)

f// Declarations:// - - - - - - - - - -// Routine to carry out the generic double butterfly computation using nine fixed-point// fast multipliers.// Parameters:// - - - - - - - - -

// i = index for temporal loop.// j = index for outer spatial loop.// k = index for inner spatial loop.// X = 1-D data array.// coeffs = current set of trigonometric coefficients.// kk = offsets for address permutations.// index1 = 1st address permutation.


// index2 = 2nd address permutation.// index3 = 3rd address permutation.// index4 = 4th address permutation.// coef levels = number of trigonometric coefficient quantisation levels.// no of bits coeffs = number of bits representing trigonometric coefficients.// scaling = scaling flag: 1 => FIXED, 2 => BFP.// Data Max = maximum magnitude of output data set.// shift = no of bits for input data to be shifted.// Type = butterfly type indicator: I, II or III.// lower = lower bound for register overflow with unconditional scaling.// upper = upper bound for register overflow with unconditional scaling.

// Note:// - - - -// Dimension array X[n] from 0 to 8 in calling routine RFHT4 Control.// Declarations:// - - - - - - - - - -// Integers:// - - - - - - -

int m, n, n2, n2p1, store, bits to shift1, bits to shift2;// Long Integers:// - - - - - - -

int64 product;// Integer Arrays:// - - - - - - - -

int Y[11];// ***********************************************************************// Apply 1st address permutation - comprising one data exchange.


// Set up scaling factor for multiplication stage.bits to shift2 = no of bits coeffs - 1;if (scaling == 2)f

// Set up scaling factor for first two samples of input data set.bits to shift1 = 3 - shift;

// Shift data so that MSB occupies optimum position.X[0] = X[0] << bits to shift1; X[1] = X[1] << bits to shift1;for (n = 2; n < 8; n++)f

X[n] = X[n] >> shift;g

// Build in three guard bits for LSB.bits to shift2 -= 3;

g// Apply 1st set of additions/subtractions.

Y[0] = X[0]; Y[1] = X[1];Y[2] = X[2]; Y[3] = X[2] + X[3]; Y[4] = X[3];


Y[5] = X[4]; Y[6] = X[4] + X[5]; Y[7] = X[5];Y[8] = X[6]; Y[9] = X[6] + X[7]; Y[10] = X[7];if (scaling == 1)f

// Scale outputs of 1st set of additions/subtractions.For (n = 0; n < 11; n++) Y[n] = (Y[n]>>1);


if ((Y[n] < lower) jj (Y[n] > upper))f



g// Apply trigonometric coefficients.

for (n = 0; n < 9; n++)f

product = (( int64)coeffs[n]*Y[n+2]) >> bits to shift2;X[n] = (int) product;

g// Apply 2nd set of additions/subtractions.

if (Type < 3)f

Y[2] = X[0] + X[1]; Y[3] = X[1] + X[2];Y[4] = X[3] + X[4]; Y[5] = X[4] + X[5];

gelsef

Y[2] = X[1] - X[2]; Y[3] = X[0] - X[1];Y[4] = X[4] - X[5]; Y[5] = X[3] - X[4];

gif (Type < 2)f

Y[6] = X[6] + X[7]; Y[7] = X[7] + X[8];gelsef

Y[6] = X[7] - X[8]; Y[7] = X[6] - X[7];g

// Apply 2nd address permutation.m = kk[1];for (n = 0; n < 8; n++)f



// Apply 3rd set of additions/subtractions.for (n = 0; n < 4; n++)f


g// Apply 3rd address permutation.

m = kk[2];for (n = 0; n < 8; n++)f


// Apply 4th set of additions/subtractions.for (n = 0; n < 4; n++)f



m = kk[3];for (n = 0; n < 8; n++)f

if (scaling == 2)f

// Remove three LSB guard bits - MSB may be magnified by up to three bits.Y[n] = (Y[n] >> 3);






void Butterfly Cordic(int *X, int *kbeta, int *kk, int *index1, int *index2, int *index3,int *index4, int halfpi, int minusquarterpi, int growth, int *arctans,int no of iterations, int no of bits angle, int negate flag, int scaling, int *Data Max,int shift, int LSB guard bits, int lower, int upper)

f// Description:// - - - - - - - - -// Routine to carry out the generic double butterfly computation using three Cordic// arithmetic units.// Externs:// - - - - - -

void Rotation (int*, int*, int*, int, int, int*);


// Parameters:// - - - - - - - - -

// X = data.// kbeta = current set of rotation angles.// kk = offsets for address permutations.// index1 = 1st address permutation.// index2 = 2nd address permutation.// index3 = 3rd address permutation.// index4 = 4th address permutation.// halfpi = integer version of +(pi/2).// minusquarterpi = integer version of -(pi/4).// growth = integer version of Cordic magnification factor.// arctans = micro-rotation angles.// no of iterations = no of Cordic iterations.// no of bits angle = no of bits to represent Cordic rotation angle.// negate flag = negation flag for Cordic output.// scaling = scaling flag: 1 => FIXED, 2 => BFP.// Data Max = maximum magnitude of output data set.// shift = no of bits for input data to be shifted.// LSB guard bits = no of bits for guarding LSB.// lower = lower bound for register overflow with unconditional scaling.// upper = upper bound for register overflow with unconditional scaling.


int m, n, n2, n2p1, store, bits to shift1, bits to shift2;// Integer Arrays:// - - - - - - - - - -

int Y[8], xs[3], ys[3], zs[3];// *****************************************************************// Apply 1st address permutation - comprising one data exchange.


// Set up scaling factor for multiplication stage.bits to shift1 = no of bits angle - 1;if (scaling == 1)f


if ((X[n] < lower) jj (X[n] > upper))f



gelse


f// Set up scaling factor for first two samples of input data set.

bits to shift2 = LSB guard bits - shift + 2;// Shift data so that MSB occupies optimum position.

X[0] = X[0] >> shift; X[1] = X[1] >> shift;for (n = 2; n < 8; n++)f

X[n] = X[n] << bits to shift2;g

// Build in two additional guard bits for LSB.bits to shift1 -= 2;

g// Scale first two permuted inputs with Cordic magnification factor.

Y[0] = (int) ((( int64)growth*X[0]) >> bits to shift1);Y[1] = (int) ((( int64)growth*X[1]) >> bits to shift1);

// Set up inputs to Cordic phase rotations of remaining permuted inputs.xs[0] = X[2]; xs[1] = X[4]; xs[2] = X[6];ys[0] = X[3]; ys[1] = X[5]; ys[2] = X[7];zs[0] = kbeta[0]; zs[1] = kbeta[1]; zs[2] = kbeta[2];if (negate flag == 1) zs[2] = minusquarterpi;

// Carry out Cordic phase rotations of remaining permuted inputs.Rotation (xs, ys, zs, halfpi, no of iterations, arctans);

// Set up outputs from Cordic phase rotations of remaining permuted inputs.Y[2] = xs[0]; Y[4] = xs[1]; Y[6] = xs[2];Y[3] = ys[0]; Y[5] = ys[1]; Y[7] = ys[2];if (scaling == 2)f

// Scale Cordic outputs to remove LSB guard bits.for (n = 2; n < 8; n++)f

Y[n] = Y[n] >> LSB guard bits;g

g// Negate, where appropriate, phase rotated outputs.

if (negate flag > 0)f

Y[7] = -Y[7];if (negate flag > 1)f

Y[3] = -Y[3]; Y[5] = -Y[5];g

g// Apply 2nd address permutation.

m = kk[1];for (n = 0; n < 8; n++)f



// Apply 1st set of additions/subtractions.for (n = 0; n < 4; n++)f


g// Apply 3rd address permutation.

m = kk[2];for (n = 0; n < 8; n++)f


// Apply 2nd set of additions/subtractions.for (n = 0; n < 4; n++)f


gm = kk[3];for (n = 0; n < 8; n++)f

if (scaling == 2)f

// Remove two LSB guard bits - MSB may be magnified by up to two bits.Y[n] = (Y[n] >> 2);





#include “stdafx.h”void Rotation (int *xs, int *ys, int *zs, int halfpi, int no of iterations, int *arctans)

f// Description:// - - - - - - - - -// Routine to carry out the phase rotations required by the Cordic arithmetic unit for the// single angle, double angle and triple angle cases.// Parameters:// - - - - - - - - -

// xs = X coordinates.// ys = Y coordinates.// zs = rotation angles.// halfpi = +(pi/2).// no of iterations = no of Cordic iterations.// arctans = set of micro-rotation angles.



int k, n;// Integer Arrays:// - - - - - - - - - - -

int temp[3];// ***********************************************************************// P H A S E R O T A T I O N R O U T I N E.// Reduce three rotation angles to region of convergence: [-pi/2,+pi/2].

for (n = 0; n < 3; n++)f

if (zs[n] < -halfpi)f

temp[n] = +ys[n]; ys[n] = -xs[n]; xs[n] = temp[n]; zs[n] += halfpi;gelse if (zs[n] > +halfpi)f

temp[n] = -ys[n]; ys[n] = +xs[n]; xs[n] = temp[n]; zs[n] -= halfpi;g

g// Loop through Cordic iterations.

for (k = 0; k < no of iterations; k++)f

// Carry out phase micro-rotation of three complex data samples.for (n = 0; n < 3; n++)f

if (zs[n] < 0)f

temp[n] = xs[n] + (ys[n] >> k);ys[n] -= (xs[n] >> k); xs[n] = temp[n]; zs[n] += arctans[k];

gelsef

temp[n] = xs[n] - (ys[n] >> k);ys[n] += (xs[n] >> k); xs[n] = temp[n]; zs[n] -= arctans[k];

gg

g// End of function.g

#include “stdafx.h”#include <stdlib.h>

void DataIndices (int i, int j, int k, int offset, int *kk, int index even 2D[2][4], intindex odd 2D[2][4], int bfly count, int alpha)


f// Description:// - - - - - - - - - -// Routine to set up the data indices for accessing the input data for the generic double// butterfly.// Parameters:// - - - - - - - - -

// i = index for temporal loop.// j = index for outer spatial loop.// k = index for inner spatial loop.// offset = element of power-of-two array.// kk = offsets for address permutations.// index even 2D = even data address indices.// index odd 2D = odd data address indices.// bfly count = double butterfly address for stage.// alpha = no of temporal stages for transform.

// Externs:// - - - - - -

void MemoryBankAddress (int, int, int, int, int*, int*);// Declarations:// - - - - - - - - - -// Integers:// - - - - - - -

int n, n1, n2, twice offset;// Pointer Variables:// - - - - - - - - - - - - -

int *bank1, *offset1, *bank2, *offset2;// ***********************************************************************// Set up dynamic memory.

bank1 = new int [1]; bank1[0] = 0;bank2 = new int [1]; bank2[0] = 0;offset1 = new int [1]; offset1[0] = 0;offset2 = new int [1]; offset2[0] = 0;

// Calculate data indices.if (i == 0)

// S T A G E = 0.twice offset = offset;

// Set up even and odd data indices for Type-I double butterfly.n1 = j - twice offset; n2 = n1 + 4;for (n = 0; n < 4; n++)f

n1 += twice offset; n2 += twice offset;MemoryBankAddress (n1, n, 1, alpha, bank1, offset1);index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1;MemoryBankAddress (n2, n, 1, alpha, bank2, offset2);index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2;

g


// Set up offsets for address permutations.kk[0] = 0; kk[1] = 0; kk[2] = 0;

gelsef

// S T A G E > 0.twice offset = (offset << 1);if (k == 0)f

// Set up even and odd data indices for Type-II double butterfly.n1 = j - twice offset; n2 = n1 + offset;for (n = 0; n < 4; n++)f

n1 += twice offset; n2 += twice offset;MemoryBankAddress (n1, bfly count, 1, alpha, bank1, offset1);index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1;MemoryBankAddress (n2, bfly count, 1, alpha, bank2, offset2);index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2;

g// Set up offsets for address permutations.

kk[0] = 0; kk[1] = 0; kk[2] = 0;gelsef

// Set up even and odd data indices for Type-III double butterfly.n1 = j + k - twice offset; n2 = j - k;for (n = 0; n < 4; n++)f

n1 += twice offset; n2 += twice offset;MemoryBankAddress (n1, bfly count, 1, alpha, bank1, offset1);index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1;MemoryBankAddress (n2, bfly count, 1, alpha, bank2, offset2);index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2;

g// Set up offsets for address permutations.

kk[0] = 2; kk[1] = 8; kk[2] = 8;g

g// Delete dynamic memory.

delete bank1, bank2, offset1, offset2;// End of function.g


#include ”stdafx.h”void MemoryBankAddress (int address, int butterfly, int startup, int alpha, int *bank,

int *offset)f// Description:// - - - - - - - - -// Routine to calculate the memory bank address and offset.// Parameters:// - - - - - - - - -

// address = sample address: [0,1,...,N-1].// butterfly = butterfly address for stage: [0,1,...,N/8-1].// startup = initialisation flag: 0 => start up, 1 => butterfly.// alpha = no of temporal stages for transform.// bank = memory bank address of sample: [0,1,2,3,4,5,6,7].// offset = address offset within memory bank: [0,...,N/8-1].

// Note:// - - - -// For optimum arithmetic efficiency, comment out coding options not relevant to the// current application.// Declarations:// - - - - - - - - - -// Integers:// - - - - - - -

int k1, k2, sub block size, mapping;// ***********************************************************************// Calculate memory bank address for N up to and including 1K.// bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+// ((address%256)>>6)+(address>>8)) % 4) << 1) + (butterfly%2);// Calculate memory bank address for N up to and including 4K.// bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+((address%256)// >>6)+((address%1024)>>8)+(address>>10)) % 4) << 1) + (butterfly%2);// Calculate memory bank address for N up to and including 16K.// bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+// ((address%256)>>6)+((address%1024)>>8)+((address%4096)>>10)+// (address>>12))% 4) << 1) + (butterfly%2);// Calculate memory bank address using generic version of address mapping.

sub block size = 1; mapping = 0;for (k1 = 0; k1 < alpha; k1++)

fk2 = k1 << 1;sub block size <<= 2;mapping += ((address % sub block size) >> k2);

gbank[0] = ((mapping % 4) << 1) + (butterfly % 2);

// Calculate address offset within memory bank.if (startup > 0) offset[0] = address >> 3;


Glossary

ADC – analog-to-digital conversionASIC – application-specific integrated circuitAWGN – additive white Gaussian noiseCD – compact discCFA – Common Factor AlgorithmCLB – configurable logic blockCM – trigonometric coefficient memoryCN – linear space of complex-valued N-tuplesCORDIC – Co-Ordinate Rotation DIgital ComputerCRT – Chinese Remainder TheoremCS – computational stageDA – distributed arithmeticDDC – digital down conversionDFT – discrete Fourier transformDHT – discrete Hartley transformDIF – decimation-in-frequencyDIT – decimation-in-timeDM – data memoryDMER – even-real data memoryDMI – intermediate data memoryDMOI – odd-imaginary data memoryDSP – digital signal processingDTMF – dual-tone multi-frequencyFDM – frequency division multiplexedFFT – fast Fourier transformFHT – fast Hartley transformFNT – Fermat number transformFPGA – field-programmable gate arrayGD-BFLY – generic double butterflyHDL – hardware description languageIF – intermediate frequencyI/O – input–outputIP – intellectual propertyLSB – least significant bitLUT – look-up tableMAC – multiplier and accumulatorMNT – Mersenne number transformMSB – most significant bitNTT – number-theoretic transform

221

222 Glossary

O – orderPE – processing elementPFA – Prime Factor AlgorithmPSD – power spectral densityRAM – random access memoryR2

4 FHT – regularized radix-4 fast Hartley transformRF – radio frequencyRN – linear space of real-valued N-tuplesROM – read only memorySFDR – spurious-free dynamic rangeSFG – signal flow graphSIMD – single-instruction multiple-dataSNR – signal-to-noise ratioTDOA – time-difference-of-arrivalTOA – time-of-arrival

Index

AAlias-free formulation, 154–155Analog-to-digital conversion (ADC), 42, 72,

95Application-specific integrated circuit (ASIC),

1, 2, 8, 39, 65–67, 77, 114Area efficient, 62, 70–72, 77–98, 160Arithmetic complexity, 1, 8, 18–20, 25, 37,

39, 61, 72–74, 77, 85, 86, 97, 110, 112,114, 124, 125, 132, 137, 140, 145, 151,155

Auto-correlation, 12, 135–137, 141–146, 161

BBergland algorithm, 16–18Bit reversal mapping, 23, 47, 98Bruun algorithms, 16, 18–19, 24Butterfly, 3, 7, 8, 11, 17, 37–39, 41–56, 59, 61,

62, 81, 84, 90–91, 108, 160

CChannelization, 12, 61, 135, 136, 149–156,

161Chinese remainder theorem (CRT), 5Circular convolution, 36, 140, 150Circular correlation, 140, 144Clock frequency, 66, 68, 71, 97, 114Coarse grain parallelism, 68Common factor algorithm (CFA), 5Complementary angle LUTs, 85, 86, 92, 125,

126Computational density, 11, 42, 66, 71, 74, 77,

78, 127, 132, 135, 161, 162Computational stage (CS), 4, 62, 90, 105, 108,

127Configurable logic block (CLB), 67Convolution, 34, 36, 135–137, 140, 144,

149–151, 156

Cooley-Tukey algorithm, 4, 5, 17, 133Co-Ordinate Rotation Digital Computer

(CORDIC), 10, 12, 70, 72, 73, 101,102, 104–114, 159, 161, 165, 168–171,173–176, 178–180, 189, 190, 193, 195,213–215, 217

Correlation, 12, 35, 37, 135–137, 140–149,151, 156, 161

Cross-correlation, 12, 135, 137, 141–143,145–148

DData memory (DM), 12, 59, 79–84, 90–92, 95,

98, 114, 120, 122, 124–127, 130, 132,156

Data space, 5, 8, 11, 12, 36, 39, 129, 136–140,142, 143, 149, 156

Decimation-in-frequency (DIF), 5, 16, 17, 20,23, 42

Decimation-in-time (DIT), 5, 16, 17, 20, 23,37, 42, 45, 59

Di-bit reversal mapping, 47, 98Differentiation, 135, 137–140, 147, 156Digital down conversion (DDC), 7, 24, 95,

135, 136, 149, 151, 153, 156, 160Digit reversal mapping, 23Discrete Fourier transform (DFT), 1–8, 10–12,

15–25, 27–37, 39, 41, 59, 65, 78, 96,114, 117–136, 149, 151–156, 159–161

Discrete Hartley transform (DHT), 1, 6–8,10–12, 27–39, 42, 44, 129, 130, 133,138–140, 143–145, 147, 148, 150, 155,160

Distributed arithmetic (DA), 10, 70, 102Divide-and-conquer, 4, 46Double buffering, 95, 98, 127Double-resolution approach, 117, 118,

124–127, 132Dragonfly, 62

223

224 Index

Dual-port memory, 91, 96, 97, 132, 156

EEquivalency theorem, 151

FFast Fourier transform (FFT), 1, 3–8, 10, 11,

15–24, 37, 39, 41, 46, 59–62, 65, 67,71, 72, 74, 78, 93–98, 102, 118–122,124, 128–130, 133, 137, 139, 141, 142,144, 153, 155, 159–162

Fast Hartley transform (FHT), 1, 6–10, 12,27–39, 41–63, 65, 69, 74, 77–79, 87,101, 107, 111, 117–133, 135–156, 160

Field-programmable gate array (FPGA), 1, 2,8, 23, 39, 65–67, 73, 74, 77, 79, 93–97,101–103, 105, 110, 112, 114, 134, 153

Fine grain parallelism, 68Fixed point, 2, 39, 42, 61–62, 90, 92, 97,

101–103, 107, 110, 112, 114, 127, 133,161

Fourier matrix, 2, 5Fourier space, 11, 30, 31, 39, 98, 118, 120,

121, 129–131, 135, 136, 141Frequency division multiplexed (FDM), 149

GGeneric double butterfly (GD-BFLY), 8, 48,

50, 52, 54–56, 90–91, 108, 160Global pipelining, 69, 78

HHalf-resolution approach, 117, 118, 131–133Hardware description language (HDL), 13, 73,

84, 161Hartley matrix, 6Hartley space, 8, 11, 30, 32, 36, 39, 98, 118,

120, 121, 129–131, 135–142, 147–151,156

IIn-place processing, 80, 123, 132Input-output (I/O), 9, 11, 66, 70–72, 78, 95,

132

KKernels, 2–4, 6, 10, 18, 27, 29, 63

LLatency, 8, 10, 25, 37, 66, 71, 72, 78, 95–97,

105, 114, 121, 127, 129, 131, 155, 159,162

Linear convolution, 36, 137, 139, 144Linear correlation, 140Local pipelining, 69

MMatched filter, 2, 39, 144Memory requirement, 7, 8, 10, 18, 24, 25, 37,

58, 59, 61, 63, 68, 70, 71, 73, 74, 86,91, 92, 95, 97, 101, 102, 110, 112, 114,124–126, 131, 132, 145, 151, 161

Minimum arithmetic addressing, 57, 85, 86,91, 93, 112, 124, 126

Minimum memory addressing, 57–58, 85–86,88, 92, 112, 125, 126

Mobile communications, 2, 7–9, 11, 65–67,79, 159

Multiplier and accumulator (MAC), 7

NNearest-neighbour communication, 61, 68Noble identity, 151

OOrthogonal, 2, 6, 10, 11, 27, 36, 46, 53, 129,

137, 140, 160Overlap-add technique, 137, 144, 149Overlap-save technique, 36, 139, 144, 150

PParseval’s theorem, 36, 129, 130Partitioned-memory processing, 66, 71–72, 74,

98Pipeline delay, 90, 114, 127, 131Pipelining, 4, 62, 68, 69, 78, 82–84, 88, 90–91,

105Polyphase DFT filter bank, 135, 136, 151, 153,

154, 156Power spectral density (PSD), 39Prime factor algorithm (PFA), 5Processing element (PE), 9

QQuad-port memory, 96, 97, 132, 156

Index 225

RRadix, 3–5, 7, 8, 11, 16, 19, 20, 23, 24, 37,

41–56, 61, 62, 69, 78, 80, 93–95, 98,117, 133, 159, 160

Random access memory (RAM), 65–67, 73,101, 103, 110, 112, 114, 123, 159

Read only memory (ROM), 67, 104Real-from-complex strategy, 7, 15, 16, 59, 95,

153, 160Regularity, 3, 8, 24, 37, 39, 41, 45, 47, 52, 54,

59, 61, 62, 114, 133, 160Regularized radix-4 fast Hartley transform

(R24 FHT/, 8–10, 12, 13, 36–38, 41, 42,

45–48, 54, 56–59, 61, 62, 66–71, 73,74, 77, 78, 80, 81, 83, 85, 86, 91–93,95–98, 101, 102, 107, 110, 112, 114,117–121, 123–127, 129–135, 142, 153,155, 156, 160, 161, 163–165, 167, 169,175

SScalable, 70, 71, 77–98, 112, 160, 162Scaling strategy, 28, 39, 61, 62, 98, 161, 163,

167, 169–171, 178, 180Shannon-Hartley theorem, 9Signal flow graph (SFG), 16, 17, 38, 45, 47,

48, 52, 54–56, 62, 108, 109, 120–122Signal-to-noise ratio (SNR), 42, 136, 144Silicon area, 11, 66, 68–71, 78Single-instruction multiple-data (SIMD), 4, 96Single-port memory, 110Single-quadrant addressing scheme, 57, 85Space complexity, 69, 70, 91–92Spider, 62Spurious-free dynamic range (SFDR), 95

Start-up delay, 90, 123–125Switching frequency, 66, 68, 70Symmetry, 3, 10, 16, 29, 52, 62, 63, 205, 207Systolic, 4

TTime complexity, 23, 70, 78, 92–93, 117, 124,

125, 127, 131, 155Time-difference-of-arrival (TDOA), 141, 148,

155, 156Time-of-arrival (TOA), 141, 148, 155Transform space, 8, 12, 31, 32, 39, 98, 118,

132, 133, 135–137, 140, 142, 143, 147,149, 150, 155

Trigonometric coefficient memory (CM), 12,56–58, 69, 70, 72, 79, 80, 85, 86, 91,92, 96, 103, 104, 110, 112, 132, 156,163, 166

UUnitary, 2, 6, 10, 11, 27, 28, 36, 46, 129, 137,

140Update-time, 8, 65, 71, 95–97, 132, 159Up-sampling, 135, 137–140, 147, 156

WWireless communications, 9, 13, 135, 136,

159, 161

ZZero-padding, 118, 133, 144, 150

the regularized fast hartley transform: optimal formulation of real-data fast fourier transform for...

Documents