pipelined fft pptfft

ELEC692 VLSI Signal Processing ArchitectureLecture 8Architecture for Fourier Transform

Usage of FFTFrequency transformationApplicationsOFDM wireless systemsSpeech/Multimedia data processingSatellite wireless transmissionDTV, DAB broadcasting using OFDMReal-time requirement needs special hardware to do this:E.g. COFDM for DTVSignal bandwidth 7.5MHzUseful symbol duration = 1msNumber of parallel subcarrier = 7.5*1000/1 = 7500Need 8K complex point FFTCompute 8K complex FFT in 1ms, i.e. 8M complex FFT in a secondNot efficient and practical to implement in software, need special HW for FFTIn fact there are quite some off-the-selves FFT processors available in the market, but it is better to integrate the hardware within your chip

DFT reviewThe N-point discrete Fourier transform X(k) of an N-point sequence x(n) (and the inverse DF) is given by

Direct Implementation of DFTProduct of a matrix (W) and a vector (x)An 8-point FFT example

1D array for DFT for N=8

Complex multiplications

Fast DFTFast DFT (Discrete Fourier Transform) algorithmCooley-Tukey decomposition (1965)Radix-2 Decimation-in-time (DIT) or Decimation-in-Frequency (DIF)Divide the problem size into two interleaved halves with each recursive stage Radix-2 decomposition first computes the even-indexed numbers x0,x2,,xn-2 and then the odd-indexed number x1,x3,,xn-1, and then combines these two results.The sequence can be decomposed recursively to reduce the overall runtime to O(nlogn)

Radix-2 DIF DFTSince WNN/2 corresponds to a rotation of 180o, the factor of the second sum can be even further reduced. WE haveThe division of k into even and odd values leads to the following:

Radix-2 decomposition of 8-point FFTx(0)x(1)x(2)x(3)x(4)x(5)x(6)x(7)y(0)y(4)y(2)y(6)y(14)y (5)y(3)y(7)-1-1-1-1-1-1-1-1-1-1-1-1W0W1W2W3W0W0W2W2W0W0W0W0

Implementation of Radix 2 FFTTwo extreme methodsReuse single ButterflySlowerSmaller areaMore complicated controlFully multi-stage straight implementationFasterLarger areaMore regular controlTrade-off between the two ends based onSpeed, area, power

Comparison of calculationhardware

DFTFFTMULADDMULADD(N-2)2(N)2N/2log2N-(N-1)Nlog2N

Data transportOne problem for FFT is its less regular data transport.If the butterfly PEs are configured such that PEs with lower exponents of W come first in each stage, a configuration results with identical communication networks between stages, (perfect shuffle)

Conventional single butterfly FFT implementationStrong speed limitationLarge intermediate results storage area need (N complex words)If the memory is not partitioned, the number of R/W accesses to perform the FFT creates a bottleneckAn N-point FFT requires N/r logrN radix-r butterfly computations and 2N logrN R/W RAM access

Single-stage (1-D) implementation- horizontal projectionHorizontal projection- provide PE for a single stageUse only N/2 PE, i.e. one stage onlyReduce throughput by a factor of log2N comparing with a 2-D array.Need to take care about the complex communication structurePEs do not have fixed coefficients, they need to change after each cycle and the global communication network is disadvantageous

Single-stage (1-D) implementation implementation- horizontal projectionPipelining with PEs does not allow a direct increase in through put for this architecture since the results of the current processing are required for the next processing step.However sequential data blocks of length N can be processed independently of one another, so several data blocks can be processed by interleavingNeed increase in # of register

Single-stage (1-D) implementation -horizontal projectionIf N is large, we cannot implement all N # of PE.Project N/2 butterfly PEs to M*PEs where M is also a power of 2 and M < N/2Special registers for input data, intermediate results and result data are required.Register cyclically read and write a particular sequence of 2M complex data

Single-stage (1-D) implementation Vertical projectionVertical projection: Have 1PE for each stage (total # = logN PE)Need circuitry between PEs to prepare the correct data inputFrom stage to stage, the length of the sequence onto which the FFT is applied is halved.Given the previous stage led to a DFT of length 2n, in accordance with perfect shuffle, the sequence of length 2n must be halved and the 1st and (n+1)th values must be fed to the following PE. Then the 2nd and (n+2)th values are fed to it.Hence the sequence must be delayed by n clock cycles in accordance with the position of the midpoint:

Data formatting/sorting for Vertical projectionThe block un-1,,u0 must be delayed by n clock cycles.When un is available, the values from the stream u must be fed to the new lower stream v. The values of u are input in parallel into the next butterfly stages for n clock cycles. SO the values of v are fed in parallel to the next butterfly PE for n clock cycles and vn-1,,v0 are delayed by 2n cycles and v2n-1,,vn delayed by n cycles.

Data formatting/sorting for Vertical projectionSpecial circuit is necessary for the data input of the 1st stage.Incoming data stream of N data is divided into 2 parts of N/2 data. The clock rate is hence halved.We need a demultiplexer followed by a FIFO register

Overall architecture of Linear FFT array based ob butterfly PEs and delay commutatorsConsists of N PEs and delay commutators are located between the PEs.

Due to the continued halving, control signals are extracted using frequency dividers

Higher radix FFTRadix-4 DIF algorithm We haveThus

Radix-4 DIF algorithmButterfly of Radix-4 Algorithm

Radix-4 Signal flow graph

Higher radix FFTRadix-8 algorithm

Some pipeline FFT Processor ArchitectureAssume input sequence to be in normal order and output is allowed to be in digit-reversed (radix-2 or radix-4) order.Assume DIF type of decompositionHere we assume additive butterfly has been separated from multiplier to show the hardware requirement distinctively

Radix-2 Multi-path Delay Commutator (R2MDC)N=16Input sequence has been broken into 2 parallel data stream flowing forward, with correct distance between data elements entering the butterfly scheduled by proper delays# of multipliers: log2N 2# of butterfly: log2N# of registers: (3/2)N-2

Radix-2 Single-path Delay Feedback (R2SDF)N=16Storing the butterfly output in feedback shift registers. A single data streams goes through the multiplier at every stage.# of multiplers: log2N 2# of butterfly: log2N# of registers: N-1

Radix-4 Single-path Delay Feedback (R4SDF)N=256Use radix-4 and CORDIC iterations. Utilization of multipliers increased to 75% due to storage of 3 out of radix-4 butterfly outputs. Utilization of the radix-4 butterfly (which is more complicated than radix-2 butterfly, containing at least 8 complex adders) is dropped to 25%.

# of multiplers: log4N 1# of butterfly: log4N# of registers: N-1

Radix-4 Multi-path Delay Commutator (R4MDC)N=256Utilization Rate: Butterflies: 25%, multiplier: 250%# of multiplers: 3log4N# of butterfly: log4N# of registers: (5/2)N-4

Some observationDelay-feedbacks are more efficient than corresponding delay commutator in terms of memory utilization since the stored butterfly output can be directly used by the multipliersRadix-4 algorithm based single-path architectures have higher multiplier utilization, but radix-2 algorithm have simpler butterflies which are better utilized.

ComparisonControl ThemeSimple ----------------------------------- ComplexProcessing Ability / Unit Low ----------------------------------- HighRadix / SpeedLow ----------------------------------- HighCombine the advantages Further decompose high radix PE

Radix-22 DIF FFTOptimal hardware Same number of non-trivial multiplications at the same positions in the SFG as of radix-4 algorithmsThe same butterfly structure as that of radix-2 algorithms.Radix-22 DIF FFT (S. He, M. Torkelson, A New Approach to Pipeline FFT Processor, in Proceedings of IPPS, 1996, pp. 766-780.

Radix-22 DIF FFTApply a 3-dimensional linear index mapThe Common factor algorithm has the form ofSummation Over n1

Radix-22 DIF FFTProceed the second step of decomposition to the remaining DFT coefficients, including the twiddle factor to exploit the exceptional values in multiplication before the next butterfly is constructed.After substituting and simplification, we haveBF IBF IBF II

Butterfly with decomposed twiddle factorsFull multipliers are required to compute the product of the decomposed twiddle factor. The order of the twiddle factors is different from that of radix-4 algorithm.

Complete Radix-22 DIF FFTApply the CFA recursively to the remaining DFTs of length N/4.

Radix-22 Single-path Delay Feedback (R22SDF) 2 types of butterflies: 1 identical to R2SDf, the other contains also the logic to implement the trivial twiddle factor multiplicationA log2N bit binary counter servers two purposes:Synchronization controller Address generation counter for twiddle factor reading in each stages

Radix-22 Single-path Delay Feedback (R22SDF)Structure for BF2I and BF2IIBF2IBF2IIOperation scheduling1st N/2 cycle, 2-to-1 mux in BF2I switch to 0 and the butterfly is idle.Input data is directed to the shift registers until they are filled.Next N/2 cycles, the mux turn to 1, the butterfly computes a 2-point DFT with incoming data and the data stored in the shift registers

pipelined fft pptfft

Documents

complex fft

n complex wordsif

bottleneckan n

recursive stage radix

fft example1d array

chipdft reviewthe n

butterfly pes

stage onlyreduce throughput