2003 vlsi design investigation for low-cost, low-power fft-ifft processing in advanced vdsl...

8/13/2019 2003 VLSI Design Investigation for Low-cost, Low-power FFT-IfFT Processing in Advanced VDSL Transceivers

1/16

VLSI design investigation for low-cost, low-power FFT/IFFTprocessing in advanced VDSL transceivers

S. Saponara a,1 , L. Fanucci b, *a Department of Information Engineering, University of Pisa, Via Diotisalvi 2, I-56122 Pisa, Italy

b IEIIT, National Research Council, Via Diotisalvi 2, I-56122 Pisa, Italy

Received 23 May 2002; accepted 11 October 2002

Abstract

The problem of an efcient very large scale integration (VLSI) realization of the direct/inverse fast Fourier transform (FFT/IFFT) fordigital subscriber line (DSL) applications is addressed in this paper. The design of scalable and very high-rate (VDSL) modem claims forlarge and high-throughput complex FFT computations while for massive and fast deployment of the xDSL family low-cost and low-powerconstraints are key issues. Throughout the paper we explore the design space at different levels (algorithm, arithmetic accuracy, architecture,technology) to achieve the best trade-off between processing performance, hardware complexity and power consumption. A programmableVLSI processor based on a FFT/IFFT cascade architecture plus pre/post-processing stages is discussed and characterized from the high-levelchoices down to the gate-level synthesis. Furthermore low-power design techniques, based on clock gating and data driven switching activityreduction, are used to decrease the power consumption exploiting the correlation of the FFT/IFFT coefcients and the statistics of the inputsignals. To this aim both frequency-division and time-division duplex schemes have been considered. The effects of supply voltage scalingand its consequence on circuit performance are examined in detail, as well as the use of different target technologies. Synthesis results for a0.18 mm CMOS standard-cells technology show that the processor is suitable for real-time modulation and demodulation in scalable full-rateVDSL modem (644096 complex FFT, 20 Msample/s) with a power consumption of few tens of mW. These performances are veryinteresting when compared to state-of-the-art software implementations and custom VLSI ones.q 2002 Elsevier Science Ltd. All rights reserved.

Keywords: Very large scale integration architectures; Low-power circuits; Fast fourier transform; Digital subscriber lines; Multicarrier modem

1. Introduction

High data rate communication is increasingly becomingdesirable to provide fast internet access and interactivemultimedia services for both business and residentialcustomers. To this aim Digital Subscriber Line (DSL)technologies can deliver data at multi Mbits/s over theunshielded twisted pairs of the wired Public SwitchedTelephone Network [18]. Depending on the type of technology, the possible bit-rates may be between 0.5 and8 Mbits/s for distance of several kilometers (AsymmetricDSL hereafter referred as ADSL) and more than 50 Mbits/sfor distance of few hundred meters (Very high-rate DSLhereafter referred as VDSL).

To achieve an efcient and reliable data transmission,multi-carrier or discrete multi-tone (DMT) modulation hasbeen selected by the American National Standards Institute(ANSI) and by the European Telecommunications Stan-dards Institute (ETSI) for ADSL and it is one candidate forVDSL [19]. The DMT symbol is the sum of independentquadrature amplitude modulated (QAM) subcarriers spreadover the selected transmission bandwidth. In Fig. 1 thespectral allocation for VDSL and ADSL services is shown.To be noted that data are differently loaded on thesubcarriers depending on the spectral shaping of the channel[1,911] . Fig. 2 presents the basic scheme of a DMTmodem.

According to this scheme multi-carrier modulation anddemodulation are managed by the inverse and directdiscrete Fourier transforms (IDFT and DFT), respectively.Different DFT/IDFT sizes correspond to different band-widths and hence to different achievable bit-rates and target

0026-2692/03/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved.PII: S0026- 2692(02 )00142- 8

Microelectronics Journal 34 (2003) 133148www.elsevier.com/locate/mejo

1 Tel.: 39-050-568-557; fax: 39-050-568-522.

* Corresponding author. Tel.: 39-050-568-668; fax: 39-050-568-522.E-mail addresses: [email protected] (L. Fanucci), sergio.

[email protected] (S. Saponara).
http://www.elsevier.com/locate/mejohttp://www.elsevier.com/locate/mejo


2/16

loop lengths. As an example, Table 1 summarizes the aboveparameters for a scalable VDSL system [8] considering asubcarrier spacing D f 4:3125 kHz ; a 26 American wiregauge cable and a 2 140 dB m/Hz thermal noise. Obviouslyreported values depend on the selected xDSL scheme and onthe channel conditions [1,8,1115] .

Such long length DFT computations are rather time and

power consuming while massive and fast deployment of xDSL technologies requires low-cost, low-power and highlyintegrated modem. Thus the design of special purposeprocessors for DFT and IDFT is a key issue for the successof xDSL. Table 2 details the target parameters for thecomplex-DFT/IDFT processor which is presented in thispaper (with reference to Table 1 DFT-length has beenextended to 64 to be compliant with multi-carrier appli-cations such as ADSL upstream [1] and wireless high data-rate transmission [11,15] ).

1.1. Previous works

Both software implementations, based on digital signalprocessors (DSP), and custom implementations, based onvery large scale integration (VLSI) architectures, have beenproposed [11,1520] in literature for DFT/IDFT processingin xDSL applications.

The best implementation for getting the highest exi-bility is a complete software one but, as proved in Ref. [16]

for a TMS320C62x, a DSP with a computational power of about 2000 MIPS (million instructions per second) does notmeet the requirements of a VDSL modem (more than 5000MIPS) where roughly 3200 MIPS are used by the DFT andIDFT blocks. To be noted that a maximum length of 2048was considered in Ref. [16]. Moreover a multi-DSPsimplementation is not suitable for low-complexity and

low-power. For instance at the cost of 0.5 mW/MIPS [16]the real time DFT and IDFT processing on the C62xarchitecture entails a power consumption of about1.6 W. Therefore a VLSI design approach is mandatory.

Several dedicated chips for Time Division Duplex (TDD)and Frequency Division Duplex (FDD) multi-carrier modemhave been proposed in the last years [15,1720] . Althoughthey integrate VLSI macrocells for DFT and IDFT proces-sing the maximum considered length is 2048 for a maximumthroughput of 8.8 Msample/s without exploiting the fullcapability of the VDSL scheme (see Table 1 and Fig. 1).

1.2. Paper outline

In this paper we explore the VLSI design space atdifferent levels (algorithm, arithmetic, architecture,technology) to determine the circuit conguration whichachieves the best power-area trade-off while meetingthe requirements of advanced xDSL schemes. Then wefurther reduce the chip power consumption by adopting

Fig. 1. Power spectral density vs. frequency for ADSL and VDSL services.

Fig. 2. Scheme of a DMT modem.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148134


3/16

clock-gating strategies that, based on input signal statistics,turn off some portions of the circuit and reduce theswitching activity to the minimal level required for realtime computation. To this aim both TDD and FDDapplication cases are considered.

After this introduction Section 2 shows how a N -

subcarriers DMT signal can be generated/received by acomplex 2 N -IDFT/DFT (double-size approach hereafterreferred as DS) or by a complex N -IDFT/DFT plus properpre- and post-processing stages (single-size approach here-after referred as SS). Section 3 presents the design of anintellectual property (IP) VLSI macrocell for fast direct/ inverse Fourier transform (FFT/IFFT) exploring the designspace at different levels. Starting from this macrocell andaccording to a bottom-up design strategy in Section 4 wedetail both DS and SS schemes for DMT and the relevantperformances when implemented with today CMOStechnology. Section 5 deals with the characterization andoptimization of the proposed architectures considering TDDand FDD xDSL modem. After a comparison with state-of-the-art addressed in Section 6, some conclusions are drawnin Section 7.

2. DMT symbol analysis

2.1. Double-size approach

The DMT symbol transmitted can be modeled as

xt Re X N 2 1

k 0 X k e j2p k D ft

" #rectT t ( )

12 X

2 N 2 1

k 0C k e

j2p k D ft " #rectT t 1

where

rectT t 1 when t [ 0; T ;

0 otherwise(being N the total number of subcarriers, D f the subcarrier

spacing, T the symbol duration, X k the sequence of N complex data produced by the QAM mapper. The symbolduration is given by T T g 1 = D f where T g takes intoaccount the cyclic retransmission [1,1114] of part of theIFFT output, i.e. cyclic sufx and prex in Fig. 2, which isadopted in multi-carrier modem to maintain the orthogon-ality of the subcarriers in case of dispersive channel and toavoid self-echo problems. The sequence C k is obtained from X k according to the following expression (hereafter overlinedenotes a complex conjugate operation):

C k X k for 1 # k # N 2 1;

C k X

2 N 2

k for N 1 # k # 2 N 2 1; 2

C k 0 for k 0; N

i:e: pilot and DC carriers are not bit -loaded :

By sampling xt with a frequency f s 1 = T s 2 N D f weobtain the numerical sequence

xq 12 X

2 N 2 1

k 0C k e

j2p kq = 2 N with q 0; ; 2 N 2 1: 3

The general expression of a M -point IDFT is

s p 1

M X M 2 1

i 0S iW

pi M with p 0; 1; ; M 2 1 4

where the twiddle-factor W M is given by e j2p = M : Comparing

Eqs. (3) and (4) it is clear that the generation of the DMTsymbol xq (2 N real values) can be obtained by means of a2 N -complex IDFT of the sequence C k : It is straightforwardfrom Eq. (2) that C k C 2 N 2 k (i.e. the coefcients exhibit anHermitian symmetry), which guarantees that the output of the complex DFT is a real sequence. The operationsinvolved in the DS approach are summarized in Fig. 3.

Table 1Properties of a scalable VDSL system

DFTlength

Max. band-width (MHz)

Target bit-rate(Mbits/s)

Max. asymmetryupstream/downstream (Mbits/s)

Targetlength (kft)

256 1.1 1.5 0.3/1.2 10512 2.2 12 5/7 6

1024 4.4 25 5/20 42048 8.8 40 10/30 34096 17.6 70 25/45 2

Table 2Typical DFT/IDFT parameters for a DMT modem

Programmability range Target throughput I/O data precision

644096 DFT-length 20 Msample/s 1216 bitsFig. 3. DS DMT symbol generation.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148 135


4/16

2.2. Single-size approach

In Eq. (3) even values can be rewritten as q 2 p; with p 0; 1; N 2 1 :

x2 p 1

2 X2 N 2 1

k 0

C k e j2p k 2 p = 2 N

12 X

N 2 1

k 0C k e

j2p kp = N X N 2 1

k 0C k N e

j2p kp = N e j2p p" #

12 X

N 2 1

k 0C k C k N e

j2p kp = N

N 2

IDFTC k C k N 5

while odd values can be rewritten as q 2 p 1; with p 0; 1; N 2 1 :

x2 p 1 12 X

2 N 2 1

k 0 C k e j2p k 2 p 1 = 2 N

12 X

N 2 1

k 0C k e

j2p kp = N e jp k = N " X

N 2 1

k 0C k N e

j2p kp = N e j2p p e jp k = N e jp#

12 X

N 2 1

k 0C k 2 C k N e

j2p kp = N e jp k = N

N 2

IDFTC k 2 C k N e jp k = N 6

Combining both Eqs. (5) and (6) in Eq. (7) we can see howthe 2 N real values can be obtained by means of a N -complexIDFT plus proper pre-processing of the input sequence C k and a nal multiplexing of the complex output to extract oddand even values. The operations involved in the SSapproach are summarized in Fig. 4.

x2 p j x2 p 1 N 2

IDFTC k C k N

j N 2

IDFTC k 2 C k N e jp k = N

N 2

IDFTC k C k N

C k 2 C k N e jp k = N e jp = 2

N 2

IDFT yk 7

The above analysis can be repeated for the reception of the DMT symbol achieving similar results. In this case a M -point IDFT must be replaced by a DFT computationwhose general expression is:

S i X M 2 1

p 0s p W

pi M with i 0; 1; M 2 1 8

being W M e2 j2p = M the twiddle factor.

3. FFT/IFFT VLSI architecture

3.1. Algorithm and ow graph

Many FFT algorithms [21] have been proposed sinceCooley and Tukey in 1965 [22] to reduce the number of

operations (multiplications and sums) of the direct DFTimplementation whose complexity is in the order of N 2.Among all the proposed FFT algorithms, radix algorithmsappear the most suitable for VLSI implementation due to thereduced number of operations and the high regularity of thestructure.

The basic concept of radix FFT algorithm is that, if N isnot prime (i.e. N r 1r 2; where r 1 and r 2 are integers greaterthan 1), it is possible to divide the global N point DFTs in r 1DFTs, each one r 2 points long, and r 2 DFTs, each one r 1points long (multiplications by twiddle-factors are alsorequired). When r 1 and r 2 are not prime, this step can berepeated recursively n-times up to obtain N r 1r 2r 3 r n 1:When r 1 r 2 r n 1 r the algorithm is calledradix- r , otherwise mixed-radix.

Typically radix-2, 4 or mixed 2/4 solutions are preferredto higher radix (such as radix 8, 16 and 32) because theylead to simpler and more regular VLSI architectures [1,11,1921,2635,40] .

Based on radix factorization, different possible data owgraph and relevant VLSI architectures could be consideredfor FFT computation [19,20,2334,40] . Basically they canbe grouped in four main classes: Full-Array [24,25] , One-Column [26], Cascade [23,2934] and Recursive [19,20,27,28,40] . Depending on the parallelization level, i.e. the

sharing of the hardware resources, they reach a differenttrade-off between processing time, circuit complexity andpower consumption.

The Full-Array architectures carry out all the requiredoperations at the same time, and this means that all thecomputational units are directly mapped on the chip. Thecorresponding circuit complexity becomes unacceptable forlarge N values in today CMOS technology. To reduce thiscomplexity it is possible to exploit hardware resourcemultiplexing by realizing on-chip only one column of radixcells (One-column architecture), only one row (Cascadearchitecture), or only a basic cell (Recursive architecture).

Table 3 compares the performance of different FFT/IFFTVLSI processors proposed in literature in the last years andsuitable for ADSL and/or VDSL applications. Performancesare expressed in terms of circuit complexity, processingspeed, power consumption and maximum complex FFT/ IFFT size. To be noted that data have been collected fromliterature and thus some values are not available (NA) orexpressed in different ways: e.g. in full-custom design thecircuit complexity is typically expressed as number of transistors while in semi-custom design it is expressed asnumber of gates for the logic (1 gate4 transistor) plusmemory size. The reported values give an idea of thedifferent trade-offs achieved.



5/16

Fig. 4. (a) SS DMT symbol generation; (b) pre-processing scheme.

Table 3State-of-art of FFT/IFFT processors suitable for xDSL applications

Design System wordlength (bit)

Tech.(mm)

Supply(V)

Processingspeed

Circuit complexity Power (mW) ComplexFFT size

Recursive [20] 18 0.35 3.3 1024 FFT in 80 ms at100 MHz (12.8 Msample/s)

150 Kgates 360Kbits memory 3000 1024

Cascade [32] 10 0.5 3.3 8192 FFT in 400 ms at20 MHz (20 Msample/s)

1.5 Mtransistors 600 8192

Recursive [27] 16 0.35 NA 22.2 Msample/s at 100 MHz 90 Kgates 8192 bits ROM 56 960 bits DPRAM

NA 512

Recursive [40] 24 0.6 NA 4 Msample/s at 40 MHz 96 Kgates 11 270 bits ROM 36 864 bits DPRAM

NA 512

Recursive [19] 17 0.35 3.3 256 FFT in 20 ms at44 MHz (12.8 Msample/s)

NA NA 256

One-column [26] 24 0.75 3.3 1024 FFT in 9.25 ms at40 MHz for a 16 chip array(110 Msample/s)

1.2 Mtransistors (19.2 Mtransistor for a 16 chip array)

7700 [34] 64, (1024 with a16 chip array)

DSP C62X [19] 32 0.18 1.8 8.8 Msample/s at 200 MHz NA 1600 2048Cascade [34] 20 0.6 1.1 1024 FFT in 330 ms at

16 MHz (3 Msample/s)460 Ktransistors 10 1024

Cascade [34] 20 0.6 3.3 1024 FFT in 30 ms at173 MHz (33 Msample/s)

460 Ktransistors 845 1024



6/16

With reference to the xDSL specication and todaysubmicron technology the One-column approach features astill high circuit complexity, 1024 complex processing unitsfor a radix-4 4096-FFT, and poor length exibility. Forinstance the one-column single-chip implementation pro-posed in Ref. [26] is targeted for a maximum size of 64

complex point. The recursive approach achieves lowhardware complexity (just 1 [27,28,40] or 2 [19,20]complex computation units) but it pays a tribute toprocessing speed and power consumption since a largeFFT requires a great number of iterations and memory read/ write operations. Moreover, high clock frequency values donot permit the use of voltage scaling techniques for powersaving. Up to now, recursive architectures proposed inliterature reach a maximum length of 1024 [20] or 512 [19,27,40] complex points.

On the contrary, the cascade approach offers a goodarea-time trade-off and a remarkable length exibility:

FFT solutions ranging from 64 to 8192 complex pointshave been already proposed in literature targetingdifferent application elds [2934] . Particularly, for ourdesign we have selected the Bi and Jones [28] owgraph properly modied to meet the mixed-radixrequirements. Fig. 5 sketches the ow graph for the

case example of a radix-4 16-FFT computation.The relevant circuit architecture can be derived by theprojection of the signal ow graph onto a linear array of computational processors, each made up of a buttery(BUTT) and a complex multiplier, concatenated withproper commutator (COM) blocks for data reshufingbetween successive stages. As proved in Refs. [29,32]with respect to the classic pipeline approach [30] theselected ow-graph features the following advantages: (i)saving in the number of adders (3 log 4 N instead of 8 log4 N ) and mul tipl iers ( log 4 N 2 1 instead of 3 log4 N 2 3); (ii) increase of computational efciency

Fig. 5. Bi and Jones pipeline data ow.



7/16

when the processor is interfaced to a continuous wordserial stream; (iii) reduction in the number of delay units(2 N instead of 2.5 N ).

3.2. VLSI architecture design

Starting from the pipeline ow graph proposed in Section3.1 the specication of a 4096 size (8192 in case of DSapproach) can be achieved by a cascade of 6 (7) stages: 5 (6)radix-4 processing units and a nal mixed radix-2/radix-4one. Between radix-2 and 4 factorization, the latter ispreferable since it is more performing in terms of outputprecision and reduced number of additions/multiplications[35,36] . Furthermore, from the analysis proposed in Ref.[23] for pipeline architectures, emerges that higher radixvalues are preferable for low-power design in case of largeFFT size. As a drawback, radix-4 algorithms are applicableonly for N equal to a power-of-four; so to implement an FFTwith N equal to a power-of-two, a mixed radix-2/4 algorithmhas to be used.

The block diagram of the proposed cascade architectureis sketched in Fig. 6 for the case of six stages. Thanks to anextensive use of pipeline the overall throughput amounts to1 complex sample/cycle and so a N -point FFT can beprocessed within N clock cycles. Moreover each stage canbe dynamically by-passed thus realizing all the requiredword lengths from 64 to 4096. To further reduced energyconsumption a clock-gating strategy is applied to the by-

passed stages.To be noted that, following an IP design reuse approach,the VHDL (very-high speed integrated circuits hardwaredescription language) architecture description is fullyparametric in terms of number of stages, input data wordlength (DWL), output word length (OWL), twiddle factorword length (TWL) and data path word length of each stage(SWL). 2 Thus, before logic synthesis on the target CMOStechnology, the IP-user can select the desired trade-off between processing accuracy, circuit complexity and FFT

size. In order to meet the xDSL system specications adetailed analysis of the above parameters before siliconintegration has been carried out (see Section 3.3).

According to the data ow detailed in Fig. 5, theprocessing stages in Fig. 6 are made up of a buttery, acomplex multiplier and a commutator for data reshufingbetween successive stages.

The generic buttery (see Fig. 7, where X i are the inputcomplex samples and bi are proper control signals withi 0; 1; 2; 3) consists of adder/subtract blocks and switchersfor internal shufing. Fig. 8 details the structure of thecommutator which exhibits six delay blocks whose size N dvaries from one stage to the other according to the rule N d N = 4t being t the index of the current stage (C0, C1, C2in Fig. 8 are control signals).

The commutator and the computational units of the laststage are slightly different from the others (i) to allow forradix-2 or 4 according to the selected length (additionalswitching blocks are added to the schemes of Figs. 7 and 8);(ii) since no multiplier are required according to ow graphin Section 3.1. Moreover the last stage features a roundingunit for proper scaling of the nal internal word length(SWL 6 in Fig. 6) to the OWL. The inner structure of thecomplex multiplier is presented in Fig. 9. It is made up of four real booth-multipliers [37], one real adder, one realsubtractor, four units for data rounding. According to a datadriven power saving approach detailed in Section 4.3 it also

Fig. 6. VLSI architecture for a six stages case example.

Fig. 7. Radix-4 buttery.

2 These parameters refer to real data and must be doubled for complexdata.



8/16

comprises hardware resources to avoid multiplications fortwiddle factors equal to 1.

3.2.1. Storage optimizations for data shufing and twiddlecoefcients

To implement the commutator delay units dual portRAM (DPRAM) have been used instead of D-edge-triggered Flip Flop (DFF) for their area-power advantage(DFF are used only for the last stage where the requireddelay amounts to 1 clock cycle). It is worth mentioning thatthis approach is possible in the selected Bi and Jones owgraph due to the remarkable size of the commutator delayblocks. On the contrary other pipeline architectures, such asthe one presented by Ding et al. [31], feature deep memoryfragmentation and so are not suitable for RAM technol-ogies. This is paid in terms of area overhead and powerconsumption, which becomes particularly critical forlong size FFT implementations. This is evident fromTable 4 , where memory requirements for our architectureand for Ding architecture are reported together with relevantarea and power consumption. The values in Table 4 are for

the 64-point FFT case adopted in Ref. [31] for a 0.6 mm3.3 V standard-cells CMOS technology. Following theapproach in Ref. [31] a switching activity of 0.4 has beenconsidered for power consumption analysis. The proposedapproach yields an area saving of 34% and a power savingof about 87% as far as data-shufe storage is concerned. Tobe noted that this area-power saving is achieved despite of the greater number of storage units required by ourarchitecture with respect to Ref. [31].

As concerns twiddle coefcients, in each radix stage theyare stored in ROM whose size is 2TWL M being M N = 4t 2 1 the number of cells for the generic t stage in the N -point cascade architecture. For the architecture sketched inFig. 6, 5456 cells are required. This hardware cost can bereduced by exploiting the symmetry properties of thetwiddle coefcients. As depicted in Fig. 10 only the valuesin region A of the complex plane need to be stored since therest of the coefcients can be generated simply by inversionand transposition of the stored values. Hence the amount of ROM for twiddle coefcients storage can be reduced by afactor of 8 according to the circuit of Fig. 11.

Fig. 8. Radix-4 commutator.

Fig. 9. Complex multiplier.



9/16

3.3. Arithmetic accuracy design

A detailed study of the machine number representationand scaling techniques is of primary importance in FFTVLSI design to cope with the growing size of partial resultswhile limiting the word size of internal memories/data paths(SWL and TWL) and the loss of processing accuracy.

As already proved in Ref. [32], working with a wholeaccuracy oating-point representation implies too largeinternal word length and is not suitable for single-chipimplementation of large transform size. Thus, three mainarithmetic approaches are considered in this section: xed-point, block-oating point (BFP) and convergent block-oating point (CBFP).

The relevant analysis is based on a C program whichallows to evaluate the relevant performances with respect toa 64-bit oating-point arithmetic. The analysis has beencarried out in terms of signal-to-quantization-noise ratio(SQNR), mean square error (MSE), absolute maximumerror (MAE) taking into account different kinds of inputsignals and different values of the hardware parametersDWL, OWL, SWL, TWL, FFT length.

In the xed-point arithmetic without scaling a growingsize by a factor 4 (2 bits in the SWL) has to be consideredfor each radix-4 stage. This way, for a 16 bits input and a4096-transform size the SWL 6 of the last radix-4 stagebecomes 28. However, as detailed in Table 5 the sameprocessing accuracy (dened as number of output true bits)

can be achieved by adopting proper scaling techniques withreduced word length size.The second approach features a scaling technique of data

after every stage of the pipelined architecture discussed inSection 3.2. This method corresponds to a sort of BFParithmetic where data are characterized by a single SWLvalue for the whole cascade architecture and by an exponentthat is augmented at every computational block stagemoving from the input towards the output.

On the contrary, in the CBFP arithmetic data areadaptively scaled depending on the maximum amplitudeinside the computational stages. To maximize throughputperformance, in CBFP we adopt one exponent for eachblock of data instead that one exponent for all data in eachstage. As it is evident from Fig. 5, the computation of therst N = 4 outputs of the second stage depends only on therst N = 4 outputs of the rst stage. Thus, as soon as the rst N = 4 results of the rst stage are computed, the evaluation of the maximum amplitude is performed and an exponent isassociated with this block of data. Then, the computation of the rst N = 4 results of the second stage can start withoutwaiting for the processing of all the N results of the rststage. The same approach is followed for the other blocks of data in the rst stage and is iterated from stage to stage withdata blocks of smaller lengths ( N = 4; N = 16 and so on).Avoiding the scaling of data when it is not necessary CBFPapproach achieves the same performances of the BFP onewith lower SWL and TWL values (i.e. a greater precisionwith the same values).

As example, some results in terms of MSE and MAE fordifferent SWL and TWL values are presented inFigs. 1214 . In these gures a random input signal with apeak-to-average ratio (PAR) of roughly 15 dB (typical forVDSL systems) is considered. Error results refer to anormalized I/O data range [ 2 1,1]. To be noted that CBFP ismore efcient than BFP when the same SWL and TWL

Table 4Memory requirements for data reshufing

Architectures (64point 24-bit FFT)

Storageunits (bit)

Storageelement

Area(mm 2)

Power at50 MHz (mW)

Proposed 6048 DPRAM/DFF 2.72 50Ref. [31] 5376 DFF 4.1 390

Fig. 10. Twiddle storage optimization by a factor 8.

Fig. 11. Circuit for M -point twiddle factor storage based on a M /8-word ROM.



10/16

values are considered. However, for high values of SWL theprocessing accuracy of the two arithmetic approachessaturates at the same level ( Figs. 13 and 14 ). This effect ismainly due to the error which characterizes the transition of the output data representation from SWL bits to OWL bits.For instance by considering in Fig. 14 the case example withOWL 17 the MSE saturation level decreases with respectto the OWL 16 case thus allowing for a greaterprocessing accuracy. Reported data are relevant to theFFT case but similar ndings resulted for the IFFT case.

To be noted that for proper scaling from SWL to OWLeither truncation or rounding can be chosen. As sketched inFig. 6 the latter approach has been adopted in ourarchitecture. Indeed, as proved by computer simulations,the use of the rounding function attains a greater arithmeticaccuracy up to 50% for a few percent circuit complexity andpower consumption increase (for instance see power budgetin Fig. 16).

Table 5 reports the sizing of the FFT/IFFT processor forthe three considered arithmetics to achieve the maximumaccuracy permitted by the saturation levels xed byOWL 16. In such a case the processor allows for a 16

bits output precision with a MAE of roughly 16.5 102 6

that is to say in the worst case the error amounts to 0.54LSB. 3 By using the square root of the MSE as a measure of the average error the accuracy loss is less than 0.4 LSB.These error gures are well suited for xDSL applications:according to FFT accuracy measure proposed in literature[38] they lead to 16 true-bits output for a dynamic range of 96 dB. If required, a greater arithmetic precision can bereached by proper setting the VHDL parameters beforesilicon integration. As example a precision of 17 bits with anaverage error of roughly 0.4 LSB can be obtained withSWL 18 and OWL 17 (CBFP arithmetic in Fig. 13).

From Table 5 it clearly emerges that the CBFP requiresthe minimum values for SWL and TWL. However theselection of the arithmetic is not straightforward. Indeed thisgreater accuracy is achieved at the expenses of a greatercomplexity of each computational stage for the require-ments of an on-chip amplitude estimation unit and a greaterdelay in each commutator block to guarantee datasynchronization. Thus the VLSI macrocell has beenimplemented in a 0.18 mm 6-metal level standard-cellsCMOS technology for the three considered arithmetic

approaches. The CBFP-ASIC integrates also a module forfast estimation of complex number amplitude which hasbeen detailed in Ref. [39]. Proper sizing of the VHDLparameters before logic synthesis has been accomplishedaccording to the results of Table 5 . Table 6 details theperformance of the three ASIC in terms of gates complexity,

ROM and DPRAM.From the results of Tables 5 and 6 it is evident that theCBFP approach achieves the best trade-off betweenprocessing accuracy and hardware complexity for the targetxDSL applications. Therefore CBFP has been selected forthe implementation of the DS and SS DMT processors inSections 4 and 5.

4. Single-size and double-size processors

4.1. CMOS implementation results

Starting from the FFT/IFFT macrocell detailed in Section3 and according to a bottom-up design strategy we haveimplemented both DS and SS schemes for DMT symbolgeneration/receipt (the architectures of DS and SS havebeen already sketched in Figs. 3 and 4). Since the newhardware resources introduce quantization/rounding errors,then we reapplied to the whole DS and SS schemes theprocedure for optimal data path/memory word lengthssizing addressed in Section 3.3. For the case of OWL DWL 16 and full programmability from 64 to 4096subcarriers it resulted SWL 18 and TWL 14 for bothDS and SS cases.

After VHDL parameter denition, the two VLSImacrocells have been implemented in the 0.18 mmCMOS technology considered above. Table 7 detailsthe relevant performance in terms of circuit complexity,computational performance and power consumption whenthe processors are used for modulation (IFFT plus pre-processing). Similar results have been achieved when theprocessors are used for demodulation (FFT plus post-processing). Power gures have been extracted by gate-level simulations (within Synopsys e power compilerenvironment) using a random sequence with 15 dB PARas input stimuli and a power supply of 1.6 V. The clock frequency has been set to 20 MHz for the SS processorand 40 MHz for the DS one in order to attain the sametarget throughput of 20 complex Msample/s (to elaboratea N -subcarriers DMT symbol the SS chip processes N complex values while the DS chip processes 2 N complexvalue).

These results demonstrate that the SS approach made upof a pre/post-processor plus a 4096-IFFT/FFT core (veradix-4 stages and a nal mixed radix one), is more efcientthan the DS approach which is made up of a 8192-IFFT/FFTcore (six radix-4 stages and a nal mixed radix one). Thiscan be explained considering that (i) the pre/post-processor(one complex multiplier plus two adders and one subtractor)

Table 5Word lengths sizing for 16 bits output precision, DWL OWL 16, N 4096

CBFP BFP Fixed-point

SWL 17 21 28 (max)TWL 12 13 12

3 In the 2s complement n bit representation of the range [ 2 1,1] the lastsignicant bit (LSB) amounts to 2 2 n2 1:



11/16

Fig. 12. MSE performance vs. TWL, SWL 19, DWL OWL 16, N 4096 :

Fig. 13. MSE performance vs. SWL, TWL 14, DWL 16 and 17, N 4096:

Fig. 14. MAE performance vs. SWL, TWL 14, DWL OWL 16, N 4096:



12/16


13/16

set to 20 MHz and a random sequence with 15 dB PAR hasbeen used as input signal.

From the results of Table 8 it emerges how the processorcan cover a wide range of applications. By using a 1.95 Vsupplyvoltage andthe DHS library theSS processorachievesa computational power greater than 200 Msample/s. For thetarget xDSL application (20 Msample/s) the best solution isthe use of the DLL library with the 1.2 V supply voltagewhich reduce power consumption down to roughly 27 mW.Fig. 16 details the power consumption budget of the 4096-carriers SS DMT processor.

From the analysis in Fig. 16 it is clear that 50% of thepower consumption is due to the complex multipliersactivity. However, in the considered algorithm a great partof the overall multipliers activity can be saved. Fig. 17shows that, among the M twiddle coefcients of the genericradix stage (16 in the case example reported in gure),0.25 M 3 (7 in the case example reported in gure) are 1.This way, in a 64-point transform up to 40% of twiddlefactors are 1 while in a 4096-point transform this percentageis 30. Therefore a considerable power saving (roughly 15%for VDSL applications) can be achieved by gating the clock of the complex multipliers whenever a (1,0) twiddlecoefcient has to be processed. In such a case the signal Cin Fig. 9 is set to 1 thus propagating the input data towardsthe output. The additional buffer unit guarantees the samedelay of the multiplier for the synchronization withsuccessive stages. Power simulation results demonstratethat the nal SS chip with multiplier clock-gating strategydissipates about 23 mW for VDSL applications (input signalwith 20 Msample/s, 15 dB PAR, 4096-carriers, 16 bit I/0) inthe considered 0.18 mm 1.2 V CMOS technology.

5. FFT/IFFT processing in TDD and FDD modem

The duplex method determines how the overall through-put is shared between the downstream and upstreamdirections. In TDD schemes upstream and downstream arepartioned in time and the entire frequency band can be usedin both directions in separate time epochs. The asymmetryratio is dened by the ratio of time used for upstream anddownstream transmissions. In FDD the available spectrumis divided in distinct frequency bands where each band isused uniquely for either upstream or downstream. Theasymmetry ratio is determined by the width and location of the spectrum bands allocated for each direction. Both TDDand FDD xDSL schemes have been proposed in literature[1,1319] .

In TDD schemes a single FFT/IFFT processor can beused for DMT symbol generation/receipt while in FDDschemes two separate processors are required for FFTand IFFT since upstream and downstream directions areallowed at the same time. The nal circuit complexity isnearly twice that for TDD. However, a further powersaving approach can be addressed exploiting the fact that,for typical xDSL applications, a lot of carriers in bothdirections are zero.

For instance, if we consider the frequency planforeseen by the ETSI standard ( Table 9 ) [6,7] with a4096-carriers DMT and a tone spacing of 4.3125 kHzthen about 41% of the carriers can be used for upstream,26% for downstream and more than 30% is notexploited. For a 1024-carriers DMT in the sameconditions then 13% of the carriers can be used for

Table 8Power consumption and maximum frequency for different supply voltagesand technology library versions

DHS DLL

1.2 V 1.6 V 1.95 V 1.2 V 1.6 V 1.95 V

Power (mW) 28.65 61.6 100.98 26.35 49.58 73.32Max frequency (MHz) 53 75 213 46 66 190

Fig. 16. Power consumption budget of the SS DMT processor.

Fig. 17. Twiddle coefcients equal to 1 in the generic computational stage.



14/16

upstream, 69% for downstream and more than 18% isnot exploited. Reported data takes into account also radioamateur bands ( Fig. 1). This means that for the twoexample cases 5987% of the IFFT inputs (upstream)are zero and 3174% of the FFT inputs (downstream)are zero.

Since CMOS circuits do not dissipate power when they

are not switching power saving can be achieved by a clock-gating strategy which, exploiting the great amounts of zeroinputs, turns off some portions of the processors and reducesthe switching activity to the minimal level required for thecomputation.

For the IFFT processor we have added a unit that checksfor blocks of null data at the input of the pre-processing andof the rst radix-4 stage. If the group of data to process isnull then the clock of the computational units (buttery plusmultiplier) is gated and a zero is propagated towards theoutput. The buttery style operation absorbs the nullcoefcients early in the signal path and hence more non-zero coefcients are present in average in the successiveradix stages. For these stages the clock-gating approach isnot applied. As a matter of fact, the implementation of theclock-gating technique involves an overhead in terms of silicon area and capacitive load and so it has to be appliedonly where an important reduction in the power consump-tion can be obtained.

For the IFFT processor the same approach has beenapplied to the rst and second radix-4 stages. As proved bygate level simulations for the above case examples thisapproach allows for a power saving up to 23% for the IFFTand up to 10% for the FFT with a negligible circuitoverhead.

6. Comparison vs. state-of-art

Comparing the performance of the SS processor (pre-sented in Sections 35 and summarized in Tables 7 and 8 )with the state-of-art of FFT/IFFT VLSI architectures forxDSL applications (reviewed in Section 3.1 and Table 3 ) thefollowing considerations can be made:

1. The proposed architecture allows for full rate VDSLapplications (4096-carriers, 20 MHz bandwidth) with

a good trade-off between power consumption and circuitcomplexity which makes it suitable for low-cost and low-power implementation in a single-chip modem. On thecontrary, most of known solutions support maximumFFT sizes equal or lower than 1024 [19,20,26 28,33,34,40] and/or achieve a poor computational power [19,34,

40] being compliant for ADSL but not exploiting thecapability of advanced VDSL schemes. Moreover ourVLSI architecture is not only a FFT/IFFT macrocell butit also comprises hardware resources (logic and memory)for proper pre- and post-processing of the DMT symbols.

2. We exploit the great amount of zero carriers whichcharacterizes FDD xDSL systems to further reduce theswitching power consumption according to a data drivenclock-gating approach.

3. With respect to Ref. [32] which attains similar perform-ance in terms of size and throughput and which exhibitssimilar architecture/algorithmic solutions the proposedapproach saves power by an order of magnitude. This isdue not only to the use of a more recent CMOStechnology but also to optimization strategies such asvoltage scaling and clock gating. Particularly clock gating has been adopted at three different levels (i) topower down radix-stages not used when sizes lower than4096 are required; (ii) to reduce by 30% the switchingactivity of the multipliers discarding operations withtwiddle factors equal to 1; (iii) to reduce the switchingactivity of the rst two stages in FDD applicationsexploiting the great amount of zero carriers. MoreoverRef. [32] adopts a CBFP arithmetic with an internal wordlength of 10 bits which, as proved in Section 3, is not

suitable for the accuracy requirements of advanced xDSLsystems. As already addressed above, our architecturecontains hardware resources for DMT symbol processingwhich are not present in Ref. [32].

4. Finally, the circuit has been designed as a parametric IPcell customizable by the user to achieve the desiredtrade-off between processing performance and hardwarecomplexity. Therefore a greater exibility, in differentapplications cases and implementation technologies, isachieved with respect to full-custom VLSI designs.

7. Conclusions

The design of a exible, low-cost and low-powerFFT/IFFT processor for scalable and high-rate xDSLtransceivers has been addressed in this paper. To this aimthe VLSI design space has been explored at different levels(algorithm, arithmetic accuracy, architecture, technology)throughout the paper. Different possible solutions have beenconsidered for symbol modulation/demodulation (SS andDS schemes), data ow graph (full-array, one-column,cascade, recursive) and internal processing arithmetic(xed-point, BFP, CBFP). Logic synthesis results provethat the best solution is a SS programmable processor based

Table 9ETSI frequency plan and relevant amateur radio bands

ETSI frequency plan for VDSL Amateur radio bands (MHz)

Downstream (MHz) Upstream (MHz)

0.1383.000 3.00 5.10 1.812.005.100 7.050 7.05 12.00 3.50 4.00

7.007.3010.1010.15



15/16

on a radix-2/4 FFT/IFFT cascade architecture with CBFParithmetic plus proper pre/post-processing stages.Implemented in a 0.18 mm standard-cells CMOS technol-ogy it allows for 644096 FFT size, 16 bits output precisionwith a maximum throughput of 75 Mcomplexsample/s.Furthermore low-power design techniques, based on clock

gating and data driven switching activity reduction, are usedto decrease the power consumption exploiting the corre-lation of the twiddle coefcients and the great amount of zero carriers in FDD transmission schemes. The effects of supply voltage scaling and its consequence on circuitperformance are examined in detail, as well as the use of different target technologies (low-leakage and high-speed).Final implementation results in the 0.18 mm CMOStechnology prove that the SS processor allows for full-rateVDSL applications (4096-carriers, 20 Mcomplexsample/s)with a good trade-off between circuit complexity and powerconsumption which amounts to few tens of mW. On thecontrary most of known solutions, based on DSP ordedicated custom architectures, support maximum FFTsizes equal or lower than 1024 and/or achieve a poorcomputational power being compliant for ADSL but notexploiting the capability of advanced VDSL schemes.Finally, the circuit has been designed as a parametric IPcell customizable by the user to achieve the desired trade-off between processing performance and hardware complexity.This way a greater exibility in different applications casesand implementation technologies is achieved with respect tofull-custom VLSI designs.

Acknowledgements

This work has been carried out in the framework of theMEDEA project INCA: Integrated Network Copper Access . Discussions with C. Del Toso and C. Trang of STMicroelectronics, France and L. Serani of Pisa Universityare gratefully acknowledged.

References

[1] J. Bingham, ADSL, VSDL and Multicarrier Modulation, Wiley, NewYork, 2000.

[2] ANSI standard T1.413, issue 2, Asymmetric Digital Subcriber Line

(ADSL), 1998.[3] VDSL Alliance SDMT VDSL draft standard proposal, ETSI STC/

TM6, April 1998.[4] ANSI T1E1.4, Very-high-speed Digital Subscriber Line (VDSL)

metallic interface: functional requirements and common specica-tions, May 1998.

[5] ANSI T1E1.4/00-011, Draft specication, Very-high-speed DigitalSubscriber Line (VDSL) metallic interface. Part 2. Technicalspecication of multi-carrier modulation (MCM) transceiver.

[6] ETSI TM6, Transmission and multiplexing; Access transmissionsystems on metallic access cables; Very-high-speed Digital Sub-scriber Line (VDSL). Part 1. Functional requirements, 1999.

[7] ETSI TM6, Transmission and multiplexing; Access transmissionsystems on metallic access cables; Very-high-speed Digital Sub-scriber Line (VDSL). Part 2. Transceiver specication, 2000.

[8] C. Del-Toso, B. Rezvani, M. Beck, J. Chow, G. Jin, S. Oelcer, J. Ciof,J. Gustafsson, Scalable multi-mode VDSL (DMT option) for EFM-Cu. IEEE 802.3ah Ethernet in the First Mile Task Force, November2001, web site: http://grouper.ieee.org/groups/802/3/efm/

[9] J. Bingham, Multicarrier modulation for data transmission: an ideawhose time has come, IEEE Commun. Mag. (1990) 514.

[10] P. Chow, J. Ciof, J. Bingham, A practical discrete multitonetransceiver loading algorithm for data transmission over spectrallyshaped channel, IEEE Trans. Commun. 43 (6) (1995) 773775.

[11] N. Weste, D. Skellen, VLSI for OFDM, IEEE Commun. Mag. (1998)127131.

[12] K. Sistanizadeh, P. Chow, J. Ciof, Multi-tone transmission forasymmetric digital subscriber lines (ADSL), Proc. IEEE Int. Conf.Commun. 2 (1993) 756760.

[13] F. Sjoberg, M. Isaksson, R. Nilsson, P. Oding, S. Wilson, P.Borjesson, Zipper: a duplex method for VDSL based on DMT,IEEE Trans. Commun. 47 (8) (1999) 12451252.

[14] F. Sjoberg, R. Nilsson, M. Isaksson, P. Deutgen, P. Oding, P.Borjesson, Performance evaluation of the Zipper duplex method,Proc. IEEE Int. Conf. Commun. 2 (1998) 10351039.

[15] W. Eberle, V. Derudder, G. Vanwijnsberghe, M. Vergara, L.Deniere, L. Van der Perre, M. Engels, I. Bolsen, H. De Man, 80-

Mb/s QPSK and 72-Mb/s 64-QAM exible and scalable digitalOFDM transceiver ASICs for Wireless Local Area Networks in the5-GHz band, IEEE J. Solid State Circuits 36 (11) (2001)18291838.

[16] B. Wiese, J. Chow, Programmable implementations of xDSLtransceiver systems, IEEE Commun. Mag. (2000) 114119.

[17] L. Kiss, E. Hanssens, K. Adriaensen, M. Huysmans, C. Gendarme, F.Van Beylen, H. Van de Weghe, A customizable DSP for DMT-basedADSL modem, Proc. IEEE 24th Eur. Solid State Circuits Conf. (1998)349353.

[18] D. Mestdagh, R. Nilsson, M. Isaksson, P. Oding, Zipper VDSL: asolution for robust duplex communication over telephone lines, IEEECommun. Mag. (2000) 9096.

[19] D. Veithen, P. Spruyf, T. Pollet, M. Peeters, S. Braet, O. Van de Wiel,H. Van de Weghe, A 70 Mb/s variable-rate DMT-based modem for

VDSL, Proc. IEEE Int. Solid State Circuits Conf. (1999) 248249.[20] M. Rudberg, M. Sanberg, K. Ekholm, Design and implementation of an FFT processor for VDSL, Proc. IEEE AsiaPacic Conf. CircuitsSyst. (1998) 611614.

[21] P. Duhamel, M. Vetterli, Fast Fourier transform: a tutorial review anda state of the art, Signal Process. 19 (1990) 259299.

[22] J. Cooley, J. Tuckey, An algorithm for machine computation of complex Fourier series, Math. Comput. 19 (1965) 297301.

[23] S. Hong, S. Kim, M. Papaefthymiou, W. Stark, Power-complexityanalysis of pipelined VLSI FFT architectures for low energywireless communication applications, Proc. IEEE MWSCAS(1999) 313316.

[24] N. Murphy, M. Swamy, On the real-time computation of DFT andDCT through systolic architecture, IEEE Trans. Signal Process. 42(1994) 988991.

[25] L. Chang, M.-Y. Wu, A new systolic array for discrete Fouriertransform, IEEE Trans. Acoust. Speech Signal Process. 36 (1988)11651167.

[26] T. Chen, G. Sunada, J. Jin, COBRA: a 100-MOPS single-chipprogrammable and expandable FFT, IEEE Trans. VLSI Syst. 7 (2)(1999) 174182.

[27] C. Wang, C. Chang, A new memory based FFT processor for VDSLtransceivers, Proc. IEEE Int. Symp. Circuits Syst., ISCAS 4 (2001)670673.

[28] E. Cetin, R. Morling, I. Kale, An integrated 256-point complex FFTprocessor for real-time spectrum analysis and measurement, Proc.IEEE Conf. Instrum. Meas. Technol. 1 (1997) 96101.

[29] G. Bi, E.V. Jones, A pipelined FFT processor for word-sequentialdata, IEEE Trans. Acoust. Speech Signal Process. 37 (12) (1989)19821985.

http://grouper.ieee.org/groups/802/3/efm/http://grouper.ieee.org/groups/802/3/efm/


16/16

[30] L. Rabiner, B. Gold, Theory and Application of Digital SignalProcessing, Prentice-Hall, Englewood Cliffs, NJ, 1975, chapter 10.

[31] T. Ding, J. McCanny, Rapid design of application specic FFT cores,IEEE Trans. Signal Process. 47 (5) (1999) 13711381.

[32] E. Bidet, D. Castelain, C. Joanblanq, P. Senn, A fast single-chipimplementation of 8192 complex point FFT, IEEE J. Solid StateCircuits 30 (3) (1995) 300304.

[33] S. He, M. Torkelson, Designing pipeline FFT processor for OFDM(de)modulation, Proc. URSI Int. Symp. Signals Syst. Electron. (1998)257262.

[34] B.M. Baas, A 9.5 mW 330 ms 1024-point FFT processor, Proc. IEEECustom Integr Circuits Conf. (1998) 127130.

[35] V. Prakash, V. Rao, Fixed point error analysis of radix-4 FFT, SignalProcess. 3 (1981) 123133.

[36] T. Thong, B. Liu, A xed-point fast Fourier transform error analysis,IEEE Trans. Audio Electroacoust. 17 (1969) 151157.

[37] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design,Addison-Wesley, Reading, MA, 1985.

[38] W. Hui, T. Ding, J. McCanny, R. Woods, Error analysis of FFTarchitectures for digital video applications, Proc. IEEE ICECS (1996)820823.

[39] L. Fanucci, M. Rovini, A low-complexity and high-resolutionalgorithm for the magnitude approximation of complex numbers,IEICE Trans. Fundam. E85-A (7) (2002) 651654.

[40] C. Chang, C. Wang, Y. Chang, Efcient VLSI architectures for fastcomputation of the discrete Fourier transform and its inverse, IEEETrans. Signal Process. 48 (11) (2000) 32063215.


2003 vlsi design investigation for low-cost, low-power fft-ifft processing in advanced vdsl...

Documents