area delay power efficient fixed point lms adaptive filter with low adaptation delay

10
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay Pramod Kumar Meher, Senior Member, IEEE and Sang Yoon Park, Member, IEEE Abstract—In this paper, we present an efficient architecture for the implementation of delayed least mean square (DLMS) adaptive filter. For achieving lower adaptation-delay and area- delay-power efficient implementation, we have used a novel partial product generator, and we have proposed a strategy for optimized balanced pipelining across the time-consuming combinational blocks of the structure. From synthesis results, we find that the proposed design offers nearly 17% less area- delay-product (ADP) and nearly 14% less energy-delay-product (EDP) than the best of the existing systolic structures, in average, for filter-lengths N =8, 16, and 32. We have proposed an efficient fixed-point implementation scheme of the proposed architecture, and derived the expression for steady-state error. We have shown that the steady-state mean squared error obtained from the analytical result matches with the simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP, and 9% saving in EDP over the proposed structure before pruning without noticeable degradation of steady-state error-performance. Index Terms—Adaptive filters, least mean square algorithms, fixed-point arithmetic, circuit optimization. I. I NTRODUCTION Adaptive digital filters find wide applications in several dig- ital signal processing (DSP) applications, e.g., noise and echo cancelation, system identification, and channel equalization etc. The finite impulse response (FIR) filters whose weights are updated by the famous Widrow-Hoff least mean square (LMS) algorithm [1] may be considered as the simplest known adaptive filter. The LMS adaptive filter is the most popular one not only due to its simplicity but also due to its satisfactory convergence performance [2]. The direct form LMS adaptive filter involves a long critical-path due to an inner-product com- putation to obtain the filter output. The critical-path is required to be reduced by pipelined implementation when it exceeds the desired sample period. Since the conventional LMS algorithm does not support pipelined implementation due to its recursive behavior, it is modified to a form called delayed LMS (DLMS) algorithm [3]–[5], which allows pipelined implementation of the filter. A lot of work has been done to implement the DLMS algorithm in systolic architectures to increase the maximum usable frequency, [3], [6], [7] which can support very high sampling frequency due to pipelining after every multiply accumulate (MAC) operation. But they involve adaptation- delay of N cycles for filter-length N , which is quite high The authors are with the Institute for Infocomm Research, Singapore, 138632 (e-mail: [email protected]; [email protected]). Corresponding author of this paper is Sang Yoon Park. A preliminary version of this paper was presented in IEEE International Midwest Symposium on Circuits and Systems (MWSCAS 2011). for large order filters. Since the convergence performance degrades considerably for large adaptation-delay, Visvanathan [8] et al have proposed a modified systolic architecture to reduce the adaptation-delay. A transpose form LMS adaptive filter is suggested in [9], where the filter output at any instant depends on the delayed versions of weights, and the number of delays in weights varies from 1 to N . Van and Feng [10] have proposed a systolic architecture, where they have used relatively large processing elements (PEs) for achieving lower adaptation-delay with critical-path of one MAC operation. Ting et al [11] have proposed a fine-grained pipelined design to limit the critical-path to the maximum of one addition-time, which supports high sampling frequency, but involves lot of area overhead for pipelining and higher power-consumption than that of [10] due to its large number of pipeline latches. Further effort has been made by Meher and Maheswari [12] to reduce the number of adaptation-delays. Meher and Park have proposed a 2-bit multiplication cell, and used that with an efficient adder-tree for pipelined inner-product computation to minimize the critical-path and silicon area without increasing the number of adaptation-delays [13], [14]. The existing work on DLMS adaptive filter do not discuss the fixed-point implementation issues, e.g, location of radix- point, choice of word-length, and quantization at various stages of computation, although they directly affect the convergence performance particularly due to the recursive behavior of the LMS algorithm. Therefore, fixed-point implementation issues are given adequate emphasis in this paper. Besides, we present here the optimization of our previously reported design [13], [14] to reduce the number of pipeline delays along with the area, sampling period, and energy-consumption. The proposed design is found to be more efficient in terms of power-delay product and energy-delay-product compared with the existing structures. The novel contributions of this paper are as follows: i) A novel design strategy for reducing the adaptation delay, which very much impacts the convergence speed. ii) Fixed-point implementation scheme of the modified DLMS algorithm. iii) Bit-level optimization of computing blocks without ad- versely affecting the convergence performance. In the next Section, we have reviewed the DLMS algo- rithm, and in Section III, we have described the proposed optimized architecture for its implementation. Section IV deals with fixed-point implementation considerations, and simula- tion studies of convergence of algorithm. In Section V, we discuss the synthesis of proposed architecture and comparison with the existing architectures followed by conclusions in Section VI.

Upload: hemanthbbc

Post on 10-Apr-2016

3 views

Category:

Documents


0 download

DESCRIPTION

ieee paper

TRANSCRIPT

Page 1: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Area-Delay-Power Efficient Fixed-Point LMSAdaptive Filter with Low Adaptation-Delay

Pramod Kumar Meher, Senior Member, IEEE and Sang Yoon Park, Member, IEEE

Abstract—In this paper, we present an efficient architecturefor the implementation of delayed least mean square (DLMS)adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we have used a novelpartial product generator, and we have proposed a strategyfor optimized balanced pipelining across the time-consumingcombinational blocks of the structure. From synthesis results,we find that the proposed design offers nearly 17% less area-delay-product (ADP) and nearly 14% less energy-delay-product(EDP) than the best of the existing systolic structures, in average,for filter-lengths N=8, 16, and 32. We have proposed an efficientfixed-point implementation scheme of the proposed architecture,and derived the expression for steady-state error. We have shownthat the steady-state mean squared error obtained from theanalytical result matches with the simulation result. Moreover, wehave proposed a bit-level pruning of the proposed architecture,which provides nearly 20% saving in ADP, and 9% saving in EDPover the proposed structure before pruning without noticeabledegradation of steady-state error-performance.

Index Terms—Adaptive filters, least mean square algorithms,fixed-point arithmetic, circuit optimization.

I. INTRODUCTION

Adaptive digital filters find wide applications in several dig-ital signal processing (DSP) applications, e.g., noise and echocancelation, system identification, and channel equalizationetc. The finite impulse response (FIR) filters whose weightsare updated by the famous Widrow-Hoff least mean square(LMS) algorithm [1] may be considered as the simplest knownadaptive filter. The LMS adaptive filter is the most popular onenot only due to its simplicity but also due to its satisfactoryconvergence performance [2]. The direct form LMS adaptivefilter involves a long critical-path due to an inner-product com-putation to obtain the filter output. The critical-path is requiredto be reduced by pipelined implementation when it exceeds thedesired sample period. Since the conventional LMS algorithmdoes not support pipelined implementation due to its recursivebehavior, it is modified to a form called delayed LMS (DLMS)algorithm [3]–[5], which allows pipelined implementation ofthe filter.

A lot of work has been done to implement the DLMSalgorithm in systolic architectures to increase the maximumusable frequency, [3], [6], [7] which can support very highsampling frequency due to pipelining after every multiplyaccumulate (MAC) operation. But they involve adaptation-delay of ∼ N cycles for filter-length N , which is quite high

The authors are with the Institute for Infocomm Research, Singapore,138632 (e-mail: [email protected]; [email protected]).

Corresponding author of this paper is Sang Yoon Park.A preliminary version of this paper was presented in IEEE International

Midwest Symposium on Circuits and Systems (MWSCAS 2011).

for large order filters. Since the convergence performancedegrades considerably for large adaptation-delay, Visvanathan[8] et al have proposed a modified systolic architecture toreduce the adaptation-delay. A transpose form LMS adaptivefilter is suggested in [9], where the filter output at any instantdepends on the delayed versions of weights, and the numberof delays in weights varies from 1 to N . Van and Feng [10]have proposed a systolic architecture, where they have usedrelatively large processing elements (PEs) for achieving loweradaptation-delay with critical-path of one MAC operation.Ting et al [11] have proposed a fine-grained pipelined designto limit the critical-path to the maximum of one addition-time,which supports high sampling frequency, but involves lot ofarea overhead for pipelining and higher power-consumptionthan that of [10] due to its large number of pipeline latches.Further effort has been made by Meher and Maheswari [12] toreduce the number of adaptation-delays. Meher and Park haveproposed a 2-bit multiplication cell, and used that with anefficient adder-tree for pipelined inner-product computation tominimize the critical-path and silicon area without increasingthe number of adaptation-delays [13], [14].

The existing work on DLMS adaptive filter do not discussthe fixed-point implementation issues, e.g, location of radix-point, choice of word-length, and quantization at various stagesof computation, although they directly affect the convergenceperformance particularly due to the recursive behavior of theLMS algorithm. Therefore, fixed-point implementation issuesare given adequate emphasis in this paper. Besides, we presenthere the optimization of our previously reported design [13],[14] to reduce the number of pipeline delays along with thearea, sampling period, and energy-consumption. The proposeddesign is found to be more efficient in terms of power-delayproduct and energy-delay-product compared with the existingstructures. The novel contributions of this paper are as follows:

i) A novel design strategy for reducing the adaptationdelay, which very much impacts the convergence speed.

ii) Fixed-point implementation scheme of the modifiedDLMS algorithm.

iii) Bit-level optimization of computing blocks without ad-versely affecting the convergence performance.

In the next Section, we have reviewed the DLMS algo-rithm, and in Section III, we have described the proposedoptimized architecture for its implementation. Section IV dealswith fixed-point implementation considerations, and simula-tion studies of convergence of algorithm. In Section V, wediscuss the synthesis of proposed architecture and comparisonwith the existing architectures followed by conclusions inSection VI.

Page 2: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

error,

xninput sample, ynfilter output,

desired

dnnew weights +_

desired signal,dn

input sample, xn

FIR FILTER

BLOCK

WEIGHT-

UPDATE BLOCK

mD

mD

ERROR-COMPUTATION BLOCK

n1D

WEIGHT-UPDATE BLOCK

n2D

new weights

n1D

error

FIG1.pdf

FIG2.pdf

en

en

signal

Fig. 1. Structure of conventional delayed LMS adaptive filter.

II. REVIEW OF DELAYED LMS ALGORITHM

The weights of LMS adaptive filter during the nth iterationare updated according to the following equations [2].

wn+1 = wn + µ · en · xn (1a)where

en = dn − yn and yn = wTn · xn (1b)

where the input vector xn, and the weight vector wn at thenth iteration are, respectively, given by

xn = [xn, xn−1, · · ·, xn−N+1]T

wn = [wn(0), wn(1), · · ·, wn(N − 1)]T

dn is the desired response, yn is the filter output, and endenotes the error computed during the nth iteration. µ is thestep-size, and N is the number of weights used in the LMSadaptive filter.

In case of pipelined designs with m pipeline stages, the erroren becomes available after m cycles, where m is called the“adaptation-delay”. The DLMS algorithm therefore uses thedelayed error en−m, i.e., the error corresponding to (n−m)thiteration for updating the current weight instead of the recent-most error. The weight-update equation of DLMS adaptivefilter is given by

wn+1 = wn + µ · en−m · xn−m (2)

The block diagram of DLMS adaptive filter is depicted inFig. 1, where the adaptation-delay of m cycles amounts tothe delay introduced by the whole of adaptive filter structureconsisting of FIR filtering and weight-update process.

It is shown in [12] that the adaptation-delay of conventionalLMS can be decomposed into two parts, where one part is thedelay introduced by the pipeline stages in FIR filtering and theother part is due to the delay involved in pipelining of weight-update process. Based on such a decomposition of delay, theDLMS adaptive filter can be implemented by a structure shownin Fig. 2. Assuming that the latency of computation of error isn1 cycles, the error computed by the structure at the nth cycleis en−n1

, which is used with the input samples delayed byn1 cycles to generate the weight-increment term. The weight-update equation of the modified DLMS algorithm is given by

wn+1 = wn + µ · en−n1· xn−n1

(3a)where

en−n1 = dn−n1 − yn−n1 (3b)and

error,

xninput sample, ynfilter output,

desired

dnnew weights +_

desired signal,dn

input sample, xn

FIR FILTER

BLOCK

WEIGHT-

UPDATE BLOCK

mD

mD

ERROR-COMPUTATION BLOCK

n1D

WEIGHT-UPDATE BLOCK

n2D

new weights

n1D

error

FIG1.pdf

FIG2.pdf

en

en

signal

Fig. 2. Structure of modified delayed LMS adaptive filter.

0 100 200 300 400 500 600 700-80

-60

-40

-20

0

Iteration NumberM

ean

Squa

red

Erro

r (dB

)

LMS (n1=0, n2=0)DLMS (n1=5, n2=1)DLMS (n1=7, n2=2)

Fig. 3. The convergence performance of system identification with LMS andmodified DLMS adaptive filters.

yn = wTn−n2

· xn (3c)

We can notice that during weight-update, the error with n1delays is used while the filtering unit uses the weights delayedby n2 cycles. The modified DLMS algorithm decouples com-putations of error-computation block and weight-update blockand allows to perform optimal pipelining by feed-forward cut-set retiming of both these sections separately to minimize thenumber of pipeline stages and adaptation delay.

The adaptive filters with different n1 and n2 are simulatedfor a system identification problem. The 10-tap bandpass filterwith impulse response

hn =sin(wH(n− 4.5))

π(n− 4.5)− sin(wL(n− 4.5))

π(n− 4.5), (4)

for n = 0, 1, 2, ..., 9, otherwise hn = 0.

is used as unknown system as in [10]. wH and wL representthe high and low cutoff frequencies of the passband, and areset to wH = 0.7π and wL = 0.3π, respectively. The step-sizeµ is set to 0.4. A 16-tap adaptive filter identifies the unknownsystem with gaussian random input xn of zero mean and unitvariance. In all cases, outputs of known system are of unitypower, and contaminated with white Gaussian noise of −70dB strength. Fig. 3 shows the learning curve of MSE of errorsignal en by averaging 20 runs for the conventional LMSadaptive filter (n1 = 0, n2 = 0) and DLMS adaptive filterswith (n1 = 5, n2 = 1) and (n1 = 7, n2 = 2). It can be seenthat as the total number of delays increases, the convergenceis slowed down, while the steady-state MSE remains almostthe same in all cases. In this example, the MSE differencebetween the cases (n1 = 5, n2 = 1) and (n1 = 7, n2 = 2)after 2000 iterations is less than 1 dB in average.

Page 3: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 3

D …………

2-BIT

PPG-0

2-BIT

PPG-1

D

2-BIT

PPG-(N-1)

D

ADDER–TREE (log2N STAGES)

+desired response,

xninput samples, xn-1 xn-2 xn-(N-1)

error,-

………

SHIFT–ADD–TREE (log2L-1 STAGES)

2-BIT

PPG-2

……

<<1

D …………D D

…………………

xninput samples, xn-(N-1)

…… …… ……

xn-1 xn-2

SHIFT-ADD-

TREE

SHIFT-ADD-

TREE

SHIFT-ADD-

TREE

en+1

+ …………………

PPG-0 PPG-1PPG-(N-

1)

…… …… ……

+

<<1

p00 p01 p0 ……p11p10 p1 ……p20 p21 p2 ……p(N-1)0 p(N-1)1 p(N-1)12

L

…………………

L

W W W W

12

L1

2

L1

2

L

error-block.pdf

update.pdf

q0 q1 q2 q2

2

L q1

2

L

filter output,

r00 r01 r0 12

L

s0

r11 r1 12

L r(N-1) 12

Lr(N-1)1r(N-1)0

s1 sN-1

r10

)0(2nnw )1(

2nnw )2(2nnw )1(

2 Nw nn

)2( 1 nny

)2( 1 nnd)1( 1 nne

)0(2nnw )1(

2nnw )1(

2 Nw nn

D

+

D

+

D

D

D

D

Fig. 4. Proposed structure of the error-computation block.

III. THE PROPOSED ARCHITECTURE

As shown in Fig. 2, there are two main computing blocksin the adaptive filter architecture, namely (i) the error-computation block and (ii) weight-update block. In this Sec-tion, we discuss the design strategy of the proposed structure tominimize the adaptation-delay in the error-computation block,followed by weight-update block.

A. Pipelined Structure of Error-Computation Block

The proposed structure for error-computation unit of an N -tap DLMS adaptive filter is shown in Fig. 4. It consists ofN number of 2-bit partial product generators (PPG) corre-sponding to N multipliers, a cluster of L/2 binary adder-treesfollowed by a single shift-add tree. Each sub-block is describedin detail.

1) Structure of Partial Product Generator: The structure ofeach PPG is shown in Fig. 5. It involves L/2 number of 2-to-3 decoders and the same number of AND-OR cells (AOC)1.Each of the 2-to-3 decoders takes a 2-bit digit (u1u0) as inputand produces three output b0 = u0 · u1, b1 = u0 ·u1, and b2 =u0 ·u1, such that b0 = 1 for (u1u0) = 1, b1 = 1 for (u1u0) =2, and b2 = 1 for (u1u0) = 3. The decoder output b0, b1 and b2along with w, 2w and 3w are fed to an AOC, where w, 2w and3w are in 2’s complement representation and sign-extended tohave (W + 2)-bit each. To take care of the sign of the inputsamples while computing the partial product correspondingto the most-significant digit (MSD), i.e., (uL−1uL−2) of theinput sample, the AOC-(L/2−1) is fed with w, −2w and −was input since (uL−1uL−2) can have four possible values 0,1, −2 and −1.

2) Structure of AND-OR Cells: The structure and functionof AOC is depicted in Fig. 6. Each AOC consists of 3 ANDcells and 2 OR cells. The structure and function of ANDcells and OR cells are depicted by Fig. 6(b) and Fig. 6(c),respectively. Each AND cell takes n-bit input D and a single

1We have assumed the word-length of the input L to be even, which isvalid in most practical cases.

bit input b, and consists of n AND gates. It distributes all then-bits of input D to its n AND gates as one of the inputs. Theother inputs of all the n AND gates are fed with the single-bitinput b. As shown in Fig. 6(c), each OR cell similarly takes apair of n-bit input words, and has n OR gates. A pair of bitsin the same bit-position in B and D is fed to the same ORgate.

The output of an AOC is w, 2w and 3w corresponding to thedecimal values 1, 2, and 3 of 2-bit input (u1u0), respectively.The decoder along with the AOC performs a multiplication ofinput operand w with two-bit digit (u1u0), such that the PPGof Fig. 5 performs L/2 parallel multiplications of input wordw with a 2-bit digit to produce L/2 partial products of theproduct word wu.

3) Structure of Adder-Tree: Conventionally, we should haveperformed the shift-add operation on the partial products ofeach PPG separately to obtain the product value, and thenwe would have added all the N product values to computethe desired inner product. However, the shift-add operationto obtain the product value increases the word-length, andconsequently increases the adder-size of N − 1 additions ofthe product values. To avoid such increase in word-size of theadders, we add all the N partial products of the same placevalue from all the N PPGs by one adder-tree.

All the L/2 partial products generated by each of the NPPGs are thus added by (L/2) binary adder-trees. The outputof the L/2 adder-trees are then added by a shift-add-treeaccording to their place values. Each of the binary adder-treerequires log2N stages of adders to add N partial product, andthe shift-add-tree requires log2 L − 1 stages of adders to addL/2 output of L/2 binary adder-trees2. The addition schemefor the error-computation block for a 4-tap filter and inputword-size L = 8 is shown in Fig. 7. For N = 4 and L = 8, theadder-network requires four binary adder-trees of two stageseach and a two-stage shift-add tree. In this figure, we haveshown all possible locations of pipeline latches by dashedlines, to reduce the critical-path to one addition-time. If we

2When L is not a power of 2, log2 L should be replaced by dlog2 Le.

Page 4: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

(u1 u0)

W+2 W+2w(2u1+u0) w(2u3+u2)

u=(uL-1uL-2uL-3……u0)

w

AOC-0

input samplesL

2-to-3 DECODER

b0 b1 b2

coefficient,

AOC-1

2-to-3 DECODER

2

3

AOC-2

2-to-3 DECODER

2

3

AOC-

(L/2-1)

2-to-3 DECODER

2

3

…………

…………

W+2

W+1

W w2w3w

…………

W+2w(-2uL-1+uL-2)w(2u5+u4)

W+2

<<1

ADDER ADDERADDER ADDERADDER ADDERADDER ADDER

ADDER ADDER ADDER ADDER

ADDER ADDER

<<4

ADDER

p00 p10 p20 p30 p01 p11 p21 p31 p02 p12 p22 p32 p03 p13 p23 p33

STAGE-1

STAGE-2

STAGE-1

STAGE-2

ADDER

TREE

SHIFT-ADD-TREE

<<2<<2

mult_cell.pdf

shift-add-tree.pdf

q0 q1 q2 q3

+

-1

<<1

…………

w-2w-w

2

(u3 u2) (u5 u4) (uL-1uL-2)

Fig. 5. Proposed structure of partial product generator (PPG). AOC stands for AND/OR cell.

AC-1 AC-2 AC-3

OC-1 OC-2

b0 w

1 W+2

b1 2w

1 W+2

b2 3w

1 W+2

W+2 W+2

W+2 W+2 W+2wX

AC

b D

1 n

n

R

10

),()(

ni

iDbiROC

n

n n

D2

D1

10),()()( 21 niiDiDiR

R

and-or-park.pdf

and-park.pdf or-park.pdf

(a)

AC-1 AC-2 AC-3

OC-1 OC-2

b0 w

1 W+2

b1 2w

1 W+2

b2 3w

1 W+2

W+2 W+2

W+2 W+2 W+2wX

AC

b D=(dn-1dn-2…d0)

1 n

n

OC

n

n n

and-or-park.pdf

R=(rn-1rn-2…r0)

ri=b · di

for i=0,1,…,n-1

D=(dn-1dn-2…d0)B=(bn-1bn-2…b0)

R=(rn-1rn-2…r0)

ri=bi + di for i=0,1,…,n-1

andor.pdf

(b) (c)Fig. 6. Structure and function of AND/OR cell. Binary operators · and +in (b) and (c) are implemented using AND and OR gates, respectively.

introduce pipeline latches after every addition, it would requireL(N − 1)/2 +L/2− 1 latches in log2N + log2 L− 1 stages,which leads to high adaptation-delay, and introduces largeoverhead of area and power-consumption for large values of Nand L. On the other hand, some of those pipeline latches areredundant in the sense that they are not required to maintain acritical-path of one addition-time. The final adder in the shift-add tree contributes the maximum delay to the critical path.Based on that observation, we have identified the pipelinelatches which do not contribute significantly to critical-pathand could exclude those without any noticeable increase ofcritical-path. The location of pipeline latches for filter-lengthsN = 8, 16 and 32 and for input size L = 8 are shown inTable-I. The pipelining is performed by a feed-forward cut-setretiming of error-computation block [15].

B. Pipelined Structure of Weight-Update Block

The proposed structure for weight-update block is shownin Fig. 8. It performs N multiply accumulate operations ofthe form (µ × e) × xi + wi to update N filter weights. Thestep-size ‘µ’ is taken as a negative power of two to realize

the multiplication with recently available error only by a shiftoperation. Each of the MAC units therefore performs themultiplication of shifted value of error with the delayed inputsamples ‘xi’ followed by the additions with the correspondingold weight values ‘wi’. All the N multiplications for the MACoperations are performed by N PPGs followed by N shift-add-trees. Each of the PPGs generates L/2 partial productscorresponding to product of recent shifted error value µ × ewith L/2 number of 2-bit digits of the input word xi, wherethe subexpression 3µ×e is shared within the multiplier. Sincethe scaled error (µ × e) is multiplied with all the N delayedinput values in the weight-update block, this subexpressioncan be shared across all the multipliers as well. This leadsto substantial saving of adder complexity. The final outputsof MAC units constitute the desired updated weights to beused as inputs to the error-computation block, as well as, theweight-update block for the next iteration.

C. Adaptation Delay

As shown in Fig. 2, the adaptation-delay is decomposedinto n1 and n2. The error-computation block generates thedelayed error by n1 − 1 cycles as shown in Fig. 4, which isfed to weight-update block shown in Fig. 8 after scaling by µ,then the input is delayed by 1 cycle before the PPG to makethe total delay introduced by FIR filtering be n1. In Fig. 8, theweight-update block generates wn−1−n2, the weights delayedby n2+1 cycles. However, it should be noted that the delay by1 cycle is due to the latch before the PPG, which is included inthe delay of the error-computation block, i.e., n1. Therefore,the delay generated in weight-update block becomes n2. If

TABLE ILOCATION OF PIPELINE LATCHES FOR L = 8 AND N = 8, 16 AND 32.

NError-Computation Block Weight-Update Block

Adder-Tree Shift-Add-Tree Shift-Add-Tree

8 stage-2 stage-1 and 2 stage-1

16 stage-3 stage-1 and 2 stage-1

32 stage-3 stage-1 and 2 stage-2

Page 5: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 5

WORD-SERIAL BIT-PARALLEL CONVERTER

x0

W+2 W+2A(2x1+x0) A(2x3+x2)

X=(xL-1xL-2xL-3……x0)

A

AOC-1

input samplesL

2-to-3 DECODER

x1

b0 b1 b2

coefficient,

AOC-2

2-to-3 DECODER

2

3

AOC-3

2-to-3 DECODER

2

3

AOC-L/2

2-to-3 DECODER

2

3

…………

…………

W+2

W+1

W A2A3A

…………

W+2A(2xL-1+xL-2)A(2x5+x4)

W+2

+

<<1

ADDER ADDERADDER ADDERADDER ADDERADDER ADDER

ADDER ADDER ADDER ADDER

ADDER ADDER

<<4

ADDER

p00 p10 p20 p30 p01 p11 p21 p31 p02 p12 p22 p32 p03 p13 p23 p33

STAGE-1

STAGE-2

STAGE-1

STAGE-2

ADDER

TREE

SHIFT-ADD-TREE

<<2<<2

mult_cell.pdf

shift-add-tree.pdf

q0 q1 q2 q3

Fig. 7. The adder-structure of the filtering unit for N = 4 and L = 8.

D …………

2-BIT

PPG-0

2-BIT

PPG-1

D

2-BIT

PPG-(N-1)

D

ADDER–TREE (log2N STAGES)

+desired response,

xninput samples, xn-1 xn-2 xn-(N-1)

error,-

………

SHIFT–ADD–TREE (log2L-1 STAGES)

2-BIT

PPG-2

……

<<1

D …………D D

…………………

xninput samples, xn-N

…… …… ……

xn-1 xn-2

SHIFT-ADD-

TREE

SHIFT-ADD-

TREE

SHIFT-ADD-

TREE

en

+ …………………

PPG-0 PPG-1 PPG-(N-1)

…… …… ……

+

<<1

p00 p01 p0 ……p11p10 p1 ……p20 p21 p2 ……p(N-1)0 p(N-1)1 p(N-1)12

L

…………………

L

W W W W

12

L1

2

L1

2

L

error-block.pdf

update.pdf

q0 q1 q2 q2

2

L q1

2

L

filter output,

r00 r01 r0 12

L

s0

r11 r1 12

L r(N-1) 12

Lr(N-1)1r(N-1)0

s1 sN-1

r10

)0(2nnw )1(

2nnw )2(2nnw )1(

2 Nw nn

)2( 1 nny

)2( 1 nnd)1( 1 nne

)0(21 nnw )1(

21 nnw )1(

21 Nw nn

D

+

D

+

D

D

D

D

-1

D

<<1

…………………

Dxn-3

Fig. 8. Proposed structure of weight-update block.

the locations of pipeline latches are decided as Table I, n1becomes 5, where 3 latches are in the error-computation block,one latch is after the subtraction in Fig. 4, and the other latchis before PPG in Fig. 8. Also, n2 is set to 1 from a latch inthe shift-add-tree in weight-update block.

IV. FIXED-POINT IMPLEMENTATION, OPTIMIZATION,SIMULATION, AND ANALYSIS

In this section, we discuss the fixed-point implementationand optimization of the proposed DLMS adaptive filter. A bit-level pruning of the adder-tree is also proposed to reduce thehardware complexity without noticeable degradation of steady-state MSE.

xn

FILTERING BLOCK

n1D

WEIGHT UPDATE BLOCK

n2D

n1D

si-app.pdf

UNKNOWN SYSTEM, hn n1D

+_yn

b1 b2 b3 b4… a1 a0 …

radix-pointMSB LSB

integer fraction

fp-basic.pdf

1nnd

1nne

2nnw

fXb1iXa

Fig. 9. Fixed-point representation of a binary number (Xi: integer word-length, Xf : fractional word-length).

A. Fixed-Point Design Considerations

For fixed-point implementation, the choice of word-lengthsand radix-points for input samples, weights, and internalsignals need to be decided. Fig. 9 shows the fixed-pointrepresentation of a binary number. Let (X,Xi) be a fixed-point representation of a binary number where X is the

Page 6: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

word-length and Xi is integer-length. The word-length andlocation of radix-point of xn and wn in Fig. 4 need to bepre-determined by hardware designer taking the design con-straints such as desired accuracy and hardware complexity intoconsideration. Assuming (L,Li) and (W,Wi), respectively,as the representations of input signals and filter weights,all other signals in Fig. 4 and Fig. 8 can be decided asshown in Table II. The signal pij which is the output ofPPG block (shown in Fig. 4) has at most 3 times the valueof input coefficients. Thus, we can add 2 more bits to theword-length and to the integer-length of coefficients to avoidoverflow. The output of each stage in the adder-tree in Fig. 7is 1 bit more than the size of input signals, so that thefixed-point representation of the output of the adder-tree withlog2N stages becomes (W + log2N + 2,Wi + log2N + 2).Accordingly, the output of the shift-add-tree would be of theform (W + L + log2N,Wi + Li + log2N) assuming thatno truncation of any least significant bits (LSB) is performedin the adder-tree or the shift-add tree. However, the numberof bits of the output of the shift-add-tree is designed to haveW -bits. The most significant W bits need to be retained outof (W + L + log2N)-bits, which results in the fixed-pointrepresentation (W,Wi + Li + log2N) for y as shown inTable II. Let the representation of the desired signal d be thesame as y even though its quantization is usually given as theinput. For this purpose, the specific scaling/sign-extension andtruncation/zero-padding are required. Since the LMS algorithmperforms learning so that y has the same sign as d, the errorsignal e can also be set to have the same representation as ywithout overflow after the subtraction.

It is shown in [4] that the convergence of N -tap DLMSadaptive filter with n1 adaptation-delay is assured if

0 < µ <2

(σ2x(N − 2) + 2n1 − 2)σ2

x

(5)

where σ2x is the average power of input samples. Further-

more, if the value of µ is defined as (power of two) 2−n

where n ≤ Wi + Li + log2N , the multiplication with µ isequivalent to the change of location of radix-point. Since themultiplication with µ does not need any arithmetic operation,it does not introduce any truncation error. If we need to use

TABLE IIFIXED-POINT REPRESENTATION OF THE SIGNALS OF THE PROPOSED

DLMS ADAPTIVE FILTER (µ = 2−(Li+log2 N)).

Signal Name Fixed-Point Representation

x (L,Li)

w (W,Wi)

p (W + 2,Wi + 2)

q (W + 2 + log2N,Wi + 2 + log2N)

y, d, e (W,Wi + Li + log2N)

µe (W,Wi)

r (W + 2,Wi + 2)

s (W,Wi)

x, w, p, q, y, d, and e can be found in error-computation block of Fig. 4.µe, r, and s are defined in the weight-update block in Fig. 8. It is notedthat all the subscripts and time indices of signals are obviated for simple

notations.

a smaller step-size, i.e., n > Wi + Li + log2N , some ofthe LSBs of en need to be truncated. If we assume thatn = Li + log2N , i.e., µ = 2−(Li+log2N) as in Table II,the representation of µen should be (W,Wi) without anytruncation. The weight increment term s (shown in Fig. 8)which is equivalent to µenxn is required to have fixed-pointrepresentation (W + L,Wi + Li). However, only Wi MSBsin the computation of shift-add tree of weight-update circuitare to be retained, while the rest of the more significant bitsof MSBs need to be discarded. This is in accordance withthe assumptions that as the weights converge towards optimalvalue, the weight increment terms become smaller, and MSBend of error term contains more number of zeros. Also, in ourdesign, L−Li LSBs of weight increment terms are truncatedso that the weight increment terms have the same fixed-point representation as weight values. We also assume thatno overflow occurs during the addition for the weight-update.Otherwise, the word-length of the weights should be increasedat every iteration, which is not desirable. The assumption isvalid since the weight increment terms are small when theweights are converged. Also when overflow occurs during thetraining-period, the weight updating is not appropriate, andwill lead to additional iterations to reach the convergence.Accordingly, the updated weight can be computed in truncatedform (W,Wi), and fed into the error-computation block.

B. Computer Simulation of the Proposed DLMS Filter

The proposed fixed-point DLMS adaptive filter is used forsystem identification used in Section II. µ is set to 0.5, 0.25,and 0.125 for filter-lengths 8, 16, and 32, respectively, suchthat the multiplication with µ does not require any additionalcircuits. For the fixed-point simulation, the word-length andradix-point of input and coefficient are set to L = 16, Li = 2,W = 16, Wi = 0, and the gaussian random input xn ofzero mean and unit variance is scaled down to fit in with therepresentation of (16, 2). The fixed-point data type of all theother signals are obtained by Table II. Each learning curve isaveraged over 50 runs to obtain a clean curve. The proposeddesign was coded in C++ using SystemC fixed-point libraryfor different order of band-pass filter, that is, N = 8, N = 16,and N = 32. The corresponding convergence behaviors areobtained as shown in Fig. 10. It is found that as the filter orderincreases, not only the convergence becomes slower, but alsothe steady-state MSE increases.

C. Steady-State Error Estimation

In this section, the MSE of output of the proposed DLMSadaptive filter due to the fixed-point quantization is analyzed.Based on the models introduced in [16] and [17], the MSEof output in the steady-state is derived in terms of parameterslisted in Table II. Let us denote the primed symbols as thetruncated quantities due to the fixed-point representation, sothat the input and desired signals can be written as

x′n = xn +αn (6)d′n = dn + βn (7)

Page 7: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 7

0 200 400 600 800 1000-80

-60

-40

-20

0

Iteration Number

Mea

n Sq

uare

d Er

ror (

dB)

N=8N=16N=32

Fig. 10. Mean-squared-error of fixed-point DLMS filter output for the systemidentification for N = 8, N = 16, and N = 32.

where αn and βn are input quantization noise vector andquantization noise of desired signal, respectively. The weightvector can be written as

w′n = wn + ρn (8)

where ρn is the error vector of current weights due to the finiteprecision. The output signal y′n and weight-update equationcan accordingly be modified, respectively, to the forms:

y′n = w′Tnx′n + ηn (9)

w′n+1 = w′n + µe′nx′n + γn (10)

where ηn and γn are the errors due to the truncation of outputfrom shift-add-tree in the error-computation block and weight-update block, respectively. The steady-state MSE in the fixed-point representation can be expressed as

E|dn − y′n|2 = E|en|2 + E|αTnwn|2

+ E|ηn|2 + E|ρTnxn|2 (11)

where E| · | is operator for mathematical expectation, andthe terms en, αTnwn, ηn, and ρTnxn are assumed to beuncorrelated.

The first term E|en|2 where en = dn − yn is the excess-MSE from infinite precision computation whereas the otherthree terms are due to finite-precision arithmetic.

The second term can be calculated as

E|αTnwn|2 = |w∗n|2(m2αn

+ σ2αn

). (12)

where w∗n is the optimal Wiener vector, and mαn and σ2αn

are defined as mean and variance of αn when xn is truncatedto the fixed-point type of (L,Li) as listed in Table II. αn canbe modeled as uniform distribution with following mean andvariance:

mαn = 2−(L−Li)/2 (13a)σ2αn

= 2−2(L−Li)/12. (13b)

For the calculation of the third term E|ηn|2 in (11), we haveused the fact that the output from shift-add-tree in the error-computation block is of the type of (W,Wi + Li + log2N)after the final truncation. Therefore,

E|ηn|2 = m2ηn + σ2

ηn . (14)

where

m2ηn = 2−(W−(Wi+Li+log2N))/2 (15a)

σ2ηn = 2−2(W−(Wi+Li+log2N))/12. (15b)

The last term E|ρTnxn|2 in (11) can be obtained by usingthe derivation proposed in [17] as

E|ρTnxn|2 = m2γn

ΣiΣk(R−1ki

)µ2

+N(σ2γn −m

2γn

)2µ

(16)

whereRki represents the (k, i)th entry of the matrix E(xnxTn ).

For the weight-update in (10), the first operation is to multiplye′n with µ, which is equivalent to move only the locationof radix-point, therefore, it does not introduce any truncationerror. The truncation after multiplication of µe′n with xn isonly required to be considered in order to evaluate γn. Then,we have

m2γn = 2−(W−Wi)/2 (17a)

σ2γn = 2−2(W−Wi)/12. (17b)

For a large µ, the truncation error ηn from the error-computation block becomes the dominant error source, and(11) can be approximated as E|ηn|2. The MSE values areestimated from analytical expressions as well as the simulationresults by averaging over 50 experiments. Table III shows thatthe steady state MSE computed from analytical expressionmatches with that of simulation of the proposed architecturefor different values of N and µ.

TABLE IIIESTIMATED AND SIMULATED STEADY STATE MSES OF THE FIXED-POINT

DLMS ADAPTIVE FILTER (L =W = 16).

Filter-Length Step-Size (µ) Simulation Analysis

N = 8 2−1 −71.01 dB −70.97 dB

N = 16 2−2 −64.84 dB −64.97 dB

N = 32 2−3 −58.72 dB −58.95 dB

D. Adder-Tree Optimization

The adder-tree and shift-add-tree for the computation of yncan be pruned for further optimization of area, delay and powercomplexity. To illustrate the proposed pruning optimizationof adder-tree and shift-add-tree for the computation of filteroutput, we take a simple example of filter-length N = 4,considering word-lengths L and W to be 8. The dot-diagramof the adder-tree is shown in Fig. 11. Each row of dot-diagram contains 10 dots, which represents the partial productsgenerated by a PPG unit, for W = 8. We have 4 sets of partialproducts corresponding to 4 partial products of each multipliersince L = 8. Each set of partial products of the same weightvalues contains 4 terms since N = 4. The final sum withouttruncation should be 18 bits. However, we use only 8 bits inthe final sum, and the rest 10 bits are finally discarded. Toreduce the computational complexity, some of the LSBs ofinputs of the adder-tree can be truncated while some guardbits can be used to minimize the impact of truncation on theerror performance of the adaptive filter. In the Figure, 4 bits

Page 8: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

0123456789

1011

1213

1415

2-STAGE ADDER-TREE

2-STAGE SHIFT-ADD-TREE

1617

pad.pdf

truncated

bits

guard bits

final truncation

p00

p10

p20

p30

p01

p11

p21

p31

p02

p12

p22

p32

p03

p13

p23

p33

q0

q1

q2

q3

Fig. 11. The dot-diagram for Optimization of adder-tree in the case ofN = 4, L = 8, and W = 8.

are taken as the guard bits and the rest 6 LSBs are truncated.To have more hardware saving, the bits to be truncated are notgenerated by the PPGs, so that the complexity of PPGs alsogets reduced.ηn defined in (9) increases if we prune the adder-tree, and

the worst-case error is caused when all the truncated bits areone. For the calculation of sum of truncated values in the worstcase, let us denote k1 as the bit-location of MSB of truncatedbits and Nk2 as the number of rows which is affected by thetruncation. In the example of Fig. 11, k1 and k2 are set to5 and 3, respectively, since the bit-positions from 0 to 5 aretruncated and total 12 rows are affected by the truncation forN = 4. Also, we can find that k2 can be derived using k1 as

k2 = bk12c+ 1 for k1 <

W

2, otherwise k2 =

W

2, (18)

since the number of truncated bits is reduced by 2 for everygroup of N rows as shown in Fig. 11. Using k1, k2 and N , thesum of truncated values for the worst case can be formulatedas

bworst = N

k2−1∑j=0

k1∑i=2j

2i = N(k22k1+1 − 1

3(4k2 − 1)

). (19)

In the example of Fig. 11, bworst amounts to 684. Meanwhile,the LSB weight of the output of adder-tree after final trunca-tion is 210 in the example. Therefore, it can be shown that inmore than 50% cases there might be one bit difference due topruning. The truncation error from each row (total twelve rowsfrom row p00 to row p32 in Fig. 11) has a uniform distribution,and if the individual errors is assumed to be independent

4 6 8 10 12 14 16 18-80

-70

-60

-50

-40

-30

k1

Mea

n Sq

uare

d Er

ror (

dB)

N=8N=16N=32

Fig. 12. MSE at the steady-state vs. k1 for N = 8, 16, and 32 (L =W =16)

each other, mean and variance of total error introduced canbe calculated as sum of means and variances of each randomvariable. However, it is unlikely that outputs from the samePPG are uncorrelated since it is generated from the sameinput sample. It would not be straightforward to estimate thedistribution of error from the pruning. However, as the valueof bworst is closer to or larger than the LSB weight of theoutput after final truncation, the pruning will affect more tothe overall error. Fig. 12 illustrates the steady-state MSE interms of k1 for N = 8, 16, and 32 when L = W = 16 toshow how much the pruning affects the output MSE. When k1is less than 10 for N = 8, the MSE deterioration is less than1 dB compared to the case when the pruning is not applied.

V. COMPLEXITY CONSIDERATIONS

The hardware and time complexities of proposed designand those of the structure of [11] and the best of systolicstructures [10] are listed in Table IV. The original DLMSstructure proposed in [4] are also listed in this Table. It isfound that the proposed design has a shorter critical-path ofone addition-time as that of [11], and lower adaptation-delaythan others. if we consider each multiplier to have (L − 1)adders then the existing designs involve 16N adders, while theproposed one involves 10N+2 adders for L = 8. Similarly, itinvolves less number of delay registers compared with others.

We have coded the proposed designs in VHDL and syn-thesized by Synopsys Design Compiler using CMOS 65-nmlibrary for different filter orders. The word-length of inputsamples and weights are chosen to be 8, i.e., L = W = 8.The step-size µ is chosen to be 1/2Li+log2N to realize itsmultiplication without any additional circuitry. Word-lengthof all the other signals are determined based on the typeslisted in Table II. We have also coded structures proposed in[10] and [11] using VHDL, and synthesized using the samelibrary and synthesis options in the Design Compiler for a faircomparison. In Table V, we have shown the synthesis resultsof proposed designs and existing designs in terms of data-arrival-time (DAT), area, energy-per-sample (EPS), area-delay-product (ADP), and energy-delay-product (EDP) obtained forfilter-lengths N = 8, 16, and 32. The proposed design-I beforethe pruning of adder-tree has the same data-arrival-time (DAT)as the design of [11] since the critical-paths of both designs

Page 9: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY 9

TABLE IVCOMPARISON OF HARDWARE AND TIME COMPLEXITIES OF DIFFERENT ARCHITECTURES FOR L = 8.

Design Critical-Path n1 n2Hardware Elements

# adders # multipliers # registers

Long et al [4] TM + TA log2N + 1 0 2N 2N 3N + 2 log2N + 1

Ting et al [11] TA log2N + 5 3 2N 2N 10N + 8

Van and Feng [10] TM + TA N/4 + 3 0 2N 2N 5N + 3

Proposed Design TA 5 1 10N + 2 0 2N + 14 + E†

†E = 24, 40 and 48 for N = 8, 16 and 32, respectively. Besides, proposed design needs additional 24N AND cells and 16N OR cells. Two’scomplement operator in Figs.5 and 8 is counted as one adder, and it is assumed that the multiplication with the step-size does not need the multiplier over

all the structures.

TABLE VPERFORMANCE COMPARISON OF DLMS ADAPTIVE FILTER BASED ON SYNTHESIS RESULT USING CMOS 65-NM LIBRARY.

Design Filter DAT Latency Area Leakage EPS ADP EDP ADP EDPLength, N (ns) (cycles) (sq.um) power (mW) (mW×ns) (sq.um×ns) (mW×ns2) reduction reduction

Ting et al [11]8 1.14 8 24204 0.13 18.49 26867 21.08 − −

16 1.19 9 48049 0.27 36.43 55737 43.35 − −32 1.25 10 95693 0.54 72.37 116745 90.47 − −

Van and Feng [10]8 1.35 5 13796 0.08 7.29 18349 9.84 − −

16 1.35 7 27739 0.16 14.29 36893 19.30 − −32 1.35 11 55638 0.32 27.64 73998 37.31 − −

Proposed Design-I8 1.14 5 14029 0.07 8.53 15572 9.72 15.13% 1.14%

16 1.19 5 26660 0.14 14.58 31192 17.34 15.45% 10.10%

32 1.25 5 48217 0.27 21.00 58824 26.25 20.50% 29.64%

Proposed Design-II8 0.99 5 12765 0.06 8.42 12382 8.33 32.51% 15.22%

16 1.15 5 24360 0.13 14.75 27526 16.96 25.38% 12.07%

32 1.15 5 43233 0.24 21.23 48853 24.41 33.98% 34.55%

DAT: data-arrival-time, ADP: area-delay-product, EPS: energy per sample, EDP: energy-delay-product. ADP and EDP reductions in last two columns areimprovements of proposed designs over [10] in percentage. Proposed Design-I: without optimization, Proposed Design-II: after optimization of adder-tree

with k1 = 5.

are same as TA as shown in Table IV while the design of[10] has a longer DAT which is equivalent to TA + TM .However, the proposed design-II after the pruning of adder-treehas the slightly smaller DAT than existing designs. Also, theproposed designs could reduce the area by using a PPG basedon common subexpression sharing compared to the existingdesigns. As shown in Table V, the reduction in area is moresignificant in the case of N = 32 since more sharing can beobtained in the case of large order filters. The proposed designscould obtain less area and more power reduction comparedwith [11] by removing redundant pipeline latches which arenot required to maintain a critical-path of one addition-time. Itis found that the proposed design-I involves ∼ 17% less ADPand ∼ 14% less EDP than the best prior work of [10], inaverage for filter-lengths N = 8, 16 and 32. The proposeddesign-II, similarly, involves ∼ 31% less ADP and nearly∼ 21% less EDP than the structure of [10] for the same filters.The optimization of adder-tree of the proposed structure withk1 = 5 offers ∼ 20% less ADP and ∼ 9% less EDP over thestructure before optimization of the adder-tree.

The proposed designs are also implemented on the field-programmable gate-array (FPGA) platform of Xilinx de-vices. The number of slices (NOS), maximum usable fre-quency (MUF) using two different devices of Spartan-3A(XC3SD1800A-4FG676) and Virtex-4 (XC4VSX35-10FF668)are listed in Table VI. The proposed design-II after thepruning offers nearly 11.86% less slice-delay-product whichis calculated as NOS/MUF in average, for N = 8, 16, 32 and

two devices.

VI. CONCLUSIONS

We have proposed an area-delay-power efficient lowadaptation-delay architecture for fixed-point implementationof LMS adaptive filter. We have used a novel partial-productgenerator for efficient implementation of general multipli-cations and inner-product computation by common subex-pression sharing. For L-bit input width, the proposed partialproduct generator performs L/2 multiplications of one of theoperands with 2-bit digits of the other in parallel, where thesimple subexpressions like ±w, ±2w, and 3w are sharedacross all the 2-bit multiplications. We have proposed anefficient addition-scheme for inner-product computation toreduce the adaptation-delay significantly in order to achievefaster convergence performance, and to reduce the critical-path to support high input sampling-rate. Besides, we haveproposed the strategy for optimized balanced pipelining acrossthe time-consuming blocks of the structure, to reduce theadaptation-delay and power-consumption, as well. The pro-posed structure involves significantly less adaptation-delayand provides significant saving of area-delay-product andenergy-delay-product compared with the existing structures.We have proposed an fixed-point implementation of the pro-posed architecture, and derived the expression for steady-stateerror. We find that the steady-state MSE obtained from theanalytical result matches with the simulation result. We havealso discussed a pruning scheme which provides nearly 20%

Page 10: Area Delay Power Efficient Fixed Point Lms Adaptive Filter With Low Adaptation Delay

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE VIFPGA IMPLEMENTATIONS OF PROPOSED DESIGNS FOR L = 8 AND

N = 8, 16 AND 32.

DesignProposed Design-I Proposed Design-IINOS MUF NOS MUF

Xilinx Virtex-4 (XC4VSX35-10FF668)N = 8 1024 148.7 931 151.6

N = 16 2036 121.2 1881 124.6

N = 32 4036 121.2 3673 124.6

Xilinx Spartan-3A DSP (XC3SD1800A-4FG676)N = 8 1025 87.2 966 93.8

N = 16 2049 70.3 1915 74.8

N = 32 4060 70.3 3750 75.7

NOS stands for the number of slices. MUF stands for maximum usablefrequency in MHz.

saving in area-delay-product and 9% saving in energy-delay-product over the proposed structure before pruning withoutnoticeable degradation of steady-state error-performance. Thehighest sampling rate that could be supported by the ASICimplementation of the proposed design is ranging from about870 to 1010 MHz for filter orders 8 to 32. When the adaptivefilter is required to be operated at lower sampling rate, one canuse the proposed design with a clock slower than the maximumusable frequency and can use lower operating voltage to reducethe power consumption further.

REFERENCES

[1] B. Widrow and S. D. Stearns, Adaptive signal processing. EnglewoodCliffs, NJ: Prentice-Hall, 1985.

[2] S. Haykin and B. Widrow, Least-mean-square adaptive filters. Hobo-ken, NJ: Wiley-Interscience, 2003.

[3] M. D. Meyer and D. P. Agrawal, “A modular pipelined implementationof a delayed LMS transversal adaptive filter,” in Proc. IEEE Interna-tional Symposium on Circuits and Systems (ISCAS), May 1990, pp.1943–1946.

[4] G. Long, F. Ling, and J. G. Proakis, “The LMS algorithm with delayedcoefficient adaptation,” IEEE Transactions on Acoustics, Speech, andSignal Processing, vol. 37, no. 9, pp. 1397–1405, Sep. 1989.

[5] ——, “Corrections to ‘The LMS algorithm with delayed coefficientadaptation’,” IEEE Transactions on Signal Processing, vol. 40, no. 1,pp. 230–232, Jan. 1992.

[6] H. Herzberg, R. Haimi-Cohen, and Y. Be’ery, “A systolic array realiza-tion of an LMS adaptive filter and the effects of delayed adaptation,”IEEE Transactions on Signal Processing, vol. 40, no. 11, pp. 2799–2803,Nov. 1992.

[7] M. D. Meyer and D. P. Agrawal, “A high sampling rate delayedLMS filter architecture,” IEEE Transactions on Circuits and SystemsII: Analog and Digital Signal Processing, vol. 40, no. 11, pp. 727–729,Nov. 1993.

[8] S. Ramanathan and V. Visvanathan, “A systolic architecture for LMSadaptive filtering with minimal adaptation delay,” in Proc. InternationalConference on VLSI Design (VLSID), Jan. 1996, pp. 286–289.

[9] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, “High speed FPGA-based implementations of delayed-LMS filters,” Journal of VLSI SignalProcessing, vol. 39, no. 1-2, pp. 113–131, Jan. 2005.

[10] L. D. Van and W. S. Feng, “An efficient systolic architecture forthe DLMS adaptive filter and its applications,” IEEE Transactions onCircuits and Systems II: Analog and Digital Signal Processing, vol. 48,no. 4, pp. 359–366, Apr. 2001.

[11] L.-K. Ting, R. Woods, and C. F. N. Cowan, “Virtex FPGA imple-mentation of a pipelined adaptive LMS predictor for electronic supportmeasures receivers,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 13, no. 1, pp. 86–99, Jan. 2005.

[12] P. K. Meher and M. Maheshwari, “A high-speed FIR adaptive filterarchitecture using a modified delayed LMS algorithm,” in Proc. IEEEInternational Symposium on Circuits and Systems (ISCAS), Rio deJaneiro, Brazil, May 2011.

[13] P. K. Meher and S. Y. Park, “Low adaptation-delay LMS adaptivefilter part-I: Introducing a novel multiplication cell,” in Proc. IEEEInternational Midwest Symposium on Circuits and Systems (MWSCAS),Aug. 2011.

[14] ——, “Low adaptation-delay LMS adaptive filter part-II: An optimizedarchitecture,” in Proc. IEEE International Midwest Symposium on Cir-cuits and Systems (MWSCAS), Aug. 2011.

[15] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design andImplementation. New York: John Wiley & Sons, Inc, 1999.

[16] C. Caraiscos and B. Liu, “A roundoff error analysis of the LMS adap-tive algorithm,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 32, no. 1, pp. 34–41, Feb. 1984.

[17] R. Rocher, D. Menard, O. Sentieys, and P. Scalart, “Accuracy evaluationof fixed-point LMS algorithm,” in Proc. IEEE International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP), May 2004, pp.237–240.

Pramod Kumar Meher (SM’03) received the de-grees of BSc (Honours), MSc in Physics and PhDin science from Sambalpur University, Sambalpur,India, in 1976, 1978, and 1996, respectively.

Currently, he is a Senior Scientist with the In-stitute for Infocomm Research, Singapore. Prior tothis assignment he was a visiting faculty with theSchool of Computer Engineering, Nanyang Techno-logical University, Singapore. He was a Professorof Computer Applications with Utkal University,Bhubaneswar, India from 1997 to 2002, a Reader

in Electronics with Berhampur University, Berhampur, India from 1993 to1997, and a Lecturer in Physics with various Government Colleges in Indiafrom 1981 to 1993. His research interest includes design of dedicated andreconfigurable architectures for computation-intensive algorithms pertainingto signal, image and video processing, communication, bio-informatics andintelligent computing. He has contributed more than 180 technical papers tovarious reputed journals and conference proceedings.

Dr. Meher is a Fellow of the Institution of Electronics and Telecommu-nication Engineers, India and a Senior Member of IEEE. He has servedas Associate Editor for the IEEE Transactions on Circuits and Systems-II:Express Briefs during 2008-2011. Currently, he is serving as a speaker forthe Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society,and Associate Editor for the IEEE Transactions on Circuits and Systems-I:Regular Papers, IEEE Transactions on Very Large Scale Integration (VLSI)Systems, and Journal of Circuits, Systems, and Signal Processing. He was therecipient of the Samanta Chandrasekhar Award for excellence in research inengineering and technology for the year 1999.

Sang Yoon Park (S’03-M’11) received the B.S.,M.S., and Ph.D. degrees in Department of ElectricalEngineering and Computer Science from Seoul Na-tional University, Seoul, Korea, in 2000, 2002, and2006, respectively. He joined the School of Electricaland Electronic Engineering, Nanyang Technologi-cal University, Singapore as a Research Fellow in2007. Since 2008, he has been with Institute forInfocomm Research, Singapore, where he is cur-rently a Research Scientist. His research interestincludes dedicated/reconfigurable architectures and

algorithms for low-power/low-area/high-performance digital signal processingand communication systems.