2326 ieee transactions on circuits and systems—i:...

14
2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012 Analytical Approach for Numerical Accuracy Estimation of Fixed-Point Systems Based on Smooth Operations Romuald Rocher, Daniel Menard, Member, IEEE, Pascal Scalart, Member, IEEE, and Olivier Sentieys, Member, IEEE Abstract—In embedded systems using xed-point arithmetic, converting applications into xed-point representations requires a fast and efcient accuracy evaluation. This paper presents a new analytical approach to determine an estimation of the numerical accuracy of a xed-point system, which is accurate and valid for all systems formulated with smooth operations (e.g., additions, subtractions, multiplications and divisions). The mathematical expression of the system output noise power is determined using matrices to obtain more compact expressions. The proposed approach is based on the determination of the time-varying impulse-response of the system. To speedup computation of the expressions, the impulse response is modelled using a linear prediction approach. The approach is illustrated in the general case of time-varying recursive systems by the Least Mean Square (LMS) algorithm example. Experiments on various and repre- sentative applications show the xed-point accuracy estimation quality of the proposed approach. Moreover, the approach using the linear-prediction approximation is very fast even for recursive systems. A signicant speed-up compared to the best known accuracy evaluation approaches is measured even for the most complex benchmarks. Index Terms—Accuracy evaluation, adaptive lters, xed-point arithmetic, quantization noises. I. INTRODUCTION D IGITAL image and signal processing applications are mostly implemented in embedded systems with xed-point arithmetic to satisfy energy consumption and cost constraints [14], [16], [33], [15]. Fixed-point architectures manipulate data with relatively small word-lengths, which could offer the advantages of a signicantly lower area, latency, and power consumption. Moreover, oating-point operators are more complex in their processing of exponent and mantissa and, therefore, chip area and latency are more important than for xed-point operators. Nevertheless, applications in their earliest design cycle are usually designed, described and simulated with oating-point data types. Consequently, application must be converted into a xed-point specication where implementation is considered. To reduce application time-to-market, tools to automate xed- point conversion are needed. Indeed, some experiments [7] have shown that the manual xed-point conversion process can rep- resent from 25% to 50% of the total development time. Like- Manuscript received December 07, 2011; revised January 20, 2012; accepted January 31, 2012. Date of publication April 05, 2012; date of current version September 25, 2012. This paper was recommended by Associate Editor Mrityunjoy Chakraborty. The authors are with the University of Rennes 1-INRIA/IRISA, 22305 Lan- nion, France (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2012.2188938 wise, in a survey conducted by a design tool provider [19], the xed-point conversion was identied by a majority of respon- dents as one of the most difcult aspects during FPGA imple- mentation of a DSP application. The aim of xed-point conver- sion tools is to determine the optimized xed-point specication that minimizes the implementation cost and provides a sufcient computation accuracy to maintain the application performance. The xed-point conversion process is divided into two steps to determine the number of bits for both integer and fractional parts. The rst step is the determination of the binary-point position from the data dynamic range [36]. The integer part word-length of all data inside the application is determined in order to avoid overow. The second step is the determination of the fractional part. The limitation of the number of bits for the fractional part leads to an unavoidable error resulting from the difference between the expected value and the real nite preci- sion value. Obviously, numerical accuracy effects must be lim- ited as much as possible to ensure algorithm integrity and appli- cation performance. Therefore, this stage of the xed-point con- version ow corresponds to an optimization process in which the implementation cost is minimized under an accuracy con- straint. The fractional part word-length is optimized with an it- erative process requiring to evaluated the numerical accuracy numerous times. Therefore precise and fast numerical accuracy estimator is a key element of any efcient xed-point conver- sion method. The numerical accuracy of an application can be obtained using two approaches: xed-point simulations and analytical approaches. Accuracy evaluation based on xed-point simula- tions [2], [20] is very time consuming because a new simulation is needed as soon as a xed-point format changes in the system resulting in prohibitive optimization time for the xed-point conversion process. Analytical approaches attempt to determine a mathematical expression of an estimation of a numerical ac- curacy metric, which leads to short evaluation times compared to xed-point simulation based approaches. Consequently, the rest of the paper focuses on improving the quality and the ap- plication range of analytical approaches for numerical accuracy evaluation. Numerical accuracy can be evaluated by determining one of the following metrics: quantization error bounds [13], [1], number of signicant bits [6] or quantization noise power [27], [31], [5]. The error bound metric (or the number of signicant bits) is used for critical systems where the quantization effects could have catastrophic consequences [21], [3]. The error has to be kept small enough to ensure the system behavior within acceptable bounds. The quantization error power is preferred when a slight degradation of application performance due to quantization effects is acceptable. This design approach con- cerns a great number of digital signal processing applications. Moreover, the quantization error can be linked to the degrada- 1549-8328/$31.00 © 2012 IEEE

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

Analytical Approach for Numerical AccuracyEstimation of Fixed-Point Systems Based

on Smooth OperationsRomuald Rocher, Daniel Menard, Member, IEEE, Pascal Scalart, Member, IEEE, and

Olivier Sentieys, Member, IEEE

Abstract—In embedded systems using fixed-point arithmetic,converting applications into fixed-point representations requires afast and efficient accuracy evaluation. This paper presents a newanalytical approach to determine an estimation of the numericalaccuracy of a fixed-point system, which is accurate and valid forall systems formulated with smooth operations (e.g., additions,subtractions, multiplications and divisions). The mathematicalexpression of the system output noise power is determined usingmatrices to obtain more compact expressions. The proposedapproach is based on the determination of the time-varyingimpulse-response of the system. To speedup computation of theexpressions, the impulse response is modelled using a linearprediction approach. The approach is illustrated in the generalcase of time-varying recursive systems by the Least Mean Square(LMS) algorithm example. Experiments on various and repre-sentative applications show the fixed-point accuracy estimationquality of the proposed approach. Moreover, the approach usingthe linear-prediction approximation is very fast even for recursivesystems. A significant speed-up compared to the best knownaccuracy evaluation approaches is measured even for the mostcomplex benchmarks.

Index Terms—Accuracy evaluation, adaptive filters, fixed-pointarithmetic, quantization noises.

I. INTRODUCTION

D IGITAL image and signal processing applicationsare mostly implemented in embedded systems with

fixed-point arithmetic to satisfy energy consumption and costconstraints [14], [16], [33], [15]. Fixed-point architecturesmanipulate data with relatively small word-lengths, whichcould offer the advantages of a significantly lower area, latency,and power consumption. Moreover, floating-point operatorsare more complex in their processing of exponent and mantissaand, therefore, chip area and latency are more important thanfor fixed-point operators.Nevertheless, applications in their earliest design cycle are

usually designed, described and simulated with floating-pointdata types. Consequently, application must be converted into afixed-point specification where implementation is considered.To reduce application time-to-market, tools to automate fixed-point conversion are needed. Indeed, some experiments [7] haveshown that the manual fixed-point conversion process can rep-resent from 25% to 50% of the total development time. Like-

Manuscript received December 07, 2011; revised January 20, 2012; acceptedJanuary 31, 2012. Date of publication April 05, 2012; date of current versionSeptember 25, 2012. This paper was recommended by Associate EditorMrityunjoy Chakraborty.The authors are with the University of Rennes 1-INRIA/IRISA, 22305 Lan-

nion, France (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2012.2188938

wise, in a survey conducted by a design tool provider [19], thefixed-point conversion was identified by a majority of respon-dents as one of the most difficult aspects during FPGA imple-mentation of a DSP application. The aim of fixed-point conver-sion tools is to determine the optimized fixed-point specificationthat minimizes the implementation cost and provides a sufficientcomputation accuracy to maintain the application performance.The fixed-point conversion process is divided into two steps

to determine the number of bits for both integer and fractionalparts. The first step is the determination of the binary-pointposition from the data dynamic range [36]. The integer partword-length of all data inside the application is determined inorder to avoid overflow. The second step is the determination ofthe fractional part. The limitation of the number of bits for thefractional part leads to an unavoidable error resulting from thedifference between the expected value and the real finite preci-sion value. Obviously, numerical accuracy effects must be lim-ited as much as possible to ensure algorithm integrity and appli-cation performance. Therefore, this stage of the fixed-point con-version flow corresponds to an optimization process in whichthe implementation cost is minimized under an accuracy con-straint. The fractional part word-length is optimized with an it-erative process requiring to evaluated the numerical accuracynumerous times. Therefore precise and fast numerical accuracyestimator is a key element of any efficient fixed-point conver-sion method.The numerical accuracy of an application can be obtained

using two approaches: fixed-point simulations and analyticalapproaches. Accuracy evaluation based on fixed-point simula-tions [2], [20] is very time consuming because a new simulationis needed as soon as a fixed-point format changes in the systemresulting in prohibitive optimization time for the fixed-pointconversion process. Analytical approaches attempt to determinea mathematical expression of an estimation of a numerical ac-curacy metric, which leads to short evaluation times comparedto fixed-point simulation based approaches. Consequently, therest of the paper focuses on improving the quality and the ap-plication range of analytical approaches for numerical accuracyevaluation.Numerical accuracy can be evaluated by determining one

of the following metrics: quantization error bounds [13], [1],number of significant bits [6] or quantization noise power [27],[31], [5]. The error bound metric (or the number of significantbits) is used for critical systems where the quantization effectscould have catastrophic consequences [21], [3]. The error hasto be kept small enough to ensure the system behavior withinacceptable bounds. The quantization error power is preferredwhen a slight degradation of application performance due toquantization effects is acceptable. This design approach con-cerns a great number of digital signal processing applications.Moreover, the quantization error can be linked to the degrada-

1549-8328/$31.00 © 2012 IEEE

Page 2: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2327

tion of the application performance using for example the tech-nique presented in [26]. In this paper, the quantization noisepower is used as the numerical accuracy metric.This paper proposes an efficient and accurate analytical ap-

proach to estimate the numerical accuracy of all types of fixed-point systems based on smooth operations (an operation is con-sidered to be smooth if the output is a continuous and differen-tiable function of its inputs, as in the case of arithmetic opera-tions). From the signal flow graph of the application, the associ-ated tool automatically generates the source code implementingthe analytical expression of the output quantization noise power.In this expression, the gains between the different noise sourcesand the output are computed from the impulse response (IR),which can be time-varying. This expression is unbiased but in-troduces infinite sums. To reduce the computation time of thegains, an approach based on linear prediction is also proposed.The approach is based on a matrix model which leads to a com-pact expression in the case of algorithms expressed with vec-tors and matrices, such as the Fast Fourier Transform (FFT) andthe Discrete Cosine Transform (DCT). Compared to approachesbased on fixed-point simulations, the time required to evaluatethe fixed-point accuracy is also dramatically reduced. Existinganalytical approaches are only valid for linear and time invariant(LTI) systems [22], [23], or for non-LTI and non recursive sys-tems [24]. In [4], non-LTI systems with feedbacks are handled.But, as mentioned in [4], the execution time can be quite longfor systems containing feedbacks. Indeed, to obtain an accu-rate estimation, this execution time depends on the length ofthe system impulse response. Compared to the last-mentionedapproach, a significant speed-up is obtained with the proposedapproach thanks to the proposed linear prediction approach.This paper is organized as follows. In Section II, existing

methods to evaluate the numerical accuracy of fixed-point ap-plications are detailed. The comparison of simulation-based andanalytical approaches highlights the benefits of analytical ap-proaches. In Section III, the proposed approach for numericalaccuracy evaluation is defined. The quantization noise propa-gation model is presented and the output noise power expres-sion is expressed. A technique based on linear prediction is alsopresented to reduce the time to obtain the power expression. InSection IV, the approach is tested on several LTI and non LTIsystems. The quality of the estimation and the time to obtainthe power expression are evaluated on various applications andbenchmarks. Finally, Section V concludes the paper.

II. EXISTING APPROACHES FOR NUMERICALACCURACY EVALUATION

In the fixed-point conversion process, the numerical accuracyevaluation is the most critical part, mainly because this evalua-tion is successively performed in the iterative word-length op-timization process. The numerical accuracy can be evaluatedusing analytical or simulation-based approaches.

A. Accuracy Evaluation Based on Fixed-Point Simulations

In fixed-point simulation-based approaches, the applica-tion performance and the numerical accuracy are directlydetermined from fixed-point (bit-true) simulations. To obtainaccurate results, the simulations must use a high number ofsamples . Let denote the number of operations in

the algorithm description, , the number of times the accu-racy is evaluated and , the simulation time for one sample.Therefore, the total system simulation time is

(1)

To reduce word-length optimization time, approaches basedon fixed-point simulations aim at reducing the values of someparameters of (1). First, fixed-point simulators have to minimizethe computing time by using efficient data types to simulatefixed-point arithmetic. Most of them use object oriented lan-guage and operator overloading [20] to implement fixed-pointarithmetic. However, the fixed-point execution time of an algo-rithm is significantly longer than the one of a floating-point sim-ulation [12]. To reduce this overhead, new data types have beenintroduced in [20]. They use more efficiently the floating-pointunits of general-purpose processors.With this new data type, thefixed-point simulation time is still 7.5 times longer than the oneobtained with floating-point data types for a simple applicationlike a fourth-order Infinite Impulse Response (IIR) filter [20].However, the optimization time to accelerate the fixed-pointsimulations is not given in [20], which could actually annihilatethe obtained gain for the fixed-point simulation. The approachproposed in [30] was developed for MATLAB code but its sim-ulation time is still significantly higher.Finally, some methods [2] attempt to reduce , the number

of iteration of the optimization process by limiting the searchspace, and thus leading to sub-optimal solutions.

B. Analytical Accuracy Evaluation

The general idea of analytical approaches is to determine amathematical expression that provides estimations of the outputquantization noise power. The computation of this analytical ex-pression could be time-consuming, but it is executed only once.Then, the numerical accuracy is determined very quickly byevaluating the mathematical expression for a given word-lengthof each data. Existing analytical approaches for numerical accu-racy evaluation are based on perturbation theory [8], [34]. Thequantization of a signal generates a noise considered as asignal perturbation. Let and be respectively the inputsand the output of an operation , and and their pertur-bations. The perturbation of the output value of a differentiableoperation is given by

(2)

Then, the expression of the noise power corresponding tothe second-order moment of the output quantization noise canbe expressed as

(3)

where is the number of quantization noises, andare respectively the mean value and the variance of the noiseand the terms and depend on system characteristics.

The expressions of and differ according to the consid-ered approach and will be presented in the following subsection.Determining the terms and is crucial for the analyticalaccuracy evaluation process.1) Simulation-Based Computation: Approaches based

on hybrid techniques [31], [10], [17] have been proposed

Page 3: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2328 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

to compute the coefficients and of (3) from a set ofsimulations. In [31], these coefficients areobtained by solving a linear system in which and are thevariables. The other elements of (3) are determined by carryingout fixed-point simulations. The word-lengths of the data areset to control each quantizer and to analyze its influence onthe output. For each simulation, the output noise power isevaluated. The statistical parameters (mean and variance) ofeach noise source are extracted from the data word-length.At least fixed-point simulations are required tosolve the system of linear equations. A similar approach to [31]is used in [17] to obtain the coefficients by simulation.2) Affine Arithmetic Based Simulations: In approach based

on affine arithmetic [4], the coefficients and are com-puted with an affine arithmetic based simulation. This approachwas proposed for LTI in [22], [23] and recently, for non-LTI sys-tems in [4].In affine arithmetic, a data is defined by a linear relation

(4)

where is an uncertainty term included in the intervaland such that is independent of . Each noise sourceis defined by an affine form whose behavior is controlled by themean and the variance of the noise source. Then, noise sourceexpressions are propagated throughout the system thanks to anaffine arithmetic based simulation. The values of andare extracted from the affine form of the output noise. The ap-proach proposed in [4] is general and can handle systems basedon smooth operations. This estimation is accurate and the exe-cution time is low for non-recursive systems. Nevertheless, inthe case of recursive systems, several iterations for the affinearithmetic based simulation are required to converge to stablevalues. Thus, the execution time depends on the length of theimpulse response and on the number of noise sources. As men-tioned in [4], the execution time can be quite long for systemscontaining feedback loops.3) Approaches Based on System Characteristics: The ap-

proaches presented in [11], [25] provide an approximation ofthe mathematical expression of the output quantization noisepower in the case of Linear and Time Invariant (LTI) systems.The principle of [25] is based on the system transfer functionsand the expressions of the coefficients and are

(5)

(6)

where is the transfer function between the white noisesource and the system output. In [25], the authors proposed atechnique to automatically obtain the transfer function from theapplication description.For non-LTI and non-recursive systems, an extended ap-

proach was presented in [24]. Each noise source ispropagated through operations , which leads to thecalculation of the output noise . The noise is theproduct of each noise source and of different signalsassociated to operations included in noise sourcepropagation. So, and terms are equal to

(7)

(8)

It is noteworthy that this approach is adequate for bothnon-LTI and non-recursive systems, but not to non-LTI andrecursive systems such as the clan of adaptive filters.

C. Summary

While approaches based on fixed-point simulations lead tovery long computing time and thus require to limit the searchspace during word-length optimization, analytical approachesare significantly less time-consuming. Unfortunately, they arelimited to a reduced class of systems and are not always efficientin terms of execution time. To cope with the lack of efficientand general approaches, an analytical approach for all types ofsystems based on smooth operations is presented in the nextsection.

III. PROPOSED ACCURACY EVALUATION APPROACH

In this section, a new approach for evaluating the quantizationnoise at the output of any system based on smooth operations(addition, subtraction, multiplication and division) is detailed.First, the models associated with the quantization process andthe quantization noise propagation model are introduced. Then,the modelling of the system is presented and the output noisepower expression is detailed.

A. Quantization Noise Models

1) Quantization Noise Source: The quantization process canbe modelled by the sum of the original signal and of a uniformlydistributed white noise [35], [32]. This quantization noise is un-correlated with the signal and the other noise sources. Such amodel is valid provided that the signal has a sufficiently largedynamic range compared to the quantization step. Accordingto the quantization mode, the noise Probability Density Func-tion (PDF) will differ. Three quantization modes are usuallyconsidered: truncation, conventional rounding, and convergentrounding. In the truncation mode, the Least Significant Bits(LSB) are directly eliminated. The resulting number is alwayssmaller than or equal to the number obtained before quantiza-tion, and, therefore, the quantization noise is always positive.Consequently, the mean of the quantization noise is not equal tozero. To reduce the bias due to truncation, the rounding quanti-zation mode is often used. In conventional rounding, the data arerounded to the nearest value representable in the reduced-accu-racy format. For numbers located at the midpoint between twoconsecutive representable values, the data are rounded-up al-ways to the larger output value. This technique leads to a (small)bias for the quantization noise. To eliminate this quantizationnoise bias, the convergent rounding can be used as well. In thiscase, the numbers located at the midpoint between two consec-utive representable values are, with equal probability, roundedto the higher or lower output value.Let denote the number of bits for the fractional part after

the quantization process and , the number of bits eliminatedduring the quantization. The quantization step after the quan-tization is equal to . The quantization noise meanand variance are given in Table I for the three considered quan-tization modes [9], [28].

Page 4: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2329

TABLE IFIRST- AND SECOND-ORDER MOMENTS FOR THE

THREE CONSIDERED QUANTIZATION MODES

TABLE IIVALUES AND OF (9) FOR ARITHMETIC OPERATIONS

2) Quantization Noise Propagation: The aim of this part isto define the model for the propagation of the quantization noisefor smooth operations. We specifically concentrate on an oper-ation with two inputs and and on the calculation of itsoutput . The finite precision values and arerespectively the sum of infinite precision values and withtheir associated quantization noises and . The propaga-tion model is then defined such as is a linear combination ofand

(9)

and are obtained from a first order Taylor development atthe output of the differentiable operator defined as

(10)

Thus, the expressions of and are

(11)

The values of and are summarized in Table II for thearithmetic operations.For non-scalar noise sources (vectors or matrix), the model

of (9) is not valid anymore. For example, in the case of themultiplication of two matrices and associated with noisematrices and , the result is equal to . Then,the output noise matrix is given by .In the rest of the paper, upper-case letters in bold will repre-

sent matrices, lower-case letters in bold will represent vectors,and italic will represent scalars.Since commutativity is not valid for non-scalar multiplica-

tions and since we seek to obtain a general model, each noisesource on the operation input is multiplied by a signal term onthe left or on the right. Therefore, in the general case, the noisesource propagation is expressed as the multiplication of each

TABLE IIIVALUES AND OF (12) FOR OPERATIONS

noise input by two matrices ( on the left and on the right)as

(12)

Table III defines values of and according to the typeof operation. For the mathematical expressions of the generalapproach, and will be considered as matrices. However,when applied to different examples, depending on the applica-tion context, these terms will be noted and for matrices,and for vectors and and for scalars.To illustrate the proposed approach, the example of a FIR (Fi-

nite Impulse Response) filter is firstly considered. In this case,is an -size input vector, is its associated noise,

and is the filter coefficient vector. So, the output noisedue to input noise propagation is equal to

(13)

The terms and are therefore given by

(14)

(15)

To illustrate the need of the two matrices and , the adap-tive filter based on the Affine Projection Algorithm (APA) isthen considered. APA has been proposed in [29] to have a fasterconvergence compared to the Least Mean Square (LMS) algo-rithm and to reduce complexity compared to Recursive LeastSquare (RLS) algorithm. Let denote an matrixmade up of the last observation vectors of the input,being the coefficient vector and the error vector. The up-dated coefficient equation is given by

(16)

The noise matrix associated to the inversion propaga-tion matrix is then multiplied by two terms

(17)

(18)

B. System Model

Each noise source propagates throughout the system and con-tributes to the global output noise. In this section, the systemcharacterizing each noise propagation is defined to compute af-terward the expression of the system output noise power. Let

denote the number of noise sources inside the system. Asshown in Table II, the noise expression does not contain crossednoise terms. So, each noise source at time contributesto the output noise by a noise that is independent of allother noise terms. The system output noise is the sum ofall contributions as presented in Fig. 1 and expressed by

(19)

Page 5: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2330 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

Fig. 1. System noise model for input noises and output noise .

Each noise contribution is obtained from the propaga-tion of noise sources through the system . Thus, to de-termine the complete expression of the output noise , eachsubsystem must be characterized analytically.1) System Characterization: The contribution of the

noise source depends on the previous samplesof the noise source with . If the systemis recursive, the noise depends on its previous samples

with . Thus, the general expressionof the quantization noise is

(20)

where represents the contribution of noise source attime to the output noise and is the contribu-tion of noise at time . The terms and maybe time-varying and depend on the system implementation. ForLTI systems, the terms and are constant and correspond tofilter coefficients.Nevertheless, to compute the output noise power expression

from the statistical parameters of the noise , (20) must bedeveloped to express contribution with only input noiseterms . This development introduces the time-varying impulseresponse associated with the system .2) Time-Varying Impulse Response: In this part, the time-

varying impulse response of the system is determined. Bydeveloping the recurrence in (20), the noise term becomes

(21)

where is the time-varying impulse response of thesystem which formulates the noise only from thesamples of the noise source . The term representsthe contribution of the noise source at timeto generate at time . For the sake of clarity, isnoted by since output time is always considered at time. This impulse response is obtained from the terms andwith

(22)

Equation (22) was obtained in the case of scalar noisesources. The model can be extended to vector (or matrix) noisesources by using vector (or matrix) terms for the time-varyingimpulse response. In this case, the time-varying impulse re-sponse is equivalent to the multiplication of the noise source

(to be more general, noises are considered as matrices

in the following) by two terms and . In case ofnon-scalar noise source, (21) can be rewritten as

(23)

The output noise is finally the sum of all noise sourcecontributions and can be expressed as

(24)

The output noise and input noises have formaldimensions of and . So, andare x and matrices. Their numerical values de-pend on the system characteristics and can be computed from aanalysis of the considered application.

C. Output Noise Power Expression

1) Noise Power: The output noise power is defined as thesecond order moment of (24) and can be expressed as

(25)

which is equivalent to

(26)

Equation (26) can be split into three parts by separating thenoise from the other noises when and the terms

from when .Theorem 1: Each term of is less or equal than

its trace value.Proof: The matrix is real, symmetric and

positive and therefore diagonalizable. Thus, eigenvalues ofare positive, and each of them is less than the trace

value. Indeed, this matrix can be approximatedby its trace . In the case where the noises arevectors, then is a scalar and is equal to its trace.Moreover, by applying Theorem 1, (26) can be reformulated

and leads to

(27)

Following the method described in [5], the expected value ofthe noise and signal terms are computed separately. For the twomatrices and made-up of signal terms and for a noisenon-correlated with these two signal terms, the following rela-tion is used: . Moreover, the different

Page 6: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2331

quantization noises are assumed to be white and independentso the intercorrelation between two noises and is equal to

(28)

where and represent mean and variance of each elementof input noises is the identity matrix and is the-size matrix composed of 1. Thus, the output noise power

can be summarized as

(29)

By introducing the terms and , (29) is reformulated as

(30)

represents the gain on the variance for the noise sourceand is equal to

(31)

represents the gain on the mean for the couple of noisesources and and is equal to

(32)

The expressions of and are obtained by a singlefloating-point simulation since they only involve signal terms.All the terms included in these expressions are stored fromthis single floating-point simulation and the terms and

are then computed. These terms are independent fromnoise sources and lead to constants in the output noise powerexpression. Noise statistics and depend on fixed-pointformats and are the variables describing the output noise powerexpression. They are computed from the data word-length ofeach data and operation in the application graph.It should be noted that (29) is unbiased since no restrictive

assumption was made about the system. and are definedby infinite sums. However, different cases must be taken intoaccount. If the system is LTI, the matrices and are con-stant, (31) and (32) can be simplified, and it leads to the resultsobtained in (5). If the system is not recursive, the length of thetime-varying impulse responses is finite and the sums in (29)are finite and can be computed directly.For the recursive case, these infinite sums are approximated

by computing the first elements of the sum, and therefore (31)and (32) are computed from to with depending on thesignal correlation inside the terms and . Nevertheless, ac-cording to the different experimentations that have been carriedout, a number equal to or less than 500 leads to very realisticand accurate results. Moreover, to compute (29), the ergodic as-sumption is used so that the statistical mean is replaced by a timeaverage over samples to get a realistic result. In practice, a

number equal to 100 leads to accurate modelling properties,as it will be shown in Section IV. The complexity of the pro-posed method has been analyzed and leads to a number of mul-tiplications and additions equal to

(33)

(34)

where corresponds to the output noise matrixsize and those of input noise . In the next subsec-tion, an approach to reduce the complexity of the infinite sumcomputation without a significant loss of accuracy is presented.2) Linear Prediction Approach: To reduce complexity, a

linear prediction approach is proposed to replace the infinitesums. (22) has shown that impulse response terms aredefined by a recurrent equation where the terms correspondsto the maximal delay for the recursive part. For LTI systems,terms and are directly equal to transfer function coeffi-cients. Therefore, the rest of the section will focus on non-LTIrecursive systems.As the impulse response is time-varying, the relation (22) is

non linear. The aim is then to linearize this expression with pre-diction coefficients to obtain an estimate of the impulseresponse equal to

(35)

Computing the coefficient vector gives amodel of the impulse response term with the aimto minimize the mean square error betweenand its estimate value defined as

(36)

For the sake of clarity on , a term isintroduced and the index is removed. Thus, the mean squareerror is defined as

(37)

The derivative of the mean square error, with respect to thecoefficients , is

(38)

These prediction coefficients can be determined by set-ting (38) to zero, leading to the solution defined as

(39)

with

...

(40)

Page 7: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2332 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

and

(41)

To determine and , the first values of the impulse responsemust be known. Therefore, the approach to determine the pre-diction coefficients is as follows:1) The first terms of the time-varying impulseresponse are determined using (22), with and repre-senting respectively the maximal number of delays for thenon-recursive and the recursive part of the system .

2) The two matrices and are determined.3) The prediction coefficients are deter-mined by .

The coefficients are determined according toAppendices A and B and the infinite sums can thereforebe written as

(42)

(43)

where

...(44)

.... . .

. . .... (45)

(46)

(47)

and with matrix being the solution of

(48)

In practice, this model is applied to matrices and foreach noise source. The complexity of this method is computedand leads to a number of multiplications and additions

equal to

(49)

(50)

Fig. 2. Synoptic of the accuracy evaluation tool.

D. Accuracy Evaluation Tool

The proposed approach has been implemented in a softwaretool to automate this process, its synoptic is given in Fig. 2. Ournumerical accuracy evaluation tool generates the analytical ex-pression of the output quantization noise from the signal flowgraph (SFG) of the application. This analytical expression is im-plemented through a C function having the fractional partof all data word-lengths as input parameters. This C code can becompiled and dynamically linked to the fixed-point conversiontool for the optimization process. The accuracy evaluation tooluses a technique similar to the one proposed in [25]. The firststep transforms the application SFG into an SFG rep-resenting the application at the quantization noise level. All thepotential noise sources are included in the graph. The expres-sion of the variance and the mean of each noise sourceaccording to are generated in order to be included in the

output C function. The expression of the number of bits elimi-nated between two operations can be found in [25]. Then, eachoperation is replaced by its propagation noise model, as pre-sented in Section III-A.2. The aim of the second step is to de-termine the expression of the system impulse response betweenthe output and the different noise sources. To handle recursivesystems, cycles in the graph are detected and then dismantled.The third step computes the value of the gains and givenin (31) and (32). The linear prediction approach presented inSection III-C.2 is used to approximate the infinite sums. Thevalues of the different terms of the matrices and are col-lected during a single floating-point simulation. The referencesimulation and the third step of the tool are performed with theMatlab tool.

IV. RESULTS

In this section, experimentations are carried out to validatethe proposed approach on LTI and non-LTI systems. Results aregiven on various applications: vector normalization illustratesnon-LTI non-recursive systems, adaptive filters such as LMS,

Page 8: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2333

NLMS and APA illustrate non-LTI recursive systems, and var-ious benchmarks illustrate all kinds of systems. For these appli-cations the Signal-to-Quantization Noise Ratio (SQNR) valuesare chosen from 30 dB to 90 dB to ensure that the assumptionsof the Widrow model are satisfied (the noise variance is signif-icantly smaller than the signal power). Moreover, the optimiza-tion time of the proposed approach is compared with approachesbased on fixed-point simulations.

A. A Non-LTI Non-Recursive System: Vector Normalization

The first example considers an -size vector normal-ized according to the relation

(51)

Here to simplify the final expression of the power, only twonoises are considered: the noise due to the quantizationof and the noise generated by the normalizationcomputation. By using the noise model propagation defined inSection III-A, the output noise vector is equal to

(52)

which can be simplified as

(53)

with the -size identity matrix. This example has no delayin the graph so only one impulse term has to be deter-mined. The other terms are then equal to zero forand is equal to one.By applying (29) to the output noise expression, its power

is

(54)

where and are the means of and andtheir variances.For this example, is set to 4, and the truncation mode is

considered. The fractional part word-length of the input andfor the output are respectively and . So, using (30)and by introducing values of mean and variance of the two con-sidered noises, (54) can be written as

(55)

where represents the number of bits eliminated at the outputof the normalization. The expression of the gain and aregiven in Table IV.The chosen synthetic input signal is a first-order autore-

gressive signal defined by

(56)

Fig. 3. Fixed-point LMS algorithm and noise model for input data ,estimate data and coefficients .

TABLE IVEXPRESSION AND VALUES OF THE GAINS AND IN THE CASE OF

THE VECTOR NORMALIZATION EXAMPLE

with and a zero-mean white noise of variance0.01.The numerical values of the terms and are given in

Table IV. As shown by the results, the terms and dependon signal characteristics and correlation between them.

B. A Non-LTI Recursive System: The Least Mean Square(LMS) Adaptive Filter

1) Fixed-Point LMS Algorithm: To illustrate the proposedmodel and to verify its accuracy on a non-LTI and recursivesystem, the Least Mean Square (LMS) example is detailed.The LMS adaptive algorithm [18] addresses the problemof estimating a sequence of scalars from an -lengthvector . The linearestimate of is where is an -lengthvector which converges to the optimal vector in themean-square error (MSE) sense. This optimal vector is equalto where corresponds to the correlationmatrix of the input signal equal to and whererepresents the intercorrelation vector between andequal to . The vector is updated according to

(57)

where is a positive constant representing the adaptationstep. In a fixed-point implementation of the LMS, four equiva-lent noise sources have to be considered as presented in Fig. 3.

and are noises associated respectively with theinput data quantization and the desired signal quan-tization. The vector is the noise vector due to the quan-tization of the products between and the error

. Finally, groups together all noises

Page 9: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2334 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

Fig. 4. Propagation through the LMS algorithm of the noise source dueto filter coefficient computing.

generated inside the filtering part during the computation of theinner product . The terms and represent themean and variance of each noise source .2) Noise Model for Adaptive Coefficient Computation: In

this section, the output noise power expression is determinedfrom the four noise sources: and .For each one, the impulse response between the source and theoutput is determined. Then, the output noise power can be di-rectly expressed using (30) where in that case. Also, theoutput noise power can be determined using the linear predic-tion approach (as presented in Section III-C.2). Each impulseresponse is modelled using the prediction coefficients, leadingto a simpler expression of the output noise power.These two approaches are presented on the LMS example in

the following. Only the propagation of the noise is de-tailed. The contribution of the other noises can be obtained inthe same way. The propagation of noise term is shownon Fig. 4 and is considered for validating the accuracy of theproposed approach and detailed hereafter. As shown on Fig. 3,the noise is first delayed with a first recursion. Then, thenoise goes through the FIR filter where it is multiplied by thesignal . After, the noise propagates through the second re-cursion to the system output and comes back to the multiplica-tion by .Noise propagation can be modelled using the following set of

recurrent equations:

(58)

(59)

where is the noise after the delay. By developing thesetwo previous expressions, the scalar output noise iswritten using its time-varying impulse response as

(60)

The time-varying impulse response is defined through twoequivalent matrices as defined in Section III. For this noise, thematrix corresponds to an -size vector . This vector isintroduced in (61) and is defined in (62).

(61)

(62)

with

(63)

The expression of can be modelled using linear predictioncoefficient as presented in Section III-C.2. The first impulse re-sponse terms of are

(64)

The LMS algorithm is a first-order recursive system. Thus,only one prediction coefficient has to be determined using theapproach defined before. The matrix , defined by (40), is ascalar in that case and is equal to

(65)

The term , defined by (41), is equal to

(66)

So, from (39), the prediction coefficient is expressed as

(67)

Thus, with this coefficient, the impulse response term canbe expressed as

(68)

for since . The infinite sum is obtainedusing (42) as

(69)

The contributions of the three other terms can be obtainedwith the same method. The time-varying impulse responseassociated with these three noises are given in the followingexpressions:

(70)

where is the Kronecker operator. It is noteworthy that, asa consequence, the noises and are scalar. Then, theterms modelling their propagation through the system are alsoscalars, which let us write for input noises

and . The output noise power is finally computedusing (30) and leads to (71):

(71)

Page 10: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2335

The noise mean term is equal to

where the mean values and are -size vectors andand are scalars. The complexity of the computation of thegains and of (71) are obtained from (33) and (34). Inthis example, input noises are size vectors for and(so and ) and scalar for and

, the output noise is a scalar .Moreover, the number is set to 500 and . The numberof multiplications is equal to

(72)

So the complexity is in . The expression of the outputnoise power can be modelled using prediction coefficients thatcan be computed for noises and using thesame method as the one presented before for . Usingthe prediction coefficients, the expression (71) is simplified andleads to (73) given at page 10 in which refers to scalar vari-ance value of each noise.

(73)

with

(74)

By applying (49), the number ofmultiplications is then equal to

(75)

The complexity is divided by a factor 375. Thus, the outputnoise power can be computed by two different ways: using theimpulse response terms of ((71) derived from (30)) or usingprediction coefficients using (73). The quality of these two ap-proaches is compared in the next section.3) Quality of the Estimation: To analyze the quality of the

proposed approach, experiments have been performed. Thisquality has been evaluated through the measurement of the rel-ative error between our analytical estimation and the es-timation based on fixed-point simulations which is con-sidered to be the reference. The relative error is defined as

(76)

The reference value is obtained by simulations from the dif-ference between a floating-point and a fixed-point simulation.The floating-point simulation is used as the high-precision refer-ence, since floating-point arithmetic generates negligible noisescompared to fixed-point arithmetic.For these experiments, the chosen synthetic input signal

is a first-order autoregressive signal defined by (56). Fig. 5shows the relative error of the proposed approach on a 32-tapLMS adaptive filter. The results are presented according to thenumber of elements chosen to compute the infinite sums andthe correlation coefficient of the input signal . The inputsignal can be a white noise , a fairly correlated noise

or a very correlated noise . As increases,the relative error decreases. Indeed, when increases, more

Fig. 5. Relative error for the 32-tap LMS according to the number of samplesin the computation of the infinite sums.

terms are included in the computation of the sums, whichleads to a better estimation quality. Moreover, the relativeerror convergence speed depends on input data correlation.For non-correlated input data, the relative error convergenceis slower than the one for very-correlated input data. In fact,the relative error is less than 20% after 300 samples for verycorrelated input data, after 350 samples for fairly correlateddata and after 550 samples for uncorrelated input data.To explain this result, we recall that the terms , representing

the noise propagation through the system, include the termsdefined as

(77)

For very correlated input data, the termdecreases faster than for uncorre-

lated data. Thus, the number of elements required to computethe infinite sums depends on the input data correlation.Nevertheless, with experimentation presented for , therelative error is less than 25% in all cases, which represents adifference equivalent to less than 1 bit between the noise powerobtained with the proposed approach and the real noise power.For the linear prediction approach, the obtained relative erroris equal to 21%. For other sizes of the LMS algorithm, samekinds of results are obtained, thus indicating that the proposedapproach is accurate for the LMS algorithm.Experiments have also been conducted on other adaptive al-

gorithms. The results are presented in Table V for the previousLMS, a 32-tap Normalized LMS (NLMS) and an Affine Projec-tion Algorithm (APA) of size 32 12. The expression of the co-efficient update for the APA algorithm is given in (16). The APAapplication is suitable to show the case where the general modelpresented in Section III include the two matrices and . BothNLMS and APA algorithms moreover include division opera-tions for the normalization module. The floating-point simula-tion is executed on samples where is the number ofsamples to compute the statistical mean and is the number ofterms to compute the sums. In these experiments, is set to100 and different values of are tested.The approach based on linear prediction leads to a relative

error of around 20% (equal to 1 dB). However, the method usingthe computation of the sums can be really accurate since therelative error is less than 17.65% in mean with andless than 6.24% with . The accuracy of the estimation

Page 11: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2336 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

TABLE VRELATIVE ERROR OBTAINED WITH THE PROPOSEDAPPROACHES FOR DIFFERENT ADAPTIVE FILTERS

TABLE VIRELATIVE ERROR FOR LTI SYSTEMS

can be controlled by adjusting the number of elements used tocompute these sums. We can therefore see the term as a wayto trade-off between the accuracy of the estimation and the timerequired to compute this estimation.

C. Experiments on LTI and Non-LTI Non-Recursive Systems

In this section, the proposed approach is first evaluated on LTIsystems such as FIR and IIR filters, an MP3 coder/decoder andan FFT. Then, the approach is applied to non-LTI and non-re-cursive systems (square, Volterra filter, correlator). The averageandmaximum relative errors between the noise power estimatedwith the proposed approach and the noise power obtained byfixed-point simulations are provided.1) LTI Systems: First, non-recursive systems, such as FIR

filter, MP3 coder/decoder and FFT, are evaluated for differentfixed-point specifications. The results are presented in Table VI.For the 32-tap FIR filter, rounding and truncation have beentested. The average relative error is 2.49% for rounding and1.14% for truncating. Moreover, the maximum relative errorsare very low (about 8%).For the MP3 coder/decoder, the average error is equal to

5.24% for the polyphase filter and to 8.42% for the DCT. Themaximum errors are 20.57% for the polyphase filter and 24.8%for the DCT. These values represent a difference of 2 dB (lessthan 1 bit) between the real value of the noise power and theone estimated by the proposed approach. For the FFT themaximal relative error is less than 7.89%, which also illustratesthe accuracy of the proposed approach on relevant applicationssuch as the FFT. To test the method on a recursive LTI system,

TABLE VIIRELATIVE ERROR FOR NON-LTI AND NON-RECURSIVE SYSTEMS

Fig. 6. Optimization time (s) for the LMS32 example of our two approaches(linear prediction, infinite sums with ) and the approach based on fixed-point simulations.

we have considered an 8-order IIR filter divided into four2nd-order cells. The relative error is less than 3.3% with anaverage value of 1.68%.2) Non-LTI and Non-Recursive Systems: For the case of

non-LTI systems we have tested three applications: a 2nd-orderpolynomial evaluation, a 2nd-order Volterra filter and a corre-lator. The results are presented in Table VII. For the 2nd-orderpolynomial, the estimation is highly accurate, since the error islower than 0.8%. For the Volterra filter and the correlator, the av-erage relative error is equal to 1.79% and 1.35% with the max-imum values of 3.22% and 5.78% respectively. For these threeapplications, the error is less than 5.78%, thus confirming theaccuracy of the proposed approach.In summary, for all types of systems based on arithmetic oper-

ations, the presented results tend to indicate that the method hasan overall relative error that is less than 25% even in the worstcases, which can be translated to less than 1 bit of precision.

D. Fixed-Point Specification Optimization Time

The evaluation of the numerical accuracy is used in theword-length optimization based on an iterative process. Theoptimization time obtained with the proposed approach iscompared with those obtained with fixed-point simulations. Forthe two approaches, the optimization times are given in Fig. 6for the 32-tap LMS algorithm. A computer with Intel Core 2Duo T8300 at 2.4 GHz and 2 GB of RAM was used for theexperiments. For the approach based on fixed-point simulation,a new simulation is required as long as a fixed-point formatis modified. So, the optimization time depends linearly on thenumber of optimization iterations. For approaches based onfixed-point simulations, experimentations were conducted onMatlab to evaluate their computing time. For the proposedapproach, the analytical expression of needs to be computedfirst. The determination of this analytical expression is the mosttime-consuming part and its computation takes 46 seconds for

Page 12: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2337

TABLE VIIICOMPARISON OF THE PROPOSED APPROACHES WITH THE ONE PROPOSED IN

[4] IN TERMS OF EXECUTION TIME (SECONDS) ANDRELATIVE ERROR FOR THE ESTIMATION

the approach based on infinite sum approximation and 4 sec-onds for the linear prediction approach. Then, each iteration ofthis optimization process corresponds only to the noise powerexpression evaluation, whose computing time is negligible. Forthe LMS algorithm, the proposed approach leads to a gain afteronly 100 iterations, which represents an execution time equalto 46 seconds. As an example, for an optimization processwith about 30 variables, between and iterations arerequired. In that case the proposed approach would requirebetween 10 and less time to perform the optimizationprocess. The approach based on linear prediction leads to timegain after only 10 iterations compared to approaches basedon fixed-point simulations, thus leading to an execution timeequal to 4 seconds. These results clearly show that the proposedapproach is faster than the best simulation-based techniques forthe process of fixed-point optimization.A comparison with methods from [4] and [31] has also been

carried-out. As they are also analytical, only the time to deter-mine the coefficients and is given. For the approach de-veloped in [31], no result on the quality and the execution timewere given. However, for an application with noise sources,

fixed-point simulations are required. Thus, the executiontime grows quadratically with the complexity of the applica-tion. For the 32-tap LMS algorithm presented above, 66 noisesources are considered and thus the approaches proposed in [31]would require 4356 fixed-point simulations. Therefore, the pro-posed method leads to significantly lower optimization time, es-pecially when the application becomes complex.In the approach based on affine arithmetic simulations pro-

posed in [4], the relative error for the estimation and thetime required to determine the coefficients andare given in Table VIII for a second-order IIR filter (IIR2), an8-order IDCT, and for an LMS filter with 2 and 5 taps. For thecase of the IIR2 application, the time to execute the third stepcorresponding to the determination of the noise power expres-sion (computation of and ) corresponds to 70% of theglobal execution time for the approach based on infinite sumsand to 50% for the approach based on linear-prediction. Thetime to execute the second step corresponding to the determi-nation of system expression (computation of and ) rep-resents to 26% of the global execution time for the approachbased on infinite sums. For this approach, the rest of the time(4%) is dedicated to the other steps (front-end, determinationof system noise model and expression generation). For the pro-posed method based on infinite sums with , the quality

of the estimation is close to the one obtained with the techniqueof [4]. For our linear prediction based-approach, the quality isslightly lower, but the execution time is reduced. For our two ap-proaches, the execution time is significantly reduced comparedto the approach proposed in [4] and a reduction of several orderof magnitude is observed.The results show the benefit of the proposed approaches to

reduce the development time of fixed-point systems. We pro-pose two different approaches with different characteristics, asshown in Table VIII. It is seen that the infinite sums approach isreally accurate and fast, whereas, the approach based on linearprediction is faster but a little less accurate. So, the proposedframework allows the user to choose between two different ap-proaches with different trade-offs between speed and accuracy.

V. CONCLUSION

In this paper, a fast general and accurate approach to deter-mine the accuracy of a fixed-point system is presented and vali-dated. The method is analytic and provides a closed-form ex-pression of the output noise power for systems composed ofsmooth operations. The quantization noises due to fixed-pointcomputation are considered through their statistical properties(mean, variance, etc.) and their propagation through arithmeticoperations (addition, subtraction, multiplication, and division)is detailed. The system is characterized by its time-varying im-pulse response and the expression of the output noise power isobtained. The approach is valid for all types of system made-upof smooth operations and for most common rounding modes(truncation, conventional rounding, convergent rounding). Theestimation is unbiased and leads to infinite sums inside the ex-pression of the output noise power which have to be approxi-mated by a finite number of terms. To reduce the complexitydue to infinite sums, an approach based on linear prediction isalso proposed.The proposed approach was applied to the LMS algorithm,

a recursive and non-LTI system, to verify its validity and itsquality. Moreover, several LTI and non-LTI systems have beentested to show the accuracy of the proposed approach on var-ious types of applications. The maximal relative errors are lessthan 25% which corresponds to less than 1 bit. Finally, the ap-proach has been compared to approaches based on fixed-pointsimulations in terms of execution time. After only 10 iterations,the proposed approach is less time-consuming than other ap-proaches based on fixed-point simulations. Thus, it was shownthat the proposed approach leads to accurate estimation andallows a significant reduction of the optimization time in thefixed-point conversion process.

APPENDIX A

In this Appendix, infinite sums are modelled for impulse re-sponse terms . The vector is defined as

...(78)

The first terms of impulse response are directlydetermined. Thus, terms to are computedwith (22). The term is present in the matrix andin the following matrices. This term can also be computed

Page 13: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

2338 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 10, OCTOBER 2012

with prediction coefficient . The -size matrix is definedby

.... . .

. . .... (79)

and the following recurrence relation is obtained:

(80)

Theorem 2: The matrix has a limit equal to 0 when.

Proof: The characteristic polynomialof the matrix is equal to

.... . .

. . . (81)

To compute the determinant, we develop according to the firstcolumn, which leads to the recurrence expression

As for and by using theprevious recursion, is equal to

(82)

The roots of the previous equation are the eigenvalues of .Moreover, these prediction coefficients define a transfer func-tion. Indeed, they linearize the time-varying transfer function.The poles of the transfer function satisfy the equality

(83)

Thus, comparing solutions of (82) with (83), eigenvalues ofare the poles of the transfer function, and can be written as

.... . .

. . .... (84)

where is an invertible matrix and can be real or complex.However, since the system is stable, the modulus of is lessthan one so converges to 0 when . So, is equal to

.... . .

. . .... (85)

(85) proves that tends to zero when .

Therefore:

(86)

with the -size identity matrix.The vector is also equal to

...(87)

So, the vector is introduced where each element is de-fined by

(88)

contains the first impulse response terms which arecomputedmanually. Thus, each line of is equal to

. The sum of impulse response terms is then equalto

(89)

This expression is developed for impulse terms . In prac-tice, the impulse terms are defined by two matrices andas shown in (Section III). The previous method is then appliedto where is a matrix containing ones (and replacing thenoise ).

APPENDIX B

In this Appendix, the method modelling the second orderinfinite sums is presented. Using the same approach as inAppendix A, the following recurrence relation is obtained:

(90)

Matrix has a limit equal to zero when as demon-strated previously, which gives

(91)

Yet, the next equality can be determined:

(92)

This equality is a Lyapunov equation. So, is the solutionof the Lyapunov equation and can be obtained using a classicalsolver. As in Appendix A, we introduce the diagonal matrixwhere each diagonal element is equal to

(93)

Page 14: 2326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …people.rennes.inria.fr/Olivier.Sentieys/... · ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT

ROCHER et al.: ANALYTICAL APPROACH FOR NUMERICAL ACCURACY ESTIMATION OF FIXED-POINT SYSTEMS BASED ON SMOOTH OPERATIONS 2339

Each element of the diagonal of is equalto the sum of square impulse terms . So, it leadsto

(94)

REFERENCES[1] G. Alefeld and J. Herzberger, Introduction to Interval Computations.

New York: Academic Press, 1983.[2] P. Belanovic and M. Rupp, “Automated floating-point to fixed-point

conversion with the fixify environment,” in Proc. IEEE Rapid SystemPrototyping (RSP’05), Montreal, Canada, 2005, pp. 172–178.

[3] M. Blair, S. Obenski, and P. Bridickas, “Patriot Missile Defense:Software Problem Led to System Failure at Dhahran, Saudi Arabia,”United States Department of Defense, General Accounting Office,Washington, DC, Technical Report GAO/IMTEC-92-26, 1992.

[4] G. Caffarena, J. A. Lopez, C. Carreras, and A. Fernandez,“SQNR estimation of fixed-point DSP algorithms,” EURASIPJ. Adv. Signal Process., 2010, Article ID 171027, 12 pages,doi:10.1155/2010/171027.

[5] C. Caraiscos and B. Liu, “A roundoff error analysis of the LMS adap-tive algorithm,” IEEE Trans. Acoust., Speech, Signal Process., vol.ASSP-32, no. 1, pp. 34–41, Feb. 1984.

[6] J. M. Chesneaux, L. S. Didier, and F. Rico, “The fixed Cadna library,”in Proc. RNC’03, Sep. 2003, pp. 215–229.

[7] M. Clark, M. Mulligan, D. Jackson, and D. Linebarger, “Acceleratingfixed-point design forMB-OFDMUWB systems,”CommsDesign, Jan.26, 2005, Article ID 57703818.

[8] G. Constantinides, “Perturbation analysis for word-length optimiza-tion,” in Proc. IEEE FCCM’03, Napa, CA, Apr. 2003, pp. 81–90.

[9] G. Constantinides, P. Cheung, and W. Luk, “Truncation noise in fixed-point SFGs,” IEEE Electron. Lett., vol. 35, no. 23, pp. 2012–2014, Nov.1999.

[10] G. A. Constantinides, “Word-length optimization for differentiablenonlinear systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 26,no. 1, pp. 26–43, 2006.

[11] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Wordlength op-timization for linear digital signal processing,” IEEE Trans. Comput.-Aid. Des. Integr. Circuits Syst., vol. 22, pp. 1432–1442, 2003.

[12] M. Coors, H. Keding, O. Luthje, and H. Meyr, “Integer code gener-ation for the TI TMS320C62x,” in Proc. ICASSP’01, May 2001, pp.1133–1136.

[13] L. H. de Figueiredo and J. Stolfi, “Affine arithmetic: Concepts and ap-plications,” Numerical Algorithms, pp. 1–13, 2003.

[14] B. Evans, “Modem Design, Implementation, and Testing Using NI’sLabVIEW,” Natl. Instrum. Acad. Day, Beirut, Lebanon, Jun. 2005.

[15] J. Eyre and J. Bier, “DSPs court the consumer,” IEEE Spectrum, vol.36, no. 3, pp. 47–53, 1999.

[16] J. Eyre and J. Bier, “The evolution of DSP processors,” IEEE SignalProcess. Mag., vol. 17, no. 2, pp. 43–51, Mar. 2000.

[17] P. D. Fiore, “Efficient approximate wordlength optimization,” IEEETrans. Computers, vol. 57, no. 11, pp. 1561–1570, 2008.

[18] S. Haykin, Adaptive Filter Theory, 2nd ed. Englewood Cliffs, NJ:Prentice Hall, 1991.

[19] T. Hill, “ACCELDSP synthesis tool floating-point to fixed-point con-version of MATLAB algorithms targeting FPGAs,” Xilinx, White Pa-pers, Apr. 2006.

[20] S. Kim and W. Sung, “Fixed-point error analysis and word length opti-mization of 8x8 IDCT architectures,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 8, no. 8, pp. 935–940, Dec. 1998.

[21] J. L. Lions, L. Lubeck, J. L. Fauquembergue, G. Kahn, W. Kubbat,S. Levedag, L. Mazzini, D. Merle, and C. O. Halloran, “ARIANE 5,Flight 501 Failure,” European Space Agency, Paris, France, Technicalreport, Jul. 1996.

[22] J. A. Lopez, G. Caffarena, C. Carreras, and O. Nieto-Taladriz, “Fastand accurate computation of the roundoff noise of linear time-invariantsystems,” IET Circuits, Devices Syst., vol. 2, no. 4, pp. 393–408, 2008.

[23] J. A. Lopez, C. Carreras, and O. Nieto-Taladriz, “Improved in-terval-based characterisation of fixed-point LTI systems with feed-backs loops,” IEEE Trans. Comput.-Aid. Des. Integr. Circuits Syst.,vol. 26, pp. 1923–1933, 2007.

[24] D. Menard, R. Rocher, P. Scalart, and O. Sentieys, “SQNR determi-nation in non-linear and non-recursive fixed-point systems,” in Proc.EUSIPCO’04, Vienna, Austria, Sep. 2004, pp. 1349–1352.

[25] D. Menard, R. Rocher, and O. Sentieys, “Analytical fixed-point accu-racy evaluation in linear time-invariant systems,” IEEE Trans. CircuitsSyst. I, vol. 55, no. 10, pp. 3197–3208, 2008.

[26] D. Menard, R. Rocher, O. Sentieys, and O. Serizel, “Accuracy con-straint determination in fixed-point system design,” EURASIP J. Em-bedded Syst., p. 13, 2008, Article ID 23197.

[27] D. Menard and O. Sentieys, “Automatic evaluation of the accuracy offixed-point algorithms,” in Proc. DATE’02, Mar. 2002, pp. 529–535.

[28] D. Menard, D. Novo, R. Rocher, F. Catthoor, and O. Sentieys, “Quanti-zation mode opportunities in fixed-point system design,” in Proc. EU-SIPCO’2010, Aalborg, Denmark, 2010, pp. 542–546.

[29] K. Ozeki and T. Umeda, “An adaptative filtering algorithm using anorthogonal projection to an affine subspace and its properties,” Elec-tron. Commun. Jpn., vol. 67, pp. 19–27, 1984.

[30] S. Roy and P. Banerjee, “An algorithm for trading off quantization errorwith hardware resources for Matlab-based FPGA design,” IEEE Trans.Computers, vol. 54, pp. 886–896, 2005.

[31] C. Shi and R. W. Bodersen, “A perturbation theory on statistical quan-tization effects in fixed-point DSP with non-stationary inputs,” in Proc.ISCAS’04, Vancouver, Canada, May 2004, pp. 373–376.

[32] A. Sripad and D. L. Snyder, “A necessary and sufficient conditionfor quantization error to be uniform and white,” IEEE Trans. Acoust.,Speech, Signal Process., vol. 25, no. 5, pp. 442–448, Oct. 1977.

[33] W. Strauss, “DSP chips take on many forms,” DSP-FPGA.com Mag.,Mar. 2006.

[34] S. Wadekar and A. Parker, “Accuracy sensitive word-length selectionfor algorithm optimization,” in Proc. ICCAD’98, 1998, pp. 54–61.

[35] B. Widrow, I. Kollár, and M.-C. Liu, “Statistical theory of quantiza-tion,” IEEE Trans. Instrumen. Meas., vol. 45, no. 2, pp. 353–361, Apr.1996.

[36] B. Wu, J. Zhu, and F. N. Najm, “Dynamic-range estimation,” IEEETrans. Comput.-Aid. Des. Integr. Circuits Syst., vol. 25, pp. 1618–1636,2006.

Romuald Rocher received the Engineering degreeand M.Sc. degree in electronics and signal pro-cessing engineering from ENSSAT, University ofRennes, France, in 2003, and the Ph.D. degree insignal processing and telecommunications from theUniversity of Rennes in 2006.Since 2008, he is an Associate Professor at the Uni-

versity of Rennes (IUT Lannion) and amember of theCAIRN research team at the IRISA/INRIA Labora-tory. His research interests include floating-to-fixed-point conversion and adaptive filters.

Daniel Menard (M’03) received the Engineeringdegree and M.Sc. degree in electronics and signalprocessing engineering from the University ofNantes Polytechnic School in 1996, the Ph.D. degreein signal processing and telecommunications fromthe University of Rennes in 2002, and the Habilita-tion à Diriger des Recherches degree in 2011.He is currently an Associate Professor of electrical

engineering at the University of Rennes (ENSSAT)and a member of the CAIRN research team at theIRISA/INRIA Laboratory. His research interests in-

clude implementation of signal processing and mobile communication appli-cations in embedded systems, fixed-point conversion, embedded software, lowpower architectures and arithmetic operator design.

Pascal Scalart (M’06) received the M.S. degree insignal processing engineering and the Ph.D. degree insignal processing and telecommunications from theUniversity of Rennes in 1989 and 1992, respectively.In 2001, he joined the Department of Electrical

Engineering at the University of Rennes, ENSSAT,France, where he became a Professor in 2003. Heis a member of the CAIRN research team at theIRISA/INRIA Laboratory. His current researchinterests are in the area of signal processing forspeech processing and for digital communications.

Olivier Sentieys (A’95–M’03) received the M.Sc.and Ph.D. degrees in electrical engineering (signalprocessing) from the University of Rennes, France,in 1990 and 1993, respectively.After completing his Habilitation thesis in 1999,

he joined the University of Rennes (ENSSAT) as afull Professor in 2002. He leads the CAIRN ResearchTeam at the IRISA/INRIA Laboratory. His researchactivities are in the two complementary fields of em-bedded systems and signal processing with researchinterests in finite arithmetic effects, low-power and

reconfigurable system-on-chip, design of wireless communication systems andcooperation in mobile systems. He is the author or coauthor of more than 150journal publications or peer-reviewed conference papers and holds 5 patents.