cotransformation provides area and accuracy improvement … · cotransformation provides area and...

Cotransformation Provides Area and Accuracy Improvement in an HDLLibrary for LNS Subtraction

Panagiotis D. VouzisComputer Science & Eng.

Lehigh UniversityBethlehem, PA, 18015

[email protected]

Sylvain CollangeEcole Normale Superieure de Lyon

46 Allee d’Italie69364 Lyon Cedex 07, [email protected]

Mark G. ArnoldComputer Science & Eng.

Lehigh UniversityBethlehem, PA, 18015

[email protected]

Abstract

The reduction of the cumbersome operations of multi-plication, division, and powering to addition, subtractionand multiplication is what makes the Logarithmic NumberSystem (LNS) attractive. Addition and subtraction, though,are the bottleneck of every LNS circuit, for which there areimplementation techniques that tradeoff area, latency andaccuracy. This paper reviews the methods of interpolation,multipartite tables and cotransformation for LNS additionand subtraction, but special focus is given on a novel ver-sion of cotransformation, for which a new special case isidentified. Synthesis results compare an already publishedHardware Description Language (HDL) library for LNSarithmetic that uses only multipartite tables or 2nd-orderinterpolation against a variation of the same librarycombined with cotransformation. Exhaustive simulationand a graphics example illustrate that the proposed libraryhas smaller area requirements and is more accurate thanthe earlier library, at the cost of an increase in the latencyof the hardware.

Track: Programmable/re-configurable architectures(Computer Arithmetic)Keywords: Logarithmic Number System, MultipartiteTables, Interpolation, Cotransformation, HardwareDescription Languages.

1. Introduction

Keeping logarithmic representations of real values insidedigital circuits over a sequence of arithmetic operations wastermed the Signed Logarithmic Number System by Swart-zlander et al. [22] in 1975. In the three decades since, theLogarithmic Number System (LNS) has been the subject ofhundreds of academic papers but only a few dozen practi-

cal applications, the most notable of which is perhaps theGRAPE supercomputer [18]. As defined by Swartzlanderet al., LNS can be considered as an alternative to Floating-Point (FP) arithmetic, since it represents a real number, X ,by the logarithm of its absolute value, x = logb(|X|), andan additional bit denoting its sign. Hence, an LNS num-ber can be seen as an FP number where the exponent hasa fractional part and the mantissa is always 1.0. In this pa-per a number in the real domain is represented by a capitalletter, and its logarithm is represented by the same smallletter. Swartzlander et at. [22] assumed the logarithm isrepresented by a biased representation, analogous to theIEEE-754 floating-point standard; however, any method ofsigned fixed-point representation may be used for the loga-rithm, x. The signed fixed-point representation is requiredeven if only positive real values X > 0 are represented inLNS, since x < 0 for |X| < 1. The most common choiceis two’s-complement representation, although signed-digitrepresentations of LNS (SDLNS) [21] have been proposed.

Although LNS lacks wide industry acceptance, mainlydue to the lack of a standard, there are numerous stud-ies comparing LNS to floating-point and fixed-point arith-metic, which demonstrate its advantages and disadvantagesfor certain applications. A detailed comparison [13] ofarea and performance between LNS and FP for Field Pro-grammable Gate Arrays (FPGAs) suggests that LNS issometimes preferable to FP, depending on the ratio of mul-tiplications and divisions over additions in an algorithm.Similarly, [23] shows that an LNS implementation of theFast Fourier Transform (FFT) uses a smaller word size thana comparable fixed-point system, while achieving the sameerror performance.

What makes LNS attractive is the accurate and inexpen-sive implementation of multiplication and division in thereal domain, which are reduced to addition and subtractionin the logarithmic domain. This gain is counterbalanced bythe more complicated operations of addition and subtrac-

tion in the real domain, which require the calculation of twofunctions. The first function is the addition logarithm,

sb(z) = logb(1 + bz), (1)

which is used when forming the logarithm of the sum of twoLNS values. The constant b is the base of the logarithms,typically two. The sb function can be thought of as a log-arithmic form of incrementation: sb converts a logarithmicrepresentation to the equivalent real value, increments thatreal and converts the incremented value back to a logarith-mic representation. The other function needed for a fullimplementation of LNS is the subtraction logarithm,

db(z) = logb(|1 − bz|), (2)

which is used when forming the logarithm of the differenceof two LNS values. Having sb and db, addition and subtrac-tion can be calculated either by the couple

logb(|X| + |Y |) = max(x, y) + sb(−|x − y|), (3)

logb ||X| − |Y || = max(x, y) + db(−|x − y|), (4)

or,

logb(|X| + |Y |) = min(x, y) + sb(|x − y|), (5)

logb ||X| − |Y || = min(x, y) + db(|x − y|). (6)

Most implementations use the first couple because for z <0, sb(z) ≤ 1, i.e., no integer bits need to be stored for sb,and because db(z) < 0, i.e., no sign bit needs to be storedfor db. Overall, the choice of the first couple over the secondcan offer a memory reduction of 10 to 20 percent.

The LookUp Tables (LUTs) used to tabulate sb and db

functions are the main bottleneck of an LNS circuit. Theirsize grows exponentially with respect to the fractional bitsof the logarithmic representation, and recently the tech-niques of interpolation [2], multipartite tables [8] and co-transformation [3] have been proposed to mitigate this prob-lem. Multipartite tables represent an alternative to interpo-lating the two functions, by eliminating the required multi-plier and using multiple tables and adders. With cotransfor-mation, the db function is evaluated by using the sb function,two smaller tables and some extra circuitry which, overall,offer considerable memory savings compared to a pure tab-ulation.

The purposes of this paper are first, to give anoverview of the state-of-the-art techniques for LNS addi-tion/subtraction; second to give a complementary analysisthat covers a subtle aspect of cotransformation not coveredin [3]; and finally, to compare cotransformation with mul-tipartite tables or interpolation alone for logarithmic sub-traction. Section 2 describes interpolation, and section 3describes multipartite methods. Section 4 describes priorcotransformations, while section 5 describes and analyzes

novel aspects of the “Improved Cotransformation”. Sec-tion 6 compares synthesis results from a modified VHDLlibrary using cotransformation against those ones that useonly multipartite tables [10]. Section 7 gives the error per-formance of different LNS architectures. Section 8 moti-vates the accuracy advantages of the cotransformation li-brary over the earlier multipartite-table one with a graphicsexample taken from [7], and, finally, conclusions are drawnin the last section.

2. Interpolation

A common technique used to reduce the memory re-quirements of tabulating the sb and db functions is interpo-lation. Since 0 < s′b(z) < 1, linear interpolation gives sat-isfactory accuracy with reasonable cost for the sb(z) func-tion. However, db(0) = d′b(0) = d′′(0) = −∞, whichmeans that interpolation of the db(z) function becomes ex-pensive close to zero because of this singularity. For ex-ample, Lewis [17] uses over ten times as much storage tointerpolate db than to interpolate sb.

Interpolation can be partitioned [16] or unpartitioned [2].An unpartitioned interpolator splits the two’s-complementfixed-point z into two parts: zH , the high (k + n) bits of z,and zL, the low (f − n) bits of z (for a two’s-complementrepresentation with k integer and f fractional bits). A linearinterpolator approximates

sb(zH + zL) ≈ sb(zH) + s′b(zH + ε) · zL, (7)

which requires one multiplier. The slope can either bestored in a separate ROM [2], or computed from functionvalues [15]. “Unpartitioned” means zH is a multiple of aconstant (∆ = 2−n) and 0 ≤ zL < ∆. The variable ε cantake any value in the interval [0,∆], but the best accuracyoccurs with ε ≈ ∆/2. To achieve faithful results (roundingto either the nearest or next nearest) with an unpartitionedsb linear interpolator [2] accepting inputs with f -bit pre-cision and a k ≈ �log2(f)�-bit integer portion requires atable with a (k + n)-bit address bus, where n = �f/2� − 2is the number of bits used to address the portion of the ta-ble for −2 ≤ z < −1. Since the function sb approacheszero asymptotically, it needs to be tabulated only for thevalues of z for which it is non-zero. The point that sb

becomes zero is called “essential zero”, and it is equal toesb

= logb(b2−f − 1) ≈ f , i.e., only f · 2n words areactually used for sb. In contrast, a partitioned linear sb

interpolator [1, 15] needs 6 · 2n words. Partitioning costsaddress-generation hardware. For f ≤ 12, unpartitioned in-terpolation is probably a better choice. For single precision,f = 23, partitioning cuts the number of words to a thirdof that required by the unpartitioned sb method. More eco-nomically, a partitioned quadratic sb interpolator [5] with

k = 5 needs roughly an f/3-bit address bus at the addedcost of two multipliers and related circuits.

The db function cannot be interpolated near its singular-ity, i.e., the region close to zero where db becomes −∞,unless partitioning is used. Otherwise either the introducederror is unacceptable, or the memory requirements becomeexcessive. The most common method is to partition the ta-ble at powers of two as z approaches zero [15]. In otherwords, 2n words are used for −2 ≤ z < −1, 2n wordsare used for −1 ≤ z < −0.5, 2n words are used for−0.5 ≤ z < −0.25 . . . . If one allows “weak” accuracyas z approaches zero [16], the number of words requireddecreases [10]; i.e., there is a tradeoff between number ofwords and accuracy. The graphics example in Section 8 inthis paper illustrates the problem with the weak-accuracyimplementation of LNS subtraction, which can be over-come by the novel methods in this paper.

3. Multipartite Tables

The multipartite-tables approach [9, 14] is an implemen-tation of linear interpolation in which the multiplier circuitthat forms the product of the slope with the low bits of zis replaced with a ROM that contains precomputed multi-ples of the slopes. The simplest version is called bipartite,from the fact it requires only two ROMs. A bipartite circuitsplits the input into three parts [19], denoted as z2, z1 andz0, whose bit lengths are n2, n1 and n0, respectively. Thebipartite approximation is

f(z) ≈ a21(z2, z1) + a20(z2, z0), (8)

which requires 2n2+n1 +2n2+n0 words of memory, which ismore than required by interpolation. In the context of LNSimplementation the relationship between (7) and (8) is thatzH = z2 + z1 and zL = z0, where z1 is the “middle” bitfield, not used for the precomputed slope ROM. This makes

a21(z2, z1) = sb(zH), (9)

and

a20(z2, z0) ≈ s′b(zH + ε) · zL. (10)

Not all of zH is available as an address to a20, and the sameslope is used for several different zH . In consequence, (10)is only an approximation to the correct product. Multipar-tite tables are a generalization of this idea, in which addi-tional ROMs are included, each of which deals with a dif-ferent subset of bits. Multipartite tables have memory re-quirements that approach that of interpolation at the cost ofextra adders that approach the cost of the missing multiplier.The multipartite method has the same problem with the db

singularity as interpolation.

4. Cotransformation Analysis

The cost of interpolating db near the singularity can beovercome by using cotransformation. The idea of cotrans-formation is to convert max(x, y) (or min(x, y)) of Eq. (4)(or Eq. (6)) and the argument of db into modified values thatare guaranteed to avoid the singularity of db.

To simplify the analysis of cotransformation, let z =z1 + z2, Z1 = bz1 , Z2 = bz2 and Z = bz = Z1 · Z2.With this value-domain notation (Z, Z1 and Z2), we can seethe algebraic reason cotransformation works without con-sidering logarithms. Since the real logarithm function is notdefined for negative arguments, the analysis of cotransfor-mation involves several terms in absolute-value signs. Al-though there are several cotransformation variations in theliterature [3, 4, 6, 12], all derive from a common idea that|Z − 1| can be expressed in terms of Z1 and Z2.

There are two possible cases (sign(z1) = sign(z2) andsign(z1) �= sign(z2)) that alter the way the absolute valuesigns are treated. The former is called “Arnold’s Cotransfor-mation”; the latter is called “Coleman’s Cotransformation”.

4.1. Arnold’s Cotransformation

This describes the case sign(z1) = sign(z2), which oc-curs either when Z1 > 1 and Z2 > 1 or when Z1 < 1and Z2 < 1. The former subcase is of interest becauseit is appropriate when (5) and (6) are used with a two’s-complement representation [4]. Although it has never beenconsidered before, the latter subcase is of interest for imple-mentation of cotransformation with SDLNS. In this case,we start with the simple algebraic observation

Z − 1 = Z1 · Z2 − 1 = Z1 − 1 + Z1 · Z2 − Z1

= (Z1 − 1) ·(

1 +Z1 · (Z2 − 1)

Z1 − 1

). (11)

Because in this case |Z2−1|/|Z1−1| = |(Z2−1)/(Z1−1)|,the entire computation can be rewritten with absolute valuesigns at each step:

Z − 1 = |Z1 − 1| ·∣∣∣∣1 +

Z1 · |Z2 − 1||Z1 − 1|

∣∣∣∣ . (12)

Taking logarithms of both sides and substituting with z, z1

and z2 yields:

db(z) = db(z1) + sb(z1 + db(z2) − db(z1)). (13)

In a practical system, the terms db(z1) and db(z2) can comefrom small lookup tables, but the sb evaluation is performedby an approximation method like interpolation or multipar-tite tables. In other words, the more difficult evaluation ofthe db function can always be turned into an evaluation of

the sb function of a transformed argument plus an extra termobtained from a small table. The name “cotransformation”was proposed for this technique in [4] because LNS addi-tion uses (5), but LNS subtraction uses the cotransformedlogb ||X| − |Y || = min(x, y) + db(z1) + sb(z1 + db(z2)−db(z1)). In theory, there are special cases for z1 = 0 andz2 = 0, but in practice, these can be implemented by storingsufficiently negative but finite values to approximate −∞.An advantage of this form of cotransformation is that allsubtractions can be turned into additions (that avoid the sin-gularity), although, if desired, the cotransformation may belimited only to values of z close enough to the singularity tobe of concern. Also, an advantage is that the original argu-ment and the transformed argument of sb are both positive(i.e., z > 0 means z1 + db(z2) − db(z1) > 0). The dis-advantage, as mentioned above, is that this case moderatelyincreases the bit width of the sb table since z > 0.

4.2. Coleman’s Cotransformation

An alternative cotransformation [6] holds in the casesign(z1) �= sign(z2):

1 − Z = 1 − Z1 · Z2 = 1 − Z1 − Z1 · Z2 + Z1

= |1 − Z1| ·∣∣∣∣1 − Z1 · |Z2 − 1|

|1 − Z1|∣∣∣∣ or (14)

db(z) = db(z1) + db(z1 + db(z2) − db(z1)). (15)

In other words, evaluation of the db function can be turnedinto an evaluation of the db function at a transformed argu-ment which is far enough away from the singularity. A dis-advantage is that this form of cotransformation cannot elim-inate db altogether. This form does have the advantage thatit naturally supports z < 0 because in two’s-complementnotation this typically means z1 < 0 and z2 > 0. Here,LNS addition uses the unmodified (3), and LNS subtractionthe cotransforemed logb ||X|−|Y || = max(x, y)+db(z1)+db(z1 + db(z2) − db(z1)).

5. Improved Cotransformation

In [3] an improved cotransformation is proposed, whichcompared to the one proposed in [6], offers better accuracyat no extra cost for the case z < 0, which allows this im-proved cotransformation to operate with the smaller-widthsb table. For the rest of this section, assume z is a negativetwo’s-complement number having k integer and f fractionalbits. This cotransformation requires z to be bit-partitionedin zl, which represents the last j bits of z, and zh whichrepresents the remaining k + (f − j) bits. Thus, z can bereconstructed by concatenating zh and zl, or algebraically,z = zh + zl. In the earlier cotransformations, this bit par-titioning was simply assumed to be zh = z1 and zl = z2;

however, the algebra given earlier does not require this. In-troducing δh = 2j−f , which is the smallest power of twoused for zh, the “Improved Cotransformation” for the db(z)function given in [3] uses

db(z) = z + F1(zh) + sb(F2(zl) − z − F1(zh)),zh �= −δh ∧ z �= −2δh (16)

db(z) = F2(zl), zh = −δh (17)

db(z) = db(−2δh), z = −2δh, (18)

where F1(zh) = db(−zh − δh) and F2(zl) = db(zl − δh).If the case of zh = −δh were not a special case, the ar-gument to sb would be infinite. If z = −2δh were not aspecial case, the argument to sb(z) would be positive, andthis would require interpolating sb beyond zero, increasingthe hardware cost. In terms of hardware implementation“Coleman’s” and “Arnold’s Cotransformation” are equiva-lent, but “Arnold’s Cotransformation” is superior in terms oferror behavior. This was proven in [3] via simulation, whereit was shown, for the 16-bit and 23-bit cases, the standarddeviation of the error introduced by “Arnold’s Cotransfor-mation” is 52% and 60% smaller respectively.

5.1. Novel Special Case

The special case (18) covers the case when the argumentof sb becomes positive when z = −2δh; however, for val-ues of j closer to f there are more cases where the argu-ment is positive. This fact is illustrated in Fig. 1(a) wherethe argument of sb in (16) is plotted for z ∈ [−6, 0.5), andfor b = 2, f = 4 and j = 3. It is apparent that the ar-gument of sb becomes positive not only for the value ofz = −2δh = −1.0, but for an interval, which in this caseis [−1.0,−0.8706]. For b = 2, f = 4 and j = 4 there aremultiple diminishing intervals for which the argument of sb

is positive, as is depicted in Fig. 1(b). Thus, for a givenb and f the choice of j determines where the argument ofsb(z) is positive and where it remains negative. This analy-sis is a graphical proof that the cotransformation presentedin [3] needs to be reexamined in order to cover these novelspecial cases, not stated before, and this is presented belowanalytically.

These novel special cases can be found by solving theinequality

F2(zl) − z − F1(zh) > 0 ⇒ logb|1−bzl−δh ||bz−bzl−δh | > 0 ⇒

|1−bzl−δh ||bz−bzl−δh | > 1 for b > 1. (19)

For each interval [−nδh,−(n − 1)δh], n ∈ Z, it holds z =−nδh + zl, thus bz − bzl−δh < 0, and since zl − δh < 0,1 − bzl−δh > 0. Taking the absolute values out of (19) and

-6 -5 -4 -3 -2 -1

-12

-10

-8

-6

-4

-2

z

tne

mugr

Afo

sb

(a) b = 2, f = 4 and j = 3.

-6 -5 -4 -3 -2 -1

-12

-10

-8

-6

-4

-2

ztn

emu

grA

fosb

(b) b = 2, f = 4 and j = 4.

Figure 1. The argument of sb for two different values of j.

substituting zl = z + nδh we have

1 − bz+(n−1)δh

−bz + bz+(n−1)δh> 1 ⇒ bz(1−2b(n−1)δh) > −1. (20)

Since 1 − 2b(n−1)δh < 0,

bz <1

2b(n−1)δh − 1⇒ z < − logb(2b(n−1)δh − 1). (21)

To summarize, for z ∈ [−nδh,−(n−1)δh] the argument ofsb(z) in (16) is positive for z ∈ [−nδn,− logb(2b(n−1)δh −1)]. If − logb(2b(n−1)δh−1) < −nδh the argument remainsnegative.

This analysis leads to the restatement of the cotransfor-mation as

db(z)=z + F1(zh) + sb(F2(zl) − z − F1(zh)),zh �=−δh∧z /∈ [−nδn,− logb(2b(n−1)δh−1)](22)

db(z)=F2(zl), zh = −δh (23)

db(z)=db(−nδh + zl),z ∈ [−nδn,− logb(2b(n−1)δh − 1)]. (24)

This cotransformation has the added cost of detecting thespecial case, as well as additional memory requirementswhich will be analyzed in the next subsection.

By knowing the intervals where the argument of sb ispositive, we can calculate the number of words, wNovelSp,required to tabulate the special case (24) as follows:

wNovelSp(b,k,f)=

2k−1δh∑

n=0

max(⌈

nδh−logb(2b(n−1)δh−1)2−f

⌉,0)

.

(25)The function max(·) returns 0 if the nominator of the frac-tion is negative (i.e., the argument of sb is negative) or elsethe number of cases that the argument is positive.

5.2. Memory Requirement

Overall, the memory requirements for the cotransfor-mation are the sum of the words to store F1, F2 pluswNovelSp(b, k, f). F1(zh) approaches (−zh − δh) asymptot-ically for z → −∞, i.e., after a specific value of z, whichis called essential zero, the function becomes a tautologysince it is always equal to (−zh − δh). A similar prop-erty exists for F2(zl), which after a specific point becomessmaller than 2−f , i.e., essentially zero [20]. The essentialzero for F1 is equal to eF1(b, f, j) = logb(1− b−2−f

)− δh,i.e., beyond this point the function does not need to betabulated because it always returns (−zh − δh). Know-ing that, we can calculate the number of words requiredto tabulate F1 as wF1(b, f, j) = �−eF1(b, f, j)/δh�. Anal-ogously eF2(b, f, j) = logb(1 − b−2−f

) + δh. However,eF2(b, f, j) < −δh, ∀ b, f, j, and since 0 ≤ zl < δh,zl needs to span all the values in this interval requiringwF2(j) = 2j . Hence, the total memory requirements forthe cotransformation are

wCtr(b, k, f, j) = wNovelSp(b, k, f) + wF1(b, f, j) + wF2(j).(26)

For a particular accuracy f the above analytical formulacan give the value of j that results in the smallest mem-ory requirements for cotransformation. Since the memorycomprises the biggest part of an LNS unit, a parametricsearch in terms of j can lead to the optimum LNS circuitin terms of area requirements. This is illustrated in Table 1where synthesis results by Leonardo on the Applications-Specific-Integrated-Circuit (ASIC) Taiwan-Semiconductor-Manufacturing-Company (TSMC) library at 0.25µm arepresented, for the case of b = 2, k = 6, and f = 8. Thevalue that gives the smallest number of words is j = 5,while the case of j = 6 gives approximately the same num-ber of words, with the advantage of F1 and F2 having thesame size address busses. The actual synthesis of the addi-

Table 1. Effect of j on area and delay for b = 2,k = 6, and f = 8.

j 3 4 5 6 7Words 208 153 103 108 179Gates 45445 35489 32674 31621 33522

Delay (ns) 17.72 17.49 17.66 20.98 23.24

tion/subtraction circuits that correspond to these two casescan give exact values for area and delay. The cases forj = 3, 4, 7 are presented to illustrate that when the theoret-ical number of words given by Eq. (26) are suboptimal, thecorresponding circuits exhibit increased area accordingly.A designer interested in the smallest area can explore thedesign space starting from the optimal value of j given byEq. (26), and then consider area and delay of neighboringvalues.

6. Synthesis Results for FPGA

Interpolation, multipartite tables and cotransformationare different choices for LNS implementation that trade-offlatency and area. Fig. 2(a) and 2(b) show the area and delayas reported by Xilinx Webpack for Virtex-IV FPGA syn-thesis when sb and db are implemented with interpolationor multipartite tables alone. The multipartite-table and theSingle-Multiplication-Second-Order (SMSO) [11] librariesused are the ones presented in [10] and offered publicly.These libraries include multipartite-table implementationsfor f = 6 to 13, and SMSO implementations for f =10, 11, 13. The multipartite tables have the smallest latencyand largest area. The SMSO occupies less area comparedto the multipartite method at the expense of increased delay.The db singularity causes increased memory requirements,and the VHDL LNS library [10] we started from relaxesaccuracy in this region in order to achieve reasonable sizefor both interpolation and multipartite methods. Cotrans-formation evaluates the db(z) function via smaller tables,eliminating the need for the weak-error model used in [10].Either multipartite tables or interpolation can be used forsb(z) since this is treated as a black box. The choice of co-transformation decreases the area for both the multipartitemethod and the SMSO as is depicted in Fig. 2(a). This gainis counterbalanced by an increase in the latency as shownin Fig. 2(b). Moreover, cotransformation increases the ac-curacy of the circuit as presented in the next section.

The increased accuracy and reduced area offered by co-transformation versus the reduced latency offered by inter-polation or multipartite methods urges us to combine co-transformation with one of these db methods to achievesmaller area or latency. Fig. 3(a) shows the area require-

ments for hybrid methods where the db(z) function fromessential zero to some midpoint is computed by interpola-tion or multipartite tables and after that with cotransforma-tion. The hybrid method that uses the least area implementsdb(z) by the SMSO from −16 < z ≤ −8 and by cotrans-formation from −8 < z < 0. This reduction in area comeswith a modest increase in the latency of the circuit as de-picted in Fig. 3(b). These ideas can be applied when usingmultipartite tables instead of the SMSO. The same figurespresent the tradeoff between area and delay and we can seeall the methods follow the same pattern, i.e., the smaller thearea of a design the bigger its delay. When the multipartitemethod computes sb, the best area/speed tradeoff appears touse cotransformation exclusively for db.

7. Error Analysis

The multipartite-table method has the biggest area re-quirements as shown in Section 6, and the situation wouldhave been worse if the accuracy of the LNS subtractor wasconsistent throughout the whole domain of z < 0. Fig. 4(a)presents the error induced by the db function for k = 5 andf = 10. When db is implemented by the multipartite-tablemethod in [10], it is apparent that the accuracy close to zerois relaxed in order to achieve reasonable area, a compromisethat tends to make LNS appear less accurate than FP for thesame wordlengths [7]. For the LNS libraries in [10], theworst-case error of 0.01 is bigger up to an order of a mag-nitude than the desired value of 2−10, which is achieved forz away from zero.

On the contrary, Fig. 4(b) illustrates that the error behav-ior of LNS subtraction is in the desired interval of [0, 2−10]for f = 10 when sb is implemented by the multipartite-method in [10] and db by the cotransformation analyzed inSection 5. The substitution of the db function with cotrans-formation essentially moves the calculation of the subtrac-tion algorithm away from the singularity, thus the accuracycan be kept consistent to 2−f without excessive memoryrequirements.

The use of cotransformation has an impact on the area,the delay, and the error performance of a hardware imple-mentation of an LNS circuit. These three aspects have beenstudied in this work and it has been shown that cotransfor-mation offers an advantage in terms of area and accuracywhen used for LNS subtraction, with either interpolationor multipartite tables. The improvement in the accuracy isvisualized in the next section with a graphics application.However, cotransformation increases the latency of a circuitbecause it uses multiple tables apart from the ones requiredby interpolation or the multipartite method. Clearly, there isa tradeoff of area plus accuracy against latency which has tobe balanced depending on the specifications of a particularapplication, such as available hardware resources, through-

0

200

400

600

800

1000

1200

1400

1600

6 7 8 9 10 11 12 13

Are

a(s

lices

)

f

MultipartiteCotransformation+Multipartite2nd orderCotransformation+2nd order

(a) Area.

16

18

20

22

24

26

28

30

32

34

36

6 7 8 9 10 11 12 13

Lat

ency

(ns)

f

MultipartiteCotransformation + Multipartite2nd orderCotransformation + 2nd order

(b) Latency.

Figure 2. Area and latency of the different methods as a function of precision, f .

0

200

400

600

800

1000

1200

6 7 8 9 10 11 12 13

Are

a(s

lices

)

f

Multipartite db, −16.0 ≤ z < −8.0Multipartite db, −16.0 ≤ z < −4.0Multipartite db, −16.0 ≤ z < −2.0Cotransformation only2nd order db, −16.0 ≤ z < −8.02nd order db, −16.0 ≤ z < −4.02nd order db, −16.0 ≤ z < −2.0Cotransformation only

(a) Area.

18

20

22

24

26

28

30

32

34

36

38

6 7 8 9 10 11 12 13

Lat

ency

(ns)

f

Multipartite db, −16.0 ≤ z < −8.0Multipartite db, −16.0 ≤ z < −4.0Multipartite db, −16.0 ≤ z < −2.0Cotransformation only2nd order db, −16.0 ≤ z < −8.02nd order db, −16.0 ≤ z < −4.02nd order db, −16.0 ≤ z < −2.0Cotransformation only

(b) Latency.

Figure 3. Area and latency of the hybrid methods as a function of precision, f .

−16 −12 −8 −4 00

0.002

0.004

0.006

0.008

0.01

z

Rel

ativ

e er

ror

(a) db with multipartite tables.

−16 −12 −8 −4 00

0.002

0.004

0.006

0.008

0.01

z

Rel

ativ

e er

ror

(b) db with “Improved Cotransformation” (j = 5) and sb with multi-partite tables.

Figure 4. Error performance of the db function for different implementation choices for k = 5 andf = 10.

(a) Cotransformation without special cases. (b) “Improved Cotransformation.” (c) Multipartite-table implementation.

Figure 5. Graphics example comparing cotransformation and multipartite tables for k = 5, f = 8,j = 5.

put requirements, and desired accuracy.

8. Graphics Example

As an example application we use a 3D-transformationpipeline that has been used in [7] to compare the relativeperformance of LNS and FP. The original design from [7]was adapted for a Virtex-IV LX25-based ML401 develop-ment board, in order to compare the original multipartitemethod from [10] against cotransformation for LNS sub-traction.

An image obtained using the original cotransforma-tion [3] without taking into account the novel special casesderived in this work is presented in Fig. 5(a) (k = 5, f = 8,j = 5). For some inputs, the value indexing the sb tableis positive, causing an incorrect value to be returned. Eventhough there are only two such special cases for the consid-ered parameters, the induced error is very visible in the lefttire of the truck.

On the other hand, Fig. 5(b) depicts the resulting imagewhen using a third table to handle the special cases, basedon the same parameters. We observe that, although the han-dling of the new special case is cheap (only a table of twovalues for f = 8 and j = 5), its omission can result inmajor errors in a series of computations. For reference, animage [7] obtained using the multipartite method alone isgiven in Fig. 5(c), still based on the same parameters (k = 5and f = 8). We can observe a significant improvement inperceived accuracy when using cotransformation for sub-tractions. Similar improvements in visual quality are ob-served for higher precisions such as when f = 10. On av-erage, cotransformation seems to offer a level of accuracysimilar to floating-point [7] using the same word length,thereby refuting the claim in [7] that such visual artifactsreflect inherent limitations of LNS compared to FP. Instead,our results suggest artifacts can result from implementation

choices, and careful LNS designers should consider the ac-curacy and memory savings of cotransformation.

9. Conclusions

This work presented the tradeoffs in area, latency andaccuracy of multipartite tables and interpolation [10], andof cotransformation [3]. New special cases for cotrans-formation are presented which correct a small but impor-tant, previously unrecognized, inaccuracy. We incorporatedthis new cotransformation into an existing HDL library [10]for LNS. Synthesis results show that using cotransforma-tion for subtraction decreases the memory size comparedto using pure interpolation or multipartite tables, but this iscounterbalanced by an increase in delay. Additionally, co-transformation makes the LNS subtraction much more ac-curate near the singularity of db, as has been illustrated byexhaustive error simulation and by a graphics application.This work unifies the most effective techniques for design-ing LNS units and gives a more complete practical study ofthe design space than any previous paper.

Acknowledgments

The authors would like to thank Nicolas Frantzen and Je-sus Garcia for their contributions in the early stages of thiswork.

References

[1] M. Arnold, T. Bailey, and J. Cowles. Comments on ‘AnArchitecture for Addition and Subtraction of Long WordLength Numbers in the Logarithmic Number System’. IEEETransactions on Computers, 41(6):786–788, June 1992.

[2] M. G. Arnold. A Pipelined LNS ALU. In Proceedings of theIEEE Workshop on VLSI, pages 155–161, Orlando, Florida,19–20 April 2001.

[3] M. G. Arnold. An Improved Cotransformation for Logarith-mic Subtraction. In Proceedings of the International Sympo-sium on Circuits and Systems (ISCAS’02), pages 752–755,Scottsdale, Arizona, 26–29 May 2002.

[4] M. G. Arnold, T. A. Bailey, J. R. Cowles, and M. D. Winkel.Arithmetic Co-transformations in the Real and ComplexLogarithmic Number Systems. IEEE Transactions on Com-puters, 47(7):777–786, July 1998.

[5] M. G. Arnold and C. Walter. Unrestricted Faithful Round-ing is Good Enough for Some LNS Applications. In Pro-ceedings of the 15th International Symposium on ComputerArithmetic, pages 237–246, Vail, CO, 11–13 June 2001.

[6] J. N. Coleman. Simplification of Table Structure in Log-arithmic Arithmetic. IEE Electronic Letters, 31(22):1905–1906, 26 Oct. 1995.

[7] S. Collange, F. de Dinechin, and J. Detrey. Floating Pointor LNS: Choosing the Right Arithmetic on an ApplicationBasis. In Proceedings of the 9th EuroMicro Digital SystemDesign (DSD 2006), pages 197–203, Dubrovnik, Croatia, 30Aug.–1 Sept. 2006.

[8] F. de Dinechin and A. Tisserand. Some Improvements onMultipartite Table Methods. In Proceedings of the 15th Sym-posium on Computer Arithmetic, pages 128–135, Vail, Col-orado, 11–13 June 2001.

[9] F. de Dinechin and A. Tisserand. Multipartite Table Meth-ods. IEEE Transactions on Computers, 54(3):319–330,March 2005.

[10] J. Detrey and F. de Dinechin. A VHDL Library of LNS Op-erations. In 37th Asilomar Conference on Signals, Systems,and Computers, volume 2, pages 2227–2231, Pacific Grove,CA, 9–12 Nov. 2003.

[11] J. Detrey and F. de Dinechin. Second Order Function Ap-proximation Using a Single Multiplication on FPGAs. InProceedings of the 14th International Conference on Field-Programmable Logic and Applications (FPL 2004), number3203 in Lecture Notes in Computer Science, pages 221–230,Antwerp, Belgium, Sep 2004.

[12] J. Garcia, L. Bleris, M. G. Arnold, and M. V. Kothare. LNSArchitectures for Embedded Model Predictive Control Pro-cessors. In Proceedings of the International Conferenceon Compilers, Architecture, and Synthesis for EmbeddedSystems (CASES’04), pages 79–84, Washington DC, 22–25Sept. 2004.

[13] M. Haselman, M. Beauchamp, A. Wood, S. Hauck, K. Un-derwood, and K. S. Hemmert. A Comparison of FloatingPoint and Logarithmic Number Systems for FPGAs. InProceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05),pages 181–190, Washington, DC, 17–20 April 2005.

[14] H. Hassler and N. Takagi. Function Evaluation by TableLook-up and Addition. In Proceedings of the 12th Sympo-sium on Computer Arithmetics, pages 10–16, Bath, England,19–21 July 1995.

[15] D. M. Lewis. An Architecture for Addition and Subtractionof Long Word Length Numbers in the Logarithmic NumberSystem. IEEE Transactions on Computers, 39(11):1325–1336, Nov. 1990.

[16] D. M. Lewis. Interleaved Memory Function Interpolatorswith Application to an Accurate LNS Arithmetic Unit. IEEETransactions on Computers, 8(43):974–982, Aug. 1994.

[17] D. M. Lewis. 114 MFLOPS Logarithmic Number SystemArithmetic Unit for DSP Applications. IEEE Journal ofSolid-State Circuits, 30(12):1547–1553, Dec. 1995.

[18] J. Makino and M. Taiji. Scientific Simulations with Special-Purpose Computers: The GRAPE Systems. John Wiley &Son Ltd., Feb. 1998.

[19] M. J. Schulte and J. E. Stine. Symmetric Bipartite Tablesfor Accurate Function Approximation. In Proceedings ofthe 13th IEEE Symposium on Computer Arithmetic, pages175–183, Asilomar, CA, July 6–9 1997.

[20] T. Stouraitis. Logarithmic Number System Theory, Analy-sis, and Design. PhD thesis, Univ. of Florida, Gainesville,Florida, 1986.

[21] T. Stouraitis and C. Chen. Hybrid Signed Digit LogarithmicNumber System Processor. In IEEE Proceedings of Com-puters and Digital Techniques, volume 140, pages 205–210,1993.

[22] E. E. Swartzlander and A. G. Alexopoulos. TheSign/Logarithm Number System. IEEE Transactions onComputers, 24(12):1238–1242, Dec. 1975.

[23] E. E. Swartzlander, D. Chandra, T. Nagle, and S. A. Starks.Sign/Logarithm Arithmetic for FFT Implementation. IEEETransactions on Computers, C-32:526–534, 1983.

cotransformation provides area and accuracy improvement … · cotransformation provides area and...

Documents