research article analysis of fast radix-10 digit...

17
Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2013, Article ID 453173, 16 pages http://dx.doi.org/10.1155/2013/453173 Research Article Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs Malte Baesler and Sven-Ole Voigt Institute for Reliable Computing, Hamburg University of Technology, Schwarzenbergstraße 95, 21073 Hamburg, Germany Correspondence should be addressed to Malte Baesler; [email protected] Received 3 May 2012; Accepted 17 September 2012 Academic Editor: Ren´ e Cumplido Copyright © 2013 M. Baesler and S.-O. Voigt. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10 digit recurrence dividers for FPGA architectures. e first one implements a simple restoring shiſt-and-subtract algorithm, whereas each of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient calculation and carry-save representation of the residuals. More precisely, the quotient digit selection function of the second divider is implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carry- propagate adders, and the fiſth divider decomposes each digit into three components and requires neither a ROM nor a multiplexer. Furthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64 data format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given. 1. Introduction Many applications, particularly commercial and financial applications, require decimal floating-point operations to avoid errors from conversions between binary and decimal formats. is paper presents five different decimal fixed- point dividers and analyzes their performances and resource requirements on FPGA platforms. All five architectures apply a radix-10 digit recurrence algorithm but differ in the quotient digit selection (QDS) function. e first fixed-point divider (type1) implements a sim- ple shiſt-and-subtract algorithm. It is characterized by an unsigned and nonredundant quotient digit calculation. Nine divisor multiples are precomputed, and in each iteration step nine carry-propagate subtractions are performed on the residual. Finally, the smallest, nonnegative difference is selected by a large fan-in multiplexer. is type1 implemen- tation is characterized by a high area use. e second divider (type2) uses a signed-digit quotient calculation with a redundancy of = 8/9 and operands scaling to get a normalized divisor in the range of 0.4 divisor < 1.0. e quotient digit selection (QDS) function can be implemented fully by a ROM because it depends only on the two most significant digits (MSDs) of the residual as well as the divisor. e residual uses a redundant carry- save representation but, because of performance issues, the two MSDs are implemented by a nonredundant radix-2 representation. e quotient digit selection (QDS) functions of the third and fourth divider (type3.a and type3.b) are based on comparators for the two most significand digits. e comparators consist of short binary carry-propagate adders (CPA), which can be implemented very efficiently in the FPGA’s slice structure. e corresponding comparative values depend on the divisor’s value and are precomputed and stored in a small ROM. e redundancy is = 8/9; thus, 17 binary CPAs are required. Similar to the type2 divider, the type3.a and type3.b dividers use prescaling of the divisor 0.4 divisor < 1.0, a redundant carry-save representation of the residual, and a nonredundant radix-2 representation of the two MSDs. e dividers type3.a and type3.b differ only in the implementation of large fan-in multiplexers in the digit recurrence step. e type3.a divider implements multiplexers that minimize the LUT usage but have long latencies, whereas

Upload: others

Post on 26-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

Hindawi Publishing CorporationInternational Journal of Reconfigurable ComputingVolume 2013 Article ID 453173 16 pageshttpdxdoiorg1011552013453173

Research ArticleAnalysis of Fast Radix-10 Digit Recurrence Algorithms forFixed-Point and Floating-Point Dividers on FPGAs

Malte Baesler and Sven-Ole Voigt

Institute for Reliable Computing Hamburg University of Technology Schwarzenbergstraszlige 95 21073 Hamburg Germany

Correspondence should be addressed to Malte Baesler maltebaeslertu-harburgde

Received 3 May 2012 Accepted 17 September 2012

Academic Editor Rene Cumplido

Copyright copy 2013 M Baesler and S-O Voigt This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary anddecimal formats for instance commercial financial and insurance applications In this paper we present five different radix-10digit recurrence dividers for FPGA architecturesThe first one implements a simple restoring shift-and-subtract algorithm whereaseach of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotientcalculation and carry-save representation of the residuals More precisely the quotient digit selection function of the second divideris implemented fully by means of a ROM the quotient digit selection function of the third and fourth dividers are based on carry-propagate adders and the fifth divider decomposes each digit into three components and requires neither a ROMnor amultiplexerFurthermore the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64data format Finally the algorithms have been synthesized on a Xilinx Virtex-5 FPGA and implementation results are given

1 Introduction

Many applications particularly commercial and financialapplications require decimal floating-point operations toavoid errors from conversions between binary and decimalformats This paper presents five different decimal fixed-point dividers and analyzes their performances and resourcerequirements on FPGA platforms All five architectures applya radix-10 digit recurrence algorithmbut differ in the quotientdigit selection (QDS) function

The first fixed-point divider (type1) implements a sim-ple shift-and-subtract algorithm It is characterized by anunsigned and nonredundant quotient digit calculation Ninedivisor multiples are precomputed and in each iterationstep nine carry-propagate subtractions are performed onthe residual Finally the smallest nonnegative difference isselected by a large fan-in multiplexer This type1 implemen-tation is characterized by a high area use

The second divider (type2) uses a signed-digit quotientcalculation with a redundancy of 120588 = 89 and operandsscaling to get a normalized divisor in the range of 04 le

divisor lt 10 The quotient digit selection (QDS) function

can be implemented fully by a ROM because it depends onlyon the two most significant digits (MSDs) of the residualas well as the divisor The residual uses a redundant carry-save representation but because of performance issues thetwo MSDs are implemented by a nonredundant radix-2representation

The quotient digit selection (QDS) functions of thethird and fourth divider (type3a and type3b) are basedon comparators for the two most significand digits Thecomparators consist of short binary carry-propagate adders(CPA) which can be implemented very efficiently in theFPGArsquos slice structureThe corresponding comparative valuesdepend on the divisorrsquos value and are precomputed and storedin a small ROM The redundancy is 120588 = 89 thus 17 binaryCPAs are required Similar to the type2 divider the type3aand type3b dividers use prescaling of the divisor 04 le

divisor lt 10 a redundant carry-save representation of theresidual and a nonredundant radix-2 representation of thetwo MSDs The dividers type3a and type3b differ only inthe implementation of large fan-in multiplexers in the digitrecurrence stepThe type3a divider implements multiplexersthatminimize the LUT usage but have long latencies whereas

2 International Journal of Reconfigurable Computing

the type3b divider implements a faster dedicatedmultiplexerthat exploits the FPGArsquos internal carry chains

The quotient digit selection function of the last divider(type4) requires neither a ROM nor a multiplexer It ischaracterized by divisor scaling (04 le divisor lt 08)and a signed-digit redundant quotient calculation with aredundancy of 120588 = 89 The quotient digit is decomposedinto three components having values minus5 0 5 minus2 0 2and minus1 0 1 The components are computed one by onewhereby the digit selection function is constant that is theselection constants do not depend on the divisorrsquos valueSimilar to the type2 type3a and type3b dividers the type4divider uses a carry-save representation for the residual witha nonredundant radix-2 representation of the two MSDs

The fixed-point division algorithms are implemented andanalyzed on a Virtex-5 FPGA Finally the type2 dividerwhich shows the best tradeoff in area and delay is extended toa floating-point divider that is fully IEEE 754-2008 compliantfor decimal64 data format including gradual underflowhandling and all required rounding modes

The architectures of the type1 type2 and type4 dividershave already been published in [1] However this papergives a more detailed description of the previous researchand introduces two new dividers which fill the designgap between the type1 and type2 dividers because they arebased on two extreme examples of algorithms the type1divider implements a restoring quotient digit selection (QDS)function that requires nine decimal carry-propagate adders(CPAs) of full precision whereas the nonrestoring QDSfunction of the type2 divider is implemented fully by meansof a ROM with limited precision In comparison the newdividers implement nonrestoring QDS functions that arebased on fast binary CPAs with limited precision

The outline of this paper is given as follows Section 2motivates the use of decimal floating-point arithmetic andits advantage compared to binary floating-point arithmeticThe underlying decimal floating-point standard IEEE 754-2008 is introduced in Section 3The digit recurrence divisionalgorithms as well as the five different architectures of fixed-point dividers are presented in Section 4 These fixed-pointdividers are extended to a decimal floating-point dividerin Section 5 Postplace and route results are presented inSection 6 and finally in Section 7 the main contributions ofthis paper are summarized

2 Decimal Arithmetic

Since its approval in 1985 the binary floating-point standardIEEE 754-1985 [2] is the most widely used implementationof floating-point arithmetic and the dominant floating-pointstandard for all computers In contrast to binary arithmeticdecimal units are more complex require more area andare more expensive and the simple binary coded decimal(BCD) data format has a storage overhead of approximately20 Thus at that time of approval the use of binary inpreference to decimal floating-point arithmetic was justifiedby the better efficiency

Most people in the world think in decimal arithmeticThese decimal numbersmust be converted to binary numbers

when using a computer However some common finitenumbers can only be approximated by binary floating-pointnumbers The decimal number 01 for example has a peri-odical continued fraction 01

10= 00001100

2 It cannot be

represented exactly in a binary floating-point arithmetic withfinite precision and the conversion causes rounding errors

As a consequence binary floating-point arithmetic can-not be used for any calculations which do not tolerateconversion errors between decimal and binary numbersThese are for instance financial and business applicationsthat even require decimal arithmetic by law [3] Thereforecommercial application often use nonstandardized softwareto perform decimal floating-point arithmetic However thesesoftware implementations are usually from 100 to 1000 timesslower than equivalent binary floating-point operations inhardware [3]

Because of the increasing importance specifications fordecimal floating-point arithmetic have been added to theIEEE 754-2008 standard for floating-point arithmetic [4]that has been approved in 2008 and offers a more profoundspecification than the former radix-independent floating-point arithmetic IEEE 854-1987 [5] Therefore new efficientalgorithms have to be investigated and providing hardwaresupport for decimal arithmetic is becoming more andmore atopic of interest

IBM has responded to this market demand and inte-grates decimal floating-point arithmetic in recent processorarchitectures such as z9 [6] z10 [7] Power6 [8] and Power7[9] The Power6 is the first microprocessor that implementsIEEE 754-2008 decimal floating-point format fully in hard-ware while the earlier released z9 already supports decimalfloating-point operations but implements them mainly inmillicode Nevertheless the Power6 decimal floating-pointunit is as small as possible and is optimized to low costIt reuses registers from the binary floating-point unit andthe computing unit mainly consists of a wide decimal adderThus its performance is rather low Other floating-pointoperations such as multiplication and division are basedon this adder and are performed sequentially The decimalfloating-point units of z10 and Power7 are designed similarlyto those of the Power6 [7 9]

3 IEEE 754-2008

The floating-point standard IEEE 754-2008 [4] has revisedand merged the IEEE 754-1985 standard for binary floating-point arithmetic [2] and IEEE 854-1987 standard for radix-independent floating-point arithmetic [5] As a consequencethe choice of radices has been focused on two formats binaryand decimal In this paper we consider only the decimal dataformat A decimal number is defined by the triple consistingof the sign (119904) significand (119888) and exponent (119902)

119909 = (minus1)119904sdot 119888 sdot 10

119902 (1)

with 119904 isin 0 1 119888 isin [0 10119901minus 1] cap N and 119902 isin [119902min 119902max] cap

Z IEEE 754-2008 uses two different designators for theexponent (119902 and 119890) as well as for the significand (119888 and119898)Theexponent 119890 is applied when the significand is regarded as an

International Journal of Reconfigurable Computing 3

1 bit

(sign) (combination field) (trailing significand field)

+ 5 bits = middot 10 bits

Figure 1 Decimal interchange formats [4]

Table 1 Interchange format parameters [4]

Parameter dec32 dec64 dec128119896 storage width (bits) 32 64 128119901 precision (digits) 7 16 34119902min min exponent minus101 minus398 minus6176119902max max exponent 90 369 6111Bias = 119864 minus 119902 101 398 6176119904 sign bit 1 1 1119908 + 5 combination field (bits) 11 13 17119905 trailing significand field (bits) 20 50 110

integer digit and fraction field denoted by 119898 The exponent119902 is applied when the significand is regarded as an integerdenoted by 119888 The relation is given through

119890 = 119902 + 119901 minus 1 119898 = 119888 sdot 119887minus119901+1

(2)

In this paper we exclusively use the integer representation 119888

with the exponent 119902Unlike binary floating-point format the decimal floating-

point number is not necessarily normalized This leads toa redundancy and a decimal number might have multiplerepresentations The set of representations is called thefloating-point numberrsquos cohort For example the numbers123 sdot 10

0 1230 sdot 10minus1 and 12300 sdot 10minus2 are all members of thesame cohortMore precisely if a number has 119899 le 119901 significantdigits the number of representations is 119901 minus 119899 + 1

IEEE 754-2008 defines three decimal interchange formats(decimal32 decimal64 and decimal128) of fixed width 32 64and 128 bits As depicted in Figure 1 a floating-point numberis encoded by three fields the sign bit 119904 the combinationfield 119866 and the trailing significand field The combinationfield encodes whether the number is finite infinite or not anumber (NaN) Furthermore in case of finite numbers thecombination field comprises the biased exponent (119864 = 119902 +

bias) and the most significant digit (MSD) of the significandThe remaining 3 sdot 119869 digits are encoded in the trailingsignificand field of width 119905 = 119869 sdot 10 The trailing significandcan either be implemented as a binary integer or as a denselypacked decimal (DPD) number [4] Binary encoding makessoftware implementations easier whereas DPD encoding isfavored by hardware implementations as it is the case in thispaper

The encoding parameters for the three fixed-width inter-change formats are summarized in Table 1This paper focuseson the data format decimal64 with DPD coded significandDPD encodes three decimal digits (four bits each) into adeclet (10 bits) and vice versa [4] It results in an storageoverhead of only 0343 per digit

IEEE 754-2008 defines five rounding modes These aretwo modes to the nearest (round ties to even and round

ties to away) and three directed rounding modes (roundtoward positive round toward negative and round towardzero) [4] As floating-point operations are obtained by firstperforming the exact operation in the set of real numbersand then mapping the exact result onto a floating-pointnumber rounding is required whenever all significant digitscannot be placed in a single word of length 119901 Moreoverinexact underflow or overflow exceptions are signaled whennecessary

4 Decimal Fixed-Point Division

Oberman and Flynn [10] distinguish five different classesof division algorithms digit recurrence functional iterationvery high radix table look-up and variable latency wherebymany practical algorithms are combinations of multipleclasses Compared to binary arithmetic decimal division ismore complex Currently there are only a few publicationsconcerning radix-10 division

Wang and Schulte [11] describe a decimal divider basedon the Newton-Raphson approximation of the reciprocalThe latency of a decimal Newton-Raphson approximationdirectly depends on the latency of the decimal multiplierA pipelined multiplier has the advantage that more thanone division operation can be processed in parallel oth-erwise the efficiency is poor However the algorithm lacksremainder calculation and the rounding is more complexThe first FPGA-based decimal Newton-Raphson dividers arepresented in [12] The dividers as well as the underlyingmultipliers are sequential hence these dividers have a highlatency

Digit recurrence division is themost widely implementedclass of division algorithms It is an iterative algorithm withlinear convergence that is a fixed number of quotient digits isretired every iteration step Compared to radix-2 arithmeticradix-10 digit recurrence division is more complex becauseon the one hand decimal logic is less efficient by itself andon the other hand the range of the quotient digit selection(QDS) function comprises a larger digit set Therefore theperformance of the digit recurrence divider depends onthe choice of the QDS function and the implementation ofdecimal logic

Nikmehr et al [13] select quotient digits by comparingthe truncated residual with limited precision multiples ofthe divisor Lang and Nannarelli [14] replace the divisorrsquosmultiples by comparative values obtained by a look-up tableand decompose the quotient digit into a radix-2 digit anda radix-5 digit in such a way that only five and two timesthe divisor are required Vazquez et al [15] take a differentapproach the selection constants in the QDS function areobtained of truncated multiples of the divisor avoiding look-up tables Therefore the multiples are computed on-the-fly Moreover the digit recurrence iteration implements aslow carry-propagate adder but an estimation of the residualis computed to make the determination of the quotientdigits independent of this carry-propagate adderThedecimaldivider of the Power6 microprocessor [16] uses extensiveprescaling to bound the divisor to be greater than or equal

4 International Journal of Reconfigurable Computing

to 10 but lower than 11112 in order to simplify the digitselection function However this prescaling is very costly itrequires a 2-digit multiply and it needs overall six cycles oneach operand Furthermore the digit selection function stillrequires a look-up table

The first decimal fixed-point divider designed for FPGAsapplies digit recurrence algorithm and is proposed by Ercego-vac andMcIlhenny [17 18] However it has a poor cycle timebecause it does not fit well on the slice structure of FPGAsbut would probably fit well into ASIC designs with dedicatedrouting

41 Digit Recurrence Division In the following we usedifferent decimal number representations A decimal numberB is called BCD-120573

0120573112057321205733coded when B can be expressed by

119861 =

119901minus1

sum

119894=0

119861119894sdot 10119894

with 119861119894=

3

sum

119896=0

119861119894119896sdot 120573119896 119861119894119896isin 0 1

(3)

A number representation is called redundant if one or moredigits have multiple representations

Moreover we consider a division 119911 = 119909119889 in which thedecimal dividend 119909 and the decimal divisor 119889 are positiveand normalized fractional numbers of precision 119901 that is01 le 119909 119889 lt 1 The quotient 119911 and the remainder rem arecalculated as follows

119909 = 119911 sdot 119889 + rem rem le 119889 sdot ulp ulp = 10minus119901 (4)

Then the radix-10 digit recurrence is implemented by

119908 [119895 + 1] = 10119908 [119895] minus 119911119895+1

sdot 119889 119895 = 0 1 119901 (5)

with the initial residual 119908[0] = 11990910 and with the quotientdigit calculated by the selection function

119911119895+1

= SEL (10119908 [119895] 119889)

with 119911 = 119911111991121199113sdot sdot sdot 119911119901=

119901minus1

sum

119895=0

119911119895+1

sdot 10minus119895

(6)

Digit recurrence division is subdivided into the classes ofrestoring and nonrestoring algorithms that differ in thequotient digit selection (QDS) function and the dynamicalrange of the residual A restoring divider selects the nextpositive quotient digit 0 le 119911

119895+1le 9 such that the next partial

residual is as small as possible but still positive The IBMz900 architecture for instance implements such a decimalrestoring divider [19] By contrast the QDS function of adecimal nonrestoring divider uses a digit set that is positiveas well as negative minus119886 le 119911

119895+1le 119886 and the partial residual

119908[119895 + 1]might also be negativeOne advantage of restoring division is the enhanced

performance since estimates of limited precision (119908[119895] asymp

119908[119895] and 119889 asymp 119889) might be used in the QDS function

SEL (10119908[119895] 119889)This performance gain is achieved by using aredundant digit set minus119886 le 119911

119895le 119886 which defines a redundancy

factor

120588 =119886

10 minus 1=119886

9 (7)

Then in each iteration step the quotient digit 119911119895+1

should beselected such that the next residual 119908[119895 + 1] is bounded by

minus120588119889 le 119908 [119895 + 1] le 120588119889 (8)which is called the convergence condition [20]

This paper presents five different decimal fixed-pointdividers that are described in the following The first divider(type1) is a restoring divider whereas the four others (type2type3a type3b and type4) are nonrestoring dividers

42 Type1 QDS Function The QDS function of the type1algorithm is very simple Nine multiples of the divisor(1119889 9119889) are subtracted in parallel from the residual bycarry-propagate adders (CPAs) and the smallest positiveresult is selected The CPAs exploit the FPGArsquos internal fastcarry logic as described in [21]

The nine multiples are precomputed and are composedof the multiples 1119889 2119889 5119889 10119889 and their negatives (10rsquoscomplement) which require at most one additional CPA permultiple The multiples 2119889 and 5119889 can be easily computed bydigit recoding and constant shift operations

(119883)BCD-5421 ≪ 1 equiv (119883 sdot 2)BCD-8421 (9)

(119883)BCD-8421 ≪ 3 equiv (119883 sdot 5)BCD-5421 (10)where (9) is read as follows A BCD-5421 coded numberX left-shifted by one bit is equivalent to the correspondingBCD-8421-coded number multiplied by two In a similarfashion we obtain a multiplication by five using (10)

The implementation of the type1 digit recurrence divideris characterized by a high area use due to the utilization ofnine parallel CPAsThe corresponding algorithm of the type1digit recurrence is shown in Algorithm 1

43 Type2 QDS Function The approach of the type2 divisionalgorithm is based on the implementation of the QDSfunction fully by a ROMThis ROM is addressed by estimatesof the residual and divisor The use of estimates is feasiblebecause a signed digit set together with a redundancy greaterthan 12 is used The estimates 10119908[119895] and 119889 are obtainedby truncation that is only a limited number of MSDs ofthe residual 10119908[119895] and divisor 119889 are regarded Further-more the residual is implemented in BCD-4221 carry-saverepresentation such that the maximum error introduced byan estimation with precision 119905 (one integer digit and 119905 minus 1

fractional digits) is bounded by 120598 = 2 sdot 10minus119905+1 Negative

residuals are represented by their 10rsquos complementThe divisor is subdivided into subranges [119889

119894 119889119894+1

) ofequal width 119889

119894+1minus 119889119894= 10minus120575 For each subrange 119894 the QDS

function 119911119895+1

= SELROM (10119908[119895] 119889) is defined by selectionconstants119898

119896(119894)

119911119895+1

= 119896 lArrrArr 119898119896(119894) le

10119908 [119895] lt 119898

119896+1(119894) (11)

International Journal of Reconfigurable Computing 5

(1) Compute 1205751= 10119908 [119895] minus 119889

1205759= 10119908 [119895] minus 9119889

(2) if 1205751lt 0 then119911119895+1

= 0 and 119908 [119895 + 1] = 10 sdot 119908 [119895]

else if 1205752lt 0 then

119911119895+1

= 1 and 119908 [119895 + 1] = 1205751

else if 120575

9lt 0 then

119911119895+1

= 8 and 119908 [119895 + 1] = 1205758

else119911119895+1

= 9 and 119908 [119895 + 1] = 1205759

Algorithm 1 Pseudocode for type1 digit recurrence division

These selection constants are bounded by selection intervals119898119896(119894) isin [119871

119896 119880119896minus1

] with the containment condition [20]

119871119896(119889119894) = (119896 minus 120588) 119889

119894 (12a)

119880119896(119889119894) = (119896 + 120588) 119889

119894 (12b)

Furthermore the continuity condition states that everyvalue 10119908[119895] must belong at least to one selection interval[20] Considering the maximum error due to truncation thecontinuity condition can be expressed by

119880119896minus1

(119889119894) minus 119871119896(119889119894+1

) ge 120598 = 2 sdot 10minus119905+1

(13)

If we suppose the subrange width to be constant (119889119894+1

minus 119889119894=

10minus120575) and consider that the minimum overlapping occurs for

minimum 119889min and maximum 119896max then we obtain the term

119909 = (119896max minus 1 + 120588) 119889min

minus (119896max minus 120588) (119889min + 10minus120575) ge 2 sdot 10

minus119905+1

(14)

We choose 120588 = 89 (119896max = 8) and119889min = 04 that leads to 120575 =

2 and 119905 = 2 Thus for the divisorsrsquos estimation 119889 is requiredan accuracy of two fractional digits and for the residualrsquosestimation 10119908 is required an accuracy of one integer and onefractional digit Furthermore the divisor must be pre-scaledsuch that 04 le 119889 lt 10

The P-D diagram is a visualization technique for design-ing a quotient digit selection function and for computing thedecision boundaries119898

119896(119894) It plots the shifted residual versus

the divisor The P-D diagram for the type2 quotient digitselection functionwith the selection constants119898

119896(119894) is shown

in Figure 2The two MSDs of the dividend and residual address the

ROM in order to obtain the signed quotient digit whichcomprises 5 bits In order to reduce the ROMrsquos size the twoMSDs of the divisor and the residual are implemented byusing a nonredundant radix-2 representation This leads toan address that is composed of 15 bits seven bits from the

Resid

ual 1

0middot

( )( + 1)

( + 2)

+1 +2

( )( + 1)

( + 2)

+1

+1

=

ge

Figure 2 P-D diagram for the type2 QDS function

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) for all 10 sdot 119908119899isin minus88 minus87 +88 with |119908| le

120588 sdot 119889max = 088 compute the quotient digits 119911 asfollowsif 10 sdot 119908

119899ge 1198807minus 120598 = 119880

7minus 02 then 119911 = 8

else if 10 sdot 119908119899ge 1198806minus 02 then 119911 = 7

else if 10 sdot 119908

119899ge 1198800minus 02 then 119911 = 1

else if 10 sdot 119908119899ge 1198710then 119911 = 0

else if 10 sdot 119908

119899ge 119871minus8

then 119911 = minus8

Algorithm 2 Algorithm for the calculation of the type2 ROMentries

unsigned radix-2 representation of the divisorrsquos two MSDsand eight bits from the signed radix-2 representation of theresidualrsquos twoMSDsHence the size of theROM is 215times 5 bitsUnfortunately the use of radix-2 representation complicatesthe multiplication by 10 which is required according to (5)Therefore an additional binary adder is needed

10 sdot 119908 [119895] = 8 sdot 119908 [119895] + 2 sdot 119908 [119895] = 119908 [119895] ≪ 3 + 119908 [119895] ≪ 1

(15)

The corresponding algorithm for the calculation of theROM entries can be derived from the P-D diagram and isdepicted in Algorithm 2 It should be noted that the radix-2 representations of 119889 and 119908 are truncations of 119889 and 119908 thatis 119889 le 119889 and 119908 le 119908

Once the quotient digit 119911119895+1

has been determined themultiple 119911

119895+1sdot 119889 is subtracted from the current residual to

compute the next residual as stated in (5) The subtracteris a fast redundant BCD-4221 carry-save adder (CSA) asdescribed in [21] except for the two MSDs in wich we apply

6 International Journal of Reconfigurable Computing

7

88

ROM

MUX

MUX

SignInv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

119888out

119888in

119889119889bin 119911119895+1

times(2+8)119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec11988816 middot 416 middot 4

times1 times2 times5 times10

Figure 3 Block diagram of type2 digit recurrence

(1) Select 119911119895+1

= SELROM(10119908 [119895] 119889)

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 3 Pseudocode for the type2 digit recurrence

radix-2 CPAs Thus the multiple 119911119895+1

sdot 119889 is composed of twosummands

119911119895+1

sdot 119889 = 1198891

119895+1+ 1198892

119895+1 (16)

where 11988912 isin 0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889 are precomputed Themultiplies 2119889 and 5119889 can be easily and fast computed by digitrecoding and constant shift operation as shown in (9) and(10) The 10rsquos and 2rsquos complements are applied for 11988912 lt 0For both radix-2 and BCD-4221 radix-10 representation thecomplements can be computed by inverting each bit andadding one In summary the subtraction uses a (3 1) radix-2CPA for the two MSDs and a redundant radix-10 (4 2) CSAfor the remaining digits The digit recurrence algorithm ofthe type2 divider is summarized in Algorithm 3 and its blockdiagram is depicted in Figure 3

44 Type3QDS Function The frequency limiting componentof the type2 divider is the digit recurrence step whichcomprises the ROM (BRAM) to calculate the next quotientdigit This BRAM is slower compared to common FPGAlogic Hence it appears to be beneficial to remove the slowBRAM from the critical path and use fast FPGA logic insteadTo analyze this impact of the BRAM delays we implementtwo dividers (type3a and type3b) which can also be realizedwithout BRAM in the critical path

Lang and Nannarelli [14] propose a divider that imple-ments a quotient digit selection function based on CPAsand sign detection These CPAs subtract fixed values fromthe current residual with limited precision These fixedvalues depend on the estimates of the divisor and are pre-computed and stored in a ROM Furthermore the quotientdigit is divided into two parts which further simplifies thequotient digit selection function and reduces the critical pathUnfortunately this divider cannot be implemented efficiently

on FPGAs because the required carry-save adder with smallerror estimation shows a poor performance on the FPGArsquosslice structure Nevertheless to investigate the impact ofBRAM delays in the type2 divider we implement anothertwo dividers (type3a and type3b) that have no BRAM inthe critical path but a CPA-based quotient digit selectionfunction instead

The type3 dividers are modifications of the type2 dividerThe common architectural features are

(i) the multiples of the divisor are compos of twocomponents 119911

119895+1sdot 119889 = 119889

1

119895+1+ 1198892

119895+1 where 119889

12isin

0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889(ii) the fast redundant BCD-4221 carry-save adders for

the digit recurrence step as stated in (5)(iii) the radix-2 implementation of the two MSDs and(iv) the 10rsquos and 2rsquos complements for 11988912 lt 0

Furthermore since the architectures of the type2 and type3dividers are similar with the exception of the quotient digitselection function most of the dividersrsquo characteristics arealso identical including the

(i) the P-D diagram(ii) the redundancy factor 120588 = 89 119896max = 8(iii) the need of prescaling the divisor 04 le 119889 lt 10(iv) the accuracy of the divisorrsquos estimation 120575 = 2 and(v) the accuracy of the residualrsquos estimation 119905 = 2

Contrary to the type2 divider the QDS functions of thetype3 dividers are implemented by 16 carry-propagate adderswith sign detection These adders subtract fixed selectionconstants from the estimation of the current residual Theselection constants are dependent on the current divisor andare stored in a ROM which is then no more part of thecritical pathThe selection constants are computed accordingto Algorithm 4 The common digit recurrence algorithm ofthe type3a and type3b dividers is shown in Algorithm 5 andthe corresponding block diagram is depicted in Figure 4

The sign signals of the binary carry-propagate adders areused to determine the corresponding quotient digit and toselect the multiples of the divisor in the digit recurrence

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

2 International Journal of Reconfigurable Computing

the type3b divider implements a faster dedicatedmultiplexerthat exploits the FPGArsquos internal carry chains

The quotient digit selection function of the last divider(type4) requires neither a ROM nor a multiplexer It ischaracterized by divisor scaling (04 le divisor lt 08)and a signed-digit redundant quotient calculation with aredundancy of 120588 = 89 The quotient digit is decomposedinto three components having values minus5 0 5 minus2 0 2and minus1 0 1 The components are computed one by onewhereby the digit selection function is constant that is theselection constants do not depend on the divisorrsquos valueSimilar to the type2 type3a and type3b dividers the type4divider uses a carry-save representation for the residual witha nonredundant radix-2 representation of the two MSDs

The fixed-point division algorithms are implemented andanalyzed on a Virtex-5 FPGA Finally the type2 dividerwhich shows the best tradeoff in area and delay is extended toa floating-point divider that is fully IEEE 754-2008 compliantfor decimal64 data format including gradual underflowhandling and all required rounding modes

The architectures of the type1 type2 and type4 dividershave already been published in [1] However this papergives a more detailed description of the previous researchand introduces two new dividers which fill the designgap between the type1 and type2 dividers because they arebased on two extreme examples of algorithms the type1divider implements a restoring quotient digit selection (QDS)function that requires nine decimal carry-propagate adders(CPAs) of full precision whereas the nonrestoring QDSfunction of the type2 divider is implemented fully by meansof a ROM with limited precision In comparison the newdividers implement nonrestoring QDS functions that arebased on fast binary CPAs with limited precision

The outline of this paper is given as follows Section 2motivates the use of decimal floating-point arithmetic andits advantage compared to binary floating-point arithmeticThe underlying decimal floating-point standard IEEE 754-2008 is introduced in Section 3The digit recurrence divisionalgorithms as well as the five different architectures of fixed-point dividers are presented in Section 4 These fixed-pointdividers are extended to a decimal floating-point dividerin Section 5 Postplace and route results are presented inSection 6 and finally in Section 7 the main contributions ofthis paper are summarized

2 Decimal Arithmetic

Since its approval in 1985 the binary floating-point standardIEEE 754-1985 [2] is the most widely used implementationof floating-point arithmetic and the dominant floating-pointstandard for all computers In contrast to binary arithmeticdecimal units are more complex require more area andare more expensive and the simple binary coded decimal(BCD) data format has a storage overhead of approximately20 Thus at that time of approval the use of binary inpreference to decimal floating-point arithmetic was justifiedby the better efficiency

Most people in the world think in decimal arithmeticThese decimal numbersmust be converted to binary numbers

when using a computer However some common finitenumbers can only be approximated by binary floating-pointnumbers The decimal number 01 for example has a peri-odical continued fraction 01

10= 00001100

2 It cannot be

represented exactly in a binary floating-point arithmetic withfinite precision and the conversion causes rounding errors

As a consequence binary floating-point arithmetic can-not be used for any calculations which do not tolerateconversion errors between decimal and binary numbersThese are for instance financial and business applicationsthat even require decimal arithmetic by law [3] Thereforecommercial application often use nonstandardized softwareto perform decimal floating-point arithmetic However thesesoftware implementations are usually from 100 to 1000 timesslower than equivalent binary floating-point operations inhardware [3]

Because of the increasing importance specifications fordecimal floating-point arithmetic have been added to theIEEE 754-2008 standard for floating-point arithmetic [4]that has been approved in 2008 and offers a more profoundspecification than the former radix-independent floating-point arithmetic IEEE 854-1987 [5] Therefore new efficientalgorithms have to be investigated and providing hardwaresupport for decimal arithmetic is becoming more andmore atopic of interest

IBM has responded to this market demand and inte-grates decimal floating-point arithmetic in recent processorarchitectures such as z9 [6] z10 [7] Power6 [8] and Power7[9] The Power6 is the first microprocessor that implementsIEEE 754-2008 decimal floating-point format fully in hard-ware while the earlier released z9 already supports decimalfloating-point operations but implements them mainly inmillicode Nevertheless the Power6 decimal floating-pointunit is as small as possible and is optimized to low costIt reuses registers from the binary floating-point unit andthe computing unit mainly consists of a wide decimal adderThus its performance is rather low Other floating-pointoperations such as multiplication and division are basedon this adder and are performed sequentially The decimalfloating-point units of z10 and Power7 are designed similarlyto those of the Power6 [7 9]

3 IEEE 754-2008

The floating-point standard IEEE 754-2008 [4] has revisedand merged the IEEE 754-1985 standard for binary floating-point arithmetic [2] and IEEE 854-1987 standard for radix-independent floating-point arithmetic [5] As a consequencethe choice of radices has been focused on two formats binaryand decimal In this paper we consider only the decimal dataformat A decimal number is defined by the triple consistingof the sign (119904) significand (119888) and exponent (119902)

119909 = (minus1)119904sdot 119888 sdot 10

119902 (1)

with 119904 isin 0 1 119888 isin [0 10119901minus 1] cap N and 119902 isin [119902min 119902max] cap

Z IEEE 754-2008 uses two different designators for theexponent (119902 and 119890) as well as for the significand (119888 and119898)Theexponent 119890 is applied when the significand is regarded as an

International Journal of Reconfigurable Computing 3

1 bit

(sign) (combination field) (trailing significand field)

+ 5 bits = middot 10 bits

Figure 1 Decimal interchange formats [4]

Table 1 Interchange format parameters [4]

Parameter dec32 dec64 dec128119896 storage width (bits) 32 64 128119901 precision (digits) 7 16 34119902min min exponent minus101 minus398 minus6176119902max max exponent 90 369 6111Bias = 119864 minus 119902 101 398 6176119904 sign bit 1 1 1119908 + 5 combination field (bits) 11 13 17119905 trailing significand field (bits) 20 50 110

integer digit and fraction field denoted by 119898 The exponent119902 is applied when the significand is regarded as an integerdenoted by 119888 The relation is given through

119890 = 119902 + 119901 minus 1 119898 = 119888 sdot 119887minus119901+1

(2)

In this paper we exclusively use the integer representation 119888

with the exponent 119902Unlike binary floating-point format the decimal floating-

point number is not necessarily normalized This leads toa redundancy and a decimal number might have multiplerepresentations The set of representations is called thefloating-point numberrsquos cohort For example the numbers123 sdot 10

0 1230 sdot 10minus1 and 12300 sdot 10minus2 are all members of thesame cohortMore precisely if a number has 119899 le 119901 significantdigits the number of representations is 119901 minus 119899 + 1

IEEE 754-2008 defines three decimal interchange formats(decimal32 decimal64 and decimal128) of fixed width 32 64and 128 bits As depicted in Figure 1 a floating-point numberis encoded by three fields the sign bit 119904 the combinationfield 119866 and the trailing significand field The combinationfield encodes whether the number is finite infinite or not anumber (NaN) Furthermore in case of finite numbers thecombination field comprises the biased exponent (119864 = 119902 +

bias) and the most significant digit (MSD) of the significandThe remaining 3 sdot 119869 digits are encoded in the trailingsignificand field of width 119905 = 119869 sdot 10 The trailing significandcan either be implemented as a binary integer or as a denselypacked decimal (DPD) number [4] Binary encoding makessoftware implementations easier whereas DPD encoding isfavored by hardware implementations as it is the case in thispaper

The encoding parameters for the three fixed-width inter-change formats are summarized in Table 1This paper focuseson the data format decimal64 with DPD coded significandDPD encodes three decimal digits (four bits each) into adeclet (10 bits) and vice versa [4] It results in an storageoverhead of only 0343 per digit

IEEE 754-2008 defines five rounding modes These aretwo modes to the nearest (round ties to even and round

ties to away) and three directed rounding modes (roundtoward positive round toward negative and round towardzero) [4] As floating-point operations are obtained by firstperforming the exact operation in the set of real numbersand then mapping the exact result onto a floating-pointnumber rounding is required whenever all significant digitscannot be placed in a single word of length 119901 Moreoverinexact underflow or overflow exceptions are signaled whennecessary

4 Decimal Fixed-Point Division

Oberman and Flynn [10] distinguish five different classesof division algorithms digit recurrence functional iterationvery high radix table look-up and variable latency wherebymany practical algorithms are combinations of multipleclasses Compared to binary arithmetic decimal division ismore complex Currently there are only a few publicationsconcerning radix-10 division

Wang and Schulte [11] describe a decimal divider basedon the Newton-Raphson approximation of the reciprocalThe latency of a decimal Newton-Raphson approximationdirectly depends on the latency of the decimal multiplierA pipelined multiplier has the advantage that more thanone division operation can be processed in parallel oth-erwise the efficiency is poor However the algorithm lacksremainder calculation and the rounding is more complexThe first FPGA-based decimal Newton-Raphson dividers arepresented in [12] The dividers as well as the underlyingmultipliers are sequential hence these dividers have a highlatency

Digit recurrence division is themost widely implementedclass of division algorithms It is an iterative algorithm withlinear convergence that is a fixed number of quotient digits isretired every iteration step Compared to radix-2 arithmeticradix-10 digit recurrence division is more complex becauseon the one hand decimal logic is less efficient by itself andon the other hand the range of the quotient digit selection(QDS) function comprises a larger digit set Therefore theperformance of the digit recurrence divider depends onthe choice of the QDS function and the implementation ofdecimal logic

Nikmehr et al [13] select quotient digits by comparingthe truncated residual with limited precision multiples ofthe divisor Lang and Nannarelli [14] replace the divisorrsquosmultiples by comparative values obtained by a look-up tableand decompose the quotient digit into a radix-2 digit anda radix-5 digit in such a way that only five and two timesthe divisor are required Vazquez et al [15] take a differentapproach the selection constants in the QDS function areobtained of truncated multiples of the divisor avoiding look-up tables Therefore the multiples are computed on-the-fly Moreover the digit recurrence iteration implements aslow carry-propagate adder but an estimation of the residualis computed to make the determination of the quotientdigits independent of this carry-propagate adderThedecimaldivider of the Power6 microprocessor [16] uses extensiveprescaling to bound the divisor to be greater than or equal

4 International Journal of Reconfigurable Computing

to 10 but lower than 11112 in order to simplify the digitselection function However this prescaling is very costly itrequires a 2-digit multiply and it needs overall six cycles oneach operand Furthermore the digit selection function stillrequires a look-up table

The first decimal fixed-point divider designed for FPGAsapplies digit recurrence algorithm and is proposed by Ercego-vac andMcIlhenny [17 18] However it has a poor cycle timebecause it does not fit well on the slice structure of FPGAsbut would probably fit well into ASIC designs with dedicatedrouting

41 Digit Recurrence Division In the following we usedifferent decimal number representations A decimal numberB is called BCD-120573

0120573112057321205733coded when B can be expressed by

119861 =

119901minus1

sum

119894=0

119861119894sdot 10119894

with 119861119894=

3

sum

119896=0

119861119894119896sdot 120573119896 119861119894119896isin 0 1

(3)

A number representation is called redundant if one or moredigits have multiple representations

Moreover we consider a division 119911 = 119909119889 in which thedecimal dividend 119909 and the decimal divisor 119889 are positiveand normalized fractional numbers of precision 119901 that is01 le 119909 119889 lt 1 The quotient 119911 and the remainder rem arecalculated as follows

119909 = 119911 sdot 119889 + rem rem le 119889 sdot ulp ulp = 10minus119901 (4)

Then the radix-10 digit recurrence is implemented by

119908 [119895 + 1] = 10119908 [119895] minus 119911119895+1

sdot 119889 119895 = 0 1 119901 (5)

with the initial residual 119908[0] = 11990910 and with the quotientdigit calculated by the selection function

119911119895+1

= SEL (10119908 [119895] 119889)

with 119911 = 119911111991121199113sdot sdot sdot 119911119901=

119901minus1

sum

119895=0

119911119895+1

sdot 10minus119895

(6)

Digit recurrence division is subdivided into the classes ofrestoring and nonrestoring algorithms that differ in thequotient digit selection (QDS) function and the dynamicalrange of the residual A restoring divider selects the nextpositive quotient digit 0 le 119911

119895+1le 9 such that the next partial

residual is as small as possible but still positive The IBMz900 architecture for instance implements such a decimalrestoring divider [19] By contrast the QDS function of adecimal nonrestoring divider uses a digit set that is positiveas well as negative minus119886 le 119911

119895+1le 119886 and the partial residual

119908[119895 + 1]might also be negativeOne advantage of restoring division is the enhanced

performance since estimates of limited precision (119908[119895] asymp

119908[119895] and 119889 asymp 119889) might be used in the QDS function

SEL (10119908[119895] 119889)This performance gain is achieved by using aredundant digit set minus119886 le 119911

119895le 119886 which defines a redundancy

factor

120588 =119886

10 minus 1=119886

9 (7)

Then in each iteration step the quotient digit 119911119895+1

should beselected such that the next residual 119908[119895 + 1] is bounded by

minus120588119889 le 119908 [119895 + 1] le 120588119889 (8)which is called the convergence condition [20]

This paper presents five different decimal fixed-pointdividers that are described in the following The first divider(type1) is a restoring divider whereas the four others (type2type3a type3b and type4) are nonrestoring dividers

42 Type1 QDS Function The QDS function of the type1algorithm is very simple Nine multiples of the divisor(1119889 9119889) are subtracted in parallel from the residual bycarry-propagate adders (CPAs) and the smallest positiveresult is selected The CPAs exploit the FPGArsquos internal fastcarry logic as described in [21]

The nine multiples are precomputed and are composedof the multiples 1119889 2119889 5119889 10119889 and their negatives (10rsquoscomplement) which require at most one additional CPA permultiple The multiples 2119889 and 5119889 can be easily computed bydigit recoding and constant shift operations

(119883)BCD-5421 ≪ 1 equiv (119883 sdot 2)BCD-8421 (9)

(119883)BCD-8421 ≪ 3 equiv (119883 sdot 5)BCD-5421 (10)where (9) is read as follows A BCD-5421 coded numberX left-shifted by one bit is equivalent to the correspondingBCD-8421-coded number multiplied by two In a similarfashion we obtain a multiplication by five using (10)

The implementation of the type1 digit recurrence divideris characterized by a high area use due to the utilization ofnine parallel CPAsThe corresponding algorithm of the type1digit recurrence is shown in Algorithm 1

43 Type2 QDS Function The approach of the type2 divisionalgorithm is based on the implementation of the QDSfunction fully by a ROMThis ROM is addressed by estimatesof the residual and divisor The use of estimates is feasiblebecause a signed digit set together with a redundancy greaterthan 12 is used The estimates 10119908[119895] and 119889 are obtainedby truncation that is only a limited number of MSDs ofthe residual 10119908[119895] and divisor 119889 are regarded Further-more the residual is implemented in BCD-4221 carry-saverepresentation such that the maximum error introduced byan estimation with precision 119905 (one integer digit and 119905 minus 1

fractional digits) is bounded by 120598 = 2 sdot 10minus119905+1 Negative

residuals are represented by their 10rsquos complementThe divisor is subdivided into subranges [119889

119894 119889119894+1

) ofequal width 119889

119894+1minus 119889119894= 10minus120575 For each subrange 119894 the QDS

function 119911119895+1

= SELROM (10119908[119895] 119889) is defined by selectionconstants119898

119896(119894)

119911119895+1

= 119896 lArrrArr 119898119896(119894) le

10119908 [119895] lt 119898

119896+1(119894) (11)

International Journal of Reconfigurable Computing 5

(1) Compute 1205751= 10119908 [119895] minus 119889

1205759= 10119908 [119895] minus 9119889

(2) if 1205751lt 0 then119911119895+1

= 0 and 119908 [119895 + 1] = 10 sdot 119908 [119895]

else if 1205752lt 0 then

119911119895+1

= 1 and 119908 [119895 + 1] = 1205751

else if 120575

9lt 0 then

119911119895+1

= 8 and 119908 [119895 + 1] = 1205758

else119911119895+1

= 9 and 119908 [119895 + 1] = 1205759

Algorithm 1 Pseudocode for type1 digit recurrence division

These selection constants are bounded by selection intervals119898119896(119894) isin [119871

119896 119880119896minus1

] with the containment condition [20]

119871119896(119889119894) = (119896 minus 120588) 119889

119894 (12a)

119880119896(119889119894) = (119896 + 120588) 119889

119894 (12b)

Furthermore the continuity condition states that everyvalue 10119908[119895] must belong at least to one selection interval[20] Considering the maximum error due to truncation thecontinuity condition can be expressed by

119880119896minus1

(119889119894) minus 119871119896(119889119894+1

) ge 120598 = 2 sdot 10minus119905+1

(13)

If we suppose the subrange width to be constant (119889119894+1

minus 119889119894=

10minus120575) and consider that the minimum overlapping occurs for

minimum 119889min and maximum 119896max then we obtain the term

119909 = (119896max minus 1 + 120588) 119889min

minus (119896max minus 120588) (119889min + 10minus120575) ge 2 sdot 10

minus119905+1

(14)

We choose 120588 = 89 (119896max = 8) and119889min = 04 that leads to 120575 =

2 and 119905 = 2 Thus for the divisorsrsquos estimation 119889 is requiredan accuracy of two fractional digits and for the residualrsquosestimation 10119908 is required an accuracy of one integer and onefractional digit Furthermore the divisor must be pre-scaledsuch that 04 le 119889 lt 10

The P-D diagram is a visualization technique for design-ing a quotient digit selection function and for computing thedecision boundaries119898

119896(119894) It plots the shifted residual versus

the divisor The P-D diagram for the type2 quotient digitselection functionwith the selection constants119898

119896(119894) is shown

in Figure 2The two MSDs of the dividend and residual address the

ROM in order to obtain the signed quotient digit whichcomprises 5 bits In order to reduce the ROMrsquos size the twoMSDs of the divisor and the residual are implemented byusing a nonredundant radix-2 representation This leads toan address that is composed of 15 bits seven bits from the

Resid

ual 1

0middot

( )( + 1)

( + 2)

+1 +2

( )( + 1)

( + 2)

+1

+1

=

ge

Figure 2 P-D diagram for the type2 QDS function

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) for all 10 sdot 119908119899isin minus88 minus87 +88 with |119908| le

120588 sdot 119889max = 088 compute the quotient digits 119911 asfollowsif 10 sdot 119908

119899ge 1198807minus 120598 = 119880

7minus 02 then 119911 = 8

else if 10 sdot 119908119899ge 1198806minus 02 then 119911 = 7

else if 10 sdot 119908

119899ge 1198800minus 02 then 119911 = 1

else if 10 sdot 119908119899ge 1198710then 119911 = 0

else if 10 sdot 119908

119899ge 119871minus8

then 119911 = minus8

Algorithm 2 Algorithm for the calculation of the type2 ROMentries

unsigned radix-2 representation of the divisorrsquos two MSDsand eight bits from the signed radix-2 representation of theresidualrsquos twoMSDsHence the size of theROM is 215times 5 bitsUnfortunately the use of radix-2 representation complicatesthe multiplication by 10 which is required according to (5)Therefore an additional binary adder is needed

10 sdot 119908 [119895] = 8 sdot 119908 [119895] + 2 sdot 119908 [119895] = 119908 [119895] ≪ 3 + 119908 [119895] ≪ 1

(15)

The corresponding algorithm for the calculation of theROM entries can be derived from the P-D diagram and isdepicted in Algorithm 2 It should be noted that the radix-2 representations of 119889 and 119908 are truncations of 119889 and 119908 thatis 119889 le 119889 and 119908 le 119908

Once the quotient digit 119911119895+1

has been determined themultiple 119911

119895+1sdot 119889 is subtracted from the current residual to

compute the next residual as stated in (5) The subtracteris a fast redundant BCD-4221 carry-save adder (CSA) asdescribed in [21] except for the two MSDs in wich we apply

6 International Journal of Reconfigurable Computing

7

88

ROM

MUX

MUX

SignInv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

119888out

119888in

119889119889bin 119911119895+1

times(2+8)119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec11988816 middot 416 middot 4

times1 times2 times5 times10

Figure 3 Block diagram of type2 digit recurrence

(1) Select 119911119895+1

= SELROM(10119908 [119895] 119889)

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 3 Pseudocode for the type2 digit recurrence

radix-2 CPAs Thus the multiple 119911119895+1

sdot 119889 is composed of twosummands

119911119895+1

sdot 119889 = 1198891

119895+1+ 1198892

119895+1 (16)

where 11988912 isin 0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889 are precomputed Themultiplies 2119889 and 5119889 can be easily and fast computed by digitrecoding and constant shift operation as shown in (9) and(10) The 10rsquos and 2rsquos complements are applied for 11988912 lt 0For both radix-2 and BCD-4221 radix-10 representation thecomplements can be computed by inverting each bit andadding one In summary the subtraction uses a (3 1) radix-2CPA for the two MSDs and a redundant radix-10 (4 2) CSAfor the remaining digits The digit recurrence algorithm ofthe type2 divider is summarized in Algorithm 3 and its blockdiagram is depicted in Figure 3

44 Type3QDS Function The frequency limiting componentof the type2 divider is the digit recurrence step whichcomprises the ROM (BRAM) to calculate the next quotientdigit This BRAM is slower compared to common FPGAlogic Hence it appears to be beneficial to remove the slowBRAM from the critical path and use fast FPGA logic insteadTo analyze this impact of the BRAM delays we implementtwo dividers (type3a and type3b) which can also be realizedwithout BRAM in the critical path

Lang and Nannarelli [14] propose a divider that imple-ments a quotient digit selection function based on CPAsand sign detection These CPAs subtract fixed values fromthe current residual with limited precision These fixedvalues depend on the estimates of the divisor and are pre-computed and stored in a ROM Furthermore the quotientdigit is divided into two parts which further simplifies thequotient digit selection function and reduces the critical pathUnfortunately this divider cannot be implemented efficiently

on FPGAs because the required carry-save adder with smallerror estimation shows a poor performance on the FPGArsquosslice structure Nevertheless to investigate the impact ofBRAM delays in the type2 divider we implement anothertwo dividers (type3a and type3b) that have no BRAM inthe critical path but a CPA-based quotient digit selectionfunction instead

The type3 dividers are modifications of the type2 dividerThe common architectural features are

(i) the multiples of the divisor are compos of twocomponents 119911

119895+1sdot 119889 = 119889

1

119895+1+ 1198892

119895+1 where 119889

12isin

0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889(ii) the fast redundant BCD-4221 carry-save adders for

the digit recurrence step as stated in (5)(iii) the radix-2 implementation of the two MSDs and(iv) the 10rsquos and 2rsquos complements for 11988912 lt 0

Furthermore since the architectures of the type2 and type3dividers are similar with the exception of the quotient digitselection function most of the dividersrsquo characteristics arealso identical including the

(i) the P-D diagram(ii) the redundancy factor 120588 = 89 119896max = 8(iii) the need of prescaling the divisor 04 le 119889 lt 10(iv) the accuracy of the divisorrsquos estimation 120575 = 2 and(v) the accuracy of the residualrsquos estimation 119905 = 2

Contrary to the type2 divider the QDS functions of thetype3 dividers are implemented by 16 carry-propagate adderswith sign detection These adders subtract fixed selectionconstants from the estimation of the current residual Theselection constants are dependent on the current divisor andare stored in a ROM which is then no more part of thecritical pathThe selection constants are computed accordingto Algorithm 4 The common digit recurrence algorithm ofthe type3a and type3b dividers is shown in Algorithm 5 andthe corresponding block diagram is depicted in Figure 4

The sign signals of the binary carry-propagate adders areused to determine the corresponding quotient digit and toselect the multiples of the divisor in the digit recurrence

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 3

1 bit

(sign) (combination field) (trailing significand field)

+ 5 bits = middot 10 bits

Figure 1 Decimal interchange formats [4]

Table 1 Interchange format parameters [4]

Parameter dec32 dec64 dec128119896 storage width (bits) 32 64 128119901 precision (digits) 7 16 34119902min min exponent minus101 minus398 minus6176119902max max exponent 90 369 6111Bias = 119864 minus 119902 101 398 6176119904 sign bit 1 1 1119908 + 5 combination field (bits) 11 13 17119905 trailing significand field (bits) 20 50 110

integer digit and fraction field denoted by 119898 The exponent119902 is applied when the significand is regarded as an integerdenoted by 119888 The relation is given through

119890 = 119902 + 119901 minus 1 119898 = 119888 sdot 119887minus119901+1

(2)

In this paper we exclusively use the integer representation 119888

with the exponent 119902Unlike binary floating-point format the decimal floating-

point number is not necessarily normalized This leads toa redundancy and a decimal number might have multiplerepresentations The set of representations is called thefloating-point numberrsquos cohort For example the numbers123 sdot 10

0 1230 sdot 10minus1 and 12300 sdot 10minus2 are all members of thesame cohortMore precisely if a number has 119899 le 119901 significantdigits the number of representations is 119901 minus 119899 + 1

IEEE 754-2008 defines three decimal interchange formats(decimal32 decimal64 and decimal128) of fixed width 32 64and 128 bits As depicted in Figure 1 a floating-point numberis encoded by three fields the sign bit 119904 the combinationfield 119866 and the trailing significand field The combinationfield encodes whether the number is finite infinite or not anumber (NaN) Furthermore in case of finite numbers thecombination field comprises the biased exponent (119864 = 119902 +

bias) and the most significant digit (MSD) of the significandThe remaining 3 sdot 119869 digits are encoded in the trailingsignificand field of width 119905 = 119869 sdot 10 The trailing significandcan either be implemented as a binary integer or as a denselypacked decimal (DPD) number [4] Binary encoding makessoftware implementations easier whereas DPD encoding isfavored by hardware implementations as it is the case in thispaper

The encoding parameters for the three fixed-width inter-change formats are summarized in Table 1This paper focuseson the data format decimal64 with DPD coded significandDPD encodes three decimal digits (four bits each) into adeclet (10 bits) and vice versa [4] It results in an storageoverhead of only 0343 per digit

IEEE 754-2008 defines five rounding modes These aretwo modes to the nearest (round ties to even and round

ties to away) and three directed rounding modes (roundtoward positive round toward negative and round towardzero) [4] As floating-point operations are obtained by firstperforming the exact operation in the set of real numbersand then mapping the exact result onto a floating-pointnumber rounding is required whenever all significant digitscannot be placed in a single word of length 119901 Moreoverinexact underflow or overflow exceptions are signaled whennecessary

4 Decimal Fixed-Point Division

Oberman and Flynn [10] distinguish five different classesof division algorithms digit recurrence functional iterationvery high radix table look-up and variable latency wherebymany practical algorithms are combinations of multipleclasses Compared to binary arithmetic decimal division ismore complex Currently there are only a few publicationsconcerning radix-10 division

Wang and Schulte [11] describe a decimal divider basedon the Newton-Raphson approximation of the reciprocalThe latency of a decimal Newton-Raphson approximationdirectly depends on the latency of the decimal multiplierA pipelined multiplier has the advantage that more thanone division operation can be processed in parallel oth-erwise the efficiency is poor However the algorithm lacksremainder calculation and the rounding is more complexThe first FPGA-based decimal Newton-Raphson dividers arepresented in [12] The dividers as well as the underlyingmultipliers are sequential hence these dividers have a highlatency

Digit recurrence division is themost widely implementedclass of division algorithms It is an iterative algorithm withlinear convergence that is a fixed number of quotient digits isretired every iteration step Compared to radix-2 arithmeticradix-10 digit recurrence division is more complex becauseon the one hand decimal logic is less efficient by itself andon the other hand the range of the quotient digit selection(QDS) function comprises a larger digit set Therefore theperformance of the digit recurrence divider depends onthe choice of the QDS function and the implementation ofdecimal logic

Nikmehr et al [13] select quotient digits by comparingthe truncated residual with limited precision multiples ofthe divisor Lang and Nannarelli [14] replace the divisorrsquosmultiples by comparative values obtained by a look-up tableand decompose the quotient digit into a radix-2 digit anda radix-5 digit in such a way that only five and two timesthe divisor are required Vazquez et al [15] take a differentapproach the selection constants in the QDS function areobtained of truncated multiples of the divisor avoiding look-up tables Therefore the multiples are computed on-the-fly Moreover the digit recurrence iteration implements aslow carry-propagate adder but an estimation of the residualis computed to make the determination of the quotientdigits independent of this carry-propagate adderThedecimaldivider of the Power6 microprocessor [16] uses extensiveprescaling to bound the divisor to be greater than or equal

4 International Journal of Reconfigurable Computing

to 10 but lower than 11112 in order to simplify the digitselection function However this prescaling is very costly itrequires a 2-digit multiply and it needs overall six cycles oneach operand Furthermore the digit selection function stillrequires a look-up table

The first decimal fixed-point divider designed for FPGAsapplies digit recurrence algorithm and is proposed by Ercego-vac andMcIlhenny [17 18] However it has a poor cycle timebecause it does not fit well on the slice structure of FPGAsbut would probably fit well into ASIC designs with dedicatedrouting

41 Digit Recurrence Division In the following we usedifferent decimal number representations A decimal numberB is called BCD-120573

0120573112057321205733coded when B can be expressed by

119861 =

119901minus1

sum

119894=0

119861119894sdot 10119894

with 119861119894=

3

sum

119896=0

119861119894119896sdot 120573119896 119861119894119896isin 0 1

(3)

A number representation is called redundant if one or moredigits have multiple representations

Moreover we consider a division 119911 = 119909119889 in which thedecimal dividend 119909 and the decimal divisor 119889 are positiveand normalized fractional numbers of precision 119901 that is01 le 119909 119889 lt 1 The quotient 119911 and the remainder rem arecalculated as follows

119909 = 119911 sdot 119889 + rem rem le 119889 sdot ulp ulp = 10minus119901 (4)

Then the radix-10 digit recurrence is implemented by

119908 [119895 + 1] = 10119908 [119895] minus 119911119895+1

sdot 119889 119895 = 0 1 119901 (5)

with the initial residual 119908[0] = 11990910 and with the quotientdigit calculated by the selection function

119911119895+1

= SEL (10119908 [119895] 119889)

with 119911 = 119911111991121199113sdot sdot sdot 119911119901=

119901minus1

sum

119895=0

119911119895+1

sdot 10minus119895

(6)

Digit recurrence division is subdivided into the classes ofrestoring and nonrestoring algorithms that differ in thequotient digit selection (QDS) function and the dynamicalrange of the residual A restoring divider selects the nextpositive quotient digit 0 le 119911

119895+1le 9 such that the next partial

residual is as small as possible but still positive The IBMz900 architecture for instance implements such a decimalrestoring divider [19] By contrast the QDS function of adecimal nonrestoring divider uses a digit set that is positiveas well as negative minus119886 le 119911

119895+1le 119886 and the partial residual

119908[119895 + 1]might also be negativeOne advantage of restoring division is the enhanced

performance since estimates of limited precision (119908[119895] asymp

119908[119895] and 119889 asymp 119889) might be used in the QDS function

SEL (10119908[119895] 119889)This performance gain is achieved by using aredundant digit set minus119886 le 119911

119895le 119886 which defines a redundancy

factor

120588 =119886

10 minus 1=119886

9 (7)

Then in each iteration step the quotient digit 119911119895+1

should beselected such that the next residual 119908[119895 + 1] is bounded by

minus120588119889 le 119908 [119895 + 1] le 120588119889 (8)which is called the convergence condition [20]

This paper presents five different decimal fixed-pointdividers that are described in the following The first divider(type1) is a restoring divider whereas the four others (type2type3a type3b and type4) are nonrestoring dividers

42 Type1 QDS Function The QDS function of the type1algorithm is very simple Nine multiples of the divisor(1119889 9119889) are subtracted in parallel from the residual bycarry-propagate adders (CPAs) and the smallest positiveresult is selected The CPAs exploit the FPGArsquos internal fastcarry logic as described in [21]

The nine multiples are precomputed and are composedof the multiples 1119889 2119889 5119889 10119889 and their negatives (10rsquoscomplement) which require at most one additional CPA permultiple The multiples 2119889 and 5119889 can be easily computed bydigit recoding and constant shift operations

(119883)BCD-5421 ≪ 1 equiv (119883 sdot 2)BCD-8421 (9)

(119883)BCD-8421 ≪ 3 equiv (119883 sdot 5)BCD-5421 (10)where (9) is read as follows A BCD-5421 coded numberX left-shifted by one bit is equivalent to the correspondingBCD-8421-coded number multiplied by two In a similarfashion we obtain a multiplication by five using (10)

The implementation of the type1 digit recurrence divideris characterized by a high area use due to the utilization ofnine parallel CPAsThe corresponding algorithm of the type1digit recurrence is shown in Algorithm 1

43 Type2 QDS Function The approach of the type2 divisionalgorithm is based on the implementation of the QDSfunction fully by a ROMThis ROM is addressed by estimatesof the residual and divisor The use of estimates is feasiblebecause a signed digit set together with a redundancy greaterthan 12 is used The estimates 10119908[119895] and 119889 are obtainedby truncation that is only a limited number of MSDs ofthe residual 10119908[119895] and divisor 119889 are regarded Further-more the residual is implemented in BCD-4221 carry-saverepresentation such that the maximum error introduced byan estimation with precision 119905 (one integer digit and 119905 minus 1

fractional digits) is bounded by 120598 = 2 sdot 10minus119905+1 Negative

residuals are represented by their 10rsquos complementThe divisor is subdivided into subranges [119889

119894 119889119894+1

) ofequal width 119889

119894+1minus 119889119894= 10minus120575 For each subrange 119894 the QDS

function 119911119895+1

= SELROM (10119908[119895] 119889) is defined by selectionconstants119898

119896(119894)

119911119895+1

= 119896 lArrrArr 119898119896(119894) le

10119908 [119895] lt 119898

119896+1(119894) (11)

International Journal of Reconfigurable Computing 5

(1) Compute 1205751= 10119908 [119895] minus 119889

1205759= 10119908 [119895] minus 9119889

(2) if 1205751lt 0 then119911119895+1

= 0 and 119908 [119895 + 1] = 10 sdot 119908 [119895]

else if 1205752lt 0 then

119911119895+1

= 1 and 119908 [119895 + 1] = 1205751

else if 120575

9lt 0 then

119911119895+1

= 8 and 119908 [119895 + 1] = 1205758

else119911119895+1

= 9 and 119908 [119895 + 1] = 1205759

Algorithm 1 Pseudocode for type1 digit recurrence division

These selection constants are bounded by selection intervals119898119896(119894) isin [119871

119896 119880119896minus1

] with the containment condition [20]

119871119896(119889119894) = (119896 minus 120588) 119889

119894 (12a)

119880119896(119889119894) = (119896 + 120588) 119889

119894 (12b)

Furthermore the continuity condition states that everyvalue 10119908[119895] must belong at least to one selection interval[20] Considering the maximum error due to truncation thecontinuity condition can be expressed by

119880119896minus1

(119889119894) minus 119871119896(119889119894+1

) ge 120598 = 2 sdot 10minus119905+1

(13)

If we suppose the subrange width to be constant (119889119894+1

minus 119889119894=

10minus120575) and consider that the minimum overlapping occurs for

minimum 119889min and maximum 119896max then we obtain the term

119909 = (119896max minus 1 + 120588) 119889min

minus (119896max minus 120588) (119889min + 10minus120575) ge 2 sdot 10

minus119905+1

(14)

We choose 120588 = 89 (119896max = 8) and119889min = 04 that leads to 120575 =

2 and 119905 = 2 Thus for the divisorsrsquos estimation 119889 is requiredan accuracy of two fractional digits and for the residualrsquosestimation 10119908 is required an accuracy of one integer and onefractional digit Furthermore the divisor must be pre-scaledsuch that 04 le 119889 lt 10

The P-D diagram is a visualization technique for design-ing a quotient digit selection function and for computing thedecision boundaries119898

119896(119894) It plots the shifted residual versus

the divisor The P-D diagram for the type2 quotient digitselection functionwith the selection constants119898

119896(119894) is shown

in Figure 2The two MSDs of the dividend and residual address the

ROM in order to obtain the signed quotient digit whichcomprises 5 bits In order to reduce the ROMrsquos size the twoMSDs of the divisor and the residual are implemented byusing a nonredundant radix-2 representation This leads toan address that is composed of 15 bits seven bits from the

Resid

ual 1

0middot

( )( + 1)

( + 2)

+1 +2

( )( + 1)

( + 2)

+1

+1

=

ge

Figure 2 P-D diagram for the type2 QDS function

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) for all 10 sdot 119908119899isin minus88 minus87 +88 with |119908| le

120588 sdot 119889max = 088 compute the quotient digits 119911 asfollowsif 10 sdot 119908

119899ge 1198807minus 120598 = 119880

7minus 02 then 119911 = 8

else if 10 sdot 119908119899ge 1198806minus 02 then 119911 = 7

else if 10 sdot 119908

119899ge 1198800minus 02 then 119911 = 1

else if 10 sdot 119908119899ge 1198710then 119911 = 0

else if 10 sdot 119908

119899ge 119871minus8

then 119911 = minus8

Algorithm 2 Algorithm for the calculation of the type2 ROMentries

unsigned radix-2 representation of the divisorrsquos two MSDsand eight bits from the signed radix-2 representation of theresidualrsquos twoMSDsHence the size of theROM is 215times 5 bitsUnfortunately the use of radix-2 representation complicatesthe multiplication by 10 which is required according to (5)Therefore an additional binary adder is needed

10 sdot 119908 [119895] = 8 sdot 119908 [119895] + 2 sdot 119908 [119895] = 119908 [119895] ≪ 3 + 119908 [119895] ≪ 1

(15)

The corresponding algorithm for the calculation of theROM entries can be derived from the P-D diagram and isdepicted in Algorithm 2 It should be noted that the radix-2 representations of 119889 and 119908 are truncations of 119889 and 119908 thatis 119889 le 119889 and 119908 le 119908

Once the quotient digit 119911119895+1

has been determined themultiple 119911

119895+1sdot 119889 is subtracted from the current residual to

compute the next residual as stated in (5) The subtracteris a fast redundant BCD-4221 carry-save adder (CSA) asdescribed in [21] except for the two MSDs in wich we apply

6 International Journal of Reconfigurable Computing

7

88

ROM

MUX

MUX

SignInv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

119888out

119888in

119889119889bin 119911119895+1

times(2+8)119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec11988816 middot 416 middot 4

times1 times2 times5 times10

Figure 3 Block diagram of type2 digit recurrence

(1) Select 119911119895+1

= SELROM(10119908 [119895] 119889)

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 3 Pseudocode for the type2 digit recurrence

radix-2 CPAs Thus the multiple 119911119895+1

sdot 119889 is composed of twosummands

119911119895+1

sdot 119889 = 1198891

119895+1+ 1198892

119895+1 (16)

where 11988912 isin 0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889 are precomputed Themultiplies 2119889 and 5119889 can be easily and fast computed by digitrecoding and constant shift operation as shown in (9) and(10) The 10rsquos and 2rsquos complements are applied for 11988912 lt 0For both radix-2 and BCD-4221 radix-10 representation thecomplements can be computed by inverting each bit andadding one In summary the subtraction uses a (3 1) radix-2CPA for the two MSDs and a redundant radix-10 (4 2) CSAfor the remaining digits The digit recurrence algorithm ofthe type2 divider is summarized in Algorithm 3 and its blockdiagram is depicted in Figure 3

44 Type3QDS Function The frequency limiting componentof the type2 divider is the digit recurrence step whichcomprises the ROM (BRAM) to calculate the next quotientdigit This BRAM is slower compared to common FPGAlogic Hence it appears to be beneficial to remove the slowBRAM from the critical path and use fast FPGA logic insteadTo analyze this impact of the BRAM delays we implementtwo dividers (type3a and type3b) which can also be realizedwithout BRAM in the critical path

Lang and Nannarelli [14] propose a divider that imple-ments a quotient digit selection function based on CPAsand sign detection These CPAs subtract fixed values fromthe current residual with limited precision These fixedvalues depend on the estimates of the divisor and are pre-computed and stored in a ROM Furthermore the quotientdigit is divided into two parts which further simplifies thequotient digit selection function and reduces the critical pathUnfortunately this divider cannot be implemented efficiently

on FPGAs because the required carry-save adder with smallerror estimation shows a poor performance on the FPGArsquosslice structure Nevertheless to investigate the impact ofBRAM delays in the type2 divider we implement anothertwo dividers (type3a and type3b) that have no BRAM inthe critical path but a CPA-based quotient digit selectionfunction instead

The type3 dividers are modifications of the type2 dividerThe common architectural features are

(i) the multiples of the divisor are compos of twocomponents 119911

119895+1sdot 119889 = 119889

1

119895+1+ 1198892

119895+1 where 119889

12isin

0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889(ii) the fast redundant BCD-4221 carry-save adders for

the digit recurrence step as stated in (5)(iii) the radix-2 implementation of the two MSDs and(iv) the 10rsquos and 2rsquos complements for 11988912 lt 0

Furthermore since the architectures of the type2 and type3dividers are similar with the exception of the quotient digitselection function most of the dividersrsquo characteristics arealso identical including the

(i) the P-D diagram(ii) the redundancy factor 120588 = 89 119896max = 8(iii) the need of prescaling the divisor 04 le 119889 lt 10(iv) the accuracy of the divisorrsquos estimation 120575 = 2 and(v) the accuracy of the residualrsquos estimation 119905 = 2

Contrary to the type2 divider the QDS functions of thetype3 dividers are implemented by 16 carry-propagate adderswith sign detection These adders subtract fixed selectionconstants from the estimation of the current residual Theselection constants are dependent on the current divisor andare stored in a ROM which is then no more part of thecritical pathThe selection constants are computed accordingto Algorithm 4 The common digit recurrence algorithm ofthe type3a and type3b dividers is shown in Algorithm 5 andthe corresponding block diagram is depicted in Figure 4

The sign signals of the binary carry-propagate adders areused to determine the corresponding quotient digit and toselect the multiples of the divisor in the digit recurrence

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

4 International Journal of Reconfigurable Computing

to 10 but lower than 11112 in order to simplify the digitselection function However this prescaling is very costly itrequires a 2-digit multiply and it needs overall six cycles oneach operand Furthermore the digit selection function stillrequires a look-up table

The first decimal fixed-point divider designed for FPGAsapplies digit recurrence algorithm and is proposed by Ercego-vac andMcIlhenny [17 18] However it has a poor cycle timebecause it does not fit well on the slice structure of FPGAsbut would probably fit well into ASIC designs with dedicatedrouting

41 Digit Recurrence Division In the following we usedifferent decimal number representations A decimal numberB is called BCD-120573

0120573112057321205733coded when B can be expressed by

119861 =

119901minus1

sum

119894=0

119861119894sdot 10119894

with 119861119894=

3

sum

119896=0

119861119894119896sdot 120573119896 119861119894119896isin 0 1

(3)

A number representation is called redundant if one or moredigits have multiple representations

Moreover we consider a division 119911 = 119909119889 in which thedecimal dividend 119909 and the decimal divisor 119889 are positiveand normalized fractional numbers of precision 119901 that is01 le 119909 119889 lt 1 The quotient 119911 and the remainder rem arecalculated as follows

119909 = 119911 sdot 119889 + rem rem le 119889 sdot ulp ulp = 10minus119901 (4)

Then the radix-10 digit recurrence is implemented by

119908 [119895 + 1] = 10119908 [119895] minus 119911119895+1

sdot 119889 119895 = 0 1 119901 (5)

with the initial residual 119908[0] = 11990910 and with the quotientdigit calculated by the selection function

119911119895+1

= SEL (10119908 [119895] 119889)

with 119911 = 119911111991121199113sdot sdot sdot 119911119901=

119901minus1

sum

119895=0

119911119895+1

sdot 10minus119895

(6)

Digit recurrence division is subdivided into the classes ofrestoring and nonrestoring algorithms that differ in thequotient digit selection (QDS) function and the dynamicalrange of the residual A restoring divider selects the nextpositive quotient digit 0 le 119911

119895+1le 9 such that the next partial

residual is as small as possible but still positive The IBMz900 architecture for instance implements such a decimalrestoring divider [19] By contrast the QDS function of adecimal nonrestoring divider uses a digit set that is positiveas well as negative minus119886 le 119911

119895+1le 119886 and the partial residual

119908[119895 + 1]might also be negativeOne advantage of restoring division is the enhanced

performance since estimates of limited precision (119908[119895] asymp

119908[119895] and 119889 asymp 119889) might be used in the QDS function

SEL (10119908[119895] 119889)This performance gain is achieved by using aredundant digit set minus119886 le 119911

119895le 119886 which defines a redundancy

factor

120588 =119886

10 minus 1=119886

9 (7)

Then in each iteration step the quotient digit 119911119895+1

should beselected such that the next residual 119908[119895 + 1] is bounded by

minus120588119889 le 119908 [119895 + 1] le 120588119889 (8)which is called the convergence condition [20]

This paper presents five different decimal fixed-pointdividers that are described in the following The first divider(type1) is a restoring divider whereas the four others (type2type3a type3b and type4) are nonrestoring dividers

42 Type1 QDS Function The QDS function of the type1algorithm is very simple Nine multiples of the divisor(1119889 9119889) are subtracted in parallel from the residual bycarry-propagate adders (CPAs) and the smallest positiveresult is selected The CPAs exploit the FPGArsquos internal fastcarry logic as described in [21]

The nine multiples are precomputed and are composedof the multiples 1119889 2119889 5119889 10119889 and their negatives (10rsquoscomplement) which require at most one additional CPA permultiple The multiples 2119889 and 5119889 can be easily computed bydigit recoding and constant shift operations

(119883)BCD-5421 ≪ 1 equiv (119883 sdot 2)BCD-8421 (9)

(119883)BCD-8421 ≪ 3 equiv (119883 sdot 5)BCD-5421 (10)where (9) is read as follows A BCD-5421 coded numberX left-shifted by one bit is equivalent to the correspondingBCD-8421-coded number multiplied by two In a similarfashion we obtain a multiplication by five using (10)

The implementation of the type1 digit recurrence divideris characterized by a high area use due to the utilization ofnine parallel CPAsThe corresponding algorithm of the type1digit recurrence is shown in Algorithm 1

43 Type2 QDS Function The approach of the type2 divisionalgorithm is based on the implementation of the QDSfunction fully by a ROMThis ROM is addressed by estimatesof the residual and divisor The use of estimates is feasiblebecause a signed digit set together with a redundancy greaterthan 12 is used The estimates 10119908[119895] and 119889 are obtainedby truncation that is only a limited number of MSDs ofthe residual 10119908[119895] and divisor 119889 are regarded Further-more the residual is implemented in BCD-4221 carry-saverepresentation such that the maximum error introduced byan estimation with precision 119905 (one integer digit and 119905 minus 1

fractional digits) is bounded by 120598 = 2 sdot 10minus119905+1 Negative

residuals are represented by their 10rsquos complementThe divisor is subdivided into subranges [119889

119894 119889119894+1

) ofequal width 119889

119894+1minus 119889119894= 10minus120575 For each subrange 119894 the QDS

function 119911119895+1

= SELROM (10119908[119895] 119889) is defined by selectionconstants119898

119896(119894)

119911119895+1

= 119896 lArrrArr 119898119896(119894) le

10119908 [119895] lt 119898

119896+1(119894) (11)

International Journal of Reconfigurable Computing 5

(1) Compute 1205751= 10119908 [119895] minus 119889

1205759= 10119908 [119895] minus 9119889

(2) if 1205751lt 0 then119911119895+1

= 0 and 119908 [119895 + 1] = 10 sdot 119908 [119895]

else if 1205752lt 0 then

119911119895+1

= 1 and 119908 [119895 + 1] = 1205751

else if 120575

9lt 0 then

119911119895+1

= 8 and 119908 [119895 + 1] = 1205758

else119911119895+1

= 9 and 119908 [119895 + 1] = 1205759

Algorithm 1 Pseudocode for type1 digit recurrence division

These selection constants are bounded by selection intervals119898119896(119894) isin [119871

119896 119880119896minus1

] with the containment condition [20]

119871119896(119889119894) = (119896 minus 120588) 119889

119894 (12a)

119880119896(119889119894) = (119896 + 120588) 119889

119894 (12b)

Furthermore the continuity condition states that everyvalue 10119908[119895] must belong at least to one selection interval[20] Considering the maximum error due to truncation thecontinuity condition can be expressed by

119880119896minus1

(119889119894) minus 119871119896(119889119894+1

) ge 120598 = 2 sdot 10minus119905+1

(13)

If we suppose the subrange width to be constant (119889119894+1

minus 119889119894=

10minus120575) and consider that the minimum overlapping occurs for

minimum 119889min and maximum 119896max then we obtain the term

119909 = (119896max minus 1 + 120588) 119889min

minus (119896max minus 120588) (119889min + 10minus120575) ge 2 sdot 10

minus119905+1

(14)

We choose 120588 = 89 (119896max = 8) and119889min = 04 that leads to 120575 =

2 and 119905 = 2 Thus for the divisorsrsquos estimation 119889 is requiredan accuracy of two fractional digits and for the residualrsquosestimation 10119908 is required an accuracy of one integer and onefractional digit Furthermore the divisor must be pre-scaledsuch that 04 le 119889 lt 10

The P-D diagram is a visualization technique for design-ing a quotient digit selection function and for computing thedecision boundaries119898

119896(119894) It plots the shifted residual versus

the divisor The P-D diagram for the type2 quotient digitselection functionwith the selection constants119898

119896(119894) is shown

in Figure 2The two MSDs of the dividend and residual address the

ROM in order to obtain the signed quotient digit whichcomprises 5 bits In order to reduce the ROMrsquos size the twoMSDs of the divisor and the residual are implemented byusing a nonredundant radix-2 representation This leads toan address that is composed of 15 bits seven bits from the

Resid

ual 1

0middot

( )( + 1)

( + 2)

+1 +2

( )( + 1)

( + 2)

+1

+1

=

ge

Figure 2 P-D diagram for the type2 QDS function

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) for all 10 sdot 119908119899isin minus88 minus87 +88 with |119908| le

120588 sdot 119889max = 088 compute the quotient digits 119911 asfollowsif 10 sdot 119908

119899ge 1198807minus 120598 = 119880

7minus 02 then 119911 = 8

else if 10 sdot 119908119899ge 1198806minus 02 then 119911 = 7

else if 10 sdot 119908

119899ge 1198800minus 02 then 119911 = 1

else if 10 sdot 119908119899ge 1198710then 119911 = 0

else if 10 sdot 119908

119899ge 119871minus8

then 119911 = minus8

Algorithm 2 Algorithm for the calculation of the type2 ROMentries

unsigned radix-2 representation of the divisorrsquos two MSDsand eight bits from the signed radix-2 representation of theresidualrsquos twoMSDsHence the size of theROM is 215times 5 bitsUnfortunately the use of radix-2 representation complicatesthe multiplication by 10 which is required according to (5)Therefore an additional binary adder is needed

10 sdot 119908 [119895] = 8 sdot 119908 [119895] + 2 sdot 119908 [119895] = 119908 [119895] ≪ 3 + 119908 [119895] ≪ 1

(15)

The corresponding algorithm for the calculation of theROM entries can be derived from the P-D diagram and isdepicted in Algorithm 2 It should be noted that the radix-2 representations of 119889 and 119908 are truncations of 119889 and 119908 thatis 119889 le 119889 and 119908 le 119908

Once the quotient digit 119911119895+1

has been determined themultiple 119911

119895+1sdot 119889 is subtracted from the current residual to

compute the next residual as stated in (5) The subtracteris a fast redundant BCD-4221 carry-save adder (CSA) asdescribed in [21] except for the two MSDs in wich we apply

6 International Journal of Reconfigurable Computing

7

88

ROM

MUX

MUX

SignInv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

119888out

119888in

119889119889bin 119911119895+1

times(2+8)119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec11988816 middot 416 middot 4

times1 times2 times5 times10

Figure 3 Block diagram of type2 digit recurrence

(1) Select 119911119895+1

= SELROM(10119908 [119895] 119889)

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 3 Pseudocode for the type2 digit recurrence

radix-2 CPAs Thus the multiple 119911119895+1

sdot 119889 is composed of twosummands

119911119895+1

sdot 119889 = 1198891

119895+1+ 1198892

119895+1 (16)

where 11988912 isin 0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889 are precomputed Themultiplies 2119889 and 5119889 can be easily and fast computed by digitrecoding and constant shift operation as shown in (9) and(10) The 10rsquos and 2rsquos complements are applied for 11988912 lt 0For both radix-2 and BCD-4221 radix-10 representation thecomplements can be computed by inverting each bit andadding one In summary the subtraction uses a (3 1) radix-2CPA for the two MSDs and a redundant radix-10 (4 2) CSAfor the remaining digits The digit recurrence algorithm ofthe type2 divider is summarized in Algorithm 3 and its blockdiagram is depicted in Figure 3

44 Type3QDS Function The frequency limiting componentof the type2 divider is the digit recurrence step whichcomprises the ROM (BRAM) to calculate the next quotientdigit This BRAM is slower compared to common FPGAlogic Hence it appears to be beneficial to remove the slowBRAM from the critical path and use fast FPGA logic insteadTo analyze this impact of the BRAM delays we implementtwo dividers (type3a and type3b) which can also be realizedwithout BRAM in the critical path

Lang and Nannarelli [14] propose a divider that imple-ments a quotient digit selection function based on CPAsand sign detection These CPAs subtract fixed values fromthe current residual with limited precision These fixedvalues depend on the estimates of the divisor and are pre-computed and stored in a ROM Furthermore the quotientdigit is divided into two parts which further simplifies thequotient digit selection function and reduces the critical pathUnfortunately this divider cannot be implemented efficiently

on FPGAs because the required carry-save adder with smallerror estimation shows a poor performance on the FPGArsquosslice structure Nevertheless to investigate the impact ofBRAM delays in the type2 divider we implement anothertwo dividers (type3a and type3b) that have no BRAM inthe critical path but a CPA-based quotient digit selectionfunction instead

The type3 dividers are modifications of the type2 dividerThe common architectural features are

(i) the multiples of the divisor are compos of twocomponents 119911

119895+1sdot 119889 = 119889

1

119895+1+ 1198892

119895+1 where 119889

12isin

0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889(ii) the fast redundant BCD-4221 carry-save adders for

the digit recurrence step as stated in (5)(iii) the radix-2 implementation of the two MSDs and(iv) the 10rsquos and 2rsquos complements for 11988912 lt 0

Furthermore since the architectures of the type2 and type3dividers are similar with the exception of the quotient digitselection function most of the dividersrsquo characteristics arealso identical including the

(i) the P-D diagram(ii) the redundancy factor 120588 = 89 119896max = 8(iii) the need of prescaling the divisor 04 le 119889 lt 10(iv) the accuracy of the divisorrsquos estimation 120575 = 2 and(v) the accuracy of the residualrsquos estimation 119905 = 2

Contrary to the type2 divider the QDS functions of thetype3 dividers are implemented by 16 carry-propagate adderswith sign detection These adders subtract fixed selectionconstants from the estimation of the current residual Theselection constants are dependent on the current divisor andare stored in a ROM which is then no more part of thecritical pathThe selection constants are computed accordingto Algorithm 4 The common digit recurrence algorithm ofthe type3a and type3b dividers is shown in Algorithm 5 andthe corresponding block diagram is depicted in Figure 4

The sign signals of the binary carry-propagate adders areused to determine the corresponding quotient digit and toselect the multiples of the divisor in the digit recurrence

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 5

(1) Compute 1205751= 10119908 [119895] minus 119889

1205759= 10119908 [119895] minus 9119889

(2) if 1205751lt 0 then119911119895+1

= 0 and 119908 [119895 + 1] = 10 sdot 119908 [119895]

else if 1205752lt 0 then

119911119895+1

= 1 and 119908 [119895 + 1] = 1205751

else if 120575

9lt 0 then

119911119895+1

= 8 and 119908 [119895 + 1] = 1205758

else119911119895+1

= 9 and 119908 [119895 + 1] = 1205759

Algorithm 1 Pseudocode for type1 digit recurrence division

These selection constants are bounded by selection intervals119898119896(119894) isin [119871

119896 119880119896minus1

] with the containment condition [20]

119871119896(119889119894) = (119896 minus 120588) 119889

119894 (12a)

119880119896(119889119894) = (119896 + 120588) 119889

119894 (12b)

Furthermore the continuity condition states that everyvalue 10119908[119895] must belong at least to one selection interval[20] Considering the maximum error due to truncation thecontinuity condition can be expressed by

119880119896minus1

(119889119894) minus 119871119896(119889119894+1

) ge 120598 = 2 sdot 10minus119905+1

(13)

If we suppose the subrange width to be constant (119889119894+1

minus 119889119894=

10minus120575) and consider that the minimum overlapping occurs for

minimum 119889min and maximum 119896max then we obtain the term

119909 = (119896max minus 1 + 120588) 119889min

minus (119896max minus 120588) (119889min + 10minus120575) ge 2 sdot 10

minus119905+1

(14)

We choose 120588 = 89 (119896max = 8) and119889min = 04 that leads to 120575 =

2 and 119905 = 2 Thus for the divisorsrsquos estimation 119889 is requiredan accuracy of two fractional digits and for the residualrsquosestimation 10119908 is required an accuracy of one integer and onefractional digit Furthermore the divisor must be pre-scaledsuch that 04 le 119889 lt 10

The P-D diagram is a visualization technique for design-ing a quotient digit selection function and for computing thedecision boundaries119898

119896(119894) It plots the shifted residual versus

the divisor The P-D diagram for the type2 quotient digitselection functionwith the selection constants119898

119896(119894) is shown

in Figure 2The two MSDs of the dividend and residual address the

ROM in order to obtain the signed quotient digit whichcomprises 5 bits In order to reduce the ROMrsquos size the twoMSDs of the divisor and the residual are implemented byusing a nonredundant radix-2 representation This leads toan address that is composed of 15 bits seven bits from the

Resid

ual 1

0middot

( )( + 1)

( + 2)

+1 +2

( )( + 1)

( + 2)

+1

+1

=

ge

Figure 2 P-D diagram for the type2 QDS function

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) for all 10 sdot 119908119899isin minus88 minus87 +88 with |119908| le

120588 sdot 119889max = 088 compute the quotient digits 119911 asfollowsif 10 sdot 119908

119899ge 1198807minus 120598 = 119880

7minus 02 then 119911 = 8

else if 10 sdot 119908119899ge 1198806minus 02 then 119911 = 7

else if 10 sdot 119908

119899ge 1198800minus 02 then 119911 = 1

else if 10 sdot 119908119899ge 1198710then 119911 = 0

else if 10 sdot 119908

119899ge 119871minus8

then 119911 = minus8

Algorithm 2 Algorithm for the calculation of the type2 ROMentries

unsigned radix-2 representation of the divisorrsquos two MSDsand eight bits from the signed radix-2 representation of theresidualrsquos twoMSDsHence the size of theROM is 215times 5 bitsUnfortunately the use of radix-2 representation complicatesthe multiplication by 10 which is required according to (5)Therefore an additional binary adder is needed

10 sdot 119908 [119895] = 8 sdot 119908 [119895] + 2 sdot 119908 [119895] = 119908 [119895] ≪ 3 + 119908 [119895] ≪ 1

(15)

The corresponding algorithm for the calculation of theROM entries can be derived from the P-D diagram and isdepicted in Algorithm 2 It should be noted that the radix-2 representations of 119889 and 119908 are truncations of 119889 and 119908 thatis 119889 le 119889 and 119908 le 119908

Once the quotient digit 119911119895+1

has been determined themultiple 119911

119895+1sdot 119889 is subtracted from the current residual to

compute the next residual as stated in (5) The subtracteris a fast redundant BCD-4221 carry-save adder (CSA) asdescribed in [21] except for the two MSDs in wich we apply

6 International Journal of Reconfigurable Computing

7

88

ROM

MUX

MUX

SignInv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

119888out

119888in

119889119889bin 119911119895+1

times(2+8)119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec11988816 middot 416 middot 4

times1 times2 times5 times10

Figure 3 Block diagram of type2 digit recurrence

(1) Select 119911119895+1

= SELROM(10119908 [119895] 119889)

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 3 Pseudocode for the type2 digit recurrence

radix-2 CPAs Thus the multiple 119911119895+1

sdot 119889 is composed of twosummands

119911119895+1

sdot 119889 = 1198891

119895+1+ 1198892

119895+1 (16)

where 11988912 isin 0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889 are precomputed Themultiplies 2119889 and 5119889 can be easily and fast computed by digitrecoding and constant shift operation as shown in (9) and(10) The 10rsquos and 2rsquos complements are applied for 11988912 lt 0For both radix-2 and BCD-4221 radix-10 representation thecomplements can be computed by inverting each bit andadding one In summary the subtraction uses a (3 1) radix-2CPA for the two MSDs and a redundant radix-10 (4 2) CSAfor the remaining digits The digit recurrence algorithm ofthe type2 divider is summarized in Algorithm 3 and its blockdiagram is depicted in Figure 3

44 Type3QDS Function The frequency limiting componentof the type2 divider is the digit recurrence step whichcomprises the ROM (BRAM) to calculate the next quotientdigit This BRAM is slower compared to common FPGAlogic Hence it appears to be beneficial to remove the slowBRAM from the critical path and use fast FPGA logic insteadTo analyze this impact of the BRAM delays we implementtwo dividers (type3a and type3b) which can also be realizedwithout BRAM in the critical path

Lang and Nannarelli [14] propose a divider that imple-ments a quotient digit selection function based on CPAsand sign detection These CPAs subtract fixed values fromthe current residual with limited precision These fixedvalues depend on the estimates of the divisor and are pre-computed and stored in a ROM Furthermore the quotientdigit is divided into two parts which further simplifies thequotient digit selection function and reduces the critical pathUnfortunately this divider cannot be implemented efficiently

on FPGAs because the required carry-save adder with smallerror estimation shows a poor performance on the FPGArsquosslice structure Nevertheless to investigate the impact ofBRAM delays in the type2 divider we implement anothertwo dividers (type3a and type3b) that have no BRAM inthe critical path but a CPA-based quotient digit selectionfunction instead

The type3 dividers are modifications of the type2 dividerThe common architectural features are

(i) the multiples of the divisor are compos of twocomponents 119911

119895+1sdot 119889 = 119889

1

119895+1+ 1198892

119895+1 where 119889

12isin

0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889(ii) the fast redundant BCD-4221 carry-save adders for

the digit recurrence step as stated in (5)(iii) the radix-2 implementation of the two MSDs and(iv) the 10rsquos and 2rsquos complements for 11988912 lt 0

Furthermore since the architectures of the type2 and type3dividers are similar with the exception of the quotient digitselection function most of the dividersrsquo characteristics arealso identical including the

(i) the P-D diagram(ii) the redundancy factor 120588 = 89 119896max = 8(iii) the need of prescaling the divisor 04 le 119889 lt 10(iv) the accuracy of the divisorrsquos estimation 120575 = 2 and(v) the accuracy of the residualrsquos estimation 119905 = 2

Contrary to the type2 divider the QDS functions of thetype3 dividers are implemented by 16 carry-propagate adderswith sign detection These adders subtract fixed selectionconstants from the estimation of the current residual Theselection constants are dependent on the current divisor andare stored in a ROM which is then no more part of thecritical pathThe selection constants are computed accordingto Algorithm 4 The common digit recurrence algorithm ofthe type3a and type3b dividers is shown in Algorithm 5 andthe corresponding block diagram is depicted in Figure 4

The sign signals of the binary carry-propagate adders areused to determine the corresponding quotient digit and toselect the multiples of the divisor in the digit recurrence

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

6 International Journal of Reconfigurable Computing

7

88

ROM

MUX

MUX

SignInv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

119888out

119888in

119889119889bin 119911119895+1

times(2+8)119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec11988816 middot 416 middot 4

times1 times2 times5 times10

Figure 3 Block diagram of type2 digit recurrence

(1) Select 119911119895+1

= SELROM(10119908 [119895] 119889)

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 3 Pseudocode for the type2 digit recurrence

radix-2 CPAs Thus the multiple 119911119895+1

sdot 119889 is composed of twosummands

119911119895+1

sdot 119889 = 1198891

119895+1+ 1198892

119895+1 (16)

where 11988912 isin 0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889 are precomputed Themultiplies 2119889 and 5119889 can be easily and fast computed by digitrecoding and constant shift operation as shown in (9) and(10) The 10rsquos and 2rsquos complements are applied for 11988912 lt 0For both radix-2 and BCD-4221 radix-10 representation thecomplements can be computed by inverting each bit andadding one In summary the subtraction uses a (3 1) radix-2CPA for the two MSDs and a redundant radix-10 (4 2) CSAfor the remaining digits The digit recurrence algorithm ofthe type2 divider is summarized in Algorithm 3 and its blockdiagram is depicted in Figure 3

44 Type3QDS Function The frequency limiting componentof the type2 divider is the digit recurrence step whichcomprises the ROM (BRAM) to calculate the next quotientdigit This BRAM is slower compared to common FPGAlogic Hence it appears to be beneficial to remove the slowBRAM from the critical path and use fast FPGA logic insteadTo analyze this impact of the BRAM delays we implementtwo dividers (type3a and type3b) which can also be realizedwithout BRAM in the critical path

Lang and Nannarelli [14] propose a divider that imple-ments a quotient digit selection function based on CPAsand sign detection These CPAs subtract fixed values fromthe current residual with limited precision These fixedvalues depend on the estimates of the divisor and are pre-computed and stored in a ROM Furthermore the quotientdigit is divided into two parts which further simplifies thequotient digit selection function and reduces the critical pathUnfortunately this divider cannot be implemented efficiently

on FPGAs because the required carry-save adder with smallerror estimation shows a poor performance on the FPGArsquosslice structure Nevertheless to investigate the impact ofBRAM delays in the type2 divider we implement anothertwo dividers (type3a and type3b) that have no BRAM inthe critical path but a CPA-based quotient digit selectionfunction instead

The type3 dividers are modifications of the type2 dividerThe common architectural features are

(i) the multiples of the divisor are compos of twocomponents 119911

119895+1sdot 119889 = 119889

1

119895+1+ 1198892

119895+1 where 119889

12isin

0 plusmn1119889 plusmn2119889 plusmn5119889 plusmn10119889(ii) the fast redundant BCD-4221 carry-save adders for

the digit recurrence step as stated in (5)(iii) the radix-2 implementation of the two MSDs and(iv) the 10rsquos and 2rsquos complements for 11988912 lt 0

Furthermore since the architectures of the type2 and type3dividers are similar with the exception of the quotient digitselection function most of the dividersrsquo characteristics arealso identical including the

(i) the P-D diagram(ii) the redundancy factor 120588 = 89 119896max = 8(iii) the need of prescaling the divisor 04 le 119889 lt 10(iv) the accuracy of the divisorrsquos estimation 120575 = 2 and(v) the accuracy of the residualrsquos estimation 119905 = 2

Contrary to the type2 divider the QDS functions of thetype3 dividers are implemented by 16 carry-propagate adderswith sign detection These adders subtract fixed selectionconstants from the estimation of the current residual Theselection constants are dependent on the current divisor andare stored in a ROM which is then no more part of thecritical pathThe selection constants are computed accordingto Algorithm 4 The common digit recurrence algorithm ofthe type3a and type3b dividers is shown in Algorithm 5 andthe corresponding block diagram is depicted in Figure 4

The sign signals of the binary carry-propagate adders areused to determine the corresponding quotient digit and toselect the multiples of the divisor in the digit recurrence

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 7

ROM

MUX

MUX

Sign

Sign

Sign

Inv Inv

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

128

1208

8

minus 88

16

QDS function

8

Inv Inv

BinaryCPA (1)

BinaryCPA (16)

119888out

119888in

times(2+8)

119889

119911119895+1

256 times 128

119908 bin

119909 bin10119909 dec10119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

2 middot4

times1 times2 times3 times10

16 middot 416 middot 4Figure 4 Block diagram of type3 digit recurrence

(1) compute 119871119896(119889119894) and 119880

119896(119889119894) for each 119889

119894isin

040 041 099 and 119896 isin minus8 +8 accordingto (12a) and (12b)

(2) compute the selection constants119898119896(119889119894)

1198988(119889119894) = 119880

7(119889119894) minus 120598 = 119880

7(119889119894) minus 02

1198987(119889119894) = 119880

6(119889119894) minus 02

1198981(119889119894) = 119880

0(119889119894) minus 02

1198980(119889119894) = 119871

0(119889119894)

119898minus7(119889119894) = 119871

minus7(119889119894)

Algorithm 4 Algorithm for the calculation of the type3 selectionconstants

(1) if1198988(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 8

else if1198987(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= 7

else if119898

minus7(119889119894) minus

10119908 [119895] lt 0 then 119911

119895+1= minus7

else 119911119895+1

= minus8

(2) Compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

10 (119908119904[119895] + 119908

119888[119895]) minus 119911

119895+1sdot 119889

Algorithm 5 Pseudocode for the type3 digit recurrence

iteration step The advantage of the type3 QDS functionis its short critical path which comprises only one 8-bitbinary carry-propagate adder However this short criticalpath is bought at a high price because the complexity ismoved to the selection of the divisors multiples Hence

large fan-in multiplexers with poor latencies are requiredIn order to minimize the impact of these multiplexers wedesigned a new dedicated (17 1) multiplexer that exploits theFPGArsquos fast carry chains and has a delay of only one LUTinstance In the following we name the divider with improvedfast multiplexers type3b and with traditional multiplexerstype3aThenewdedicated (17 1)multiplexer uses the 16-signbits as select lines and is coded as follows

if sel =10158401111111111111111

1015840 then 119910 = 119909 (16)

if sel =10158400111111111111111

1015840 then 119910 = 119909 (15)

if sel = 101584000000000000000011015840 then 119910 = 119909 (1)

if sel = 101584000000000000000001015840 then 119910 = 119909 (0)

(17)

with sel(minus1) = 1 and sel(16) = 0 In other words bit 119909(119894)is selected when sel(119894) = 0 and sel(119894 minus 1) = 1 The selectionof two input signals is implemented in one (5 1) LUT and allsignals are combined by a long OR gate that is implementedexploiting the FPGArsquos fast carry chain Therefore one (17 1)multiplexer requires nine LUTs and the type3b divider withfast multiplexers uses much more LUTs than the type3adivider (see Section 6) The block diagram of such a (17 1)multiplexer is depicted in Figure 5

45 Type4 QDS Function The type4 divider applies a newalgorithm as proposed by us in a previous paper [22] Itis based on the decomposition of the signed quotient digitinto three components having values minus5 0 5 minus2 0 2 andminus1 0 1 as well as a fast constant digit selection function Inthis implementation neither a ROM for the QDS functionnor a multiplexer to select the multiple 119911

119895sdot 119889 in the digit

recurrence iteration step is required The divider is intendedto utilize less resources than other implementations As

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

8 International Journal of Reconfigurable Computing

0 1

0 1

0 1

01

0 1

0

00

sel(15)

sel(15)sel(14)sel(13)

I4I3I2I1I0

9LUT(5 1)

I4I3I2I1I0

8LUT(5 1)

I4I3I2I1I0

I4I3I2I1I0

2LUT(5 1)

1

LUT

(5 1)

INIT= F5F7FDFF

INIT = F5F7FDFF

INIT= F5F7FDFF

INIT = F5F7FDFF

sel(3)sel(2)sel(1)

sel(1)sel(0)

1

1

1

1

119884

119909(16)

119909(15)119909(14)

119909(3)119909(2)

119909(1)119909(0)

Figure 5 Fast multiplexer

the type2 type3a and type3b dividers the type4 divideruses signed-digit redundant quotient calculation carry-saverepresentation of the residual fast BCD-4221 CSAs for thedigit recurrence and a radix-2 implementation of the MSDs

Quotient digits are decomposed into three components119911119895+1

= 5 sdot 1199113

119895+1+ 2 sdot 119911

2

119895+1+ 1199111

119895+1 and each component 119911119894 is

computed by a distinct selection functionThese componentscan hold three values 119911

119894isin minus1 0 1 so that only two

comparators per component are required to distinguishbetween these values Since minus8 le 119911

119895+1le 8 we get a

redundancy factor of 120588 = 89 Furthermore the selectionfunctions become very simple due to prescaling of the divisor(04 le 119889 lt 08) that is they are constant functions SEL119894 thatdo not depend on the divisor anymore

1199113

119895+1= SEL3 (10119908[119895]) (18a)

1199112

119895+1= SEL2 (1199071 [119895]) (18b)

1199111

119895+1= SEL1 (1199072 [119895]) (18c)

The recurrence is then defined as follows

1199071[119895] = 10119908 [119895] minus 119911

3

119895+1sdot 5119889 (19a)

1199072[119895] = 119907

1[119895] minus 119911

2

119895+1sdot 2119889 (19b)

119908 [119895 + 1] = 1199072[119895] minus 119911

1

119895+1sdot 119889 (19c)

The multiples 2119889 and 5119889 can be easily and fast com-puted according to (9) and (10) Each selection functionrequires two comparators implemented as carry-save addersthat subtract constant values from the residualsrsquo estimations(119908[119895] 1199071[119895] and 1199072[119895]) As we will show in the followingthese estimations require only the two most significant digits(MSDs) of the corresponding exact values

First the selection intervals [119871119894119896 119880119894

119896] for 119911119894

119895+1= 119896 with

119896 isin minus1 0 1 and 119894 isin 1 2 3 have to be determined where119871119894

119896is the smallest and 119880119894

119896is the greatest value of the selection

constant 119898119896(119894) such that the next residual is still bounded

Applying the convergence condition (8) the digit recurrence(19a) (19b) and (19c) and the redundancy factor 120588 = 89 weobtain

1003816100381610038161003816119908 [119895]1003816100381610038161003816 le 120588119889 =

8

9119889 (20a)

100381610038161003816100381610038161199072[119895]

10038161003816100381610038161003816le (120588 + 1) 119889 =

17

9119889 (20b)

100381610038161003816100381610038161199071[119895]

10038161003816100381610038161003816le (120588 + 1 + 2) 119889 =

35

9119889 (20c)

From the recurrence 1199071[119895] = 10119908[119895] minus 1199113

119895+1sdot 5119889 1199071[119895] le

(359)119889 1199113119895+1

isin minus1 0 1 and replacing 10119908[119895] by the upperlimit 1198803

119896 we get

1198803

119896= (5119896 + 120588 + 3) 119889 = (5119896 +

35

9) 119889 (21a)

Similarly we obtain the upper limits 1198802119896and 119880

1

119896from the

recurrence (19b) and (19c) respectively

1198802

119896= (2119896 + 120588 + 1) 119889 = (2119896 +

17

9) 119889 (21b)

1198801

119896= (119896 + 120588) 119889 = (119896 +

8

9119889) (21c)

Likewise the lower limits 119871119894119896can be computed

1198713

119896= (5119896 minus 120588 minus 3) 119889 = (5119896 minus

35

9) 119889 (22a)

1198712

119896= (2119896 minus 120588 minus 1) 119889 = (2119896 minus

17

9) 119889 (22b)

1198711

119896= (119896 minus 120588) 119889 = (119896 minus

8

9119889) (22c)

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 9

1

2

3

4

minus 1

minus 2

minus 3

minus 4

05

30

31

3pos = 09

3neg = minus 15 3

minus1

30

Resid

ual 1

0middot

Figure 6 P-D diagram for the type4 QDS function

Due to the redundancy factor 120588 = 89 gt 12we have overlap-ping regions [119871119894

119896+1 119880119894

119896] where more than one quotient digit

component 119911119894119895+1

may be selected The decision boundary of aselection function should lie inside this overlapping regionsFigure 6 shows the P-D diagram for 1199113 The figures for 1199112and 119911

1 look similar As 1199113 can hold three different valuesthe selection function requires two decision boundariesGenerally it is implemented by a staircase function (seeFigure 2) but for the bounded divisor 119889 isin [04 08) itis independent of the divisor and is reduced to a constantfunction (see Figure 6) Fortunately the selection functionsfor 1199112 and 1199111 are also independent of the divisor and can alsobe implemented by constant functions in a similar fashion

We now determine the required number of fractionaldigits for the estimates 10119908 1199071 and 1199072 Moreover we choosesuitable selection constants 119898119894pos and 119898

119894

neg These constantsdepend on the maximum error due to the estimates and onthe minimum overlapping width which is given by119880119894

0(04) minus

119871119894

1(08) as well as 119880119894

minus1(08) minus 119871

119894

0(04) Since we use BCD-4221

carry-save redundant digit representation for a precision of 119905digits (one integer digit and 119905 minus 1 fractional digits) we have amaximum error of 120598 = 2 sdot 10

minus(119905minus1) Moreover the estimations119908 are computed by truncations (119908 minus 119908 le 120598) This means forgiven positive and negative selection constants119898119894pos gt 0 and119898119894

neg lt 0 the following expressions must be true

119880119894

0(04) minus 119898

119894

pos ge 120598 119871119894

1(08) le 119898

119894

pos (23)

119871119894

0(04) le 119898

119894

neq 119880119894

minus1(08) minus 119898

119894

neg ge 120598 (24)

If we choose the selection constants as listed in Table 2all conditions are fulfilled for a precision of 119905 = 2 digitsand each selection function is reduced to two simple 2-digitcomparators

As soon as one component 119911119894 of the quotient digit iscomputed the corresponding multiple of the divisor mustbe subtracted from the partial residual Each component 119911119894can hold three different values 119911

119894isin minus1 0 +1 that are

multiplied with the corresponding weighted divisor 119899 sdot119889with119899 isin 5 2 1 In other words 119899 sdot 119889 is either passed negated

Table 2 Constants for type4 quotient digit selection function

119898119894

pos 119898119894

neg 119880119894

0(04) minus119898

119894

pos 119880119894

minus1(08) minus 119898

119894

neg

1199113 09 minus15 0656 06111199112 01 minus07 0656 06111199111 01 minus03 0255 0211

or reset Passing and resetting the multiple of the divisor isperformed at no extra cost Negation is accomplished by bitinversion and adding +1 through the carry inputs of the carry-save adders in (19a) (19b) and (19c) In summary this is avery fast operation and requires only two LUTs per digit

Similar to type2 and type3 division we implement thetwo MSDs of the type4 residual by using a nonredundantradix-2 representation which requires less LUTs and resultsin faster quotient digit selection functionThepseudocode forthe digit recurrence iteration is shown in Algorithm 6 andthe corresponding block diagram is depicted in Figure 7

46 Proposed Decimal Fixed-Point Divider For each typeof division algorithms presented in the preceding sectionswe have implemented a corresponding fixed-point dividerwhich is described in the following For all fixed-pointdividers we expect the input operands to be normalized

The block diagram of the type1 divider is depicted inFigure 8 First the divisor multiples 2119889 9119889 are precom-puted In the following cycles 119901 = 16 quotient digits (16+ 1 ifthe first digit is zero) are computed one by one followed by anadditional rounding digit The quotient (Z) and the quotient+1 (ZP1) are computed on-the-fly by using two registers thatare updated every cycle The algorithm has the advantagethat no additional slow decimal CPA is required to computethe incremented quotient It is described more precisely inSection 5 The normalization of the result is also performedon-the-fly by locking the conversion when 16 + 1 quotientdigits for the significand and the rounding digit have beencomputed that is in the worst case when the first quotientdigit is zero 16+1+1 = 18 cycles are required Furthermorethe sticky bit is calculated which is required for rounding Itis set to one whenever the final remainder is unequal to zero

The architectures of the type2 type3a type3b and type4are similar in their structure but differ in their digit recur-rence algorithm and scaling The common block diagram isshown in Figure 9 First the dividend and the divisor are re-scaled

119889 isin [01 10)

997904rArr 1198891015840isin

[04 10) for type2 type3ab[04 08) for type4

(25)

and five as well as two times the divisor are precomputedaccording to (9) and (10) Then the operands are recodedbecause the digit recurrence iteration uses a redundant digitrepresentation with BCD-4221 carry-save adders

The digit recurrence retires in each iteration one signedquotient digit 119911

119896 The on-the-fly-conversion algorithm con-

verts this signed-digit representation to the BCD-8421-coded

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

10 International Journal of Reconfigurable Computing

8

888 MUX

2 1

MUX2 1

Inve

rtr

eset

Inve

rtr

eset

Inve

rtr

eset

Bina

ryCP

ABC

D-4

221

CSA

L1 sh

ifter

09const 01const 01const

Bina

ryCP

ABC

D-4

221

CSA

Bina

ryCP

ABC

D-4

221

CSA

times(2+8)

119888out

119888in

119888out

119888in

119888out

119888in

minus15const minus07const minus03const

1199113 1199112 1199111SE

L1199113

SEL1199112

SEL1199111

119908 bin

119909 bin10119909 dec10

119908 dec119904119908 dec119888

119908998400 bin

119908998400 dec119904119908998400 dec119888

1199071 bin

1199071 dec1199041199071 dec119888

1199072 bin

1199072 dec1199041199072 dec119888

119889 middot5 119889 middot2 119889 middot1

16 middot 416 middot 4

Figure 7 Block diagram of type4 digit recurrence

Normalized dividend Normalized divisor

Compute divisor multiples

Compute 16 + 1 + 1 digit recurrence steps

Compute the quotienton-the-fly Determine sticky bit

Z ZP1 sb

119911119896 119908[119896]

2 middot 119889 3 middot 119889 9 middot 119889

Figure 8 Block diagram of the normalized type1 division

(1) select

1199113

119895+1=

1 if 09 le 10119908 [119895]

0 if minus15 le 10119908 [119895] lt 09

minus1 if 10119908 [119895] lt minus15

and compute(1199071

119904[119895] + 119907

1

119888[119895]) = 10 (119908

119904[119895] + 119908

119888[119895]) minus 119911

3

119895+1sdot 5119889

(2) select

1199113

119895+1=

1 if 01 le 1199071 [119895]

0 if minus 07 le 1199071 [119895] lt 01

minus1 if 1199071 [119895] lt minus07

and compute(1199072

119904[119895] + 119907

2

119888[119895]) = (119907

1

119904[119895] + 119907

1

119888[119895]) minus 119911

2

119895+1sdot 2119889

(3) select

1199111

119895+1=

1 if 01 le 1199072 [119895]

0 if minus03 le 1199072 [119895] lt 01

minus1 if 1199072 [119895] lt minus03

and compute (119908119904[119895 + 1] + 119908

119888[119895 + 1]) =

(1199072

119904[119895] + 119907

2

119888[119895]) minus 119911

1

119895+1119889

Algorithm 6 Pseudo code for the type 4 digit recurrence

Compute the quotienton-the-fly

sb

Compute divisor multiples

BCD-8421 to BCD-4221 BCD-8421 to BCD-4221

Determine sticky bit andsign of the remainder

Normalized dividend Normalized divisor

for type2for type3

sign rem

Scale divisor 119889 isin [01 1) rarr 119889998400 isin [04 1)[04 08)

Compute 16+ 1+ 1 digit recurrence steps

119911119896 119908[119896]

Z ZP1

MUX

ZM1998400 Z998400 ZP1998400

119877

2 middot 119889998400 and 5 middot 119889998400

Figure 9 Block diagram of the normalized type2 type3a type3band type4 divisions

result [20] and computes the quotient (Z) the quotient +1(ZP1) and quotient minus1 (ZM1) This conversion is accom-plished every iteration step and does not need a slow CPAThe incremented and decremented quotients are required forrounding Moreover the quotient is also normalized on-the-fly by locking the conversion when 16 + 1 quotient digits(including the rounding digit) have been computed Thealgorithm is described explicitly in Section 5 because it isaccomplished together with the gradual underflow handlingof the floating-point divider

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 11

if (qNaN119883and sNaN

119883) and (qNaN

119863or sNaN

119863) then

1198621015840

119883= 119862119863 119862

1015840

119863= 0 01

1199021015840

119883= 119902119863 1199021015840

119863= 0 + bias = 398

else if (qNaN119883or sNaN

119883) then

1198621015840

119883= 119862119883 119862

1015840

119863= 0 01

1199021015840

119883= 119902119883 1199021015840

119863= 0 + bias = 398

else no changes1198621015840

119883= 119862119883 119862

1015840

119863= 119862119863

1199021015840

119883= 119902119883 1199021015840

119863= 119902119863

Algorithm 7 Algorithm of NaN handling

Furthermore the sticky bit is calculatedwhich is requiredfor rounding It is set to one whenever the final remainderis unequal to zero Moreover when the final remainder isnegative the quotient of the type2 type3a type3b and type4dividers has to be adjusted by subtracting one LSD Thissubtraction does not need another slow CPA because thequotient minus1 (ZM1) has already been computed on-the-flyThe calculations of both the sticky and sign bit require thereduction of the redundant remainder by a CPA This CPAmight be subdivided into multiple smaller CPAs to keep thelatency low

5 IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divideris an extension of the normalized fixed-point divider Thedivider presented in this paper supports the interchangeformat IEEE 754-2008 decimal64 with DPD encoding butit can be easily adapted to any other precision and exponentrange

The block diagram of the divider is depicted in Figure 10TheDFP division begins with decoding of the dividendX anddivisor D and the extraction of their signs significands andexponents In the following the significands of the operandsX (119888119883) and D (119888

119863) are regarded as integers Therefore the

corresponding exponents are 119902119883and 119902119863

According to [4] if one of the operands is a signalingNaN(sNAN) or quiet NaN (qNAN) then the result is also a quietNaN with the payload of the original NaN Hence in orderto preserve the payload while using an unmodified dividerthe NaN handling unit sets the NaN holding operand as thedividend and resets the divisor to 10 sdot 10

0 as depicted inAlgorithm 7

The fixed-point dividers presented in Section 4 requirenormalized operands Therefore the number of leadingzeros for both operands is counted and the significands arenormalized by barrel shifters Due to performance issues theleading zeros counter exploit the FPGArsquos fast carry chains asproposed in [23]

The exponent of a floating-point division is first estimatedby the normalized quotient in fractional representation(QNF) and is updated iteratively in each digit recurrence stepSince most of the decimal floating-point numbers specifiedby IEEE 754-2008 have multiple representations of the same

value the result of a division might also have more than onecorrect representations that differ in the exponent Howeverto obtain a unique and reproducible result in the case of suchan ambiguity IEEE 754-2008 defines the exponent of theresult which is called preferred exponent (QP)

Both exponents QNF and QP are positive integers andare determined as follows

Δ119902 = (119902119883minus 119902119863)

ΔLZ = (LZ119883minus LZ119863)

(26)

QNF = Δ119902 minus ΔLZ + offset1

QP = Δ119902 + offset1(27)

The offset1 is a bias and assures the quotients to be positivesince unsigned integer calculation is used in this paper Thebias is composed of

offset1 = 783

= 398 IEEE754-2008 bias

+ 369 119902max

+ 15 conversion fractional to integer

+ 1 normalization of the result

(28)

The normalized fixed-point division unit then computes 119901 =

16 significand digits and one additional rounding digit Addi-tionally the divider encompasses the on-the-fly conversionunit that also detects and handles gradual underflow withzero delay overhead One signed quotient digit 119911

119895is retired in

each cycle and the on-the-fly-conversion algorithm convertsthis signed-digit representation into the BCD-8421-codedpartial result [20]The conversion is accomplished every iter-ation step and does not need a slow CPA (see Algorithm 8)The partial quotient is stored in a register Z1015840 Moreovertwo additional registers ZM11015840 and ZP11015840 are provided ZM11015840is the partial quotient decremented by one least significantdigit (LSD) and ZP11015840 is the partial quotient incrementedby one LSD ZM11015840 is selected when the final residual inthe last iteration is negative and ZP11015840 is required by thefollowing rounding operation Furthermore the exponent forthe normalized significand in integer representation (QNI)is calculated (see (29)) and gradual underflow handling isperformed at no extra cost (see Algorithm 8) The exponentQNI is computed by decrementing the exponent of thenormalized significand in fractional representation (QNF)in each iteration step by one The recurrence iterationterminates earlier in case of gradual underflow The gradualunderflow signal (GUF) is asserted high when QNI = 119902minand the calculated integer quotient is less than 10

119901

QNI =

119902min if GUF = 1

QNF minus (119901 minus 1) if 1199111gt 0

QNF minus 119901 if 1199111= 0

(29)

The extended quotient (composed of the calculated quotientand the rounding digit) must be decremented by one if the

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

12 International Journal of Reconfigurable Computing

DPD decoder DPD decoder

NaN handling

Leading zeros count andQNF and QP calculations

Dividend normalization Divisor normalization

Normalized fixed-point division

QNF

GUF QNI TZ Z ZP1 sb

QP

On-the-fly conversion withgradual underflow handling

Exponentcalculation

sbof uf

DPD encoder

Quotient

Overflow underflowgradual underflow

inexact invalid div0

Exception unit Rounding unit

Divisor Divisor

119904119883 119902119883 119888119883 119904119863 119902119863 119888119863

Z119890 ZP1119890

119902119885 119888119885 119904119885

119904119883 119904119863

119877

119877

Figure 10 Block diagram of the floating-point divider

final remainder is less than zero The sticky bit (sb) is usedfor rounding and is asserted high whenever the remainder isunequal to zero

Moreover the number of trailing zeros (TZ) is computedon-the-fly This number indicates by how many digits thecomputed quotientmight be shifted to the right without loos-ing accuracy The number of trailing zeros has to be countedonly for Z1015840 because if ZM11015840 is selected the residual will beunequal to zero and TZ = 0 anyway The number of trailingzeros is used for the selection between the preferred exponent(QP) and the normalized exponent for integer representation(QNI) as described in the following paragraph

The exponent calculation unit selects either QP or QNIThe preferred exponent should be selected whenever possi-ble This is only feasible when QNI lt QP le QNI + TZthat is there is a sufficient number of trailing zeros to shiftthe significand to the right Furthermore the final exponentmust satisfy the minimum and maximum exponent rangeThe algorithm of the exponent calculation is illustrated inAlgorithm 9 where the value

offset2 = 783 minus bias = 783 minus 398 = 385 (30)

is used to add the IEEE 754-2008 bias and to remove offset1again (which was introduced in (27))

QNI1015840 = QNF + 1Z1015840 = ZM11015840 = ZP11015840 = 0TZ1015840 = 0 119895 = 0

while (Z1015840 lt 10119901) and (QNI1015840 ge 119902min)

Z1015840 = 10 sdot Z1015840 + 119911

119895+1if 119911119895+1

ge 0

10 sdot ZM11015840 + (119911119895+1

+ 10) if 119911119895+1

lt 0

ZM11015840 = 10 sdot Z1015840 + (119911119895+1

minus 1)

10 sdot ZM11015840 + (119911119895+1

+ 9)

if 119911119895+1

gt 0

if 119911119895+1

le 0

ZP11015840 = 10 sdot Z1015840 + (119911

119895+1+ 1) if 119911

119895+1ge minus1

10 sdot ZM11015840 + (119911119895+1

+ 11) if 119911119895+1

lt minus1

TZ1015840 = TZ1015840 + 1 if 119911

119895+1= 0

0 elseQNI1015840 = QNI1015840minus 1119895 = 119895 + 1

QNI = QNI1015840 + 1 remove the impact of the rounding digit

1198771015840=

119911119895+1

minus 1 if remainder lt 0

119911119895+1

elseZM110158401015840 = lfloorZM11015840 sdot 10minus1rfloorZ10158401015840 = lfloorZ1015840 sdot 10minus1rfloorZP10158401015840 = lfloorZP11015840 sdot 10minus1rfloor

(Z ZP1119877) = (ZM110158401015840Z10158401015840 1198771015840 + 10) if 1198771015840 lt 0

(Z10158401015840ZP110158401015840 1198771015840) else

sb = 0 if remainder = 0

1 else

GUF =

1 if QNI = 119902min and Z1015840 lt 10119901

and (sb = 1 or 119877 = 0)

0 elseif GUF = 1 then underflow

TZ = 0 if sb = 1

TZ1015840 else

Algorithm 8 On-the-fly conversion with gradual underflow han-dling

Rounding is required when the number of significantdigits of the quotient exceeds the length 119901 of a decimal wordThe divider presented in this paper provides the five roundingmodes as requested by [4] Rounding either selects the integerquotient Z

119890or the incremented quotient ZP1

119890= Z119890+ 1 As

proven by Theorem A1 in the appendix rounding overflowcannot occur in DFP division Rounding overflow wouldoccur if Z

119890+1 overflows due to roundingThe selection of the

quotient depends on the rounding mode the rounding digit(R) the sticky bit (sb) and the least significant digit (LSD)of the quotient The calculation of the round up detection issummarized in Table 3

Finally the result is encoded again and the exceptionunit might assert six exception signals These are division byzero invalid operation result is infinite inexact overflowand underflow Division by zero is asserted when the divisorequals zero but the dividend is unequal to zeroThe operationis called invalid when both operands are zero both operandsare infinity or any of the operands is either a signalingor a quiet NaN The inexact flag is asserted when the

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 13

Table 3 Round-up detection

Rounding mode Round-up detection

Round Ties To Even (119877 gt 5) + (119877 equiv 5) sdot sb+(119877 equiv 5) sdot sb sdot LSD(0)

Round Ties To Away (119877 ge 5)

Round Toward Positive sign sdot ((119877 gt 0) + sb)Round Toward Negative sign sdot((119877 gt 0) + sb)Round Toward Zero 0Legend ldquo+rdquo logical OR ldquosdotrdquo logical AND

offset2 = 783 minus 398 = 385if (QP le QNI) or (QNI + TZ lt QP)

or (QP gt 119902max) then(a) QNI gt 119902max rArr overflow(b) 119902min le QNI le 119902max rArr

119902119885= QNI minus offset2

Z119890= Z

ZP11015840e = ZP1(c) QNI lt 119902min rArr underflow

else(a) 119902min le QP le 119902max rArr

119902119885= QP minus offset2

Z119890= Z≫ (QP minus QNI)

ZP1119890= ZP1≫ (QP minus QNI)

(b) QNI lt 119902min rArr underflow

Algorithm 9 Exponent calculation

rounding digit or the sticky bit is unequal to zero Overflowand underflow are computed as described in the exponentcalculation unit and are listed in Algorithms 8 and 9 Theinfinity exception is asserted when the result is greater thanthe largest number that is when either overflow or divisionby zero exception is asserted or when the dividend is infinitywhile the divisor is a finite number

6 Implementation Results

All dividers are modeled using VHDL and are implementedfor Xilinx Virtex-5 devices with speed grade minus2 using XilinxISE 101 The postplace and route results of the fixed-pointdividers are listed in Table 4

All five types of fixed-point dividers require 19 cycles toperform a division This includes 17 cycles to determine thequotient digits and the rounding digit one additional cycleto normalize the result (when the first quotient digit is zero)and one cycle latency to perform on-the-fly conversion andsticky bit determination As expected the type1 divider usesthemost resources in terms of look-up tables (LUTs) and flip-flops (FFs)The type2 type3a and type3b divider consumesless LUTs but require further five (type2) or two (type3aand type3b) BRAMs However comparing the delay of thethree dividers leads to an unexpected result Contrary todecimal divider implementations in ASIC designs on FPGAplatforms the shift-and-subtract algorithm is the fastest The

reason is that the signal propagation on the FPGArsquos internalfast carry chains is faster than interconnections betweenslices over the FPGArsquos general routing matrix The type1divider exploits this fast carry chains whereas type2 type3atype3b and type4 dividers only use the normal slow sliceinterconnection resources In ASIC implementations forinstance the longest paths of the type4 divider would bemuch shorter than the longest path of the type1 divider buton FPGA architectures the situation is the opposite

One of the fastest decimal fixed-point divider on ASICsis the design of Lang and Nannarelli [14] They minimize thecritical path in the digit recurrence by using a fast quotientdigit selection function based on binary carry-propagateadders that subtract fixed values from the estimation ofthe current residual These fixed values are dependent onestimations of the divisor and are precomputed and storedin a ROM Therefore the latency of the ROM does notcontribute to the critical path of the digit recurrence Theimplementation of such a divider on FPGAs would havea poor performance because the required carry-save adderwith small error estimation cannot be implemented effi-ciently on the FPGArsquos slice structure However the conceptof removing the ROM from the critical path is applied tothe type3a and type3b dividers and the correspondingpostplace and route results are shown in Table 4 Theseresults point out that the type3a utilizes a similar amountof LUTs as the type2 divider but the cycle time is muchhigher The reason for this increased cycle time can beexplained by the raised complexity of the multiplexers forthe selection of the divisorrsquos multiples in the digit recurrencestep These multiplexers show a poor performance in termsof propagation delay because they are implemented as atree with slow slice interconnections On the contrary thetype3b divider applies dedicated fast multiplexers whichhave a propagation delay of only one LUT instance and eightfast carry chains This dedicated fast multiplexer speeds upthe divider and reduces the maximum cycle time by onenanosecond Unfortunately the number of used LUTs isincreased dramatically by approximately 1000 LUTs

The comparison of the type2 type3a type3b and type4dividers in terms of speed and area (number of occupiedLUTs and FFs combined) shows that the type2 algorithm isthe fastest and requires the least number of slices Hencethere is no benefit in removing the ROMrsquos latency fromthe critical path because the complexity is moved to theselection of the divisorrsquos multiples The type1 divider is only5 faster than the type2 divider but the speed is bought ata high price because the number of occupied slices is 75higher Therefore the type2 divider shows a good tradeoff interms of area and latency and is used for the floating-pointdivider presented in this paper In the following paragraphwe compare the type2 implementation with other publishedfixed-point dividers

Four other FPGA-based dividers are presented by Ercego-vac andMcIlhenny [17 18] Deschamp and Sutter [24] Zhanget al [25] and Vestias and Neto [12] These implementationsare compared to our type2 divider in the following Thedivider presented in [17 18] is based on a digit recurrencealgorithm that only requires limited-precision multipliers

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 14: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

14 International Journal of Reconfigurable Computing

Table 4 Results for normalized decimal fixed-point dividers 119901 = 16 + 1

Type1 Type2 Type3a Type3b Type4Number of LUTs 3595 1704 1749 2751 1846Number of FFs 1234 1126 1262 1262 1031Number of LUTs and FFs combined 3868 2210 2304 3240 2203Number of 36 k BRAM 0 5 2 2 0Number of LUTs normalized to the type2 divider 211 100 103 161 108Number of FFs normalized to the type2 divider 110 100 112 112 092Number of LUTs and FFs combined normalized to the type2 divider 175 100 104 147 100Cycle time (ns) 81 85 101 91 121Overall latency (ns) (19 times cycle time) 154 162 192 173 230Max frequency (MHz) 123 118 99 110 83Max frequency normalized to the type2 divider 105 100 084 093 070

Table 5 Performance comparison of fixed-point dividers

Occupied area Cycle time (ns) Latency (ns)Ercegovac and McIlhenny [17] (119901 = 14 Virtex-5) 1263 (6 2) LUTs 131 197Type2 divider equalized to 119901 = 14 1692 (6 2) LUTs 85 136Deschamps and Sutter [24] (119901 = 16 Virtex-4)

(i) Nonrestoring algorithm 2974 (4 1) LUTs 214 386(ii) SRT-like algorithm 3799 (4 1) LUTs 166 300

Zhang et al [25] (119901 = 16 Virtex-II Pro) 3976 slices 200 420Vestias and Neto [12] (119901 = 16 Virtex-4 Newton-Raphson)

(i) Type A3 (0 DSPs 118 cycles) 2756 (4 1) LUTs 34 394(ii) Type A4 (0 DSPs 90 cycles) 3768 (4 1) LUTs 34 306(iii) Type A5 (7 DSPs 112 cycles) 2091 (4 1) LUTs 34 380(iv) Type A6 (10 DSPs 86 cycles) 2718 (4 1) LUTs 34 292

Type2 divider equalized to 119901 = 16 1704 (6 2) LUTs 85 153

adders and LUTs Furthermore a compensation term iscomputed in the digit recurrence that compensates the errorcaused by this limited precision The design is optimized forVirtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycletime of 131 ns which is more than 50 higher comparedto the type1 or type2 design proposed in this paper Thereason for that high cycle time is (similar to the type4divider) the long critical paths across many LUTs that cannotexploit the FPGArsquos fast carry chains Therefore the designwould probably fit better on ASICs with dedicated routingThe dividers implemented in [24] apply two different digitrecurrence algorithms with a redundancy of 120588 = 1 Thequotient digit selection function is very complex becauseit requires three MSDs of the divisor and four MSDs ofthe residual The cycle times are high but it must be takeninto account that these designs are optimized for Virtex-4 FPGAs Nevertheless the number of used LUTs is veryhigh The divider presented in [25] applies a radix-100 digitrecurrence algorithm with comprehensive prescaling of thedividend and divisor It is optimized for Virtex-II Pro deviceshence it is hard to compare it with the dividers presentedin this paper However the critical path includes a decimalcarry-save adder tree two carry-propagate adder stages andthree multiplexer stages It can therefore be assumed that this

algorithm would also have a poor performance on Virtex-5 devices although two quotient digits are retired in eachiteration step Vestias and Neto [12] propose four differentdecimal dividers optimized for different speed and areatradeoffs Contrary to the circuits presented in this paperthe dividers apply the Newton-Raphson algorithm The usedmultiplier is sequential therefore many cycles per divisionare required However each divider has a higher delay and agreater LUT usage compared to our type2 implementation

Table 5 lists the results of the dividers presented in [17 2425] together with the results of the type2 divider Howeverfor a fair performance comparison the precision of the type2divider is equalized to match the number of calculatedquotient digits of the corresponding divider Furthermore itshould be noted that the dividers presented in [12 17 24 25]do not provide rounding support

For the floating-point implementation the fixed-pointdivider of type2 is used because type1 and type3b consumetoo many resources and type3a and type4 are too slowHowever five additional BRAMs are required but these areavailable in sufficient quantities on Virtex-5 devices Thecorresponding postplace and route results for two config-urations with different numbers of serialization stages arelisted in Table 6 A comparisonwith other implementations is

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 15: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of Reconfigurable Computing 15

Table 6 Postplace and route result of the floating-point dividerbased on type2

Config 1 Config 2Number of cycles per float division 23 21Number of LUTs 3231 3205Number of FFs 1630 1377Number of LUTs and FFs combined 3571 3566Cycle time (ns) 85 91Overall latency (ns) 1955 191Max frequency (MHz) 118 110

Table 7 Postplace and route result of a 64-bit binary floating-pointdivider (Core Gen)

Cycles per division1 10 20 55

No of LUTs 3161 592 425 394No of FFs 196 518 469 542No of LUTs and FFs comb 3232 917 675 675Cycle time (ns) 1537 208 89 68Overall latency (ns) 1537 208 178 374Max frequency (MHz) 65 48 112 147

complicated because there are no other FPGA-based imple-mentations of decimal floating-point dividers published yetTherefore we can only compare the decimal divider withbinary dividers implemented on the same FPGA In Table 7postplace and route results of binary floating-point dividersfor the double data format on a Virtex-5 provided by theXilinx Core Generator [26] are listed The dividers alsoprovide exception signals overflow underflow invalid opera-tion and divide-by-zero Unfortunately the binary floating-point divider does not support gradual underflow becausethis feature increases the complexity The divider with onecycle per division is an unpipelined and fully parallel designand the divider that requires 55 cycles per division is afully sequential design The binary floating-point dividerwith 20 cycles per division is best suited for a comparisonwith the decimal floating-point divider because it requires acomparable number of cycles for one operation Obviouslydecimal arithmetic has a great overhead regarding the totalnumber of used LUTs but the cycle time and latency aresimilar This resource overhead can be explained on the onehand by the inefficient representation of decimal values onbinary logic and on the other hand by the more complexspecification of the decimal standard IEEE 754-2008 Forinstance decimal floating-point division requires additionalencoders decoders and a normalization stage

7 Conclusion

In this paper we have presented five different radix-10 digitrecurrence division algorithmsThe type1 divider implementsthe simple shift-and-subtract algorithm The type2 divider isbased on a nonrestoring algorithmwith ROM-based quotientdigit selection function The type3a and type3b dividers are

both similar to type2 but use fast binary carry-propagateadders in the quotient digit selection function Howeverthe type3a divider uses LUT-based multiplexers whereas thetype3b uses fast carry chain-based multiplexers The type4divider applies a new algorithm with constant digit selectionfunctions This type4 divider requires neither a ROM normultiplexers to select multiples of the divisor

Wehave shown that the subtract-and-shift algorithmwiththeworst latency inASIC architectures is the fastest design onFPGAs but uses the most FPGA resources in terms of LUTsA good tradeoff between latency and area is the architecturetype2with a redundant carry-save representation of the resid-ual a radix-2 representation of the twoMSDs and a quotientdigit selection function implemented in ROM Furthermorewe have extended thisfixed-point to a fully IEEE 754-2008compliant decimal floating-point divider for decimal64 dataformat and finally we have shown implementation results ofall dividers

Appendix

Proofs

Theorem A1 (rounding overflow) Let us consider a decimalfloating-point division 119876 = 119883119863 with the signs 119904

119894 the

significands 119888119894 the exponents 119902

119894 and the precision p

(minus1sQ sdot cQ sdot 10

qQ) =(minus1

sX sdot cX sdot 10qX)

(minus1sD sdot cD sdot 10qD) (A1)

Then rounding overflow that is an overflow that arises due toadding +1 to the least significant digit of the final result cannotoccur

Proof Assume to the contrary that there are a floating-pointdividendX a floating-point divisorD and a proper roundingmode where the decimal floating-point division produces arounding overflow Without loss of generality we considerin the following only positive operands (119904

119883= 119904119863

= 0)normalized significands (119888

119883 119888119863ge 10119901minus1) and the exponents

to be zero (119902119883= 119902119863= 0) In the case of rounding overflow the

119901 digits of the quotient must be all nines and the remaindermust be unequal to zero that is

119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= 119888119876sdot 119888119863+ rem = 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digitssdot 119888119863⏟⏟⏟⏟⏟⏟⏟

119901 digits

+ rem⏟⏟⏟⏟⏟⏟⏟

119901 digits (A2)

with 0 lt rem lt 119888119863sdot 10minus119901+1

(A3)

Condition (A2) is fulfilled for

119888119863= 10119901minus1

119888119883= 99 sdot sdot sdot 9⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

119901 digits (A4)

but this violates condition (A3) because the remainder is zero(rem = 0) that is the result is exact no round-up is appliedand hence no rounding overflow occurs

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 16: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

16 International Journal of Reconfigurable Computing

Otherwise if the divisor is greater than 10119901minus1 it follows

that119888119883⏟⏟⏟⏟⏟⏟⏟

119901 digits

= (99 sdot sdot sdot 9) sdot 119888119863+ rem

= (10 minus 10minus119901+1

) sdot 119888119863+ rem

= 10 sdot 119888119863+ (minus10

minus119901+1sdot 119888119863+ rem)

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

lt0

= 10 sdot 119888119863+ 120583⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

⩾119901+1 digits

(A5)

with minus10 lt 120583 lt 0 The left term of the equation hasby definition a precision of 119901 digits while the right termof the equation has a precision of more than 119901 digits Thiscontradiction finally proves that no rounding overflow canoccur

References

[1] M Baesler S Voigt and T Teufel ldquoFPGA implementationsof radix-10 digit recurrence fixed-point and floating-pointdividersrdquo in Proceedings of the International Conference onReconfigurable Computing and FPGAs (ReConFig rsquo11 ) pp 13ndash19 IEEE Computer Society Los Alamitos CA USA December2011

[2] IEEE Task P754 ANSIIEEE 754-1985 Standard for BinaryFloating- Point Arithmetic New York NY USA August 1985

[3] M F Cowlishaw ldquoDecimal floating-point algorism for comput-ersrdquo in Proceedings of the 16th IEEE Symposium on ComputerArithmetic (ARITH rsquo03) pp 104ndash111 IEEE Computer SocietyWashington DC USA June 2003

[4] IEEE Task P754 IEEE 754-2008 Standard for Floating-PointArithmetic New York NY USA August 2008

[5] ANSIIEEE ANSIIEEE Std 854-1987 An American NationalStandard IEEE Standard for Radix-Independent Floating-PointArithmetic New York NY USA October 1987

[6] A Y Duale M H Decker H G Zipperer M Aharoni and T JBohizic ldquoDecimal floating-point in z9 an implementation andtesting perspectiverdquo IBM Journal of Research and Developmentvol 51 no 1-2 pp 217ndash227 2007

[7] C F Webb ldquoIBM z10 the next-generation mainframe micro-processorrdquo IEEE Micro vol 28 no 2 pp 19ndash29 2008

[8] L Eisen J W Ward H W Tast et al ldquoIBM POWER6accelerators VMX and DFUrdquo IBM Journal of Research andDevelopment vol 51 no 6 pp 663ndash683 2007

[9] R Kalla B SinharoyW J Starke andM Floyd ldquoPower7 IBMrsquosnext-generation server processorrdquo IEEEMicro vol 30 no 2 pp7ndash15 2010

[10] S F Oberman and M J Flynn ldquoDivision algorithms andimplementationsrdquo IEEE Transactions on Computers vol 46 no8 pp 833ndash854 1997

[11] L K Wang and M J Schulte ldquoA decimal floating-pointdivider using newton-raphson iterationrdquo Journal of VLSI SignalProcessing Systems vol 49 no 1 pp 3ndash18 2007

[12] M Vestias and H Neto ldquoRevisiting the newton-raphsoniterative method for decimal divisionpagesrdquo in Proceedings ofthe International Conference on Field Programmable Logic andApplications (FPL rsquo11) pp 138ndash143 IEEE Computer SocietyPress September 2011

[13] H Nikmehr B Phillips and C C Lim ldquoFast decimal floating-point divisionrdquo IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems vol 14 no 9 pp 951ndash961 2006

[14] T Lang and A Nannarelli ldquoA radix-10 digit-recurrence divisionunit algorithm and architecturerdquo IEEE Transactions on Com-puters vol 56 no 6 pp 727ndash739 2007

[15] A Vazquez E Antelo and P Montuschi ldquoA radix-10 SRTdivider based on alternative BCD codingsrdquo in Proceedings of the25th IEEE International Conference on Computer Design (ICCDrsquo07) pp 280ndash287 IEEE Computer Society Press Los AlamitosCA USA October 2007

[16] E Schwarz and S Carlough ldquoPower6 decimal dividepagesrdquoin Proceedings of the 18th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo07) pp 128ndash133 IEEE Computer Society July 2007

[17] M D Ercegovac and R McIlhenny ldquoDesign and FPGA imple-mentation of radix-10 algorithm for division with limited preci-sion primitivesrdquo in Proceedings of the 42nd Asilomar Conferenceon Signals Systems and Computers (ASILOMAR rsquo08) pp 762ndash766 IEEEComputer Society PacificGrove Calif USAOctober2008

[18] M D Ercegovac and R McIlhenny ldquoDesign and FPGAimplementation of radix-10 combined divisionsquare rootalgorithm with limited precision primitivesrdquo in Proceedings ofthe 44th Asilomar Conference on Signals Systems andComputers(Asilomar rsquo10) pp 87ndash91 IEEEComputer Society PacificGroveCalif USA November 2010

[19] F Y Busaba C A Krygowski W H Li E M Schwarz andS R Carlough ldquoThe IBM z900 decimal arithmetic unitrdquo inProceedings of the 35th Asilomar Conference on Signals Systemsand Computers vol 2 pp 1335ndash1339 IEEE Computer SocietyNovember 2001

[20] M Ercegovac and T Lang Division and Square Root Digit-Recurrence Algorithms and Implementations Kluwer AcademicPublishers Norwell Mass USA 1994

[21] M Baesler S O Voigt and T Teufel ldquoA decimal floating-point accurate scalar product unit with a parallel fixed-pointmultiplier on a Virtex-5 FPGArdquo International Journal of Recon-figurable Computing vol 2010 Article ID 357839 13 pages 2010

[22] M Baesler S O Voigt and T Teufel ldquoA radix-10 digit recur-rence division unit with a constant digit selection functionrdquoin Proceedings of the 28th IEEE International Conference onComputer Design (ICCD rsquo10) pp 241ndash246 IEEE ComputerSociety Los Alamitos CA USA October 2010

[23] M Baesler S O Voigt and T Teufel ldquoAn IEEE 754-2008decimal parallel and pipelined FPGA floating-point multiplierrdquoin Proceedings of the 20th International Conference on FieldProgrammable Logic and Applications (FPL rsquo10) pp 489ndash495IEEE Computer Society Washington DC USA September2010

[24] J P Deschamps and G Sutter ldquoDecimal division algorithmsand FPGA implementationsrdquo in Proceedings of the 6th SouthernProgrammable Logic Conference (SPL rsquo10) pp 67ndash72 IEEEComputer Society March 2010

[25] Y Zhang D Chen L Chen et al ldquoDesign and implementationof a readix-100 decimal divisionrdquo in Proceedings of IEEESymposium on Circuit and System (ISCAS rsquo09) Taibei TaiwanMay 2009

[26] Xilinx Xilinx LogiCORE Floating-Point Operator v40 April2008

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 17: Research Article Analysis of Fast Radix-10 Digit …downloads.hindawi.com/journals/ijrc/2013/453173.pdfnumbers. e decimal number 0.1 , for example, has a peri-odical continued fraction

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of