floating_point_fpga

8/7/2019 Floating_Point_FPGA

1/10

High Performance FPGA based Floating Point

Arithmetics

Project report for Computer Arithmetic Algorithms

Andreas Ehliar and Per Karlstrom

{ehliar,perk}@isy.liu.se

June 13, 2006

1 Introduction

We decided to investigate what kind of floating point arithmetic performanceit is possible to achieve in a modern FPGA. In order to gain a thorough un-derstanding of the issues involved we decided to try to implement a fast FPUourselves. We do expect that an expert in the field could come up with a bettersolution, especially given the limited amount of time available for this project.

However, a search on the Internet did not turn up any references to high per-formance FPU:s on Virtex 4 FPGA:s.

The Virtex-4 uses a relatively standard FPGA architecture with CLB:s con-sisting of 4 slices which each contains 2 4-LUTs and 2 flip flops. The FPGAalso has a large number of embedded memories and DSP blocks containing highspeed multipliers and adders. In addition, the Virtex-4 also contains a numberof specialized components which were not used in this project For further detailsabout the Virtex-4 FPGA, see the Virtex-4 User Guide [2]. The DSP blocks arethoroughly described in the XtremeDSP user guide [3].

In order to test the FPU in a realistic environment we decided to implementa complex radix-2 butterfly kernel with the FPU adder and multiplier we weregoing to implement. This kernel can be used to implement for example higherradix FFT:s.

We selected a simple floating point format with no denormalized numbers andneither NaN nor Inf.

2 Methodology

To test the final result we implemented a C++ class for floating point numbers.The number of bits in the mantissa and exponents could be configured from 1

1


2/10

to 31 bits. The C++ model was used to generate the test vectors for the RTL

test benches.An initial RTL model was then developed and tested against the floating pointtest data. The RTL model was written with the hardware in mind but it wasnot optimized for the Virtex 4 FPGA.

The performance of the initial RTL model was evaluated and the most criticalpart of the design was optimized to better fit the FPGA. After the optimization,the model was verified with the test benches. This was repeated until theperformance was satisfactory.

Finally, the design was tested in an FPGA by downloading test data to theFPGA and uploading the results from the butterfly calculation for verificationagainst test patterns generated by the C++ model.

3 Floating point format

The first version of the RTL code was fairly configurable in regards to mantissaand exponent sizes. In order to ease the development of an optimized FPGAimplementation, we decided to limit the floating point format to a maximumof one sign bit, 10 bits of exponent, and 15 bits of mantissa with an implicitone. The mantissa is represented using regular unsigned binary numbers. Theexponent is implemented using excess 511.

4 Multiplier

The multiplier is quite simple to construct due to the large number of availablemultiplier blocks in the FPGA. A single multiplier is used for the mantissa andan adder is used for the exponent. It is also necessary to normalize the resultof the multiplication. This normalizer is very simple since the most significantbit can only be located at one out of two bit positions given normalized inputsto the multiplier. The overall architecture of the multiplier is shown in figure 1.A simple rounding scheme was chosen where the rounding was done beforethe normalization. This can be implemented basically for free in the DSP48blocks in the FPGA. This can be contrasted with the rounding schemes used inIEEE-754 where rounding is performed after normalization with an extra small

normalization step required to check for overflow after rounding. This wouldnot map very well to the DSP48 block. Except for the utilization of the DSP48block, no FPGA specific optimizations were performed in the multiplier block.

2


3/10

Round

Normalize

Figure 1: The floating point multiplier architecture

5 Adder/Subtracter

A floating point adder is more complicated than a floating point multiplier. Thebasic architecture for the adder is shown in figure 2. The first step comparesthe operands and swaps them if necessary so that the largest number alwaysenters the left path. This step also adds the implicit one if the input operandsare non-zero. In the next step, the smallest number is aligned so that theexponents of both operands match. After this step, an addition or subtractionof the two numbers are performed. A subtraction can never cause a negativeresult because of the earlier comparison and swap step. The normalization stepis the final step. It is implemented using two pipeline stages. The first stagelooks at the mantissa in 4 bits intervals as seen in figure 3. The first modulelooks at the first four bits and outputs a normalized result assuming a one wasfound in these bits. An extra output signal, shown as gray lines in the figure,is used to signal that all bits four bits were zero. The second module assumesthat the first four bits were all zero and instead looks at the following four bits,outputting a normalized result. This is repeated for the remaining bits of themantissa. The next stage decides which of the previous results that should beused. If all bits were zero, a zero is output as the result. The value needed tocorrect the exponent is generated according to the same scheme. This is shownas dashed lines in the figure.

3


4/10

Find leading one

CMP

Exponent Mantissa

Add

Denorma

lization

Comp

are/Select

Align

Figure 2: The overall architecture for the adder

4


5/10

ff1 in 4shift

ff1 in 4shift

ff1 in 4shift

ff1 in 4shift

4 8 12

Unnormalized mantissa Exp

Normalized mantissa New exponent

41 MUXPrioritydecoder

0

Figure 3: The normalizer architecture

5.1 FPGA optimizations

Initially the adder met timing at 250 MHz. It did not achieve this performanceonce it was inserted into a complex butterfly. At this point further optimizationswere required. The first FPGA specific optimization was to make sure that theadder/subtracter was implemented using only one LUT per bit. A standardadder structure as compared to an adder structure with both addition andsubtraction is shown in figure 4.

Another optimization was to optimize the exponent selection in the normaliza-tion step. At first, this was implemented using a 5 to 1-mux in front of an adder.By implementing a 2 to 1-mux directly in the same LUT used for the addition,a smaller 4 to 1-mux could be used in front of the adder. In order to make surethat this mux was placed near the adder, RLOC directives were used to placethe components in relation to each other.

In both the exponent and mantissa mux, the reset signal of the flip flop wasused to set the result to zero instead of embedding this logic into the LUT.

Another technique that we tried was to construct a 4 to 1-mux combined witha priority decoder as shown in figure 5. This mux should achieve slightly betterperformance than an ordinary mux since there is only one level of LUTs. Ina later stage of the implementation we moved the or function to the previouspipeline stage as well.

5


6/10

=1=1

Carry out

Carry in

B

Sum =1=1

=1

Carry out

Carry in

Sub

B

A Sum

Figure 4: A regular adder using 1 LUT / bit as compared to a adder/subtracterusing 1 LUT/bit.

A

B

C

D

Functionality table

Y Z

10

0

0 0 0

0

1x x

x

1

AB

C

D

X O

O

1X

Y

Z

X

Figure 5: A priority decoder combined with a 4-to-1 mux.

6


7/10

6 Floorplanning

In order to improve performance of the final system we tried to locate differentpipeline stages close to each other by using RLOC directives. Doing this resultedin more regularity and smaller area footprint.

7 Results and Discussion

Knowing the FPGA architecture is important to write efficient HDL code. Agood understanding of the Virtex 4 architecture enables the designer to use thefabric in ways not (yet) supported by the synthesis tools. In some cases thegains can be substantial in other cases the gains are more limited.

With the initial RLOC optimizations we achieved better timing results but assoon as we tried to use RLOC over pipeline boundaries we got worse timingresults. Eventually we managed to reach a 250 MHz clock frequency for theradix-2 butterfly by using RLOC. The floorplan for this implementation is shownin figure 6. At this point however, a number of low level optimizations had beendone which enabled the design to meet timing at 250 MHz even without theuse of RLOC. Sadly enough the RLOC:ed radix-4 butterfly could not be fit intothe FPGA because one radix-2 butterfly was too wide. Unfortunately we didnot have time to correct this problem. Thus the radix-4 butterfly could only beplaced without RLOC directives. The radix-4 butterfly also met timing at 250MHz. The floorplan for the radix-4 butterfly is shown in figure 7. Table 1 liststhe final resource utilization in the FPGA for various components. The radix-2

and radix-4 are complex valued butterflies whereas the floating point adder andmultiplier operates on real values.

Resource Radix-4 Radix-2 Adder Multiplier Available

LUTs 10104 2514 73 372 30720Flip Flops 14432 3660 63 325 30720DSP48 16 4 1 0 192

Table 1: Component resource utilization

There are a number of opportunities for further optimizations in this design.For example, instead of using CLB:s for the shifting, a multiplier could be usedfor this task by sending in the number to be shifted as one operand and a bitvector with a single one in a suitable position as the other operand.

If the application of the floating point blocks are known it is possible to do someapplication specific optimizations. For example, in a butterfly with an adderand a subtracter operating on the same operands the first compare stage couldbe shared between these. If the application can tolerate it, further pipeliningcould increase the performance significantly. If the latency tolerance is veryhigh, bit serial arithmetics could probably be used as well. In this project welimited the pipeline depth to compare well with FPU:s used in CPU:s.

According to a post on comp.arch.fpga it is possible to achieve 400MHz per-formance for IEEE single precision floating point arithmetics. Few details are

7


8/10

Figure 6: RLOC:ed complex butterfly

8


9/10

Figure 7: Non RLOC:ed radix-4 butterfly

9


10/10

available but a key technique is to use the DSP48 block for the adder since an

adder implemented with a carry chain would be too slow. The post normaliza-tion step is supposed to be implemented using both DSP48 and Block RAMs [1].The pipeline depth of this implementation is not known.

It would also be interesting to look at the newly announced Virtex 5 architecture.The 6-LUT architecture should reduce the number of logic levels and routingall over the design. Unfortunately no tools are publicly available today thattargets the Virtex 5.

8 RLOC related problems

It is relatively easy to RLOC individual pipeline stages. Once we tried to hier-

archically RLOC several pipeline stages, the performance suddenly decreased.Generally, the place and route tool seems to place modules quite far from eachother. This generally balances the different pipeline stages as well and easesrouting due to lower congestion. However, as soon as we started to RLOC sev-eral pipeline stages together the distance between two non-RLOCed stages grewlarger and it was harder to meet timing. In the end, we had to RLOC at leastsome parts of all modules involved in the design to be able to meet timing.

9 Conclusions

The Virtex 4 FPGA is not really suited for floating point arithmetics. Withsome techniques detailed in this report it is possible to get relatively decentperformance. We would have liked to be able to achieve a higher performancethough. We also realized that the placer does a pretty good job and it is nottrivial to achieve higher performance by doing some of the placement by hand.

References

[1] Andraka, Ray; Re: Floating point reality check, news:comp.arch.fpga, 14May 2006

[2] Xilinx; Virtex-4 User Guide

[3] Xilinx; XtremeDSP for Virtex-4 FPGAs User Guide

10

floating_point_fpga

Documents