number formats fixed vs floating point

Number Formats: Fixed vs. Floating Point, which representation to use in DSP applications?

Teodor Neagoe, email: [email protected] Senior Applications Engineer, Arrow Electronics, Montreal

Logica Banica, email: [email protected]

Informatics Department, Faculty of Mathematics and Computer Science University of Pitesti, Street Trgul din Vale, No.1, Pitesti, Romania

ABSTRACT: The most common misconception about the number representation is that the floating point format is more accurate than the fixed point. This article reviews most of the number formats used in computers, with specific references to signal processing. Using appropriate examples, how to represent fixed and floating point numbers in integer machines is explained. In fact, the most accurate representation of numbers is the integer format. Each representation has its advantages and disadvantages as related to a specific application. This paper analyzes some of the practical factors influencing a designers decision when choosing a number format and the type of processor for his or her next project.

1. Introduction

all cases the same, and it represents the one-bit increment of the digital data representation.

As we all know, the numbers in computers are represented in binary mode. No matter if youre working with integer, fixed, or floating point, it all comes back to bits. In signal processing, we deal with waveforms, and when we talk about waveforms we have to first redefine a couple of basic parameters that will be used in this paper.

2. Full Scale vs. Dynamic Range

If you are representing a waveform or any other function digitally, the Full Scale is the difference between the largest and the smallest data you desire to digitize.

f(x)

Full ScaleRange=Max[f(x)]-Min[f(x)]

Figure 1 Full Scale Range

Dynamic Range is the ratio of the full scale range to the smallest increment that the digital data can resolve. Of course, this increment is in

f(x) =If(x1)-f(x2)I

f[x(nT)]DynamicRange= Full ScaleRange Min[ f(x)]

Figure 2 Dynamic Range

3. Precision vs. Accuracy

Precision is the smallest increment that digital data can resolve (also called the Quantization level [Q]). It relates to resolution and dynamic range.

Precision=QQ Precision=QQ

Accuracy=Q/2Accuracy=Q

ErrorExample: Round-offvsTruncation

Figure 3 Precision vs. Accuracy

1

On the other hand, accuracy is the difference between the actual data and the digital representation of the same data. Accuracy relates to system errors and the signal to noise ratio.

Lets see how we can represent the decimal number 32,166 in the three formats.

Integer:32,16610

Fixed Point:32.16610Floating Point: 32.16610 x 103

The question is, which representation is more accurate? The answer may not be obvious, but the integer format is the most accurate. Well analyze these aspects in more detail in the rest of the article.

4. Integer Format

Representing values in base two is much the same as representing decimal numbers in base ten. Each bit value is multiplied by a base number, raised to some assigned power, 0 to N-1, where N is the base of the representation (in this case 2). Each multiplication is summed and the final result is the integer representation of the digital data value. Its obvious that representing the same number in base 2 requires more digits (in this case bits) than in base 10. For example, representing the value 145 in base two requires 8 bits:

210

10 10 10

210

145

=(1x10)+(4x10)+(5x10)=145

10

76543210

2 2222222

740

10010001=(1x2)+(1x2)+(1x2)=145

10

Figure 4 Integer Representation of 145

The following logarithmic formula is a simple way to calculate the number of bits needed to represent an integer in base two. Its obvious that more bits are needed to represent larger numbers.

N=l o gx =l og10x

nl ogn

10

Where: N is the number of bits needed, n is 2, and x is the number to be represented.

In the previous case, to represent 145 in base 2, 8 bits were necessary.

2

log 1452.161

N=log145=10==7.17=> 8bits

log 20.301

2

10

Fixed Point Format (Fractions)

In fixed point representations, the data is normalized. So, the full range always varies between -1 and +1. Representing fixed point data (fractions) in base two is also the same as base ten.

0-1-2

101010-1-2

075

=(7x10)+(5x10)=0.75

10

0-1-2 -3-4 -5-6-7

222 22 2 2 2-1-2

01100000

=(1x2)+(1x2)=0.75

10

Decimal Point

Figure 5 Fractional Representation of 0.75

The most significant bit is used as the sign bit, with remaining N-1 bits for the fractional part. Because the representation is normalized, the same number of bits N is used to represent any value small or large.

An interesting example is the representation of 0.2 as a fixed point fraction with 16-bit accuracy and only one sign bit.

.2=1/8+1/16+1/128+1/2048+1/4096

0 -1-2 -3-4 -5 -6 -7-8 -9-10 -11-12 -13 -14 -15

2 22 22 2 2 2 2 22 22 2 2 2

00011001100000011

Decimal Point

Figure 6 Representation of 0.2 in Fixed Point

(16-bit Accuracy)

It may be surprising, but to represent 0.2 in fixed point with 16 bit accuracy, more than 16 bits are needed. The explanation is in the fact that .2 in base two is a repeating fraction.

5. Twos Complement Data Format

Twos complement is a common data format used in microprocessors as well as digital signal

processors. The advantage of using this format is that it allows subtraction to be replaced by the addition of negative data values.

Here are the steps to convert a fractional number to a twos complement format.

Determine the number of bits (N) needed to represent the data;

Multiply the absolute value of the fractional number by 2N-1; Round to the nearest integer; and

To make it negative, complement the bits and add one.

Figure 7 shows the same -0.2, converted to a 16-bit value.

-1-2 -3-4 -5 -6 -7-8 -9-10 -11 -12 -13 -14 -15

sign22 22 2 22 2 22 2 2 2 2 2

1110011001100110

M=6Decimal PointN=9

Figure 7 Converting a -0.2 to a 16-bit Value

6. Accuracy Concerns: Rollover

Fixed point representations are the most effective but require additional attention to scaling and overflow issues. Figure 8 illustrates the problem with twos complement integers. The positive numbers are above the line, while negative ones are below the line. Zero is considered to be a positive number but the most negative number is also on the same axis.

Two's Complement Integers

Positive

Numbers

Most Positive

txt0

Most Negative Least Negative

Negative

Numbers

Figure 8 Accuracy Concerns: Rollover

As the numbers are increased counter clockwise on the circle, data can suddenly rollover from the most positive to the most negative in the increment of one bit. In any application, this is a dramatic factor that can affect the functionality of control loops, monitoring devices, or safety

3

equipments. Many DSPs have a saturation mode that will not allow rollover to occur. However, using this mode is not always a wise choice. For example, Figure 9 illustrates a pitfall you can encounter with saturation modes on DSPs. With saturation activated, the intermediate result is not allowed to overflow to a negative value and is held at the most positive value therefore leading to a wrong result.

Decimal

0.510

+0.75101.25 10 -0.375100.87510= 0.1112

Binary

0.120.12

+0.112+0.112

0.11 2saturation on1.11 2saturation off

+1.1012+1.1012

0.0112wrong answer0.1112right answer

Figure 9 Rollover Example

However, with the saturation mode off, the intermediate result is allowed to overflow and the next addition, which pulls the data value back into the allowed range, generates the correct result. What actually happened is one positive number was added to another positive number going to the negative side but, by subtracting the last number, the result ended back in range. The same principle is applied to signals; generally speaking, you should know the range of your output the same way you know the range of your input.

7. Bipolar Data Format Conversions

There are situations when data has to be converted from one format to another. The following table summarizes the steps to be taken in order to do this. The number of bits is equally important. Many data converters use bipolar fixed point formats other than two's complement. There are two basic reasons for this: noise and computational simplicity. Some formats are good for A/D and D/A converters because the bit transitions when moving through the codes minimize glitches. Other formats are very good computationally because they eliminate the need for additional hardware as in the case of two's complement. The designer has to be aware of this and know how the data is represented in various parts of the design. For example, an Analog to Digital converter supplies data on 12

bits in Sign Magnitude format while the microprocessor or the DSP is processing 16-bit data, in twos complement format. First, the sign should be extended from 12 to 16 bits and then the format is converted according to the steps in the table.

Sign2sOffset1s

MagnComplBinaryCompl

SignNoIf MSB=1ComplIf

MagnchangeComplMSB,ifMSB=1,

otherbitsMSB=1then

and addcomplcompl

00..01otherother bits

bits, add

00..01

2sIf MSB=1NoComplIf

ComplComplchangeMSBMSB=1,

otherbitsthen add

and add00..01

00..01

OffsetComplComplNoCompl

BinaryMSB,ifMSBchangeMSB, if

MSB=0new

complMSB=0

otherbitsthen add

and add00..01

00..01

1sIf MSB=1If MSB=1ComplNo

ComplthenthenaddMSB,ifchange

compl11..11MSB=1

other bitsadd

11..11

Figure 10 Conversion Table for Bipolar Digital Data Formats

8. Q Format: Q(M,N)

The Q format is a common data representation that was made popular by Texas Instruments. It allows data to be represented as a combination of integer and fractional components. The total number of bits is M+N+1. The most common Q formats are Q(0,15), also called Q15, and Q(0,31), also known as Q31. They are essentially twos complement fixed point representations using 16 and 32 bits respectively. They represent the most fractional number where one bit is used for sign and the rest for fractions. Multiplying Q(M) and Q(N) data values produces a Q(M+N) with the sign bit replicated. For example, multiplying two Q15 numbers results in a Q30 number with the sign bit represented twice.

-1-2 -3-4 -5 -6 -7-8 -9-10 -11-12 -13 -14 -15

sign22 22 2 22 2 22 22 2 2 2

1110011001100110

M=6Decimal PointN=9

4

Figure 11 Q(M,N) Format with M=6, N=9

9. Floating Point Format

In floating point format, the data is represented with a fractional part, called mantissa (M), an exponent (E), and a sign bit (S).

Floating Point Number = (-1)S x M 2E-offset

Mantissa is normalized to 0.5

number formats fixed vs floating point

Documents