unit2 arithmetics

7/27/2019 Unit2 Arithmetics

1/79

UNIT 2


2/79

Addition/subtraction of signed numbers

si =

ci+1=

13

7+Y

1

0

0

0

1

0

1

1

0

0

1

1

0

11

0

0

1

1

0

1

0

01

0

0

0

0

1

1

1

1

0

0

0

0

1

1

11

Example:

10= = 0

01 1

11 1 0 0

1

1 1 10

Legend for stage i

xi

yi

Carry-inci

Sumsi

Carry-outci+1

X

Z

+ 6 0+xiyi

si

Carry-outci+1

Carry-inci

xiyici

xiyici

xiyici

xiyici xi yi ci =

+ + +

yicixicixiyi+ +

At the ith stage:Input:

ci is the carry-in

Output:

si is the sum

ci+1 carry-out to (i+1)st

state


3/79

Addition logic for a single stage

Full adder(FA)

ci

ci 1+

si

Sum Carry

yixi

ci

yi

xi

ci

yi

x

i

xi

ci

yi

si

ci 1+

Full Adder (FA): Symbol for the complete circuit

for a single stage of addition.


4/79

n-bit adderCascade n full adder (FA) blocks to form a n-bit adder.Carries propagate or ripple through this cascade, n-bit ripple carry adder.

FA c0

y1

x1

s1

FA

c1

y0

x0

s0

FA

cn 1-

yn 1-

xn 1-

cn

sn 1-

Most significant bit(MSB) position

Least significant bit(LSB) position

Carry-in c0 into the LSB position provides a convenient way to

perform subtraction.


5/79

K n-bit adderK n-bit numbers can be added by cascading kn-bit adders.

n-bit c0

yn

xn

sn

cn

y0

xn 1-

s0

ckn

sk 1-( )n

x0

yn 1-

y2n 1-

x2n 1-

ykn 1-

sn 1-

s2n 1-

skn 1-

xkn 1-

addern-bitaddern-bitadder

Each n-bit adder forms a block, so this is cascading of blocks.Carries ripple or propagate through blocks, Blocked Ripple Carry Adder


6/79

n-bit subtractor

FA 1

y1

x1

s1

FAc1

y0

x0

s0

FAcn 1-

yn 1-

xn 1-

cn

sn 1-

Most significant bit(MSB) position

Least significant bit(LSB) position

RecallXYis equivalent to adding 2s complement ofYtoX.

2s complement is equivalent to 1s complement + 1.

XY = X + Y + 1

2s complement of positive and negative numbers is computed similarly.


7/79

n-bit adder/subtractor (contd..)

Add/Subcontrol

n-bit adder

xn 1-

x1

x0

cn

sn 1- s1s0

c0

yn 1-

y1

y0

Add/sub control = 0, addition.

Add/sub control = 1, subtraction.


8/79

Detecting overflows

Overflows can only occur when the sign of the two operands isthe same.

Overflow occurs if the sign of the result is different from the

sign of the operands.

Recall that the MSB represents the sign. xn-1, yn-1, sn-1 represent the sign of operand x, operand yand result s

respectively.

Circuit to detect overflow can be implemented by the

following logic expressions:

111111 nnnnnn syxsyxOverflow

1 nn ccOverflow


9/79

Computing the add timeConsider 0th stage:

x0 y0

c0c1

s0

FA

c1 is available after 2 gate delays.

s1 is available after 1 gate delay.

ci

yi

xi

ci

yi

xi

xi

ci

yi

si

ci 1+

Sum Carry


10/79

Computing the add time (contd..)

x0 y0

s2

FA

x0 y0x0 y0

s1

FAc2

s0

FAc

1

c3

c0

x0 y0

s3

FA

c4

Cascade of 4 Full Adders, or a 4-bit adder

s0 available after 1 gate delays, c1 available after 2 gate delays.

s1

available after 3 gate delays, c2

available after 4 gate delays.



For an n-bit adder,sn-1is available after 2n-1 gate delayscnis available after2ngate delays.


11/79

Fast additionRecall the equations:

iiiiiii

iiii

cycxyxc

cyxs

1

Second equation can be written as:

iiiiii cyxyxc )(1

We can write:

iiiiii

iiii

yxPandyxGwhere

cPGc

1

Giis called generate function andPiis called propagate functionGi and Pi are computed only from xiand yi and not ci, thus they canbe computed in one gate delay after Xand Yare applied to the

inputs of an n-bit adder.


12/79

Carry lookaheadci1 Gi Pici

ci Gi1 Pi1ci1

ci1 Gi Pi(Gi1 Pi1ci1)

continuing

ci1 Gi Pi(Gi1 Pi1(Gi2 Pi2ci2 ))

until

ci1 Gi PiGi1 PiPi1Gi2 . . PiPi1. .P1G0 PiPi1. ..P0c0

All carries can be obtained 3 gate delays afterX, Yandc0are applied.

-One gate delay for Pi and Gi-Two gate delays in the AND-OR circuit for ci+1

All sums can be obtained 1 gate delay after the carries are computed.Independent of n, n-bit addition requires only 4 gate delays.This is called Carry Lookahead adder.


13/79

Carry-lookahead adder

Carry-lookahead logic

B cell B cell B cell B cell

s3

P3

G3

c3

P2

G2

c2

s2

G1

c1

P1

s1

G0

c0

P0

s0

.c

4

x1

y1

x3

y3

x2

y2

x0

y0

Gi

ci

..

.

Pi

si

xi

yi

B cell

4-bit

carry-lookahead

adder

B-cell for a single stage


14/79

Carry lookahead adder (contd..) Performing n-bit addition in 4 gate delays independent ofn

is good only theoretically because of fan-in constraints.

Last AND gate and OR gate require a fan-in of (n+1) for a n-bit adder. For a 4-bit adder (n=4) fan-in of 5 is required. Practical limit for most gates.

In order to add operands longer than 4 bits, we can cascade4-bit Carry-Lookahead adders. Cascade of Carry-Lookaheadadders is called Blocked Carry-Lookahead adder.

ci1 Gi PiGi1 PiPi1Gi2 ..PiPi1..P1G0 PiPi1...P0c0


15/79

4-bit carry-lookahead Adder


16/79

Blocked Carry-Lookahead adder

c4 G3 P3G2 P3P2G1 P3P2P1G0 P3P2P1P0c0

Carry-out from a 4-bit block can be given as:

Rewrite this as:

P0I

P3P2P1P0

G0IG3 P3G2 P3P2G1 P3P2P1G0

Subscript I denotes the blocked carry lookahead and identifies the block.

Cascade 4 4-bit adders, c16can be expressed as:

c16 G3I P3

IG2

I P3

IP2

IG1

IP3

IP2

IP1

0G0

IP3

IP2

IP1

0P0

0c0


17/79

Blocked Carry-Lookahead adder

Carry-lookahead logic

4-bit adder 4-bit adder 4-bit adder 4-bit adder

s15-12

P3

IG3

I

c12

P2

IG2

I

c8

s11-8

G1

I

c4

P1

I

s7-4

G0

I

c0

P0

I

s3-0

c16

x15-12

y15-12

x11-8

y11-8

x7-4

y7-4

x3-0

y3-0

.

Afterxi, yi and c0are applied as inputs:- Gi and Pi for each stage are available after 1 gate delay.- PI is available after 2 and GI after 3 gate delays.- All carries are available after 5 gate delays.- c16 is available after 5 gate delays.- s15 which depends on c12 is available after 8 (5+3)gate delays

(Recall that for a 4-bit carry lookahead adder, the last sum bit isavailable 3 gate delays after all inputs are available)


18/79


19/79

Multiplication of unsigned numbers

Product of 2 n-bit numbers is at most a 2n-bit number.

Unsigned multiplication can be viewed as addition of shifted

versions of the multiplicand.


20/79

Multiplication of unsigned

numbers (contd..) We added the partial products at end.

Alternative would be to add the partial products at each stage.

Rules to implement multiplication are:

If the ith bit of the multiplier is 1, shift the multiplicand and add the

shifted multiplicand to the current value of the partial product.

Hand over the partial product to the next stage

Value of the partial product at the start stage is 0.


21/79

Multiplication of unsigned numbers

ith multiplier bit

carry incarry out

jth multiplicand bit

ith multiplier bit

Bit of incoming partial product (PPi)

Bit of outgoing partial product (PP(i+1))

FA

Typical multiplication cell


22/79

Combinatorial array multiplier

Multiplicand

m3 m2 m1 m00 0 0 0

q3

q2

q1

q00

p2

p1

p0

0

0

0

p3p4p5p6p7

PP1

PP2

PP3

(PP0)

,

Product is: p7,p6,..p0

Multiplicand is shifted by displacing it through an array of adders.



23/79


(contd..) Combinatorial array multipliers are:

Extremely inefficient.

Have a high gate count for multiplying numbers of practical size such as 32-

bit or 64-bit numbers. Perform only one function, namely, unsigned integer product.

Improve gate efficiency by using a mixture of

combinatorial array techniques and sequential

techniques requiring less combinational logic.


24/79

Sequential multiplication Recall the rule for generating partial products:

If the ith bit of the multiplier is 1, add the appropriately shifted multiplicand

to the current partial product.

Multiplicand has been shifted left when added to the partial product.

However, adding a left-shifted multiplicand to an

unshifted partial product is equivalent to adding an

unshifted multiplicand to a right-shifted partial

product.


25/79

Sequential Circuit Multiplier

qn 1-

mn 1-

n-bit

Adder

Multiplicand M

Controlsequencer

Multiplier Q

0

C

Shift right

Register A (initially 0)

Add/Noaddcontrol

an 1-

a0

q0

m0

0

MUX


26/79

Sequential multiplication (contd..)

1 1 1 1

1 0 1 1

1 1 1 1

1 1 1 0

1 1 1 0

1 1 0 1

1 1 0 1

Initial configuration

Add

M1 1 0 1

C

First cycle

Second cycle

Third cycle

Fourth cycle

No add

Shift

ShiftAdd

Shift

ShiftAdd

1 1 1 1

0

0

0

1

0

0

0

1

0

0 0 0 0

0 1 1 0

1 1 0 1

0 0 1 1

1 0 0 1

0 1 0 0

0 0 0 1

1 0 0 0

1 0 0 1

1 0 1 1

QA

Product


27/79


28/79

Signed Multiplication Considering 2s-complement signed operands, what will happen to (-

13)(+11) if following the same method of unsigned multiplication?

Sign extension of negative multiplicand.

1

0

11 11 1 1 0 0 1 1

110

110

1

0

1000111011

000000

1

1

0

0

1

1

1

00000000

110011111

13-( )

143-( )

11+( )

Sign extension isshown in blue


29/79

Signed Multiplication For a negative multiplier, a straightforward solution is

to form the 2s-complement of both the multiplier andthe multiplicand and proceed as in the case of a

positive multiplier. This is possible because complementation of both

operands does not change the value or the sign of theproduct.

A technique that works equally well for both negativeand positive multipliers Booth algorithm.


30/79

Signed MultiplicationHardware Implementation


31/79

Booth Algorithm Consider in a multiplication, the multiplier is positive0011110, how many appropriately shifted versions ofthe multiplicand are added in a standard procedure?

0

0 0

1 0 1 1 0 1

0

0 0 0 0 0 0

1

0

011010

1011010

1011010

1011010

0000000

000000

011000101010

0

00

1+ 1+ 1+ 1+


32/79

Booth Algorithm Since 0011110 = 0100000 0000010, if we use the

expression to the right, what will happen?

0

1

0 1 0 1 1 10000

00000000000000

1 1 1 1 1 1 1 0 1 0 0 1

00

0

0 0 0 1 0 1 1 0 1

0 0 0 0 0 0 0 0

0110001001000 1

2's complement of

the multiplicand

00

0

0

1+ 1-

0

0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0


33/79

Booth Algorithm In general, in the Booth scheme, -1 times the shifted multiplicand isselected when moving from 0 to 1, and +1 times the shiftedmultiplicand is selected when moving from 1 to 0, as the multiplieris scanned from right to left.

Booth recoding of a multiplier.

001101011100110100

00000000 1+ 1-1-1+1-1+1-1+1-1+


34/79

Booth Algorithm

Booth multiplication with a negative multiplier.

01

0

1 1 1 1 0 1 1

0 0 0 0 0 0 0 0 0

00

0110

0 0 0 0 1 1 0

1100111

0 0 0 0 0 0

01000 11111

1

10 1 1 0 1

1 1 0 1 0 6-( )

13+( )

X

78-( )

+11- 1-


35/79

Booth AlgorithmMultiplier

Biti Biti 1-

Version of multiplicandselected by biti

0

1

0

0

01

1 1

0 M

1+ M

1 M

0 M

Booth multiplier recoding table.

X

X

X

X


36/79

Booth Algorithm Best case a long string of 1s (skipping over 1s)

Worst case 0s and 1s are alternating

1

0

1110000111110000

001111011010001

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0

000000000000

00000000

1- 1- 1- 1- 1- 1- 1- 1-

1- 1- 1- 1-

1-1-

1+ 1+ 1+ 1+ 1+ 1+ 1+ 1+

1+

1+1+1+

1+

Worst-case

multiplier

Ordinarymultiplier

Goodmultiplier


37/79

Booth AlgorithmFlow Chart

Table


38/79


39/79

Bit-Pair Recoding of Multipliers Bit-pair recoding halves the maximum number ofsummands (versions of the multiplicand).

1+1

(a) Example of bit-pair recoding derived from Booth recoding

0

000

1

1

0

1

0

Implied 0 to right of LSB

1

0

Sign extension

1

21


40/79

Bit-Pair Recoding of Multipliersi 1+ i 1

(b) Table of multiplicand selection decisions

selected at position i

MultiplicandMultiplier bit-pair

i

0

0

1

1

1

0

1

0

1

1

1

1

0

0

0

1

1

0

0

1

0

0

1

Multiplier bit on the right

0 0 X M

1+

1

1+

0

1

2

2+

X M

X M

X M

X M

X M

X M

X M


41/79

Bit-Pair Recoding of Multipliers

41

1-

0000

1 1 1 1 1 0

0 0 0 0 11

1 1 1 1 10 0

0 0 0 0 0 0

0000 111111

0 1 1 0 1

0

1 010011111

1 1 1 1 0 0 1 1

0 0 0 0 0 0

1 1 1 0 1 1 0 0 1 0

0

1

0 0

1 0

1

0 0

0

0 1

0

0 1

10

0

0100

1

1

0

1

11

1-

6-( )13

+

(

)

1+

78-( )

1- 2-

Figure 6.15. Multiplication requiring only n/2 summands.


42/79

Carry-Save Addition of Summands CSA speeds up the addition process.

42P7 P6 P5 P4 P3 P2 P1 P0


43/79

Carry-Save Addition of Summands(Cont.,)

P3 P2 P1 P0P5 P4P7 P6


44/79

Carry-Save Addition of Summands(Cont.,)

Consider the addition of many summands, we can: Group the summands in threes and perform carry-save addition on

each of these groups in parallel to generate a set of S and C vectors inone full-adder delay

Group all of the S and C vectors into threes, and perform carry-saveaddition on them, generating a further set of S and C vectors in onemore full-adder delay

Continue with this process until there are only two vectors remaining

They can be added in a RCA or CLA to produce the desired product


45/79

Carry-Save Addition of Summands

Figure 6.17. A multiplication example used to illustrate carry-save addition as shown in Figure 6.18.

100 1 11

100 1 11

100 1 11

11111 1

100 1 11 M

Q

A

B

C

D

E

F

(2,835)

X

(45)

(63)

100 1 11

100 1 11

100 1 11

000 1 11 111 0 00 Product

100 1 11 M

Q


46/79

Figure 6.18. The multiplication example from Figure 6.17 performed usingcarry-save addition.

00000101 0 10

10010000 1 11 1

+

1000011 1

10010111 0 10 1

0110 1 10 0

00011010 0 00

10001011 1 0 1

110001 1 0

00111100

00110 1 10

11001 0 01

100 1 11

100 1 11

100 1 11

00110 1 10

11001 0 01

100 1 11

100 1 11

100 1 11

11111 1 Q

A

B

C

S1

C1

D

E

F

S2

C2

S1

C1

S2

S3

C3

C2

S4

C4

Product

x


47/79


48/79

Manual Division

Longhand division examples.

1101

1

13

14

26

21

274 100010010

10101

1101

1

1110

1101

10000

13 1101


49/79

Longhand Division Steps Position the divisor appropriately with respect to the

dividend and performs a subtraction.

If the remainder is zero or positive, a quotient bit of 1

is determined, the remainder is extended by another

bit of the dividend, the divisor is repositioned, and

another subtraction is performed.

If the remainder is negative, a quotient bit of 0 is

determined, the dividend is restored by adding back

the divisor, and the divisor is repositioned for another

subtraction.


50/79

Circuit Arrangement

Figure 6.21. Circuit arrangement for binary division.

qn-1

Divisor M

Control

Sequencer

Dividend Q

Shift left

N+1 bit

adder

q0

Add/Subtract

Quotient

SettingA

m00 mn-1

a0an an-1


51/79

Binary DivisionHardware Implementation


52/79

Restoring Division Shift A and Q left one binary position

Subtract M from A, and place the answer back in A

If the sign of A is 1, set q0 to 0 and add M back to A(restore A); otherwise, set q0 to 1

Repeat these steps n times


53/79

Examples

10111

Figure 6.22. A restoring-division example.

1 1 1 1 1

01111

0

0

0

1

000

0

0

00

0

0

0

00

0

1

00

0

0

0

101

11

11

01

0001

SubtractShift

Restore

1 00001 0000

1 1

Initially

Subtract

Shift10111

100001100000000

SubtractShift

Restore

101110100010000

1 1

QuotientRemainder

Shift

10111

1 0000

Subtract

Second cycle

First cycle

Third cycle

Fourth cycle

00

0

0

00

1

0

1

10000

1 11 0000

11111Restore

q0Set

q0Set

q0Set

q0Set


54/79

Restoring DivisionFlow Chart


55/79

Nonrestoring DivisionAvoid the need for restoring A after an

unsuccessful subtraction.

Any idea? Step 1: (Repeatn times) If the sign of A is 0, shift A and Q left one bit position and

subtract M from A; otherwise, shift A and Q left and add M

to A.Now, if the sign of A is 0, set q0 to 1; otherwise, set q0 to 0.

Step2: If the sign of A is 1, add M to A


56/79

Examples

A nonrestoring-division example.

Add

Restore

remainderRemainder

0

0

0

0

1

1 1 1 1 10 0 0 1 1

1

Quotient

0 0 1 01 1 1 1 1

0 0 0 01 1 1 1 1

Shift 0 0 0

11000

01111

Add

0 0 0 1 10 0 0 0 1 0 0 0

1 1 1 0 1

Shift

Subtract

Initially 0 0 0 0 0 1 0 0 0

1 1 1 0 0000

1 1 1 0 0

0 0 0 1 1

0 0 0Shift

Add

0 0 10 0 0 01

1 1 1 0 1

Shift

Subtract

0 0 0 110000

Fourth cycle

Third cycle

Second cycle

First cycle

q0Set

q0

Set

q0

Set

q0

Set


57/79

Nonrestoring DivisionFlow Chart


58/79


59/79

FractionsIfbis a binary vector, then we have seen that it can be interpreted asan unsigned integer by:

V(b) = b31.231 + b30.2

30 + bn-3.229 + .... + b1.2

1 + b0.20

This vector has an implicit binary point to its immediate right:

b31b30b29....................b1b0. implicit binary point

Suppose if the binary vector is interpreted with the implicit binary point isjust left of the sign bit:

implicit binary point .b31b30b29....................b1b0

The value ofbis then given by:

V(b) = b31.2-1 + b30.2

-2 + b29.2-3 + .... + b1.2

-31 + b0.2-32


60/79

Range of fractionsThe value of the unsigned binary fraction is:

V(b) = b31.2-1 + b30.2

-2 + b29.2-3 + .... + b1.2

-31 + b0.2-32

The range of the numbers represented in this format is:

In general for a n-bit binary fraction (a number with an assumed binary

point at the immediate left of the vector), then the range of values is:

9999999998.021)(0 32 bV

n

bV

21)(0


61/79

Scientific notationPrevious representations have a fixed point. Either the point is to the immediate

right or it is to the immediate left. This is called Fixed point representation.Fixed point representation suffers from a drawback that the representation can

only represent a finite range (and quite small) range of numbers.

A more convenient representation is the scientific representation, where

the numbers are represented in the form:

x m1.m2m3m4 be

Components of these numbers are:

Mantissa (m), implied base (b), and exponent (e)


62/79

Significant digits

A number such as the following is said to have 7 significant digits

x 0.m1m

2m

3m

4m

5m

6m

7 b

e

Fractions in the range 0.0 to 0.9999999 need about 24 bits of precision

(in binary). For example the binary fraction with 24 1s:

111111111111111111111111 = 0.9999999404

Not every real number between 0 and 0.9999999404 can be represented

by a 24-bit fractional number.

The smallest non-zero number that can be represented is:

000000000000000000000001 = 5.96046 x 10-8

Every other non-zero number is constructed in increments of this value.


63/79

Sign and exponent digitsIn a 32-bit number, suppose we allocate 24 bits to represent a fractional

mantissa.Assume that the mantissa is represented in sign and magnitude format,

and we have allocated one bit to represent the sign.

We allocate 7 bits to represent the exponent, and assume that the

exponent is represented as a 2s complement integer.

There are no bits allocated to represent the base, we assume that the

base is implied for now, that is the base is 2.Since a 7-bit 2s complement number can represent values in the range

-64 to 63, the range of numbers that can be represented is:

0.0000001 x 2-64 < = | x |


64/79

A sample representation

Sign Exponent Fractional mantissa

bit

1 7 24

24-bit mantissa with an implied binary point to the immediate left

7-bit exponent in 2s complement form, and implied base is 2.

Normalization


65/79

Normalization

If the number is to be represented using only 7 significant mantissa digits,

the representation ignoring rounding is:

Consider the number: x = 0.0004056781 x 1012

x = 0.0004056 x 1012

If the number is shifted so that as many significant digits are brought into

7 available slots:

x = 0.4056781 x 109

= 0.0004056 x 1012

Exponent ofx was decreased by 1 for every left shift ofx.

A number which is brought into a form so that all of the available mantissa

digits are optimally used (this is different from all occupied which may

not hold), is called a normalized number.

Same methodology holds in the case of binary mantissas

0001101000(10110) x 28 = 1101000101(10) x 25


66/79

Normalization (contd..)A floating point number is in normalized form if the most significant

1 in the mantissa is in the most significant bit of the mantissa.All normalized floating point numbers in this system will be of the form:

0.1xxxxx.......xx

Range of numbers representable in this system, if every number must benormalized is:

0.5 x 2-64


67/79

Normalization, overflow and underflowThe procedure for normalizing a floating point number is:

Do (until MSB of mantissa = = 1)Shift the mantissa left (or right)

Decrement (increment) the exponent by 1

end do

Applying the normalization procedure to: .000111001110....0010 x 2-62

gives: .111001110........ x 2-65

But we cannot represent an exponent of65, in trying to normalize the

number we have underflowed our representation.

Applying the normalization procedure to: 1.00111000............x 263

gives: 0.100111..............x 264

This overflows the representation.

Ch i th i li d b


68/79

Changing the implied base

So far we have assumed an implied base of 2, that is our floating point

numbers are of the form:x = m 2e

If we choose an implied base of 16, then:

x = m 16e

Then:

y = (m.16) .16e-1 (m.24) .16e-1 = m . 16e = x

Thus, every four left shifts of a binary mantissa results in a decrease of 1in a base 16 exponent.

Normalization in this case means shifting the mantissa until there is a 1 in

the first four bits of the mantissa.


69/79

Excess notationRather than representing an exponent in 2s complement form, it turns out to be

more beneficial to represent the exponent in excess notation.If 7 bits are allocated to the exponent, exponents can be represented in the range of

-64 to +63, that is:

-64


70/79

IEEE notationIEEE Floating Point notation is the standard representation in use. There are two

representations:- Single precision.- Double precision.

Both have an implied base of 2.

Single precision:

- 32 bits (23-bit mantissa, 8-bit exponent in excess-127 representation)

Double precision:- 64 bits (52-bit mantissa, 11-bit exponent in excess-1023 representation)

Fractional mantissa, with an implied binary point at immediate left.

Sign Exponent Mantissa

1 8 or 11 23 or 52


71/79

Peculiarities of IEEE notationFloating point numbers have to be represented in a normalized form to

maximize the use of available mantissa digits.In a base-2 representation, this implies that the MSB of the mantissa is

always equal to 1.

If every number is normalized, then the MSB of the mantissa is always 1.

We can do away without storing the MSB.

IEEE notation assumes that all numbers are normalized so that the MSB

of the mantissa is a 1 and does not store this bit.So the real MSB of a number in the IEEE notation is either a 0 or a 1.

The values of the numbers represented in the IEEE single precision

notation are of the form:

(+,-) 1.M x 2(E - 127)

The hidden 1 forms the integer part of the mantissa.Note that excess-127 and excess-1023 (not excess-128 or excess-1024) are used

to represent the exponent.

E fi ld


72/79

Exponent fieldIn the IEEE representation, the exponent is in excess-127 (excess-1023)

notation.

The actual exponents represented are:

-126


73/79

Floating point arithmeticAddition:

3.1415 x 108

+ 1.19 x 106

= 3.1415 x 108

+ 0.0119 x 108

= 3.1534 x 108

Multiplication:

3.1415 x 108 x 1.19 x 106 = (3.1415 x 1.19 ) x 10(8+6)

Division:3.1415 x 108 / 1.19 x 106 = (3.1415 / 1.19 ) x 10(8-6)

Biased exponent problem:

If a true exponent e is represented in excess-p notation, that is as e+p.

Then consider what happens under multiplication:

a. 10(x + p) * b. 10(y + p) = (a.b). 10(x + p + y +p) = (a.b). 10(x +y + 2p)

Representing the result in excess-p notation implies that the exponent

should bex+y+p. Instead it isx+y+2p.

Biases should be handled in floating point arithmetic.

/


74/79

Floating point arithmetic: ADD/SUB

rule Choose the number with the smaller exponent.

Shift its mantissa right until the exponents of both the

numbers are equal.

Add or subtract the mantissas.

Determine the sign of the result.

Normalize the result if necessary and truncate/round

to the number of mantissa bits.

Note: This does not consider the possibility of overflow/underflow.


75/79

Floating point arithmetic: MUL rule Add the exponents.

Subtract the bias.

Multiply the mantissas and determine the sign of the

result.

Normalize the result (if necessary).

Truncate/round the mantissa of the result.


76/79

Floating point arithmetic: DIV rule Subtract the exponents

Add the bias.

Divide the mantissas and determine the sign of the

result.

Normalize the result if necessary.

Truncate/round the mantissa of the result.

Note: Multiplication and division does not require alignment of themantissas the way addition and subtraction does.

Guard bits


77/79

Guard bitsWhile adding two floating point numbers with 24-bit mantissas, we shift

the mantissa of the number with the smaller exponent to the right untilthe two exponents are equalized.

This implies that mantissa bits may be lost during the right shift (that is,

bits of precision may be shifted out of the mantissa being shifted).

To prevent this, floating point operations are implemented by keeping

guard bits, that is, extra bits of precision at the least significant end

of the mantissa.The arithmetic on the mantissas is performed with these extra bits of

precision.

After an arithmetic operation, the guarded mantissas are:

- Normalized (if necessary)

- Converted back by a process called truncation/rounding to a 24-bit

mantissa.


78/79

Truncation/rounding Straight chopping: The guard bits (excess bits of precision) are dropped.

Von Neumann rounding: If the guard bits are all 0, they are dropped.

However, if any bit of the guard bit is a 1, then the LSB of the retained bit is

set to 1.

Rounding: If there is a 1 in the MSB of the guard bit then a 1 is added to the LSB of the

retained bits.


79/79

Rounding Rounding is evidently the most accurate truncationmethod.

However, Rounding requires an addition operation.

Rounding may require a renormalization, if the addition operation de-

normalizes the truncated number.

IEEE uses the rounding method.

0.111111100000 rounds to 0.111111 + 0.000001

=1.000000 which must be renormalized to 0.100000

unit2 arithmetics

Documents