a 30-b integrated logarithmic number system processor -91

8
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VO L 26, NO . 10, OCTOBER 1991 1433 A 30-b Integrat ed Logarithmic Number System Processor Lawrence K . Yu, Member, IEEE, a n d David M. Lewis, Member, IEEE Abstract -This paper describes an integrated processor that performs addition and subtraction of 30-b numbers in the logarithmic number system (LNS). This processor offers 5- MOPS performance in 3-pm CMOS technology, and is imple- mented in a two-chip set comprising 170K ransistors. A new algorithm for linear approximation using different-sized approx- imation intervals in each of a number of segments is used. A second technique, nonlinear compression, further reduces table space by storing the difference between the exact value of the function and a l inear approximation. This allows the implemen- tation of logarithmic arithmetic using much less RO M than previously required, making high-speed logarithmic arithmetic possible in area comparable to single-precision floating-point processors. Compared to previous techniques for logarithmic arithmetic, a factor of 285 reduction in table space is realized. I. INTRODUCTION ALCULATIONS requiring high precision and range C an be performed with several different numeric representations, including floating-point (FP) or logarith- mic number system (LNS) representations. FP represen- tation is the most common number representation, while LNS representations have rarely been used. The scarcity of LNS processors is due to the difficulty of performing LNS addition and subtraction. While the LNS offers better accuracy than FP [ l ] and simple multiplication and division, addition and subtraction circuits have area that is exponential in numeric precision. Most applications require both additive and multiplicative operations, mak- ing LNS arithmetic impractical. As a result, the highest precisio n processor previously described offered only 12-b precision in a 3-pm I * L echnolo gy [2]. No algorithms for higher precision LNS arithmetic have previously been described, making LNS arithmetic impractical for most applications. This paper describes a new algorithm and architecture for performing LNS addition and subtraction, and its prototype implementation in an integrated processor. This processor offers 5-MOPS performance using a 30-b num- ber representation, and is implemented in a two-chip set in 3-pm CMOS. Although the prototype is slightly less Manuscript received January 17, 1991; revised May 13. 1991. This work was supported by the Natural Sciences and Engineering Research Council of Canada. The authors are with the Department of Electrical Engineering, University of Toronto, Toronto, Ont., Canada M5S 1A4. IEEE Lo g Number 9101852. accurate than a single-precision FP processor, due to the limited circuit area in 3-pm technology, it offers higher performance than FP arithmetic in the same technology. A denser technology would allow LNS arithmetic to offer better speed, accuracy, and performance than single-pre- cision FP in the same technology. The central difficulty in implementing addition and subtraction operations in LNS is the need to approximate two nonlinear functions, which has typically been per- formed using lookup tables. I n a straightforward imple- mentation with F bits of fractional precision, roughly 2 F x 2 F words are r equired [3]. For this reason, published implementations have been restricted to 8 to 12 b of fractional precision [21, [41. Efficient approximation of a nonlinear function using approach. Linear approximation [SI, uadratic approxima- tion [6], and linear approximation with a nonlinear differ- ence function in a PLA [ 7 1 have been used to advantage in the approximation of some functions, such as log(x). This is possible for log (x ) because of its smooth natur e over a small range. In contrast, one of the functions that must be approximated in LNS arithmetic has a singularity that makes straightforward Taylor series approximations difficult. A previous attempt [SI at using linear approxi- mation in LNS arithmetic has achieved better precision for addition only by using a modified linear approxima- tion, but is limited to about 3.85 additional bits of precision. This paper describes an integrated LNS arithmetic pro- cessor using a new method for linear approximation of the LNS arithmetic functions. Using 3-pm CMOS tech- nology, the prototype offers 20-b precision, considerably greater than previous designs. Two techniques are used to increase the precision possible for a given amount of ROM. First, a new segmented technique for linear ap- proximation is used to reduce the amount of table storage required to 561 kb, a factor o f 127 reduction compared to the most efficient previous method [2]. A second table compression technique, linear approximation with differ- ence coding, is used for further reduction, to 251 kb, a factor of 285 reduction. The remainder of paper is organized as follows. Section I1 introduces the LNS representation and the algorithms used. The chip optimization and design is described in Section 111, followed by conclusions in Section IV . 0018-9200/91/1000-1433$01.00 61991 IEEE

Upload: phuc-hoang

Post on 10-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 1/8

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VO L 26, NO . 10, OCTOBER 1991 1433

A 30-b Integrated Logarithmic NumberSystem Processor

Lawrence K. Yu, Member, IEEE, and David M. Lewis, Member, I EEE

Abstract -This paper describes an integrated processor thatperforms addition and subtraction of 30-b numbers in thelogarithmic number system (LNS). This processor offers 5-MOPS performance in 3-pm CMOS technology, and is imple-mented in a two-chip set comprising 170K transistors. A newalgorithm for linear approximation using different-sized approx-imation intervals in each of a number of segments is used. A

second technique, nonlinear compression, further reduces tablespace by storing the difference between the exact value of thefunction and a l inear approximation. This allows the implemen-tation of logarithmic arithmetic using much less RO M thanpreviously required, making high-speed logarithmic arithmeticpossible in area comparable to single-precision floating-pointprocessors. Compared to previous techniques for logarithmic

arithmetic, a factor of 285 reduction in table space is realized.

I. I NTRODUCTI ON

ALCULATIONS requiring high precision and rangeC an be performed with several different numeric

representations, including floating-point (FP) or logarith-

mic number system (LNS) representations. FP represen-

tation is the most common number representation, while

LNS representations have rarely been used. The scarcity

of LNS processors is due to the difficulty of performing

LNS addition and subtraction. While the LNS offers

better accuracy than FP [ l]and simple multiplication and

division, addition and subtraction circuits have area that

is exponential in numeric precision. Most applications

require both additive and multiplicative operations, mak-ing LNS arithmetic impractical. As a result, the highest

precision processor previously described offered only 12-b

precision in a 3-pm I *L echnology [2]. No algorithms for

higher precision LNS arithmetic have previously been

described, making LNS arithmetic impractical for most

applications.This paper describes a new algorithm and architecture

for performing LNS addition and subtraction, and its

prototype implementation in an integrated processor. This

processor offers 5-MOPS performance using a 30-b num-

ber representation, and is implemented in a two-chip set

in 3-pm CMOS. Although the prototype is slightly less

Manuscript received January 17, 1991; revised May 13. 1991. Thiswork was supported by the Natural Sciences and Engineering ResearchCouncil of Canada.

The authors are with the Department of Electrical Engineering,Universi ty of Toron to , Toron to , Ont . , Canada M5S 1A4.

I E E E Lo g Number 9101852.

accurate than a single-precision FP processor, du e to the

limited circuit area in 3-pm technology, it offers higher

performance than FP arithmetic in the same technology.

A denser technology would allow LNS arithmetic to offer

better speed, accuracy, and performance than single-pre-

cision FP in the same technology.

The central difficulty in implementing addition and

subtraction operations in LNS is the need to approximatetwo nonlinear functions, which has typically been per-

formed using lookup tables. In a straightforward imple-

mentation with F bits of fractional precision, roughly

2 F x 2F words are required [3]. For this reason, published

implementations have been restricted to 8 to 12 b offractional precision [21, [41.

Efficient approximation of a nonlinear function using

small lookup tables suggests the use of a Taylor series

approach. Linear approximation [SI, uadratic approxima-

tion [6], and linear approximation with a nonlinear differ-

ence function in a PLA [71 have been used to advantage

in the approximation of some functions, such as log(x).

This is possible for log (x ) because of its smooth nature

over a small range. In contrast, one of the functions that

must be approximated in LNS arithmetic has a singularity

that makes straightforward Taylor series approximations

difficult. A previous attempt [SI at using linear approxi-

mation in LNS arithmetic has achieved better precision

for addition only by using a modified linear approxima-

tion, but is limited to about 3.85 additional bits ofprecision.

This paper describes an integrated LNS arithmetic pro-

cessor using a new method for linear approximation of

the LNS arithmetic functions. Using 3-pm CMOS tech-

nology, the prototype offers 20-b precision, considerably

greater than previous designs. Two techniques are used to

increase the precision possible for a given amount of

R O M . First, a new segmented technique for linear ap-

proximation is used to reduce the amount of table storage

required to 561 kb, a factor of 127 reduction compared to

the most efficient previous method [2]. A second table

compression technique, linear approximation with differ-

ence coding, is used for further reduction, to 251 kb, a

factor of 285 reduction.

The remainder of paper is organized as follows. Section

I1 introduces the LNS representation and the algorithms

used. The chip optimization and design is described in

Section 111, followed by conclusions in Section IV .

0018-9200/91/1000-1433$01.00 6 1 9 9 1 I E E E

Page 2: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 2/8

1434 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991

11. NUMBER EPRESENTATIONND

ALGORITHMESIGN

Th e logarithmic number system represents a number x

by its sign and logarithm in some base b, together with

some distinct representation for zero. In this paper we

will only consider b = 2, and use a distinct bit to represent

zero for simplicity. Formally, the number x is represented

by the triple (zx,,, e,), with z , being a zero flag, s, thesign, e, = log2((xl), and x = (1- z,)x( - 1)". X2'1. e, is

an N-bit fixed-point number with I integer bits and F

fractional bits of precision, and N = F + 1. An F frac-

tional bit LNS representation has precision comparable to

a F +0.53-b precision F P representation. W e also assume

an excess-n representation of e,, with n = 2'-' - 1. The

processor described in this paper uses a 30-b LNS format,

with I = 8 and F = 20, and two extra bits for z , and s,.

This is slightly inferior to single-precision FP, but is the

most accurate format that could be implemented in the

available process technology.

The central algorithmic technique used for high-preci-

sion LN S addition and subtraction is a new method for

segmented linear approximation, which is fully described

in [9]. A brief summary and subsequent enhancements are

presented here for exposition of the architecture.

Let a and b be two numbers represented in LNS

format by ( z , , s,, e , ) and ( z , , sb,e,), and without loss of

generality assume e, 2 e,. Ignoring signs, algorithms for

addition and subtraction using auxiliary functions f , and

f , can be derived as follows:

Addition: c = a + b

log(c)= log(a + 6 )

log(c)= log(a x ( l + b/a))

log(c)= log(a)+log(l+ b / a )

log(c)= og(a) +

e, = e, + ,,(e, - e,)

where f , , ( r ) = log(l+2')

e , = e,,+ &e, - ,>where f , ( r ) = log(1 -2'1.

log(l + 2 b ( h ) - - l ~ ( a )

Subtraction: c = a - b

In this paper log(x) means the logarithm to base 2, and

exp( x) means 2".

The central difficulty in LNS arithmetic is the imple-

mentation of the functions f , ( r ) and f s ( r ) , for r < 0.

Lookup tables are the most common method used in

previous processors. A graph of these functions in Fig. 1 

illustrates the singularity in f , ( f ) that makes linear ap-

proximation difficult. While f,( r ) is well behaved near

0,f , ( r ) + - 0 as r -0.

Both functions are smooth for more negative r . LNS

arithmetic often exploits the essential zeros of f , ( r ) and

f s ( r ) , which are thresholds such that for r smaller than

the essential zero, f , , ( r ) and f , ( r ) are zero to within the

accuracy of th e representation. The essential zero has th e

value - F -2, at which both If,(r)l < 2-Fp' and If,(r)l <, so lookup tables are required only for - F - 2 <-F- 1

2.

4.

-6.

18 16 12 '0

Fig. 1. Plot of f J r ) an d f , ( r ) unctions.

r < 0. Previous implementations of LNS arithmetic used

lookup tables with all bits of r as input, and detecting

essential zeros to eliminate the associated ROM space [l].

This leads to 2 X ( F + 2)x 2F words of lookup table.

The varying nonlinearity of the functions makes a

straightforward linear approximation using equal intervals

for approximation across the entire domain impractical.

Instead, the algorithms described in this paper use a new

segmented approach, with linear approximation across

smaller regions where the function is more nonlinear.

Furthermore, study of f , ( r> and f 7 ( r )will reveal proper-

ties of these functions that simplify linear approximation

if logarithmic arithmetic is used in the approximation.

A. Linear Approximation off , an d f,y

neighborhood of x is defined by

A linear approximation of some function f ( x > in the

x A xf(x + A x ) = f ( ) +-

x

This formulation appears to require a ROM storing f ( x >

at some set of points, a ROM for d f ( x ) / d x , and a

multiplier, which is potentially expensive. Further consid-

eration of f , ( r ) and f s ( r ) will show that the d f ( x ) / d x

ROM and the multiplier can be eliminated at the cost of

a small ROM and additional logic. First, the multiplica-

tion can be eliminated by using logarithmic arithmetic, so

(1) can be replaced by (2 ) if d f ( x ) / d x > 0 and (3) if

d f ( x ) / d x <0:

f i x + A x ) = f ( x ) + sgn( Ax)

f ( + A x ) = f (x) - gn (A x)

The function sgn(.) in (2) and (3) is the sign function.

The function f , ( r ) has a positive derivative and is approx-

imated using (2), while f , ( r ) always has a negative deriva-

tive and is approximated using (3). These calculations

eliminate the multiplication, but appear to increase the

complexity of the circuitry. This complexity can be elimi-

nated by noticing that f , ( r ) and f s ( r ) have properties (4 )

and (5) . As a result, the calculations can be performed

using logarithmic arithmetic, as shown in (6) and (7). The

Page 3: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 3/8

YU AND LEWIS: 30-b INTEGRATED LOGARITHMIC NUMBER SYSTEM P R O C E S S O R 1435

T A B L E  I

S E G M E N THOICE

multiplication and calculation of d f ( x ) / d x have been

eliminated, at the cost of adding lookup tables for log(.)

and exp(.). These tables can be shared for both addition

and subtraction, and will be seen to be small.

(4 )

L( + Ax ) =fa( x)+sgn ( Ax )

xexp(log(lAxl)+x -f ,(x)) (6 )

Xexp(log(lAxl)+x - f , ( x ) ) . ( 7 )

L(x + Ax) = f , x ) - gn (Ax )

Previous linear approximations have generally used posi-

tive Ax, extracted as an unsigned bit field from the inputparameter. In contrast, Ax is a signed number in this

paper. This doubles the range that can be used for linear

approximation with some fixed maximum error , and halves

the size of the lookup tables.

The remaining problem is how to choose x and Axsuch that r = x + Ax. The error in linear approximation

with some maximum value of IAxl, called Axmclx,s

(d2f(x)/dx2)x Axi,,/2), so the choice of the points x

and corresponding Ax,,, must be made to meet the

required error tolerance. For a given maximum error, the

value of AX,,^ is proportional to 11/2x d2 f ( x> /~ * IP " * .

This choice of Ax,,, forms the basis of segmentedlinear approximation. The domain of x is divided into a

number of segments, and a worst-case value of Ax,,, is

used in each segment. Within a segment, the values of

f ( x ) are stored at a set of points 2X Ax,,, apart (the

factor of 2 arises from the fact that Ax is signed). Thus,

segments are chosen in a manner that makes it easy to

compute the correct segment and Axmax or the given

segment, while not wasting excessive table space.

A simple technique for choosing Ax, , , is to partition

the binary representation of r into several parts, specifi-

cally, r r , h , r l , and re , such that r = rr+ rh + rl + r e .Also

define rf= r, + rh .The linear approximation will be per-

formed with x = r , , Ax = r l. The value r , is ignored, as it

is chosen to be too small to affect the result. Although r,

is not directly used in this calculation, it is used later toselect a segment.

The partitioning of r into bit fields can be described bytwo integers p , and p, , p , <p , and p , < 0. Using seg-

mented linear approximation, p e and p l are functions of

r , but are constant within each segment. The correspond-

-4

-4 -3 -2

Fig. 2. Linear approximation intervals for F = 4

ing value of Ax,,, is 2p'. Given some pi and p, , the

values of r I , r h , r l , and re are defined by (8)-(11). The

notation r, . . . r , means the value of the binary represen-

tation of bits m through n inclusive of r .

r , = y p,-1 . . ' - F (8)

rl = rp l . . . rP <- pi (9 )

rl ,= r - I . . . rp , + + 2p/ (10)

(11)r= r lP, . . r( ,

Using this partitioning, r, is the integer part of r , rh is a

positive quantity, with - - p l bits being depend ent upon

r , and rl is a signed quantity with p l - , + 1 significant

bits, and (rll< 2p/.Finally, 0 < re < 2"., and r , < r < r f+ 1.

Combining this partitioning of r with the approxima-

tions (6) and (7) leads to the formulas (12) and (13) as

approximations for f u ( r ) and f s ( r > :

L ( r ) = fo(r , )+ sgn( r o x exp (log(Ir,I) + rr - a( I t ) )

A( r ) = f,( , ) - gn ( r r ) x exp (log( b l l ) + rr - f,(d ) .

(12)

(13)

The choice of segments and Ax,,, is made to meeterror tolerance requirements as well as result in a simple

implementation. The values of p , and p , are chosen to

meet the accuracy constraint that the error due to linear

approximation should be smaller than half a least signifi-

cant bit. The function f , ( r ) is smooth everywhere, allow-

ing a relatively simple choice of segments based on inter-

vals [ r l , r l+ 1). The value of Ax,, increases with more

negative r , . For r < - 1, s ( r ) allows a similar treatment,

but the singularity at 0 requires a different approach for

r E [- ,O ) . In this region the interval is divided into

segments [- -', - -'-'). The segment size and Ax,,,both decrease as r + 0. Table I shows the sizes of inter-

vals and segments for the functions, to within a constant

factor. The effect of segmented linear approximation for

F = 4 is shown in Fig. 2.  The crosses mark the pointsstored in lookup tables, and the lines represent the range

of linear approximation about each point. The arithmetic

performed by the remainder of the processor and the

corresponding data paths are shown in Figs. 3  and 4,

respectively. Each step in the algorithm and correspond-

Page 4: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 4/8

1436

ex p

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991

4096 23

stage12

3

45

6

I0

9

10-

t

generate P* .PI

fn =fn_rbKrO

partition r into ri . rh. rl. r,= ri + rh

lrl = bg-tbl( I : )

fd = r, -fr~lcor = lrl + d

cor = eap-tbl(lcor)f f=f r t+cor i f r , > O. f f=frt -cor i f r : <0

e, =e. + f

Fig. 3. Addit ion algori thm

1e,

Fig. 4. Data path for segmente d l inear approximation.

ing hardware block is labeled with an identifying number.

The data path operates on two numbers x and y , and

chooses a and b that meet the constraint e , 3 eb. The

dotted line is related to the circuit partitioning into two

chips, and will be described later.

B. Range Reduction

The size of the log (. ) and exp (. ) lookup tables can be

reduced by using algebraic identities. Consider the log

ROM, which looks up log(x) in the range [0,1) using a

fixed-point input representa tion,' and has the identity

log(2' X x)= i +log(x). This can be used to reduce table

space by half. A priority encoder generates i correspond-

ing to the most significant one in x; x is left shifted by i

'Log(0) is chosen to be a negative value of large enough magnitude to

guarantee a correct result.

T A B L E 11

TABLEIZES OR LINEAREXTRAPOLATION

Table I Words I Width

cr I I

Fig. 5. Nonlinear funct ion compression

resulting in a value in the range [.5,1), and is used to

perform the table lookup. The table output has i sub-

tracted, producing the result. A similar technique is used

for exp(.1.The total lookup table sizes after applying this opti-

mization to the log and exp tables are shown in Table 11 . 

A total of 22K words with 574 464 b is required. Further

reduction is required to fit in the available technology.

C . Nonlinear Table Compression

A second technique, called nonlinear table compres-

sion, is used to reduce the size of each of the lookup

tables. Nonlinear compression uses the observation that a

linear approximation provides a close, but inexact, ap-

proximation of the function. Since the approximation is

close, a table storing the difference between the linear

approximation and the exact function can use a few bits

to represent the difference. This is expressed by (14)

through (16). It is necessary to represent the value of

f ( x , ) and df(x,)/dx for each possible value of xb, and

the value of f d ( x ) for each possible value of x.

Equation (15) corresponds to the hardware implementa-

tion shown in Fig. 5 , so each ROM of Fig. 4 is replaced by

an instance of the logic in Fig. 5.   Given some f ( x > with

Page 5: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 5/8

YU AND LEWIS: 30-b INTEGRATED LOGARITHM IC NUMBER SYSTEM PROCESSOR

N , Wf W, W,,

3 23 13 493 23 13 494 27 14 135

3 28 22 11

4 24 13 105

4 23 12 75

N , bits of precision in x , nonlinear table compression

splits x into two parts, x , , containing the N , least signifi-

cant bits of x used for linear approximation, and X b ,

containing the N h most significant bits used to look up

the function value and derivative, with N , = Ne + N b . The

value of x b is used to address a ROM. Each word

contains the value of f ( x h ) , he value of d f ( x , ) / dw, and

2Nc- 1 correction words f d ( x ) (note f d ( x J=

0 and is notstored). The widths of these quantities are Wr, W,, and

W,, respectively, with the total width of bits used for

f d ( x ) being W,, = 2 N e - > X W,. A multiplexer uses x, to

select a field from the ROM, producing the value of

fd(x). A multiplier and two adders produce f (x>.

In fact, the value stored for d f ( x b ) / d u is chosen to

optimize the accuracy of the linear approximation and

thus minimize the number of bits of ROM, and is not

necessarily the exact value of d f ( x b ) / du (although the

result must still be exact). The total ROM width is W =

Wf Wd +W,,, but the number of words is reduced by a

factor of 2 4 . As N , increases, the number of words

decreases, and W,, and W increase. The value of N , is

chosen to minimize total ROM space.

W words

85 25885 38 6

176 640

127 162

142 128

100 256

111. CHIPAREAOPTIMIZATIONN D DESIGN

A chip set that implements LN S addition using the

above algorithm has been implemented. The chip set is

designed to demonstrate the feasibility of high-precision

LN S processors, so only the arithmetic section of the

processor has been designed. This is implemented as a

pipelined processor, accepting two operands and deliver-

ing a result per clock cycle. The processor is deeply

pipelined in order to maximize throughput, at the cost of

slightly increasing latency. In a pipelined architecture,

combinational circuits can be pipelined to use a cycle

time as small as a gate delay plus pipeline register, but an

architecture that uses ROM’s cannot easily be pipelined

using a cycle shorter than a ROM access. A goal of 140 nswas set for ROM access time and cycle time of the chip.

The algorithms described above describe an architec-

ture for LN S arithmetic, but detailed design decisions

must be made to optimize circuit area. There are two

principal decisions, related to nonlinear compression and

allocation of lookup tables to ROM’s.

A. Nonlinear Compression Optimization

The nonlinear compression algorithm places no con-

straints on N e . Because the number of words and the size

of each word varies with N e , it is important to choose an

Ne that minimizes chip area. The first step in this opti-

mization was to write a computer program that accepts

the contents of a lookup table and evaluates the total

table size for any given value of N,. This can be used to

determine the optimum value of N e . This technique can

be applied to each segment of the f o ( r )and f s ( r > ookup

tables, and to the log and exp tables as a whole. This

optimization is relatively straightforward, and becomes

N a m e

fafs b

fssl

fss2

log

ex p

1437

TABLE 111

COMPRESSEDABLEIZES

(a) Optimal Values of W

(b) Actual Values of W

complex only when the possible allocation of multiplesegments or tables to a single ROM is considered.

The log and exp tables are both accessed once per

operation, and so require a distinct ROM for each table.

The processor also requires one access to either f & r ) or

f , ( r ) . This potentially allows several segments of f J r > and

f s ( r ) o be implemented in a single ROM, if this will save

space. This is only directly possible if the ROM is wide

enough to accommodate the widest of each of the fields,

since otherwise some data steering logic is required. AS a

first step, the segments for f , ( r > and f s ( r ) were indepen-

dently optimized and grouped into four collections that

could fit into identical word widths. About 5% of the

table space is not used, but the corresponding area is

small compared to the cost of separate tables. These

collections of segments and corresponding word widths

are shown in Table 1”. Total table space is reduced to

234290 b, a reduction of 57%.It is desirable to perform further sharing by merging

the various f , ( r ) and f s ( r ) tables into fewer ROM’s.

Although the merged ROM might contain more bits, it

will eliminate some wiring and overhead circuits such as

word-line drivers and sense amplifiers. Merging cannot be

done efficiently using the tables of Table III(a) because of

the large variations in word widths. However, the varia-

tion of ROM size with respect to N , is relatively flat near

the minimum. This allows ROM’s to be merged by choos-

ing a suboptimal N, that results in comparable values of

Wf, d,and W,, for the different segments. While the

value of Ne for a given collection of segments is subopti-

mal, the overall chip area is minimized.

We wrote computer programs to study the area trade-off

between sharing tables and implementing each table sep-

arately. The programs use a detailed model of ROM area,

including core area and overhead area for the technology

available to us. This was used to explore the possibilities

Page 6: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 6/8

1438 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26 , N O . 10, OCTOBER 1991

T A B L E IV

R O M SIZES

RO M I Words I Width I Total Bits

log I 128 1 110 I 14080

exp 1 256 I 14 2 1 36352

Tota l I 249131

I I I110 125 140 t(ns.)

I 10 35

Fig. 6. Clock timing.

of sharing tables, and used to find the optimal implemen-

tation of the tables. It was found that it was most econom-

ical to merge all sections of f a and f , into two ROM’s, as

shown in Table IV. A large ROM (the f-ROM) with

W =90 stores most of the data using the values W f = 27,w d = 14, and W,, =49, which can be interpreted as either

three fields of 14 b or seven fields of 7 b. No data steering

logic is required except for f d (x ) . This can stor e any table

entry except the fss2 segments, where 1 b of W f and 8 b

of w d bits are stored in a separate 265 X 9 excess ROM.

Although the values of Ne for each table are slightly

suboptimal, the overall circuit area is minimized. This

increases the number of bits of ROM to 245731 b, asshown in Table III(b), a 5% increase, but overall ROM

area is reduced by 28%, and wiring area is reduced as

well.

The optimal ROM sizes presented above require com-

plicated logic to map r into a f-ROM address. The

complexity of the logic can be reduced by placing each

table at an address in the ROM that simplifies the map-

ping function. This leaves “holes” in the address space of

the ROM, but no additional area is consumed, since the

corresponding words in the ROM are not implemented.

The word-line decoder must decode only those addresses

that are implemented. The resulting structure is effec-

tively a PLA implemented with the area efficiency of a

ROM due to the density of the terms.

The use of a multilevel decoder tree requiring multiples

of eight words in each segment, plus the need for reason-

able aspect ratios, creates a further small increase in

ROM space. A total of 251 kb of ROM is actually present

in the chip.

An unfortunate side effect of the complicated algo-

rithms used by this processor is the large delay through

the processor. A total of ten pipeline stages is used in the

present implementation. One of these stages is used for

interchip communication, since multiplexing of the inter-

chip signals is required. Four of these stages are required

to perform range reduction operations for the log and exp

lookup tables. In retrospect, the area saved by range

reduction was small, so these stages should be eliminated

at the expense of slightly greater area.

B. Chip Implementation

A two-chip set that performs LNS addition and sub-

traction has been implemented in 3-pm DLM p-well

CMOS. The circuits were estimated as requiring a

10-mmX 10-mm die, but the multiproject silicon foundry

service available to us had a maximum die size of 8.2

mm X 8.2 mm, necessitating partitioning the chip into a

two-chip set. Partitioning the chip into a two-chip set

increased area considerably due to the large number of

pads required for communication. The total area of the

processor is still well within the amount that can be

fabricated on a single chip with acceptable yield, so this

decision is purely due to the constraints of the chip

manufacturer.

Two-phase clocking with the timing illustrated in Fig. 6 

is used. Two-stage dynamic latches are used throughout,

with all state changes on 4,, which is used for precharg-ing dynamic circuits. The timing of 41 s primarily dic-

tated by ROM requirements. 42 s only used for the first

latch stage in pipeline registers. This constrains the pro-

cessor to a single dynamic circuit per pipe stage, but does

not cause any extra pipe stages.

The chip design uses simple static circuits wherever

possible in order to keep design complexity low. A small

number of commonly used cells, such as adders, and

multiplexers were designed to keep layout time moderate.

A fully static adder cell with block carry lookahead was

designed. Variable-sized blocks are used to reduce carry

propagation time, and the resulting circuit can perform a

30-b addition in 55 ns . Dynamic circuits were used only

when a static circuit was too slow or too large. Dynamic

circuits were used only for ROM’s, shifters, PLAs, andpipeline registers. The shifters and PLA’s are constructed

using domino logic, precharged on 4, and evaluated on

6,.he pipeline registers are dynamic latches.

The pipeline partitioning of the chip is chosen to make

the ROM access time the only limit on clock cycle. The

use of no more than one dynamic circuit per pipe stage

sets a lower bound of five-clock-cycle latency on t he

circuit. Although consecutive dynamic stages with no in-

tervening logic can be placed in consecutive clock cycles,

most of the dynamic circuits have some intervening static

logic. The static logic requires th e use of an extra pipeline

stage if it cannot fit into the time slack of the preceding

dynamic stage. A total of nine stages results from these

constraints, and an extra stage to communicate between

chips brings the total to t en stages.

The f-ROM is the limiting factor in processor speed, so

a simple but reasonably fast design was used. The address

is latched and predecoded during while the ROM

precharges. The access is performed during ql, and the

output data are latched during &. The ROM design uses

Page 7: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 7/8

Y U AN D L E W I S : 30-h INTEGRATED LOGARITHMIC N U M B E R S Y S T t M PKOCESSOR 133’)

(h)

Fig. 7. Photomicrographs of processor chips

an El-ROM core with straps to ground every 32 b and

second-level-metal word-line straps every 64 b. Two level

decoders are used for fast word-line drive. A simple sense

amplifier, a ratioed inverter with a high switching thresh-

old, is used to provide reasonably fast sensing with good

noise immunity. By choosing the inverter ratio appropri-

ately, fast switching can be achieved. Since a ratioed

inverter has a long pull-down time, an additional NMOS

FET is used to precharge the data-out line low. Simula-

tions predicted a 35-ns precharge and 90-ns access time

for a 125-ns cycle. A n additional 15-ns dead time is

allowed for a total cycle of 140 ns.

The processor is partitioned into two chips, with the

functions allocated to each chip shown in Fig. 4.  Fig. 7 

shows photomicrographs of the two chips, with labels

corresponding to Fig. 4. Both chips have been fabricated

and tested successfully. One circuit design error, a float-ing address line to a ROM, was found. This was not

detected during simulations due to the limited number of

vectors that could be simulated using the entire chip. At

VLIn 5.0 V and 25”C, chip 1 operated at a speed of 4.5

MHz, and chip 2 operated at 5.6 MHz. The lower than

expected clock speed was the result of inadvertent omis-

sion of metal straps for the predecoded address lines.

SPICE predicts 175-ns cycle time for chip 1 with this

omission, closer to the measurements. Our test setup

limited control of the clock widths to multiples of 20 ns,

so the actual maximum clock rates could be higher than

these rates. and closer to the SPICE simulations.

1v. COMPARISON TO FI.OATING-POINT PROCESSORS

The use of 3-pm technology makes the performance of

the prototype uncompetitive with modern FP processors.

The processor is not only slower than recent FP proces-

sors, but this technology only allows 30-b words with 20-b

precision that has less accuracy than a single-precision FP

processor. A 2-pm or higher density technology is re-

quired to implement 32-b words with 23-b precision (de-

leting the zero bit, and using a distinct value to represent

zero), offering about 0.53 b more precision than FP.

As mentioned earlier, the LNS has not been widely

used due to the difficulty of implementing addition and

subtraction. The methods describe here could make LNS

arithmetic competitive with FP processors, and feasible

for general-purpose computations. To compare LNSarithmetic to FP, we consider general-purpose computa-

tions, which require circuits for multiplication as well as

addition. In the LNS, a multiplication or division is per-

formed with addition or subtraction, so the silicon area

required is insignificant, and the time required using our

adder is 55 ns in 3-pm CMOS. Using improved ROM’s,

th e processor presented here can achieve 140-11s addition

with 1400-11s latency. A FP processor designed using

similar technology [101 offers lower performance than t he

LNS processor, offering 1 .l-pus addition, 1.6-ps multipli-

cation, and 2.7-ps division.

LNS arithmetic should offer excellent performance in

higher density technology. The core area of the LNS chips

tot als about 100 mm’, of which about 30% is ROM.

Increasing the word size to 32 b to achieve better accu-

racy than a single-precision FP processor would increase

RO M ar ea to 74 mm2 , for a total of 144 mm2. Scaling to

1.2 -pm technology suggests an area of 23 mm’, smaller

than current FP processors. Eliminating the interchip

communication and range reduction logic for the log and

exp ROM’s would eliminate five pipe stages while in-

creasing area by about 5%. The predicted cycle time in

1.2-pm CMOS from SPICE simulations is 25 ns, with a

latency of 125 ns . Multiplication and division could be

accomplished with insignificant area and 25-11s latency. A

recent FP processor 1111 in 1.2-pm C MOS offers 25-11s

cycle times for addition a nd multiplication, and can com-

plete two operations every three cycles, but has circuit

area of 110 mm’. This processor can perform ei thersingle- or double-precision FP arithmetic, and also inte-

grates a register file and interface logic. Another single-

precision FP processor in the same technology [12] has

area of 58 mm’ and performs an addition and multiplica-

tion every 67 ns. A LNS processor in this technology

would offer higher performance and smaller circuit area.

Page 8: A 30-b Integrated Logarithmic Number System Processor -91

8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91

http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 8/8

1440 IEEE J O U K N A L OF SOLID-STATE CIRCUITS, VOL. 26, NO. IO , OCTOBER 1991

LNS processors using the algorithms described here

cannot implement double-precision FP processors in rea-

sonable area due to the 0 ( 2 F / 2 ) ependency of circuit

area on precision. Implementation of high-precision LN S

addition will require the development of different algo-

rithms.

V. CONCLUSIONS

This paper has described the architecture of an inte-

grated processor for 30-b LNS arithmetic. Two techniques

are use to achieve this precision in moderate circuit area.

Linear approximation of the LN S arithmetic functions

using logarithmic arithmetic is shown to be simple due to

the particular functions involved. A segmented approach

to linear approximation minimizes the amount of table

space required. Subsequent nonlinear compression of each

lookup table leads to a further reduction in table size.

The result is that a factor of 285 reduction in table size is

achieved, compared to previous techniques.

The circuit area of the implementation is minimized by

optimizing the table parameters, using a computer pro-

gram that accurately models ROM area. The implementa-

tion is highly pipelined, and produces one result per clock

cycle using a ten-stage pipeline. The architecture could be

scaled using modern technology to higher precision and

performance, as well as reduced latency. As a result, LNS

arithmetic can offer higher speed and better accuracy

than a single-precision FP processor in smaller circuit

area.

ACKNOWLEDGMENT

Circuit fabrication was provided by the Canadian Mi-

croelectronics Corporation.

REFERENCES[ l ] J. N.Mitchell Jr., “Computer mult ipl icat ion and division using

binary logari thms,” IRE Trans. Electron. Cornput., vol. EC-1 1, pp .512-517, Au g. 1962.

[2] F. J. Raylor, R. Gil l , J. Joseph, and J . Radke , “A 20 bi t logari thmicnumber system processor,” IEEE Trans. Comput. , vol. 37, pp .

190-200. Feb. 1988.

[3] E. E. Swartzlander and A. G. Alexopolous, “The signed logari thmnumber system,” IEEE Trans. Comp ut. , vol. C-24, pp. 1238-1242,

Dec. 1975.[4 ] J. H. Lang, C. A. Zukowski , R. 0. LaMai re , and C. H. A n,

“Integrated-circui t logari thmic units ,” IEEE Trans. Comput. , vol.

C-34, pp. 475-483, May 1985.[SI M . Combet , H . Van Zonneveld , and L . Verbeek , “C omputa t ion of

the base two logari thm of binary num bers,’’ IEEE Trans. Electron.Comput . , vol. EC-14, pp. 863-867, Dec. 1965.

[6] D. M arino, “Ne w algori thm for the approximate evaluat ion in

hardware of binary logari thms and elementary funct ions,” IEEETrans. Com put. , vol. C-21, pp. 1416-1421, Dec. 1972.

[7 ] H.-Y. Lo and Y . Aoki, “Generat ion of a precise binary logari thmwith difference grouping programmable logic array,” IEEE Trans.

Comput . , vol. C-34, pp. 681-691, Aug. 1985.[X I F. J . Taylor, “An exte nded precision logari thmic number system,”

IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp.232-234, Jan. 1983.

[ 9 ] D. M. Lewis, “An architecture for addit ion and subtract ion of longword length numbe rs in the logari thmic number system,” IEEETrans. Comput. , vol. 39, pp. 1326-1336, N ov. 1990.

[IO] G . Wolrich et al., “ A high performanc e float ing point coprocessor,”

IEEE J . Solid-State Circuits, vol. SC-19, pp. 690-696, Oct. i984.[ l l ] K. Molnar et al. , “ A 40 Mhz 64-b float ing point c o-processor,” in

ISSCC Dig. Tech. Papers, 1989, pp. 48-49.[12] D. A. Staver et al., “A 30-MFLOPS CMOS single precision float ing

point m ult iply-accumulate chip,” in ISSCC Dig. Tech. Papers, 1987,pp . 274-275.

Lawrence K. Yu (S’8S-M’90) received theB.A.Sc. an d M .A.Sc. degrees in electrical engi-neer ing f rom the Univers i ty o f Toron to ,Toronto, Ont ., Ca nada in 1986 and 1990, re-spectively.

H e is presently employed at th e Universi ty ofToron to as a Research Associa t e on the Hubnet

project . His research interests include computerari thmetic and VLSI systems design.

David M. Lewis (M’88) received the B.A.Sc.degree with honors in engineering science fromthe University of Toron to , Toron to , Ont . ,Canad a, in 1977, and the P h.D . degree in elec-trical engineering in 1985.

From 1982 to 1985 he was employed as a

Research Associate on the Hubnet project , anddeveloped custom integrated circui ts for a 50 -Mb/s loca l a rea ne twork . H e has been an Ass is -

tant Professor at the Universi ty of Toronto since1985. His resea rch interests include logic and

circui t s imulat ion, computer archi tecture for high-level languages, loga-ri thmic ari thmetic, and VLSI architecture.

Dr. Lewis is a member of t h e A C M .