a 30-b integrated logarithmic number system processor -91
TRANSCRIPT
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 1/8
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VO L 26, NO . 10, OCTOBER 1991 1433
A 30-b Integrated Logarithmic NumberSystem Processor
Lawrence K. Yu, Member, IEEE, and David M. Lewis, Member, I EEE
Abstract -This paper describes an integrated processor thatperforms addition and subtraction of 30-b numbers in thelogarithmic number system (LNS). This processor offers 5-MOPS performance in 3-pm CMOS technology, and is imple-mented in a two-chip set comprising 170K transistors. A newalgorithm for linear approximation using different-sized approx-imation intervals in each of a number of segments is used. A
second technique, nonlinear compression, further reduces tablespace by storing the difference between the exact value of thefunction and a l inear approximation. This allows the implemen-tation of logarithmic arithmetic using much less RO M thanpreviously required, making high-speed logarithmic arithmeticpossible in area comparable to single-precision floating-pointprocessors. Compared to previous techniques for logarithmic
arithmetic, a factor of 285 reduction in table space is realized.
I. I NTRODUCTI ON
ALCULATIONS requiring high precision and rangeC an be performed with several different numeric
representations, including floating-point (FP) or logarith-
mic number system (LNS) representations. FP represen-
tation is the most common number representation, while
LNS representations have rarely been used. The scarcity
of LNS processors is due to the difficulty of performing
LNS addition and subtraction. While the LNS offers
better accuracy than FP [ l]and simple multiplication and
division, addition and subtraction circuits have area that
is exponential in numeric precision. Most applications
require both additive and multiplicative operations, mak-ing LNS arithmetic impractical. As a result, the highest
precision processor previously described offered only 12-b
precision in a 3-pm I *L echnology [2]. No algorithms for
higher precision LNS arithmetic have previously been
described, making LNS arithmetic impractical for most
applications.This paper describes a new algorithm and architecture
for performing LNS addition and subtraction, and its
prototype implementation in an integrated processor. This
processor offers 5-MOPS performance using a 30-b num-
ber representation, and is implemented in a two-chip set
in 3-pm CMOS. Although the prototype is slightly less
Manuscript received January 17, 1991; revised May 13. 1991. Thiswork was supported by the Natural Sciences and Engineering ResearchCouncil of Canada.
The authors are with the Department of Electrical Engineering,Universi ty of Toron to , Toron to , Ont . , Canada M5S 1A4.
I E E E Lo g Number 9101852.
accurate than a single-precision FP processor, du e to the
limited circuit area in 3-pm technology, it offers higher
performance than FP arithmetic in the same technology.
A denser technology would allow LNS arithmetic to offer
better speed, accuracy, and performance than single-pre-
cision FP in the same technology.
The central difficulty in implementing addition and
subtraction operations in LNS is the need to approximatetwo nonlinear functions, which has typically been per-
formed using lookup tables. In a straightforward imple-
mentation with F bits of fractional precision, roughly
2 F x 2F words are required [3]. For this reason, published
implementations have been restricted to 8 to 12 b offractional precision [21, [41.
Efficient approximation of a nonlinear function using
small lookup tables suggests the use of a Taylor series
approach. Linear approximation [SI, uadratic approxima-
tion [6], and linear approximation with a nonlinear differ-
ence function in a PLA [71 have been used to advantage
in the approximation of some functions, such as log(x).
This is possible for log (x ) because of its smooth nature
over a small range. In contrast, one of the functions that
must be approximated in LNS arithmetic has a singularity
that makes straightforward Taylor series approximations
difficult. A previous attempt [SI at using linear approxi-
mation in LNS arithmetic has achieved better precision
for addition only by using a modified linear approxima-
tion, but is limited to about 3.85 additional bits ofprecision.
This paper describes an integrated LNS arithmetic pro-
cessor using a new method for linear approximation of
the LNS arithmetic functions. Using 3-pm CMOS tech-
nology, the prototype offers 20-b precision, considerably
greater than previous designs. Two techniques are used to
increase the precision possible for a given amount of
R O M . First, a new segmented technique for linear ap-
proximation is used to reduce the amount of table storage
required to 561 kb, a factor of 127 reduction compared to
the most efficient previous method [2]. A second table
compression technique, linear approximation with differ-
ence coding, is used for further reduction, to 251 kb, a
factor of 285 reduction.
The remainder of paper is organized as follows. Section
I1 introduces the LNS representation and the algorithms
used. The chip optimization and design is described in
Section 111, followed by conclusions in Section IV .
0018-9200/91/1000-1433$01.00 6 1 9 9 1 I E E E
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 2/8
1434 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991
11. NUMBER EPRESENTATIONND
ALGORITHMESIGN
Th e logarithmic number system represents a number x
by its sign and logarithm in some base b, together with
some distinct representation for zero. In this paper we
will only consider b = 2, and use a distinct bit to represent
zero for simplicity. Formally, the number x is represented
by the triple (zx,,, e,), with z , being a zero flag, s, thesign, e, = log2((xl), and x = (1- z,)x( - 1)". X2'1. e, is
an N-bit fixed-point number with I integer bits and F
fractional bits of precision, and N = F + 1. An F frac-
tional bit LNS representation has precision comparable to
a F +0.53-b precision F P representation. W e also assume
an excess-n representation of e,, with n = 2'-' - 1. The
processor described in this paper uses a 30-b LNS format,
with I = 8 and F = 20, and two extra bits for z , and s,.
This is slightly inferior to single-precision FP, but is the
most accurate format that could be implemented in the
available process technology.
The central algorithmic technique used for high-preci-
sion LN S addition and subtraction is a new method for
segmented linear approximation, which is fully described
in [9]. A brief summary and subsequent enhancements are
presented here for exposition of the architecture.
Let a and b be two numbers represented in LNS
format by ( z , , s,, e , ) and ( z , , sb,e,), and without loss of
generality assume e, 2 e,. Ignoring signs, algorithms for
addition and subtraction using auxiliary functions f , and
f , can be derived as follows:
Addition: c = a + b
log(c)= log(a + 6 )
log(c)= log(a x ( l + b/a))
log(c)= log(a)+log(l+ b / a )
log(c)= og(a) +
e, = e, + ,,(e, - e,)
where f , , ( r ) = log(l+2')
e , = e,,+ &e, - ,>where f , ( r ) = log(1 -2'1.
log(l + 2 b ( h ) - - l ~ ( a )
Subtraction: c = a - b
In this paper log(x) means the logarithm to base 2, and
exp( x) means 2".
The central difficulty in LNS arithmetic is the imple-
mentation of the functions f , ( r ) and f s ( r ) , for r < 0.
Lookup tables are the most common method used in
previous processors. A graph of these functions in Fig. 1
illustrates the singularity in f , ( f ) that makes linear ap-
proximation difficult. While f,( r ) is well behaved near
0,f , ( r ) + - 0 as r -0.
Both functions are smooth for more negative r . LNS
arithmetic often exploits the essential zeros of f , ( r ) and
f s ( r ) , which are thresholds such that for r smaller than
the essential zero, f , , ( r ) and f , ( r ) are zero to within the
accuracy of th e representation. The essential zero has th e
value - F -2, at which both If,(r)l < 2-Fp' and If,(r)l <, so lookup tables are required only for - F - 2 <-F- 1
2.
4.
-6.
18 16 12 '0
Fig. 1. Plot of f J r ) an d f , ( r ) unctions.
r < 0. Previous implementations of LNS arithmetic used
lookup tables with all bits of r as input, and detecting
essential zeros to eliminate the associated ROM space [l].
This leads to 2 X ( F + 2)x 2F words of lookup table.
The varying nonlinearity of the functions makes a
straightforward linear approximation using equal intervals
for approximation across the entire domain impractical.
Instead, the algorithms described in this paper use a new
segmented approach, with linear approximation across
smaller regions where the function is more nonlinear.
Furthermore, study of f , ( r> and f 7 ( r )will reveal proper-
ties of these functions that simplify linear approximation
if logarithmic arithmetic is used in the approximation.
A. Linear Approximation off , an d f,y
neighborhood of x is defined by
A linear approximation of some function f ( x > in the
x A xf(x + A x ) = f ( ) +-
x
This formulation appears to require a ROM storing f ( x >
at some set of points, a ROM for d f ( x ) / d x , and a
multiplier, which is potentially expensive. Further consid-
eration of f , ( r ) and f s ( r ) will show that the d f ( x ) / d x
ROM and the multiplier can be eliminated at the cost of
a small ROM and additional logic. First, the multiplica-
tion can be eliminated by using logarithmic arithmetic, so
(1) can be replaced by (2 ) if d f ( x ) / d x > 0 and (3) if
d f ( x ) / d x <0:
f i x + A x ) = f ( x ) + sgn( Ax)
f ( + A x ) = f (x) - gn (A x)
The function sgn(.) in (2) and (3) is the sign function.
The function f , ( r ) has a positive derivative and is approx-
imated using (2), while f , ( r ) always has a negative deriva-
tive and is approximated using (3). These calculations
eliminate the multiplication, but appear to increase the
complexity of the circuitry. This complexity can be elimi-
nated by noticing that f , ( r ) and f s ( r ) have properties (4 )
and (5) . As a result, the calculations can be performed
using logarithmic arithmetic, as shown in (6) and (7). The
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 3/8
YU AND LEWIS: 30-b INTEGRATED LOGARITHMIC NUMBER SYSTEM P R O C E S S O R 1435
T A B L E I
S E G M E N THOICE
multiplication and calculation of d f ( x ) / d x have been
eliminated, at the cost of adding lookup tables for log(.)
and exp(.). These tables can be shared for both addition
and subtraction, and will be seen to be small.
(4 )
L( + Ax ) =fa( x)+sgn ( Ax )
xexp(log(lAxl)+x -f ,(x)) (6 )
Xexp(log(lAxl)+x - f , ( x ) ) . ( 7 )
L(x + Ax) = f , x ) - gn (Ax )
Previous linear approximations have generally used posi-
tive Ax, extracted as an unsigned bit field from the inputparameter. In contrast, Ax is a signed number in this
paper. This doubles the range that can be used for linear
approximation with some fixed maximum error , and halves
the size of the lookup tables.
The remaining problem is how to choose x and Axsuch that r = x + Ax. The error in linear approximation
with some maximum value of IAxl, called Axmclx,s
(d2f(x)/dx2)x Axi,,/2), so the choice of the points x
and corresponding Ax,,, must be made to meet the
required error tolerance. For a given maximum error, the
value of AX,,^ is proportional to 11/2x d2 f ( x> /~ * IP " * .
This choice of Ax,,, forms the basis of segmentedlinear approximation. The domain of x is divided into a
number of segments, and a worst-case value of Ax,,, is
used in each segment. Within a segment, the values of
f ( x ) are stored at a set of points 2X Ax,,, apart (the
factor of 2 arises from the fact that Ax is signed). Thus,
segments are chosen in a manner that makes it easy to
compute the correct segment and Axmax or the given
segment, while not wasting excessive table space.
A simple technique for choosing Ax, , , is to partition
the binary representation of r into several parts, specifi-
cally, r r , h , r l , and re , such that r = rr+ rh + rl + r e .Also
define rf= r, + rh .The linear approximation will be per-
formed with x = r , , Ax = r l. The value r , is ignored, as it
is chosen to be too small to affect the result. Although r,
is not directly used in this calculation, it is used later toselect a segment.
The partitioning of r into bit fields can be described bytwo integers p , and p, , p , <p , and p , < 0. Using seg-
mented linear approximation, p e and p l are functions of
r , but are constant within each segment. The correspond-
-4
-4 -3 -2
Fig. 2. Linear approximation intervals for F = 4
ing value of Ax,,, is 2p'. Given some pi and p, , the
values of r I , r h , r l , and re are defined by (8)-(11). The
notation r, . . . r , means the value of the binary represen-
tation of bits m through n inclusive of r .
r , = y p,-1 . . ' - F (8)
rl = rp l . . . rP <- pi (9 )
rl ,= r - I . . . rp , + + 2p/ (10)
(11)r= r lP, . . r( ,
Using this partitioning, r, is the integer part of r , rh is a
positive quantity, with - - p l bits being depend ent upon
r , and rl is a signed quantity with p l - , + 1 significant
bits, and (rll< 2p/.Finally, 0 < re < 2"., and r , < r < r f+ 1.
Combining this partitioning of r with the approxima-
tions (6) and (7) leads to the formulas (12) and (13) as
approximations for f u ( r ) and f s ( r > :
L ( r ) = fo(r , )+ sgn( r o x exp (log(Ir,I) + rr - a( I t ) )
A( r ) = f,( , ) - gn ( r r ) x exp (log( b l l ) + rr - f,(d ) .
(12)
(13)
The choice of segments and Ax,,, is made to meeterror tolerance requirements as well as result in a simple
implementation. The values of p , and p , are chosen to
meet the accuracy constraint that the error due to linear
approximation should be smaller than half a least signifi-
cant bit. The function f , ( r ) is smooth everywhere, allow-
ing a relatively simple choice of segments based on inter-
vals [ r l , r l+ 1). The value of Ax,, increases with more
negative r , . For r < - 1, s ( r ) allows a similar treatment,
but the singularity at 0 requires a different approach for
r E [- ,O ) . In this region the interval is divided into
segments [- -', - -'-'). The segment size and Ax,,,both decrease as r + 0. Table I shows the sizes of inter-
vals and segments for the functions, to within a constant
factor. The effect of segmented linear approximation for
F = 4 is shown in Fig. 2. The crosses mark the pointsstored in lookup tables, and the lines represent the range
of linear approximation about each point. The arithmetic
performed by the remainder of the processor and the
corresponding data paths are shown in Figs. 3 and 4,
respectively. Each step in the algorithm and correspond-
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 4/8
1436
ex p
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991
4096 23
stage12
3
45
6
I0
9
10-
t
generate P* .PI
fn =fn_rbKrO
partition r into ri . rh. rl. r,= ri + rh
lrl = bg-tbl( I : )
fd = r, -fr~lcor = lrl + d
cor = eap-tbl(lcor)f f=f r t+cor i f r , > O. f f=frt -cor i f r : <0
e, =e. + f
Fig. 3. Addit ion algori thm
1e,
Fig. 4. Data path for segmente d l inear approximation.
ing hardware block is labeled with an identifying number.
The data path operates on two numbers x and y , and
chooses a and b that meet the constraint e , 3 eb. The
dotted line is related to the circuit partitioning into two
chips, and will be described later.
B. Range Reduction
The size of the log (. ) and exp (. ) lookup tables can be
reduced by using algebraic identities. Consider the log
ROM, which looks up log(x) in the range [0,1) using a
fixed-point input representa tion,' and has the identity
log(2' X x)= i +log(x). This can be used to reduce table
space by half. A priority encoder generates i correspond-
ing to the most significant one in x; x is left shifted by i
'Log(0) is chosen to be a negative value of large enough magnitude to
guarantee a correct result.
T A B L E 11
TABLEIZES OR LINEAREXTRAPOLATION
Table I Words I Width
cr I I
Fig. 5. Nonlinear funct ion compression
resulting in a value in the range [.5,1), and is used to
perform the table lookup. The table output has i sub-
tracted, producing the result. A similar technique is used
for exp(.1.The total lookup table sizes after applying this opti-
mization to the log and exp tables are shown in Table 11 .
A total of 22K words with 574 464 b is required. Further
reduction is required to fit in the available technology.
C . Nonlinear Table Compression
A second technique, called nonlinear table compres-
sion, is used to reduce the size of each of the lookup
tables. Nonlinear compression uses the observation that a
linear approximation provides a close, but inexact, ap-
proximation of the function. Since the approximation is
close, a table storing the difference between the linear
approximation and the exact function can use a few bits
to represent the difference. This is expressed by (14)
through (16). It is necessary to represent the value of
f ( x , ) and df(x,)/dx for each possible value of xb, and
the value of f d ( x ) for each possible value of x.
Equation (15) corresponds to the hardware implementa-
tion shown in Fig. 5 , so each ROM of Fig. 4 is replaced by
an instance of the logic in Fig. 5. Given some f ( x > with
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 5/8
YU AND LEWIS: 30-b INTEGRATED LOGARITHM IC NUMBER SYSTEM PROCESSOR
N , Wf W, W,,
3 23 13 493 23 13 494 27 14 135
3 28 22 11
4 24 13 105
4 23 12 75
N , bits of precision in x , nonlinear table compression
splits x into two parts, x , , containing the N , least signifi-
cant bits of x used for linear approximation, and X b ,
containing the N h most significant bits used to look up
the function value and derivative, with N , = Ne + N b . The
value of x b is used to address a ROM. Each word
contains the value of f ( x h ) , he value of d f ( x , ) / dw, and
2Nc- 1 correction words f d ( x ) (note f d ( x J=
0 and is notstored). The widths of these quantities are Wr, W,, and
W,, respectively, with the total width of bits used for
f d ( x ) being W,, = 2 N e - > X W,. A multiplexer uses x, to
select a field from the ROM, producing the value of
fd(x). A multiplier and two adders produce f (x>.
In fact, the value stored for d f ( x b ) / d u is chosen to
optimize the accuracy of the linear approximation and
thus minimize the number of bits of ROM, and is not
necessarily the exact value of d f ( x b ) / du (although the
result must still be exact). The total ROM width is W =
Wf Wd +W,,, but the number of words is reduced by a
factor of 2 4 . As N , increases, the number of words
decreases, and W,, and W increase. The value of N , is
chosen to minimize total ROM space.
W words
85 25885 38 6
176 640
127 162
142 128
100 256
111. CHIPAREAOPTIMIZATIONN D DESIGN
A chip set that implements LN S addition using the
above algorithm has been implemented. The chip set is
designed to demonstrate the feasibility of high-precision
LN S processors, so only the arithmetic section of the
processor has been designed. This is implemented as a
pipelined processor, accepting two operands and deliver-
ing a result per clock cycle. The processor is deeply
pipelined in order to maximize throughput, at the cost of
slightly increasing latency. In a pipelined architecture,
combinational circuits can be pipelined to use a cycle
time as small as a gate delay plus pipeline register, but an
architecture that uses ROM’s cannot easily be pipelined
using a cycle shorter than a ROM access. A goal of 140 nswas set for ROM access time and cycle time of the chip.
The algorithms described above describe an architec-
ture for LN S arithmetic, but detailed design decisions
must be made to optimize circuit area. There are two
principal decisions, related to nonlinear compression and
allocation of lookup tables to ROM’s.
A. Nonlinear Compression Optimization
The nonlinear compression algorithm places no con-
straints on N e . Because the number of words and the size
of each word varies with N e , it is important to choose an
Ne that minimizes chip area. The first step in this opti-
mization was to write a computer program that accepts
the contents of a lookup table and evaluates the total
table size for any given value of N,. This can be used to
determine the optimum value of N e . This technique can
be applied to each segment of the f o ( r )and f s ( r > ookup
tables, and to the log and exp tables as a whole. This
optimization is relatively straightforward, and becomes
N a m e
fafs b
fssl
fss2
log
ex p
1437
TABLE 111
COMPRESSEDABLEIZES
(a) Optimal Values of W
(b) Actual Values of W
complex only when the possible allocation of multiplesegments or tables to a single ROM is considered.
The log and exp tables are both accessed once per
operation, and so require a distinct ROM for each table.
The processor also requires one access to either f & r ) or
f , ( r ) . This potentially allows several segments of f J r > and
f s ( r ) o be implemented in a single ROM, if this will save
space. This is only directly possible if the ROM is wide
enough to accommodate the widest of each of the fields,
since otherwise some data steering logic is required. AS a
first step, the segments for f , ( r > and f s ( r ) were indepen-
dently optimized and grouped into four collections that
could fit into identical word widths. About 5% of the
table space is not used, but the corresponding area is
small compared to the cost of separate tables. These
collections of segments and corresponding word widths
are shown in Table 1”. Total table space is reduced to
234290 b, a reduction of 57%.It is desirable to perform further sharing by merging
the various f , ( r ) and f s ( r ) tables into fewer ROM’s.
Although the merged ROM might contain more bits, it
will eliminate some wiring and overhead circuits such as
word-line drivers and sense amplifiers. Merging cannot be
done efficiently using the tables of Table III(a) because of
the large variations in word widths. However, the varia-
tion of ROM size with respect to N , is relatively flat near
the minimum. This allows ROM’s to be merged by choos-
ing a suboptimal N, that results in comparable values of
Wf, d,and W,, for the different segments. While the
value of Ne for a given collection of segments is subopti-
mal, the overall chip area is minimized.
We wrote computer programs to study the area trade-off
between sharing tables and implementing each table sep-
arately. The programs use a detailed model of ROM area,
including core area and overhead area for the technology
available to us. This was used to explore the possibilities
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 6/8
1438 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26 , N O . 10, OCTOBER 1991
T A B L E IV
R O M SIZES
RO M I Words I Width I Total Bits
log I 128 1 110 I 14080
exp 1 256 I 14 2 1 36352
Tota l I 249131
I I I110 125 140 t(ns.)
I 10 35
Fig. 6. Clock timing.
of sharing tables, and used to find the optimal implemen-
tation of the tables. It was found that it was most econom-
ical to merge all sections of f a and f , into two ROM’s, as
shown in Table IV. A large ROM (the f-ROM) with
W =90 stores most of the data using the values W f = 27,w d = 14, and W,, =49, which can be interpreted as either
three fields of 14 b or seven fields of 7 b. No data steering
logic is required except for f d (x ) . This can stor e any table
entry except the fss2 segments, where 1 b of W f and 8 b
of w d bits are stored in a separate 265 X 9 excess ROM.
Although the values of Ne for each table are slightly
suboptimal, the overall circuit area is minimized. This
increases the number of bits of ROM to 245731 b, asshown in Table III(b), a 5% increase, but overall ROM
area is reduced by 28%, and wiring area is reduced as
well.
The optimal ROM sizes presented above require com-
plicated logic to map r into a f-ROM address. The
complexity of the logic can be reduced by placing each
table at an address in the ROM that simplifies the map-
ping function. This leaves “holes” in the address space of
the ROM, but no additional area is consumed, since the
corresponding words in the ROM are not implemented.
The word-line decoder must decode only those addresses
that are implemented. The resulting structure is effec-
tively a PLA implemented with the area efficiency of a
ROM due to the density of the terms.
The use of a multilevel decoder tree requiring multiples
of eight words in each segment, plus the need for reason-
able aspect ratios, creates a further small increase in
ROM space. A total of 251 kb of ROM is actually present
in the chip.
An unfortunate side effect of the complicated algo-
rithms used by this processor is the large delay through
the processor. A total of ten pipeline stages is used in the
present implementation. One of these stages is used for
interchip communication, since multiplexing of the inter-
chip signals is required. Four of these stages are required
to perform range reduction operations for the log and exp
lookup tables. In retrospect, the area saved by range
reduction was small, so these stages should be eliminated
at the expense of slightly greater area.
B. Chip Implementation
A two-chip set that performs LNS addition and sub-
traction has been implemented in 3-pm DLM p-well
CMOS. The circuits were estimated as requiring a
10-mmX 10-mm die, but the multiproject silicon foundry
service available to us had a maximum die size of 8.2
mm X 8.2 mm, necessitating partitioning the chip into a
two-chip set. Partitioning the chip into a two-chip set
increased area considerably due to the large number of
pads required for communication. The total area of the
processor is still well within the amount that can be
fabricated on a single chip with acceptable yield, so this
decision is purely due to the constraints of the chip
manufacturer.
Two-phase clocking with the timing illustrated in Fig. 6
is used. Two-stage dynamic latches are used throughout,
with all state changes on 4,, which is used for precharg-ing dynamic circuits. The timing of 41 s primarily dic-
tated by ROM requirements. 42 s only used for the first
latch stage in pipeline registers. This constrains the pro-
cessor to a single dynamic circuit per pipe stage, but does
not cause any extra pipe stages.
The chip design uses simple static circuits wherever
possible in order to keep design complexity low. A small
number of commonly used cells, such as adders, and
multiplexers were designed to keep layout time moderate.
A fully static adder cell with block carry lookahead was
designed. Variable-sized blocks are used to reduce carry
propagation time, and the resulting circuit can perform a
30-b addition in 55 ns . Dynamic circuits were used only
when a static circuit was too slow or too large. Dynamic
circuits were used only for ROM’s, shifters, PLAs, andpipeline registers. The shifters and PLA’s are constructed
using domino logic, precharged on 4, and evaluated on
6,.he pipeline registers are dynamic latches.
The pipeline partitioning of the chip is chosen to make
the ROM access time the only limit on clock cycle. The
use of no more than one dynamic circuit per pipe stage
sets a lower bound of five-clock-cycle latency on t he
circuit. Although consecutive dynamic stages with no in-
tervening logic can be placed in consecutive clock cycles,
most of the dynamic circuits have some intervening static
logic. The static logic requires th e use of an extra pipeline
stage if it cannot fit into the time slack of the preceding
dynamic stage. A total of nine stages results from these
constraints, and an extra stage to communicate between
chips brings the total to t en stages.
The f-ROM is the limiting factor in processor speed, so
a simple but reasonably fast design was used. The address
is latched and predecoded during while the ROM
precharges. The access is performed during ql, and the
output data are latched during &. The ROM design uses
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 7/8
Y U AN D L E W I S : 30-h INTEGRATED LOGARITHMIC N U M B E R S Y S T t M PKOCESSOR 133’)
(h)
Fig. 7. Photomicrographs of processor chips
an El-ROM core with straps to ground every 32 b and
second-level-metal word-line straps every 64 b. Two level
decoders are used for fast word-line drive. A simple sense
amplifier, a ratioed inverter with a high switching thresh-
old, is used to provide reasonably fast sensing with good
noise immunity. By choosing the inverter ratio appropri-
ately, fast switching can be achieved. Since a ratioed
inverter has a long pull-down time, an additional NMOS
FET is used to precharge the data-out line low. Simula-
tions predicted a 35-ns precharge and 90-ns access time
for a 125-ns cycle. A n additional 15-ns dead time is
allowed for a total cycle of 140 ns.
The processor is partitioned into two chips, with the
functions allocated to each chip shown in Fig. 4. Fig. 7
shows photomicrographs of the two chips, with labels
corresponding to Fig. 4. Both chips have been fabricated
and tested successfully. One circuit design error, a float-ing address line to a ROM, was found. This was not
detected during simulations due to the limited number of
vectors that could be simulated using the entire chip. At
VLIn 5.0 V and 25”C, chip 1 operated at a speed of 4.5
MHz, and chip 2 operated at 5.6 MHz. The lower than
expected clock speed was the result of inadvertent omis-
sion of metal straps for the predecoded address lines.
SPICE predicts 175-ns cycle time for chip 1 with this
omission, closer to the measurements. Our test setup
limited control of the clock widths to multiples of 20 ns,
so the actual maximum clock rates could be higher than
these rates. and closer to the SPICE simulations.
1v. COMPARISON TO FI.OATING-POINT PROCESSORS
The use of 3-pm technology makes the performance of
the prototype uncompetitive with modern FP processors.
The processor is not only slower than recent FP proces-
sors, but this technology only allows 30-b words with 20-b
precision that has less accuracy than a single-precision FP
processor. A 2-pm or higher density technology is re-
quired to implement 32-b words with 23-b precision (de-
leting the zero bit, and using a distinct value to represent
zero), offering about 0.53 b more precision than FP.
As mentioned earlier, the LNS has not been widely
used due to the difficulty of implementing addition and
subtraction. The methods describe here could make LNS
arithmetic competitive with FP processors, and feasible
for general-purpose computations. To compare LNSarithmetic to FP, we consider general-purpose computa-
tions, which require circuits for multiplication as well as
addition. In the LNS, a multiplication or division is per-
formed with addition or subtraction, so the silicon area
required is insignificant, and the time required using our
adder is 55 ns in 3-pm CMOS. Using improved ROM’s,
th e processor presented here can achieve 140-11s addition
with 1400-11s latency. A FP processor designed using
similar technology [101 offers lower performance than t he
LNS processor, offering 1 .l-pus addition, 1.6-ps multipli-
cation, and 2.7-ps division.
LNS arithmetic should offer excellent performance in
higher density technology. The core area of the LNS chips
tot als about 100 mm’, of which about 30% is ROM.
Increasing the word size to 32 b to achieve better accu-
racy than a single-precision FP processor would increase
RO M ar ea to 74 mm2 , for a total of 144 mm2. Scaling to
1.2 -pm technology suggests an area of 23 mm’, smaller
than current FP processors. Eliminating the interchip
communication and range reduction logic for the log and
exp ROM’s would eliminate five pipe stages while in-
creasing area by about 5%. The predicted cycle time in
1.2-pm CMOS from SPICE simulations is 25 ns, with a
latency of 125 ns . Multiplication and division could be
accomplished with insignificant area and 25-11s latency. A
recent FP processor 1111 in 1.2-pm C MOS offers 25-11s
cycle times for addition a nd multiplication, and can com-
plete two operations every three cycles, but has circuit
area of 110 mm’. This processor can perform ei thersingle- or double-precision FP arithmetic, and also inte-
grates a register file and interface logic. Another single-
precision FP processor in the same technology [12] has
area of 58 mm’ and performs an addition and multiplica-
tion every 67 ns. A LNS processor in this technology
would offer higher performance and smaller circuit area.
8/8/2019 A 30-b Integrated Logarithmic Number System Processor -91
http://slidepdf.com/reader/full/a-30-b-integrated-logarithmic-number-system-processor-91 8/8
1440 IEEE J O U K N A L OF SOLID-STATE CIRCUITS, VOL. 26, NO. IO , OCTOBER 1991
LNS processors using the algorithms described here
cannot implement double-precision FP processors in rea-
sonable area due to the 0 ( 2 F / 2 ) ependency of circuit
area on precision. Implementation of high-precision LN S
addition will require the development of different algo-
rithms.
V. CONCLUSIONS
This paper has described the architecture of an inte-
grated processor for 30-b LNS arithmetic. Two techniques
are use to achieve this precision in moderate circuit area.
Linear approximation of the LN S arithmetic functions
using logarithmic arithmetic is shown to be simple due to
the particular functions involved. A segmented approach
to linear approximation minimizes the amount of table
space required. Subsequent nonlinear compression of each
lookup table leads to a further reduction in table size.
The result is that a factor of 285 reduction in table size is
achieved, compared to previous techniques.
The circuit area of the implementation is minimized by
optimizing the table parameters, using a computer pro-
gram that accurately models ROM area. The implementa-
tion is highly pipelined, and produces one result per clock
cycle using a ten-stage pipeline. The architecture could be
scaled using modern technology to higher precision and
performance, as well as reduced latency. As a result, LNS
arithmetic can offer higher speed and better accuracy
than a single-precision FP processor in smaller circuit
area.
ACKNOWLEDGMENT
Circuit fabrication was provided by the Canadian Mi-
croelectronics Corporation.
REFERENCES[ l ] J. N.Mitchell Jr., “Computer mult ipl icat ion and division using
binary logari thms,” IRE Trans. Electron. Cornput., vol. EC-1 1, pp .512-517, Au g. 1962.
[2] F. J. Raylor, R. Gil l , J. Joseph, and J . Radke , “A 20 bi t logari thmicnumber system processor,” IEEE Trans. Comput. , vol. 37, pp .
190-200. Feb. 1988.
[3] E. E. Swartzlander and A. G. Alexopolous, “The signed logari thmnumber system,” IEEE Trans. Comp ut. , vol. C-24, pp. 1238-1242,
Dec. 1975.[4 ] J. H. Lang, C. A. Zukowski , R. 0. LaMai re , and C. H. A n,
“Integrated-circui t logari thmic units ,” IEEE Trans. Comput. , vol.
C-34, pp. 475-483, May 1985.[SI M . Combet , H . Van Zonneveld , and L . Verbeek , “C omputa t ion of
the base two logari thm of binary num bers,’’ IEEE Trans. Electron.Comput . , vol. EC-14, pp. 863-867, Dec. 1965.
[6] D. M arino, “Ne w algori thm for the approximate evaluat ion in
hardware of binary logari thms and elementary funct ions,” IEEETrans. Com put. , vol. C-21, pp. 1416-1421, Dec. 1972.
[7 ] H.-Y. Lo and Y . Aoki, “Generat ion of a precise binary logari thmwith difference grouping programmable logic array,” IEEE Trans.
Comput . , vol. C-34, pp. 681-691, Aug. 1985.[X I F. J . Taylor, “An exte nded precision logari thmic number system,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp.232-234, Jan. 1983.
[ 9 ] D. M. Lewis, “An architecture for addit ion and subtract ion of longword length numbe rs in the logari thmic number system,” IEEETrans. Comput. , vol. 39, pp. 1326-1336, N ov. 1990.
[IO] G . Wolrich et al., “ A high performanc e float ing point coprocessor,”
IEEE J . Solid-State Circuits, vol. SC-19, pp. 690-696, Oct. i984.[ l l ] K. Molnar et al. , “ A 40 Mhz 64-b float ing point c o-processor,” in
ISSCC Dig. Tech. Papers, 1989, pp. 48-49.[12] D. A. Staver et al., “A 30-MFLOPS CMOS single precision float ing
point m ult iply-accumulate chip,” in ISSCC Dig. Tech. Papers, 1987,pp . 274-275.
Lawrence K. Yu (S’8S-M’90) received theB.A.Sc. an d M .A.Sc. degrees in electrical engi-neer ing f rom the Univers i ty o f Toron to ,Toronto, Ont ., Ca nada in 1986 and 1990, re-spectively.
H e is presently employed at th e Universi ty ofToron to as a Research Associa t e on the Hubnet
project . His research interests include computerari thmetic and VLSI systems design.
David M. Lewis (M’88) received the B.A.Sc.degree with honors in engineering science fromthe University of Toron to , Toron to , Ont . ,Canad a, in 1977, and the P h.D . degree in elec-trical engineering in 1985.
From 1982 to 1985 he was employed as a
Research Associate on the Hubnet project , anddeveloped custom integrated circui ts for a 50 -Mb/s loca l a rea ne twork . H e has been an Ass is -
tant Professor at the Universi ty of Toronto since1985. His resea rch interests include logic and
circui t s imulat ion, computer archi tecture for high-level languages, loga-ri thmic ari thmetic, and VLSI architecture.
Dr. Lewis is a member of t h e A C M .