15 x 15 mac
DESCRIPTION
macTRANSCRIPT
Journal of VLSI Signal Processing 33, 83–103, 2003c© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
Energy Efficient Adiabatic Multiplier-Accumulator Design
DUSAN SUVAKOVIC AND C. ANDRE T. SALAMAEdward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto,
10 King’s College Road, Toronto, Ontario M5S 3G4, Canada
Received October 20, 2000; Revised August 29, 2001
Abstract. This paper presents a strategy for minimizing non-adiabatic dissipation in adiabatic arithmetic units.The non-adiabatic dissipation is minimized by architectural design involving a small number of complex logic gates.Circuit design of complex adiabatic gates, based on ordered binary decision diagrams (OBDD), is introduced. Anoptimized architecture for adiabatic parallel multipliers is proposed and savings in energy dissipation over competingarchitectures are estimated. Experimental results obtained from implementation of an adiabatic multiply-accumulate(MAC) unit suggest that the proposed strategy provides substantial improvement in energy efficiency over equivalentnon-adiabatic and alternative adiabatic implementations, while achieving a competitive operating speed.
Keywords: adiabatic, arithmetic, circuits, multiplier, low-power
1. Introduction
Unlike other low power design techniques that attemptto minimize the energy used in computation [1], the en-ergy recovery or adiabatic technique involves recyclingof that energy. The delivery and recovery of energy isperformed virtually without dissipation [2], resultingin potentially better energy efficiency than in conven-tional digital systems. Since energy consumption is notnecessary in order to perform computation [3], energyrecovery using CMOS circuits is possible.
Dissipation in adiabatic logic consists of two com-ponents: adiabatic and non-adiabatic dissipation. Theformer component can be reduced asymptotically tozero by slowing down the transfer of charge betweendigital circuitry and the power supply [2]. The lattercomponent cannot be eliminated [4] and results fromerasure of information that occurs in all conventionalarithmetic architectures [3, 5]. Since non-adiabatic dis-sipation introduces a lower bound on the overall dissi-pation, adiabatic implementation of a digital system isjustified only if this lower bound is significantly smallerthan the dissipation achievable by conventional lowpower design techniques.
Three different approaches to the problem of non-adiabatic dissipation have been proposed in previouswork. The first approach minimizes the non-adiabaticdissipation at the circuit level [6–12], by replacingconventional logic gates in a digital system by theiradiabatic counterparts. An improvement in energy ef-ficiency of 3–4 times over equivalent CMOS circuitimplementations has been reported for small adiabaticmultipliers and adders designed using this approach[13]. The second approach involves logically reversiblesystem design [14], which eliminates non-adiabaticdissipation, while introducing significant overhead incircuitry and adiabatic dissipation that is particularlypronounced in DSP building blocks [2, 15]. The thirdapproach is based on applying the energy recoverytechnique only to selected, high capacitance nodes, forwhich the non-adiabatic is negligible compared to thesavings achieved by energy recovery. Such design isbeneficial in system architectures in which switchingat a small number of heavily loaded circuit nodes dom-inates the overall dissipation [16, 17]. None of the ap-proaches described above is convenient for adiabaticimplementation of arithmetic units, which are the mainsource of dissipation in conventional DSP systems.
84 Suvakovic and Salama
The work described in this paper builds on thethird approach by introducing special architectures foradiabatic implementation of parallel multipliers andmultiplier-accumulators (MACs) with a small numberof internal nodes, which facilitates energy recovery.In the proposed architectures, non-adiabatic dissipa-tion is minimized by using high fan-in gates, involvingtransistor networks with a topology of ordered binarydecision diagrams (OBDD) [18]. OBDD-style countercircuits, with as many as 15 inputs, are used as ma-jor building blocks in the implemented parallel mul-tiplier/MAC. Their feasibility and energy efficiency isexperimentally verified.
The paper is organized as follows. The sourcesof dissipation in adiabatic systems is summarizedin Section 2. Section 3 describes the circuit designof complex OBDD logic gates and adiabatic sense-amplifiers. The architecture design for adiabatic paral-lel multipliers that minimizes the number of latches andthe required complex logic networks are described inSection 4. Section 5 presents the design of a multiply-accumulate arithmetic unit built using high fan-in,OBDD-style counter gates. Conclusions are given inSection 6.
2. Energy Dissipation in Adiabatic Systems
Energy dissipation in adiabatic systems is made up of:adiabatic dissipation (Ea), power supply losses (E ps),non-adiabatic dissipation (Ena) and CMOS dissipation(Ec). The adiabatic dissipation is specified by
Ea = (R · Ca)
T· Ca · V 2
max (1)
where R is the on (triode region) resistance of the tran-sistors responsible for energy recovery, Ca is the adia-batically charged capacitance and T and Vmax are theslope (i.e. the rise or fall time) and the amplitude, re-spectively, of the power clock, serving both as clocksignal and the supply voltage [2]. The adiabatic dissi-pation can be reduced by reducing the supply voltageVmax or the adiabatic load capacitance Ca . Moreover,it can be made arbitrarily low by increasing T .
The power supply losses can be expressed as
E ps = (1 − η) · 1
2· Ca · V 2
max (2)
where η is the efficiency factor of the power supply,which increases with the power clock period T [19, 20].
Similarly to adiabatic dissipation, E ps depends on thethe supply voltage Vmax and the adiabatic load capac-itance Ca and can be reduced arbitrarily by increasingthe power clock period.
The non-adiabatic dissipation for systems usinglatches consisting of two cross coupled CMOS invert-ers is given by
Ena = Nla · Cla · V 2t (3)
where Nla is the average latch switching rate, Cla isthe total latch node capacitance and Vt is the thresholdvoltage of the PMOS transistor [4, 8]. Since Ea andE ps can be made arbitrarily low, the non-adiabatic diss-pation, caused by erasure of information in pipelinedsystems, is exposed as the dominant part of the overalldissipation. Ena depends on the latch implementation,but to a greater extent, its reduction is achievable at thearchitectural design level, as explained in Section 4.
Finally, the CMOS dissipation is given by
Ec = Nc · Cc · V 2dd (4)
where Cc and Nc are the total physical capacitance andits associated switching rate for the part of the systemconsisting of conventional CMOS gates. Since this pa-per focuses on adiabatic implementation of arithmeticunits, Ec is ignored in further discussion.
3. Design of Complex Adiabatic Logic Gates
The proposed structure of an adiabatic logic gate is il-lustrated in Fig. 1. It consists of a complex NMOS logicnetwork, two precharge transistors, a sense amplifier, alatch and two output adiabatic buffers/drivers. The de-sign of the complex NMOS logic networks, and theiroutput detection, are the key issues for the proposedcircuit technique.
The topology of the complex NMOS logic networkused as part of the adiabatic gate shown in Fig. 1, isthat of an ordered binary decision diagram (OBDD),which is known to be a more compact representationof a logic function than conventional representationsbased on product terms [18]. Figure 2(a) shows an ex-ample of an OBDD with four binary inputs, whereasFig. 2(b) shows the corresponding NMOS transistornetwork, in which each OBDD edge is replaced with aNMOS pass transistor the gate of which is controlled byan input signal. All transistors in the same row are con-trolled by either the non-inverted or inverted version ofthe same input signal, depending on the label on the
Energy Efficient Adiabatic Multiplier-Accumulator Design 85
PWR1 PWR1
PWR1
senseamplifier
NMOS logicnetwork
adiabaticdrivers
PWR2
inputsignals
CMOSlatch
Vdd
adiabaticdrivers
PWR2precedingpipelinestage input
signals
nextpipelinestage
Figure 1. Adiabatic gate structure.
corresponding OBDD edge. For each combination ofthe input signals, the circuit shown in Fig. 2(b) per-forms computation by creating a low impedance pathbetween the root node and one of the output nodes,whereas the energy used to control the NMOS switchesis completely recoverable from the gate capacitances.Although the computation itself does not produce dis-sipation, the output detection requires dissipation of afinite amount of energy.
The feasibility of high fan-in logic gates in practicalimplementations is limited by their complexity, whichimpacts their speed, physical footprint, input capaci-tance and energy required for detection.
01
0
1
01
1 0
x1
x2x2
x3
x4 x4
01
1 0
NR
NE0NE1NE1 NE0
NR
x1
x2
x3
x4
x1
x2
x3
x4
(a) (b)
Figure 2. Logic function representation using OBDD: (a) OBDD and (b) equivalent NMOS logic network.
Techniques for reliable output detection for com-plex, high fan-in NMOS networks including OBDDbased networks, were reported in previous work[21, 22]. The output detection technique [23] used here,minimizes the energy required for detection by usingvoltage sensing.
The sense amplifier circuit is shown in Fig. 3(a).The key waveforms for the gate operation, includingthe power clock signals PWR1 and PWR2 for a two-phase non-overlapping adiabatic clocking scheme, areshown in Fig. 3(b).
The logic gate operates in two phases. In the sec-ond phase, nodes F and FB are precharged, while all
86 Suvakovic and Salama
S SB
F FB
PWR2PWR2
OBDDnetwork
senseamplifier
SSB
SBS
PWR1PWR1PWR1
input
OUTBOUT
signasPWR1
SB
PWR1
SL
SLB
PWR2
OUTB
S
SLB
SL
PWR2
OUT
PWR1
F FB
(a)
(b)
(c)(d)
Symbol Wave
D0:A0:v(pwr2)
D0:A0:v(pwr1)
Voltage
s (lin)
0200m400m600m800m
11.21.4
Voltage
s (lin)
0200m400m600m800m
11.21.4
Time (lin) (TIME)50n 100n
*****
Symbol Wave
D0:A0:v(fb)
D0:A0:v(f)
Voltage
s (lin)
0
200m
400m
600m
800m
Time (lin) (TIME)50n 100n
*****
Symbol Wave
D0:A0:v(sb)
D0:A0:v(s)
Voltage
s (lin)
0
200m
400m
600m
800m
1
1.2
1.4
Time (lin) (TIME)50n 100n
*****
Figure 3. Circuit design for adiabatic OBDD-style gates: (a) sense amplifier, (b) typical waveforms, (c) adiabatic drivers controlled by senseamplifier and (d) adiabatic drivers controlled by latch.
transistors in the OBDD NMOS network are off. In thefirst phase of the next clock cycle, inputs of the NMOStree are energized, discharging either node F or FBto the ground and creating a small differential voltage
between nodes S and SB. Subsequently, PWR2 ener-gizes the sense amplifier and creates full swing differ-ential signals at nodes S and SB reflecting the detectedNMOS tree output.
Energy Efficient Adiabatic Multiplier-Accumulator Design 87
The gate following the sense amplifier has small in-put capacitance, decoupling the sense amplifier nodesfrom the load capacitance of the following stage. Thechoice of this gate generally depends on the type oflogic in the following stage.
For relatively small capacitive loads, adiabaticdrivers powered by the same power clock phase as theadiabatic driver can be controlled directly by the senseamplifier, as shown in Fig. 3(c). However, for largerload capacitances, CMOS signal levels are required tocontrol the adiabatic driver instead of the pulse-shapedvoltage at the sense amplifier outputs. The CMOS levelsignals are obtained by the latch shown in Fig. 3(d),which is identical to the pulse-to-level converter givenin [17] with the addition of transistors M3 and M4controlled by power clock PWR1, added to eliminatethe effect of the non-zero voltage at the sense amplifiernodes S and SB during the opposite power clock phase,when PWR2 is high. The outputs of the latch are sta-ble for the duration of the PWR2 pulse, thus allowingcomplete adiabatic charging and discharging of largecapacitive loads.
3.1. Energy Efficiency of Proposed Adiabatic Gates
The addition of CMOS-type dissipation introduced bythe pulse-to-level converter is justified if it is muchsmaller than non-adiabatic dissipation due to incom-plete discharging of the load capacitance, eliminatedthis way. Based on Eqs. (3) and (4), this condition issatisfied when
Cld �(
Vdd
Vt
)2
· Cla (5)
where Cld and Cla are the load and latch capacitances,respectively. Condition (5) is easily met for low voltageoperation, which is preferred in order to achieve highenergy efficiency. Moreover, since the fraction of thetotal power supplied losses associated with the energyrecovered from the observed load capacitance Cld , canbe expressed as
E ps,ld = (1 − η) · 1
2· Cld · V 2
max (6)
the energy consumed by the CMOS latch is negligiblecompared to E ps,ld if
Cld � 2 · N
(1 − η)·(
Vdd
Vmax
)2
· Cla (7)
Assuming that the latch switching rate N is equal to0.5, that η = 0.9 and that the amplitude of the powerclock Vmax is equal to the DC supply voltage Vdd , con-dition (7) reduces to
Cld � 10 · Cla (7a)
Condition (7a) is more relaxed than (5), especiallyfor higher Vdd , however it is not likely to be satisfiedfor adiabatic implementation of typical arithmetic ar-chitectures in which, the fan-out and load capacitanceare small for the majority of the logic gates. Conse-quently, the non-adiabatic dissipation in latches woulddominate in such implementation.
It should be noted that non-adiabatic dissipation isnot eliminated by circuit techniques that allow gatepipelining without the use of latches, such as PAL [9]and SCAL [24]. The outputs of such gates are latcheddynamically, which also results in dissipation due toerasure of information. In addition, only small logicgates with 2 to 3 inputs are feasible in these circuittechniques due to their inherently poor handling of highfan-in and fan-out. Consequently, gate-pipelined arith-metic architectures involving such gates must consistof a very large number of gates with a substantial over-head in the number of delay-matching gates, causingsignificant total non-adiabatic dissipation.
For the purpose of comparison between PAL, SCALand the proposed circuit technique, an adiabatic 15-input counter and a 4-bit adder were implemented ina standard 0.25 µm CMOS process using all three cir-cuit techniques and simulated in HSPICE. The OBDD-style counter operates in a single stage and includes 4latched logic gates, whereas its PAL and SCAL coun-terparts operate in 10 stages and include 77 gates each.For the same supply voltage of 1.6 V and the rangeof operating frequencies between 1 and 50 MHz, theaverage dissipation per computation for the OBDD-style counter is 2.6–4.2 times less than that of PAL and2.5–4.4 times less than that of SCAL. The OBDD-styleadder operates in a single stage and includes 5 latchedlogic gates, whereas the PAL and SCAL adders operatein 8 pipelined stages and consist of 56 gates each. Us-ing the same supply voltage and operating frequenciesas for the counter, dissipation of the OBDD-style adderis found to be 1.1–1.5 and 1.6–3.7 times less than thatof its PAL and SCAL counterparts, respectively.
The obtained results suggest that adiabatic designbased on latched OBDDL-style gates can achieve betterenergy efficiency than other adiabatic techniques if the
88 Suvakovic and Salama
utilization of complex logic gates is high, which meansthat the number of such gates and the associated latchesis small. For this reason, architectural optimizationsdescribed in Section 4 are aimed at minimizing thenumber of latches.
Finally, it should be pointed out that the effect ofcomplexity on the speed of the proposed adiabatic logicgates is less pronounced than in conventional CMOSgates [25]. This a result of low voltage swing at internalnodes of OBDD-style transistor networks and the use ofsense amplifiers at their outputs. HSPICE simulationsindicate that the maximum power clock frequenciesachievable for the implemented counter based on 15-input gates and the implemented adder based on 9-inputgates, are 250 MHz and 600 MHz, respectively for thepower clock voltage of 3.5 V, whereas for the powerclock voltage of 1 V, the maximum clock frequenciesare 72 and 230 MHz, respectively.
4. Architectural Design of Adiabatic ArithmeticUnits in General Purpose DSP Systems
In general purpose DSPs, the major source of energydissipation are multiplier-accumulator (MAC) unitsfeaturing fast parallel multipliers, with a 16 × 16 bitfixed-point multiplication, or higher complexity. MACarchitectures are typically based on small logic gatesimplementing Wallace tree [26] or Dadda [27] partialproduct reduction schemes, usually combined with aBooth encoding algorithm [28] that reduces the initialnumber of partial products.
In a straightforward approach, an adiabatic paral-lel multiplier can be designed by substituting eachlogic gate in a conventional CMOS multiplier with anequivalent adiabatic latched gate. Additional latchesare needed to accommodate gate pipelining by provid-ing matched pipeline delay paths. Since the number oflatches in a gate-pipelined implementation exceeds thenumber of combinational logic gates, if latched logicgates performing simple logic functions are used, thelatches dominate the overall multiplier area and en-ergy consumption, leading to poor utilization of theproposed circuit technique.
In order to minimize the number of latches and therelated non-adiabatic dissipation, multiplier architec-tures consisting of fewer logic gates must be sought.From that standpoint, an ideal n × n bit multiplier ar-chitecture would consist of 2n gates, i.e. of only onegate per output bit. However, the maximum fan-in forthese gates would be 2n, hence the gate complexity
would limit the feasibility of the ideal architecture torather small values of n.
In order to estimate the circuit complexity of mul-tipliers achievable by this approach, OBDDs for mul-tiplier output bits were generated for several values ofn, by a OBDD logic synthesis program. As shown inFig. 4, the total number of transistors in the OBDDnetworks approximately triples when n is increased by1. Therefore, the single stage implementation is onlyadvantageous for small values of n.
Multipliers for which n ≥ 16, as typically used inDSP processors, are not suitable for single-stage ar-chitectures due to an impractically large area and inputcapacitance. Therefore, adiabatic versions of such mul-tipliers need to be pipelined, while keeping the numberof logic gates and pipeline stages as low as possible.
A block diagram of a commonly used parallel n ×n-bit multiplier architecture that consists of three stages:the Booth stage, the partial product reduction stage andthe carry-propagate adder (CPA) is shown in Fig. 5. Itcomputes the product
P = A · B (8a)
A =n−1∑i=0
ai · 2i , (8b)
B =n−1∑i=0
bi · 2i and (8c)
P =2n−1∑i=0
pi · 2i (8d)
where ai , bi (i = 0 . . n − 1) and pi (i = 0 . . 2n − 1)are the bits in binary representations of A, B and P,
respectively.The most commonly implemented, radix-4 modi-
fied Booth algorithm [29], reduces the number of par-tial products for a n × n bit multiplication from n2 toapproximately n2/2. The second multiplier stage, typ-ically a Wallace tree architecture, reduces n2/2 binaryproduct terms to 2 output bits. The final, CPA stageperforms the two-input addition.
The pipelined multiplier architecture described inthis section minimizes the number of latches utilizingOBDD gates of manageable size. It reduces the numberof latches significantly when compared with adiabaticarchitectures based on small logic gates and almostcompletely eliminates delay-matching latches.
Energy Efficient Adiabatic Multiplier-Accumulator Design 89
1 2 3 4 5 6 7 810
0
101
102
103
104
105
n
num
ber
of O
BD
D tr
ansi
stor
s pe
r m
ultip
lier
Figure 4. Number of OBDD transistors for single-stage implementation of n × n-bit multipliers.
n
an-1, ... a0
n
bn-1, ... b0
radix-4 Booth
partial product generator
compressor stage
(Wallace tree)
CPA
2n
p2n-1, ... p0
Figure 5. A typical parallel multiplier architecture.
4.1. Complex Adiabatic Gates for Radix-4 BoothPartial Product Generator
The radix-4 modified Booth partial product generatorin Fig. 5 includes the recoder and the multiplexer stages
as shown in Fig. 6(a). The inputs to the recoder stageare the multiplier bits bi . The multiplicand bits ai , alongwith the outputs of the recoder stage, are inputs to themultiplexer stage.
In multipliers built with conventional CMOS gates,each recoder and the multiplexer consists of severallogic gates. In a gate-level pipelined adiabatic imple-mentation, a latch is introduced for each one of thesegates. In addition, since recoding of the multiplier pre-cedes multiplexing, at least n latches are required todelay the multiplicand bits a0 . . an−1 to the multiplexerstage. Assuming that there are 3 gates per partial prod-uct in the multiplexer stage [30], the total number oflatches in such adiabatic Booth unit is greater than3n2/2 + n, exceeding n2 latches associated with ANDgates in a multiplier without Booth unit. An alternative,single stage Booth architecture is based on complexOBDD-style gates and involves only n2/2 latches.
The logic function performed by such Booth stagegates is equivalent to the OBDD shown in Fig. 6(b).This OBDD has 5 binary inputs b2k−1, b2k , b2k+1, a j
and a j+1. Booth recoding of inputs b2k+1, b2k , b2k−1 isperformed in the first three OBDD rows. The nodes inrow 4, from left to right, correspond with the recodedvalues “−2”, “−1”, “0”, “+1” and “+2”, respectively,
90 Suvakovic and Salama
b2k-1b2k+1 b2k
Booth
“-2” “-1” “0” “1” “2”
b2k-1 b2k-1 b2k-1 b2k-1
b2k+1
b2k b2k
ai ai
ai+1 ai+1
01
1
1
0
00
0
0 1
1
1 10
11
0 0
01
10
out0out1
0 1‘0’
ai ai+1
0 1
ppi,k
PP selectorx nx n/2
recoder
(a) (b)
Figure 6. Radix-4 modified Booth partial product generator: (a) conventional circuit and (b) OBDD for single-stage implementation.
such that for any combination of b2k+1, b2k , b2k−1, thecorresponding OBDD path between the root node androw 4, ends in the node representing the Booth-recoded3-bit value b2k+1b2kb2k−1. The remaining part of thegraph below row 4 is equivalent to the multiplexer gatein Fig. 6(b). The subgraphs following nodes “−1” and“+1” in row 4 depend only on input ai , whereas the sub-graphs following nodes “−2” and “+2” depend only onai+1. There is no subgraph for node “0”, since the out-put is decided for that case. Consequently, this node isconnected directly to the OBDD ‘zero’ output “out0”.
The circuit implementation of the radix-4 BoothOBDD includes 22 transistors and features a simpletopology resulting in a compact layout.
Partial product generation for Booth algorithms ofradix-8 and higher involves recoded values such as ±3for which carry propagate addition needs to be per-formed. Its single stage implementation is impracti-cally complex since it requires the inclusion of all mul-tiplicand bits an−1 . . a0 as inputs to the Booth gate.
A compromise solution for single stage implemen-tation of radix-8 modified Booth algorithm, that avoidsthe use of a carry-propagate adder as part of the par-tial product generator, in exchange for a certain in-crease in the number of partial products is proposed
by Bewick [31]. Its single stage implementation usingOBDD-style logic gates is described below.
In order to avoid full n-bit addition in computing 3A,the addition
3A = A + 2A =n−1∑i=0
ai · 2i +n−1∑i=0
ai · 2i+1 (9a)
is broken into sums of two 4-bit inputs as follows:
3A = a0 +n/4−1∑
k=0
24k+1 ·(
3∑j=0
2 j · (a4k+ j + a4k+ j+1)
)
(9b)
Since each such sum is generally a 5-bit number, (10b)can be rewritten as
3A = a0 +n/4−1∑
k=0
24k+1 ·(
24 · c4k+5 +3∑
j=0
2 j · s4k+ j+1
)
(9c)
The lower 4 bits of k-th such sum s4k+1, s4k+2, s4k+3
and s4k+4 can be generated as functions of the 5 multi-plicand bits: a4k , a4k+1, a4k+2, a4k+3 and a4k+4 and used
Energy Efficient Adiabatic Multiplier-Accumulator Design 91
b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2a4k+3a4k+4
radix-8BoothOBDDgate
ppj,4k+3
b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2a4k+3
radix-8BoothOBDDgate
ppj,4k+2
b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2
radix-8BoothOBDDgate
ppj,4k+1
b3j-1b3jb3j+1b3j+2a4ka4k+1
radix-8BoothOBDDgate -
ppj,4k
b3j-1b3jb3j+1b3j+2a4ka4k+1a4k+2a4k+3a4k+4
ppj,4k+4
partialproduct
partialproduct
partialproduct
partialproduct
additional
(c4k+5)
(a)
(b)
Figure 7. Single stage OBDD-style radix-8 Booth encoder: (a) regular partial products and (b) additional partial product.
instead of the 3A values for bit positions 4k, 4k + 1,4k + 2 and 4k + 3, respectively. As shown in Fig. 7(a),the maximum number of OBDD inputs for these 4 bitpositions is 9.
To preserve the correctness of computation, the out-put carry bit c4k+5 in (9c), which results from the 4-bitaddition in (9b) is evaluated by a separate gate andrepresents an additional partial product. It can have anon-zero value only for the recoded multiplier valuesof “+3” or “−3” . As shown in Fig. 7(b), the OBDDgate producing this additional partial product, at bit po-sition 4k + 4, also has 9 inputs. Single stage, adiabaticimplementation of the Radix-8 modified Booth algo-rithm requires n/4 of such partial products, in additionto the n2/3 regular ones. This is a negligible overheadfor 16 × 16-bit and larger multipliers, thus making theradix-8 Booth stage an attractive option for reduction ofnon-adiabatic dissipation. However, the average num-ber of 82 transistors per OBDD gate represents a signif-icant increase compared to the radix-4 Booth stage andcreates additional adiabatic dissipation that needs to becompensated by slowing down the system operation.
4.2. Adiabatic Architecture of the PartialProduct Reduction Stage
The second building block of the parallel multiplier inFig. 5 is a Wallace tree [26] or Dadda counter [27],consisting of full adders. 4-to-2 compressors are not
considered in this analysis since they inherently consistof two levels of logic and their gate-pipelined versionis equivalent to two cascaded full-adders.
Typical full adder circuits in low power DSP de-signs consist of several simpler gates, such as the oneshown in Fig. 8(a) [32]. Although only 3 gates areused in this circuit, the gate pipelined version requires5 latches. However, full adder circuit implementationwithout cascading gates is also possible.
For a full adder design that does not include cascadedlogic gates, the gate pipelined version of a Wallace treeis obtained by adding a latch at every full adder outputand by insertion of a number of delay matching latches.An example of a Wallace tree bit slice reducing thenumber of partial products from 15 to 2, is shown inFig. 8(b). It includes 13 full adders and has 7 pipelinestages. It also involves a total of 31 latches, 5 of whichare the delay matching latches. However, a Wallacetree in Fig. 8(b) using the gate-pipelined version of thefull adder shown in Fig. 8(a), involves 75 latches andoperates in 14 pipeline stages.
An alternative architecture for the Wallace treeshown in Fig. 8(b), utilizing complex counters, is givenin Fig. 8(c). It operates in 3 pipeline stages and fea-tures a 15:4 counter and two full-adders and a total of 9latches, one of which is inserted to provide delay match-ing. It reduces the number of latches by 3.44 comparedwith the circuit shown in Fig. 8(b) and by 8.33, com-pared to the circuit shown in Fig. 8(a). The 15:4 counter
92 Suvakovic and Salama
c sFA
LLc s
FA
LLc s
FA
LLc s
FA
LLc s
FA
LL
c sFA
LL
c sFA
LLc s
FA
LL Lc s
FA
LL
c sFA
LL
c sFA
LL
L
c sFA
LL
L L
L
c sFA
LL
pp1 pp2 pp3 pp4 pp5 pp6 pp7 pp8 pp9 pp10 pp11pp12pp13
pp14pp15
LL LL
c sFA
LL L
15-input counterb3 b2 b1 b0
pp1 .. pp15
15
c sFA
LL
0 1
a bicii
s c
L
0 1
a bicii
s c L L
L L
(a)
(b)(c)
Figure 8. Partial product reduction schemes: (a) gate-pipelining of a conventional FA circuit, (b) gate-pipelined, 15-input Wallace tree and(c) alternative tree using complex counter.
consists of only four OBDD-style logic gates, whereasthe equivalent circuit based on full-adders [33] includes11 full-adders and at least 22 logic gates.
The topology for the counter gates is obtained fromthe topology of a generalized counter graph with mul-tiple outputs, shown in Fig. 9(a). This graph effectivelycounts logic ‘ones’ among the input bits and each node
in row i represents a possible sum of ‘ones’ in the in-put bits in1, in2 . . . ini . Since the possible sums rangefrom 0 to i , the number of nodes in the i-th row isi + 1. Therefore, each combination of the n input bitsincluding k ‘ones’ (k ≤ n), corresponds with one ofthe graph paths that starts at the root node and ends atoutput node k.
Energy Efficient Adiabatic Multiplier-Accumulator Design 93
Figure 9. 7-input counter: (a) counter graph topology, (b) OBDDs encoding output bits and (c) layout.
Since counting ones is binary encoding of k, thenumber of binary-output OBDD gates in the counterunit equals the number of bits in the binary represen-tation of k.
The OBDD for the counter output bit m is obtainedfrom the graph in Fig. 9(a) by merging all its outputnodes whose label number in binary representation hasa “1” at the bit position m and by labeling new outputnode as “1”. All remaining nodes are merged into node“0”. Subsequently, Bryant’s reduction algorithm [18]is applied to the OBDD. The resulting OBDDs for the7-input, 3-output counter [34] are shown in Fig. 9(b).
Whereas the OBDD complexity for a general logicfunction, measured as the number of transistors in the
equivalent circuit, increases exponentially, the OBDDcomplexity of the counter gates increases linearly withthe number of inputs. The number of transistors versusthe number of inputs for counter output bits 0, 1, 2 and 3is plotted in Fig. 10. Counter output gates with fan-in aslarge as 15 have been implemented and experimentallyverified.
The regular and local interconnection patterns in thecounter OBDDs enable very compact layout of theircircuit implementation. Since vertices originating fromadjacent nodes in one row, end in the same node in thefollowing row, all transistors controlled by the sameinput signal (including the inverted and non-invertedversions) can be connected by abutment and laid out
94 Suvakovic and Salama
0 5 10 150
20
40
60
80
100
120
140
160
number of input bits
num
ber
of O
BD
D tr
ansi
stor
s
bit 0 OBDD bit 1 OBDD bit 2 OBDD bit 3 OBDD
Figure 10. Number of OBDD transistors for counter output gates.
in one row, as shown in Fig 9(c). This way, the tran-sistor junction capacitance at the internal nodes of theOBDD network is reduced, along with the dissipationnecessary for output detection.
As explained in Section 3.2, the output detection forOBDD-style transistor networks relies on the differen-tial voltage between the OBDD network output nodes,one of which is discharged to the ground, whereas theother retains a small positive voltage. The worst casefor output detection occurs when the voltage at the lat-ter node is minimal. This occurs for the combinationof input bits that connects this node to the maximuminternal capacitance, thus causing maximum voltagedrop due to charge sharing.
The maximum internal capacitive load for all counterOBDD networks is listed in Table 1. For each countersize, the maximum worst case load capacitance relativeto the total internal capacitance, is that for the MSBgate. For all other bit positions, the variation in the in-ternal capacitive load for different input combinationsis relatively small. This is explained by the fact that inMSB OBDD networks, all transistors connected to theoutput node “1” are controlled by the non-inverted in-put signals, whereas all transistors connected to output
node “0”, are controlled by the inverted input signals,as shown in Fig. 9(b) for the OBDD corresponding withbit 2 of the (7, 3) counter. If, for example, the inputs tothe (7, 3) counter are
(in0, in1, in2, in3, in4, in5, in6, in7)
= (0, 0, 0, 0, 1, 1, 1),
a conducting path will be created between output “0”and the root node, whereas output “1” is connected to12 out of 15 internal nodes since 3 out of 4 transistorsconnected to it are turned on.
The variation in the internal load capacitance ofOBDD output nodes is much smaller for OBDD net-works at other bit positions since their output nodes areconnected to the equal number of transistors controlledby non-inverted signals and their complements. Thisway, for all input combinations, both output nodes areconnected to approximately equal numbers of internalnodes.
Output detection for the MSB OBDD counter net-works can either be achieved by precharging the outputnodes with energy sufficient to provide the necessary
Energy Efficient Adiabatic Multiplier-Accumulator Design 95
Table 1. Internal capacitance of large counter OBDDs.
Maximum Variation ofOBDD Number of capacitive load load capacitancecounter internal nodes (number of nodes) (% of total)
15-input
bit 3 64 56 12.5–87.5
bit 2 80 44 45–55
bit 1 52 26 50
bit 0 28 14 50
14-input
bit 3 56 49 12.5–87.5
bit 2 72 40 44.4–55.6
bit 1 48 24 50
bit 0 26 13 50
13-input
bit 3 48 42 12.5–87.5
bit 2 64 36 43.75–56.25
bit 1 44 22 50
bit 0 24 12 50
12-input
bit 3 40 35 12.5–87.5
bit 2 56 32 42.9–57.1
bit 1 40 20 50
bit 0 22 11 50
voltage swing for the worst case input combinationor alternatively, by providing additional energy onlywhen it is needed. The latter solution is illustrated inFig. 11, in which additional precharged capacitance isconnected to an output node of a MSB OBDD network
in1in1
in2in2
in3in3
precharge
precharge
precharge
precharge
precharge
precharge
precharge
precharge
out1 out0
root
OBDDnetwork
in4
in5
in6
in7
in4
in5
in6
in7
Figure 11. Dynamic OBDD charging for MSB of 7-input counter.
for each turned-on transistor connected to that node.This way, less energy is dissipated, on the average, inoutput detection.
4.3. Adiabatic Architecturesfor Carry-Propagate Adders
The number of latches involved in the implementa-tion of the carry-propagate adder (CPA) is minimizedif the logic function for each of its output bits is imple-mented as a single complex logic gate. For the n + nbit addition, the number of output latches in such im-plementation is n + 1. The differential NMOS networkevaluating the arithmetic sum for bit position i (wherethe LSB position is labelled as 1), which minimizes thetransistor height is shown in Fig. 12(a). It consists ofi − 1 carry propagate networks and one 3-input XORnetwork. The carry propagation network in Fig. 12(a)does not have an OBDD topology, but rather takes ad-vantage of the specific nature of the carry propagationlogic function, reducing the total transistor height toi + 1 for i-th bit position [35], for the fan-in of 2i .
The transistor height of a OBDD-style network per-forming the same function is 2i , as shown in Fig. 12(b).Given that the number of stacked transistors is the limit-ing factor for implementation, larger single stage CPAsare achievable using the carry propagation network inFig. 12(a), than using the network in Fig. 12(b). How-ever, the advantage of circuit in Fig. 12(b) is in that itsintermediate nodes represent valid carry bits for bit po-sitions from 1 to i , thus allowing a very compact OBDDManchester carry adder implementation, as also shownin Fig. 12(b).
96 Suvakovic and Salama
a ba
b
aba
b
c_in c_in
c_out c_out
c_in c_in
bb
aa
ss
c0c0
a1, a1,
b1, b1
carry
propagate
c1 c1
cn-2cn-2
an-1, an-1,
bn-1, bn-1
carry
propagate
cn-1cn-1
an, an,
bn, bnXOR3
sn sn
c_in c_in
bb
aa
c_out c_out
c0c0
a1, a1,
b1, b1
OBDD carrypropagate
c1 c1
cn-2cn-2
an-1, an-1,
bn-1, bn-1
OBDD carrypropagate
cn-1cn-1
XOR3
s1 s1
a1, a1,
b1, b1
XOR3
sn sn
an, an,
bn, bn
XOR3
s2 s2
a2, a2,b2, b2
OBDD-style carry propagate
network
n-bit adder
n-bit adder
carry propagate network
XOR3 network
(a)
(b)
Figure 12. Carry propagate transistor networks: (a) minimum transistor height network and (b) OBDD-style network.
Adiabatic adders of sizes exceeding the maximumnumber of stackable transistors have to be implementedin more than one pipeline stage. We propose the carry-select adder architecture as the most appropriate, sinceit allows two-stage implementation for a wide range ofadder sizes. The design of a 2-stage, 32-bit carry-selectadder is described in the following example dealing
with the design of a 32-bit carry-select adder with theminimized transistor height.
The 8 most significant bits in first stage of the 32-bitcarry-select adder architecture are shown in Fig 13(a).At this stage, the sum and the carry output bit for 8-bit groups 32–25, 24–17 and 16–9 are found for bothpossible values of the input carry signal c24, c16, c8 and
Energy Efficient Adiabatic Multiplier-Accumulator Design 97
a25,..a32
b25,..b32
c24 = 0
s32_0
s31_0
s30_0
s29_0
s28_0
s27_0
s26_0
s25_0
c32_0
8-bi
t car
ry c
hain
8-bi
t add
er
7-bi
t add
er
6-bi
t add
er
5-bi
t add
er
4-bi
t add
er
3-b
a.
2-b
a. 1-ba25,..a32
b25,..b32
c24 = 1
s32_1
s31_1
s30_1
s29_1
s28_1
s27_1
s26_1
s25_1
c32_1
8-bi
t car
ry c
hain
8-bi
t add
er
7-bi
t add
er
6-bi
t add
er
5-bi
t add
er
4-bi
t add
er
3-b
a.
2-b
a. 1-b
s32_0
s32_1
c24_0
c24_1
c16_0
c16_1
c8
OBDD
(a)
(b)
Figure 13. Adiabatic carry-select adder architecture: (a) first stage—bit group 32-25 and (b) second stage—bit position 32.
c0, respectively. Assuming that the input carry signalc0 is known at the first stage, only one sum and the carryoutput are required for the group 8-1. The number oflatched gates at the first stage is therefore 63. The carrypropagate networks used are those shown in Fig. 12(a)and the maximum transistor height for a single gate is9 (for gates s32 1, s32 0, s24 1, s24 0, s16 1, s16 0and s8).
OBDD-style gates at the second stage evaluate theadder output si by selecting one of si 0 and si 1, basedon the carry signals c8, c16 0, c16 1, c24 0 and c24 1.The block diagram of one such gate, for bit position32, is shown in Fig. 13(b). The maximum number ofinputs per gate at the second stage is 7. The numberof latches at this stage is 33, hence the total number oflatches for the adder is 96.
The described architecture enables practically fea-sible 2-stage implementations of larger adders, with amoderate increase in the OBDD size. For example, a 2-stage carry select architecture for a 64-bit adder, based
on the same type of circuits and 11-bit groups involvestransistor networks whose transistor height does notexceed 12.
By comparison, the carry-lookahead architecture re-quires at least 3 stages of logic gates since it takes atleast two stages to generate the carry signal for eachbit position and one additional stage to generate thefinal result. Using the gate count for the 3-stage 32-bit adder based on enhanced multiple output dominologic (EMODL) [35], the total number of latches forthe carry lookahead architecture would be 161, which is67% more latches than needed for the proposed carry-select architecture.
Further, a gate-pipelined ripple-carry (RCA) imple-mentation of a n + n bit adder, consisting of single-stage, latched full-adder gates would have a latency ofn clock cycles and involve a catastrophic number of3/2 · n2 + 1/2 · n latches, the majority of which wouldbe the delay matching latches. In the case of the 32-bitaddition, the number of latches would amount to 1552,
98 Suvakovic and Salama
thus disqualifying RCA as a candidate architecture foradiabatic implementation.
4.4. Comparison with Previous Designs
In order to assess the overall savings in non-adiabaticdissipation in parallel multipliers, achievable by theproposed architectural optimizations, the gate countsfor 16 × 16 and 32 × 32-bit Wallace/Dadda multipliersbased on small logic gates [36] were used. The numberof latches in adiabatic multipliers obtained from theseby replacing each logic gate with an equivalent latchedadiabatic gate, was estimated to be 15% higher thanthe total number of logic gates taking into account thedelay matching latches. Also, the number of latchesin the multipliers, of the same size but based on com-plex gates as described in this section, was calculated.Radix-4 Booth single stage partial product generatorwas used for the 16 × 16-bit multiplier, whereas radix-8 Booth partial product generator was used for the 32 ×32-bit multiplier. As shown in Table 2, the proposed ar-chitectural approach reduces the number of latches bya factor of 8 for the 16 × 16-bit multiplier and by afactor of 10.7 for the 32 × 32-bit multiplier.
For both multiplier sizes considered, the number oflatches in the proposed architecture is dominated bythe number of latches in the partial product reduc-tion tree. It should be noted that for the example ofsuch a tree shown in Fig. 8(c), the reduction of par-tial products from 15 to 4 involves 4 latches, whereasthe elimination of further 2 partial products involves 5more latches. This observation suggests that, for DSPalgorithms computing sums of a large number of prod-ucts, better reduction in the number of latches can beachieved by an application-specific architecture that
Table 2. Comparison between conventional and proposed multi-plier architectures in adiabatic implementation.
Number of Number ofDesign gates latches
Conventional 16 × 16-bit 2569 2920multiplier [34]
Proposed 16 × 16-bit adiabatic 340 364multiplier architecture
Conventional 32 × 32-bit 10,417 11,980multiplier [34]
Proposed 32 × 32-bit adiabatic 1026 1122multiplier architecture
does not reduce the result of each separate multipli-cation down to 2 partial products, but rather uses the(15, 4) counter as many times as possible and performs4 to 2 compression only once, to calculate the final re-sult. The asymptotic minimum number of latches forsuch an architecture compressing X partial products to4 using (15, 4) counters, is 1.45X. The use of (7, 3)counters would involve a minimum of 2.25X latches.
5. 15 × 15-Bit Adiabatic MAC: Designand Implementation
5.1. Specifications
In order to illustrate the design procedure outlinedabove, an adiabatic multiply-accumulate (MAC) unit,employing high fan-in, OBDD-based counter gates,was designed in a 0.25 µm CMOS process [23]. Thefollowing specifications were adopted for the design:
• It was assumed that the MAC is an adiabatic subsys-tem in a conventional CMOS environment and thatits inputs are driven by non-adiabatic CMOS circuits;
• MAC input word lengths were chosen to be 15 bitsfor the multiplicand and 15 bits for the multiplier inorder to take full advantage of the 15:4 compressionrate;
• The MAC was intended for applications, such asFIR filtering, where result is not a single product butrather a sum of multiple products and is required onlyonce at the end of a sequence of multiply-accumulatecomputations.
The MAC datapath architecture includes threepipeline stages, as shown in Fig. 14. Stage 1 consistsof 15 × 15 = 225 adiabatic, two-input AND/NANDgates generating partial products ppi j = ai b j for the15-bit multiplication operands A(A = a14a13 .. a0) andB (B = (b14b13 .. b0). Inputs ai and b j are assumed to benon-adiabatic, latched signals that are present at the in-puts of AND/NAND gates during the first clock phase.Since there are no latches in this stage, all circuitsare energized and de-energized through power clockPWR1 without non-adiabatic losses.
Pipeline stage 2 consists of adiabatic gates that per-form n-to-4 compression, where n ≤ 15 . The bit-sliceof stage 2 is a counter circuit with the fan-in of upto 15, producing 4-bit outputs. Each output signal iscomputed by a separate logic gate, whereas all gatesin one counter share the same inputs driven by the first
Energy Efficient Adiabatic Multiplier-Accumulator Design 99
15-input NMOS trees
7-input NMOS trees3
input Aa0 .. a14
input Bb0 .. b14
adiabatic latches
adiabatic latches
PWR2
PWR1
stage 1
stage 2
stage 3
to CPA
Partial product generator
(AND/NAND gates)
15
PWR1
4
3
bjai
pwr1
y y
pwr1
ai
bj
y
y
15 15 15 15
bit 3 bit 2 bit 1 bit 0
l o g i c n e t w o r k s
s e n s e a m p l i f i e r s
a d i a b a t i c d r i v e r s
15stage2
stage 3
slice k
slice k+3
slice k+2
slice k+1
slice k
7 7 7
bit 2 bit 1
l o g i c n e t w o r k s
s e n s e a m p l i f i e r s
a d i a b a t i c d r i v e r s
4stage3
to k+2
to k+1
slice k
3
latches
bit 0
from k-1
from k-2
to CPA
in(3:0)
(a) (b)
(c) (d)
Figure 14. Implemented MAC architecture: (a) overall architecture, (b) stage 1, (c) stage 2 and (d) stage 3.
stage. Gates in different bit slices are custom sized forthe actual number of partial products generated for theparticular bit positions. All gate outputs at stage 2 arelatched and latches are powered by power clock PWR2.
Pipeline stage 3 is the accumulator stage. The accu-mulator is based on 7-input counter circuits, each onewith 3-bit outputs. The outputs are double latched, withthe first set of latches powered by PWR1 and the sec-ond, by PWR2. The second set of latches provides syn-chronization of the signals in the feedback path with the
inputs from stage 2. As illustrated in Fig. 14, all signalsconnected to a particular accumulator bit slice have thesame bit weight. The direct path inputs to the k-th bitslice are driven by circuits at bit slices k, k − 1, k − 2and k − 3 of stage 2, whereas the inputs in the feedbackpath are driven by circuits at bit slices k, k−1 and k − 2of stage 3.
The throughput of the described architecture is onemultiplication per clock cycle. The sum of products isavailable at the output of stage 3 with the latency of
100 Suvakovic and Salama
Figure 15. Chip micrograph.
one and a half clock cycle and it is compressed to threesignals per bit position. If one additional multiplica-tion with zeroed inputs is performed at the end of themultiply-accumulate sequence, the output of stage 3 isfurther compressed to 2 bits and only a carry-propagateadder (CPA) is needed to obtain the final result. In ap-plications such as FIR filtering, computing not a singleproduct but rather a sum of multiple products, the finalresult is required only at the end of the computation. Insuch cases, the activity rate of the final CPA is ratherlow, making its implementation and energy efficiencyless critical at the system level.
5.2. Performance Analysis
The described MAC unit was implemented and the chipmicrograph is shown in Fig. 15. The MAC is functionalfor clock frequencies up to 66 MHz, while operatedfrom a 1 V power supply.
The total non-adiabatic dissipation per clock cycle(i.e. per multiplication) is 0.28 pJ, as shown in Table 3.and it is caused by latch/sense amplifier activity. In ad-dition to non-adiabatic dissipation, the MAC requires4.5 pJ of recoverable energy in order to perform itsoperation. 10% of that energy, or 0.45 pJ per clock cy-cle is lost due to the internal dissipation in the powersupply. Therefore, the total energy consumption per
Table 3. MAC performance analysis.
Power supply 0–1 V adiabaticMaximum clock frequency 66 MHz
Energy efficiency: per multiplication:Energy used 0.28 pJ + 4.5 pJ
Energy recovered 4.5 pJ
Non-adiabatic dissipation 0.28 pJ
Power supply dissipation 0.1 ∗ 4.5 pJ = 0.45 pJ
Total dissipation 0.28 pJ + 0.45 pJ = 0.73 pJ
Energy Efficient Adiabatic Multiplier-Accumulator Design 101
Table 4. MAC performance comparison.
Adiabatic MACCMOS (conventionalMAC architecture) This design
Total dissipation 17.6 pJ 1.57 pJ 0.73 pJ
Non-adiabatic N/A 1.12 pJ 0.28 pJdissipation
Number of latches N/A 773 190
Number of pipeline 2 8 3stages
Latency 2 cycles 4 cycles 1.5 cycles
Maximum frequency 100 MHz 66 MHz 66 MHz
Number of transistors 58000 14500 10450
multiplication for the adiabatic MAC and the powersupply is 0.28 pJ + 0.45 pJ = 0.73 pJ. Compared witha conventional CMOS implementation of an equiva-lent MAC using the same process [32], the adiabaticMAC described in this paper consumes 23 times lessenergy per computation. The comparison between thetwo units is listed in Table 4.
Finally, to demonstrate the advantage of the pro-posed architecture in adiabatic implementation, the im-plemented architecture was compared with an alterna-tive adiabatic architecture in which Wallace tree com-pression is performed using full-adder (FA) gates. Thecharacteristics of the alternative architecture are alsolisted in Table 4. The FA based architecture includes773 latches, whereas the proposed architecture includesonly 190 latches, achieving the reduction in the relatednon-adiabatic dissipation by the factor of 4. The totalcharged capacitance per clock cycle is approximatelythe same for the two architectures. It is dominated bylatch capacitance in the case of FA-based architec-ture, whereas in the proposed architecture, the combi-national circuit capacitance dominates. Generally, theproposed architecture is more energy efficient than theFA based one and this advantage is more pronouncedfor higher power supply efficiency. The proposed ar-chitecture also has lower latency since it operates in 3pipeline stages, compared to 8 for the FA-based one.
6. Conclusions
Issues related to architectural design of parallel multi-pliers for adiabatic implementation have been analyzedin this paper. Non-adiabatic dissipation in latches as-sociated with adiabatically driven full-swing signals
was identified as a lower bound on the overall en-ergy consumption. It has been shown that a significantimprovement in achievable energy efficiency of adia-batic arithmetic units can be made by using complexlogic gates as building blocks for such units, rather thansmall gates that typically comprise equivalent CMOSdesigns. This way, the number of full-swing signalsand the associated latches causing non-adiabatic dissi-pation is reduced. In addition, such architectural designapproach minimizes the pipeline depth of the inherentlygate-pipelined adiabatic systems.
A circuit technique enabling logic design and outputdetection for complex logic gates has been developed.Logic design based on ordered binary decision dia-grams (OBDD) was used to achieve circuit compactionand design automation. Custom built CAD tools forOBDD-style logic synthesis and layout were develo-ped and used in the design of the prototype chip. Spe-cial attention was given to the design and analysis ofhigh fan-in counter gates featuring high computationalefficiency due to their linear complexity and regulartopology.
A low power OBDD output detection scheme in-volving sense amplifiers was developed. The proposedlogic gate design achieves operation at clock speedscomparable to those typically used in DSP systems.In addition, the proposed circuit style allows low volt-age operation, which also boosts energy efficiency. Theworst case analysis of internal capacitive load for thesense amplifier was performed for the counter gates.The highest capacitive load that represents the worstcase for output detection occurs for counter MSB gates.A conditional output precharging technique for suchOBDD networks is proposed to minimize the averageenergy required for output detection.
A parallel multiplier architecture was developed thatminimizes non-adiabatic dissipation and its advan-tage over alternative architectures was demonstrated.A multiply-accumulate (MAC) unit based on countergates with up to 15 inputs was designed and imple-mented in a 0.25 µm CMOS process. This design wasfound to be 27 times more energy efficient than anequivalent conventional design and 4 times more en-ergy efficient than an alternative adiabatic architectureconsisting of smaller gates.
Acknowledgment
The work was supported by NSERC, Micronet, Gen-num, Mitel, Nortel Networks and PMC Sierra.
102 Suvakovic and Salama
References
1. A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-PowerCMOS Digital Design,” IEEE Journal of Solid-State Circuits,vol. 27, 1992, pp. 473–483.
2. W. Athas, L.J. Svensson, J.G. Koller, N. Tzartzanis, and E. Chou,“Low-Power Digital Systems Based on Adiabatic SwitchingPrinciples,” IEEE Transaction on VLSI Systems, vol. 2, 1994,pp. 398–407.
3. R. Landauer, “Irreversibility and Heat Generation in the Com-puting Process,” IBM Journal of Research and Development,vol. 5, 1961, pp. 183–191.
4. J.S. Denker, “A Review of Adiabatic Computing,” in Symposiumon Low Power Electronics Proceedings, 1994, pp. 94–97.
5. C.H. Bennet, “Logical Reversibility of Computation,” IBM Jour-nal of Research and Development, vol. 6, 1973, pp. 525–532,
6. A.G. Dickinson and J.S. Denker, “Adiabatic Dynamic Logic,”IEEE Journal of Solid-State Circuits, vol. 30, 1995, pp. 311–315.
7. A. Kramer, J.S. Denker, S.C. Avery, A.G. Dickinson, and T.R.Wik, “Adiabatic Computing with the 2N-2N2D Logic Family,”in IEEE Symposium on VLSI Circuits, 1994.
8. D. Maksimovic, V.G. Oklobdzija, B. Nikolic, and K.W. Current,“Clocked CMOS Adiabatic Logic with Integrated Single-Phase Power-Clock Supply: Experimental Results,” in ISLPEDProceedings, 1997, pp. 323–327.
9. V.G. Oklobdzija, D. Maksimovic, and F. Lin, “Pass-TransistorAdiabatic Logic Using Single Power-Clock Supply,” IEEETCAS II: Analog and Digital Signal Processing, vol. 44, 1997,pp. 842–846.
10. Y. Moon and D.-K. Jeong, “An Efficient Charge Recovery LogicCircuit,” IEEE Journal of Solid- State Circuits, vol. 31, 1996,pp. 514–522.
11. K.T. Lau and F. Liu, “Improved Adiabatic Pseudo-DominoLogic,” Electronics Letters, vol. 33, 1997, pp. 1982–1983.
12. J. Lim, K. Kwon, and S.-I. Chae, “Reversible Energy RecoveryLogic Circuit Without Non-Adiabatic Energy Loss,” ElectronicsLetters, vol. 34, 1998, pp. 344–345.
13. M.C. Knapp, P.J. Kindlmann, and M.C. Papaefthymiou, “Imple-menting and Evaluating Adiabatic Arithmetic Units,” in CICCProceedings, 1996, pp. 115–118.
14. R.C. Merkle, “Reversible Electronic Logic Using Switches,”Nanotechnology, vol. 4, 1993, pp. 21–40.
15. J. Lim, D.G. Kim, and S.I. Chae, “A 16-bit Carry-LookaheadAdder Using Reversible Energy Recovery Logic for Ultra-Low-Energy Systems,” IEEE J. Solid-State Circuits, vol. 34, 1999,pp. 898–903.
16. W.C. Athas, N. Tzartzanis, L.J. Svensson, and L. Peterson, “ALow-Power Microprocessor Based on Resonant Energy,” IEEEJournal of Solid- State Circuits, vol. 32, 1997, pp. 1693–1701.
17. W.C. Athas, N. Tzartzanis, W. Mao, R. Lal, K. Chong, L.Peterson, and M. Bolotski, “Clock-Powered CMOS VLSIGraphics Processor for Embedded Display Controller Applica-tion,” in ISSCC Proceedings, 2000, pp. 296–297.
18. R.E. Bryant, “Graph-Based Algorithms for Boolean FunctionManipulation,” IEEE Transactions on Computers, vol. C-35,1986, pp. 677–691.
19. D. Maksimovic and V.G. Oklobdzija, “Integrated Power ClockGenerators for Low-Energy Logic,” in IEEE Power ElectronicSpecialists Conference Proceedings, 1995, pp. 61–67.
20. W. Athas, L. Svensson, and N. Tzartzanis, “A Resonant Sig-nal Driver for Two-Phase, Almost Non-overlapping Clocks,” inISCAS Proceedings, 1996, pp. 129–132.
21. P. Zhou, J.C. Czilli, G.A. Jullien, and W.C. Miller, “CurrentInput TSPC Latch for High Speed, Complex Switching Trees,”in ISCAS Proceedings, 1994, pp. 335–338.
22. G.A. Jullien, W.C. Miller, R. Grondin, L. Del Pup, S.S. Bizzan,and D. Zhang, “Dynamic Computational Blocks for Bit-LevelSystolic Array,” IEEE Journal of Solid-State Circuits, vol. 29,1994, pp. 14–22.
23. D. Suvakovic and C.A.T. Salama, “A Pipelined Multiply-Accumulate Unit Design for Energy Recovery DSP Systems,”in ISCAS Proceedings, 2000.
24. S. Kim and M.C. Papaefthymiou, “True Single-Phase AdiabaticCircuitry,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 9, 2001, pp. 52–63.
25. K.W. Martin, Digital Integrated Circuit Design, New York:Oxford University Press, 2000.
26. C.S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans-actions on Computers, vol. EC13, 1964, pp. 14–17.
27. L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Fre-quenza, vol. 34, 1965, pp. 349–356.
28. A.D. Booth, “A Signed Binary Multiplication Technique,” Quar-terly J. Mechan. Appl. Math., vol. IV, 1951.
29. O.L. MacSorley, “High-Speed Arithmetic in Binary Computa-tions,” IRE Proc., vol. 49, 1961, pp. 67–91.
30. D. Villeger and V.G. Oklobdzija, “Analysis of Booth EncodingEfficiency in Parallel Multipliers Using Compressors for Re-duction of Partial Products,” in The Twenty-Seventh AsilomarConference on Signals, Systems and Computers, 1993, pp. 781–784.
31. G. Bewick, “Fast Multiplication: Algorithms and Implementa-tion,” Ph.D. Thesis, Stanford University, 1994.
32. M. Izumikava et al., “A 0.25-µm CMOS 0.9-V 100-MHz DSPCore,” IEEE J. Solid-State Circuits, vol. 32, 1997, pp. 52–61.
33. E.E. Swartzlander, Jr., “Parallel Counters,” IEEE Transactionson Computers, vol. c-22, 1973, pp. 1021–1024.
34. P.J. Song and G. De Micheli, “Circuit and ArchitectureTrade-offs for High-Speed Multiplication,” IEEE J. Solid-StateCircuits, vol. 26, 1991, pp. 1184–1198.
35. Z. Wang, G.A. Jullien, W.C. Miller, J. Wang, and S.S. Bizzan,“Fast Adders Using Enhanced Multiple-Output Domino Logic,”IEEE J. Solid-State Circuits, vol. 32, 1997, pp. 206–214.
36. T.K. Callaway and E.E. Swartzlander, “Optimizing ArithmeticElements for Signal Processing,” in Workshop on VLSI SignalProcessing Proceedings, 1992, pp. 91–100.
Dusan Suvakovic received his B.S., M.S and M.A.Sc. degrees inElectrical Engineering from the University of Novi Sad, Yugoslavia
Energy Efficient Adiabatic Multiplier-Accumulator Design 103
in 1988, University of Belgrade, Yugoslavia in 1992 and Universityof Toronto in 1998, respectively. He is currently working towardsthe completion of his Ph.D. thesis at the University of Toronto. Hisresearch interests are in the area of low energy DSP design as wellas low-power, high-speed digital circuits. From 1988 to 1995, hewas a research associate at M. Pupin Institute, Belgrade and a de-sign engineer at Perle Systems, Markham Ontario and Mark IV In-dustries, Mississauga Ontario. In December 2001, he joined BellLaboratories—Lucent Technologies, Murray Hill NJ as a memberof technical staff.
C. Andre T. Salama received the B.A.Sc. (Hons.) M.A.Sc. and Ph.D.degrees, all in Electrical Engineering, from the University of BritishColumbia in 1961, 1962 and 1966 respectively.
From 1962 to 1963 he served as a Research Assistant at the Uni-versity of California, Berkeley. From 1966 to 1967 he was employedat Bell Northern Research, Ottawa, as a Member of Scientific Staffworking in the area of integrated circuit design. Since 1967 he hasbeen on the staff of the Department of Electrical and Computer En-gineering, University of Toronto where he held the J.M. Ham Chairin Microelectronics from 1987 to 1997. In 1992, he was appointed
to his present position of University Professor for scholarly achieve-ments and preeminence in the field of microelectronics. In 1989–90,he was awarded the ITAC/NSERC Research Fellowship in informa-tion technology. In 1994, he was awarded the Canada Council I.W.Killam Memorial Prize in Engineering for outstanding career con-tributions to the field of microelectronics. In 2000, he received theIEEE Millenium Medal.
He was associate editor of the IEEE Transactions on Circuits andSystems in 1986–88 and a member of the International ElectronDevices Meeting (IEDM) Technical Program Committee in 1980–82, 1987–89 and 1996–98. He was the chair of the Solid State DevicesSubcommittee for IEDM in 1998 and is a member of the editorialboard of Solid State Electronics, the Analog IC and Signal ProcessingJournal and the Technical Program Committee of the InternationalSymposium on Power Semiconductor Devices and ICs (ISPSD). Hechaired the technical program committee of ISPSD in 1996 and wasthe general chair for the conference in 1999.
Dr. Salama is the Scientific Director of Micronet, a network of cen-tres of excellence focussing on microelectronics research and fundedby the Canadian Government. He is also a principal investigator forCommunications and Information Technology Ontario, a centre ofexcellence funded by the Province of Ontario.
He has published extensively in technical journals, is the holder ofeleven patents and has served as a consultant to the semiconductorindustry in Canada and the U.S. His research interests include thedesign and fabrication of semiconductor devices and integrated cir-cuits with emphasis on deep submicron devices as well as circuits andsystems for high speed, low power signal processing applications.
Dr. Salama is a Fellow of the Institute of Electrical and Electron-ics Engineers, a Fellow of the Royal Society of Canada, a memberof the Association of Professional Engineers of Ontario, the Elec-trochemical Society and the Innovation Management Association ofCanada.