a high-speed low-power modulo 2^n+1 multiplier design ...1324/fulltext.pdf · a high-speed...
TRANSCRIPT
A High-Speed Low-Power Modulo 2n+1 Multiplier
Design Using Carbon-Nanotube Technology
A Thesis Presented
by
He Qi
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirement
for the degree of
Master of Science
in
Electrical Engineering
in the field of
Electronic Circuits and Semiconductor Devices
Northeastern University
Boston, Massachusetts
April, 2012
© Copyright 2012 by He Qi
All Rights Reserved
NORTHEASTERN UNIVERSITY Graduate School of Engineering
Thesis Title: A High-Speed Low-Power Modulo 2n+1 Multiplier Design Using Carbon-
Nanotube Technology.
Author: He Qi.
Department: Department of Electrical and Computer Engineering.
Approved for Thesis Requirements of the Master of Science Degree
____________________________________________ ______________________
Thesis Advisor: Prof. Yong-Bin Kim Date
____________________________________________ ______________________
Thesis Reader: Prof. Fabrizio Lombardi Date
____________________________________________ ______________________
Thesis Reader: Prof. Minsu Choi Date
____________________________________________ ______________________
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
____________________________________________ ______________________
Director of the Graduate School Date
NORTHEASTERN UNIVERSITY Graduate School of Engineering
Thesis Title: A High-Speed Low-Power Modulo 2
n+1 Multiplier Design Using Carbon-
Nanotube Technology.
Author: He Qi.
Department: Department of Electrical and Computer Engineering.
Approved for Thesis Requirements of the Master of Science Degree
____________________________________________ ______________________
Thesis Advisor: Prof. Yong-Bin Kim Date
____________________________________________ ______________________
Thesis Reader: Prof. Fabrizio Lombardi Date
____________________________________________ ______________________
Thesis Reader: Prof. Minsu Choi Date
____________________________________________ ______________________
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
____________________________________________ ______________________
Dean: Prof. Sara Wadia-Fascetti Date
Copy Deposited in Library:
____________________________________________ ______________________
Reference Librarian Date
Abstract
Modulo 2n+1 multiplier is one of the critical components in the area of digital signal processing, residue
arithmetic, and data encryption that demand high-speed and low-power operation. In this thesis, a new
circuit implementation of a high-speed low-power modulo 2n+1 multiplier is proposed. It has three
major stages: partial product generation stage, partial product reduction stage, and the final adder stage.
The major technical contribition to the arts of the thesis is that the partial product reduction stage
introduces a new MUX-based compressor to reduce power and increase speed. Secondly, in the final
adder stage, the sparse-tree based inverted end-around-carry adder reduces the number of critical path
circuit blocks. Finally, a proposed adder is implemented using both 32nm CNTFET (Carbon-Nanotube
FET) and bulk CMOS technology for comparison. The CNTFET-based design dramatically decreases
the PDP (Power Delay Product) of the circuit. The simulation results demonstrate that the MUX-based
compressor reduces the PDP of the partial product reduction stage by 4.24 times compare to the
traditional full adder based design. The sparse-architecture solves the wire interconnection problem
while slightly reduces the PDP of the final adder stage compare to the Kogge-Stone design. The power
consumption of CNTFET-based multiplier is on average of 5.72 times less than its conventional bulk
CMOS counterpart, while the PDP of CNTFET is 94 times less than the CMOS one. The proposed
multilier circuit and its implementation demonstrates the viability of the ultra-low-power and high
performance feature of the promising CNTFET technology.
Index Terms
Modulo 2n+1 Multiplier, MUX-based Compressor, Sparse-tree Adder, Carbon-Nanotube Technology
Acknowledgements
First of all, I will thank Prof. Yong-Bin Kim, my research advisor. His constructive suggestions and
encouragements lead me to make progress in my master research. In addition, his great guidance helps
me to realize where my passion is and what research area I’m going to concentrate on in the future.
Thank you so much! I would also like to thank the members of committee to share my research results
with valuable advices.
He Qi
Boston, MA
For my parents
CONTENTS
ABSTRACT.....................................................................................................................i
ACKNOWLEDGEMENTS........................................................................................i
I. INTRODUCTION.......................................................................................................1
1.1 BACKGROUND.....................................................................................................1
1.2 PROBLEM AND WORK STATEMENT..............................................................4
1.3 OUTLINE OF THE THESIS..................................................................................6
II. ALGORITHM............................................................................................................8
2.1 ALGORITHM OF THE PARTIAL PRODUCT GENERATION STAGE............8
2.2 ALGORITHM OF THE PARTIAL PRODUCT REDUCTION STAGE............10
2.3 ALGORITHM OF THE FINAL ADDITION STAGE.........................................11
2.4 AN EXAMPLE.....................................................................................................20
III. CIRCUIT IMPLEMENTATION.......................................................................22
3.1 CIRCUIT DESIGN OF THE PARTIAL PRODUCT GENERATION STAGE..22
3.2 CIRCUIT DESIGN OF THE PARTIAL PRODUCT REDUCTION STAGE.....24
3.2.1 INTRODUCTION OF A MUX-BASED COMPRESSOR DESIGN...........................24
3.2.2 DIFFERENT TYPES OF THE MUX-BASED COMPRESSORS..............................26
3.2.2.1 Circuit Desigin of the 3:2 Compressor...................................................................26
3.2.2.2 Circuit Desigin of the 4:2 Compressor...................................................................27
3.2.2.3 Circuit Desigin of the 5:2 Compressor………...........................................................29
3.2.2.4 Circuit Desigin of the 7:2 Compressor………...........................................................33
3.2.3 DETAILED SUBCIRCUIT DESIGN OF THE COMPRESSORS..............................34
3.2.3.1 MUX Subcircuit Design………………............................................................34
3.2.3.2 Complementary MUX Subcircuit Design……………............................................37
3.2.3.3 XOR-XNOR Subcircuit Design...........................................................................39
3.2.3.4 CGEN Subcircuit Design...................................................................................43
3.2.4 THE ARCHITECTURE OF THE PARTIAL PRODUCT REDUCTION STAGE.........44
3.2.4.1 Architecture Designed for an 8-bit Modulo 2n+1 Multiplier............................................44
3.2.4.2 Architecture Designed for an 16-bit Modulo 2n+1 Multiplier...........................................45
3.3 CIRCUIT DESIGN OF THE FINAL ADDITION STAGE..................................47
IV. SIMULATION RESULTS OF THE PROPOSED DESIGN AND
TECHNOLOGY COMPARISON………………………………………..………...49
4.1 PERFORMANCE COMPARISON BETWEEN THE FULL ADDER BASED
COMPRESSOR AND THE MUX BASED COMPRESSOR…….……………50
4.2 SIMULATION RESULTS OF DIFFERENT COMPRESSOR
ARCHITECTURES IN THE PARTIAL PRODUCT REDUCTION
STAGE……………………………………………………………………….…51
4.3 SIMULATION RESULTS OF THE SPARSE-TREE ARCHITECTURE AND
THE KOGGE-STONE ARCHITECTURE……………………………………..52
4.4 SIMULATION RESULTS OF THE CNT-BASED DESIGN AND THE BULK
CMOS- BASED DESIGN………………………………………………………54
4.4.1 FEATURES OF THE CNT TECHNOLOGY……..……………………..…….…..54
4.4.2 POWER, DELAY AND AREA…………………………………….….………...57
4.4.3 PVT VARIATION………………………………………...…………………....58
V. CONCLUSION……….……………………………………………………………63
REFERENCE…………………………………………………………………………64
APPENDIX: HSPICE INPUT FILES……………………………………………..66
List of Figures
Fig.1 Initial Partial Product Matrix.............................................................................................9
Fig. 2 Modified Partial Product Matrix........................................................................................9
Fig. 3 Final n × n Partial Product Matrix...................................................................................10
Fig. 4 8-bit Kogge-Stone Adder...............................................................................................12
Fig. 5 16-bit Kogge-Stone Adder.............................................................................................13
Fig. 6 8-bit Kogge-Stone Diminished-1 Adder...........................................................................14
Fig. 7 Revised Diminished-1 Kogge-Stone Adder with Stages.............................................17
Fig. 8 16-bit Kogge–Stone Adder with Sparsity of 4...................................................................18
Fig. 9 Inverted EAC Adder with Sparsity of 4............................................................................18
Fig. 10 Inverted EAC Adder with Sparsity of 4 in Stages...................................................19
Fig. 11 the Initial Output of the Partial Product Generation Stage..................................................20
Fig. 12 the n×n Partial Product Matrix......................................................................................20
Fig. 13 the Final Partial Product Matrix with the Correction Factor...............................................21
Fig. 14 the Initial Output of the Partial Product Reduction Stage...................................................21
Fig. 15 the Output of the Partial Product Reduction Stage after Repositioning.................................21
Fig. 16 Proposed Inverter.........................................................................................................22
Fig. 17 Nand Gate with 2 Inputs..............................................................................................23
Fig. 18 Nor Gate with 2 Inputs.................................................................................................23
Fig. 19 Traditional Design of the Partial Product Reduction Stage.................................................24
Fig. 20 A New Design of the Partial Product Reduction Stage......................................................25
Fig. 21 Traditional MUX-based Design of the 3:2 Compressor.....................................................26
Fig. 22 A New MUX-based Design of the 3:2 Compressor...........................................................27
Fig.23 Traditional MUX-based Design of the 4:2 Compressor......................................................28
Fig.24 A New MUX-based Design of the 4:2 Compressor............................................................29
Fig. 25 Existing Architectures of the 5:2 Compressor..................................................................32
Fig. 26 A New MUX-based Design of the 5:2 Compressor...........................................................32
Fig. 27 A New MUX-based Design of the 7:2 Compressor...........................................................34
Fig. 28 Original Design of the MUX Subcircuit..........................................................................35
Fig. 29 Modified Design of the MUX Subcircuit........................................................................36
Fig. 30 Proposed Design of the MUX Subcircuit........................................................................37
Fig. 31 Existing Designs of the Complementary-output MUX Subcircuit........................................38
Fig. 32 Proposed Design of the Complementary-output MUX Subcircuit........................................39
Fig. 33 Original Design of the XOR-XNOR Subcircuit................................................................40
Fig. 34 Modified Designs of the XOR-XNOR Subcircuit.............................................................41
Fig. 35 Proposed Design of the XOR-XNOR Subcircuit..............................................................42
Fig. 36 Proposed Design of the CGEN Subcircuit.......................................................................43
Fig. 37 Possible Compressor Architectures for an 8-bit Modulo 2n+1 Multiplier..............................44
Fig. 38 Possible Compressor Architectures for an 16-bit Modulo 2n+1 Multiplier............................47
Fig. 39 the 4-bit Conditional Sum Generator..............................................................................48
Fig. 40 Delay of the Full Adder Based Compressor.....................................................................49
Fig. 41 Delay of the MUX Based Compressor............................................................................50
Fig. 42 Critical Path Delay of the Sparse-tree Adder....................................................................53
Fig. 43 Noncritical Path Delay of the Sparse-tree Adder..............................................................54
Fig. 44 Critical Path Delay of Kogge-Stone Adder......................................................................55
Fig. 45 Delay and Rise-time of the Proposed Multiplier Based on CMOS Technology…...……......57
Fig. 46 Delay and Rise-time of the Proposed Multiplier Based on CNTFET Technology....................58
Fig. 47 Power Consumption of the Proposed Multiplier Based on Two Technologies.......................59
Fig. 48 Temperature Variation.................................................................................................61
Fig. 49 Voltage Variation........................................................................................................61
Fig. 50 Process Variation........................................................................................................62
List of Tables
Table 1 Truth Table of the CGEN Subcircuit.............................................................................44
Table 2 Comparison between the Kogge-stone adder and the Sparse-tree Adder..............................47
Table 3 Performance Comparison between the Full Adders Based Compressor and the MUX-based
Compressor..........................................................................................................................51
Table 4 Performance and Power Comparison between Different Types of Compressors....................51
Table 5 Performance and Power Comparison among Different Compressor Architectures for an 8-bit
Modulo 2n+1 Multiplier..........................................................................................................52
Table 6 Performance and Power Comparison among Different Compressor Architectures for an 16-bit
Modulo 2n+1 Multiplier..........................................................................................................52
Table 7 Performance and Power Comparison between the Kogge-Stone Architecture and the Sparse-tree
Architecture..........................................................................................................................53
Table 8 Performance Comparison between the Proposed Multiplier Based on Two Different
Technologies........................................................................................................................58
Table 9 Delay Comparison between Two Technologies with Different Temperatures………............59
Table 10 Rise-time Comparison between Two Technologies with Different Temperatures................60
Table 11 Delay Comparison between Two Technologies with Different Supply Voltages.................60
Table 12 Delay Comparison between Two Technologies with Different Process Corners..................60
Table 13 Risetime Comparison between Two Technologies with Different Process Corners..............60
1
I. Introduction
1.1 BACKGROUND
Modulo arithmetic is widely used in a lot of areas. In cryptography, modulo arithmetic is the
foundation of public key system and is used in a number of symmetric key algorithms such as
International Data Encryption (IDEA) and Advanced Encryption Standard (AES). There are also
a variety of modulo operations implemented in computer science such as XOR operation in
programming language. Furthermore, modulo arithmetic also has an application in music and
chemistry such as modulo 12 operations in electronic instruments to implement twelve-tone
equal temperament. Nowadays, modulo arithmetic is frequently used in fault tolerant design of
ad-hoc network, digital and linear convolution architectures [1]. In recent years, the information
safety, especially the confidentiality of transmitting data through signal channels, is becoming
more and more important because of the increasing popularity and gradually matured function of
internet, which makes cryptography play a significant role in the information age. Modulo 2n and
modulo 2n+1 multiplier are key blocks in the circuit implementation of cryptographic algorithm
such as IDEA [1].
Residue number system (RNS) is another important application of modulo arithmetic. In the
recent years, the RNS is widely used in arithmetic computation and signal processing
applications such as fast Fourier transforms, digital filtering, and image processing [2]. RNS
became so popular is because the calculation of a large integer is transferred into several small
integer calculations in parallel by decomposing a large integer into several small integers. This
effectively increases the operating speed [3]. Among popular moduli sets, (2n-1, 2
n, 2
n+1) draws
2
the most attention and have been studied for several decades because of its easy conversion
between binary and residue. Such conversion is based on the conventional Chinese remainder
theorem [2]. It takes n bits wide inputs for modulo 2n-1 and modulo 2
n operation, while it takes
n+1 bit wide inputs for modulo 2n+1 operation [1]. That makes modulo 2
n+1 implementation
more difficult and complex hardware block with much attention.
Many architectures and circuit implementations of modulo 2n+1 block are proposed and
compared in the past decades. According to Cruiger’s work [4], three multiplication architectures
are proposed: The first architecture is realized by using a (n+1) × (n+1) bits multiplier followed
by modulo adders to correct errors caused by carry. The second architecture takes advantage of
modulo 2n+1 adder, where multiplier consists of a carry-save adder and a final carry-select
addition unit to reduce design complexity [1]. In the third architecture, they modified the second
architecture by correcting errors in the carry-select adder. Furthermore, the circuit area is
significantly reduced and operating speed is increased by introducing a bit-pair recoding scheme
in the carry-save adder block [4]. Although the last two architectures are suitable for full-custom
design [1], they increase not only layout and fabrication complexity but also design challenges.
In the work of Hiasat [5], a very high speed modulo (2n+1) multiplier is proposed. The circuit
implementation takes advantage of a binary multiplier stage, an adder stage, and the combination
of several logic gates. The main contribution of his work is reducing hardware requirement and
accomplishing realizing very large dynamic ranges.
3
Later in the work of Wrzyszcz and Milford [6], a new partial product matrix is introduced to
reduce design and hardware complexity of the previous design as well as introducing very small
hardware overhead. Furthermore, their design realizes a regular VLSI layout implementation
since the whole structure is almost composed by full adder and half adder only, which also
dramatically optimizes the parallel computing performance, speed, and the maximum operating
frequency. Finally, since the periodic properties of
occurs in every row of the partial
product array, only bits with weight less than 2n occur to compose the final (n+1) × (n+1) partial
product matrix after reposition computation. The correction process also turns out to be easy to
realize because of those characteristics.
According to the work of Zimmermann [7], a new implementation of modulo (2n+1) multiplier is
proposed, which has three major parts: modulo reduced partial products generation block,
modulo carry-save adder, and modulo final adder. To implement the final modulo addition
operation, a fast and simple end-around-carry adder is needed. Zimmermann introduces a new
parallel prefix adder to realize this function, which dramatically increases the operation speed.
Furthermore, conventional Booth coding of the partial product generation stage and the Wallace
tree structure in the final adder stage could also be used to speed up in Zimmermann’s algorithm.
Also, the highly regular structure of this implementation reduces the complexity of layout
process and it is very suitable for VLSI implementation and modularization. Chaves and Sousa
[8] realized the idea of Zimmermann in the later years. Booth coder and Wallace tree structure
make their implementation the fastest modulo (2n+1) multiplier ever at that time.
From a panoramic point of view, a lot of work regarding to Diminished-1 algorithm has been
4
done to solve the problem of n+1 bit input length in a modulo (2n+1) multiplier implementation.
For example,Yutai Ma [9] introduces bit-pair Booth recoding technique and Carry Save adder to
reduce partial products to
for even n or
for odd n. In the work of
Zimmermann [7], weighted operand representation is introduced to implement Diminish-1
function at the cost of additional circuit for correction purpose. Wang’s [10] work eliminates the
conversion circuit between binary and diminished-1, which reduces power and circuit
complexity. Chaves and Sousa [8] compare ordinary and diminished-1 implementations of
modulo (2n+1) multiplier. Also, they optimized the Booth recoding scheme to speed up the
multiplier. In the work of Vergos and Efstathiou [11], they made an improvement comparing to
the work of Wrzyszcz and Milford [6] by reducing the correction factor from 3 to 1, reducing the
circuit complexity and increasing speed.
1.2 PROBLEM AND WORK STATEMENT
To sum up, modulo (2n+1) multiplier today has characteristics of high speed, low power, regular
scheme which is suitable for VLSI implementation and small area. However, further
improvements of the circuit implementation could be achieved. The enhancements could be
possibly made on the partial product reduction stage and the final adder stage because these two
stages are the critical path of the multiplier. Thus, new efficient hardware design of partial
product reduction block and final adder block to achieve higher speed and lower power is highly
needed.
To make further improvement on modulo (2n+1) multiplier, a new circuit implementation is
proposed in this thesis. It has three major stages: partial product generation stage, partial product
5
reduction stage, and the final adder stage. The last two stages determine the speed and power of
the whole circuit. Conventional compressor in the partial product generation stage takes
advantage of cascade full-adders and half-adders. However, adders consume a lot of power and
have a large delay. In this thesis, a new compressor based on the combination of MUX and xor-
xnor gate is proposed to reduce PDP [1]. For the final adder stage, the conventional Kogge-Stone
adder is the fastest parallel prefix form carry look-ahead adder [13]. However, the performance
of the parallel prefix adders is limited by the large number of carry merge cells and excessive
inter-stage wiring tracks. In this thesis, a sparse tree based inverted EAC adder is used to solve
this problem [14]. The sparse tree architecture dramatically reduces the number of blocks in the
last stage compare to Kogge-stone adder, which helps a lot in the layout process. The sparse tree
architecture also reduces delay of the last stage, because the sparse tree path is not the critical
path and the fan-out of the critical path is also reduced.
Additionally, the limitation of technology itself restricts further improvement of circuit
implementation of modulo (2n+1) multiplier. The popular CMOS technology based transistors
could be scaled down to very small size to archive very high integration capacity of VLSI
implementation. Nowadays, 32nm CMOS technology has been widely used and dramatically
increases the speed of the multiplier. However, as the sub-micron nano range scale down to
25nm in the near future, the leakage current of transistor will significantly increase. Also, the
sensitivity to process variation increases significantly to an unavoidable level and the
requirement of the accuracy of manufacture process [12]. Furthermore, the intrinsic capacitance
of nodes will get smaller and smaller as size of transistors and supply voltage getting lower,
making the number of charges that could be stored at nodes getting smaller. This makes
6
instantaneous voltage change such as cosmos particle collision a big problem, which could
destroy the device at some conditions [12]. Thus, robust technologies that has stable property
when the size of transistors getting smaller is required in the near future.
Among variety of modern technologies, cylindrical carbon molecules have beneficial properties
in the application of electronics and nanotechnology [12]. Carbon-Nano-Tube (CNT) is a tube-
shaped allotrope of carbon. CNT benefits its length-to-diameter ratio of as high as over 130
billion, which is greatly larger than other material under study. One of the advantageous
properties of CNT is its extremely hardness and stiffness. The only limitation of this property is
that it is sensitive to high-energy electron irradiation. The particular structure of CNT brings the
possibility of conductivity change between semiconductor and metal. For a given (n,m) CNT, if
n = m, the CNT is metallic; if n − m is a multiple of 3, then the CNT turns out to be a
semiconductor. Furthermore, CNT has very good thermal properties such as conductivity and
thermal stability. Based on CNT technology, a new CNT transistor (CNTFET) is introduced
these years with advantages of lower leakage power, better frequency response, lower PVT
variation, and extremely low PDP, which makes CNTFET a very competitive substitute of
traditional MOSFET.
1.3 OUTLINE OF THE THESIS
The rest of the thesis will be organized as follows. In section II, the algorithm used to implement
the multiplier is presented. Section III describes the proposed circuit implementation of modulo
2n+1 multiplier, and the novel sparse tree based Inverted EAC adder and the MUX based
compressor are also presented in the same chapter. The simulation results of the CNTFET based
7
design and the comparison with traditional CMOS technology based design is given in section
IV, and the conclusion is followed in section V.
8
II. Algorithm
Among various existing A·B mod (2n+1) algorithms, the one presented by Vergos and Efstathiou
[1] is considered to be the best. The proposed circuit implementation based on this algorithm can
be adapted to various applications such as IDEA cipher mentioned in section I. Some problems
might occur when this algorithm is used on IDEA cipher, because in the work of Vergos and
Efstathiou [1], (n+1)-bit wide inputs are introduced while in IDEA application, the input width is
n. However, this problem could be easy solved by connecting the MSB of the two inputs to
ground and just neglect the MSB of the outputs.
2.1 ALGORITHM OF THE PARTIAL PRODUCT GENERATION STAGE
Assume A and B are two inputs represented as A=anan-1an-2···a1a0 and B=bnbn-1bn-2···b1b0, then
A·B modulo (2n+1) can be represented as follows [1]:
(1)
where pi,j = ai AND bi. The A×B operation could be achieved by adding a group of partial
products together in a certain order.
Take an observation of the partial product matrix, it could be divided into four groups: A, B, C
and D, as shown in Fig. 1 (where Pi,j = ai AND bj). Only one group of them could be different
9
from 0 at certain time. Thus, partial products in different groups could be ORed instead of being
added together. Firstly, we perform the logic “OR” operation on the terms of the groups A, B, and
D in the columns with weight 2n up to 2
2n-2 and on the two terms of the groups B and D with
weight 22n-1
. Since , the term weighted 22n-1
, qn-1, can be substituted by
two terms qn-1 in the columns with weight 2n-1
and 1, respectively, and ORed with any term of
group A there. Moreover, since , the term pn,n could be repositioned to the
rightmost column and ORed with p0,0 [1, 11]. The modified version of partial product matrix after
“OR” operation is shown in Fig. 2 (where qi = pi,n ˅ pn,i) .
22n
22n-1
22n-2
… 2n+2
2n+1
2n 2
n-1 2
n-2 … 2
2 2
1 2
0
Pn,0 Pn-1,0 Pn-2,0 … P2,0 P1,0 P0,0
Pn,1 Pn-1,1 Pn-2,1 Pn-3,1 … P1,1 P0,1
Pn,2 Pn-1,2 Pn-2,2 Pn-3,2 Pn-4,2 … P0,2
… … … … … … …
Pn,n-2 … P4,n-2 P3,n-2 P2,n-2 P1,n-2 P0,n-2
Pn,n-1 Pn-1,n-1 … P3,n-1 P2,n-1 P1,n-1 P0,n-1
Pn,n Pn-1,n Pn-2,n … P2,n P1,n P0,n
Fig.1 Initial Partial Product Matrix
22n-2
…
2n+1
2n 2
n-1 2
n-2 … 2
2 2
1 2
0
Pn-1,0Vqn-1 Pn-2,0 … P2,0 P1,0 P0,0V Pn,nVqn-1
Pn-1,1Vq0 Pn-2,1 Pn-3,1 … P1,1 P0,1
Pn-1,2Vq1 Pn-2,2 Pn-3,2 … P0,2
… … … … … …
… P3,n-2 P2,n-2 P1,n-2 P0,n-2
Pn-1,n-1Vqn-2 … P2,n-1 P1,n-1 P0,n-1
Fig. 2 Modified Partial Product Matrix
A
B
C D
10
There is an observation regarding to the reposition operation of the partial product terms in the
n×n partial product matrix, with weight greater than 2n-1
based on the following equation [11]:
(2)
Equation (2) shows that repositioning each bit to ith
bit needs a correction factor to make
sure that the partial product matrix is equivalent to the initial partial product matrix before
reposition operation. For each partial product vector, the correction factor is derived as
12n. Hence, the correction factor of the entire partial product matrix is given by [11]:
(3)
The final n × n partial product matrix after the reposition operation is shown in Fig. 3
2n-1
2n-2
2n-3
… 22 2
1 2
0
Pn-1,0Vqn-1 Pn-2,0 Pn-3,0 ... P2,0 P1,0 P0,0V Pn,nVqn-1
Pn-2,1 Pn-3,1 Pn-4,1 … P1,1 P0,1
Pn-3,2 Pn-4,2 Pn-5,2 … P0,2
… … … … … … …
P1,n-2 P0,n-2 …
P0,n-1 …
Fig. 3 Final n × n Partial Product Matrix
2.2 ALGORITHM OF THE PARTIAL PRODUCT REDUCTION STAGE
Another observation is regarding to the compressors in partial product reduction stage, which
11
perform like a carry save adder (CSA). Since this CSA works as a modulo 2n+1 adder, the carry-
out bit of each level of the CSA has to be fed back as the carry-in bit of the next subsequent level
[1]. Supposing that the carry-out bit of the nth
column at ith stage of CSA is ci with weight 2n,
then the carry-out can be deduced to [11]:
(4)
Thus, in an n-1 stage CSA, another correction factor because of the carry-out bits of the CSA due
to equation (4) is [1]:
(5)
The final correction factor can be calculated from the sum of COR1 and COR2:
(6)
For an n-bit modulo (2n+1) multiplier, the constant “3” is the final correction factor. A “2” will
be added to the partial product reduction stage, while a “1” will be added to the final adder stage
due to the inverted carry feedback issue discussed later in this thesis.
2.3 ALGORITHM OF THE FINAL ADDITION STAGE
When two 1-bit wide inputs A and B are added together, if the carry-out of A+B is always 1,
regardless of the value of input carry, A and B are said “generate”. In practice, A and B generate
only in the case that both A and B are logic 1. We use to present the relationship of
“generate”, denote as: . Similarly, A and B are said “propagate” if the carry-out
of A+B is always 1 whenever the carry-in bit is 1, regardless the value of two 1-bit wide inputs A
and B. In practice, A+B propagate only in the case that at least one of A or B is logic 1. We use
to present the relationship of “propagate”, denote as: .
12
Fig. 4 8-bit Kogge-Stone Adder
The final adder stage is an inverted End-Around-Carry (EAC) adder revised from conventional
Kogge-Stone adder. An 8-bit Kogge-Stone adder is shown in Fig. 4. The algorithm of Kogge-
Stone adder is illustrated below. Each “□” produces a "propagate" and a "generate" bit, where
“propagate” , “generate” . Next, operator “○” works as
in the next stages in vertical direction. The final
“generate” bits are produced in the last stage. These bits need to be XORed with the initial
propagate ( ) to produce the final sum bits. For example, the LSB of sum vector is
calculated as: P0 XORed with the carry-in bit. The second LSB of sum vector is calculated as: P1
XORed with the rightmost carry-out bit in the last stage of “○” operation. The 16-bit Kogge-
Stone adder performs in the same manner, as shown in Fig. 5.
13
Although the conventional Kogge-Stone adder is thought to be the fastest adder possible today,
however, to realize modulo (2n+1) function, it needs some structural revision. The partial product
reduction stage generates an n-bit sum vector and an n-bit carry vector, which will be added in
the final adder stage. However, to achieve the modulo (2n+1) addition function, the output of
carry bit of the carry vector should be feedback to the LSB of the final adder stage, shown in the
work of Zimmerman [7]:
(7)
From (7) we can observe that the inverted carry-out bit of the addition of Sum and Carry vectors
has to be fed back to achieve modulo (2n+1) function in the revised Kogge-Stone adder
architecture shown later.
Fig. 5 16-bit Kogge-Stone Adder
The parallel prefix computation works in the form of “○” operations will be remained in the
revised architecture. Instead of directly XORed the “propagate” of each nth
bit with the (n-1)th
14
carry-out bit in the th stage, the new architecture is proposed to invert the (n-1)
th carry-out
bit in th stage and then this new inverted (pi*, gi*) set will “○” with the (pi , gi) of each bit in
th stage to generate final (G’, P’) set. Finally, the sum vector of the final adder stage is
generated by XORing the final carry-out bit gi* with the initial “propagate” gi. The revised 8-bit
EAC Kogge-Stone adder is shown in Fig. 6.
Fig. 6 8-bit Kogge-Stone Diminished-1 Adder
As the final sum vector and carry vector are calculated mainly depends on the “generate-
propagate set” in every stage, the derivation of (G, P) and some characteristics of it should
15
be discussed. Furthermore, the architecture in Fig.6 has a logic depth of . To reduce
the logic depth from to , a new architecture is introduced based on the
algorithm improvement shown below. The carry-out bit of a carry-look-ahead (CLA) adder is
logic 1 when one of the cases below takes place: A+B “generate” or the next less significant
carry-out bit is 1 with A+B “generate”. Then the carry-out bit of CLA could be denoted like this:
(8)
According to (8), the final generate-propagate set
in the th stage could be
expressed below (Let
) [1]:
(9)
There are several observations regarding to the equation above. Firstly,
(10)
which means the inverted EAC adder is just taking the inverted logic of the “generate” bit and
keep the value of the “propagate” bit. The second observation is:
(11)
The third observation is on the derivation of
, as shown
below [15]:
16
(12)
In some cases, generating the whole architecture in stages based on (12) is not possible.
To solve this problem, we could transfer (12) into another form [15]. Suppose that
and , then,
(13)
According to (13), . The new designed final stage adder based on
this algorithm is shown in Fig. 7. The addition operation in the final adder stage is done in
stages. However, this implementation has obvious wire interconnection problem because of the
complexity of cells [1].
One possible solution for the wire interconnection problem is to introduce sparse-tree
architecture. The sparsity of a Kogge-Stone adder refers to the number of carry-out bits
generated by the adder. For example, sparsity-1 means the whole adder totally generates 1 carry-
out bit for. The sparsity of 2 means generating carry-out every other bit and sparsity of 4 means
generating carry-out every-fourth bit. A much shorter carry ripple adder is then introduced with
an input bit of the carry-out of sparse tree adder. Because this shorter carry ripple adder is not the
critical path, the delay of the final adder stage is reduced, while the wire interconnection problem
is solved. There is a trade-off between the sparsity and the effectiveness of solving wire
interconnection problem. Increasing sparsity increases the speed of the sparse-tree adder;
17
however, the delay of the short carry ripple adder gets larger as well. Finally, the critical path
will no longer be the sparse-tree adder, but the short carry ripple adder instead.
a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7
s0s1s2s3s4s5s6s7
hi (gi,pi) gipi ,
ai bi jiji PG ,, , mkmk PG ,, ,
mkmkjiji PGPG ,,,, ,, mkmkjiji PGPG ,,,, ,,
Fig. 7 Revised Diminished-1 Kogge-Stone Adder with Stages
An example of 16-bit Kogge–Stone adder with sparsity-4 is shown in Fig. 8, while the Inverted
EAC adder with sparsity-4 is shown in Fig. 9.
18
a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7a8,b8a9,b9a10,b
10
a11,b
11
a12,b
12
a13,b
13
a14,b
14
a15,b
15
C1C5C9C13
Fig. 8 16-bit Kogge–Stone adder with sparsity of 4
a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7a8,b8a9,b9a10,b
10
a11,b
11
a12,b
12
a13,b
13
a14,b
14
a15,b
15
C15=C-1 C11 C7 C3
Fig. 9 Inverted EAC adder with sparsity of 4
Generally, for 8-bit and 16-bit adders, a sparsity of 4 is usually chosen [14]. The carry out
equations for the 16-bit sparse tree inverted EAC adder are as follows:
19
(14)
Based on the deduction shown in (12), the equations turn into:
(15)
Based on the deduction shown in (13), the equations turn into:
(16)
In (16), the final equations limit the modulo addition operation in the final adder stage within
stages, as shown in Fig. 10. This architecture solves wire interconnection problem and
reduces non-critical path delay.
A=119= B=87=
0 0
0 0
1 1
1 0
1 1
0 0
1 1
1 1
1 1
0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fig. 11 the Initial Output of the Partial Product Generation Stage
20
Fig. 10 Inverted EAC adder with sparsity of 4 in stages
2.4 AN EXAMPLE
Take a 9-bit modulo (2n+1) multiplier for example. Assuming the two inputs are
A=119=001110111, B=87=001010111. The initial output of the partial product generation stage
is shown in Fig. 11.
0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1
Fig. 12 the n×n Partial Product Matrix
21
The left half (to the left of the dash line) of the initial partial products shown in Fig. 11 needs to
be repositioned using the principle illustrated in Fig.3. The final n×n partial product matrix after
repositioning is shown in Fig. 12. A correction factor of 2, in the form of a correction vector
shown in the block in Fig.13, is added to the bottom of the n×n partial product matrix. Total
correction factor of the modulo 2n+1 multiplier is 3. The other “1” is added in the final adder
stage.
0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0
Fig. 13 the Final Partial Product Matrix with the Correction Factor
1 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0
Fig. 14 the Initial Output of Partial Product Reduction Stage
1 1 0 1 0 1 0 0 0 1 1 1 0 1 0 1
Fig. 15 the Output of the Partial Product Reduction Stage after Repositioning
The partial product reduction stage compresses the partial product matrix in Fig. 13 to a final
sum vector and a carry vector, as shown in Fig.14. This initial output of partial product reduction
stage also needs to be repositioned. The final sum vector and carry vector after repositioning
with another “1” are then modulo 2n+1 added. In this example, 119×87 modulo (2
8+1) equals 73.
Sum Vector
Carry Vector
Sum Vector
Carry Vector
22
III. Circuit Implementation
The proposed implementation of modulo 2n+1 multiplier consists of three stages: the partial
product generation stage, the partial product reduction stage, and the final addition stage. The
possible circuit configurations for each stage will be discussed in this section:
3.1 CIRCUIT DESIGN OF THE PARTIAL PRODUCT GENERATION
STAGE
This stage is the simplest stage in the circuit design of the entire multiplier. Traditional 2-input
NAND gate, 2-input NOR gate, and inverter need to be optimized to meet the power and speed
demand of this stage.
Fig. 16 Proposed Inverter
The structure and the size of the transistors composing the proposed inverter, 2-input NAND and
2-input NOR are shown in Fig. 16, Fig. 17, and Fig. 18, respectively. The NAND gates are used
23
for generating initial partial product terms, while the NOR gates and inverters are the key circuit
components to implement the operations of repositioning to get the final n×n partial product
matrix. The most complex logic functions in the reposition operations are ,
and
, where and [1].
Fig. 17 Nand Gate with 2 Inputs
Fig. 18 Nor Gate with 2 Inputs
24
3.2 CIRCUIT DESIGN OF THE PARTIAL PRODUCT REDUCTION
STAGE
3.2.1INTRODUCTION OF A MUX-BASED COMPRESSOR DESIGN
The partial product reduction stage is considered to be the most important stage to determine the
power and speed of the entire modulo 2n+1 multiplier [1]. Thus, this stage must be designed with
a group of low-power high-speed compressors.
Fig. 19 Traditional Design of the Partial Product Reduction Stage
In this stage, the n×n partial product matrix and a correction factor “2” are compressed to a final
sum vector and a carry factor. The remaining correction factor of “1” is added to the final
25
addition stage by using the inverted EAC adder. Traditional compressors are designed with full
adders. However, these designs consume too much power and occupy too much chip area. It
also cannot meet the requirement of ultra high speed in the world today. For example, to
compress single column of an 8×8 partial product matrix, totally 7 full adders are needed, while
in the worst case of the possible new designs proposed in this thesis, only one 7:2 compressor
and two 3:2 compressors are needed. The traditional full adder based compressor and the worst
case of the possible new design are shown in Fig. 19 and Fig. 20, respectively.
Fig. 20 A New Design of the Partial Product Reduction Stage
The compressor architecture shown in Fig. 20 is designed with MUX and XOR-XNOR sub-
circuits. The compressors based on MUX use much less transistors than the full adder based
26
compressors, and the total number of compressors used in the traditional full adder based design
is much higher than the new MUX-based design. Thus, the new compressor architecture is a
much more proper design to meet the requirement of low power and high speed.
3.2.2DIFFERENT TYPES OF THE MUX-BASED COMPRESSORS
Several basic MUX-based compressors are discussed below:
3.2.2.1 Circuit Design of the 3:2 compressor:
A 3:2 compressor takes 3 inputs x1, x2, and x3 to generate two outputs Sum and Carry. The
logic relationship between inputs and outputs is demonstrated in equation (17) [16]:
(17)
Fig. 21 Traditional MUX-based Design of the 3:2 Compressor
Fig. 21 shows an existing design of the MUX-based 3:2 compressor [16]. However, this design is
not fast enough because X1 and X2 should be added first, and then their sum adds to X3. The
second addition operation should wait the calculation result of the first addition operation. The
27
total delay of this design is 2×∆XOR. To reduce critical path delay of the 3:2 compressor, a new
design of the MUX-based 3:2 compressor is shown in Fig. 22. In the proposed design, X3 could
select MUXs before the input signals arrive. Thus, the time taken to switch the transistors in the
critical path is reduced, increasing circuit efficiency [16]. The total delay of the proposed design
is ∆XOR+∆MUX. The output equations of the proposed design are shown below [16]:
(18)
(19)
Fig. 22 A New MUX-based Design of the 3:2 Compressor
3.2.2.2 Circuit Design of the 4:2 compressor:
A 4:2 compressor takes 4 inputs x1, x2, x3, and x4 along with a carry-in bit Cin to generate three
outputs Sum, Carry, and Cout, where “Sum” is weighted at 20, “Carry” and “Cout” are weighted
at 21. The logic relationship between inputs and outputs is demonstrated in equation (20) [16]:
(20)
28
An existing circuit design of MUX-based 4:2 compressor is shown in Fig. 23 [16]. Same as the
traditional 3:2 compressor, the second and the third XOR operation need to wait the result of the
previous one. This limits the speed of the compressor (3×∆XOR). In Fig. 24, a new design of the
MUX-based 4:2 compressor is proposed. In this design, the outputs and its complementary
signals are generated at the same time, avoiding the race-hazard problem. The power
consumption of the inverters to generate the complementary signal is also reduced. Furthermore,
the MUX connected to Cin could be selected in advance. The Total delay of the proposed design
is 1×XOR+2×MUX.
Fig.23 Traditional MUX-based Design of the 4:2 Compressor
The output equations of the proposed design are shown below [16]:
(21)
(22)
(23)
29
Fig.24 A New MUX-based Design of the 4:2 Compressor
3.2.2.3 Circuit Design of the 5:2 compressor:
The 5:2 compressor has 7 inputs (x1, x2, x3, x4, x5, Cin1 and Cin2) and 4 outputs (Carry, Sum,
Cout1, and Cout2). The relationship between inputs and outputs is shown below [16]:
(24)
Several existing circuit implementations of the MUX-based 5:2 compressor are shown in Fig. 25
(a), (b), and (c), respectively [16]. In Fig.25, the delay of the compressor is reduced to 5×ΔXOR.
The delay of the original full adder based design is 6×ΔXOR, if all the full adder blocks are
replaced by their constitute XOR blocks [16]. However, the delay of the MUX based 5:2
compressor could be further reduced by replacing some XOR gate by MUX blocks. The
proposed implementation is shown in Fig. 26. In the first stage, 2 XOR-XNOR blocks are
introduced to generate the output and its complementary signal at the same time, reducing the
30
power of additional inverters, and avoiding race-hazard problem. In the second and the fourth
stages, the MUXs controlled by X3, Cin1, and Cin2 could be selected before the input signals
arrive. The rest of MUX blocks also efficiently use the output of the blocks in the previous stage.
Benefits from all the features mentioned above, the critical path delay of the proposed design is
reduced to ΔXOR+3×ΔMUX. The equations regarding to the outputs are shown below:
(25)
(26)
(27)
(28)
31
XOR XOR
XOR MUX
XOR
MUX
Sum Carry
Cin1
X1 X2 X3 X4
XOR
XOR MUX
Cin2
X5
Cout1
Cout2
XOR XOR
XOR XOR
MUX
MUX
Sum Carry
Cin1
X1 X2 X3 X4
XOR
Cin2
X5
Cout1
Cout2
XOR
(x1+x2)(x3+x4) (x1x2+x3x4)
(b)
(a)
32
XOR XOR
XOR XOR
MUX
Sum Carry
Cin1X1 X2 X3 X4
XOR
Cin2 X5
Cout1
Cout2
XOR
CGEN
MUX
(c)
Fig. 25 Existing Architectures of the 5:2 Compressor
XOR-
XNOR
XOR-
XNOR
MUX MUX
MUX
Sum Carry
Cin1X1 X2 X3 X4
MUX
Cin2 X5
Cout1
Cout2
MUX
CGEN
MUX
Fig. 26 A New MUX-based Design of the 5:2 Compressor
33
3.2.2.4 Circuit Design of the 7:2 compressor:
The 7:2 compressor has 9 inputs (x1, x2, x3, x4, x5, x6, x7, Cin1 and Cin2) and 4 outputs (Sum,
Carry, Cout1, and Cout2). Unlike the 5:2 compressor, where Carry, Cout1, and Cout2 are all
weighted at 21, the 7:2 compressor has a Cout1 output weighted at 2
2. To sum up, the
relationship of the inputs and the outputs of a 7:2 compressor is [1]:
(29)
The MUX-based 7:2 compressor is a totally new design in this thesis. The principle of the design
is to use MUX to replace XOR as much as possible to reduce delay and to generate output and its
complementary signal at the same time to reduce power. Then the output equations shown below
[17] could be transformed into the circuit implementation of the MUX-based 7:2 compressor
shown in Fig. 27, with some additional logic gates such as Nand to realize. The total delay of the
proposed design is ΔXOR+5×ΔMUX.
(30)
(31)
(32)
(33)
where
34
XOR-
XNOR
MUX MUX
XOR-
XNOR
MUX
MUX
MUX
MUXMUX
XOR-
XNOR
2-bit
Nand
4-bit
Nand
2-bit
Nand
X5 X6 X7 X2 X3 X4X1
cin2
cin1
CGEN3-bit
Nor
2-bit
Nand
2-bit
Nand
CGEN
XOR-
XNOR
MUX
CGEN
Carry Sum Cout1 Cout2
Fig. 27 A New MUX-based Design of the 7:2 Compressor
3.2.3DETAILED SUBCIRCUIT DESIGN OF THE COMPRESSORS
To realize the circuit implementations mentioned above, detailed transistor level designs are also
need to be discussed and compared. The MUX subcircuit, the complementary-output MUX
subcircuit, the XOR-XNOR subcircuit, and the CGEN subcircuit will be discussed one by one.
35
3.2.3.1 MUX Subcircuit Design:
Fig. 28 Original Design of the MUX Subcircuit
Firstly, we take a look at the subcircuit of MUX. The original 2-1 MUX is shown in Fig.28 [18].
This is the most widespread MUX cell today, especially in low power applications. However,
this structure has no driving ability to drive the large input-capacitance of the following stages
especially when many stages are cascaded. This introducing large delay and worsen the
performance of the entire modulo multiplier. Thus, this implementation will not be chosen. To
solve this weak driving ability problem, another circuit implementation of MUX is introduced
later, which is shown in Fig. 29. The modified structure solves the driving problem by adding
two cascaded inverters at the output of the original design. This method is highly effective.
However, inverters consume a lot of power and even enlarge the size of the MUX block by more
than 2 times compare to the one in Fig 28. So this is also not a desired design in low power
applications.
36
Fig. 29 Modified Design of the MUX Subcircuit
The proposed design of the MUX subcircuit is shown in Fig. 30. This design takes advantage of
the complementary CMOS technology, which is robust against both voltage scaling and
transistor sizing [18]. Compare to the modified MUX circuit shown in Fig. 29, the proposed
design only has one inverter, reducing a lot of power. The driving ability of the proposed design
is not reduced by diminishing the number of inverters because the rest transistors of the proposed
design are also connected to vdd/gnd to be provided driving strength. The total number of
transistors in the proposed design is 2 more than the one in Fig. 29. However, the total silicon
area of transistors in the two designs is the same. Thus, based on the discussion above, the circuit
design in Fig. 30 is chosen in this research for the comprehensive consideration of low power,
small silicon area and high speed.
37
Fig. 30 Proposed Design of the MUX Subcircuit
3.2.3.2 Complementary MUX Subcircuit Design:
Secondly, the complementary-output MUX subcircuit needs to be designed. Two existing
designs of complementary-output MUX are shown in Fig. 31(a) and (b), respectively [18]. The
design of (a) has some driving ability because two compensation transistors, which are all driven
by vdd, are introduced. For the same reason, structure in (a) can also obtain a full voltage swing
at the output. However, the driving ability of (a) is not strong enough to drive many cascaded
stages. Different from (a), structure (b) has no driving ability at all. Additionally, in some cases,
the output and its complementary signal will not have a full swing.
38
Vdd
Vdd
set set
A
B
W=64nm
W=64nm
W=64nm
W=64nm
W=128nm
W=128nm
out
out
set set
A
B
A
B
W=64nm
W=64nm
W=64nm
W=64nm
W=128nm
W=128nm
W=128nm
W=128nm
out
out
(a) (b)
Fig. 31 Existing Designs of the Complementary-output MUX Subcircuit
To solve the problems mentioned above, we need to redesign a complementary-output MUX. In
the circuit design of Fig. 31(a), an inverter needs to be added to each of the two outputs to
improve driving ability. In the circuit design of Fig. 31 (b), two cascaded inverters are needed
and all other pass-gates need to be replaced by complementary CMOS pass-gates to obtain full
swing. Obviously, after the improvement, (b) occupies much more silicon area than (a), so the
proposed design needs to take the idea from (a), which is shown in Fig. 32.
39
Vdd
Vdd
set set
A
B
out
out
W=64nm
W=64nm
W=64nm
W=64nm
W=128nm
W=128nm
Vdd
Vdd
W=128nm
W=128nm
W=256nm
W=256nm
Fig. 32 Proposed Design of the Complementary-output MUX Subcircuit
3.2.3.3 XOR-XNOR Subcircuit Design:
Thirdly, the XOR-XNOR subcircuit needs to be designed. The original design of the XOR-
XNOR subcircuit is shown in Fig. 33 [18]. This design has the problem of week driving ability,
especially when the logic value the XNOR node is logic 0. This dramatically reduces speed.
Another problem is regarding to the complementary outputs. A skew occurs at the node of XOR
and the node of XNOR. Additionally, this design generates a weak logic “1” at XNOR node
because NMOS-based pass-gate has a Vth voltage drop when passing logic “1”. Thus, this
40
design cannot be used at the condition of low power supply.
xor
Vdd
W=128nm
W=256nm
Vdd
xnor
A B
W=256nm
W=256nm
W=128nm
W=128nm
Fig. 33 Original Design of the XOR-XNOR Subcircuit
To solve those problems, other designs of XOR-XNOR subcircuit are designed, as shown in Fig.
34 (a), (b), and (c), respectively [18]. The modified XOR-XNOR block shown in (a) could be
used with low supply voltage because the complementary CMOS pass-gates are introduced in
this design to replace the original one. However, the weak driving ability problem and the skew
problem at the output still remain. Unlike (a), design of (b) solves skew problem at the output by
adding a group of complementary transistor to the circuit shown in Fig. 33. But it generates a
weak “0” at node XOR, while generates a weak “1” at node XNOR.
41
Vdd
W=128nm
Vdd
W=128nm
xnor
Vdd
W=128nm
W=256nm
W=128nm
W=128nm
W=64nm
W=64nm
W=64nmW=64nm
xor
A
B
Vdd
W=256nm
W=256nm
W=128nm
W=128nm
A B
W=128nm
W=128nm
W=256nm
W=256nm
xor
xnor
W=128nm
W=128nm
W=256nm
W=256nm
xor
xnor
A B
Vdd
W=128nm
W=64nm
Fig. 34 Modified Designs of the XOR-XNOR Subcircuit
(a)
(b) (c)
42
So this design is also not a good choice in low power applications. The circuit implementation in
(c) can solve the weak logic problem and the week driving ability problem at the same time
because of the feedback NMOS-PMOS transistors in the middle of the circuitry. However, it is
still not a good choice in low power applications for the following reasons. When the input
changes from any other input patterns to “00” or “11”, the feedback NMOS-PMOS transistors,
which is originally turned off, will be turned on by a weak logic driver and a high impedance
driver. Thus, this transition will take a lot of time, worsens the entire circuit performance and
consumes huge dynamic power when transit [18].
W=128nm
W=128nm
W=256nm
W=256nm
xor
xnor
A B
Vdd
W=64nm
W=32nmVdd
W=64nm
W=64nm
W=128nm
W=128nm
Fig. 35 Proposed Design of the XOR-XNOR Subcircuit
43
The proposed design of the XOR-XNOR subcircuit is shown in Fig. 35. It combines all the
desire features together, solving the weak logic problem, the skew problem at the output, the
week driving ability problem and the long transit time problem occurred in Fig. 34 (c) at the
same time.
Vdd
Vdd
W=256nm
W=128nm
W=128nm
W=64nm
W=64nm
W=64nm
W=64nmW=64nm
W=128nmW=128nm
W=128nm W=128nm
Carry
ABCin
Fig. 36 Proposed Design of the CGEN Subcircuit
3.2.3.4 CGEN Subcircuit Design:
Finally, the proposed CGEN subcircuit is shown in Fig. 36 [18]. The CGEN subcircuit works
like a full adder without the output of “Sum”. The truth table of CGEN block is shown in Table 1.
44
This circuit implementation takes advantage of complementary CMOS logic, providing good
driving ability (small delay) with relatively small silicon area.
Table 1 Truth Table of the CGEN Subcircuit
A b cin carry
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
Fig. 37 Possible Compressor Architectures for an 8-bit Modulo 2n+1 Multiplier
3.2.4 THE ARCHITECTURE OF THE PARTIAL PRODUCT REDUCTION
STAGE
3.2.4.1 Architecture Designed for an 8-bit Modulo 2n+1 Multiplier
45
After designing the specific circuit blocks, the architecture of the whole compressor need to be
decided. For an 8-bit modulo 2n+1 multiplier, two possible compressor architectures are
compared. The architectures discussed in this section are the architectures of the partial product
reduction stage to compress a single column of the final partial product matrix with the
corresponding correction bit. The first compressor architecture is shown in Fig. 37 (a), where the
number of compressors used (three in total) in the architecture is the least among all the possible
architectures. Only three stages are introduced and only one compressor is used in eac stage.
However, when taking parallel concept into consideration, the other architecture, which is shown
in Fig. 37 (b), has much better performance. This architecture uses total 7 3:2 compressors in
four stages. In the first stage, three 3:2 compressors work in parallel, while the number of the
compressors used in the rest of the stages is 2, 1, and 1, respectively. Although it seems like that
the second architecture uses more compressors than the first architecture, the second one has less
delay and number of transistors. The simulation result is shown in section IV.
Furthermore, the architecture in Fig. 37 (b) has advantages in layout comparing to the first one
because two types of compressors are introduced in Fig. 37 (a) while a single type of compressor
is introduced in Fig. 37 (b). However, interconnect wire routing issue will occur in Fig. 37 (b)
because of the parallel design, especially when the size of input is large. The 3:2 compressor [16]
is shown in Fig. 3. In this thesis, this architecture is chosen to achieve high speed, small area, and
low power.
3.2.4.2 Architecture Designed for an 16-bit Modulo 2n+1 Multiplier
For a 16-bit compressor, more possible architectures are discussed, as shown in Fig. 38 (a), (b),
46
(c), and (d), respectively. Among all these architectures, (c) is the best choices. The architecture
in (c) benefits from its smallest delay, the smallest power, and the smallest silicon area. These
features make (c) proper to be used in low-power high-speed applications. Also, the architecture
in (c) has the advantage of being composed by only one type of compressor with regular layout,
just same as the proposed 8-bit compressor architecture. The simulation of performance and
power comparison of all these architectures is shown in section IV.
4:2
7:2
7:27:2
Inputs
Outputs
4:2
4:27:2
Inputs
Outputs
3:2
7:2
3:2
3:23:2
Inputs
3:2 3:2 3:2
3:2 3:2 3:2 3:2
3:2 3:2
3:2 3:2
3:2
3:2
Outputs
(a) (b)
(c)
47
5:2
5:2
4:24:2
Inputs
Outputs
4:2
4:2
3:2
4:2
3:2
Fig. 38 Possible Compressor Architectures for an 16-bit Modulo 2n+1 Multiplier
3.3 CIRCUIT DESIGN OF THE FINAL ADDITION STAGE
Table 2 Comparison between the Kogge-stone adder and the Sparse-tree Adder
Adder Type Logic Depth Max Fanout # of Cells
Kogge-Stone 2
Sparse Tree 2
The comparison between the original Kogge-Stone architecture and the sparse-tree structure is
summarized in Table 4. The logic depth and maximum fanout of the sparse-tree structure is the
same as the Kogge-Stone architecture. However, the total number of critical path blocks used in
the sparse-tree structure is much less. Therefore, the interconnect wire routing problem no longer
(d)
48
exist. The advantages of sparse-tree structure over Kogge-Stone adder become striking if the size
of input is large. The architecture of the 16-bit sparse-tree design has been shown in Fig. 10. In
Fig. 39, detailed 4-bit conditional sum generator is proposed.
Fig. 39 the 4-bit Conditional Sum Generator
49
IV. SIMULATION RESULTS OF THE PROPOSED DESIGN
AND TECHNOLOGY COMPARISON
Fig. 40 Delay of the Full Adder Based Compressor
In this thesis, totally three main improvements have been made on the circuit implementation.
First of all, in partial product reduction stage, the MUX-based compressor is introduced to
replace the original full adder based design to achieve high performance, low power, and small
area. The best architecture of this stage for input width of 8 and 16 are already chosen. Secondly,
in the final adder stage, a new design of sparse-tree architecture is introduced to make
improvement on the original Kogge-Stone one to solve the wire interconnection problem, while
50
maintain the advantage of high-speed and low-power characteristics of Kogge-Stone structure.
Finally, a new CNTFET technology is introduced to compare with the popular bulk CMOS
technology. The following simulation results show the desired outputs one by one.
4.1 PERFORMANCE COMPARISON BETWEEN THE FULL ADDER
BASED COMPRESSOR AND THE MUX BASED COMPRESSOR
In Fig. 19 and Fig. 20, the structure of traditional full adder based compressor and the proposed
MUX based compressor are shown, respectively. Table 3 summarized the power, delay, and area
comparison between two designs. It is clearly shows that the area and delay of the MUX based
compressor are all approximately half of its full adder based counterpart, while MUX based
design also has tiny advantage in power consumption. Thus, the proposed MUX based
compressor is much better the original full adder one.
Fig. 41 Delay of the MUX Based Compressor
51
Table 3 Performance Comparison between the Full Adders Based Compressor and the MUX
based Compressor
Full Adder Based MUX based
Delay 611.41ps 398.74ps
Power 60.75uW 21.93uW
# of transistors 518 250
4.2 SIMULATION RESULTS OF DIFFERENT COMPRESSOR
ARCHITECTURES IN THE PARTIAL PRODUCT REDUCTION STAGE
Table 4 Performance and Power Comparison between Different Types of Compressors
3:2 Compressor 4:2 Compressor 5:2 Compressor 7:2 Compressor
Delay 1mux+1xor 2mux+1xor 3mux+1xor 5mux+1xor
Delay Simulation 64.26ps 94.97ps 126.00ps 186.26ps
Power 6.48uW 9.97uW 11.87uW 14.16uW
There are a variety of architectures of the partial product reduction stage. Architectures of the
partial product reduction stage with 8-bit input width are shown in Fig. 37, while the
architectures of the one with 16-bit input width are shown in Fig. 38. Each of the architectures is
taking advantage of different type of compressor, and the features for each type of compressor
are clearly listed in Table 4. Based on this, in the case of 8-bit input width, the architecture in Fig.
37 (b) has advantages in delay, power consumption, and area, as shown in Table 5. In the case of
16-bit input width, in Table 6, architecture in Fig. 38 (c) is the chosen one, due to its lowest
52
delay, lowest power and the smallest silicon area among all the possible architectures.
Table 5 Performance and Power Comparison among Different Compressor Architectures for an 8-
bit Modulo 2n+1 Multiplier
Fig.37 (a) Fig.37 (b)
Delay 7mux+3xor 4mux+4xor
Delay Simulation 398.74ps 285.37ps
Power 21.93uW 13.81uW
# of transistors 250 210
Table 6 Performance and Power Comparison among Different Compressor Architectures for a
16-bit Modulo 2n+1 Multiplier
Fig.38 (a) Fig.38 (b) Fig.38 (c) Fig.38 (d)
Delay 11mux+6xor 13mux+4xor 6mux+6xor 11mux+5xor
Delay Simulation 673.00ps 707.47ps 365.24ps 489.38ps
Power 48.38uW 53.54uW 17.67uW 22.25uW
# of transistors 578 534 390 496
4.3 SIMULATION RESULTS OF THE SPARSE-TREE ARCHITECTURE
AND THE KOGGE-STONE ARCHITECTURE
In the final addition stage, another important simulation is needed. The proposed sparse-tree
architecture is designed based on the assumption that the critical path of the stage is the path to
generate carry-outs, while 4-bit conditional sum generator should be the non-critical path. The
53
simulation result of the delay of the two paths is shown in Fig. 42 and 43, respectively. It is
clearly shows that we get the desired result, where the critical path delay is about 82.15ps and the
noncritical path delay is about 74.84ps.
Fig. 42 Critical Path Delay of the Sparse-tree Adder
The purpose of replacing the original Kogge-Stone adder by the sparse-tree architecture is to
solve the wire interconnecting problem in layout, however, the performance and power of the
new design should not worse than that of the Kogge-Stone structure. In Table 7, the PDP of new
Sparse-tree structure is even slightly less than the PDP of Kogge Stone, while the wire
interconnection problem is also well solved.
54
Table 7 Performance and Power Comparison between the Kogge-Stone Architecture and the
Sparse-tree Architecture
Kogge-Stone Sparse-tree
Delay 81.57ps 82.15ps
Power 28.45uW 25.86uW
# of transistors 518 250
Fig. 43 Noncritical Path Delay of the Sparse-tree Adder
4.4 SIMULATION RESULTS OF THE CNT-BASED DESIGN AND THE
BULK CMOS- BASED DESIGN
4.4.1 FEATURES OF THE CNT TECHNOLOGY
55
Fig. 44 Critical Path Delay of the Kogge-Stone Adder
Finally, the simulation results between CNT technology and bulk CMOS technology need to be
compared. CNTFETs take advantage of semiconducting SWCNTs to work as essential element
of integrated circuit. Depending on different atom arrangement of the tubes, a SWCNT can act as
either a conductor or a semiconductor. The atom arrangement could be represented in the integer
pair (n, m).The relationship between m and n in the (n, m) pair can determine the characteristics
of the CNTFET: If n = m or n-m = 3i, where i is an integer, the CNT turns out to be metallic.
Otherwise, it turns out to be semiconducting. The equation of the diameter of the CNTFET is
shown below [12]:
(34)
56
where a0 = 0.142 nm. The threshold voltage of the CNT transistor is determined by DCNT, which
is shown in (35) [12]:
(35)
where a=2.49.The threshold voltage is reverse proportional to the diameter DCNT. We can adjust
the value of Dcnt to get the desired threshold voltage. In this thesis, we use the value of (19, 0) of
the (n, m) integer pair [12], then the threshold voltage turns out to be 0.293V.
CNTFET has a lot of advantages in various aspects than CMOS due to observation below. When
the channel length of the transistors down to a certain level, 25nm in general, traditional methods
are no longer available to reduce power because the static power is increasing rapidly, far
outweigh the dynamic power in traditional design [19]. The maximum leakage power of the
MOSFET-based gates is 75 times larger than for CNTFET gates. The minimum leakage power
of the MOSFET is about three times larger than for CNTFET [12]. The second observation is
regarding to the frequency response. In [12], the simulation result of inverter shows that the
CNTFET inverter has nearly 3dB more voltage gain and 3 times higher 3dB frequency than the
MOSFET inverter. CNTFET also has advantage in PVT variations that are discussed below. The
number of tubes in parallel in a CNTFET is equivalent to the width of CMOS, thus, we can
adjust the “width” of the CNTFET by changing its number of tubes. Different from the bulk
CMOS technology, however, for the CNTFET case, the ratio between pFET and nFET is 1:1
because the nFET and the pFET have almost the same current driving ability with same
transistor geometry [12].
57
To sum up, comparing to the conventional CMOS technology, CNT becomes more and more
attractive due to its much better performance. Due to the advantages of its lower leakage power,
better frequency response, lower PVT variation, and extremely low PDP, CNT becomes a good
substitution of CMOS in the future [12]. In this section, simulation results are compared between
CNT and CMOS based design in the application of the proposed modulo 2n+1 multiplier.
Fig. 45 Delay and Rise-time of the Proposed Multiplier Based on CMOS Technology
4.4.2 POWER, DELAY AND AREA
The simulation waveform of the final outputs delay and power of CNTFET-based 2n+1
multiplier and its CMOS counterpart with fan-out of 4 is shown in Fig. 45, 46, and 47,
respectively (solid line for CMOS and dotted line for CNTFET). Detail comparison of delay and
power is shown in table 8. All of these simulations are based on the modulo 2n+1 multiplier with
58
new designed blocks discussed in this thesis, including the new compressor, its new parallel
architecture, and the sparse-tree design. The PDP of CNT-based design is 94 times less than the
PDP of the bulk CMOS-based design. Simulation results of each new designed stage are also
shown one by one.
Fig. 46 Delay and Rise-time of the Proposed Multiplier Based on CNTFET Technology
Table 8 Performance Comparison between the Proposed Multiplier Based on Two Different
Technologies
CMOS CNT
Delay 494.94ps 30.25ps
Rise Time 16.82ps 0.73ps
Power 71.28uw 12.45uw
# of Transistors 2738 2738
59
4.4.3 PVT VARIATION
The comparison of temperature variation, voltage variation, and process variation of CMOS and
CNT technology are shown in Table 9 to 13 and Fig. 48, Fig. 49, and Fig. 50, respectively. The
robustness of CNT technology based design is much better than its CMOS counterpart as the
tables and figures clearly show.
Fig. 47 Power Consumption of the Proposed Multiplier Based on Two Technologies
Table 9 Delay Comparison between Two Technologies with Different Temperatures
Temperature (ºC) CMOS (ps) CNT (ps)
0 345.19 31.23
25 409.48 31.229
50 480.85 31.23
75 558.7 31.229
100 642.93 31.229
60
Table 10 Rise-time Comparison between Two Technologies with Different Temperatures
Temperature (ºC) CMOS (ps) CNTFET (ps)
0 33.3 3.4885
25 40.83 3.4883
50 48.972 3.4885
75 57.957 3.4883
100 68.875 3.4883
Table 11 Delay Comparison between Two Technologies with Different Supply Voltages
Supply Voltage (V) CMOS (ps) CNTFET (ps)
0.72 551.5 35.057
0.76 470.29 33.016
0.8 409.48 31.229
0.84 362.39 29.729
0.88 326.14 26.434
Table 12 Delay Comparison between Two Technologies with Different Process Corners
Process Corner CMOS (ps) CNTFET (ps)
ff (-3%) 272.8 30.815
normal 409.48 31.229
ss (+3%) 616.69 31.525
Table 13 Rise-time Comparison between Two Technologies with Different Process Corners
Process Corner CMOS (ps) CNTFET (ps)
ff (-3%) 30.743 3.2904
normal 40.83 3.4883
ss (+3%) 55.658 3.4779
61
Fig. 48 Temperature Variation
Fig. 49 Voltage Variation
0
100
200
300
400
500
600
700
0° 25° 50° 75° 100°
CMOS
CNT
0
100
200
300
400
500
600
0.72 0.76 0.8 0.84 0.88
CMOS
CNT
Temperature (°C)
Delay (ps)
Voltage (V)
Delay (ps)
62
Fig. 50 Process Variation
The simulation results shows that when temperature goes high, the delay of the modulo 2n+1
multiplier based on two technologies also goes high, however, CNT-based multiplier is much
more insensitive to temperature variation than the CMOS one. Similarly, CNT-based multiplier
is also insensitive to voltage variation and process variation as well. Different from temperature
variation, delay of the multiplier declines when supply voltage going high. To sum up, the
robustness of CNT technology is much better than the CMOS counterpart, making CNT
technology a very promising choice in the applications requiring high stability against the
variation of environment such as military applications and research applications.
0
100
200
300
400
500
600
700
ff nom ss
CMOS
CNT
Process corner
Delay (ps)
63
V. Conclusion
In this thesis, a new design of modulo 2n+1 multiplier is proposed. The new design of MUX-
based compressor increases speed and reduces power comparing to the conventional full adder
based compressor. The parallel architecture of compressors further speeds up the partial products
reduction stage and introduces regular layout. As for the final addition stage, the sparse-tree
architecture keeps the speed advantage of Kogge-Stone and solves its wire interconnection
problems. Finally, a comparison between CNT and CMOS based design is presented. CNT has
advantages in leakage power, frequency response, PVT variation, and PDP. It turns out that the
CNT is a better choice than CMOS to meet the aggressive high-speed low-power requirement
with less PVT variations, and this thesis will be a good reference for the future CNTFET-based
design in other applications.
64
Reference
[1] Modugu, R., Kim, Y.B., and Choi, M., “A fast low-power modulo 2n+1 multiplier”.
[2] Wang, Y., Swamy, M. N. S., and Omair Ahmad, M., “Residue-to-binary number converters
for three moduli set”, IEEE Transactions on CIircuits and Systems, Vol. 46, No. 2, Feburary
1999.
[3] Gallaher, D., Petry, F. E., and Srinivasan, P., “The Digit Parallel Method for Fast RNS to
Weighted Number System Conversion for Specific Moduli (2n-1; 2
n; 2
n+1)”, IEEE Transactions
on CIircuits and Systems, Vol. 44, No. 1, January 1997.
[4] Curiger, A., Bonnenberg, H., and Kaeslin, H., “Regular VLSI Architectures for
Multiplication Modulo (2n + 1)”, IEEE Journal of Solid-State Circuits, Vol. 26, No. 7, July 1991.
[5] Hiasat, A., “New memoryless, mod (2n+1) residue multiplier”, Electronic Letters Vol. 28, No.
3, 30th January 1992
[6] Wrzyszcz, A. and Milford, D., “A new modulo 2n+1 multiplier”
[7] Zimmerman, R., “Efficient VLSI implementation of modulo (2n ± 1) addition and
multiplication” IEEE trans. Comput., Vol. 51, pp. 1389-1399, 2002.
[8] Chaves, R. and Sousa, L., “Faster Modulo 2n + 1 Multipliers without Booth recoding”.
[9] Ma, Y., “A Simplified Architecture for Modulo (2n + 1) Multiplication”, IEEE Transactions
on Computers, Vol. 47, No. 3, March 1998.
[10] Wnag, Z., Jullien, G.A. and Miller, W.C., “An Efficient Tree Architecture for Modulo 2n +
1 Multiplication”, Journal of VLSI Signal Processing 14, 241-248, 1996.
[11] Vergos, H.T. and Efstathiou, C., “Design of efficient modulo 2n + 1 multipliers”, IET
Comput. Digital Technology, Vol. 1, No. 1, pp. 49-57, 2007.
65
[12] Kim, Y.B., “Integrated circuit design based on carbon nanotube field effect transistor,”
IEEE Journal of Trans. on EE Materials, Vol. 12, No.5, pp.175-188, Oct. 25, 2011.
[13] Kogge, P. and Stone, H. S., “A parallel algorithm for the efficient solution of a general class
of recurrence equations” IEEE Trans. Comput., Vol. C-22, pp. 786-793, Aug 1973.
[14] Mathew, S., Anders, M., Krishnamurthy, R.K. and Borkar, S., ”A 4-GHz 130-nm address
generation unit with 32-bit sparse-tree adder core” In IEEE Journal of Solid-State Circuits, Vol.
38, No. 5, pp. 689-695, May 2003.
[15] Vergos, H.T., Efstathiou, C., and Nikolos, D., “Diminished-One Modulo 2n + 1 Adder
Design”, IEEE Transactions on Computers, Vol. 51, No. 12, December 2002.
[16] Sreehari, V., Kirthi, M., Lingamneni, A. and Sreekanth, R., “Novel architectures for high-
speed and low-power 3-2,4-2 and 5-2 compressor,” IEEE 20th International Conference on VLSI
Design.
[17] Ma, W.N. and Li, S.G., “A New High Compression Compressor for Large Multiplier”.
[18] Chip-Hong, C., Jiangmin, G. and Mingyan, Z., “Ultra low-voltage low-power CMOS 4-2
and 5-2 compressors for fast arithmetic circuits,” IEEE Trans. on Circuits and Systems, Vol.51,
No. 10, Oct., 2004.
[19] Chandrakasan, A., Bowhill, W.J. and Fox, F., “Design of High-Performance Microprocessor
Circuits”, Wiley-IEEE Press, October 2000.
66
Appendix: Hspice Input Files
A.1 PARTIAL PRODUCT GENERATION STAGE SUBCIRCUIT FOR
BOTH CMOS AND CNT TECHNOLOGY
.subckt partial_product
+x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8
+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8
+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8
+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8
+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8
+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8
+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8
+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8
+a0 a1 a2 a3 a4 a5 a6 a7
+b0 b1 b2 b3 b4 b5 b6 b7
X1 x1_1_bar a0 b0 nand2 ***first (LSB) column of the final partial product
X2 x1_2 a7 b1 nand2 marix
X3 x1_3 a6 b2 nand2
X4 x1_4 a5 b3 nand2
X5 x1_5 a4 b4 nand2
X6 x1_6 a3 b5 nand2
X7 x1_7 a2 b6 nand2
X8 x1_8 a1 b7 nand2
X17 x1_1 x1_1_bar inv
X18 x2_1_bar a1 b0 nand2 ***second column of the final partial product marix
X19 x2_2_bar a0 b1 nand2
X20 x2_3 a7 b2 nand2
67
X21 x2_4 a6 b3 nand2
X22 x2_5 a5 b4 nand2
X23 x2_6 a4 b5 nand2
X24 x2_7 a3 b6 nand2
X25 x2_8 a2 b7 nand2
X34 x2_1 x2_1_bar inv
X35 x2_2 x2_2_bar inv
X36 x3_1_bar a2 b0 nand2 ***third column of the final partial product marix
X37 x3_2_bar a1 b1 nand2
X38 x3_3_bar a0 b2 nand2
X39 x3_4 a7 b3 nand2
X40 x3_5 a6 b4 nand2
X41 x3_6 a5 b5 nand2
X42 x3_7 a4 b6 nand2
X43 x3_8 a3 b7 nand2
X52 x3_1 x3_1_bar inv
X53 x3_2 x3_2_bar inv
X54 x3_3 x3_3_bar inv
X55 x4_1_bar a3 b0 nand2 ***fourth column of the final partial product marix
X56 x4_2_bar a2 b1 nand2
X57 x4_3_bar a1 b2 nand2
X58 x4_4_bar a0 b3 nand2
X59 x4_5 a7 b4 nand2
X60 x4_6 a6 b5 nand2
X61 x4_7 a5 b6 nand2
X62 x4_8 a4 b7 nand2
X71 x4_1 x4_1_bar inv
X72 x4_2 x4_2_bar inv
X73 x4_3 x4_3_bar inv
68
X74 x4_4 x4_4_bar inv
X75 x5_1_bar a4 b0 nand2 ***fifth column of the final partial product marix
X76 x5_2_bar a3 b1 nand2
X77 x5_3_bar a2 b2 nand2
X78 x5_4_bar a1 b3 nand2
X79 x5_5_bar a0 b4 nand2
X80 x5_6 a7 b5 nand2
X81 x5_7 a6 b6 nand2
X82 x5_8 a5 b7 nand2
X91 x5_1 x5_1_bar inv
X92 x5_2 x5_2_bar inv
X93 x5_3 x5_3_bar inv
X94 x5_4 x5_4_bar inv
X95 x5_5 x5_5_bar inv
X96 x6_1_bar a5 b0 nand2 ***sixth column of the final partial product marix
X97 x6_2_bar a4 b1 nand2
X98 x6_3_bar a3 b2 nand2
X99 x6_4_bar a2 b3 nand2
X100 x6_5_bar a1 b4 nand2
X101 x6_6_bar a0 b5 nand2
X102 x6_7 a7 b6 nand2
X103 x6_8 a6 b7 nand2
X112 x6_1 x6_1_bar inv
X113 x6_2 x6_2_bar inv
X114 x6_3 x6_3_bar inv
X115 x6_4 x6_4_bar inv
X116 x6_5 x6_5_bar inv
X117 x6_6 x6_6_bar inv
X118 x7_1_bar a6 b0 nand2 ***seventh column of the final partial product marix
69
X119 x7_2_bar a5 b1 nand2
X120 x7_3_bar a4 b2 nand2
X121 x7_4_bar a3 b3 nand2
X122 x7_5_bar a2 b4 nand2
X123 x7_6_bar a1 b5 nand2
X124 x7_7_bar a0 b6 nand2
X125 x7_8 a7 b7 nand2
X134 x7_1 x7_1_bar inv
X135 x7_2 x7_2_bar inv
X136 x7_3 x7_3_bar inv
X137 x7_4 x7_4_bar inv
X138 x7_5 x7_5_bar inv
X139 x7_6 x7_6_bar inv
X140 x7_7 x7_7_bar inv
X141 x8_1_bar a7 b0 nand2 ***eighth (MSB) column of the final partial product
X142 x8_2_bar a6 b1 nand2 marix
X143 x8_3_bar a5 b2 nand2
X144 x8_4_bar a4 b3 nand2
X145 x8_5_bar a3 b4 nand2
X146 x8_6_bar a2 b5 nand2
X147 x8_7_bar a1 b6 nand2
X148 x8_8_bar a0 b7 nand2
X157 x8_1 x8_1_bar inv
X158 x8_2 x8_2_bar inv
X159 x8_3 x8_3_bar inv
X160 x8_4 x8_4_bar inv
X161 x8_5 x8_5_bar inv
X162 x8_6 x8_6_bar inv
X163 x8_7 x8_7_bar inv
X563 x8_8 x8_8_bar inv
.ends
70
A.2 PARTIAL PRODUCT REDUCTION STAGE SUBCIRCUIT FOR BOTH
CMOS AND CNT TECHNOLOGY
.subckt overall_compressor sum4_8 sum4_7 sum4_6 sum4_5 sum4_4 sum4_3 sum4_2 sum4_1
+carry4_7 carry4_6 carry4_5 carry4_4 carry4_3 carry4_2 carry4_1 carry4_8_bar
+x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8
+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8
+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8
+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8
+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8
+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8
+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8
+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8
*x1_2 = row2,column1
*sum1_2 = 32compressor block level 1, second block
X1 sum1_1 carry1_1 x1_1 x1_2 x1_3 compressor
X2 sum1_2 carry1_2 x1_4 x1_5 x1_6 compressor
X3 sum1_3 carry1_3 x1_7 x1_8 0 compressor
X6 sum1_6 carry1_6 x2_1 x2_2 x2_3 compressor
X7 sum1_7 carry1_7 x2_4 x2_5 x2_6 compressor
X8 sum1_8 carry1_8 x2_7 x2_8 vdd compressor
X11 sum1_11 carry1_11 x3_1 x3_2 x3_3 compressor
X12 sum1_12 carry1_12 x3_4 x3_5 x3_6 compressor
X13 sum1_13 carry1_13 x3_7 x3_8 0 compressor
X16 sum1_16 carry1_16 x4_1 x4_2 x4_3 compressor
X17 sum1_17 carry1_17 x4_4 x4_5 x4_6 compressor
71
X18 sum1_18 carry1_18 x4_7 x4_8 0 compressor
X21 sum1_21 carry1_21 x5_1 x5_2 x5_3 compressor
X22 sum1_22 carry1_22 x5_4 x5_5 x5_6 compressor
X23 sum1_23 carry1_23 x5_7 x5_8 0 compressor
X26 sum1_26 carry1_26 x6_1 x6_2 x6_3 compressor
X27 sum1_27 carry1_27 x6_4 x6_5 x6_6 compressor
X28 sum1_28 carry1_28 x6_7 x6_8 0 compressor
X31 sum1_31 carry1_31 x7_1 x7_2 x7_3 compressor
X32 sum1_32 carry1_32 x7_4 x7_5 x7_6 compressor
X33 sum1_33 carry1_33 x7_7 x7_8 0 compressor
X36 sum1_36 carry1_36 x8_1 x8_2 x8_3 compressor
X37 sum1_37 carry1_37 x8_4 x8_5 x8_6 compressor
X38 sum1_38 carry1_38 x8_7 x8_8 0 compressor
X81 carry1_36_bar carry1_36 inv
X82 carry1_37_bar carry1_37 inv
X83 carry1_38_bar carry1_38 inv
*new column1 inputs: sum1_1, sum1_2, sum1_3, sum1_4, sum1_5, carry1_76_bar,
*carry1_77_bar, carry1_78_bar, carry1_79_bar, carry1_80_bar, x1_16, 0
*new column2 inputs: sum1_6, sum1_7, sum1_8, sum1_9, sum1_10, carry1_1, carry1_2,
*carry1_3, carry1_4, carry1_5, x2_16, vdd
X86 sum2_1 carry2_1 sum1_1 sum1_2 sum1_3 compressor
X87 sum2_2 carry2_2 carry1_36_bar carry1_37_bar carry1_38_bar compressor
X90 sum2_5 carry2_5 sum1_6 sum1_7 sum1_8 compressor
72
X91 sum2_6 carry2_6 carry1_1 carry1_2 carry1_3 compressor
X94 sum2_9 carry2_9 sum1_11 sum1_12 sum1_13 compressor
X95 sum2_10 carry2_10 carry1_6 carry1_7 carry1_8 compressor
X98 sum2_13 carry2_13 sum1_16 sum1_17 sum1_18 compressor
X99 sum2_14 carry2_14 carry1_11 carry1_12 carry1_13 compressor
X102 sum2_17 carry2_17 sum1_21 sum1_22 sum1_23 compressor
X103 sum2_18 carry2_18 carry1_16 carry1_17 carry1_18 compressor
X106 sum2_21 carry2_21 sum1_26 sum1_27 sum1_28 compressor
X107 sum2_22 carry2_22 carry1_21 carry1_22 carry1_23 compressor
X110 sum2_25 carry2_25 sum1_31 sum1_32 sum1_33 compressor
X111 sum2_26 carry2_26 carry1_26 carry1_27 carry1_28 compressor
X114 sum2_29 carry2_29 sum1_36 sum1_37 sum1_38 compressor
X115 sum2_30 carry2_30 carry1_31 carry1_32 carry1_33 compressor
X150 carry2_29_bar carry2_29 inv
X151 carry2_30_bar carry2_30 inv
*new column1 inputs: sum2_1 sum2_2 sum2_3 sum2_4 carry2_61_bar carry2_62_bar
*new column2 inputs: sum2_5 sum2_6 sum2_7 sum2_8 carry2_1 carry2_2
X154 sum3_1 carry3_1 sum2_1 sum2_2 carry2_29_bar compressor
X155 sum3_2 carry3_2 sum2_5 sum2_6 carry2_1 compressor
X156 sum3_3 carry3_3 sum2_9 sum2_10 carry2_5 compressor
X157 sum3_4 carry3_4 sum2_13 sum2_14 carry2_9 compressor
X158 sum3_5 carry3_5 sum2_17 sum2_18 carry2_13 compressor
X159 sum3_6 carry3_6 sum2_21 sum2_22 carry2_17 compressor
73
X160 sum3_7 carry3_7 sum2_25 sum2_26 carry2_21 compressor
X161 sum3_8 carry3_8 sum2_29 sum2_30 carry2_25 compressor
X186 carry3_8_bar carry3_8 inv
*new column1 inputs: sum3_1 sum3_2 carry2_63_bar carry2_64_bar carry3_31_bar
carry3_32_bar
*new column2 inputs: sum3_3 sum3_4 carry2_3 carry2_4 carry3_1 carry3_2
X200 sum4_1 carry4_1 sum3_1 carry3_8_bar carry2_30_bar compressor
X201 sum4_2 carry4_2 sum3_2 carry3_1 carry2_2 compressor
X202 sum4_3 carry4_3 sum3_3 carry3_2 carry2_6 compressor
X203 sum4_4 carry4_4 sum3_4 carry3_3 carry2_10 compressor
X204 sum4_5 carry4_5 sum3_5 carry3_4 carry2_14 compressor
X205 sum4_6 carry4_6 sum3_6 carry3_5 carry2_18 compressor
X206 sum4_7 carry4_7 sum3_7 carry3_6 carry2_22 compressor
X207 sum4_8 carry4_8 sum3_8 carry3_7 carry2_26 compressor
X208 carry4_8_bar carry4_8 inv
*final output: sum4_8 sum4_7 sum4_6 sum4_5 sum4_4 sum4_3 sum4_2 sum4_1
* carry4_7 carry4_6 carry4_5 carry4_4 carry4_3 carry4_2 carry4_1 carry4_8_bar
.ends
74
A.3 FINAL ADDITION STAGE SUBCIRCUIT FOR BOTH CMOS AND
CNT TECHNOLOGY
.subckt carry_merge C_m1 C_m1_bar C3 g0 g1 g2 g3 g4 g5 g6 g7 p0 p1 p2 p3 p4 p5 p6 p7
psum0 psum0_bar psum1 psum2 psum3 psum4 psum4_bar psum5 psum6 psum7 a0 a1 a2 a3 a4
a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7
*generate & propagate generation block
X1 g0 p0 a0 b0 g_p
X2 g1 p1 a1 b1 g_p
X3 g2 p2 a2 b2 g_p
X4 g3 p3 a3 b3 g_p
X5 g4 p4 a4 b4 g_p
X6 g5 p5 a5 b5 g_p
X7 g6 p6 a6 b6 g_p
X8 g7 p7 a7 b7 g_p
*inverted carry emerge block
X21 go5 po5 g7 g6 p7 p6 s_o
X22 go6 po6 g5 g4 p5 p4 s_o
X23 go7 po7 g3 g2 p3 p2 s_o
X24 go8 po8 g1 g0 p1 p0 s_o
X27 go11 po11 go5 go6 po5 po6 s_o
X28 go12 po12 go7 go8 po7 po8 s_o
X31 C_m1_bar po15 go11 go12 po11 po12 s_o
X32 C3 po19 go12 go11_b po12 po11 s_o
* final carry
X33 C_m1 C_m1_bar inv
X33i go11_b go11 inv
75
*partial sum generation block
X37 psum0_bar psum0 a0 b0 xor_xnor
X38 psum1_bar psum1 a1 b1 xor_xnor
X39 psum2_bar psum2 a2 b2 xor_xnor
X40 psum3_bar psum3 a3 b3 xor_xnor
X41 psum4_bar psum4 a4 b4 xor_xnor
X42 psum5_bar psum5 a5 b5 xor_xnor
X43 psum6_bar psum6 a6 b6 xor_xnor
X44 psum7_bar psum7 a7 b7 xor_xnor
.ends
*4-bit conditional sum generator
.subckt cond s0 s1 s2 s3 cin cin_bar psum0 psum0_bar psum1 psum2 psum3 g0 g1 g2 g3 p0 p1
p2 p3
*s0 generator
X3 s0 psum0_bar psum0 cin cin_bar mux
*s1 generator
X4 n4 g0 p0 nor2
X5 n3 n3_0 n4 psum1 xor_xnor
X6 n08 n2 g0 psum1 xor_xnor
X7 s1 n3 n2 cin cin_bar mux
*s2 generator
X8 n9 g0 p1 nand2
X9 n10 g1 inv
X10 n7 n9 n10 nand2
X11 n7_0 n7 inv
X12 n11 p1 p0 nand2
X13 n8 n11 n7_0 nand2
76
X14 n09 n6 n8 psum2 xor_xnor
X15 n010 n5 n7 psum2 xor_xnor
X16 s2 n6 n5 cin cin_bar mux
*s3 generator
X17 n16 p1 p2 g0 nand3
X18 n17 p2 g1 nand2
X19 n13 g2 inv
X20 n18 n16 n17 n13 nand3
X21 n14 n18 inv
X22 n12 p1 p2 p0 nand3
X23 n15 n12 n14 nand2
X24 n011 n20 n15 psum3 xor_xnor
X25 n012 n19 n18 psum3 xor_xnor
X26 s3 n20 n19 cin cin_bar mux
.ends
*final stage adder
.subckt final_stage_adder s0 s1 s2 s3 s4 s5 s6 s7 a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7
X1 C_m1 C_m1_bar C3 g0 g1 g2 g3 g4 g5 g6 g7 p0 p1 p2 p3 p4 p5 p6 p7 psum0 psum0_bar
psum1 psum2 psum3 psum4 psum4_bar psum5 psum6 psum7 a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2
b3 b4 b5 b6 b7 carry_merge
X2 C3_bar C3 inv
X5 s0 s1 s2 s3 C_m1 C_m1_bar psum0 psum0_bar psum1 psum2 psum3 g0 g1 g2 g3
p0 p1 p2 p3 cond
X6 s4 s5 s6 s7 C3 C3_bar psum4 psum4_bar psum5 psum6 psum7 g4 g5 g6 g7 p4
p5 p6 p7 cond
.ends
77
A.4 OTHER SUBCIRCUITS FOR CMOS TECHNOLOGY
*inverter
.subckt inv out in
M1 out in vdd vdd pmos W=256n L=32n
M2 out in 0 0 nmos W=128n L=32n
.ends
*xor_xnor
.subckt xor_xnor xnor xor a b
M1 a b xor vdd pmos L=32nm W=256nm
M2 xor a b vdd pmos L=32nm W=256nm
M3 xor b 1 0 nmos L=32nm W=64nm
M4 1 a 0 0 nmos L=32nm W=64nm
M5 xnor xor vdd vdd pmos L=32nm W=64nm
M6 xor xnor 0 0 nmos L=32nm W=32nm
M7 2 b vdd vdd pmos L=32nm W=128nm
M8 xnor a 2 vdd pmos L=32nm W=128nm
M9 a b xnor 0 nmos L=32nm W=128nm
M10 xnor a b 0 nmos L=32nm W=128nm
.ends
*2 to 1 mux
.subckt mux out a b set set_bar
M1 1 a vdd vdd pmos W=128nm L=32nm
M2 4 b vdd vdd pmos W=128nm L=32nm
M3 2 set_bar 1 vdd pmos W=128nm L=32nm
M4 2 set 4 vdd pmos W=128nm L=32nm
M5 2 set 3 0 nmos W=64nm L=32nm
M6 2 set_bar 5 0 nmos W=64nm L=32nm
M7 3 a 0 0 nmos W=64nm L=32nm
78
M8 5 b 0 0 nmos W=64nm L=32nm
X1 out 2 inv
.ends
*3 input nand
.subckt nand3 out a b c
m1 out a vdd vdd pmos l=32n w=256n
m2 out b vdd vdd pmos l=32n w=256n
m3 out c vdd vdd pmos l=32n w=256n
m4 out a 2 0 nmos l=32n w=384n
m5 2 b 3 0 nmos l=32n w=384n
m6 3 c 0 0 nmos l=32n w=384n
.ends
*2 input nand
.subckt nand2 out a b
m1 out a vdd vdd pmos l=32n w=256n
m2 out b vdd vdd pmos l=32n w=256n
m3 out a 2 0 nmos l=32n w=256n
m4 2 b 0 0 nmos l=32n w=256n
.ends
*2 input nor
.subckt nor2 out a b
m1 2 a vdd vdd pmos l=32n w=512n
m2 out b 2 vdd pmos l=32n w=512n
m3 out a 0 0 nmos l=32n w=128n
m4 out b 0 0 nmos l=32n w=128n
.ends
* special operator
79
.subckt s_o Gout Pout gl gr pl pr
X1 1 pl gr nand2
X2 2 gl inv
X3 Gout 1 2 nand2
X4 3 pl pr nand2
X5 Pout 3 inv
.ends
* G_P generator
.subckt g_p gi pi ai bi
X1 gi 1 inv
X2 pi 2 inv
X3 1 ai bi nand2
X4 2 ai bi nor2
.ends
.subckt compressor sum carry x1 x2 x3
X1 xnor xor x1 x2 xor_xnor
X2 sum xnor xor x3 x3_bar mux
X3 x3_bar x3 inv
X4 carry x1 x3 xnor xor mux
.ends
80
A.5 OTHER SUBCIRCUITS FOR CNT TECHNOLOGY
* PCNFET Lch=32nm n1=19 n2=0 tubes=8
* NCNFET Lch=32nm n1=19 n2=0 tubes=8
*inverter
.subckt inv out in
X1 out in vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X2 out in 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8
.ends
*xor_xnor
.subckt xor_xnor xnor xor a b
X1 a b xor vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X2 xor a b vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X3 xor b 1 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4
X4 1 a 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4
X5 xnor xor vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=2
X6 xor xnor 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=2
X7 2 b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4
X8 xnor a 2 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4
X9 a b xnor 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8
X10 xnor a b 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8
.ends
*2 to 1 mux
.subckt mux out a b set set_bar
X1 1 a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4
X2 4 b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4
X3 2 set_bar 1 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4
X4 2 set 4 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4
81
X5 2 set 3 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4
X6 2 set_bar 5 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4
X7 3 a 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4
X8 5 b 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4
X0 out 2 inv
.ends
*3 input nand
.subckt nand3 out a b c
X1 out a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X2 out b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X3 out c vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X4 out a 2 0 NCNFET Lch=32nm n1=19 n2=0 tubes=24
X5 2 b 3 0 NCNFET Lch=32nm n1=19 n2=0 tubes=24
X6 3 c 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=24
.ends
*2 input nand
.subckt nand2 out a b
X1 out a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X2 out b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8
X3 out a 2 0 NCNFET Lch=32nm n1=19 n2=0 tubes=16
X4 2 b 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=16
.ends
*2 input nor
.subckt nor2 out a b
X1 2 a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=16
X2 out b 2 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=16
X3 out a 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8
X4 out b 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8
82
.ends
* special operator
.subckt s_o Gout Pout gl gr pl pr
X1 1 pl gr nand2
X2 2 gl inv
X3 Gout 1 2 nand2
X4 3 pl pr nand2
X5 Pout 3 inv
.ends
* G_P generator
.subckt g_p gi pi ai bi
X1 gi 1 inv
X2 pi 2 inv
X3 1 ai bi nand2
X4 2 ai bi nor2
.ends
.subckt compressor sum carry x1 x2 x3
X1 xnor xor x1 x2 xor_xnor
X2 sum xnor xor x3 x3_bar mux
X3 x3_bar x3 inv
X4 carry x1 x3 xnor xor mux
.ends
83
A.6 MODULO 2N+1 MULTIPLIER TESTING CIRCUIT FOR BOTH CMOS
AND CNT TECHNOLOGY
.lib "CNFET.lib" CNFET
*.include 'PTM_customized_32nm_nom.lib'
.include 'partial_product_8bit.sp'
.include 'compressor_8bit.sp'
.include 'sparsetree_8bit.sp'
.include 'subckt_CNT.sp'
.global vdd
Vdd vdd 0 0.8
Va0 a000 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xi1 a00 a000 inv
Xi2 a0 a00 inv
*Va0 a0 0 0.8
Va1 a1 0 0
Va2 a2 0 0
Va3 a3 0 0
Va4 a4 0 0
Va5 a5 0 0
Va6 a6 0 0
Va7 a7 0 0
Vb0 b0 0 0.8
Vb1 b1 0 0
Vb2 b2 0 0
Vb3 b3 0 0
84
Vb4 b4 0 0
Vb5 b5 0 0
Vb6 b6 0 0
Vb7 b7 0 0
X1 x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8
+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8
+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8
+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8
+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8
+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8
+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8
+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8
+a0 a1 a2 a3 a4 a5 a6 a7
+b0 b1 b2 b3 b4 b5 b6 b7
+partial_product
X2 sum6_8 sum6_7 sum6_6 sum6_5 sum6_4 sum6_3 sum6_2 sum6_1
+carry6_7 carry6_6 carry6_5 carry6_4 carry6_3 carry6_2 carry6_1 carry6_8_bar
+x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8
+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8
+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8
+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8
+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8
+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8
+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8
+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8
+overall_compressor
X3 s0 s1 s2 s3 s4 s5 s6 s7
+sum6_1 sum6_2 sum6_3 sum6_4 sum6_5 sum6_6 sum6_7 sum6_8
85
+carry6_8_bar carry6_1 carry6_2 carry6_3 carry6_4 carry6_5 carry6_6 carry6_7
+final_stage_adder
X4 to1 s0 inv
X5 to2 s0 inv
X6 to3 s0 inv
X7 to4 s0 inv
.options AUTOSTOP
.options INGOLD=2 DCON=1
.options GSHUNT=1e-12 RMIN=1e-15
.options ABSTOL=1e-5 ABSVDC=1e-4
.options RELTOL=1e-2 RELVDC=1e-2
.options NUMDGT=4 PIVOT=1
.option convergence=1
.param TEMP=27
.option post
.tran 1e-12 2e-9
.end
86
A.7 7:2 COMPRESSOR SUBCIRCUIT AND ITS TESTING CIRCUIT
*.lib "CNFET.lib" CNFET
.include 'PTM_customized_32nm_nom.lib'
.global vdd
* PCNFET Lch=32nm n1=19 n2=0 tubes=8
* NCNFET Lch=32nm n1=19 n2=0 tubes=8
X1 3 4 x5 x6 xor_xnor
X2 5 6 x2 x3 xor_xnor
X3 9 19 3 4 x7 x7_0 mux
X4 10 11 5 6 x4 x4_0 mux
X5 13 14 10 11 9 9_0 mux
X6 15 16 13 14 x1 x1_0 mux
X7 17 18 15 16 cin2 cin2_0 mux
X8 carry 15 cin1 17 18 mux_single
X9 sum 23 17 18 cin1 cin1_0 mux
x10 t1 x2 x3 x4 CGEN
x11 b x5 x6 x7 CGEN
x12 t2 x2 x3 x4 nor3
x13 t3 t2_0 x1 nand2
x14 a t1_0 t3 nand2
x15 t9 t4 10 x1 xor_xnor
x16 t5 x1 x2 x3 x4 nand4
x17 t6 t4 9 nand2
x18 c t5 t6 nand2
x19 cout1 a b c CGEN
x20 t7 t8 a b xor_xnor
x21 cout2 t7 t8 c c_0 mux_single
87
Xi1 12_0 12 inv
Xi2 x1_0 x1 inv
Xi3 cin2_0 cin2 inv
Xi4 cin1_0 cin1 inv
xi5 x7_0 x7 inv
xi6 x4_0 x4 inv
xi7 9_0 9 inv
xi8 c_0 c inv
xi9 t2_0 t2 inv
xi10 t1_0 t1 inv
.subckt inv out in
M1 out in vdd vdd pmos L=32nm W=64nm
M2 out in 0 0 nmos L=32nm W=32nm
.ends
.subckt xor_xnor xnor xor a b
M1 a b xor vdd pmos L=32nm W=48nm
M2 xor a b vdd pmos L=32nm W=48nm
M3 xor b 1 0 nmos L=32nm W=32nm
M4 1 a 0 0 nmos L=32nm W=32nm
M5 vdd xor xnor vdd pmos L=32nm W=48nm
M6 xor xnor 0 0 nmos L=32nm W=32nm
M7 vdd b 2 vdd pmos L=32nm W=48nm
M8 2 a xnor vdd pmos L=32nm W=48nm
M9 a b xnor 0 nmos L=32nm W=32nm
M10 xnor a b 0 nmos L=32nm W=32nm
.ends
.subckt mux_single out a b set set_bar
M1 1 a vdd vdd pmos L=32nm W=48nm
88
M2 4 b vdd vdd pmos L=32nm W=48nm
M3 2 set_bar 1 vdd pmos L=32nm W=48nm
M4 2 set 4 vdd pmos L=32nm W=48nm
M5 2 set 3 0 nmos L=32nm W=32nm
M6 2 set_bar 5 0 nmos L=32nm W=32nm
M7 3 a 0 0 nmos L=32nm W=32nm
M8 5 b 0 0 nmos L=32nm W=32nm
M9 out 2 vdd vdd pmos L=32nm W=48nm
M10 out 2 0 0 nmos L=32nm W=32nm
.ends
.subckt mux out outbar a b set set_bar
M1 a set out 0 nmos L=32nm W=32nm
M2 b set_bar out 0 nmos L=32nm W=32nm
M3 b set outbar 0 nmos L=32nm W=32nm
M4 a set_bar outbar 0 nmos L=32nm W=32nm
M5 out outbar vdd vdd pmos L=32nm W=48nm
M6 outbar out vdd vdd pmos L=32nm W=48nm
.ends
.subckt CGEN carry a b cin
M1 1 b vdd vdd pmos L=32nm W=48nm
M2 1 a vdd vdd pmos L=32nm W=48nm
M3 2 cin 1 vdd pmos L=32nm W=48nm
M4 2 cin 3 0 nmos L=32nm W=32nm
M5 3 b 0 0 nmos L=32nm W=32nm
M6 3 a 0 0 nmos L=32nm W=32nm
M7 4 b vdd vdd pmos L=32nm W=48nm
M8 2 a 4 vdd pmos L=32nm W=48nm
M9 2 a 6 0 nmos L=32nm W=32nm
M10 6 b 0 0 nmos L=32nm W=32nm
89
X11 carry 2 inv
.ends
*2 input nand
.subckt nand2 out a b
M1 out a vdd vdd pmos L=32nm W=64nm
M2 out b vdd vdd pmos L=32nm W=64nm
M3 out a 2 0 nmos L=32nm W=64nm
M4 2 b 0 0 nmos L=32nm W=64nm
.ends
*4 input nand
.subckt nand4 out a b c d
M1 out a vdd vdd pmos L=32nm W=64nm
M2 out b vdd vdd pmos L=32nm W=64nm
M3 out c vdd vdd pmos L=32nm W=64nm
M4 out d vdd vdd pmos L=32nm W=64nm
M5 out a 1 0 nmos L=32nm W=128nm
M6 1 b 2 0 nmos L=32nm W=128nm
M7 2 c 3 0 nmos L=32nm W=128nm
M8 3 d 0 0 nmos L=32nm W=128nm
.ends
*3 input nor
.subckt nor3 out a b c
M1 2 a vdd vdd pmos L=32nm W=192nm
M2 3 b 2 vdd pmos L=32nm W=192nm
M3 out c 3 vdd pmos L=32nm W=192nm
M4 out a 0 0 nmos L=32nm W=32nm
M5 out b 0 0 nmos L=32nm W=32nm
M6 out c 0 0 nmos L=32nm W=32nm
90
.ends
*2 input nor
.subckt nor2 out a b
M1 2 a vdd vdd pmos L=32nm W=128nm
M2 out b 2 vdd pmos L=32nm W=128nm
M3 out a 0 0 nmos L=32nm W=32nm
M4 out b 0 0 nmos L=32nm W=32nm
.ends
Vdd vdd 0 0.8
Va a00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii1 a0 a00 inv
Xii2 x1 a0 inv
Vb b00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii3 b0 b00 inv
Xii4 x2 b0 inv
Vc c00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii5 c0 c00 inv
Xii6 x3 c0 inv
Vd d00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii7 d0 d00 inv
Xii8 x4 d0 inv
Ve e00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii9 e0 e00 inv
Xii10 x5 e0 inv
91
Vf f00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii11 f0 f00 inv
Xii12 x6 f0 inv
Vg g00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii13 g0 g00 inv
Xii14 x7 g0 inv
Vh h00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii15 h0 h00 inv
Xii16 cin1 h0 inv
Vi i00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9
Xii17 i0 i00 inv
Xii18 cin2 i0 inv
.options POST
.options AUTOSTOP
.options INGOLD=2 DCON=1
.options GSHUNT=1e-12 RMIN=1e-15
.options ABSTOL=1e-5 ABSVDC=1e-4
.options RELTOL=1e-2 RELVDC=1e-2
.options NUMDGT=4 PIVOT=1
.option convergence=1
.param TEMP=27
.option post
.tran 1e-12 20e-10
.end