a high-speed low-power modulo 2^n+1 multiplier design ...1324/fulltext.pdf · a high-speed...

A High-Speed Low-Power Modulo 2n+1 Multiplier

Design Using Carbon-Nanotube Technology

A Thesis Presented

by

He Qi

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirement

for the degree of

Master of Science

in

Electrical Engineering

in the field of

Electronic Circuits and Semiconductor Devices

Northeastern University

Boston, Massachusetts

April, 2012

NORTHEASTERN UNIVERSITY Graduate School of Engineering

Thesis Title: A High-Speed Low-Power Modulo 2n+1 Multiplier Design Using Carbon-

Nanotube Technology.

Author: He Qi.

Department: Department of Electrical and Computer Engineering.

Approved for Thesis Requirements of the Master of Science Degree

____________________________________________ ______________________

Thesis Advisor: Prof. Yong-Bin Kim Date

____________________________________________ ______________________

Thesis Reader: Prof. Fabrizio Lombardi Date

____________________________________________ ______________________

Thesis Reader: Prof. Minsu Choi Date

____________________________________________ ______________________

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

____________________________________________ ______________________

Director of the Graduate School Date

NORTHEASTERN UNIVERSITY Graduate School of Engineering

Thesis Title: A High-Speed Low-Power Modulo 2

n+1 Multiplier Design Using Carbon-

Nanotube Technology.

Author: He Qi.

Department: Department of Electrical and Computer Engineering.

Approved for Thesis Requirements of the Master of Science Degree

____________________________________________ ______________________

Thesis Advisor: Prof. Yong-Bin Kim Date

____________________________________________ ______________________

Thesis Reader: Prof. Fabrizio Lombardi Date

____________________________________________ ______________________

Thesis Reader: Prof. Minsu Choi Date

____________________________________________ ______________________

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

____________________________________________ ______________________

Dean: Prof. Sara Wadia-Fascetti Date

Copy Deposited in Library:

____________________________________________ ______________________

Reference Librarian Date

Abstract

Modulo 2n+1 multiplier is one of the critical components in the area of digital signal processing, residue

arithmetic, and data encryption that demand high-speed and low-power operation. In this thesis, a new

circuit implementation of a high-speed low-power modulo 2n+1 multiplier is proposed. It has three

major stages: partial product generation stage, partial product reduction stage, and the final adder stage.

The major technical contribition to the arts of the thesis is that the partial product reduction stage

introduces a new MUX-based compressor to reduce power and increase speed. Secondly, in the final

adder stage, the sparse-tree based inverted end-around-carry adder reduces the number of critical path

circuit blocks. Finally, a proposed adder is implemented using both 32nm CNTFET (Carbon-Nanotube

FET) and bulk CMOS technology for comparison. The CNTFET-based design dramatically decreases

the PDP (Power Delay Product) of the circuit. The simulation results demonstrate that the MUX-based

compressor reduces the PDP of the partial product reduction stage by 4.24 times compare to the

traditional full adder based design. The sparse-architecture solves the wire interconnection problem

while slightly reduces the PDP of the final adder stage compare to the Kogge-Stone design. The power

consumption of CNTFET-based multiplier is on average of 5.72 times less than its conventional bulk

CMOS counterpart, while the PDP of CNTFET is 94 times less than the CMOS one. The proposed

multilier circuit and its implementation demonstrates the viability of the ultra-low-power and high

performance feature of the promising CNTFET technology.

Index Terms

Modulo 2n+1 Multiplier, MUX-based Compressor, Sparse-tree Adder, Carbon-Nanotube Technology

Acknowledgements

First of all, I will thank Prof. Yong-Bin Kim, my research advisor. His constructive suggestions and

encouragements lead me to make progress in my master research. In addition, his great guidance helps

me to realize where my passion is and what research area I’m going to concentrate on in the future.

Thank you so much! I would also like to thank the members of committee to share my research results

with valuable advices.

He Qi

Boston, MA

For my parents

CONTENTS

ABSTRACT.....................................................................................................................i

ACKNOWLEDGEMENTS........................................................................................i

I. INTRODUCTION.......................................................................................................1

1.1 BACKGROUND.....................................................................................................1

1.2 PROBLEM AND WORK STATEMENT..............................................................4

1.3 OUTLINE OF THE THESIS..................................................................................6

II. ALGORITHM............................................................................................................8

2.1 ALGORITHM OF THE PARTIAL PRODUCT GENERATION STAGE............8

2.2 ALGORITHM OF THE PARTIAL PRODUCT REDUCTION STAGE............10

2.3 ALGORITHM OF THE FINAL ADDITION STAGE.........................................11

2.4 AN EXAMPLE.....................................................................................................20

III. CIRCUIT IMPLEMENTATION.......................................................................22

3.1 CIRCUIT DESIGN OF THE PARTIAL PRODUCT GENERATION STAGE..22

3.2 CIRCUIT DESIGN OF THE PARTIAL PRODUCT REDUCTION STAGE.....24

3.2.1 INTRODUCTION OF A MUX-BASED COMPRESSOR DESIGN...........................24

3.2.2 DIFFERENT TYPES OF THE MUX-BASED COMPRESSORS..............................26

3.2.2.1 Circuit Desigin of the 3:2 Compressor...................................................................26

3.2.2.2 Circuit Desigin of the 4:2 Compressor...................................................................27

3.2.2.3 Circuit Desigin of the 5:2 Compressor………...........................................................29

3.2.2.4 Circuit Desigin of the 7:2 Compressor………...........................................................33

3.2.3 DETAILED SUBCIRCUIT DESIGN OF THE COMPRESSORS..............................34

3.2.3.1 MUX Subcircuit Design………………............................................................34

3.2.3.2 Complementary MUX Subcircuit Design……………............................................37

3.2.3.3 XOR-XNOR Subcircuit Design...........................................................................39

3.2.3.4 CGEN Subcircuit Design...................................................................................43

3.2.4 THE ARCHITECTURE OF THE PARTIAL PRODUCT REDUCTION STAGE.........44

3.2.4.1 Architecture Designed for an 8-bit Modulo 2n+1 Multiplier............................................44

3.2.4.2 Architecture Designed for an 16-bit Modulo 2n+1 Multiplier...........................................45

3.3 CIRCUIT DESIGN OF THE FINAL ADDITION STAGE..................................47

IV. SIMULATION RESULTS OF THE PROPOSED DESIGN AND

TECHNOLOGY COMPARISON………………………………………..………...49

4.1 PERFORMANCE COMPARISON BETWEEN THE FULL ADDER BASED

COMPRESSOR AND THE MUX BASED COMPRESSOR…….……………50

4.2 SIMULATION RESULTS OF DIFFERENT COMPRESSOR

ARCHITECTURES IN THE PARTIAL PRODUCT REDUCTION

STAGE……………………………………………………………………….…51

4.3 SIMULATION RESULTS OF THE SPARSE-TREE ARCHITECTURE AND

THE KOGGE-STONE ARCHITECTURE……………………………………..52

4.4 SIMULATION RESULTS OF THE CNT-BASED DESIGN AND THE BULK

CMOS- BASED DESIGN………………………………………………………54

4.4.1 FEATURES OF THE CNT TECHNOLOGY……..……………………..…….…..54

4.4.2 POWER, DELAY AND AREA…………………………………….….………...57

4.4.3 PVT VARIATION………………………………………...…………………....58

V. CONCLUSION……….……………………………………………………………63

REFERENCE…………………………………………………………………………64

APPENDIX: HSPICE INPUT FILES……………………………………………..66

List of Figures

Fig.1 Initial Partial Product Matrix.............................................................................................9

Fig. 2 Modified Partial Product Matrix........................................................................................9

Fig. 3 Final n × n Partial Product Matrix...................................................................................10

Fig. 4 8-bit Kogge-Stone Adder...............................................................................................12

Fig. 5 16-bit Kogge-Stone Adder.............................................................................................13

Fig. 6 8-bit Kogge-Stone Diminished-1 Adder...........................................................................14

Fig. 7 Revised Diminished-1 Kogge-Stone Adder with Stages.............................................17

Fig. 8 16-bit Kogge–Stone Adder with Sparsity of 4...................................................................18

Fig. 9 Inverted EAC Adder with Sparsity of 4............................................................................18

Fig. 10 Inverted EAC Adder with Sparsity of 4 in Stages...................................................19

Fig. 11 the Initial Output of the Partial Product Generation Stage..................................................20

Fig. 12 the n×n Partial Product Matrix......................................................................................20

Fig. 13 the Final Partial Product Matrix with the Correction Factor...............................................21

Fig. 14 the Initial Output of the Partial Product Reduction Stage...................................................21

Fig. 15 the Output of the Partial Product Reduction Stage after Repositioning.................................21

Fig. 16 Proposed Inverter.........................................................................................................22

Fig. 17 Nand Gate with 2 Inputs..............................................................................................23

Fig. 18 Nor Gate with 2 Inputs.................................................................................................23

Fig. 19 Traditional Design of the Partial Product Reduction Stage.................................................24

Fig. 20 A New Design of the Partial Product Reduction Stage......................................................25

Fig. 21 Traditional MUX-based Design of the 3:2 Compressor.....................................................26

Fig. 22 A New MUX-based Design of the 3:2 Compressor...........................................................27

Fig.23 Traditional MUX-based Design of the 4:2 Compressor......................................................28

Fig.24 A New MUX-based Design of the 4:2 Compressor............................................................29

Fig. 25 Existing Architectures of the 5:2 Compressor..................................................................32



Fig. 28 Original Design of the MUX Subcircuit..........................................................................35

Fig. 29 Modified Design of the MUX Subcircuit........................................................................36

Fig. 30 Proposed Design of the MUX Subcircuit........................................................................37

Fig. 31 Existing Designs of the Complementary-output MUX Subcircuit........................................38

Fig. 32 Proposed Design of the Complementary-output MUX Subcircuit........................................39

Fig. 33 Original Design of the XOR-XNOR Subcircuit................................................................40

Fig. 34 Modified Designs of the XOR-XNOR Subcircuit.............................................................41

Fig. 35 Proposed Design of the XOR-XNOR Subcircuit..............................................................42

Fig. 36 Proposed Design of the CGEN Subcircuit.......................................................................43

Fig. 37 Possible Compressor Architectures for an 8-bit Modulo 2n+1 Multiplier..............................44

Fig. 38 Possible Compressor Architectures for an 16-bit Modulo 2n+1 Multiplier............................47

Fig. 39 the 4-bit Conditional Sum Generator..............................................................................48

Fig. 40 Delay of the Full Adder Based Compressor.....................................................................49

Fig. 41 Delay of the MUX Based Compressor............................................................................50

Fig. 42 Critical Path Delay of the Sparse-tree Adder....................................................................53

Fig. 43 Noncritical Path Delay of the Sparse-tree Adder..............................................................54

Fig. 44 Critical Path Delay of Kogge-Stone Adder......................................................................55

Fig. 45 Delay and Rise-time of the Proposed Multiplier Based on CMOS Technology…...……......57

Fig. 46 Delay and Rise-time of the Proposed Multiplier Based on CNTFET Technology....................58

Fig. 47 Power Consumption of the Proposed Multiplier Based on Two Technologies.......................59

Fig. 48 Temperature Variation.................................................................................................61

Fig. 49 Voltage Variation........................................................................................................61

Fig. 50 Process Variation........................................................................................................62

List of Tables

Table 1 Truth Table of the CGEN Subcircuit.............................................................................44

Table 2 Comparison between the Kogge-stone adder and the Sparse-tree Adder..............................47

Table 3 Performance Comparison between the Full Adders Based Compressor and the MUX-based

Compressor..........................................................................................................................51

Table 4 Performance and Power Comparison between Different Types of Compressors....................51

Table 5 Performance and Power Comparison among Different Compressor Architectures for an 8-bit

Modulo 2n+1 Multiplier..........................................................................................................52

Table 6 Performance and Power Comparison among Different Compressor Architectures for an 16-bit

Modulo 2n+1 Multiplier..........................................................................................................52

Table 7 Performance and Power Comparison between the Kogge-Stone Architecture and the Sparse-tree

Architecture..........................................................................................................................53

Table 8 Performance Comparison between the Proposed Multiplier Based on Two Different

Technologies........................................................................................................................58

Table 9 Delay Comparison between Two Technologies with Different Temperatures………............59

Table 10 Rise-time Comparison between Two Technologies with Different Temperatures................60

Table 11 Delay Comparison between Two Technologies with Different Supply Voltages.................60

Table 12 Delay Comparison between Two Technologies with Different Process Corners..................60

Table 13 Risetime Comparison between Two Technologies with Different Process Corners..............60

1

I. Introduction

1.1 BACKGROUND

Modulo arithmetic is widely used in a lot of areas. In cryptography, modulo arithmetic is the

foundation of public key system and is used in a number of symmetric key algorithms such as

International Data Encryption (IDEA) and Advanced Encryption Standard (AES). There are also

a variety of modulo operations implemented in computer science such as XOR operation in

programming language. Furthermore, modulo arithmetic also has an application in music and

chemistry such as modulo 12 operations in electronic instruments to implement twelve-tone

equal temperament. Nowadays, modulo arithmetic is frequently used in fault tolerant design of

ad-hoc network, digital and linear convolution architectures [1]. In recent years, the information

safety, especially the confidentiality of transmitting data through signal channels, is becoming

more and more important because of the increasing popularity and gradually matured function of

internet, which makes cryptography play a significant role in the information age. Modulo 2n and

modulo 2n+1 multiplier are key blocks in the circuit implementation of cryptographic algorithm

such as IDEA [1].

Residue number system (RNS) is another important application of modulo arithmetic. In the

recent years, the RNS is widely used in arithmetic computation and signal processing

applications such as fast Fourier transforms, digital filtering, and image processing [2]. RNS

became so popular is because the calculation of a large integer is transferred into several small

integer calculations in parallel by decomposing a large integer into several small integers. This

effectively increases the operating speed [3]. Among popular moduli sets, (2n-1, 2

n, 2

n+1) draws

http://en.wikipedia.org/wiki/Twelve-tone_equal_temperament

http://en.wikipedia.org/wiki/Twelve-tone_equal_temperament

2

the most attention and have been studied for several decades because of its easy conversion

between binary and residue. Such conversion is based on the conventional Chinese remainder

theorem [2]. It takes n bits wide inputs for modulo 2n-1 and modulo 2

n operation, while it takes

n+1 bit wide inputs for modulo 2n+1 operation [1]. That makes modulo 2

n+1 implementation

more difficult and complex hardware block with much attention.

Many architectures and circuit implementations of modulo 2n+1 block are proposed and

compared in the past decades. According to Cruiger’s work [4], three multiplication architectures

are proposed: The first architecture is realized by using a (n+1) × (n+1) bits multiplier followed

by modulo adders to correct errors caused by carry. The second architecture takes advantage of

modulo 2n+1 adder, where multiplier consists of a carry-save adder and a final carry-select

addition unit to reduce design complexity [1]. In the third architecture, they modified the second

architecture by correcting errors in the carry-select adder. Furthermore, the circuit area is

significantly reduced and operating speed is increased by introducing a bit-pair recoding scheme

in the carry-save adder block [4]. Although the last two architectures are suitable for full-custom

design [1], they increase not only layout and fabrication complexity but also design challenges.

In the work of Hiasat [5], a very high speed modulo (2n+1) multiplier is proposed. The circuit

implementation takes advantage of a binary multiplier stage, an adder stage, and the combination

of several logic gates. The main contribution of his work is reducing hardware requirement and

accomplishing realizing very large dynamic ranges.

3

Later in the work of Wrzyszcz and Milford [6], a new partial product matrix is introduced to

reduce design and hardware complexity of the previous design as well as introducing very small

hardware overhead. Furthermore, their design realizes a regular VLSI layout implementation

since the whole structure is almost composed by full adder and half adder only, which also

dramatically optimizes the parallel computing performance, speed, and the maximum operating

frequency. Finally, since the periodic properties of

occurs in every row of the partial

product array, only bits with weight less than 2n occur to compose the final (n+1) × (n+1) partial

product matrix after reposition computation. The correction process also turns out to be easy to

realize because of those characteristics.

According to the work of Zimmermann [7], a new implementation of modulo (2n+1) multiplier is

proposed, which has three major parts: modulo reduced partial products generation block,

modulo carry-save adder, and modulo final adder. To implement the final modulo addition

operation, a fast and simple end-around-carry adder is needed. Zimmermann introduces a new

parallel prefix adder to realize this function, which dramatically increases the operation speed.

Furthermore, conventional Booth coding of the partial product generation stage and the Wallace

tree structure in the final adder stage could also be used to speed up in Zimmermann’s algorithm.

Also, the highly regular structure of this implementation reduces the complexity of layout

process and it is very suitable for VLSI implementation and modularization. Chaves and Sousa

[8] realized the idea of Zimmermann in the later years. Booth coder and Wallace tree structure

make their implementation the fastest modulo (2n+1) multiplier ever at that time.

From a panoramic point of view, a lot of work regarding to Diminished-1 algorithm has been

4

done to solve the problem of n+1 bit input length in a modulo (2n+1) multiplier implementation.

For example,Yutai Ma [9] introduces bit-pair Booth recoding technique and Carry Save adder to

reduce partial products to

for even n or

for odd n. In the work of

Zimmermann [7], weighted operand representation is introduced to implement Diminish-1

function at the cost of additional circuit for correction purpose. Wang’s [10] work eliminates the

conversion circuit between binary and diminished-1, which reduces power and circuit

complexity. Chaves and Sousa [8] compare ordinary and diminished-1 implementations of

modulo (2n+1) multiplier. Also, they optimized the Booth recoding scheme to speed up the

multiplier. In the work of Vergos and Efstathiou [11], they made an improvement comparing to

the work of Wrzyszcz and Milford [6] by reducing the correction factor from 3 to 1, reducing the

circuit complexity and increasing speed.

1.2 PROBLEM AND WORK STATEMENT

To sum up, modulo (2n+1) multiplier today has characteristics of high speed, low power, regular

scheme which is suitable for VLSI implementation and small area. However, further

improvements of the circuit implementation could be achieved. The enhancements could be

possibly made on the partial product reduction stage and the final adder stage because these two

stages are the critical path of the multiplier. Thus, new efficient hardware design of partial

product reduction block and final adder block to achieve higher speed and lower power is highly

needed.

To make further improvement on modulo (2n+1) multiplier, a new circuit implementation is

proposed in this thesis. It has three major stages: partial product generation stage, partial product

5

reduction stage, and the final adder stage. The last two stages determine the speed and power of

the whole circuit. Conventional compressor in the partial product generation stage takes

advantage of cascade full-adders and half-adders. However, adders consume a lot of power and

have a large delay. In this thesis, a new compressor based on the combination of MUX and xor-

xnor gate is proposed to reduce PDP [1]. For the final adder stage, the conventional Kogge-Stone

adder is the fastest parallel prefix form carry look-ahead adder [13]. However, the performance

of the parallel prefix adders is limited by the large number of carry merge cells and excessive

inter-stage wiring tracks. In this thesis, a sparse tree based inverted EAC adder is used to solve

this problem [14]. The sparse tree architecture dramatically reduces the number of blocks in the

last stage compare to Kogge-stone adder, which helps a lot in the layout process. The sparse tree

architecture also reduces delay of the last stage, because the sparse tree path is not the critical

path and the fan-out of the critical path is also reduced.

Additionally, the limitation of technology itself restricts further improvement of circuit

implementation of modulo (2n+1) multiplier. The popular CMOS technology based transistors

could be scaled down to very small size to archive very high integration capacity of VLSI

implementation. Nowadays, 32nm CMOS technology has been widely used and dramatically

increases the speed of the multiplier. However, as the sub-micron nano range scale down to

25nm in the near future, the leakage current of transistor will significantly increase. Also, the

sensitivity to process variation increases significantly to an unavoidable level and the

requirement of the accuracy of manufacture process [12]. Furthermore, the intrinsic capacitance

of nodes will get smaller and smaller as size of transistors and supply voltage getting lower,

making the number of charges that could be stored at nodes getting smaller. This makes

6

instantaneous voltage change such as cosmos particle collision a big problem, which could

destroy the device at some conditions [12]. Thus, robust technologies that has stable property

when the size of transistors getting smaller is required in the near future.

Among variety of modern technologies, cylindrical carbon molecules have beneficial properties

in the application of electronics and nanotechnology [12]. Carbon-Nano-Tube (CNT) is a tube-

shaped allotrope of carbon. CNT benefits its length-to-diameter ratio of as high as over 130

billion, which is greatly larger than other material under study. One of the advantageous

properties of CNT is its extremely hardness and stiffness. The only limitation of this property is

that it is sensitive to high-energy electron irradiation. The particular structure of CNT brings the

possibility of conductivity change between semiconductor and metal. For a given (n,m) CNT, if

n = m, the CNT is metallic; if n − m is a multiple of 3, then the CNT turns out to be a

semiconductor. Furthermore, CNT has very good thermal properties such as conductivity and

thermal stability. Based on CNT technology, a new CNT transistor (CNTFET) is introduced

these years with advantages of lower leakage power, better frequency response, lower PVT

variation, and extremely low PDP, which makes CNTFET a very competitive substitute of

traditional MOSFET.

1.3 OUTLINE OF THE THESIS

The rest of the thesis will be organized as follows. In section II, the algorithm used to implement

the multiplier is presented. Section III describes the proposed circuit implementation of modulo

2n+1 multiplier, and the novel sparse tree based Inverted EAC adder and the MUX based

compressor are also presented in the same chapter. The simulation results of the CNTFET based

http://en.wikipedia.org/wiki/Carbon

http://en.wikipedia.org/wiki/Molecule

app:ds:allotrope

7

design and the comparison with traditional CMOS technology based design is given in section

IV, and the conclusion is followed in section V.

8

II. Algorithm

Among various existing A·B mod (2n+1) algorithms, the one presented by Vergos and Efstathiou

[1] is considered to be the best. The proposed circuit implementation based on this algorithm can

be adapted to various applications such as IDEA cipher mentioned in section I. Some problems

might occur when this algorithm is used on IDEA cipher, because in the work of Vergos and

Efstathiou [1], (n+1)-bit wide inputs are introduced while in IDEA application, the input width is

n. However, this problem could be easy solved by connecting the MSB of the two inputs to

ground and just neglect the MSB of the outputs.

2.1 ALGORITHM OF THE PARTIAL PRODUCT GENERATION STAGE

Assume A and B are two inputs represented as A=anan-1an-2···a1a0 and B=bnbn-1bn-2···b1b0, then

A·B modulo (2n+1) can be represented as follows [1]:

(1)

where pi,j = ai AND bi. The A×B operation could be achieved by adding a group of partial

products together in a certain order.

Take an observation of the partial product matrix, it could be divided into four groups: A, B, C

and D, as shown in Fig. 1 (where Pi,j = ai AND bj). Only one group of them could be different

9

from 0 at certain time. Thus, partial products in different groups could be ORed instead of being

added together. Firstly, we perform the logic “OR” operation on the terms of the groups A, B, and

D in the columns with weight 2n up to 2

2n-2 and on the two terms of the groups B and D with

weight 22n-1

. Since , the term weighted 22n-1

, qn-1, can be substituted by

two terms qn-1 in the columns with weight 2n-1

and 1, respectively, and ORed with any term of

group A there. Moreover, since , the term pn,n could be repositioned to the

rightmost column and ORed with p0,0 [1, 11]. The modified version of partial product matrix after

“OR” operation is shown in Fig. 2 (where qi = pi,n ˅ pn,i) .

22n

22n-1

22n-2

… 2n+2

2n+1

2n 2

n-1 2

n-2 … 2

2 2

1 2

0

Pn,0 Pn-1,0 Pn-2,0 … P2,0 P1,0 P0,0

Pn,1 Pn-1,1 Pn-2,1 Pn-3,1 … P1,1 P0,1

Pn,2 Pn-1,2 Pn-2,2 Pn-3,2 Pn-4,2 … P0,2

… … … … … … …

Pn,n-2 … P4,n-2 P3,n-2 P2,n-2 P1,n-2 P0,n-2

Pn,n-1 Pn-1,n-1 … P3,n-1 P2,n-1 P1,n-1 P0,n-1

Pn,n Pn-1,n Pn-2,n … P2,n P1,n P0,n

Fig.1 Initial Partial Product Matrix

22n-2

…

2n+1

2n 2

n-1 2

n-2 … 2

2 2

1 2

0

Pn-1,0Vqn-1 Pn-2,0 … P2,0 P1,0 P0,0V Pn,nVqn-1

Pn-1,1Vq0 Pn-2,1 Pn-3,1 … P1,1 P0,1

Pn-1,2Vq1 Pn-2,2 Pn-3,2 … P0,2

… … … … … …

… P3,n-2 P2,n-2 P1,n-2 P0,n-2

Pn-1,n-1Vqn-2 … P2,n-1 P1,n-1 P0,n-1

Fig. 2 Modified Partial Product Matrix

A

B

C D

10

There is an observation regarding to the reposition operation of the partial product terms in the

n×n partial product matrix, with weight greater than 2n-1

based on the following equation [11]:

(2)

Equation (2) shows that repositioning each bit to ith

bit needs a correction factor to make

sure that the partial product matrix is equivalent to the initial partial product matrix before

reposition operation. For each partial product vector, the correction factor is derived as

12n. Hence, the correction factor of the entire partial product matrix is given by [11]:

(3)

The final n × n partial product matrix after the reposition operation is shown in Fig. 3

2n-1

2n-2

2n-3

… 22 2

1 2

0

Pn-1,0Vqn-1 Pn-2,0 Pn-3,0 ... P2,0 P1,0 P0,0V Pn,nVqn-1

Pn-2,1 Pn-3,1 Pn-4,1 … P1,1 P0,1

Pn-3,2 Pn-4,2 Pn-5,2 … P0,2

… … … … … … …

P1,n-2 P0,n-2 …

P0,n-1 …

Fig. 3 Final n × n Partial Product Matrix

2.2 ALGORITHM OF THE PARTIAL PRODUCT REDUCTION STAGE

Another observation is regarding to the compressors in partial product reduction stage, which

11

perform like a carry save adder (CSA). Since this CSA works as a modulo 2n+1 adder, the carry-

out bit of each level of the CSA has to be fed back as the carry-in bit of the next subsequent level

[1]. Supposing that the carry-out bit of the nth

column at ith stage of CSA is ci with weight 2n,

then the carry-out can be deduced to [11]:

(4)

Thus, in an n-1 stage CSA, another correction factor because of the carry-out bits of the CSA due

to equation (4) is [1]:

(5)

The final correction factor can be calculated from the sum of COR1 and COR2:

(6)

For an n-bit modulo (2n+1) multiplier, the constant “3” is the final correction factor. A “2” will

be added to the partial product reduction stage, while a “1” will be added to the final adder stage

due to the inverted carry feedback issue discussed later in this thesis.

2.3 ALGORITHM OF THE FINAL ADDITION STAGE

When two 1-bit wide inputs A and B are added together, if the carry-out of A+B is always 1,

regardless of the value of input carry, A and B are said “generate”. In practice, A and B generate

only in the case that both A and B are logic 1. We use to present the relationship of

“generate”, denote as: . Similarly, A and B are said “propagate” if the carry-out

of A+B is always 1 whenever the carry-in bit is 1, regardless the value of two 1-bit wide inputs A

and B. In practice, A+B propagate only in the case that at least one of A or B is logic 1. We use

to present the relationship of “propagate”, denote as: .

12

Fig. 4 8-bit Kogge-Stone Adder

The final adder stage is an inverted End-Around-Carry (EAC) adder revised from conventional

Kogge-Stone adder. An 8-bit Kogge-Stone adder is shown in Fig. 4. The algorithm of Kogge-

Stone adder is illustrated below. Each “□” produces a "propagate" and a "generate" bit, where

“propagate” , “generate” . Next, operator “○” works as

in the next stages in vertical direction. The final

“generate” bits are produced in the last stage. These bits need to be XORed with the initial

propagate ( ) to produce the final sum bits. For example, the LSB of sum vector is

calculated as: P0 XORed with the carry-in bit. The second LSB of sum vector is calculated as: P1

XORed with the rightmost carry-out bit in the last stage of “○” operation. The 16-bit Kogge-

Stone adder performs in the same manner, as shown in Fig. 5.

13

Although the conventional Kogge-Stone adder is thought to be the fastest adder possible today,

however, to realize modulo (2n+1) function, it needs some structural revision. The partial product

reduction stage generates an n-bit sum vector and an n-bit carry vector, which will be added in

the final adder stage. However, to achieve the modulo (2n+1) addition function, the output of

carry bit of the carry vector should be feedback to the LSB of the final adder stage, shown in the

work of Zimmerman [7]:

(7)

From (7) we can observe that the inverted carry-out bit of the addition of Sum and Carry vectors

has to be fed back to achieve modulo (2n+1) function in the revised Kogge-Stone adder

architecture shown later.

Fig. 5 16-bit Kogge-Stone Adder

The parallel prefix computation works in the form of “○” operations will be remained in the

revised architecture. Instead of directly XORed the “propagate” of each nth

bit with the (n-1)th

14

carry-out bit in the th stage, the new architecture is proposed to invert the (n-1)

th carry-out

bit in th stage and then this new inverted (pi*, gi*) set will “○” with the (pi , gi) of each bit in

th stage to generate final (G’, P’) set. Finally, the sum vector of the final adder stage is

generated by XORing the final carry-out bit gi* with the initial “propagate” gi. The revised 8-bit

EAC Kogge-Stone adder is shown in Fig. 6.

Fig. 6 8-bit Kogge-Stone Diminished-1 Adder

As the final sum vector and carry vector are calculated mainly depends on the “generate-

propagate set” in every stage, the derivation of (G, P) and some characteristics of it should

15

be discussed. Furthermore, the architecture in Fig.6 has a logic depth of . To reduce

the logic depth from to , a new architecture is introduced based on the

algorithm improvement shown below. The carry-out bit of a carry-look-ahead (CLA) adder is

logic 1 when one of the cases below takes place: A+B “generate” or the next less significant

carry-out bit is 1 with A+B “generate”. Then the carry-out bit of CLA could be denoted like this:

(8)

According to (8), the final generate-propagate set

in the th stage could be

expressed below (Let

) [1]:

(9)

There are several observations regarding to the equation above. Firstly,

(10)

which means the inverted EAC adder is just taking the inverted logic of the “generate” bit and

keep the value of the “propagate” bit. The second observation is:

(11)

The third observation is on the derivation of

, as shown

below [15]:

16

(12)

In some cases, generating the whole architecture in stages based on (12) is not possible.

To solve this problem, we could transfer (12) into another form [15]. Suppose that

and , then,

(13)

According to (13), . The new designed final stage adder based on

this algorithm is shown in Fig. 7. The addition operation in the final adder stage is done in

stages. However, this implementation has obvious wire interconnection problem because of the

complexity of cells [1].

One possible solution for the wire interconnection problem is to introduce sparse-tree

architecture. The sparsity of a Kogge-Stone adder refers to the number of carry-out bits

generated by the adder. For example, sparsity-1 means the whole adder totally generates 1 carry-

out bit for. The sparsity of 2 means generating carry-out every other bit and sparsity of 4 means

generating carry-out every-fourth bit. A much shorter carry ripple adder is then introduced with

an input bit of the carry-out of sparse tree adder. Because this shorter carry ripple adder is not the

critical path, the delay of the final adder stage is reduced, while the wire interconnection problem

is solved. There is a trade-off between the sparsity and the effectiveness of solving wire

interconnection problem. Increasing sparsity increases the speed of the sparse-tree adder;

17

however, the delay of the short carry ripple adder gets larger as well. Finally, the critical path

will no longer be the sparse-tree adder, but the short carry ripple adder instead.

a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7

s0s1s2s3s4s5s6s7

hi (gi,pi) gipi ,

ai bi jiji PG ,, , mkmk PG ,, ,

mkmkjiji PGPG ,,,, ,, mkmkjiji PGPG ,,,, ,,

Fig. 7 Revised Diminished-1 Kogge-Stone Adder with Stages

An example of 16-bit Kogge–Stone adder with sparsity-4 is shown in Fig. 8, while the Inverted

EAC adder with sparsity-4 is shown in Fig. 9.

18

a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7a8,b8a9,b9a10,b

10

a11,b

11

a12,b

12

a13,b

13

a14,b

14

a15,b

15

C1C5C9C13

Fig. 8 16-bit Kogge–Stone adder with sparsity of 4

a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7a8,b8a9,b9a10,b

10

a11,b

11

a12,b

12

a13,b

13

a14,b

14

a15,b

15

C15=C-1 C11 C7 C3

Fig. 9 Inverted EAC adder with sparsity of 4

Generally, for 8-bit and 16-bit adders, a sparsity of 4 is usually chosen [14]. The carry out

equations for the 16-bit sparse tree inverted EAC adder are as follows:

19

(14)

Based on the deduction shown in (12), the equations turn into:

(15)

Based on the deduction shown in (13), the equations turn into:

(16)

In (16), the final equations limit the modulo addition operation in the final adder stage within

stages, as shown in Fig. 10. This architecture solves wire interconnection problem and

reduces non-critical path delay.

A=119= B=87=

0 0

0 0

1 1

1 0

1 1

0 0

1 1

1 1

1 1

0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Fig. 11 the Initial Output of the Partial Product Generation Stage

20

Fig. 10 Inverted EAC adder with sparsity of 4 in stages

2.4 AN EXAMPLE

Take a 9-bit modulo (2n+1) multiplier for example. Assuming the two inputs are

A=119=001110111, B=87=001010111. The initial output of the partial product generation stage

is shown in Fig. 11.

0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1

Fig. 12 the n×n Partial Product Matrix

21

The left half (to the left of the dash line) of the initial partial products shown in Fig. 11 needs to

be repositioned using the principle illustrated in Fig.3. The final n×n partial product matrix after

repositioning is shown in Fig. 12. A correction factor of 2, in the form of a correction vector

shown in the block in Fig.13, is added to the bottom of the n×n partial product matrix. Total

correction factor of the modulo 2n+1 multiplier is 3. The other “1” is added in the final adder

stage.

0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0

Fig. 13 the Final Partial Product Matrix with the Correction Factor

1 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0

Fig. 14 the Initial Output of Partial Product Reduction Stage

1 1 0 1 0 1 0 0 0 1 1 1 0 1 0 1

Fig. 15 the Output of the Partial Product Reduction Stage after Repositioning

The partial product reduction stage compresses the partial product matrix in Fig. 13 to a final

sum vector and a carry vector, as shown in Fig.14. This initial output of partial product reduction

stage also needs to be repositioned. The final sum vector and carry vector after repositioning

with another “1” are then modulo 2n+1 added. In this example, 119×87 modulo (2

8+1) equals 73.

Sum Vector

Carry Vector

Sum Vector

Carry Vector

22

III. Circuit Implementation

The proposed implementation of modulo 2n+1 multiplier consists of three stages: the partial

product generation stage, the partial product reduction stage, and the final addition stage. The

possible circuit configurations for each stage will be discussed in this section:

3.1 CIRCUIT DESIGN OF THE PARTIAL PRODUCT GENERATION

STAGE

This stage is the simplest stage in the circuit design of the entire multiplier. Traditional 2-input

NAND gate, 2-input NOR gate, and inverter need to be optimized to meet the power and speed

demand of this stage.

Fig. 16 Proposed Inverter

The structure and the size of the transistors composing the proposed inverter, 2-input NAND and

2-input NOR are shown in Fig. 16, Fig. 17, and Fig. 18, respectively. The NAND gates are used

23

for generating initial partial product terms, while the NOR gates and inverters are the key circuit

components to implement the operations of repositioning to get the final n×n partial product

matrix. The most complex logic functions in the reposition operations are ,

and

, where and [1].

Fig. 17 Nand Gate with 2 Inputs

Fig. 18 Nor Gate with 2 Inputs

24

3.2 CIRCUIT DESIGN OF THE PARTIAL PRODUCT REDUCTION

STAGE

3.2.1INTRODUCTION OF A MUX-BASED COMPRESSOR DESIGN

The partial product reduction stage is considered to be the most important stage to determine the

power and speed of the entire modulo 2n+1 multiplier [1]. Thus, this stage must be designed with

a group of low-power high-speed compressors.

Fig. 19 Traditional Design of the Partial Product Reduction Stage

In this stage, the n×n partial product matrix and a correction factor “2” are compressed to a final

sum vector and a carry factor. The remaining correction factor of “1” is added to the final

25

addition stage by using the inverted EAC adder. Traditional compressors are designed with full

adders. However, these designs consume too much power and occupy too much chip area. It

also cannot meet the requirement of ultra high speed in the world today. For example, to

compress single column of an 8×8 partial product matrix, totally 7 full adders are needed, while

in the worst case of the possible new designs proposed in this thesis, only one 7:2 compressor

and two 3:2 compressors are needed. The traditional full adder based compressor and the worst

case of the possible new design are shown in Fig. 19 and Fig. 20, respectively.

Fig. 20 A New Design of the Partial Product Reduction Stage

The compressor architecture shown in Fig. 20 is designed with MUX and XOR-XNOR sub-

circuits. The compressors based on MUX use much less transistors than the full adder based

26

compressors, and the total number of compressors used in the traditional full adder based design

is much higher than the new MUX-based design. Thus, the new compressor architecture is a

much more proper design to meet the requirement of low power and high speed.

3.2.2DIFFERENT TYPES OF THE MUX-BASED COMPRESSORS

Several basic MUX-based compressors are discussed below:

3.2.2.1 Circuit Design of the 3:2 compressor:

A 3:2 compressor takes 3 inputs x1, x2, and x3 to generate two outputs Sum and Carry. The

logic relationship between inputs and outputs is demonstrated in equation (17) [16]:

(17)

Fig. 21 Traditional MUX-based Design of the 3:2 Compressor

Fig. 21 shows an existing design of the MUX-based 3:2 compressor [16]. However, this design is

not fast enough because X1 and X2 should be added first, and then their sum adds to X3. The

second addition operation should wait the calculation result of the first addition operation. The

27

total delay of this design is 2×∆XOR. To reduce critical path delay of the 3:2 compressor, a new

design of the MUX-based 3:2 compressor is shown in Fig. 22. In the proposed design, X3 could

select MUXs before the input signals arrive. Thus, the time taken to switch the transistors in the

critical path is reduced, increasing circuit efficiency [16]. The total delay of the proposed design

is ∆XOR+∆MUX. The output equations of the proposed design are shown below [16]:

(18)

(19)

Fig. 22 A New MUX-based Design of the 3:2 Compressor


A 4:2 compressor takes 4 inputs x1, x2, x3, and x4 along with a carry-in bit Cin to generate three

outputs Sum, Carry, and Cout, where “Sum” is weighted at 20, “Carry” and “Cout” are weighted

at 21. The logic relationship between inputs and outputs is demonstrated in equation (20) [16]:

(20)

28

An existing circuit design of MUX-based 4:2 compressor is shown in Fig. 23 [16]. Same as the

traditional 3:2 compressor, the second and the third XOR operation need to wait the result of the

previous one. This limits the speed of the compressor (3×∆XOR). In Fig. 24, a new design of the

MUX-based 4:2 compressor is proposed. In this design, the outputs and its complementary

signals are generated at the same time, avoiding the race-hazard problem. The power

consumption of the inverters to generate the complementary signal is also reduced. Furthermore,

the MUX connected to Cin could be selected in advance. The Total delay of the proposed design

is 1×XOR+2×MUX.

Fig.23 Traditional MUX-based Design of the 4:2 Compressor

The output equations of the proposed design are shown below [16]:

(21)

(22)

(23)

29

Fig.24 A New MUX-based Design of the 4:2 Compressor


The 5:2 compressor has 7 inputs (x1, x2, x3, x4, x5, Cin1 and Cin2) and 4 outputs (Carry, Sum,

Cout1, and Cout2). The relationship between inputs and outputs is shown below [16]:

(24)

Several existing circuit implementations of the MUX-based 5:2 compressor are shown in Fig. 25

(a), (b), and (c), respectively [16]. In Fig.25, the delay of the compressor is reduced to 5×ΔXOR.

The delay of the original full adder based design is 6×ΔXOR, if all the full adder blocks are

replaced by their constitute XOR blocks [16]. However, the delay of the MUX based 5:2

compressor could be further reduced by replacing some XOR gate by MUX blocks. The

proposed implementation is shown in Fig. 26. In the first stage, 2 XOR-XNOR blocks are

introduced to generate the output and its complementary signal at the same time, reducing the

30

power of additional inverters, and avoiding race-hazard problem. In the second and the fourth

stages, the MUXs controlled by X3, Cin1, and Cin2 could be selected before the input signals

arrive. The rest of MUX blocks also efficiently use the output of the blocks in the previous stage.

Benefits from all the features mentioned above, the critical path delay of the proposed design is

reduced to ΔXOR+3×ΔMUX. The equations regarding to the outputs are shown below:

(25)

(26)

(27)

(28)

31

XOR XOR

XOR MUX

XOR

MUX

Sum Carry

Cin1

X1 X2 X3 X4

XOR

XOR MUX

Cin2

X5

Cout1

Cout2

XOR XOR

XOR XOR

MUX

MUX

Sum Carry

Cin1

X1 X2 X3 X4

XOR

Cin2

X5

Cout1

Cout2

XOR

(x1+x2)(x3+x4) (x1x2+x3x4)

(b)

(a)

32

XOR XOR

XOR XOR

MUX

Sum Carry

Cin1X1 X2 X3 X4

XOR

Cin2 X5

Cout1

Cout2

XOR

CGEN

MUX

(c)

Fig. 25 Existing Architectures of the 5:2 Compressor

XOR-

XNOR

XOR-

XNOR

MUX MUX

MUX

Sum Carry

Cin1X1 X2 X3 X4

MUX

Cin2 X5

Cout1

Cout2

MUX

CGEN

MUX


33


The 7:2 compressor has 9 inputs (x1, x2, x3, x4, x5, x6, x7, Cin1 and Cin2) and 4 outputs (Sum,

Carry, Cout1, and Cout2). Unlike the 5:2 compressor, where Carry, Cout1, and Cout2 are all

weighted at 21, the 7:2 compressor has a Cout1 output weighted at 2

2. To sum up, the

relationship of the inputs and the outputs of a 7:2 compressor is [1]:

(29)

The MUX-based 7:2 compressor is a totally new design in this thesis. The principle of the design

is to use MUX to replace XOR as much as possible to reduce delay and to generate output and its

complementary signal at the same time to reduce power. Then the output equations shown below

[17] could be transformed into the circuit implementation of the MUX-based 7:2 compressor

shown in Fig. 27, with some additional logic gates such as Nand to realize. The total delay of the

proposed design is ΔXOR+5×ΔMUX.

(30)

(31)

(32)

(33)

where

34

XOR-

XNOR

MUX MUX

XOR-

XNOR

MUX

MUX

MUX

MUXMUX

XOR-

XNOR

2-bit

Nand

4-bit

Nand

2-bit

Nand

X5 X6 X7 X2 X3 X4X1

cin2

cin1

CGEN3-bit

Nor

2-bit

Nand

2-bit

Nand

CGEN

XOR-

XNOR

MUX

CGEN

Carry Sum Cout1 Cout2


3.2.3DETAILED SUBCIRCUIT DESIGN OF THE COMPRESSORS

To realize the circuit implementations mentioned above, detailed transistor level designs are also

need to be discussed and compared. The MUX subcircuit, the complementary-output MUX

subcircuit, the XOR-XNOR subcircuit, and the CGEN subcircuit will be discussed one by one.

35

3.2.3.1 MUX Subcircuit Design:

Fig. 28 Original Design of the MUX Subcircuit

Firstly, we take a look at the subcircuit of MUX. The original 2-1 MUX is shown in Fig.28 [18].

This is the most widespread MUX cell today, especially in low power applications. However,

this structure has no driving ability to drive the large input-capacitance of the following stages

especially when many stages are cascaded. This introducing large delay and worsen the

performance of the entire modulo multiplier. Thus, this implementation will not be chosen. To

solve this weak driving ability problem, another circuit implementation of MUX is introduced

later, which is shown in Fig. 29. The modified structure solves the driving problem by adding

two cascaded inverters at the output of the original design. This method is highly effective.

However, inverters consume a lot of power and even enlarge the size of the MUX block by more

than 2 times compare to the one in Fig 28. So this is also not a desired design in low power

applications.

36

Fig. 29 Modified Design of the MUX Subcircuit

The proposed design of the MUX subcircuit is shown in Fig. 30. This design takes advantage of

the complementary CMOS technology, which is robust against both voltage scaling and

transistor sizing [18]. Compare to the modified MUX circuit shown in Fig. 29, the proposed

design only has one inverter, reducing a lot of power. The driving ability of the proposed design

is not reduced by diminishing the number of inverters because the rest transistors of the proposed

design are also connected to vdd/gnd to be provided driving strength. The total number of

transistors in the proposed design is 2 more than the one in Fig. 29. However, the total silicon

area of transistors in the two designs is the same. Thus, based on the discussion above, the circuit

design in Fig. 30 is chosen in this research for the comprehensive consideration of low power,

small silicon area and high speed.

37

Fig. 30 Proposed Design of the MUX Subcircuit

3.2.3.2 Complementary MUX Subcircuit Design:

Secondly, the complementary-output MUX subcircuit needs to be designed. Two existing

designs of complementary-output MUX are shown in Fig. 31(a) and (b), respectively [18]. The

design of (a) has some driving ability because two compensation transistors, which are all driven

by vdd, are introduced. For the same reason, structure in (a) can also obtain a full voltage swing

at the output. However, the driving ability of (a) is not strong enough to drive many cascaded

stages. Different from (a), structure (b) has no driving ability at all. Additionally, in some cases,

the output and its complementary signal will not have a full swing.

38

Vdd

Vdd

set set

A

B

W=64nm

W=64nm

W=64nm

W=64nm

W=128nm

W=128nm

out

out

set set

A

B

A

B

W=64nm

W=64nm

W=64nm

W=64nm

W=128nm

W=128nm

W=128nm

W=128nm

out

out

(a) (b)

Fig. 31 Existing Designs of the Complementary-output MUX Subcircuit

To solve the problems mentioned above, we need to redesign a complementary-output MUX. In

the circuit design of Fig. 31(a), an inverter needs to be added to each of the two outputs to

improve driving ability. In the circuit design of Fig. 31 (b), two cascaded inverters are needed

and all other pass-gates need to be replaced by complementary CMOS pass-gates to obtain full

swing. Obviously, after the improvement, (b) occupies much more silicon area than (a), so the

proposed design needs to take the idea from (a), which is shown in Fig. 32.

39

Vdd

Vdd

set set

A

B

out

out

W=64nm

W=64nm

W=64nm

W=64nm

W=128nm

W=128nm

Vdd

Vdd

W=128nm

W=128nm

W=256nm

W=256nm

Fig. 32 Proposed Design of the Complementary-output MUX Subcircuit

3.2.3.3 XOR-XNOR Subcircuit Design:

Thirdly, the XOR-XNOR subcircuit needs to be designed. The original design of the XOR-

XNOR subcircuit is shown in Fig. 33 [18]. This design has the problem of week driving ability,

especially when the logic value the XNOR node is logic 0. This dramatically reduces speed.

Another problem is regarding to the complementary outputs. A skew occurs at the node of XOR

and the node of XNOR. Additionally, this design generates a weak logic “1” at XNOR node

because NMOS-based pass-gate has a Vth voltage drop when passing logic “1”. Thus, this

40

design cannot be used at the condition of low power supply.

xor

Vdd

W=128nm

W=256nm

Vdd

xnor

A B

W=256nm

W=256nm

W=128nm

W=128nm

Fig. 33 Original Design of the XOR-XNOR Subcircuit

To solve those problems, other designs of XOR-XNOR subcircuit are designed, as shown in Fig.

34 (a), (b), and (c), respectively [18]. The modified XOR-XNOR block shown in (a) could be

used with low supply voltage because the complementary CMOS pass-gates are introduced in

this design to replace the original one. However, the weak driving ability problem and the skew

problem at the output still remain. Unlike (a), design of (b) solves skew problem at the output by

adding a group of complementary transistor to the circuit shown in Fig. 33. But it generates a

weak “0” at node XOR, while generates a weak “1” at node XNOR.

41

Vdd

W=128nm

Vdd

W=128nm

xnor

Vdd

W=128nm

W=256nm

W=128nm

W=128nm

W=64nm

W=64nm

W=64nmW=64nm

xor

A

B

Vdd

W=256nm

W=256nm

W=128nm

W=128nm

A B

W=128nm

W=128nm

W=256nm

W=256nm

xor

xnor

W=128nm

W=128nm

W=256nm

W=256nm

xor

xnor

A B

Vdd

W=128nm

W=64nm

Fig. 34 Modified Designs of the XOR-XNOR Subcircuit

(a)

(b) (c)

42

So this design is also not a good choice in low power applications. The circuit implementation in

(c) can solve the weak logic problem and the week driving ability problem at the same time

because of the feedback NMOS-PMOS transistors in the middle of the circuitry. However, it is

still not a good choice in low power applications for the following reasons. When the input

changes from any other input patterns to “00” or “11”, the feedback NMOS-PMOS transistors,

which is originally turned off, will be turned on by a weak logic driver and a high impedance

driver. Thus, this transition will take a lot of time, worsens the entire circuit performance and

consumes huge dynamic power when transit [18].

W=128nm

W=128nm

W=256nm

W=256nm

xor

xnor

A B

Vdd

W=64nm

W=32nmVdd

W=64nm

W=64nm

W=128nm

W=128nm

Fig. 35 Proposed Design of the XOR-XNOR Subcircuit

43

The proposed design of the XOR-XNOR subcircuit is shown in Fig. 35. It combines all the

desire features together, solving the weak logic problem, the skew problem at the output, the

week driving ability problem and the long transit time problem occurred in Fig. 34 (c) at the

same time.

Vdd

Vdd

W=256nm

W=128nm

W=128nm

W=64nm

W=64nm

W=64nm

W=64nmW=64nm

W=128nmW=128nm

W=128nm W=128nm

Carry

ABCin

Fig. 36 Proposed Design of the CGEN Subcircuit

3.2.3.4 CGEN Subcircuit Design:

Finally, the proposed CGEN subcircuit is shown in Fig. 36 [18]. The CGEN subcircuit works

like a full adder without the output of “Sum”. The truth table of CGEN block is shown in Table 1.

44

This circuit implementation takes advantage of complementary CMOS logic, providing good

driving ability (small delay) with relatively small silicon area.

Table 1 Truth Table of the CGEN Subcircuit

A b cin carry

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 1

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 1

Fig. 37 Possible Compressor Architectures for an 8-bit Modulo 2n+1 Multiplier

3.2.4 THE ARCHITECTURE OF THE PARTIAL PRODUCT REDUCTION

STAGE

3.2.4.1 Architecture Designed for an 8-bit Modulo 2n+1 Multiplier

45

After designing the specific circuit blocks, the architecture of the whole compressor need to be

decided. For an 8-bit modulo 2n+1 multiplier, two possible compressor architectures are

compared. The architectures discussed in this section are the architectures of the partial product

reduction stage to compress a single column of the final partial product matrix with the

corresponding correction bit. The first compressor architecture is shown in Fig. 37 (a), where the

number of compressors used (three in total) in the architecture is the least among all the possible

architectures. Only three stages are introduced and only one compressor is used in eac stage.

However, when taking parallel concept into consideration, the other architecture, which is shown

in Fig. 37 (b), has much better performance. This architecture uses total 7 3:2 compressors in

four stages. In the first stage, three 3:2 compressors work in parallel, while the number of the

compressors used in the rest of the stages is 2, 1, and 1, respectively. Although it seems like that

the second architecture uses more compressors than the first architecture, the second one has less

delay and number of transistors. The simulation result is shown in section IV.

Furthermore, the architecture in Fig. 37 (b) has advantages in layout comparing to the first one

because two types of compressors are introduced in Fig. 37 (a) while a single type of compressor

is introduced in Fig. 37 (b). However, interconnect wire routing issue will occur in Fig. 37 (b)

because of the parallel design, especially when the size of input is large. The 3:2 compressor [16]

is shown in Fig. 3. In this thesis, this architecture is chosen to achieve high speed, small area, and

low power.

3.2.4.2 Architecture Designed for an 16-bit Modulo 2n+1 Multiplier

For a 16-bit compressor, more possible architectures are discussed, as shown in Fig. 38 (a), (b),

46

(c), and (d), respectively. Among all these architectures, (c) is the best choices. The architecture

in (c) benefits from its smallest delay, the smallest power, and the smallest silicon area. These

features make (c) proper to be used in low-power high-speed applications. Also, the architecture

in (c) has the advantage of being composed by only one type of compressor with regular layout,

just same as the proposed 8-bit compressor architecture. The simulation of performance and

power comparison of all these architectures is shown in section IV.

4:2

7:2

7:27:2

Inputs

Outputs

4:2

4:27:2

Inputs

Outputs

3:2

7:2

3:2

3:23:2

Inputs

3:2 3:2 3:2

3:2 3:2 3:2 3:2

3:2 3:2

3:2 3:2

3:2

3:2

Outputs

(a) (b)

(c)

47

5:2

5:2

4:24:2

Inputs

Outputs

4:2

4:2

3:2

4:2

3:2

Fig. 38 Possible Compressor Architectures for an 16-bit Modulo 2n+1 Multiplier

3.3 CIRCUIT DESIGN OF THE FINAL ADDITION STAGE

Table 2 Comparison between the Kogge-stone adder and the Sparse-tree Adder

Adder Type Logic Depth Max Fanout # of Cells

Kogge-Stone 2

Sparse Tree 2

The comparison between the original Kogge-Stone architecture and the sparse-tree structure is

summarized in Table 4. The logic depth and maximum fanout of the sparse-tree structure is the

same as the Kogge-Stone architecture. However, the total number of critical path blocks used in

the sparse-tree structure is much less. Therefore, the interconnect wire routing problem no longer

(d)

48

exist. The advantages of sparse-tree structure over Kogge-Stone adder become striking if the size

of input is large. The architecture of the 16-bit sparse-tree design has been shown in Fig. 10. In

Fig. 39, detailed 4-bit conditional sum generator is proposed.

Fig. 39 the 4-bit Conditional Sum Generator

49

IV. SIMULATION RESULTS OF THE PROPOSED DESIGN

AND TECHNOLOGY COMPARISON

Fig. 40 Delay of the Full Adder Based Compressor

In this thesis, totally three main improvements have been made on the circuit implementation.

First of all, in partial product reduction stage, the MUX-based compressor is introduced to

replace the original full adder based design to achieve high performance, low power, and small

area. The best architecture of this stage for input width of 8 and 16 are already chosen. Secondly,

in the final adder stage, a new design of sparse-tree architecture is introduced to make

improvement on the original Kogge-Stone one to solve the wire interconnection problem, while

50

maintain the advantage of high-speed and low-power characteristics of Kogge-Stone structure.

Finally, a new CNTFET technology is introduced to compare with the popular bulk CMOS

technology. The following simulation results show the desired outputs one by one.

4.1 PERFORMANCE COMPARISON BETWEEN THE FULL ADDER

BASED COMPRESSOR AND THE MUX BASED COMPRESSOR

In Fig. 19 and Fig. 20, the structure of traditional full adder based compressor and the proposed

MUX based compressor are shown, respectively. Table 3 summarized the power, delay, and area

comparison between two designs. It is clearly shows that the area and delay of the MUX based

compressor are all approximately half of its full adder based counterpart, while MUX based

design also has tiny advantage in power consumption. Thus, the proposed MUX based

compressor is much better the original full adder one.

Fig. 41 Delay of the MUX Based Compressor

51

Table 3 Performance Comparison between the Full Adders Based Compressor and the MUX

based Compressor

Full Adder Based MUX based

Delay 611.41ps 398.74ps

Power 60.75uW 21.93uW

# of transistors 518 250

4.2 SIMULATION RESULTS OF DIFFERENT COMPRESSOR

ARCHITECTURES IN THE PARTIAL PRODUCT REDUCTION STAGE

Table 4 Performance and Power Comparison between Different Types of Compressors

3:2 Compressor 4:2 Compressor 5:2 Compressor 7:2 Compressor

Delay 1mux+1xor 2mux+1xor 3mux+1xor 5mux+1xor

Delay Simulation 64.26ps 94.97ps 126.00ps 186.26ps

Power 6.48uW 9.97uW 11.87uW 14.16uW

There are a variety of architectures of the partial product reduction stage. Architectures of the

partial product reduction stage with 8-bit input width are shown in Fig. 37, while the

architectures of the one with 16-bit input width are shown in Fig. 38. Each of the architectures is

taking advantage of different type of compressor, and the features for each type of compressor

are clearly listed in Table 4. Based on this, in the case of 8-bit input width, the architecture in Fig.

37 (b) has advantages in delay, power consumption, and area, as shown in Table 5. In the case of

16-bit input width, in Table 6, architecture in Fig. 38 (c) is the chosen one, due to its lowest

52

delay, lowest power and the smallest silicon area among all the possible architectures.

Table 5 Performance and Power Comparison among Different Compressor Architectures for an 8-

bit Modulo 2n+1 Multiplier

Fig.37 (a) Fig.37 (b)

Delay 7mux+3xor 4mux+4xor

Delay Simulation 398.74ps 285.37ps



Table 6 Performance and Power Comparison among Different Compressor Architectures for a

16-bit Modulo 2n+1 Multiplier

Fig.38 (a) Fig.38 (b) Fig.38 (c) Fig.38 (d)

Delay 11mux+6xor 13mux+4xor 6mux+6xor 11mux+5xor

Delay Simulation 673.00ps 707.47ps 365.24ps 489.38ps

Power 48.38uW 53.54uW 17.67uW 22.25uW

# of transistors 578 534 390 496

4.3 SIMULATION RESULTS OF THE SPARSE-TREE ARCHITECTURE

AND THE KOGGE-STONE ARCHITECTURE

In the final addition stage, another important simulation is needed. The proposed sparse-tree

architecture is designed based on the assumption that the critical path of the stage is the path to

generate carry-outs, while 4-bit conditional sum generator should be the non-critical path. The

53

simulation result of the delay of the two paths is shown in Fig. 42 and 43, respectively. It is

clearly shows that we get the desired result, where the critical path delay is about 82.15ps and the

noncritical path delay is about 74.84ps.

Fig. 42 Critical Path Delay of the Sparse-tree Adder

The purpose of replacing the original Kogge-Stone adder by the sparse-tree architecture is to

solve the wire interconnecting problem in layout, however, the performance and power of the

new design should not worse than that of the Kogge-Stone structure. In Table 7, the PDP of new

Sparse-tree structure is even slightly less than the PDP of Kogge Stone, while the wire

interconnection problem is also well solved.

54

Table 7 Performance and Power Comparison between the Kogge-Stone Architecture and the

Sparse-tree Architecture

Kogge-Stone Sparse-tree

Delay 81.57ps 82.15ps



Fig. 43 Noncritical Path Delay of the Sparse-tree Adder

4.4 SIMULATION RESULTS OF THE CNT-BASED DESIGN AND THE

BULK CMOS- BASED DESIGN

4.4.1 FEATURES OF THE CNT TECHNOLOGY

55

Fig. 44 Critical Path Delay of the Kogge-Stone Adder

Finally, the simulation results between CNT technology and bulk CMOS technology need to be

compared. CNTFETs take advantage of semiconducting SWCNTs to work as essential element

of integrated circuit. Depending on different atom arrangement of the tubes, a SWCNT can act as

either a conductor or a semiconductor. The atom arrangement could be represented in the integer

pair (n, m).The relationship between m and n in the (n, m) pair can determine the characteristics

of the CNTFET: If n = m or n-m = 3i, where i is an integer, the CNT turns out to be metallic.

Otherwise, it turns out to be semiconducting. The equation of the diameter of the CNTFET is

shown below [12]:

(34)

56

where a0 = 0.142 nm. The threshold voltage of the CNT transistor is determined by DCNT, which

is shown in (35) [12]:

(35)

where a=2.49.The threshold voltage is reverse proportional to the diameter DCNT. We can adjust

the value of Dcnt to get the desired threshold voltage. In this thesis, we use the value of (19, 0) of

the (n, m) integer pair [12], then the threshold voltage turns out to be 0.293V.

CNTFET has a lot of advantages in various aspects than CMOS due to observation below. When

the channel length of the transistors down to a certain level, 25nm in general, traditional methods

are no longer available to reduce power because the static power is increasing rapidly, far

outweigh the dynamic power in traditional design [19]. The maximum leakage power of the

MOSFET-based gates is 75 times larger than for CNTFET gates. The minimum leakage power

of the MOSFET is about three times larger than for CNTFET [12]. The second observation is

regarding to the frequency response. In [12], the simulation result of inverter shows that the

CNTFET inverter has nearly 3dB more voltage gain and 3 times higher 3dB frequency than the

MOSFET inverter. CNTFET also has advantage in PVT variations that are discussed below. The

number of tubes in parallel in a CNTFET is equivalent to the width of CMOS, thus, we can

adjust the “width” of the CNTFET by changing its number of tubes. Different from the bulk

CMOS technology, however, for the CNTFET case, the ratio between pFET and nFET is 1:1

because the nFET and the pFET have almost the same current driving ability with same

transistor geometry [12].

57

To sum up, comparing to the conventional CMOS technology, CNT becomes more and more

attractive due to its much better performance. Due to the advantages of its lower leakage power,

better frequency response, lower PVT variation, and extremely low PDP, CNT becomes a good

substitution of CMOS in the future [12]. In this section, simulation results are compared between

CNT and CMOS based design in the application of the proposed modulo 2n+1 multiplier.

Fig. 45 Delay and Rise-time of the Proposed Multiplier Based on CMOS Technology

4.4.2 POWER, DELAY AND AREA

The simulation waveform of the final outputs delay and power of CNTFET-based 2n+1

multiplier and its CMOS counterpart with fan-out of 4 is shown in Fig. 45, 46, and 47,

respectively (solid line for CMOS and dotted line for CNTFET). Detail comparison of delay and

power is shown in table 8. All of these simulations are based on the modulo 2n+1 multiplier with

58

new designed blocks discussed in this thesis, including the new compressor, its new parallel

architecture, and the sparse-tree design. The PDP of CNT-based design is 94 times less than the

PDP of the bulk CMOS-based design. Simulation results of each new designed stage are also

shown one by one.

Fig. 46 Delay and Rise-time of the Proposed Multiplier Based on CNTFET Technology

Table 8 Performance Comparison between the Proposed Multiplier Based on Two Different

Technologies

CMOS CNT

Delay 494.94ps 30.25ps

Rise Time 16.82ps 0.73ps

Power 71.28uw 12.45uw

# of Transistors 2738 2738

59

4.4.3 PVT VARIATION

The comparison of temperature variation, voltage variation, and process variation of CMOS and

CNT technology are shown in Table 9 to 13 and Fig. 48, Fig. 49, and Fig. 50, respectively. The

robustness of CNT technology based design is much better than its CMOS counterpart as the

tables and figures clearly show.

Fig. 47 Power Consumption of the Proposed Multiplier Based on Two Technologies

Table 9 Delay Comparison between Two Technologies with Different Temperatures

Temperature (ºC) CMOS (ps) CNT (ps)

0 345.19 31.23

25 409.48 31.229

50 480.85 31.23

75 558.7 31.229

100 642.93 31.229

60

Table 10 Rise-time Comparison between Two Technologies with Different Temperatures

Temperature (ºC) CMOS (ps) CNTFET (ps)

0 33.3 3.4885

25 40.83 3.4883

50 48.972 3.4885

75 57.957 3.4883

100 68.875 3.4883

Table 11 Delay Comparison between Two Technologies with Different Supply Voltages

Supply Voltage (V) CMOS (ps) CNTFET (ps)

0.72 551.5 35.057

0.76 470.29 33.016

0.8 409.48 31.229

0.84 362.39 29.729

0.88 326.14 26.434

Table 12 Delay Comparison between Two Technologies with Different Process Corners

Process Corner CMOS (ps) CNTFET (ps)

ff (-3%) 272.8 30.815

normal 409.48 31.229

ss (+3%) 616.69 31.525

Table 13 Rise-time Comparison between Two Technologies with Different Process Corners

Process Corner CMOS (ps) CNTFET (ps)

ff (-3%) 30.743 3.2904

normal 40.83 3.4883

ss (+3%) 55.658 3.4779

61

Fig. 48 Temperature Variation

Fig. 49 Voltage Variation

0

100

200

300

400

500

600

700

0° 25° 50° 75° 100°

CMOS

CNT

0

100

200

300

400

500

600

0.72 0.76 0.8 0.84 0.88

CMOS

CNT

Temperature (°C)

Delay (ps)

Voltage (V)

Delay (ps)

62

Fig. 50 Process Variation

The simulation results shows that when temperature goes high, the delay of the modulo 2n+1

multiplier based on two technologies also goes high, however, CNT-based multiplier is much

more insensitive to temperature variation than the CMOS one. Similarly, CNT-based multiplier

is also insensitive to voltage variation and process variation as well. Different from temperature

variation, delay of the multiplier declines when supply voltage going high. To sum up, the

robustness of CNT technology is much better than the CMOS counterpart, making CNT

technology a very promising choice in the applications requiring high stability against the

variation of environment such as military applications and research applications.

0

100

200

300

400

500

600

700

ff nom ss

CMOS

CNT

Process corner

Delay (ps)

63

V. Conclusion

In this thesis, a new design of modulo 2n+1 multiplier is proposed. The new design of MUX-

based compressor increases speed and reduces power comparing to the conventional full adder

based compressor. The parallel architecture of compressors further speeds up the partial products

reduction stage and introduces regular layout. As for the final addition stage, the sparse-tree

architecture keeps the speed advantage of Kogge-Stone and solves its wire interconnection

problems. Finally, a comparison between CNT and CMOS based design is presented. CNT has

advantages in leakage power, frequency response, PVT variation, and PDP. It turns out that the

CNT is a better choice than CMOS to meet the aggressive high-speed low-power requirement

with less PVT variations, and this thesis will be a good reference for the future CNTFET-based

design in other applications.

64

Reference

[1] Modugu, R., Kim, Y.B., and Choi, M., “A fast low-power modulo 2n+1 multiplier”.

[2] Wang, Y., Swamy, M. N. S., and Omair Ahmad, M., “Residue-to-binary number converters

for three moduli set”, IEEE Transactions on CIircuits and Systems, Vol. 46, No. 2, Feburary

1999.

[3] Gallaher, D., Petry, F. E., and Srinivasan, P., “The Digit Parallel Method for Fast RNS to

Weighted Number System Conversion for Specific Moduli (2n-1; 2

n; 2

n+1)”, IEEE Transactions

on CIircuits and Systems, Vol. 44, No. 1, January 1997.

[4] Curiger, A., Bonnenberg, H., and Kaeslin, H., “Regular VLSI Architectures for

Multiplication Modulo (2n + 1)”, IEEE Journal of Solid-State Circuits, Vol. 26, No. 7, July 1991.

[5] Hiasat, A., “New memoryless, mod (2n+1) residue multiplier”, Electronic Letters Vol. 28, No.

3, 30th January 1992

[6] Wrzyszcz, A. and Milford, D., “A new modulo 2n+1 multiplier”

[7] Zimmerman, R., “Efficient VLSI implementation of modulo (2n ± 1) addition and

multiplication” IEEE trans. Comput., Vol. 51, pp. 1389-1399, 2002.

[8] Chaves, R. and Sousa, L., “Faster Modulo 2n + 1 Multipliers without Booth recoding”.

[9] Ma, Y., “A Simplified Architecture for Modulo (2n + 1) Multiplication”, IEEE Transactions

on Computers, Vol. 47, No. 3, March 1998.

[10] Wnag, Z., Jullien, G.A. and Miller, W.C., “An Efficient Tree Architecture for Modulo 2n +

1 Multiplication”, Journal of VLSI Signal Processing 14, 241-248, 1996.

[11] Vergos, H.T. and Efstathiou, C., “Design of efficient modulo 2n + 1 multipliers”, IET

Comput. Digital Technology, Vol. 1, No. 1, pp. 49-57, 2007.

65

[12] Kim, Y.B., “Integrated circuit design based on carbon nanotube field effect transistor,”

IEEE Journal of Trans. on EE Materials, Vol. 12, No.5, pp.175-188, Oct. 25, 2011.

[13] Kogge, P. and Stone, H. S., “A parallel algorithm for the efficient solution of a general class

of recurrence equations” IEEE Trans. Comput., Vol. C-22, pp. 786-793, Aug 1973.

[14] Mathew, S., Anders, M., Krishnamurthy, R.K. and Borkar, S., ”A 4-GHz 130-nm address

generation unit with 32-bit sparse-tree adder core” In IEEE Journal of Solid-State Circuits, Vol.

38, No. 5, pp. 689-695, May 2003.

[15] Vergos, H.T., Efstathiou, C., and Nikolos, D., “Diminished-One Modulo 2n + 1 Adder

Design”, IEEE Transactions on Computers, Vol. 51, No. 12, December 2002.

[16] Sreehari, V., Kirthi, M., Lingamneni, A. and Sreekanth, R., “Novel architectures for high-

speed and low-power 3-2,4-2 and 5-2 compressor,” IEEE 20th International Conference on VLSI

Design.

[17] Ma, W.N. and Li, S.G., “A New High Compression Compressor for Large Multiplier”.

[18] Chip-Hong, C., Jiangmin, G. and Mingyan, Z., “Ultra low-voltage low-power CMOS 4-2

and 5-2 compressors for fast arithmetic circuits,” IEEE Trans. on Circuits and Systems, Vol.51,

No. 10, Oct., 2004.

[19] Chandrakasan, A., Bowhill, W.J. and Fox, F., “Design of High-Performance Microprocessor

Circuits”, Wiley-IEEE Press, October 2000.

http://www.wiley.com/WileyCDA/Section/id-302475.html?query=Frank+Fox

66

Appendix: Hspice Input Files

A.1 PARTIAL PRODUCT GENERATION STAGE SUBCIRCUIT FOR

BOTH CMOS AND CNT TECHNOLOGY

.subckt partial_product

+x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8

+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8

+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8

+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8

+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8

+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8

+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8

+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8

+a0 a1 a2 a3 a4 a5 a6 a7

+b0 b1 b2 b3 b4 b5 b6 b7

X1 x1_1_bar a0 b0 nand2 ***first (LSB) column of the final partial product

X2 x1_2 a7 b1 nand2 marix

X3 x1_3 a6 b2 nand2

X4 x1_4 a5 b3 nand2

X5 x1_5 a4 b4 nand2

X6 x1_6 a3 b5 nand2

X7 x1_7 a2 b6 nand2

X8 x1_8 a1 b7 nand2

X17 x1_1 x1_1_bar inv

X18 x2_1_bar a1 b0 nand2 ***second column of the final partial product marix

X19 x2_2_bar a0 b1 nand2

X20 x2_3 a7 b2 nand2

67








X36 x3_1_bar a2 b0 nand2 ***third column of the final partial product marix











X55 x4_1_bar a3 b0 nand2 ***fourth column of the final partial product marix











68


X75 x5_1_bar a4 b0 nand2 ***fifth column of the final partial product marix













X96 x6_1_bar a5 b0 nand2 ***sixth column of the final partial product marix






X102 x6_7 a7 b6 nand2

X103 x6_8 a6 b7 nand2







X118 x7_1_bar a6 b0 nand2 ***seventh column of the final partial product marix

69







X125 x7_8 a7 b7 nand2








X141 x8_1_bar a7 b0 nand2 ***eighth (MSB) column of the final partial product

X142 x8_2_bar a6 b1 nand2 marix















.ends

70

A.2 PARTIAL PRODUCT REDUCTION STAGE SUBCIRCUIT FOR BOTH

CMOS AND CNT TECHNOLOGY

.subckt overall_compressor sum4_8 sum4_7 sum4_6 sum4_5 sum4_4 sum4_3 sum4_2 sum4_1

+carry4_7 carry4_6 carry4_5 carry4_4 carry4_3 carry4_2 carry4_1 carry4_8_bar

+x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8

+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8

+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8

+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8

+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8

+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8

+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8

+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8

*x1_2 = row2,column1

*sum1_2 = 32compressor block level 1, second block

X1 sum1_1 carry1_1 x1_1 x1_2 x1_3 compressor


X3 sum1_3 carry1_3 x1_7 x1_8 0 compressor



X8 sum1_8 carry1_8 x2_7 x2_8 vdd compressor






71














X81 carry1_36_bar carry1_36 inv



*new column1 inputs: sum1_1, sum1_2, sum1_3, sum1_4, sum1_5, carry1_76_bar,

*carry1_77_bar, carry1_78_bar, carry1_79_bar, carry1_80_bar, x1_16, 0

*new column2 inputs: sum1_6, sum1_7, sum1_8, sum1_9, sum1_10, carry1_1, carry1_2,

*carry1_3, carry1_4, carry1_5, x2_16, vdd

X86 sum2_1 carry2_1 sum1_1 sum1_2 sum1_3 compressor

X87 sum2_2 carry2_2 carry1_36_bar carry1_37_bar carry1_38_bar compressor


72

X91 sum2_6 carry2_6 carry1_1 carry1_2 carry1_3 compressor















*new column1 inputs: sum2_1 sum2_2 sum2_3 sum2_4 carry2_61_bar carry2_62_bar

*new column2 inputs: sum2_5 sum2_6 sum2_7 sum2_8 carry2_1 carry2_2

X154 sum3_1 carry3_1 sum2_1 sum2_2 carry2_29_bar compressor

X155 sum3_2 carry3_2 sum2_5 sum2_6 carry2_1 compressor





73




*new column1 inputs: sum3_1 sum3_2 carry2_63_bar carry2_64_bar carry3_31_bar

carry3_32_bar

*new column2 inputs: sum3_3 sum3_4 carry2_3 carry2_4 carry3_1 carry3_2

X200 sum4_1 carry4_1 sum3_1 carry3_8_bar carry2_30_bar compressor

X201 sum4_2 carry4_2 sum3_2 carry3_1 carry2_2 compressor








*final output: sum4_8 sum4_7 sum4_6 sum4_5 sum4_4 sum4_3 sum4_2 sum4_1

* carry4_7 carry4_6 carry4_5 carry4_4 carry4_3 carry4_2 carry4_1 carry4_8_bar

.ends

74

A.3 FINAL ADDITION STAGE SUBCIRCUIT FOR BOTH CMOS AND

CNT TECHNOLOGY

.subckt carry_merge C_m1 C_m1_bar C3 g0 g1 g2 g3 g4 g5 g6 g7 p0 p1 p2 p3 p4 p5 p6 p7

psum0 psum0_bar psum1 psum2 psum3 psum4 psum4_bar psum5 psum6 psum7 a0 a1 a2 a3 a4

a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7

*generate & propagate generation block

X1 g0 p0 a0 b0 g_p

X2 g1 p1 a1 b1 g_p

X3 g2 p2 a2 b2 g_p

X4 g3 p3 a3 b3 g_p

X5 g4 p4 a4 b4 g_p

X6 g5 p5 a5 b5 g_p

X7 g6 p6 a6 b6 g_p

X8 g7 p7 a7 b7 g_p

*inverted carry emerge block

X21 go5 po5 g7 g6 p7 p6 s_o




X27 go11 po11 go5 go6 po5 po6 s_o

X28 go12 po12 go7 go8 po7 po8 s_o

X31 C_m1_bar po15 go11 go12 po11 po12 s_o

X32 C3 po19 go12 go11_b po12 po11 s_o

* final carry

X33 C_m1 C_m1_bar inv

X33i go11_b go11 inv

75

*partial sum generation block

X37 psum0_bar psum0 a0 b0 xor_xnor








.ends

*4-bit conditional sum generator

.subckt cond s0 s1 s2 s3 cin cin_bar psum0 psum0_bar psum1 psum2 psum3 g0 g1 g2 g3 p0 p1

p2 p3

*s0 generator

X3 s0 psum0_bar psum0 cin cin_bar mux

*s1 generator

X4 n4 g0 p0 nor2

X5 n3 n3_0 n4 psum1 xor_xnor

X6 n08 n2 g0 psum1 xor_xnor

X7 s1 n3 n2 cin cin_bar mux

*s2 generator

X8 n9 g0 p1 nand2

X9 n10 g1 inv

X10 n7 n9 n10 nand2

X11 n7_0 n7 inv

X12 n11 p1 p0 nand2

X13 n8 n11 n7_0 nand2

76

X14 n09 n6 n8 psum2 xor_xnor



*s3 generator

X17 n16 p1 p2 g0 nand3

X18 n17 p2 g1 nand2

X19 n13 g2 inv

X20 n18 n16 n17 n13 nand3

X21 n14 n18 inv

X22 n12 p1 p2 p0 nand3

X23 n15 n12 n14 nand2




.ends

*final stage adder

.subckt final_stage_adder s0 s1 s2 s3 s4 s5 s6 s7 a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7

X1 C_m1 C_m1_bar C3 g0 g1 g2 g3 g4 g5 g6 g7 p0 p1 p2 p3 p4 p5 p6 p7 psum0 psum0_bar

psum1 psum2 psum3 psum4 psum4_bar psum5 psum6 psum7 a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2

b3 b4 b5 b6 b7 carry_merge

X2 C3_bar C3 inv

X5 s0 s1 s2 s3 C_m1 C_m1_bar psum0 psum0_bar psum1 psum2 psum3 g0 g1 g2 g3

p0 p1 p2 p3 cond

X6 s4 s5 s6 s7 C3 C3_bar psum4 psum4_bar psum5 psum6 psum7 g4 g5 g6 g7 p4

p5 p6 p7 cond

.ends

77

A.4 OTHER SUBCIRCUITS FOR CMOS TECHNOLOGY

*inverter

.subckt inv out in

M1 out in vdd vdd pmos W=256n L=32n

M2 out in 0 0 nmos W=128n L=32n

.ends

*xor_xnor

.subckt xor_xnor xnor xor a b

M1 a b xor vdd pmos L=32nm W=256nm

M2 xor a b vdd pmos L=32nm W=256nm

M3 xor b 1 0 nmos L=32nm W=64nm

M4 1 a 0 0 nmos L=32nm W=64nm

M5 xnor xor vdd vdd pmos L=32nm W=64nm

M6 xor xnor 0 0 nmos L=32nm W=32nm

M7 2 b vdd vdd pmos L=32nm W=128nm

M8 xnor a 2 vdd pmos L=32nm W=128nm

M9 a b xnor 0 nmos L=32nm W=128nm

M10 xnor a b 0 nmos L=32nm W=128nm

.ends

*2 to 1 mux

.subckt mux out a b set set_bar

M1 1 a vdd vdd pmos W=128nm L=32nm

M2 4 b vdd vdd pmos W=128nm L=32nm

M3 2 set_bar 1 vdd pmos W=128nm L=32nm

M4 2 set 4 vdd pmos W=128nm L=32nm

M5 2 set 3 0 nmos W=64nm L=32nm

M6 2 set_bar 5 0 nmos W=64nm L=32nm

M7 3 a 0 0 nmos W=64nm L=32nm

78

M8 5 b 0 0 nmos W=64nm L=32nm

X1 out 2 inv

.ends

*3 input nand

.subckt nand3 out a b c

m1 out a vdd vdd pmos l=32n w=256n

m2 out b vdd vdd pmos l=32n w=256n

m3 out c vdd vdd pmos l=32n w=256n

m4 out a 2 0 nmos l=32n w=384n

m5 2 b 3 0 nmos l=32n w=384n

m6 3 c 0 0 nmos l=32n w=384n

.ends

*2 input nand

.subckt nand2 out a b

m1 out a vdd vdd pmos l=32n w=256n

m2 out b vdd vdd pmos l=32n w=256n


m4 2 b 0 0 nmos l=32n w=256n

.ends

*2 input nor

.subckt nor2 out a b

m1 2 a vdd vdd pmos l=32n w=512n

m2 out b 2 vdd pmos l=32n w=512n


m4 out b 0 0 nmos l=32n w=128n

.ends

* special operator

79

.subckt s_o Gout Pout gl gr pl pr

X1 1 pl gr nand2

X2 2 gl inv

X3 Gout 1 2 nand2

X4 3 pl pr nand2

X5 Pout 3 inv

.ends

* G_P generator

.subckt g_p gi pi ai bi

X1 gi 1 inv

X2 pi 2 inv

X3 1 ai bi nand2

X4 2 ai bi nor2

.ends

.subckt compressor sum carry x1 x2 x3

X1 xnor xor x1 x2 xor_xnor

X2 sum xnor xor x3 x3_bar mux

X3 x3_bar x3 inv

X4 carry x1 x3 xnor xor mux

.ends

80

A.5 OTHER SUBCIRCUITS FOR CNT TECHNOLOGY

* PCNFET Lch=32nm n1=19 n2=0 tubes=8

* NCNFET Lch=32nm n1=19 n2=0 tubes=8

*inverter

.subckt inv out in

X1 out in vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X2 out in 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8

.ends

*xor_xnor


X1 a b xor vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X2 xor a b vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X3 xor b 1 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4

X4 1 a 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4

X5 xnor xor vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=2

X6 xor xnor 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=2

X7 2 b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4

X8 xnor a 2 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4

X9 a b xnor 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8

X10 xnor a b 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8

.ends

*2 to 1 mux

.subckt mux out a b set set_bar

X1 1 a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4

X2 4 b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4

X3 2 set_bar 1 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4

X4 2 set 4 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=4

81

X5 2 set 3 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4

X6 2 set_bar 5 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4

X7 3 a 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4

X8 5 b 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=4

X0 out 2 inv

.ends

*3 input nand

.subckt nand3 out a b c

X1 out a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X2 out b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X3 out c vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X4 out a 2 0 NCNFET Lch=32nm n1=19 n2=0 tubes=24


X6 3 c 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=24

.ends

*2 input nand


X1 out a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8

X2 out b vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=8



.ends

*2 input nor


X1 2 a vdd vdd PCNFET Lch=32nm n1=19 n2=0 tubes=16

X2 out b 2 vdd PCNFET Lch=32nm n1=19 n2=0 tubes=16


X4 out b 0 0 NCNFET Lch=32nm n1=19 n2=0 tubes=8

82

.ends

* special operator

.subckt s_o Gout Pout gl gr pl pr

X1 1 pl gr nand2

X2 2 gl inv

X3 Gout 1 2 nand2

X4 3 pl pr nand2

X5 Pout 3 inv

.ends

* G_P generator

.subckt g_p gi pi ai bi

X1 gi 1 inv

X2 pi 2 inv

X3 1 ai bi nand2

X4 2 ai bi nor2

.ends

.subckt compressor sum carry x1 x2 x3

X1 xnor xor x1 x2 xor_xnor

X2 sum xnor xor x3 x3_bar mux

X3 x3_bar x3 inv

X4 carry x1 x3 xnor xor mux

.ends

83

A.6 MODULO 2N+1 MULTIPLIER TESTING CIRCUIT FOR BOTH CMOS

AND CNT TECHNOLOGY

.lib "CNFET.lib" CNFET

*.include 'PTM_customized_32nm_nom.lib'

.include 'partial_product_8bit.sp'

.include 'compressor_8bit.sp'

.include 'sparsetree_8bit.sp'

.include 'subckt_CNT.sp'

.global vdd

Vdd vdd 0 0.8

Va0 a000 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xi1 a00 a000 inv

Xi2 a0 a00 inv

*Va0 a0 0 0.8

Va1 a1 0 0

Va2 a2 0 0

Va3 a3 0 0

Va4 a4 0 0

Va5 a5 0 0

Va6 a6 0 0

Va7 a7 0 0

Vb0 b0 0 0.8

Vb1 b1 0 0

Vb2 b2 0 0

Vb3 b3 0 0

84

Vb4 b4 0 0

Vb5 b5 0 0

Vb6 b6 0 0

Vb7 b7 0 0

X1 x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8

+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8

+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8

+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8

+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8

+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8

+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8

+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8

+a0 a1 a2 a3 a4 a5 a6 a7

+b0 b1 b2 b3 b4 b5 b6 b7

+partial_product

X2 sum6_8 sum6_7 sum6_6 sum6_5 sum6_4 sum6_3 sum6_2 sum6_1

+carry6_7 carry6_6 carry6_5 carry6_4 carry6_3 carry6_2 carry6_1 carry6_8_bar

+x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8

+x2_1 x2_2 x2_3 x2_4 x2_5 x2_6 x2_7 x2_8

+x3_1 x3_2 x3_3 x3_4 x3_5 x3_6 x3_7 x3_8

+x4_1 x4_2 x4_3 x4_4 x4_5 x4_6 x4_7 x4_8

+x5_1 x5_2 x5_3 x5_4 x5_5 x5_6 x5_7 x5_8

+x6_1 x6_2 x6_3 x6_4 x6_5 x6_6 x6_7 x6_8

+x7_1 x7_2 x7_3 x7_4 x7_5 x7_6 x7_7 x7_8

+x8_1 x8_2 x8_3 x8_4 x8_5 x8_6 x8_7 x8_8

+overall_compressor

X3 s0 s1 s2 s3 s4 s5 s6 s7

+sum6_1 sum6_2 sum6_3 sum6_4 sum6_5 sum6_6 sum6_7 sum6_8

85

+carry6_8_bar carry6_1 carry6_2 carry6_3 carry6_4 carry6_5 carry6_6 carry6_7

+final_stage_adder

X4 to1 s0 inv

X5 to2 s0 inv

X6 to3 s0 inv

X7 to4 s0 inv

.options AUTOSTOP

.options INGOLD=2 DCON=1

.options GSHUNT=1e-12 RMIN=1e-15

.options ABSTOL=1e-5 ABSVDC=1e-4

.options RELTOL=1e-2 RELVDC=1e-2

.options NUMDGT=4 PIVOT=1

.option convergence=1

.param TEMP=27

.option post

.tran 1e-12 2e-9

.end

86

A.7 7:2 COMPRESSOR SUBCIRCUIT AND ITS TESTING CIRCUIT

*.lib "CNFET.lib" CNFET

.include 'PTM_customized_32nm_nom.lib'

.global vdd

* PCNFET Lch=32nm n1=19 n2=0 tubes=8

* NCNFET Lch=32nm n1=19 n2=0 tubes=8

X1 3 4 x5 x6 xor_xnor

X2 5 6 x2 x3 xor_xnor

X3 9 19 3 4 x7 x7_0 mux

X4 10 11 5 6 x4 x4_0 mux

X5 13 14 10 11 9 9_0 mux

X6 15 16 13 14 x1 x1_0 mux

X7 17 18 15 16 cin2 cin2_0 mux

X8 carry 15 cin1 17 18 mux_single

X9 sum 23 17 18 cin1 cin1_0 mux

x10 t1 x2 x3 x4 CGEN

x11 b x5 x6 x7 CGEN

x12 t2 x2 x3 x4 nor3

x13 t3 t2_0 x1 nand2

x14 a t1_0 t3 nand2

x15 t9 t4 10 x1 xor_xnor

x16 t5 x1 x2 x3 x4 nand4

x17 t6 t4 9 nand2

x18 c t5 t6 nand2

x19 cout1 a b c CGEN

x20 t7 t8 a b xor_xnor

x21 cout2 t7 t8 c c_0 mux_single

87

Xi1 12_0 12 inv

Xi2 x1_0 x1 inv

Xi3 cin2_0 cin2 inv

Xi4 cin1_0 cin1 inv

xi5 x7_0 x7 inv

xi6 x4_0 x4 inv

xi7 9_0 9 inv

xi8 c_0 c inv

xi9 t2_0 t2 inv

xi10 t1_0 t1 inv

.subckt inv out in

M1 out in vdd vdd pmos L=32nm W=64nm

M2 out in 0 0 nmos L=32nm W=32nm

.ends


M1 a b xor vdd pmos L=32nm W=48nm

M2 xor a b vdd pmos L=32nm W=48nm

M3 xor b 1 0 nmos L=32nm W=32nm


M5 vdd xor xnor vdd pmos L=32nm W=48nm

M6 xor xnor 0 0 nmos L=32nm W=32nm

M7 vdd b 2 vdd pmos L=32nm W=48nm

M8 2 a xnor vdd pmos L=32nm W=48nm

M9 a b xnor 0 nmos L=32nm W=32nm

M10 xnor a b 0 nmos L=32nm W=32nm

.ends

.subckt mux_single out a b set set_bar

M1 1 a vdd vdd pmos L=32nm W=48nm

88


M3 2 set_bar 1 vdd pmos L=32nm W=48nm

M4 2 set 4 vdd pmos L=32nm W=48nm

M5 2 set 3 0 nmos L=32nm W=32nm

M6 2 set_bar 5 0 nmos L=32nm W=32nm


M8 5 b 0 0 nmos L=32nm W=32nm

M9 out 2 vdd vdd pmos L=32nm W=48nm

M10 out 2 0 0 nmos L=32nm W=32nm

.ends

.subckt mux out outbar a b set set_bar

M1 a set out 0 nmos L=32nm W=32nm

M2 b set_bar out 0 nmos L=32nm W=32nm

M3 b set outbar 0 nmos L=32nm W=32nm

M4 a set_bar outbar 0 nmos L=32nm W=32nm

M5 out outbar vdd vdd pmos L=32nm W=48nm

M6 outbar out vdd vdd pmos L=32nm W=48nm

.ends

.subckt CGEN carry a b cin



M3 2 cin 1 vdd pmos L=32nm W=48nm

M4 2 cin 3 0 nmos L=32nm W=32nm




M8 2 a 4 vdd pmos L=32nm W=48nm



89

X11 carry 2 inv

.ends

*2 input nand


M1 out a vdd vdd pmos L=32nm W=64nm

M2 out b vdd vdd pmos L=32nm W=64nm

M3 out a 2 0 nmos L=32nm W=64nm


.ends

*4 input nand

.subckt nand4 out a b c d

M1 out a vdd vdd pmos L=32nm W=64nm

M2 out b vdd vdd pmos L=32nm W=64nm

M3 out c vdd vdd pmos L=32nm W=64nm

M4 out d vdd vdd pmos L=32nm W=64nm



M7 2 c 3 0 nmos L=32nm W=128nm

M8 3 d 0 0 nmos L=32nm W=128nm

.ends

*3 input nor

.subckt nor3 out a b c


M2 3 b 2 vdd pmos L=32nm W=192nm

M3 out c 3 vdd pmos L=32nm W=192nm


M5 out b 0 0 nmos L=32nm W=32nm

M6 out c 0 0 nmos L=32nm W=32nm

90

.ends

*2 input nor



M2 out b 2 vdd pmos L=32nm W=128nm


M4 out b 0 0 nmos L=32nm W=32nm

.ends

Vdd vdd 0 0.8

Va a00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii1 a0 a00 inv

Xii2 x1 a0 inv

Vb b00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii3 b0 b00 inv

Xii4 x2 b0 inv

Vc c00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii5 c0 c00 inv

Xii6 x3 c0 inv

Vd d00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii7 d0 d00 inv

Xii8 x4 d0 inv

Ve e00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii9 e0 e00 inv

Xii10 x5 e0 inv

91

Vf f00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii11 f0 f00 inv

Xii12 x6 f0 inv

Vg g00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii13 g0 g00 inv

Xii14 x7 g0 inv

Vh h00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii15 h0 h00 inv

Xii16 cin1 h0 inv

Vi i00 0 pulse 0.8 0 0 1e-10 1e-10 4e-10 1e-9

Xii17 i0 i00 inv

Xii18 cin2 i0 inv

.options POST

.options AUTOSTOP

.options INGOLD=2 DCON=1

.options GSHUNT=1e-12 RMIN=1e-15

.options ABSTOL=1e-5 ABSVDC=1e-4

.options RELTOL=1e-2 RELVDC=1e-2

.options NUMDGT=4 PIVOT=1

.option convergence=1

.param TEMP=27

.option post

.tran 1e-12 20e-10

.end

a high-speed low-power modulo 2^n+1 multiplier design ...1324/fulltext.pdf · a high-speed...

Documents