design of fast and low power multiplier

8/13/2019 Design Of Fast and low power Multiplier

http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 1/63



Department of ECE, JNTUHCEH 2

multipliers have been plagued by complicated switching systems and/or irregularities

in design.

1.2 Power Optimization

Power refers to number of Joules dissipated over a certain amount of time

whereas energy is the measure of the total number of Joules dissipated by a circuit.

In digital CMOS design, the well-known power-delay product is commonly

used to assess the merits of designs. In a sense, this can be shown as power ×

delay = (energy/delay) × delay = energy, which implies delay is irrelevant.

1.3 Low-Power Multiplier Design

Multiplication consists of three steps: generation of partial products or

(PPG), reduction of partial products (PPR), and finally carry-propagate addition

(CPA).In general there are sequential and combinational multiplier implementations.

Now consider combinational case here because the scale of integration now is large

enough to accept parallel multiplier implementations in digital VLSI systems.

Different multiplication algorithms vary in the approaches of PPG, PPR, and CPA.

For PPG, AND gate array is the easiest, also use array of multiplexers for PPG.

For PPR, two alternatives exist: reduction by rows, performed by an array of

adders, and reduction by columns, performed by an array of counters. The final CPA

requires a fast adder scheme because it is on the critical path. In some cases, final CPA

is postponed if it is advantageous to keep redundant results from PPG for further

arithmetic operations.

1.4 Languages and Tools Used

Considered Verilog HDL as our primary language. For simulation used

synopsys VCS compiler. For synthesis have used Synopsys Design Compiler

90nm process technology.

1.5 Research Approach

The basic motive of the project was to study and develop an efficient fast and

low power multiplier. As the name suggests had to go for faster and low power factor

optimization simultaneously. The basic building block of a multiplier is ADDER

circuit. Hence turned our focus to the adders first. Studied the area occupied and the

time delay consumed by different adders and found out a proper relation between time




and area complexity of all the adders under consideration.

Then turned our focus to the Multipliers. In Multipliers studied different

multipliers writing programs, verifying waveforms and then finally calculating area

along with power consumed by the circuit. After knowing all this also calculated

delay for different multipliers which helped us to determine the best multiplier. HPM

multiplier was found to be the best multiplier among all with less power

consumption and proper area, delay trade-off. Future work will be to optimize power

consumed by different multipliers there by reducing number of gates used and area

occupied by them.




Complexity (A) Delay (T) Product (A x T) Adder Schemes

O(n)

O(n)

O(n)

O(n)

O(n1/*1+1)

O(log n)

O(n2)

O(n*1+2/*1+1)

O(n log n)

Ripple-Carry

Carry-Select

Carry-Look ahead

Table 2.1 Categorization of adders with respect to delay time and capacity

2.2 Half and Full AdderThe basic building block of a multiplier is ADDER circuit. An adder or

summer is a digital circuit that performs addition of numbers. In many computers and

other kinds of processors, adders are used not only in the arithmetic logic unit, but also in

other parts of the processor, where they are used to calculate addresses, table indices, and

similar operations.

2.2.1 Half Adder

A half adder adds two one-bit binary numbers “ain” and “ bin”. It has two

outputs,“Sout” and “cout” (the value theoretically carried on to the next addition); the

final sum is, “2cout + sout”. The simplest half-adder design, pictured below in figure

2.1, incorporates an XOR gate for “sout” and an AND gate f or “cout”. Half

adders cannot be used compositely, given their incapacity for a carry-in bit. TABLE

2.2 shows the truth table of half adder.

Figure 2.1 Logic Diagram of Half Adder

http://en.wikipedia.org/wiki/Digital_circuit

http://en.wikipedia.org/wiki/Addition

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Arithmetic_logic_unit

http://en.wikipedia.org/wiki/XOR_gate

http://en.wikipedia.org/wiki/AND_gate

http://en.wikipedia.org/wiki/AND_gate

http://en.wikipedia.org/wiki/XOR_gate

http://en.wikipedia.org/wiki/Arithmetic_logic_unit

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Addition

http://en.wikipedia.org/wiki/Digital_circuit




HALF ADDER INPUTS HALF ADDER OUTPUTS

ain bin sout cout

0

0

1

1

0

1

0

1

0

1

1

1

0

0

0

1

Table 2.2 Truth Table of Half Adder

2.2.2 Full Adder

A full adder adds binary numbers and accounts for values carried in as well as

out. A one-bit full adder adds three one-bit numbers, often written as “ain”, “ bin”, and

“cin” where “ain” and bin are the operands, and “cin” is a bit carried in (in theory from

a past addition). The circuit produces a two-bit output sum typically represented by

the signals „cout‟ and „sout‟. The one-bit full adder's truth table shown in TABLE 2.3.

FULL ADDER INPUTS FULL ADDER OUTPUTS

Ain bin cin sout cout

0

0

0

0

1

1

1

1

0

0

1

1

0

0

1

1

0

1

0

1

0

1

0

1

0

1

1

0

1

0

0

1

0

0

0

1

0

1

1

0

Table 2.3 Truth Table of Full Adder

A full adder can be implemented in many different ways such as with a

custom transistor-level circuit or composed of other gates. One implementation is

http://en.wikipedia.org/wiki/Truth_table

http://en.wikipedia.org/wiki/Transistor

http://en.wikipedia.org/wiki/Transistor

http://en.wikipedia.org/wiki/Truth_table




= ⊕ ⊕

= · + · + ·

In this implementation for generation of carry, using three AND gates and two

OR gates as shown in figure 2.2. As per the logic diagram this adder has more number of

gates. The results from the synthesis tool are shown in table 2.4.

Figure 2.2 Logic Diagram of Full Adder1

Full adder can be implemented using only two types of gates and is convenient

if the circuit is being implemented using simple IC chips which contain only one gate

type per chip. In this light, „cout can be implemented as shown in figure 2.3.


A full adder can be constructed from two half adders by connecting “ain” and“ bin” to the input of one half adder, connecting the sum from that to an input to the

second adder, connecting „cin‟ to the other input and OR the two carry outputs.

Equivalently, „sout‟ could be made the three-bit XOR of „ain‟, „ bin‟, and „cin‟, and

„cout‟ could be made the three-bit majority function of „ain‟, „ bin‟ and „cin‟ as shown in

figure 2.4 where h1 and h2 half adders shown in figure 2.1.

http://en.wikipedia.org/wiki/Majority_function

http://en.wikipedia.org/wiki/Majority_function





T he full adder can be viewed as a 3:2 compressor it sums three one-bit

inputs, and returns the result as a single two-bit number. Thus, for example, a binary

input of 101 results in an output of 1+0+1=10 (decimal number '2'). The carry-out

represents bit one of the result, while the sum represents bit zero. Likewise, a half

adder can be used as a 2:2 compressor. The results from the synthesis tool for the

three full adders are shown in TABLE 2.4. From figure it is clear that of all the full

adders shown the full adder1 is the better choice. So throughout the project used full

adder1 both as adder and 3:2 compressor.

Type of Full Adder Area (µm2) Delay (ps) Power (uw)

Full adder 1

Full Adder 2

Full Adder 3

30.1

41.1

51.5

0.45

0.45

0.53

6.6

12.64

14.72

Table 2.4 Comparisons of Different Implementations of Full Adders

2.3 Ripple Carry Adder

The well known adder architecture, ripple carry adder is composed of

cascaded full adders for n-bit adder, as shown in figure 2.5. It is constructed by

cascading full adder blocks in series. The carry out of one stage is fed directly to the

carry-in of the next stage. For an n-bit parallel adder it requires “n” full adders.




Figure 2.5 4-Bit Ripple Carry Adder

Multiple full adder circuits can be cascaded in parallel to add an N-bit number.For an N- bit parallel adder, there must be N number of full adder circuits. A ripple carry

adder is a logic circuit in which the carry-out of each full adder is the carry in of the

succeeding next most significant full adder.

It is called a ripple carry adder because each carry bit gets rippled into the next

stage. In a ripple carry adder the sum and carry out bits of any half adder stage is not

valid until the carry in of that stage occurs. Propagation delays inside the logic circuitry

is the reason behind this. Propagation delay is time elapsed between the application of an

input and occurrence of the corresponding output. Not very efficient when large number

bit numbers are used. Delay increases linearly with bit length.

2.3.1 Delay

Delay from Carry-in to Carry-out is more important than from A to carry-out

or carry-in to SUM, because the carry-propagation chain will determine the latency

of the whole circuit for a Ripple-Carry adder.

2.4 Carry Select Adder

In Carry select adder scheme, blocks of bits are added in two ways: One

assuming a carry-in of 0 and the other with a carry-in of 1.

Because of multiplexers larger area is required. Have a lesser delay than Ripple

Carry Adders (half delay of RCA). Hence always go for Carry Select Adder while

working with smaller no of bits.




Figure 2.6 Carry Select with 1 Level using n/2- bit RCA




As shown in the figure 2.6, is the basic building block of a carry-select adder, the

carry-select adder generally consists of two ripple carry adders and a multiplexer. Adding

two n-bit numbers with a carry-select adder is done with two adders (therefore two ripple

carry adders) in order to perform the calculation twice, one time with the assumption of the

carry being zero and the other assuming one. After the two results are calculated, the correct

sum, as well as the correct carry, is then selected with the multiplexer once the correct carry

is known.

The block size should have a delay, from addition inputs a and b to the carry out,

equal to that of the multiplexer chain leading into it, so that the carry out is calculated just in

time. where the resulting carry and sum bits are selected by the carry-in. Since one ripple

carry adder assumes a carry-in of 0, and the other assumes a carry-in of 1, selecting whichadder had the correct assumption via the actual carry-in yields the desired result.

2.5 Carry Look Ahead Adder

Carry Look Ahead Adder can produce carries faster due to carry bits generated in

parallel by an additional circuitry whenever inputs change. This technique uses carry

bypass logic to speed up the carry propagation. Let ai and bi be the augends and

addend inputs, ci the carry input, si and ci+1, the sum and carry-out to the i

th

bit position.If the auxiliary functions, pi and gi called the propagate and generate signals, the

sum output respectively are defined as follows.

Figure 2.7 4 BIT CLA Logic equations

pi = ai + bi……..(2.1) gi = ai bi……… (2.2)

si = ai xor bi xor ci .....(2.3) ci+1 = gi + pici .....(2.4)

http://en.wikipedia.org/wiki/Adder_(electronics)#Multiple-bit_adders

http://en.wikipedia.org/wiki/Multiplexer

http://en.wikipedia.org/wiki/Multiplexer

http://en.wikipedia.org/wiki/Adder_(electronics)#Multiple-bit_adders




ACLA = O (n) = 14n ……………(2.5)

TCLA = O (log n) = 4 log2n. ……….(2.6)

2.6 Binary To Excess - 1 Code Converter (BEC)

Binary to Excess One Converter can also be used as a adder where are having

results available and waiting for only carry from the previous stage. Then can use this BEC

converter can calculate the result with one as carry. Then by using the carry coming

from the previous stage as a select signal to the multiplexer and can get the original

result.

Figure 2.8 5 Bit Binary to Ecess – 1 Code Converter with out carry with carry

The BEC gets n inputs and generates n output; the BECWC (BEC with Carry)

gets n input and generates n+1 output to give the carry output.

The output value is one more than the given input. The detailed structures of the

5-bit BEC without carry (BEC) and with carry (BECWC) are shown figure 2.8. The

function table of BEC and BECWC are shown in TABLE 2.5.




Table 2.5 Functional Table of 5 Bit BEC and BECWC

Input BEC without carry BEC with carry

b[4:0] x[4:0] cy x[4:0]

00000

00001

00010

00011

00100

11011

11100

11101

11110

11111

00001

00010

00011

00100

00101

11100

11101

11110

11111

00000

0

0

0

0

0

0

0

0

0

1

00001

00010

00011

00100

00101

11100

11101

11110

11111

00000




Chapter -3

The Multipliers

3.1 Introduction

High speed multiplication is a primary requirement of high performance digital

systems. In recent trends the column compression multipliers are popular for high speed

computations due to their higher speeds. The data flow in a column compression

multiplier is shown in figure 3.1.

Figure 3.1 Data flow in a column compression multiplier

As shown in the figure 3.1 it is clear that the total delay of the multiplier can be

split up into three parts: due to the Partial Product Generation (PPG), the Partial

Product Summation Tree (PPST), and finally due to the Final Adder. Of these the

dominant components of the multiplier delay are due to the PPST and the final adder. The

relative delay due to the PPG is small. Therefore significant improvement in the speed

of the multiplier can be achieved by reducing the delay in the PPST and the final adder

stage of the multiplier.

The first column compression multiplier was introduced by Wallace in 1964. He

reduced the partial product of N rows by grouping into sets of three row set and two row

set using (3, 2) counters and (2, 2) counters respectively. In 1965,




Dadda altered the approach of Wallace by starting with the exact placement of the

(3, 2) counters and (2, 2) counters in the maximum critical path delay of the multiplier.

Since 2000‟s, a closer reconsideration of Wallace and Dadda multipliers has been done

and proved that the Dadda multiplier is slightly faster than the Wallace multiplier and the

hardware required for Dadda multiplier is lesser than the Wallace multiplier. In 2006, H.

Eriksson along with his research team presented HPM reduction tree structure that has an

ease of layout compared to Dadda‟s approach . Compared to Dadda, HPM is slightly

faster and consumes lesser power while area being the same.

3.2 Wallace Tree Multiplier

The Wallace tree multiplier is considerably faster than a simple array multiplier

because its height is logarithmic in word size, not linear. However, in addition to the large

number of adders required, the Wallace tree‟s wiring is much less regular and more

complicated. As a result, Wallace trees are often avoided by designers, while design

complexity is a concern to them.

3.2.1 WALLACE Column Compression Algorithm

1. The N rows of partial products are together in sets of three each. Any additional

rows that are not a member of a group of three are transferred to the next level

without modification.

2. Within each group of three rows, (3,2) compressors are applied to the columns

containing three bits and (2,2) compressors are applied to the columns containing

two bits.

3. Columns containing only a single bit are transferred to the next level unchanged.

d0 = N

d j+1 = 2*[d j/3] + d j mod 3 ………. (3.1)




full

adder

full

adder

x7

y7

x6

y6

x5

y5

x4

y4

x3

y3

x2

y2

x1

y1

x0

y0

final adder

half half 7 6 5 4 3 2 1 0 adder adder 15 14 13 12 11 10 9 8

23 22 21 20 19 18 17 16 reduction 31 30 29 28 27 26 25 24 stage 1 39 38 37 36 35 34 33 32

47 46 45 44 43 42 41 40 55 54 53 52 51 50 49 48

63 62 61 60 59 58 57 56

23

c7

s7

c6

s6

c5

s5

c4

s4

c3

s3

c2

s2

c1

s1

c0 s0 0

stage 2 47 s15 s14 s13 s12 s11 s10 s9 s8 24

c15 c14 c13 c12 c11 c10 c9 c8 55 54 53 52 51 50 49 48

63 62 61 60 59 58 57 56

c23 s23 s22 s21 s20 s19 s18 s17 s16 s0 0 reduction 47 s15 s14 c22 c21 c20 c19 c18 c17 c16 stage 3 63 c30 s30 s29 s28 s27 s26 s25 s24 c8

c31 s31 c29 c28 c27 c26 c25 c24

reduction c30 s41 s40 s39 s38 s37 s36 s35 s34 s33 s32 s16 s0 0

stage 4 63 c41 c40 c39 c38 c37 c36 c35 c34 c33 c32

c31 s31 c29 c28 c27 c26 c25 c24

s44 s43 s42 s32 s16 s0 0

c52 c51 c50 c49 c48 c47 c46 c45 c44 c43 c42

p15 p14 p13 p12 p11 p10 p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Figure 3.2 8 x 8 Wallace Tree Multiplier Logarithmic Depth Hierarchy




3.2.2 Applying Wallace Compression Algorithm to 8 x 8 multiplier

Consider N- bit Multiplier X and N- bit Multiplicand. X and Y are represented as

Multiplicand, Y = yn-1 yn-2 yn-3 . . . . . . . . y3 y2 y1 y0 ……………..(3.2)

Multiplier,X = xn-1 xn-2 xn-3 . . . . . . . . x3 x2 x1 x0 ……………..(3.3)

The flow diagram below shows the intermediate state reductions of the multipliers are

being done by full adders and half adders while the final step additions being done by a RCA.

The flow diagram was done in Microsoft Excel sheet as shown in figure 3.2. The architecture of

the 8 x 8 Wallace Multiplier along with the theoretical delay values is shown in figure 3.3

where FA is full adder and HA is half adder.

Figure 3.3 Architecture of 8 x 8 Wallace Tree Multiplier with RCA as final adder

3.3 DADDA Multiplier

The Dadda multiplier is faster than Wallace Tree multiplier because the Wallace

tree‟s wiring is much less regular and more complicated. Dadda multiplier over comes

this disadvantage by placing the 3:2 compressors and 2:2 compressors in the critical path.




The disadvantage in Dadda multiplier is that the layout is not regular.

However, unlike Wallace multipliers that reduce as much as possible on each layer,

Dadda multipliers do as few reductions as possible. Because of this, Dadda multipliers have

a less expensive reduction phase, but the numbers may be a few bits longer, thus requiring

slightly bigger adders.

Take any three wires with the same weights and input them into a full adder. The

result will be an output wire of the same weight and an output wire with a higher weight for

each three input wires. If there are two wires of same weight left, and the current number of

output wires with that weight is equal 2 (modulo 3), input them into a half adder.

Otherwise, pass them through to the next layer. If there is just one wire left, connect it to

the next layer.

3.3.1 Dadda Column Compression Algorithm

1. Let d1=2 and d j+1=[1.5*d j]. “dj: is the height of the matrix for the j th stage.

Repeat until the largest jth stage is reached in which the original N height

matrix contains at least which has more than “dj” partial products.

2. In the jth

stage from the end, place (3,2) and (2,2) compressors as required to

achieve a reduced matrix. Only columns with more than “dj” partial products as

they receive carries from less significant (3,2) and (2,2) compressors are reduced

3. Let j=j-1 and repeat step 2 until a matrix with a height of two is generated. This

should occur when j=1.




63 55 47 39 31 23 15 7 6 5 4 3 2 1 0 62 54 46 38 30 22 14 13 12 11 10 9 8

61 53 45 37 29 21 20 19 18 17 16 N=8

60 52 44 36 28 27 26 25 24

59 51 43 35 34 33 32

58 50 42 41 40

57 49 48

56

63 55 47 39 31 23 29 21 6 5 4 3 2 1 0 62 54 46 38 30 36 28 13 12 11 10 9 8

61 53 45 37 43 35 20 19 18 17 16 N=7

60 52 44 50 42 27 26 25 24 59 51 57 49 34 33 32

58 c0 56 41 40

c1 s1 s0 48

63 55 47 39 31 44 50 42 20 5 4 3 2 1 0 62 54 46 38 51 57 49 27 12 11 10 9 8

61 53 45 58 c0 56 34 19 18 17 16 N=6

60 52 c1 s1 s0 41 26 25 24 59 c4 c3 c2 48 33 32

c5 s5 s4 s3 s2 40

63 55 47 39 52 c1 s1 s0 41 19 4 3 2 1 0 62 54 46 59 c4 c3 c2 48 26 11 10 9 8

61 53 c5 s5 s4 s3 s2 33 18 17 16 N=5

60 c10 c9 c8 c7 c6 40 25 24 c11 s11 s10 s9 s8 s7 s6 32

63 55 47 60 c10 c9 c8 c7 c6 40 18 3 2 1 0 62 54 c11 s11 s10 s9 s8 s7 s6 25 10 9 8

N=4

c19 s19 s18 s17 s16 s15 s14 s13 s12 24

63 55 c19 s19 s18 s17 s16 s15 s14 s13 s12 17 2 1 0 62 c28 c27 c26 c25 c24 c23 c22 c21 c20 24 9 8 N=3

c29 s29 s28 s27 s26 s25 s24 s23 s22 s21 s20 16

63 16 1 0 N=2

c41 s41 s40 s39 s38 s37 s36 s35 s34 s33 s32 s31 s30 8




Figure 3.6 8 x 8 HPM Multiplier Logarithmic Depth Hierarchy

3.4.2 Applying HPM Compression Algorithm to 8 x 8 multiplier

The flow diagram below shows the intermediate state reductions of the

multipliers are being done by full adders and half adders while the final step additions

being done by a RCA. The flow diagram is shown in figure 3.6 and the architecture

figure 3.7, where FA is full adder and HA is half adder.

Figure 3.7 Architecture of 8 x 8 HPM Multiplier with RCA as final adder

3.5 Analysis of Multipliers

The theoretical comparison of number of adders Wallace,Dadda and HPM

multipliers is shown in the table 3.1

Compression

Technique

Number of Full

Adders

Number of Half

Adders

Carry Propagation

Adder(CPA) Length

Wallace

Dadda

HPM

N2 - 4.N + 2 + S

N2 – 4.N + 3

N2 – 4.N + 3

> N

N – 1

N – 1

2.N – 1 – S

2.N – 2

2.N – 2




Table 3.1 Theoretical Comparison of Different Multipliers

Where, N = Multiplier and Multiplicand bit length and

S = Number of Reduction stages

Now TABLE 3.2 shows the synthesis results of three multipliers for area,

delay and power. simulated and synthesized 8 x 8 bit, 16 x 16 bit and 32 x 32 bit

multipliers using each column reduction technique. The 2:2 compressor used in

the design is the full adder1 shown in figure 2.1 and 3:2 compressor used is shown

in figure 2.2. All the results shown here are the combinational circuit results i.e.,

inputs and outputs are not registered.

Bit width of

Multiplier

PPST Area (µm2) Delay (ns) Power ( mW)

8 x 8

16 x 16

32 x 32

Wallace

Dadda

HPM

Wallace

Dadda

HPM

Dadda

HPM

2770.4

2240.6

2052.2

9273.5

9116.6

8934.3

47059.1

42724.7

7.62

6.63

5.81

15.46

13.94.

12.77

25.43

25.33

1.2

0.877

0.817

5.07

4.86

4.74

31.09

20.40

Table 3.2 Comparison of Implementation of Different Multipliers w.r.t area, delay and power

From the table 3.2 it is clear that HPM multiplier is the optimal multiplier

compared to the other two with respect to area, delay and power. So used HPM as our

column reduction technique in our proposed multiplier.




Chapter - 4

TheProposed Design for High Speed and Low Power

4.1 Recursive Multiplication for High Speed

The architecture proposed in this project work is centered on a recursive

multiplication algorithm by Danysh and Swartzlander [15]. The authors present a

multiplication algorithm based on divide and conquer methodologies that introduces

greater regularity in design than standard column compression multipliers, while avoiding the

linear latency of array multipliers.

Recent studies have examined the consequences of technology scaling on arithmetic

circuitry. These investigations strongly support the need for the consideration of

interconnect layout as an integral part of future arithmetic circuitry. The predominant

advantage offered by the recursive multiplication scheme is the use of smaller multipliers to

implement a larger operation, which is in direct compliance with the presented results. This

structure promotes the notion of exploiting locally optimized arrays for reduced

interconnect power through shorter local interconnects, and a more regular integration of the

sub-components on a larger scale.

The recursive multiplier scheme works by executing an n-bit multiplication using 4

n/2-bit multipliers in parallel and adding up the results. The n/2-bit multipliers may he

further reduced, where each sub-multiplier carries out 4 parallel n/4-bit multiplications,

and so forth. In this manner a large multiplication is carried using recursions of simpler

base multiplier modules.

Mathematically, the recursive algorithm may be proved by first considering two

unsigned n-bit operands, the multiplier X and multiplicand A

=

and A may now he defined as:

X = XL + XH …………. (4.1) A = AL+AH …………….(4.2)

The overall multiplication of A and X is given by




= A . X ………..(4.3)

= (AL + AH) . (XL +XH)

= (AL.XL) + (AL.XH) + (AH.XL) + (AH.XH)

= P1 + P2 + P3 +P4 ……………..(4.4)

Figure 4.1 Pictorial Representation of RecursiveMultiplication

Therefore, the overall multiplication may be reduced to four smaller multiplications,

and this process may be repeated using even smaller base multipliers. In order to minimize

the delay introduced by subdividing the process, the result of the base multipliers, or the

intermediary products, will be kept in carry save form. Hence only one final fast adder will

be required to yield the final product. All the four N/2 multiplications derived above are

diagrammatically shown in figure 4.1 (for N = 8). The result from multiplier M1, M2, M3

and M4 are P1, P2, P3 and P4 respectively.

The architecture of recursive multiplier for N x N bit multiplier with RCA as

merging adder is shown in figure 4.2 Each N/2 multiplier in figure 4.2 uses HPM

algorithm for PPST.




Figure 4.2 N – Bit Recursive Multiplier with RCA as Merging Adder

Figure 4.2 shows that all the inputs and outputs are registered, so the latency of

the multiplier becomes two i.e, it takes two clock cycles to get the first output and after

that output is obtained for each clock cycle.




Now as mentioned earlier the partial products that are dependent (marked in black

in figure. 1(b) and 1(c)) are given to Multiplier M2 and M3 respectively. The products

obtained from M2 and M3 are given to a N- bit RCA and the obtained result is given to

N+1 bit RCA along with the MSB N/2 bits of product from M1 and the LSB N/2-bits of

product from M4. The MSB N/2 bits of M4 product are given to N/2-1 bit RCA with „1‟

as carry input and calculating the result before the actual carry arrives, and used a

multiplexer for selecting the product based on the actual carry generated by N+1-bit

RCA. This dependency and flow can be clearly observed in figure 4.2.

Now in order to improve the speed further replaced the N/2 – 1 RCA Adder

with BEC adder. The logical structure of a 7 – bit bec adder is shown in figure 4.3. So by

using this recursive multiplication and hybrid adder (combination of RCA and BEC)

for merging the products for four N/2 bit multipliers achieved speed. Now to reduce the

power have opted the twin precision multiplication which is described in next section.

Figure 4.3 7 – Bit BEC Adder without Carry

4.2 Twin Precision Multiplication For Low Power

Multiplier is a complex arithmetic operation, which is reflected in its relatively high

signal propagation delay, high power dissipation, and large area requirement.When choosing

a multiplier for a digital system,the bit-width of the multiplier is required to be at least as

wide as the largest operand of the applications that are to be executed onthat digital system.




The bit-width of the multiplier is therefore, often much larger than the data

represented inside the operands,which leads to unnecessarily high power dissipation and

unnecessary long delay.

This resource waste could partially be remedied by having several multipliers ,each

with a specific bit-width,and use the particular multiplier with the smallest bit-width that is

large enough to accommodate the current multiplication. Such a scheme would assure that a

multiplication would be computed on a multiplier that has been optimized in terms of power

and delay for that specific bit-width.

Figure 4.4 (a) Partial Product Array for N = 8 (b) Partial Products showing the dependency.

In figure 4.4(b) the partial products are partitioned such that obtain four partial

product arrays of N = 4, of them the partial products that are marked as black are

dependent because the output storage bits are same for those arrays. So the partial

products that are in black cannot be operated simultaneously. Thus to increase the

throughput are using the independent partial products that are coated in ash are used

along with operand guarding. The architecture for N = 8 with HPM algorithm as

reduction technique, operand guarding and using the independent partial products for performing two N/2 bit multiplications simultaneously is shown in figure 4.5.




Figure 4.5 Block Diagram of 8 x 8 Twin Precision Multiplier

4.3 Clock Gating and Recursive Multiplication

Clock gating is a popular technique used in many synchronous circuits for

reducing dynamic power dissipation. Recursive Multiplication is used to reduce the power.

4.3.1 Clock Gating

Clock gating saves power by adding more logic to a circuit to prune the clock tree.

Pr uning the clock disables portions of the circuitry so that the flip-flops in them do not

have to switch states. Switching states consumes power. When not being switched, the

switching power consumption goes to zero, and only leakage currents are incurred.

Clock gating works by taking the enable conditions attached to registers, and usesthem to gate the clocks. Therefore it is imperative that a design must contain these enable

conditions in order to use and benefit from clock gating. This clock gating process can

also save significant die area as well as power, since it removes large numbers of mu x‟s

and replaces them with clock gating logic. This clock gating logic is generally in the form

of "Integrated clock gating" (ICG) cells. However, note that the clock gating logic will

change the clock tree structure, since the clock gating logic will sit in the clock tree.

http://en.wikipedia.org/wiki/Clock_tree


http://en.wikipedia.org/wiki/Flip-flop_%28electronics%29

http://en.wikipedia.org/wiki/Flip-flop_%28electronics%29






Clock gating logic can be added into a design in a variety of ways:

1. Coded into the RTL code as enable conditions that can be automatically translated

into clock gating logic by synthesis tools (fine grain clock gating).

2. Inserted into the design manually by the RTL designers (typically as module level

clock gating) by instantiating library specific ICG (Integrated Clock Gating) cells

to gate the clocks of specific modules or registers.

3. Semi-automatically inserted into the RTL by automated clock gating tools. These

tools either insert ICG cells into the RTL, or add enable conditions into the RTL

code. These typically also offer sequential clock gating optimizations.

Sequential clock gating is the process of extracting/propagating the enable

conditions to the upstream/downstream sequential elements, so that additional registers can

be clock gated. Although asynchronous circuits by definition do not have a "clock",

the term perfect clock gating is used to illustrate how various clock gating techniques are

simply approximations of the data-dependent bheaviour exhibited by asynchronous

circuitry. As the granularity on which gate the clock of a synchronous circuit approaches

zero, the power consumption of that circuit approaches that of an asynchronous circuit:

the circuit only generates logic transitions when it is actively computing.

4.3.2 Recursive Multiplication with Clock Gating

In project clock gating is used for achieving operator isolation in the

Recursive multiplier for power reduction. As mentioned earlier recursive multiplier has

four N/2 bit multipliers of which M2 and M3 are dependent and M1 and M4 are

independent. So to achieve low power and double-throughput are using clock gating

technique and are isolating the M2 and M3 multipliers without transferring inputs

them. To perform twin precision multiplication an extra control input is needed. Here are

considering a two it input “Twin” as a control input. The “Twin” is passed through a 2:3

decoder which generates T[1], T[2] and T[3] as control signals and these signals are used

for the operator isolation. TABLE 4.1 shows the truth table of 2:3 decoder. As shown

in the TABLE 4.1 have four operating modes:

http://en.wikipedia.org/wiki/Asynchronous_circuit

http://en.wikipedia.org/wiki/Asynchronous_circuit




Mode 0 – Both M1 and M4 in operation for Twin Precision

Mode 1 – Only M1 in operation

Mode 2 – Only M4 in operation

Mode 3 – Full Mode operation

Operation

ModeT[1] T[2] T[3]

00 – Both M1

and M4 in

operation for

Twin Precision

1 0 1

01 – Only M1 in

operation1 0 0

10 – Only M4 in

operation0 0 1

11 – Full Mode

operation1 1 1

TABLE 4.1 Decoder Truth Table

Now the each output signal of the decoder is given to 2 input AND gate with

clock as another input thus generating three clocks namely clock1, clock2 and clock3 by

T[1], T[2] and T[3] respectively. Clock2 drives registers of M2 and M3 as shown in

figure 4.6. So only in mode 3 the multipliers M2 and M3 will be working and in

remaining all modes they are in off condition thus saving the switching power.

The advantage in this design compared to the regular twin precision multiplier in

is that are isolating the operator instead of operand guarding. So in this design can make

use of one multiplier at a time for one N/2- bit multiplication but in regular twin precision

have to give all zeros for MSB N/2 bits of multiplier and multiplicand in order to

operate the multiplier for same operation, so there is restriction in giving inputs which is

not feasible always. But the control circuit here provides the control to overcome this.




The architecture shown in figure 4.6 has increased speed and also has the flexibility

for N/2 bit multiplication with less power consumption and double-throughput. This can

be clearly observed in the result analysis.




Chapter - 5

ASIC Implementation of Proposed Design

5.1 Introduction

In this project implemented different types of multipliers but multipliers with

HPM column reduction technique, recursive multiplier and recursive multiplier with

clock gating are quite important here the implementation of these multipliers is being

described.

For VLSI (hardware) implementation followed ASIC design flow starting from

RTL description to the GDSII. The architecture is described using VERILOG HDL and

the functional simulation is done in VCS simulator , synthesis is carried out in DC

COMPILER.

5.2 ASIC Design Methodology

Application Specific Integrated Circuit (ASIC) Design, as the name suggests this

design focuses on the development of a hardware module which is completely dedicated

to that particular application or process. This type of design helps in the economical usage

of silicon and also has a good speed compared to the other implementations such as

FPGA and CPLD devices. In general for the development of an ASIC follow a flow

called ASIC design flow. ASIC design flow can be seen in figure 5.1, and the discussion

of each step is done in following sections.

Figure 5.1 ASIC Design




5.2.1 System Partitioning

This is the first step of the ASIC design flow; here the complex problem statement

is decomposed into smaller subsystems. The decomposition is carried out hierarchically

until each subsystem is of manageable size.

5.2.2 Design Entry

In any design, specifications are written first abstractly describing the

functionality, interface and overall architecture of the circuit. A behavioural description is

then created to analyze the design in terms of functionality, compliance to standards, and

other high-level issues. Typically behavioural (simulation model) descriptions are created in

HDLs. Here used VERILOG HDL for design entry.

5.2.3 Simulation

Simulation is carried out at this stage for the written code and this type of

simulation is called as behavioural simulation or functional simulation. Here,the

simulation is carried out by the help of “testbench”, testbench is a piece of code which

provide the required stimulus or inputs and control signals to the design, by observing

theoutputs in a waveform confirm the functionality. In t h design used VCS simulator

for functional simulation.

5.2.4 Synthesis

The next stage is synthesis, synthesis means converting the written code into gates

and its interconnections. In this stage the conversion of the code into gates and

interconnections is done by mapping to a particular technology i.e., either 0.35µm

technology or 0.18µm technology or 90nm technology. The technology here refers to the

gate length of the transistors used in our design. The output of this synthesis stage is the

“gate level netlist (.vg)” and “design constraints (.sdc)” files. Gate level netlist contains

information of the gates and interconnections and design constraints contain the

information such as the clock frequency, wire-load models used. This is the final stage of

the logical flow or the front-end flow. The output files i.e., .v and .sdc are taken as input

to the physical or backend flow.

RTL synthesis is an automated design task in which high-level design descriptions




written in Hardware Description Languages (such as VHDL, Verilog, or SystemVerilog) are

transformed into gate-level netlists. Gate-level netlist is basically a circuit implementation of

the design made of library components (both combinational and sequential cells) available

in the technology library and their interconnections. The netlist is generated by the synthesis

tool according to the constraints set by the designer.

Design Compiler is RTL Synthesis tool by Synopsys. It supports UNIX platforms

and is installed on Institute's computer systems (see here for available versions on each

platform linux). Design Compiler is not supported on Windows platform.

Synthesis with Design Compiler include the following main tasks: reading in the

design, setting constraints, optimizing the design, analyzing the results and saving the design

database. These tasks are described as follows

5.3 Synthesis Overview

Synthesis with Design Compiler include the following main tasks: reading in the

design, setting constraints, optimizing the design, analyzing the results and saving the design

database. These tasks are described below

5.3.1 Reading in the Design

The first task in synthesis is to read the design into Design Compiler memory.

Reading in an HDL design description consist of two tasks: analyzing and elaborating the

description. The analysis command (analyze) performs the following tasks

Reads the HDL source and checks it for syntactical errors Creates HDL library

objects in an HDL-independent intermediate format and saves these intermediate files in a

specified location

5.3.2 Constraining the design

The next task is to set the design constraints. Constraints are the instructions that the

designer gives to Design Compiler. They define what the synthesis tool can or cannot do

with the design or how the tool behaves. Usually this information can be derived from the

various design specifications (e.g. from timing specification).

There are basically two types of design constraints:




5.3.2.1 Design Rule Constraints

Design rules constraints are implicit constraints which means that they are defined

by the ASIC vendor in technology library. By specifying the technology library that Design

Compiler should use, also specify all design rules in that library and cannot discard or

override these rules.

5.3.2.2 Optimization Constraints

Optimization constraints are explicit constraints (set by the designer). They describe

the design goals (area, timing, and so on) the designer has set for the design and work as

instructions for the Design Compiler how to perform synthesis.

Design rule constraints comprise:

5.3.2.3 Maximum transition time

Longest time allowed for a driving pin of a net to change its logic value

5.3.2.4 Maximum fanout

Maximum fanout for a driving pin

5.3.2.5 Maximum (and minimum) capacitance

The maximum (and minimum) total capacitive load that an output pin can drive. The

total capacitance comprises of load pin capacitance and interconnect capacitances.

5.3.2.6 Cell degradation

Some technology libraries contain cell degradation tables. The cell degradation

tables list the maximum capacitance that can be driven by a cell as a function of the

transition times at the inputs of the cell.

5.3.2.7 System clock definition and clock delays

Clock constraints are the most important constraints in your ASIC design. The clocksignal is the synchronization signal that controls the operation of the system. The clock

signal also defines the timing requirements for all paths in the design. Most of the other

timing constraints are related to the clock signal.

5.3.2.8 Multicycle paths

A multicycle path is an exception to the default single cycle timing requirement of

paths. That is, on a multicycle path the signal requires more than a single clock cycle to

propagate from the path startpoint to the path endpoint.




5.3.2.8 Input and output delays

Input and output delays constrain external path delays at the boundaries of a

design. Input delay is used to model the path delay from external inputs to the first

registers in the design. Output delay constrain the path from the last register to the

outputs of the design.

5.3.2.9 Minimum and maximum path delays

Minimum and maximum path delays allow constraining paths individually

and setting specific timing constraints on those paths..

5.3.4 Optimizing the Design

The following section presents the behavior of Design Compiler optimization step.

The optimization step translates the HDL description into gate-level netlist using the cells

available in the technology library. The optimization is done in several phases. In each

optimization phase different optimization techniques are applied according to the design

constraints.

5.3.4.1 Gate-level Optimizations

Gate-level optimizations work on the technology-independent netlist and maps it to

the library cells to produce a technology-specific gate-level netlist. Gate-level optimizations

include the following processes:

5.3.4.2 Area Optimization

Area optimization is the last step that Design Compiler performs on the design.

During this phase, only those optimizations that don't break design rules or timing

constraints are allowed.

5.3.5 Reporting and Analyzing the Design

Once the synthesis has been completed, need to analyze the results. Design

Compiler provides together with its graphical user interface (Design Vision) various means

to debug the synthesized design. These include both textual reports that can be generated for

different design objects and graphical views that help inspecting and visualizing the design.




There are basically two types of analysis methods and tools:

5.3.5.1 Generating reports for design object properties

Reporting commands generate textual reports for various design objects:

timing and area, cells, clocks, ports, buses, pins, nets, hierarchy, resources,

constraints in the design, and so on.

5.3.5.2 Visualizing design objects (Design Vision)

Some design objects and their properties can be analyzed graphically. may

examine for example the design schematic and explore the design structure, visualize

critical and other timing paths in the design, generate histograms for various metrics

and so on.

5.3.6 Save Design

The final task in synthesis with Design Compiler is to save the synthesized design.

The design can be saved in many formats but should save for example the gate-level netlist

(usually in Verilog) and/or the design database. Remember that by default, Design Compiler

does not save anything when exiting.




Chapter -6

Results

6.1 ASIC Results

In this chapter will see the simulation and synthesis results of the various

multipliers along with the proposed design.

6.1.1 Simulation Results

Figure 6.1 Simulation Result of 16 x 16 HPM Multiplier

Analysis

Signal In/Out Description

clk input Input to the multiplier

Rst input Input to the multiplier

datain1[15:0] input Input to the multiplier


dataout[31:0] output Output of the mulitplier

Table 6.1 Analysis of 16 x 16 HPM Multiplier




Figure 6.2 Simulation Result of 32 x 32 HPM Multiplier

Analysis



rst input Input to the multiplier




Table 6.2 Analysis of 32 x 32 HPM Multiplier




Figure 6.4 Simulation Result of 32 x 32 Recursive Multiplier with clock gating

Analysis



Rst input Input to the multiplier




Table 6.4 Analysis of 32 x 32 Recursive Multiplier with clock gating




6.2.2 Power report of 16 x 16 Basic HPM

The above report simply displays the total power of the design. Dynamic power is

the power dissipated when the circuit is active i.e. performing some function. Dynamic

power is further divided into two components: Switching power and Internal power.

Switching power is dissipated when charging and discharging the load capacitance at

the cell output. The amount of switching power depends on the switching activity (is related

to the operating frequency) of the cell. The more there are logic transitions on the cell

output, the more switching power increases.

Internal power is consumed within a cell for charging and discharging internal cell

capacitances. Internal power also includes short-circuit power. During logic transitions both

P and N type transistors are both on simultaneously for a short time causing direct

connection from Vdd rail to ground rail.




6.2.3 Timing report of 16 x 16 Basic HPM

The delay report shows delay calculation in two sections: the first section for data

arrival time calculation and the second for data required time calculation. The data arrival

time is the time required for signal to travel from path start point to a path end point. The

data required time is the maximum time a signal has for traveling that path. The difference

of data required time and data arrival time is called slack or timing margin of the path. If

slack is negative, there is a timing violation on that path.




6.2.4 Area Report of 32 x 32 Basic HPM

The above report simply displays the total area of the design. The total area is the

sum of three factors: combinational, noncombinational, and net interconnect area. The total

cell area is due to logic cells in design is shown by the combinational (basic logic gates like

ANDs, ORs, and the like) and the noncombinational (registers) factors. The third factor

affecting the area (net interconnect area) is due to the wires connecting these cells .




6.2.5 Power report of 32 x 32 Basic HPM

The above report simply displays the total power of the design. Dynamic power is

the power dissipated when the circuit is active i.e. performing some function. Dynamic



the cell output.The amount of switching power depends on the switching activity (is related

to the operating frequncy) of the cell. The more there are logic transitions on the cell output,

the more switching power increases.








6.2.6 Timing report of 32 x 32 Basic HPM





of data required time and data arrival time is called slack or timing margin of the path. If

slack is negative, there is a timing violation on that path.




6.2.7 Area Report of 16 x 16 Recursive Multiplier with Clock Gating

The above report simply displays the total area of the design. The total area is the

sum of three factors: combinational, noncombinational, and net interconnect area. The total

cell area is due to logic cells in design is shown by the combinational (basic logic gates like

ANDs, ORs, and the like) and the noncombinational (registers) factors. The third factor

affecting the area (net interconnect area) is due to the wires connecting these cells .




6.2.11 Power Report of 32 x 32 Recursive Multiplier with Clock Gating

The above report simply displays the total power of the design. Dynamic power isthe power dissipated when the circuit is active i.e. performing some function. Dynamic



the cell output.The amount of switching power depends on the switching activity (is related

to the operating frequncy) of the cell. The more there are logic transitions on the cell output,

the more switching power increases.








6.2.12 Timing Report of 32 x 32 Recursive Multiplier with Clock Gating





of data

required time and data arrival time is called slack or timing margin of the path. If slack is




negative, there is a timing violation on that path.

6.3 Comparison

The comparison between the TABLE 5.1 (Basic HPM Multiplier) and TABLE 5.2

(Recursive multiplier with clock gating) summarizes the enhanced performance of the proposed

multiplier in terms of percentages which are listed in TABLE 5.3. The summary of Area, Power

and Delay comparisons in Table 5.1 and 5.2 for 16 and 32 bit are plotted in figures. 5.6, 5.7 and

5.8 respectively.

Multiplier

Word size

Type of

Operation

Area (µm2) Delay

(ns)

Power

(µW)

16 x 16

32 x 32

Basic HPM

Basic HPM

12771.8

44388.3

7.43

13.13

313

815

Table 6.5 HPM Multiplier

Table 6.6 Recursive Multiplier with Clock Gating

Multiplier

Word size

Type of

Operation

Area (µm2) Delay

(ns)

Power

(µW)

16 x 16

32 x 32

Recursive

Multiplication

with Clock

Gating

Recursive

Multiplication with

Clock gating

14297.7

48912.5

7.13

12.13

161

443.7




Table 6.7 Percentage Results of Recursive Multiplier with Clock gating with reference

to the HPM

Area Comparison plot for the tables 6.5 & 6.6 of HPM and Recursive Multiplier with

clock gating

Area comparison of 16 and 32 bit Multipliers

Figure 6.5 Area comparison plot

As shown in the figure 6.5 it is clearly observed that the Area occupied by the 16 bit

is more than compared to the 16 bit basic HPM multiplier, similarly the area occupied by

the 32 bit recursive multiplier is more than compared to the 32 bit basic HPM multiplier.

05000

10000

15000

20000

25000

30000

35000

40000

4500050000

16 x 16 32 x 32

HPM

Recursive

Multiplication with

Clock Gating

Multiplier

Word size

Area (%) Delay (%) Power (%)

16 x 16

32 x 32

11.94

10.19

-4.03

-7.16

-48.5

-45.5




Power Comparison plot for the tables 6.5 & 6.6 of HPM and Recursive Multiplier

with clock gating

Power comparison of 16 and 32 bit Multipliers

Figure 6.6 Power Comparison Plot

As shown in the figure 6.6 it is clearly observed that the power consumed by the 16

bit recursive multiplier is less than compared to the 16 bit basic HPM multiplier, similarly

the power consumed by the 32 bit recursive multiplier is less than compared to the 32 bit

basic HPM multiplier.

0

100

200

300

400

500

600

700

800

900

16 x 16 32 x 32

HPM

Recursive

Multiplication

with Clock

Gating




Delay Comparison plot for the tables 6.5 & 6.6 of HPM and Recursive Multiplier

with clock gating

Figure 6.7 Delay Comparison Plot

As shown in the figure 6.7 it is clearly observed that the Delay of the 16 bit recursive

multiplier is less than compared to the 16 bit basic HPM multiplier, similarly the delay of

the 32 bit recursive multiplier is less than compared to the 32 bit basic HPM multiplier.

Delay comparison of 16 and 32 bit Multipliers

0

2

4

6

8

10

12

14

16 x 16 32 x 32

HPM

Recursive

Multiplication with

Clock Gating




Chapter - 7

Conclusion

In this thesis, successfully achieved a faster and low power multiplication by using

a combination of High Performance Multiplication [HPM] column reduction technique,

implementing a N-bit multiplier by recursive multiplication and acceleration of the final

addition using a hybrid adder (RCA and BEC Adder) and low power has been achieved

by using clock gating technique. The result analysis shows that area overheads are not

significant when compared to the increase in speed and reduction in power

consumption. The proposed multiplier design technique can be implemented with any

type of parallel multipliers to achieve faster and low power performance.

The design is implemented using Verilog HDL and simulated with the help of

VCS Compiler and Synthesis is done by using Design Compiler and with the proposed

architecture, double-throughput has been achieved and the results show that for the 32-bit

proposed multiplier is as much as faster, occupies more area and consumes lesser power

with respect to the regular HPM multiplier.




Chapter - 8

Future Scope

As an attempt to develop fast and low-power multiplier design, the research

presented in this dissertation has achieved good results and demonstrated the efficiency of

high level optimization techniques. However, there are limitations in our work and several

future research directions are possible.

The results analysis shows that there is a increase in speed and reduction in power

consumption at synthesis level, by implementing physical design we can still improve

increase in speed and power consumption that would prove better according to situation and

require less power and consume less time.




BIBLIOGRAPHY

[1] B.Parhami, "Computer Arithmetic", Oxford University Press, 2000.

[2] E. E. Swartzlander, Jr. and G. Goto, "Computer arithmetic," The Computer

Engineering Handbook, V. G. Oklobdzija, ed., Boca Raton, FL: CRC Press, 2002.

[3] C. S. Wallace, “A Suggestion for a Fast Multiplier ,” IEEE Transactions on

Electronic Computers, Vol. EC-13, pp. 14-17, 1964.

[4] Luigi Dadda, “Some Schemes for Parallel Multipliers,” Alta Frequenza, Vol. 34, pp.

349-356, August 1965

[5] H. Eriksson, P. Larsson-Edefors, M. Sheeran, M. Själander, D. Johansson, and M.

Schölin, “Multiplier reduction tree with logarithmic logic depth and regular

connectivity,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2006, pp. 4 – 8.

[6] V. G. Oklobdzija and D.Villeger , “Improving Multiplier Design by Using Improved

Column Compression Tree and Optimized Final Adder in CMOS Technology”, IEEE

transactions on Very Large Scale Integration (VLSI) systems, Vol. 3, no. 2, June 1995.

[7] Magnus Själander and Per Larsson-Edefors, ” Multiplication Acceleration

Through Twin Precision “, IEEE Trans. O VLSI Systems vol. 17, no. 9, pp. 1233-1245 Sep

2009.[8] V. G. Oklobdzija and D.Villeger , “Improving Multiplier Design by Using Improved

Column Compression Tree and Optimized Final Adder in CMOS Technology”, IEEE

transactions on Very Large Scale Integration (VLSI) systems, Vol. 3, no. 2, June 1995.

[9] Paul F.Stelling, “Design strategies for optimal hybrid final adders in parallel

multiplier ”,Journal of VLSI signal processing, vol 14,pp,321-331,1996.

[10] Sabyasachi Das and Sunil P.Khatri,"Generation of the Optimal Bit-Width

Topology of the Fast Hybrid Adder in a Parallel Multiplier", International

Conference on Integrated Circuit Design and Technology (ICICDT) May, 2007.

[11] B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, “ ASIC Implementation of

Modified Faster Carry Save Adder ”, European Journal of Scientific Research, Vol. 42,

Issue 1, 2010.

[12] B.Ramkumar, Harish M Kittur, “Low Area, Low Power CSLA”, IEEE transactions

on Very Large Scale Integration (VLSI) systems.

[13] K.C. Bickerstaff, E.E. Swartzlander, M.J. Schulte, Analysis of column

compression multipliers, Proceedings of 15th IEEE Symposium on Computer

Arithmeitc,2001.



[14] W. J. Townsend, Earl E. Swartzlander and J.A. Abraham, “A comparison of

Dadda and Wallace multiplier delays”, Advanced Signal Processing

Algorithms, Architectures and Implementations XIII. Proceedings of the SPIE, vol.

5205, 2003, pages 552-560.

[15] Danysh and Swamlander Jr., "A recursive fast multiplier", Asilomar Conf. on

Signals,Systems & Computers, vol. 1, pp. 197 -201, 1998.

design of fast and low power multiplier

Documents