reducing computation time for short bit width twos compliment multiplier

REDUCING THE COMPUTATION TIME IN (SHORT BIT-

WIDTH) TWO'S COMPLEMENT MULTIPLIERS

ABSTRACT:

Two's complement multipliers are important for a wide range of applications. In this

paper, we present a technique to reduce by one row the maximum height of the partial product

array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the

delay of the partial product generation stage. This reduction may allow for a faster compression

of the partial product array and regular layouts. This technique is of particular interest in all

multiplier designs, but especially in short bit-width two's complement multipliers for high-

performance embedded cores. The proposed method is general and can be extended to higher

radix encodings, as well as to any size square and m times n rectangular multipliers. We

evaluated the proposed approach by comparison with some other possible solutions; the results

based on a rough theoretical analysis and on logic synthesis showed its efficiency in terms of

both area and delay.

Introduction about Verilog

Overview:

Hardware description languages such as Verilog differ from software programming languages because they include ways of describing the propagation of time and signal dependencies (sensitivity). There are two assignment operators, a blocking assignment (=), and a non-blocking (<=) assignment. The non-blocking assignment allows designers to describe a state-machine update without needing to declare and use temporary storage variables (in any general programming language we need to define some temporary storage spaces for the operands to be operated on subsequently; those are temporary storage variables). Since these concepts are part of Verilog's language semantics, designers could quickly write descriptions of large circuits, in a relatively compact and concise form. At the time of Verilog's introduction (1984), Verilog represented a tremendous productivity improvement for circuit designers who were already using graphical schematic capture software and specially-written software programs to document and simulate electronic circuits.

The designers of Verilog wanted a language with syntax similar to the C programming language, which was already widely used in engineering software development. Verilog is case-sensitive, has a basic preprocessor (though less sophisticated than that of ANSI C/C++), and equivalent control flow keywords (if/else, for, while, case, etc.), and compatible operator precedence. Syntactic differences include variable declaration (Verilog requires bit-widths on net/reg types), demarcation of procedural blocks (begin/end instead of curly braces {}), and many other minor differences.

A Verilog design consists of a hierarchy of modules. Modules encapsulate design hierarchy, and communicate with other modules through a set of declared input, output, and bidirectional ports. Internally, a module can contain any combination of the following: net/variable declarations (wire, reg, integer, etc.), concurrent and sequential statement blocks, and instances of other modules (sub-hierarchies). Sequential statements are placed inside a begin/end block and executed in sequential order within the block. But the blocks themselves are executed concurrently, qualifying Verilog as a dataflow language.

Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating, undefined"), and strengths (strong, weak, etc.) This system allows abstract modeling of shared signal-lines, where multiple sources drive a common net. When a wire has multiple drivers, the wire's (readable) value is resolved by a function of the source drivers and their strengths.

A subset of statements in the Verilog language is synthesizable. Verilog modules that conform to a synthesizable coding-style, known as RTL (register transfer level), can be

physically realized by synthesis software. Synthesis-software algorithmically transforms the (abstract) Verilog source into a net-list, a logically-equivalent description consisting only of elementary logic primitives (AND, OR, NOT, flip-flops, etc.) that are available in a specific FPGA or VLSI technology. Further manipulations to the net-list ultimately lead to a circuit fabrication blueprint (such as a photo mask set for an ASIC, or a bit-stream file for an FPGA).

Verilog -HDL History

Beginning:

Verilog was the first modern hardware description language to be invented. It was created by Phil Moorby and Prabhu Goel during the winter of 1983/1984. The wording for this process was "Automated Integrated Design Systems" (later renamed to Gateway Design Automation in 1985) as a hardware modeling language. Gateway Design Automation was purchased by Cadence Design Systems in 1990. Cadence now has full proprietary rights to Gateway's Verilog and the Verilog-XL, the HDL-simulator that would become the de-facto standard (of Verilog logic simulators) for the next decade.. Originally, Verilog was intended to describe and allow simulation; only afterwards was support for synthesis added.

Verilog-95:

With the increasing success of VHDL at the time, Cadence decided to make the language available for open standardization. Cadence transferred Verilog into the public domain under the Open Verilog International (OVI) (now known as Accellera) organization. Verilog was later submitted to IEEE and became IEEE Standard 1364-1995, commonly referred to as Verilog-95.

In the same time frame Cadence initiated the creation of Verilog-A to put standards support behind its analog simulator Spectre. Verilog-A was never intended to be a standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.

Verilog 2001:

Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users had found in the original Verilog standard. These extensions became IEEE Standard 1364-2001 known as Verilog-2001.

Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for (2's complement) signed nets and variables. Previously, code authors had to perform

signed-operations using awkward bit-level manipulations (for example, the carry-out bit of a simple 8-bit addition required an explicit description of the Boolean-algebra to determine its correct value). The same function under Verilog-2001 can be more succinctly described by one of the built-in operators: +, -, /, *, >>>. A generate/end-generate construct (similar to VHDL's generate/end-generate) allows Verilog-2001 to control instance and statement instantiation through normal decision-operators (case/if/else). Using generate/end-generate, Verilog-2001 can instantiate an array of instances, with control over the connectivity of the individual instances. File I/O has been improved by several new system-tasks. And finally, a few syntax additions were introduced to improve code-readability (e.g. always @*, named-parameter override, C-style function/task/module header declaration).

Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial EDA software packages.

Introduction about Multiplication:

Multiplication (often denoted by the cross symbol "×") is the mathematical operation of scaling one number by another. It is one of the four basic operations in elementary arithmetic (the others being addition, subtraction and division).

Multiplication:

If a positional numeral system is used, a natural way of multiplying numbers is taught in schools as long multiplication, sometimes called grade-school multiplication: multiply the multiplicand by each digit of the multiplier and then add up all the properly shifted results. It requires memorization of the multiplication table for single digits.

This is the usual algorithm for multiplying larger numbers by hand in base 10. Computers normally use a very similar shift and add algorithm in base 2. A person doing long multiplication on paper will write down all the products and then add them together; an abacus-user will sum the products as soon as each one is computed.

Example:

This example uses long multiplication to multiply 23,958,233 (multiplicand) by 5,830 (multiplier) and arrives at 139,676,498,390 for the result (product).

23958233 5830 × ------------ 00000000 (= 23,958,233 × 0) 71874699 (= 23,958,233 × 30) 191665864 (= 23,958,233 × 800) 119791165 (= 23,958,233 × 5,000) ------------ 139676498390 (= 139,676,498,390 )

Multiplication algorithm:

http://en.wikipedia.org/wiki/Abacus

http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add

http://en.wikipedia.org/wiki/Multiplication_table

http://en.wiktionary.org/wiki/multiplier

http://en.wiktionary.org/wiki/multiplicand

http://en.wikipedia.org/wiki/Numeral_system

http://en.wikipedia.org/wiki/Division_(mathematics)

http://en.wikipedia.org/wiki/Subtraction

http://en.wikipedia.org/wiki/Addition

http://en.wikipedia.org/wiki/Elementary_arithmetic

http://en.wikipedia.org/wiki/Operation_(mathematics)

http://en.wikipedia.org/wiki/%C3%97

A multiplication algorithm is an algorithm (or method) to multiply two numbers. Depending on the size of the numbers, different algorithms are in use. Efficient multiplication algorithms have existed since the advent of the decimal system

Types of Multiplication Algorithms

1. Booth’s Algorithm2. Modified Booth’s Algorithm3. Wallace Tree Algorithm

Booth's Algorithm:

Booth's algorithm is a multiplication algorithm which worked for two's complement numbers. It is similar to our paper-pencil method, except that it looks for the current as well as previous bit in order to decided what to do. Here are steps

If the current multiplier digit is 1 and earlier digit is 0 (i.e. a 10 pair) shift and sign extend the multiplicand, subtract with previous result.

If it is a 01 pair, add to the previous result. If it is a 00 pair, or 11 pair, do nothing.

Let's look at few examples.

4 bits 0110 <- 6 x 0010 <- 2 ------------- 00000000 - 0110 -------------- 11110100 + 0110 -------------- (1) 00001100 <- 12 (overflow bit ignored) 8 bits

In Booth's algorithm, if the multiplicand and multiplier are n-bit two's complement numbers, the result is considered as 2n-bit two's complement value. The overflow bit (outside 2n bits) is ignored.

The reason that the above computation works is because

http://en.wikipedia.org/wiki/Multiplication

http://en.wikipedia.org/wiki/Algorithm

0110 x 0010 = 0110 x (-0010 + 0100) = -01100 + 011000 = 1100.

Example 2:

0010 x 0110 ------------ 00000000 - 0010 ------------- 11111100 + 0010 ------------- (1) 00001100

In this we have computed

0010 x 0110 = 0010 x ( -0010 + 1000) = - 00100 + 0010000 = 1100

Example 3, (-5) x (-3):

1011 -> -5 (4-bit two's complement) x 1101 -> -3 ----------- 00000000 - 11111011 (notice the sign extension of multiplicand) ------------ 00000101 + 1111011 ------------- 11111011 - 111011 ------------- 00001111 -> +15

A long example:

10011100 <- -100 x 01100011 <- 99 -------------------- 00000000 00000000 - 11111111 10011100 -------------------- 00000000 01100100 + 11111110 011100 --------------------

11111110 11010100 - 11110011 100 -------------------- 00001011 01010100 + 11001110 0 -------------------- 11011001 01010100 <- -9900

Note that the multiplicand and multiplier are 8-bit two's complement number, but the result is understood as 16-bit two's complement number. Be careful about the proper alignment of the columns. 10 pair causes a subtraction, aligned with 1, 01 pair causes an addition, aligned with 0. In both cases, it aligns with the one on the left. The algorithm starts with the 0-th bit. We should assume that there is a (-1)-th bit, having value 0.

Booth Algorithm Advantages and Disadvantages

• Depends on the architecture– Potential advantage: might reduce the # of 1’s

in multiplier• In the multipliers that we have seen so far:

– Doesn’t save in speed(still have to wait for the critical path, e.g., the shift-add delay in sequential multiplier)

– Incr– Eases area: recoding circuitry AND subtraction

Modified Booth:

• Booth 2 modified to produce at most n/2+1 partial products.• Algorithm: (for unsigned numbers)

1. Pad the LSB with one zero.2. Pad the MSB with 2 zeros if n is even and 1 zero if n is odd.3. Divide the multiplier into overlapping groups of 3-bits.4. Determine partial product scale factor from modified booth 2 encoding table.5. Compute the Multiplicand Multiples6. Sum Partial Products

• Can encode the digits by looking at three bits at a time• Booth recoding table:

1. Must be able to add multiplicand times –2, -1, 0, 1 and 22. Since Booth recoding got rid of 3’s, generating partial products is not that hard

(shifting and negating)

I+1 i i-1 add

0 0 0 0*M 0 0 1 1*M 0 1 0 1*M 0 1 1 2*M 1 0 0 –2*M 1 0 1 –1*M 1 1 0 –1*M 1 1 1 0*M

• Booth 2 modified to produce at most n/2+1 partial products.• Algorithm: (for unsigned numbers)

1. Pad the LSB with one zero.2. If n is even don’t pad the MSB ( n/2 PP’s) and if n is odd sign extend the MSB by

1 bit ( n+1/2 PP’s).3. Divide the multiplier into overlapping groups of 3-bits.4. Determine partial product scale factor from modified booth 2 encoding table.5. Compute the Multiplicand Multiples6. Sum Partial Products

• Interpretation of the Booth recoding table:• i+1 i i-1 add Explanation• 0 0 0 0*M No string of 1’s in sight

0 0 1 1*M End of a string of 1’s 0 1 0 1*M Isolated 1 0 1 1 2*M End of a string of 1’s 1 0 0 –2*M Beginning of a string of 1’s 1 0 1 –1*M End one string, begin new one 1 1 0 –1*M Beginning of a string of 1’s 1 1 1 0*M Continuation of string of 1’s

• Grouping multiplier bits into pairs– Orthogonal idea to the Booth recoding– Reduces the num of partial products to half– If Booth recoding not used è have to be able to multiply by 3 (hard: shift+add)

• Applying the grouping idea to Booth èModified Booth Recoding (Encoding)– We already got rid of sequences of 1’s è

no mult by 3– Just negate, shift once or twice

• Uses high-radix to reduce number of intermediate addition operands

– Can go higher: radix-8, radix-16– Radix-8 should implement *3, *-3, *4, *-4– Recoding and partial product generation becomes more complex

• Can automatically take care of signed multiplication

Wallace tree:

A Wallace tree is an efficient hardware implementation of a digital circuit that multiplies two integers, devised by an Australian Computer Scientist Chris Wallace in 1964.[1]

The Wallace tree has three steps:

1. Multiply (that is - AND) each bit of one of the arguments, by each bit of the other, yielding n2 results. Depending on position of the multiplied bits, the wires carry different weights, for example wire of bit carrying result of a2b3 is 32 (see explanation of weights below).

2. Reduce the number of partial products to two by layers of full and half adders.3. Group the wires in two numbers, and add them with a conventional adder.[2]

http://en.wikipedia.org/wiki/Wallace_tree#cite_note-1

http://en.wikipedia.org/wiki/Adder_(electronics)

http://en.wikipedia.org/wiki/Wallace_tree#cite_note-0

http://en.wikipedia.org/wiki/Chris_Wallace_(computer_scientist)

http://en.wikipedia.org/wiki/Computer_hardware

http://en.wikipedia.org/wiki/Computational_complexity_theory

The second phase works as follows. As long as there are three or more wires with the same weight add a following layer:

Take any three wires with the same weights and input them into a full adder. The result will be an output wire of the same weight and an output wire with a higher weight for each three input wires.

If there are two wires of the same weight left, input them into a half adder. If there is just one wire left, connect it to the next layer.

The benefit of the Wallace tree is that there are only O(log n) reduction layers, and each layer has O(1) propagation delay. As making the partial products is O(1) and the final addition is O(log n), the multiplication is only O(log n), not much slower than addition (however, much more expensive in the gate count). Naively adding partial products with regular adders would require O(log2n) time. From a complexity theoretic perspective, the Wallace tree algorithm puts multiplication in the class NC1.

These computations only consider gate delays and don't deal with wire delays, which can also be very substantial. The Wallace tree can be also represented by a tree of 3/2 or 4/2 adders. It is sometimes combined with Booth encoding.

Weights explained

The weight of a wire is the radix (to base 2) of the digit that the wire carries. In general, anbm – have indexes of n and m; and since 2n2m = 2n + m the weight of anbm is 2n + m.

Example

n = 4, multiplying a3a2a1a0 by b3b2b1b0:

1. First we multiply every bit by every bit: o weight 1 - a0b0

o weight 2 - a0b1, a1b0

o weight 4 - a0b2, a1b1, a2b0

o weight 8 - a0b3, a1b2, a2b1, a3b0

o weight 16 - a1b3, a2b2, a3b1

o weight 32 - a2b3, a3b2

o weight 64 - a3b3

2. Reduction layer 1: o Pass the only weight-1 wire through, output: 1 weight-1 wireo Add a half adder for weight 2, outputs: 1 weight-2 wire, 1 weight-4 wireo Add a full adder for weight 4, outputs: 1 weight-4 wire, 1 weight-8 wireo Add a full adder for weight 8, and pass the remaining wire through, outputs: 2

weight-8 wires, 1 weight-16 wireo Add a full adder for weight 16, outputs: 1 weight-16 wire, 1 weight-32 wire

http://en.wikipedia.org/wiki/Booth_encoding

http://en.wikipedia.org/wiki/Gate_delay

http://en.wikipedia.org/wiki/NC_(complexity)

http://en.wikipedia.org/wiki/Computational_complexity_theory

http://en.wikipedia.org/wiki/Half_adder

http://en.wikipedia.org/wiki/Full_adder

o Add a half adder for weight 32, outputs: 1 weight-32 wire, 1 weight-64 wireo Pass the only weight-64 wire through, output: 1 weight-64 wire

3. Wires at the output of reduction layer 1: o weight 1 - 1o weight 2 - 1o weight 4 - 2o weight 8 - 3o weight 16 - 2o weight 32 - 2o weight 64 - 2

4. Reduction layer 2: o Add a full adder for weight 8, and half adders for weights 4, 16, 32, 64

5. Outputs: o weight 1 - 1o weight 2 - 1o weight 4 - 1o weight 8 - 2o weight 16 - 2o weight 32 - 2o weight 64 - 2o weight 128 - 1

6. Group the wires into a pair integers and an adder to add them.

Two’s complement:

The two's complement of a binary number is defined as the value obtained by subtracting the number from a large power of two (specifically, from 2N for an N-bit two's complement). The two's complement of the number then behaves like the negative of the original number in most arithmetic, and it can coexist with positive numbers in a natural way.

A two's-complement system, or two's-complement arithmetic, is a system in which negative numbers are represented by the two's complement of the absolute value;[1] this system is the most common method of representing signed integers on computers.[2] In such a system, a number is negated (converted from positive to negative or vice versa) by computing its ones' complement and adding one. An N-bit two's-complement numeral system can represent every integer in the range −2N−1 to 2N−1-1 while ones' complement can only represent integers in the range −(2N−1−1) to 2N−1−1

The two's-complement system has the advantage of not requiring that the addition and subtraction circuitry examine the signs of the operands to determine whether to add or subtract. This property makes the system both simpler to implement and capable of easily handling higher precision arithmetic. Also, zero has only a single representation, obviating the subtleties associated with negative zero, which exists in ones'-complement systems.

http://en.wikipedia.org/wiki/Ones'_complement

http://en.wikipedia.org/wiki/Negative_zero

http://en.wikipedia.org/wiki/Zero

http://en.wikipedia.org/wiki/Ones'_complement

http://en.wikipedia.org/wiki/Two's_complement#cite_note-1

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Signed_number_representations

http://en.wikipedia.org/wiki/Two's_complement#cite_note-0

http://en.wikipedia.org/wiki/Binary_number

REDUCING THE COMPUTATION TIME IN (SHORT BIT-

WIDTH) TWO'S COMPLEMENT MULTIPLIERS

1. INTRODUCTION:

In multimedia, 3D graphics and signal processing applications, performance, in most cases, strongly depends on the effectiveness of the hardware used for computing multiplications, since multiplication is, besides addition, massively used in these environments. The high interest in this application field is witnessed by the large amount of algorithms and implementations of the multiplication operation, which have been proposed in the literature (for a representative set of references, see [1]). More specifically, short bit-width (8-16 bits) two’s complement multipliers with single-cycle throughput and latency have emerged and become very important building blocks for high-performance embedded processors and DSP execution cores [2], [3]. In this case, the multiplier must be highly optimized to fit within the required cycle time and power budgets. Another relevant application for short bit-width multipliers is the design of SIMD units supporting different data formats [3], [4]. In this case, short bit-width multipliers often play the role of basic building blocks. Two’s complement multipliers of moderate bit-width (less than 32 bits) are also being used massively in FPGAS. All of the above translates into a high interest and motivation on the part of the industry, for the design of high-performance short or moderate bit-width two’s complement multipliers.

The basic algorithm for multiplication is based on the well-known paper and pencil approach [1] and passes through three main phases: 1) partial product (PP) generation, 2) PP reduction, and 3) final (carry-propagated) addition. During PP generation, a set of rows is generated where each one is the result of the product of one bit of the multiplier by the multiplicand. For example, if we consider the multiplication X Â Y with both X and Y on n bits and of the form xnà1 . . . X0 and ynà1 . . . Y0, then the ith row is, in general, a proper left shifting of yi* X, i.e., either a string of all zeros when yi= 0, or the multiplicand X itself when yi= 1. In this case, the number of PP rows generated during the first phase is clearly n. Modified Booth Encoding (MBE) is a technique that has been introduced to reduce the number of PP rows, still keeping the generation process of each row both simple and fast enough. One of the most commonly used schemes is radix-4 MBE, for a number of reasons, the most important being that it allows for the reduction of the size of the partial product array by almost half, and it is very simple to generate the multiples of the multiplicand. More specifically, the classic two’s complement n * n bit multiplier using the radix-4 MBE scheme, generates a PP array with a maximum height of [n/2]+1 rows, each row before the last one being one of the2

following possible values: all zeros, +-X;+-2X. The last row, which is due to the negative encoding, can be kept very simple by using specific techniques integrating two’s complement and sign extension prevention [1].

The PP reduction is the process of adding all PP rows by using a compression tree [6], [7]. Since the knowledge of intermediate addition values is not important, the outcome of this phase is a result represented in redundant carry- save form, i.e., as two rows, which allows for much faster implementations. The final (carry-propagated) addition has the task of adding these two rows and of presenting the final result in a non redundant form, i.e., as a single row.

In this work, we introduce an idea to overlap, to some extent, the PP generation and the PP reduction phases. Our aim is to produce a PP array with a maximum height of [n/2] rows that is then reduced by the compressor tree stage.2 As we will see for the common case of values n which are power of two, the above reduction can lead to an implementation where the delay of the compressor tree is reduced by one XOR2 gate keeping a regular layout. Since we are focusing on small values of n and fast single-cycle units, this reduction might be important in cases where, for example, a high computation performance through the assembly of a large number of small processing units withlimited computation capabilities are required, such as 8 Â 8 or 16 Â 16 multipliers [8].

A similar study aimed at the reduction of the maximum height to [n/2] but using a different approach has recently2 presented interesting results in [9] and previously, by the same authors, in [10]. Thus, in the following, we will evaluate and compare the proposed approach with the technique in [9]. Additional details of our approach, besides the main results presented here, can be found in [11].

The paper is organized as follows: in Section 2, the multiplication algorithm based on MBE is briefly reviewed and analyzed. In Section 3, we describe related works. In Section 4, we present our scheme to reduce the maximum height of the partial product array by one unit during the generation of the PP rows. Finally, in Section 5, we provide evaluations and comparisons.

2 .MODIFIED BOOTH RECODED MULTIPLIERS:

In general, a radix-B = 2b MBE leads to a reduction of the number of rows to about [n/b] while, on the other hand, it introduces the need to generate all the multiples of the multiplicand X, at least from –B/2 * X to B/2 * X. As mentioned above, radix-4 MBE is particularly of interest since, for radix-4, it is easy to create the multiples of the multiplicand 0; +-X; +-2X. In particular, +-2X can be simply obtained by single left shifting of the corresponding terms +-X. It is clear that the MBE can be extended to higher radices (see [12] among others), but the advantage of getting a higher reduction in the number of rows is paid for by the need to generate more multiples of X. In this paper, we focus our attention on radix-4 MBE, although the proposed method can be easily extended to any radix-B MBE [11].

From an operational point of view, it is well known that the radix-4 MBE scheme consists of scanning the multiplier operand with a three-bit window and a stride of two bits (radix-4). For each group of three bits (y2i+1, y2i, y2i+1), only one partial product row is generated according to the encoding in Table 1. A possible implementation of the radix-4 MBE and of the corresponding partial product generation is shown in Fig. 1, which comes from a small adaptation of [10, Fig. 12b]. For each partial product row, Fig. 1a produces the one, two, and neg signals. These signals are then exploited by the logic in Fig. 1b, along with the appropriate bits of the multiplicand, in order to generate the whole partial product array. Other alternatives for the implementation of the recoding and partial product generation can be found in [13], [14], [15], among others.

As introduced previously, the use of radix-4 MBE allows for the (theoretical) reduction

of the PP rows to [n/2], with the2 possibility for each row to host a multiple of yi* X, with yi Є

{0,+-1,+-2}. While it is straightforward to generate the positive terms 0, X, and 2X at least through a left shift of X, some attention is required to generate the terms -X and -2X which, as observed in Table 1, can arise from three configurations of the y2i+1 , y2i , and y2i-1 bits. To avoid computing negative encodings, i.e., -X and -2X, the two’s complement of the multiplicand is generally used. From a mathematical point of view, the use of two’s complement requires extension of the sign to the leftmost part of each partial product row, with the consequence of an extra area overhead. Thus, a number of strategies for preventing sign extension have been developed. For instance, the scheme in [1] relies on the observation that 1-2+4. The array resulting from the application of the sign extension prevention technique in [1] to the partial product array of a 8 * 8 MBE multiplier [5] is shown in Fig. 2.

The use of two’s complement requires a neg signal (e.g., neg0, neg1, neg2, and neg3 in Fig. 2) to be added in the LSB position of each partial product row for generating the two’s complemented, as needed. Thus, although for a n *n multiplier, only [n/2] partial products are generated, the maximum height of the partial product array is [n/2]+1

When 4-to-2 compressors are used, which is a widely used option because of the high regularity of the resultant circuit layout for n power of two, the reduction of the extra row may require an additional delay of two XOR2 gates. By properly connecting partial product rows and using a Wallace reduction tree [7], the extra delay can be further reduced to one XOR2 [16], [17]. However, the reduction still requires additional hardware, roughly a row of n half adders. This issue is of special interest when n is a power of two, which is by far a very common case, and the multiplier’s critical path has to fit within the clock period of a high performance processor. For instance, in the design presented in [2], for n =16, the maximum column height of the partial product array is nine, with an equivalent delay for the reduction of six XOR2 gates [16], [17]. For a maximum height of the partial product array of 8, the delay of the reduction tree would be reduced by one XOR2 gate [16], [17]. Alternatively, with a maximum height of eight, it would be possible to use 4 to 2 adders, with a delay of the reduction tree of six XOR2 gates, but with a very regular layout.

3. RELATED WORK:

Some approaches have been proposed aiming to add the [n/2] + 1 rows, possibly in the same time as the [n/2] rows. The22 solution presented in [14] is based on the use of different types of counters, that is, it operates at the level of the PP reduction phase. Kang and Gaudiot propose a different approach in [9] that manages to achieve the goal of eliminating the extra row before the PP reduction phase. This approach is based on computing the two’s complement of the last partial product, thus eliminating the need for the last neg signal, in a logarithmic time complexity. A special tree structure (basically an incrementer implemented as a prefix tree [18]) is used in order to produce the two’s complement (Fig. 3), by decoding the MBE signals through a 3-5 decoder (Fig. 4a). Finally, a row of 4-1 multiplexers with implicit zero output1 is used (Fig. 4b) to produce the last partial product row directly in two’s complement, without the need for the neg signal. The goal is to produce the two’s complement in parallel with the computation ofThe partial products of the other rows with maximum overlap. In such a case, it is expected to have no or a small time penalization in the critical path. The architecture in [9], [18] is a logarithmic version of the linear method presented in [19] and [20]. With respect to [19], [20], the approach in [9] is more general, and shows better adaptability to any word size. An example of the partial product array produced using the above method is depicted in Fig. 5.

In this work, we present a technique that also aims at producing only [n/2] rows, but by relying on a different2 approach than [9].

4. BASIC IDEA:

The case of n * n square multipliers is quite common, as the case of n that is a power of two. Thus, we start by focusing our attention on square multipliers, and then present the extension to the general case of m * n rectangular multipliers.

4.1 Square Multipliers: The proposed approach is general and, for the sake of clarity, will be explained through the practical case of 8 * 8 multiplications (as in the previous figures). As briefly outlined in the previous sections, the main goal of our approach is to produce a partial product array with a maximum height of [n/2] rows, without introducing any2 additional delay.

Let us consider, as the starting point, the form of the simplified array as reported in Fig. 2, for all the partial product rows except the first one. As depicted in Fig. 6a, the first row is temporarily considered as being split into two sub rows, the first one containing the partial product bits (from right to left) from pp00 to pp80 bar and the second one with two bits set at “one” in positions 9 and 8. Then, the bit neg3 related to the fourth partial product row, is moved to become a part of the second sub row. The key point of this

“graphical” transformation is that the second sub row containing also the bit neg3 , can now be easily added to the first sub row, with a constant short carry propagation of three positions (further denoted as “3-bits addition”), a value which is easily shown to be general, i.e., independent of the length of the operands, for square multipliers. In fact, with reference to the

notation of Fig. 6, we have that As introduced above, due to the particular value of the second operand, i.e., 0 1 1 0 neg3 , in [11], we have observed that it requires a carry propagation only across the least-significant three positions, a fact that can also be seen by the implementation shown in Fig. 7.

It is worth observing that, in order not to have delay penalizations, it is necessary that the generation of the other rows is done in parallel with the generation of the first row cascaded

by the computation of the bits qq70 qq60 in Fig. 6b. In order to achieve this, we must simplify and differentiate the generation of the first row with respect to the other rows. We observe that the Booth recoding for the first row is computed more easily than for the other rows, because the yà1 bit used by the MBE is always equal to zero. In order to have a preliminary

Analysis which is possibly independent of technological details, we refer to the circuits in the following figures:

Fig. 1, slightly adapted from [10, Fig. 12], for the partial product generation using MBE;

Fig. 7, obtained through manual synthesis (aimed at modularity and area reduction without compromising the delay), for the addition of the last neg bit to the three most significant bits of the first row;

Fig. 8, obtained by simplifying Fig. 1 (since, in the first row, it is y2i-1 = 0), for the partial product generation of the first row only using MBE; and

Fig. 9, obtained through manual synthesis of a combination of the two parts of Fig. 8 and aimed at decreasing the delay of Fig. 8 with no or very small area increase, for the partial product generation of the first row only using MBE.

In particular, we observe that, by direct comparison of Figs. 1 and 8, the generation of the MBE signals for the first row is simpler, and theoretically allows for the saving of the delay of one NAND3 gate. In addition, the implementation in Fig. 9 has a delay that is smaller than the two parts of Fig. 8, although it could require a small amount of additional area.

As we see in the following, this issue hardly has any significant impact on the overall design, since this extra hardware is used only for the three most significant bits of the first row, and not for all the other bits of the array.

The high-level description of our idea is as follows:1. Generation of the three most significant bit weights of the first row, plus addition of the

last neg bit:

possible implementations can use a replication of three times the circuit of Fig. 9 (each for the three most significant bits of the first row), cascaded by the circuit of Fig. 7 to add the neg signal;

2. Parallel generation of the other bits of the first row: possible implementations can use instances of the circuitry depicted in Fig. 8, for each bit of the first row, except for the three most significant; 3. Parallel generation of the bits of the other rows: possible implementations can use the circuitry of Fig. 1, replicated for each bit of the other rows.

All items 1 to 3 are independent, and therefore can be executed in parallel. Clearly if, as assumed and expected, item 1 is not the bottleneck (i.e., the critical path), then the implementation of the proposed idea has reached the goal of not introducing time penalties.

4.2 Extension to Rectangular Multipliers:

A number of potential extensions to the proposed method exist, including rectangular multipliers, higher radix MBE, and multipliers with fused accumulation [11]. Here, we quickly focus on m * n rectangular multipliers. With no loss of generality, we assume m >= n i.e., m = n + m’ with m’>= 0, since it leads to a smaller number of rows; for simplicity, and also with no loss of generality, in the following, we assume that both m and n are even. Now, we have seen in Fig. 6a, that for m’ = 0 then the last neg bit, i.e., neg [n/2]+1 belongs to the same column as the first row partial product . We observe that the first partial product row has bits up to ; therefore, in order to also include in the first row the contribution of , due to the

particular nature of operands it is necessary to perform a carry propagation (i.e.,

bit addition) in the sum

Thus, for rectangular multipliers, the proposed

approach can be applied With the cost of a -bit addition.

The complete or even partial execution overlap of the first row with other rows generation clearly depends on a number of factors, including the value of m’ and the way that the

-bit addition is implemented, but still the proposed approach offers an interesting alternative that can possibly be explored for designing and implementing rectangular multipliers.

5. EVALUATION AND COMPARISONS:

In this section, the proposed method based on the addition of the last neg signal to the first row is first evaluated. The designed architecture is then compared with an implementation based on the computation of the two’s complement of the last row (referred to as “Two’s complement” method) using the designs for the 3-5 decoders, 4-1 multiplexers, and two’s complement tree in [9]. Moreover, in the analysis, the standard MBE implementations for the first and for a Generic partial product row are also taken into account (as summarized in Table 2).

For all the implementations, we explicitly evaluate the most common case of a n x n multiplier, although we have shown in Section 4 that the proposed approach can also be extended to m x n rectangular multipliers. While studying the framework of possible implementations, we considered the first phase of the multiplication algorithm (i.e., the partial product generation) and we focused our attention on the issues of area occupancy and modular design, since it is reasonable to expect that they lead to a possibly small multiplier with regular layout. The detailed results of some extensive evaluations and comparisons, both based on theoretical analysis and related implementations are reported in [11]. Results encompass the following:

1. Theoretical analysis based on the concept of equivalent gates from Gajski’s analysis [21] (as in [9]),2. Theoretical analysis based on delay and area costs for elementary gates in a standard cell library,3. Theoretical analysis showing that the proposed approach, in the version minimizing area, can very likely overlap the generation of the first row with the generation of the other rows, and 4. Validation by logic synthesis and technology mapping to an industrial cell library. All the results show the feasibility of the proposed approach. Here, for the sake of simplicity, we quickly summarize the results of the theoretical analysis and we check the validity of our estimations through logic synthesis and simulation.

5.1 High-Level Remarks and Theoretical Analysis:

As can be seen from Fig. 6, the generation of the first row is different from the generation of the other rows, basically for two reasons:

1. The first row needs to assimilate the last neg signal, an operation which requires an addition over the three most significant bit weights; 2.. The first row can take advantage of a simpler Booth recoding, as the yà1 bit used by the MBE is always equal to zero (Section 4).

As seen before, in Fig. 8, we have a possible implementation to generate the first row, which takes into account the simpler generation of the MBE signals. We have seen that by combining the two parts of Fig. 8 we get Fig. 9, which is faster than Fig. 8, at a possibly slightly larger area cost certainly very marginal with respect to the global area of all the partial product bits coming from the other rows. We have done some rough simulations and found that a good trade-off could be to have the generation of the first bits of the first row carried out by the circuit of Fig. 9, followed by the cascaded addition provided by Fig. 7 (Section 4).

Based on all of the above, our architecture has been designed to perform the following operations: 1. Generation of the three most significant bit weights of the first row (through the very small and regular circuitry of Fig. 9) and addition to these bits of the neg signal (by means of the circuitry of Fig. 7); 2. Generation of the other bits of the first row, using the circuitry depicted in Fig. 8; and 3. Generation of the bits of the other rows, using the circuitry of Fig. 1.

As these three operations can be carried out in parallel, the overall critical path of the proposed architecture emerges from the largest delay among the above paths.

Critical path and area cost for the proposed architecture, as well as for the other implementations in Table 2, were computed with reference to a 130 nm HCMOS standard cell library from STMicroelectronics [22] (later used also for obtaining overall synthesis results). In this analysis, the contribution of wires was neglected, and a buffer-free configuration was considered. Nonetheless, details regarding buffer stages location and size are discussed in [11]. Data concerning area and delay for elementary cells used in this work (as well as in [9]) are reported in Table 3. Results are reported in Tables 4 and 5, respectively. It is worth observing that results may vary depending on specific parameters selected for the synthesis such as logic implementation, optimization strategies, and target libraries.

We observe that the “Two’s Complement” approach has a delay that is longer than the delay to generate the standard partial product rows, becoming even longer as the size n of the multiplier increases (e.g., exceeding the delay of a XNOR2 gate starting from n ¼ 16). On the other hand, according to theoretical estimations, we can see that the delay for generating the first row in the proposed method is

estimated to be lower than the delay for generating the standard rows. This means that the extra row is eliminated without any penalty on the overall critical path.

With respect to area costs, it can be observed that the proposed method hardly introduces any area overhead with respect to the standard generation of a partial product row. On the other hand, the “Two’s Complement” approach requires additional hardware, which increases with the size of the multiplier.

5.2 Implementation Results:

In order to further check the validity of our estimations in an implementation technology, we implemented the designs in Table 2 through logic synthesis and technology mapping to an industrial standard cell library. Specifically, for the logic synthesis, we used Synopsys Design Compiler and the designs were mapped to a 130 nm HCMOS industrial library from STMicroelectronics [22].

To perform the evaluation, we obtained the area-delay space for the sole generation of the partial product row of interest (i.e., the first row in the proposed approach, the last row in the implementation presented in [9]). In order to support the comparison, the area-delay space for the generation of the partial product rows using standard MBE implementations was also evaluated, by considering the first row and the other rows of the partial product array separately (Table 2). The results, obtained for n = 8, 16, and 32, are depicted in Fig. 10.

The delays are shown both in absolute units (ns) and normalized to the delay of an inverter with a fan-out of four (68 ps for the technology used, under worst-case conditions). Accordingly, the area is presented both in absolute units (µm2) and normalized to equivalent gates using the area of a NAND2 gate (4:39 µm2 for the technology used). We obtained several design points (using different target delays) for each approach, and the minimum delay shown corresponds to the fastest design that the tool was capable of synthesizing.

We observe that the “Proposed method” implementation produces a curve in the delay-area graph bounded by the curve for the generation of a standard partial product (upper bound) and by the curve for the standard generation of the first partial product (lower bound) for the three values of n considered. Moreover, the minimum delay that is achieved is very similar to the case of the generation of a standard partial product for n= 8; 16 (with our approach it is about 0.5-0.7 FO4 higher), and is even less for n=32 due to the predominant effect of the higher loading of the control signals. Therefore, our scheme does not introduce any additional delay in the partial product generation stage for target delays higher than about 5 FO4.

The curve for our scheme gets closer to the curve corresponding to the standard generation of the first partial product as n increases. This is due to the fact that as n increases, the short addition of the leading part achieves more overlap with the generation of the rest of the partial product (with higher input load capacitance, as n increases).

The “Two’s Complement” scheme achieves minimum delays between 7 and 10 FO4, at the cost of requiring more than four times the area at this point, compared to the “Proposed method” approach. Most importantly, its delay is much higher than the one of any standard row.

6. CONCLUSIONS:

Two’s complement n x n multipliers using radix-4 Modified Booth Encoding produce [n/2] partial products but due to the2 sign handling, the partial product array has a maximum height of [n/2] + 1. We presented a scheme that produces a partial product array with a maximum height of [n/2], without2 introducing any extra delay in the partial product generation stage. With the extra hardware of a (short) 3-bit addition, and the simpler generation of the first partial product row, we have been able to achieve a delay for the proposed scheme within the bound of the delay of a standard partial product row generation. The outcome of the above is that the reduction of the maximum height of the partial product array by one unit may simplify the partial product reduction tree, both in terms of delay and regularity of the layout. This is of special interest for all multipliers, and especially for single-cycle short bit-width multipliers for high performance embedded cores, where short bit-width multiplications are common operations. We have also compared our approach with a recent proposal with the same aim, considering results using a widely used industrial synthesis tool and a modern industrial technology library, and concluded that our approach may improve both the performance and area requirements of square multiplier designs. The proposed approach also applies with minor modifications to rectangular and to general radix-B Modified Booth Encoding multipliers.

7. References:

1. M.D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann Publishers, 2003.

2. S.K. Hsu, S.K. Mathew, M.A. Anders, B.R. Zeydel, V.G.Oklobdzija, R.K. Krishnamurthy, and S.Y. Borkar, “A 110GOPS/ W 16-Bit Multiplier and Reconfigurable PLA Loop in 90-nm CMOS,” IEEE J. Solid State Circuits, vol. 41, no. 1, pp. 256-264, Jan.2006.

3. H. Kaul, M.A. Anders, S.K. Mathew, S.K. Hsu, A. Agarwal, R.K.Krishnamurthy, and S. Borkar, “A 300 mV 494GOPS/W Reconfi-gurable Dual-Supply 4-Way SIMD Vector Processing Accelerator in 45 nm CMOS,” IEEE J. Solid State Circuits, vol. 45, no. 1, pp. 95-101, Jan. 2010.

4. M.S. Schmookler, M. Putrino, A. Mather, J. Tyler, H.V. Nguyen, C.Roth, M. Sharma, M.N. Pham, and J. Lent, “A Low-Power, High-Speed Implementation of a PowerPC Microprocessor Vector Extension,” Proc. 14th IEEE Symp. Computer Arithmetic, pp. 12-19,1999.

5. O.L. MacSorley, “High Speed Arithmetic in Binary Computers,”Proc. IRE, vol. 49, pp. 67-91, Jan. 1961.

6. L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Frequenza,vol. 34, pp. 349-356, May 1965.

7. C.S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans.Electronic Computers, vol. EC-13, no. 1, pp. 14-17, Feb. 1964.D.E. Shaw, “Anton: A Specialized Machine for Millisecond-ScaleMolecular Dynamics Simulations of Proteins,” Proc. 19th IEEE Symp. Computer Arithmetic, p. 3, 2009.

8. J.-Y. Kang and J.-L. Gaudiot, “A Simple High-Speed Multiplier Design,” IEEE Trans.Computers, vol. 55, no. 10, pp. 1253-1258, Oct.2006.

9. J.-Y. Kang and J.-L. Gaudiot, “A Fast and Well-Structured Multiplier,” Proc. Euromicro Symp. Digital System Design, pp. 508-515, Sept. 2004.

10. F. Lamberti, N. Andrikos, E. Antelo, and P. Montuschi,“Speeding-Up Booth Encoded Multipliers by Reducing the Size of Partial Product Array,” internal report, http://arith.polito.it/ir_mbe.pdf, pp. 1-14, 2009.

11. E.M. Schwarz, R.M. Averill III, and L.J. Sigal, “A Radix-8 CMOS S/390 Multiplier,” Proc. 13th IEEE Symp. Computer Arithmetic, pp. 2-9, 1997.

12. W.-C. Yeh and C.-W. Jen, “High-Speed Booth Encoded Parallel Multiplier Design,” IEEE Trans. Computers, vol. 49, no. 7, pp. 692-701, July 2000.

13. Z. Huang and M.D. Ercegovac, “High-Performance Low-Power Left-to-Right Array Multiplier Design,” IEEE Trans. Computers,vol. 54, no. 3, pp. 272-283, Mar. 2005.

14. R. Zimmermann and D.Q. Tran, “Optimized Synthesis of Sum-of-Products,” Proc. Conf. Record of the 37th Asilomar Conf. Signals,Systems and Computers, vol. 1, pp. 867-872, 2003.

15. V.G. Oklobdzija, D. Villeger, and S.S. Liu, “A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach,” IEEE Trans.Computers, vol. 45, no. 3, pp. 294-306, Mar. 1996.

16. P.F. Stelling, C.U. Martel, V.G. Oklobdzija, and R. Ravi, “Optimal Circuits for Parallel Multipliers,” IEEE Trans. Computers, vol. 47,no. 3, pp. 273-285, Mar. 1998.

17. J.-Y. Kang and J.-L. Gaudiot, “A Logarithmic Time Method for Two’s

Complementation,” Proc. Int’l Conf. Computational Science, pp. 212-219, 2005.

18. K. Hwang, Computer Arithmetic Principles, Architectures, and Design.Wiley, 1979.

19. R. Hashemian and C.P. Chen, “A New Parallel Technique for Design of Decrement/Increment and Two’s Complement Circuits,” Proc. 34th Midwest Symp. Circuits and Systems, vol. 2,pp. 887-890, 1991.

20. D. Gajski, Principles of Digital Design. Prentice-Hall, 1997.STMicroelectronics, “130nm HCMOS9 Cell Library,” http://www.st.com/stonline/products/technologies/soc/evol.htm,2010.

Syntax report

Started : "Check Syntax for Partial_product".

=========================================================================

* HDL Compilation *

=========================================================================

Compiling verilog file "fig8b.v" in library work

Compiling verilog file "fig8a.v" in library work

Module <fig8b> compiled


Module <fig8a> compiled



Compiling verilog file "fig1.v" in library work


Module <Partial_product> compiled

No errors in compilation

Analysis of file <"Partial_product.prj"> succeeded.

Process "Check Syntax" completed successfully

Synthesis reportRelease 9.2i - xst J.36

Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.

--> Parameter TMPDIR set to ./xst/projnav.tmp

CPU : 0.00 / 0.13 s | Elapsed : 0.00 / 0.00 s

--> Parameter xsthdpdir set to ./xst

CPU : 0.00 / 0.13 s | Elapsed : 0.00 / 0.00 s

--> Reading design: Partial_product.prj

TABLE OF CONTENTS

1) Synthesis Options Summary

2) HDL Compilation

3) Design Hierarchy Analysis

4) HDL Analysis

5) HDL Synthesis

5.1) HDL Synthesis Report

6) Advanced HDL Synthesis

6.1) Advanced HDL Synthesis Report

7) Low Level Synthesis

8) Partition Report

9) Final Report

9.1) Device utilization summary

9.2) Partition Resource Summary

9.3) TIMING REPORT

=========================================================================

* Synthesis Options Summary *

=========================================================================

---- Source Parameters

Input File Name : "Partial_product.prj"

Input Format : mixed

Ignore Synthesis Constraint File : NO

---- Target Parameters

Output File Name : "Partial_product"

Output Format : NGC

Target Device : xc3s500e-5-cp132

---- Source Options

Top Module Name : Partial_product

Automatic FSM Extraction : YES

FSM Encoding Algorithm : Auto

Safe Implementation : No

FSM Style : lut

RAM Extraction : Yes

RAM Style : Auto

ROM Extraction : Yes

Mux Style : Auto

Decoder Extraction : YES

Priority Encoder Extraction : YES

Shift Register Extraction : YES

Logical Shifter Extraction : YES

XOR Collapsing : YES

ROM Style : Auto

Mux Extraction : YES

Resource Sharing : YES

Asynchronous To Synchronous : NO

Multiplier Style : auto

Automatic Register Balancing : No

---- Target Options

Add IO Buffers : YES

Global Maximum Fanout : 500

Add Generic Clock Buffer(BUFG) : 24

Register Duplication : YES

Slice Packing : YES

Optimize Instantiated Primitives : NO

Use Clock Enable : Yes

Use Synchronous Set : Yes

Use Synchronous Reset : Yes

Pack IO Registers into IOBs : auto

Equivalent register Removal : YES

---- General Options

Optimization Goal : Speed

Optimization Effort : 1

Library Search Order : Partial_product.lso

Keep Hierarchy : NO

RTL Output : Yes

Global Optimization : AllClockNets

Read Cores : YES

Write Timing Constraints : NO

Cross Clock Analysis : NO

Hierarchy Separator : /

Bus Delimiter : <>

Case Specifier : maintain

Slice Utilization Ratio : 100

BRAM Utilization Ratio : 100

Verilog 2001 : YES

Auto BRAM Packing : NO

Slice Utilization Ratio Delta : 5

=========================================================================

=========================================================================

* HDL Compilation *

=========================================================================








Compiling verilog file "fig1.v" in library work


Module <Partial_product> compiled

No errors in compilation

Analysis of file <"Partial_product.prj"> succeeded.

=========================================================================

* Design Hierarchy Analysis *

=========================================================================

Analyzing hierarchy for module <Partial_product> in library <work>.

Analyzing hierarchy for module <fig8a> in library <work>.

Analyzing hierarchy for module <fig8b> in library <work>.

Analyzing hierarchy for module <fig1a> in library <work>.

Analyzing hierarchy for module <fig1b> in library <work>.

=========================================================================

* HDL Analysis *

=========================================================================

Analyzing top module <Partial_product>.

Module <Partial_product> is correct for synthesis.

Analyzing module <fig8a> in library <work>.

Module <fig8a> is correct for synthesis.

Analyzing module <fig8b> in library <work>.

Module <fig8b> is correct for synthesis.

Analyzing module <fig1a> in library <work>.

Module <fig1a> is correct for synthesis.

Analyzing module <fig1b> in library <work>.

Module <fig1b> is correct for synthesis.

=========================================================================

* HDL Synthesis *

=========================================================================

Performing bidirectional port resolution...

Synthesizing Unit <fig8a>.

Related source file is "fig8a.v".

Unit <fig8a> synthesized.

Synthesizing Unit <fig8b>.

Related source file is "fig8b.v".

Found 1-bit xor2 for signal <pp0j_0$xor0000>.

Unit <fig8b> synthesized.

Synthesizing Unit <fig1a>.

Related source file is "fig1a.v".

Found 1-bit xor2 for signal <onei<0>>.

Unit <fig1a> synthesized.

Synthesizing Unit <fig1b>.

Related source file is "fig1b.v".

Found 1-bit xor2 for signal <ppij_0$xor0000>.

Unit <fig1b> synthesized.

Synthesizing Unit <Partial_product>.

Related source file is "fig1.v".

WARNING:Xst:1306 - Output <qq90bar> is never assigned.

WARNING:Xst:1306 - Output <qq60> is never assigned.




WARNING:Xst:646 - Signal <pp01> is assigned but never used.


















WARNING:Xst:646 - Signal <pp60<0>> is assigned but never used.




WARNING:Xst:646 - Signal <pp70<0>> is assigned but never used.




WARNING:Xst:1780 - Signal <pp80> is never used or assigned.

Unit <Partial_product> synthesized.

=========================================================================

HDL Synthesis Report

Macro Statistics

# Xors : 35

1-bit xor2 : 35

=========================================================================

=========================================================================

* Advanced HDL Synthesis *

=========================================================================

Loading device for application Rf_Device from file '3s500e.nph' in environment C:\Xilinx92i.

WARNING:Xst:1290 - Hierarchical block <row1pp60> is unconnected in block <Partial_product>.

It will be removed from the design.

WARNING:Xst:1290 - Hierarchical block <row1pp70> is unconnected in block <Partial_product>.


WARNING:Xst:1290 - Hierarchical block <row2p011> is unconnected in block <Partial_product>.





























=========================================================================

Advanced HDL Synthesis Report

Macro Statistics

# Xors : 35

1-bit xor2 : 35

=========================================================================

=========================================================================

* Low Level Synthesis *

=========================================================================

Optimizing unit <Partial_product> ...

Mapping all equations...

Building and optimizing final netlist ...

Found area constraint ratio of 100 (+ 5) on block Partial_product, actual ratio is 0.

Final Macro Processing ...

=========================================================================

Final Register Report

Found no macro

=========================================================================

=========================================================================

* Partition Report *

=========================================================================

Partition Implementation Status

-------------------------------

No Partitions were found in this design.

-------------------------------

=========================================================================

* Final Report *

=========================================================================

Final Results

RTL Top Level Output File Name : Partial_product.ngr

Top Level Output File Name : Partial_product

Output Format : NGC

Optimization Goal : Speed

Keep Hierarchy : NO

Design Statistics

# IOs : 82

Cell Usage :

# BELS : 6

# LUT3 : 1

# LUT4 : 5

# IO Buffers : 44

# IBUF : 8

# OBUF : 36

=========================================================================

Device utilization summary:

---------------------------

Selected Device : 3s500ecp132-5

Number of Slices: 3 out of 4656 0%

Number of 4 input LUTs: 6 out of 9312 0%

Number of IOs: 82

Number of bonded IOBs: 44 out of 92 47%

---------------------------

Partition Resource Summary:

---------------------------

No Partitions were found in this design.

---------------------------

=========================================================================

TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.

FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT

GENERATED AFTER PLACE-and-ROUTE.

Clock Information:

------------------

No clock signals found in this design

Asynchronous Control Signals Information:

----------------------------------------

No asynchronous control signals found in this design

Timing Summary:

---------------

Speed Grade: -5

Minimum period: No path found

Minimum input arrival time before clock: No path found

Maximum output required time after clock: No path found

Maximum combinational path delay: 6.176ns

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

=========================================================================

Timing constraint: Default path analysis

Total number of paths / destination ports: 138 / 36

-------------------------------------------------------------------------

Delay: 6.176ns (Levels of Logic = 3)

Source: Mr<1> (PAD)

Destination: pp00<5> (PAD)

Data Path: Mr<1> to pp00<5>

Gate Net

Cell:in->out fanout Delay Delay Logical Name (Net Name)

---------------------------------------- ------------

IBUF:I->O 6 1.106 0.721 Mr_1_IBUF (Mr_1_IBUF)

LUT3:I0->O 6 0.612 0.569 row1pp00/pp0j_0_not00001 (pp00_5_OBUF)

OBUF:I->O 3.169 pp00_5_OBUF (pp00<5>)

----------------------------------------

Total 6.176ns (4.887ns logic, 1.289ns route)

(79.1% logic, 20.9% route)

=========================================================================

CPU : 3.31 / 3.45 s | Elapsed : 3.00 / 3.00 s

-->

Total memory usage is 147804 kilobytes

Number of errors : 0 ( 0 filtered)

Number of warnings : 71 ( 0 filtered)

Number of infos : 0 ( 0 filtered)

Test bench coding

////////////////////////////////////////////////////////////////////////////////

// Copyright (c) 1995-2007 Xilinx, Inc.

// All Right Reserved.

////////////////////////////////////////////////////////////////////////////////

// ____ ____

// / /\/ /

// /___/ \ / Vendor: Xilinx

// \ \ \/ Version : 9.2i

// \ \ Application : ISE

// / / Filename : ts_tb_selfcheck.tfw

// /___/ /\ Timestamp : Mon Jan 23 18:06:08 2012

// \ \ / \

// \___\/\___\

//

//Command:

//Design Name: ts_tb_selfcheck_beh

//Device: Xilinx

//

`timescale 1ns/1ps

module ts_tb_selfcheck_beh;

reg [7:0] Md = 8'b00000000;

reg [7:0] Mr = 8'b00000000;

wire [15:0] km;

wire [15:0] k1;

wire [15:0] k2;

wire [15:0] k3;

wire [15:0] k4;

test UUT (

.Md(Md),

.Mr(Mr),

.km(km),

.k1(k1),

.k2(k2),

.k3(k3),

.k4(k4));

integer TX_ERROR = 0;

initial begin // Open the results file...

#1000 // Final time: 1000 ns

if (TX_ERROR == 0) begin

$display("No errors or warnings.");

end else begin

$display("%d errors found in simulation.", TX_ERROR);

end

$stop;

end

initial begin

// ------------- Current Time: 200ns

#200;

Mr = 8'b00001100;

// -------------------------------------

// ------------- Current Time: 250ns

#50;

CHECK_k2(16'b0000101111111100);

CHECK_k3(16'b0011000000000100);

// -------------------------------------

// ------------- Current Time: 300ns

#50;

Md = 8'b00110111;

// -------------------------------------

// ------------- Current Time: 350ns

#50;

CHECK_km(16'b0000001010010100);

CHECK_k2(16'b0000101100100000);

CHECK_k3(16'b0011001101110100);

// -------------------------------------

// ------------- Current Time: 600ns

#250;

Md = 8'b01111101;

Mr = 8'b10101101;

// -------------------------------------

// ------------- Current Time: 650ns

#50;

CHECK_km(16'b1101011101111001);

CHECK_k1(16'b0000010010111101);

CHECK_k2(16'b0000101000001000);

CHECK_k3(16'b0010100000100100);

CHECK_k4(16'b1010000010010000);

// -------------------------------------

// ------------- Current Time: 800ns

#150;

Md = 8'b10011001;

Mr = 8'b10011001;

// -------------------------------------

// ------------- Current Time: 850ns

#50;

CHECK_km(16'b0010100101110001);

CHECK_k1(16'b0000001111011001);

CHECK_k2(16'b0000111100110100);

CHECK_k3(16'b0010001100100100);

CHECK_k4(16'b1111001101000000);

end

task CHECK_km;

input [15:0] NEXT_km;

#0 begin

if (NEXT_km !== km) begin

$display("Error at time=%dns km=%b, expected=%b", $time, km, NEXT_km);

TX_ERROR = TX_ERROR + 1;

end

end

endtask

task CHECK_k1;

input [15:0] NEXT_k1;

#0 begin

if (NEXT_k1 !== k1) begin

$display("Error at time=%dns k1=%b, expected=%b", $time, k1, NEXT_k1);


end

end

endtask

task CHECK_k2;


#0 begin




end

end

endtask

task CHECK_k3;


#0 begin




end

end

endtask

task CHECK_k4;


#0 begin




end

end

endtask

endmodule

out put wave form:

Schematic diagram:

Technical schematic diagram:

.