final thesis - diva portalliu.diva-portal.org/smash/get/diva2:743281/fulltext01.pdffinal thesis low...
TRANSCRIPT
Final thesis
Low Power Design Using RNS
by
Viktor Classon
LITH-ISY-EX--14/4792--SE
2014-08-25
Final thesis
Low Power Design Using RNS
by
Viktor Classon
LITH-ISY-EX--14/4792--SE
2014-08-25
Supervisor, ISY: Oscar Gustafsson
Supervisor, Ericsson: Shafqat Ullah
Examiner: Mark Vesterbacka
Abstract
Power dissipation has become one of the major limiting factors in the de-sign of digital ASICs. Low power dissipation will increase the mobility of theASIC by reducing the system cost, size and weight. DSP blocks are a majorsource of power dissipation in modern ASICs. The residue number system(RNS) has, for a long time, been proposed as an alternative to the regulartwo’s complement number system (TCS) in DSP applications to reduce thepower dissipation. The basic concept of RNS is to first encode the inputdata into several smaller independent residues. The computational opera-tions are then performed in parallel and the results are eventually decodedback to the original number system. Due to the inherent parallelism of theresidue arithmetics, hardware implementation results in multiple smaller de-sign units. Therefore an RNS design requires low leakage power cells andwill result in a lower switching activity.
The residue number system has been analyzed by first investigating dif-ferent implementations of RNS adders and multipliers (which are the basicarithmetic functions in a DSP system) and then deriving an optimal com-bination of these. The optimum combinations have been used to implementan FIR filter in RNS that has been compared with a TCS FIR filter.
By providing different input data and coefficients to both the RNS andTCS FIR filter an evaluation of their respective performance in terms ofarea, power and operating frequency have been performed. The result ispromising for uniform distributed random input data with approximately15 % reduction of average power with RNS compared to TCS. For a realisticDSP application with normally distributed input data, the power reductionis negligible for practical purposes.
iii
Acknowledgements
First of all I would like to thank the employees at the section Digital ASICat Ericsson in Kista for all the help and support and most of all for giving methe opportunity of doing my master’s thesis. A big thanks especially to mysupervisor Shafqat Ullah at Ericsson for all the support and for sharing hisknowledge with me. I also want to thank my examiner, Mark Vesterbacka,and my supervisor at ISY, Oscar Gustafsson, for all the help and supportduring the thesis.
Most of all I want to thank my parents, Maria and Svante, and mypartner, Linda, for supporting me during my five years of studies. Withoutyou I would not have been able to make it!
Finally I would like to thank all fellow students at Linkoping Universityand master thesis students at Digital ASIC for a fantastic time and sup-porting company! Especially a big thanks to Emil Lundqvist for his reviewof this work as an opponent and to Ejaz Sadiq for being a great soundingboard during my master’s thesis.
Stockholm, August 2014Viktor Classon
v
Contents
1 Introduction 11.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 52.1 RNS arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Basic arithmetic operations . . . . . . . . . . . . . . . 62.1.2 Conversion . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Choosing a moduli-set . . . . . . . . . . . . . . . . . . 7
2.2 FIR filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Profile developing . . . . . . . . . . . . . . . . . . . . 92.3.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Proposed design 113.1 Arithmetic functions . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1 Forward conversion . . . . . . . . . . . . . . . . . . . . 133.2.2 Reverse conversion . . . . . . . . . . . . . . . . . . . . 14
3.3 Choosing a moduli-set . . . . . . . . . . . . . . . . . . . . . . 143.3.1 Modulus for comparison . . . . . . . . . . . . . . . . . 14
4 Implementation 164.1 RNS addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 LUT and binary adders . . . . . . . . . . . . . . . . . 164.1.3 End-around carry parallel-prefix adder . . . . . . . . . 174.1.4 Parallel-prefix adder using the diminished-one number
representation for modulo 2n − 1 . . . . . . . . . . . . 19
vi
CONTENTS CONTENTS
4.1.5 Addition using Verilog’s built-in modulo operator . . . 204.1.6 Ordinary addition for modulo 2n . . . . . . . . . . . . 21
4.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.0 LUT based multiplication . . . . . . . . . . . . . . . . 214.2.1 Modulo-m product-partitioning multiplier with ROM 224.2.2 Parallel-prefix multiplier for modulo 2n − 1 . . . . . . 224.2.3 Parallel-prefix multiplier for modulo 2n + 1 . . . . . . 224.2.4 Modular multiplication using the isomorphic technique 244.2.5 High radix modulo 2n − 1 multiplier . . . . . . . . . . 254.2.6 Using Verilog’s built-in operators . . . . . . . . . . . . 274.2.7 Ordinary multiplication for modulo 2n . . . . . . . . . 27
4.3 Forward conversion . . . . . . . . . . . . . . . . . . . . . . . . 274.3.1 RNS adder tree . . . . . . . . . . . . . . . . . . . . . . 284.3.2 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . 294.3.3 Forward conversion for modulo 2n − 1 . . . . . . . . . 304.3.4 Using Verilog’s built-in modulo operator . . . . . . . . 304.3.5 Forward conversion for modulo 2n . . . . . . . . . . . 30
4.4 Reverse conversion . . . . . . . . . . . . . . . . . . . . . . . . 304.4.1 CRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Choosing a moduli set . . . . . . . . . . . . . . . . . . . . . . 334.6 FIR filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Results 365.1 Input data and coefficients . . . . . . . . . . . . . . . . . . . . 36
5.1.1 Uniformly distributed data and coefficients . . . . . . 365.1.2 Sawtooth data and coefficients ramp . . . . . . . . . . 375.1.3 Realistic input data and FIR coefficients . . . . . . . . 375.1.4 Different properties of the data and coefficients . . . . 37
5.2 Adders and multipliers . . . . . . . . . . . . . . . . . . . . . . 405.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Moduli-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.4 FIR filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 Varying input word length . . . . . . . . . . . . . . . . 565.4.2 Varying number of taps . . . . . . . . . . . . . . . . . 585.4.3 Folded FIR filter . . . . . . . . . . . . . . . . . . . . . 60
5.5 Maximum frequency . . . . . . . . . . . . . . . . . . . . . . . 61
6 Discussion and conclusions 626.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 FIR filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Future Work 65
vii
CONTENTS CONTENTS
Appendix 68
A Modulus 69
B Optimum moduli-sets 70
C RNS adders results 72
D RNS multiplier results 82
viii
List of Figures
1.1 The basic principle of RNS . . . . . . . . . . . . . . . . . . . 2
2.1 Flowchart for profile development . . . . . . . . . . . . . . . . 9
4.1 RNS addition using two binary adders . . . . . . . . . . . . . 174.2 Hybrid version of RNS addition . . . . . . . . . . . . . . . . . 174.3 Logic operators for the parallel-prefix adder . . . . . . . . . . 184.4 End-around carry parallel-prefix adder with Sklansky parallel-
prefix structure . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Adder based on the diminished-one number representation . . 204.6 Adder using Verilog’s built-in operator . . . . . . . . . . . . . 214.7 Modulo 2n addition using a binary adder . . . . . . . . . . . 214.8 Modulo-m product-partitioning multiplier with ROM . . . . . 234.9 Multiplier based on parallel-prefix RNS adders . . . . . . . . 244.10 Multiplier based on the isomorphic technique . . . . . . . . . 254.11 Modular high-radix RNS multiplier . . . . . . . . . . . . . . . 264.12 Multiplier based on the isomorphic technique . . . . . . . . . 274.13 Multiplier based on binary multiplier for modulo 2n . . . . . 274.15 Forward conversion using an RNS adder tree . . . . . . . . . 284.14 Forward conversion with registers at input and output . . . . 284.16 Reverse conversion . . . . . . . . . . . . . . . . . . . . . . . . 314.17 Reverse conversion using CRT . . . . . . . . . . . . . . . . . . 324.18 Modulo-m product-partitioning multiplier with combinato-
rial logic instead of LUT. Changes from ordinary RNS mul-tiplier are shown in white. . . . . . . . . . . . . . . . . . . . . 33
4.19 Direct-form FIR filter . . . . . . . . . . . . . . . . . . . . . . 354.20 Transposed direct-form FIR filter . . . . . . . . . . . . . . . . 354.21 Folded FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Discrete uniform distributions for different number of bits . . 375.2 Sawtooth data and ramp coefficients . . . . . . . . . . . . . . 385.3 Histogram for realistic input data for a 20-bit FIR filter . . . 385.4 Frequency response for some different FIR filter coefficients . 395.5 Description of the RNS multiplier and adder graphs . . . . . 41
ix
LIST OF FIGURES LIST OF FIGURES
5.6 Test setup for RNS adders and multipliers . . . . . . . . . . . 415.7 Total power dissipation for all RNS adders using uniformly
distributed input data as described in section 5.1.1 on page 36. 425.8 The best RNS adder for each modulo compared with RNS
adders for modulo 2n. Power dissipation was calculated usinguniformly distributed input data as described in section 5.1.1on page 36. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.9 RNS adders type 0 and 1 . . . . . . . . . . . . . . . . . . . . 455.10 RNS adders type 2 and 3 . . . . . . . . . . . . . . . . . . . . 465.11 RNS adders type 4 and 5 . . . . . . . . . . . . . . . . . . . . 475.12 RNS adders type 6 . . . . . . . . . . . . . . . . . . . . . . . . 485.13 All RNS multipliers . . . . . . . . . . . . . . . . . . . . . . . 495.14 RNS multipliers type 0 and 1 . . . . . . . . . . . . . . . . . . 505.15 RNS multipliers type 2 and 4 . . . . . . . . . . . . . . . . . . 515.16 RNS multipliers type 5 and 6 . . . . . . . . . . . . . . . . . . 525.17 RNS multipliers type 7 and TCS multiplier . . . . . . . . . . 535.18 Combinations of RNS multipliers with a maximum of 3,5,7,9
and 11 RNS multipliers in the moduli-set compared with TCSmultiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.19 Combinations of RNS adders with a maximum of 3,5,7,9 and11 RNS adders in the moduli-set compared with RNS adderfor modulo 2n (which is almost identical to a TCS adder) . . 56
5.20 64-tap FIR filter with varying input bit width for RNS andTCS. Uniform data as described in section 5.1.1 on page 36.The red line represents the power reduction. . . . . . . . . . . 57
5.21 16-tap FIR filter with varying input word length for RNSand TCS. Sawtooth data with ramp coefficients as describedin section 5.1.2 on page 37. The red line represents the powerreduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.22 20-bit FIR filter with varying number of taps for RNS andTCS. Uniformly distributed data and coefficients are used asdescribed in section 5.1.1 on page 36. The red line representsthe power reduction. . . . . . . . . . . . . . . . . . . . . . . . 59
5.23 20-bit FIR filter with varying number of taps for RNS andTCS. Realistic data with constant FIR coefficients are used asdescribed in section 5.1.3 on page 37. The red line representsthe power reduction. . . . . . . . . . . . . . . . . . . . . . . . 60
x
List of Tables
2.1 Example of signed and unsigned representations using themoduli-set {m1,m2} = {2, 3}, M = 6. . . . . . . . . . . . . . 6
3.1 Definition of different adder types . . . . . . . . . . . . . . . . 123.2 Definition of different RNS multiplier types . . . . . . . . . . 133.3 Definition of different RNS forward conversion types . . . . . 143.4 Definition of different RNS reverse conversion types . . . . . . 14
4.1 Periodicity of some residues . . . . . . . . . . . . . . . . . . . 29
5.1 Sign switching rate of input data . . . . . . . . . . . . . . . . 405.2 Theoretical toggle rate at output of a 20-bit input multiplica-
tion. The optimum moduli-sets as presented in table 5.5 onpage 55 is used. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 The best adder type for chosen modulo with respect to power,refer to table 3.1 on page 12 for details about the adder types. 44
5.4 The best multiplier type for chosen modulo with respect topower, refer to table 3.2 on page 13 for details about themultiplier types. . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Some of the optimum moduli-sets and their resulting numberof bits. For the complete list refer to Appendix B. . . . . . . 55
5.6 Results for an FIR filter folded 22 times with 20-bit inputand 22 taps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Synthesis results for 4-tap FIR filter with 20 or 30 input bit-width. The synthesis maximum frequency goal was set to 1.5GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.1 Resulting moduli-sets . . . . . . . . . . . . . . . . . . . . . . 70
C.1 Results for RNS adders . . . . . . . . . . . . . . . . . . . . . 73
D.1 Results for RNS multipliers . . . . . . . . . . . . . . . . . . . 83
xi
Nomenclature
ASIC Application-specific integrated circuit
CRT Chinese remainder theorem
DSP Digital signal processing
FIR Finite impulse response
LUT Look-up table
RNS Residue number system
ROM Read only memory
RTL Register-transfer level
TCS Two’s complement number system
VHDL Very high speed integrated circuit hardware description language
VLSI Very large scale integration
xii
Chapter 1
Introduction
Power dissipation has become one of the major limiting factors in the de-sign of digital ASICs. Low power dissipation will increase the mobility of theASIC by reducing the system cost, size and weight. DSP blocks are a majorsource of power dissipation in modern ASICs. The residue number system(RNS) has, for a long time, been proposed as an alternative to the regulartwo’s complement number system (TCS) in DSP applications to reduce thepower dissipation. Some research have shown that implementing FIR filtersin residue number system (RNS) instead of two’s complement number sys-tem (TCS) can give a reduction in power dissipation. FIR filters are amongthe less complex DSP blocks. A general sketch of how RNS computationscan be performed is shown in figure 1.1 on the next page. The earliest usageof the residue number system can be found in The Mathematical Classicof Sun Tzu by the Chinese mathematician Sun Tzu who lived in the 3rdcentury AD. A famous riddle from his book [1] is quoted below.
Now there are an unknown number of things.If we count by threes, there is a remainder of 2.If we count by fives there is a remainder 3.If we count by sevens, there is a remainder 2.Find the number of things.
(Sun Tzu)
1
1.1. PROBLEM STATEMENT CHAPTER 1. INTRODUCTION
Forward
conversion
Reverse
conversion
Modulo m1
Modulo m2
Modulo mn
Operands Results
Modulo channels
Figure 1.1: The basic principle of RNS
1.1 Problem statement
The problem to be investigated in this thesis is to compare RNS with TCS.This will be done by implementing FIR filters in RNS and TCS and comparethese two implementations. The requirements of the implementation in RNSis to minimize the power while still being able to run the circuit at 500MHz and not getting a massive increase in area. Both the RNS and TCSimplementation shall be able to receive and process one sample per clockcycle. Another design goal is to be able to process an input of around 20bits. An important idea in the thesis is that RNS in the future could beimplemented in large parts of the ASIC and therefore the forward and reverseconversion will not contribute as much as the computational operation topower dissipation and area, therefore will the implementation and resultsfocus on an implementation without the conversion. The ASIC will beintended for and implemented in 32 nm technology. The aim of the thesiscan be summarized with answering these four questions:
• Is RNS better than TCS with respect to power, area and timing?
• How can RNS be implemented and what different design choices canbe made?
• What further extensions of RNS exists that can further improve itsproperties?
• How can RNS be introduced into Ericsson’s systems?
2
1.2. METHODOLOGY CHAPTER 1. INTRODUCTION
1.2 Methodology
The thesis work has been performed at Ericsson in Kista, Stockholm. Thework has been executed in the following way:
1. Literature study
2. Implementation of adders and multipliers in RNS
3. Comparison of individual RNS adders and multipliers
4. Study of what RNS adders and multipliers to use and what combina-tions of them that will result in the lowest power dissipation
5. Implementation of RNS and TCS FIR filter
6. Comparison of RNS and TCS FIR filters
7. Implementation of forward and reverse conversion
8. Comparison of different forward and reverse conversion techniques
9. Analysis of the different results
1.3 Prior work
The arithmetic of a residue number system and its application to digitalsignal processing and computer technology has earlier been described in[2], [3] and [4]. The use of RNS for reduction of power in FIR filters hasearlier been discussed in for example [5], [6] and [7] with good results. Twopromising results can be seen in figure 5 from [7] and figure 6 from [5] wherethe power is significantly lower with RNS compared to TCS.
In figure 5 from [7] we can see an RNS FIR filter with forward and reverseconversion with 16-bit coefficients and a 32-bit dynamic range comparedwith a TCS FIR filter designed with the same restrictions.
In figure 6 from [5] the dynamic and static power dissipation of an RNSFIR filter compared with a TCS FIR filter. Both the RNS and TCS havea 10-bit input and coefficients and a dynamic range of 20 bits. Note thatneither in figure 5 from [7] nor in figure 6 from [5] the authors take accountof the increasing bit width in the accumulator due to the number of taps.
1.4 Outline
A brief introduction to the thesis is given in chapter 1. In chapter 2 the basicmathematical principles of RNS are presented. From these basic mathemat-ical principles a set of different implementations of RNS is presented, anda subset of these are the proposed design presented in chapter 3. The de-tailed implementation is presented in chapter 4 and the simulation results
3
1.5. LIMITATIONS CHAPTER 1. INTRODUCTION
of the implementation are presented in chapter 5. The results are discussedin chapter 6. From the results and the discussion some conclusions can bemade, which are presented in chapter 7 together with suggestions of futurework areas in the subject.
1.5 Limitations
The aim with the thesis is to investigate RNS, hence individual TCS addersand multipliers will not be implemented (here the synthesis tool will decidewhich adders and multipliers to use). The focus of the thesis has been onRNS specific algorithms and not low power algorithms that are suitablefor TCS or FIR filters in general. When implementing the individual RNSadders and multipliers the focus has been on the structure and not the exactimplementation of the ordinary binary adders and binary multipliers usedin the implementation, again this has been left to the synthesis tool in mostcases. The major limitation of the thesis work is that it has a time budgetof 20 weeks.
4
Chapter 2
Background
The basic concept of a residue number system (RNS) is to represent a largenumber with a set of smaller integers. In RNS some computations canbe performed more efficiently. RNS originates from the Chinese remaindertheorem (CRT) of modular arithmetic, which was first described by theChinese third-century mathematician Sun Tzu [4]. The CRT can be usedto solve his famous riddle on page 1.
2.1 RNS arithmetic
RNS arithmetic is based on the mathematical congruence relation. Let aand b be integers. These integers are said to be congruent modulo m if a− bis exactly divisible by m. This is often in mathematical contexts written asa ≡ b (modm). The number m is called a modulus or base.
Now let q be the quotient and r be the remainder from the divisionof the integer a by the modulus m, a = q · m + r. From the congruencedefinition above we then have a ≡ r (mod m). The integer r is the residueof a with respect to m, which will be denoted as r = |a|m. We shall assumethat r ∈ {0, 1, 2, ...,m− 1}, that is r lies in the set of least positive residuesmodulo m.
Now define a moduli-set as {m1,m2, ...,mN} that contains N positiveand pairwise relatively prime moduli. That is for every i and j where i 6= j,the moduli mi and mj in the moduli-set have no common divisor larger thanunity. Now M can be defined as the dynamic range of the RNS moduli-set. M can be computed as the product of the moduli-set according toequation (2.1).
M =
N∏n=1
mn. (2.1)
For every moduli-set a number X < M has a unique representation consist-ing of the N residues. This representation can be calculated as {xi = |X|mi :
5
2.1. RNS ARITHMETIC CHAPTER 2. BACKGROUND
1 ≤ i ≤ N}. We shall represent such a representation as 〈x1, x2, ..., xN 〉.
Example 1 Take the moduli-set {3, 5, 7}, then m1 = 3, m2 = 5 and m3 =7. The dynamic range of the moduli-set will be
M =
3∏n=1
mn = m1 ·m2 ·m3 = 3 · 5 · 7 = 105
Now let X = 10. Then 〈x1, x2, x3〉 can be calculated as follows
x1 = |X|m1 = |10|3 = |3 · 3 + 1|3 = |3 · 3|3 + |1|3 = 1
x2 = |X|m2 = |10|5 = |2 · 5|5 = 0
x3 = |X|m3 = |10|7 = |1 · 7 + 3|7 = |1 · 7|7 + |3|7 = 3.
So X = 10 can be represented as 〈1, 0, 3〉 in the RNS moduli-set {3, 5, 7}.
A residue number system can be used to represent both signed and un-signed numbers. For unsigned numbers, RNS can represent numbers in therange of 0 ≤ X ≤ M − 1. For signed numbers RNS can represent numbersthat satisfies one of the following relations:
−M − 1
2≤ X ≤ M − 1
2if M is odd
−M2≤ X ≤ M
2− 1 if M is even.
See table 2.1 for an example of RNS representation for signed and unsignednumbers.
〈x1, x2〉 Unsigned Signed〈0, 0〉 0 0〈1, 1〉 1 1〈0, 2〉 2 2〈1, 0〉 3 −3〈0, 1〉 4 −2〈1, 2〉 5 −1
Table 2.1: Example of signed and unsigned representations using themoduli-set {m1,m2} = {2, 3}, M = 6.
2.1.1 Basic arithmetic operations
Addition, subtraction and multiplication are quite straightforward calcu-lated in RNS. Division, sign-determination, overflow-detection and magnitude-comparison are significantly harder to implement. As for addition, subtrac-tion and multiplication the only difference with ordinary TCS operations is
6
2.1. RNS ARITHMETIC CHAPTER 2. BACKGROUND
that the result has to be in the range of [0 : m − 1]. Addition X + Y = Zcan be calculated as
X + Y = 〈x1, x2, ..., xn〉+ 〈y1, y2, ..., yn〉 = 〈z1, z2, ..., zn〉 = Z
where zi = |xi + yi|mi.
Multiplication X · Y = Z can be calculated in a similar fashion
X · Y = 〈x1, x2, ..., xn〉 · 〈y1, y2, ..., yn〉 = 〈z1, z2, ..., zn〉 = Z
where zi = |xi · yi|mi .
Note that the difference between addition and multiplication is for additionxi + yi ≤ 2(mi − 1) and for multiplication xi · yi ≤ (mi − 1)2 which leadsto that the reduction required to get a result in the range of [0 : m − 1]can be much greater for multiplication. This fact will cause a more compleximplementation of RNS multipliers compared to RNS adders.
2.1.2 Conversion
The goal with the forward and reverse conversion is to convert a numberrepresented in TCS into RNS, and RNS into TCS.
Forward conversion
Conversion from TCS to RNS can in a straightforward way be computedusing division, where the remainder of the division will be the residue.
Reverse conversion
Reverse conversion is described from an implementation perspective in chap-ter 4 on page 16.
2.1.3 Choosing a moduli-set
There exist in general two types of modulus, arbitrary and special. Thespecial modulus are usually referred to as the ones that is used in a specialmoduli-set, {2n−1, 2n, 2n+1}, or extensions of this. The arbitrary modulusare the remaining integers, including the primes.
In this thesis it will be assumed that the arbitrary sets consists only ofprimes due to the fact that completely arbitrary modulus are not guaranteedto be relative primes. The special sets are designed to be more hardwareefficient and are only guaranteed to be relative primes. Using only primemodulus is probably the best moduli-set from a purely mathematical view[8]. But the special sets might have other advantages. This gives us thatthe desired modulo for comparison would be the primes and those fulfillingthe requirements of a special set.
7
2.2. FIR FILTERS CHAPTER 2. BACKGROUND
Special moduli-sets
The most common special moduli-set is {2n − 1, 2n, 2n + 1} and extensionsof this [4]. The use of this moduli-set is often motivated by less complicatedimplementation of RNS to TCS converters and the fact that dedicated hard-ware multipliers can be used on FPGA platforms [9]. A common extensionis to add 2n±q ± 1, where q ≥ 1 to the moduli-set.
2.2 FIR filters
Finite-duration Impulse Response, FIR, filters is probably the most com-monly used digital filter. An FIR filter is based on the mathematical conceptof discrete convolution where the filtered output of a signal can be calculatedusing equation (2.2) [10].
y[n] =
N∑i=0
h[i]x[n− i]. (2.2)
In equation (2.2) y[i] is the output, x[i] is the input and h[i] are the coeffi-cients. N is defined as the order of the filter and the filter will have N + 1taps.
2.3 Design flow
Each implementation will be performed in the way presented below. If anerror would occur at any step the process was restarted from 1.
1. Implement
2. Simulate and verify with TCS result
3. Synthesize
4. Do analysis of synthesis and develop a profile in terms of area, powerand delay
8
2.3. DESIGN FLOW CHAPTER 2. BACKGROUND
RTLDesign
Synthesis
Netlist
Verilog Simulation
SwitchingActivity File
Cell Library
Power Calculation
Power Reports
Synthesis Reports
Input Data
Figure 2.1: Flowchart for profile development
2.3.1 Synthesis
The RTL code has been synthesized using Synopsys Design Compilerr.During synthesis (for all designs, both RNS and TCS) some optimizationswill be done by the synthesis tool. The synthesis tool will try to minimizethe power while still fulfilling the required critical path. [11]
2.3.2 Profile developing
The design flow for developing a profile in terms of area, delay and poweris shown in figure 2.1. The source of the area, delay, power and otherinteresting parameters are presented below:
Synthesis Reports Area (cell library specific), gate count, UVT cell ratio(see section 2.3.3 on the following page), etc.
Power Reports Power dissipation (leakage power, switching power andinternal power), delay, critical path, etc.
9
2.3. DESIGN FLOW CHAPTER 2. BACKGROUND
2.3.3 Power
The power calculations are made in the Power Calculation block in figure 2.1on the previous page. The power dissipation can be divided into dynamicand static power dissipation. Dynamic power dissipation consists of switch-ing and internal power dissipation. The power dissipation reports that aregenerated from PrimeTime1 are described in [13]. Note that both static anddynamic power in the equations below scale with the size of the design aswell.
• Static power
– Leakage power Pl = V · Ileak
• Dynamic power
– Switching power Ps = 12 · Cload · V 2 · f
– Internal power Pint = ( 12 · Cint · V 2 · f) + (V · Ishortcut)
Different standard cells
Depending on what standard cell the synthesis tool chooses the leakagepower consumption will be different. A bigger VT will result in smallerleakage. The synthesis tool can choose between the following standard celltypes (sorted in decreasing VT ):
UVT Ultra-high VT
SVT Super-high VT
MVT Mezzanine VT
HVT High VT
1The Synopsys PrimeTime suite provides a single, golden, trusted signoff solution fortiming, signal integrity, power, timing constraint and variation-aware analysis. - [12]
10
Chapter 3
Proposed design
3.1 Arithmetic functions
The basic arithmetic functions of an FIR filter is addition and multiplication.These operations can be implemented in many different ways in the residuenumber system. The basic complication with RNS is to deal with modulooverflow that occurs when the result is bigger than the modulo. For amodulo, mi the result of the operations always has to be within the range{0, ...,mi−1}. For addition the result will be in the range of {0, ..., 2(mi−1)}and therefore at most one subtraction with mi has to be performed to be inthe correct range. For multiplication on the other hand the product will bein the range of {0, ..., (mi − 1)2} which complicates the reduction.
To find out which algorithms for addition and multiplication that are thebest in terms of power dissipation, simulations will be made on individualadders and multipliers for all chosen modulus.
3.1.1 Addition
Three basic approaches for designing addition of arbitrary modulo is pre-sented in [14]. These three are: using LUT, using two ordinary binary addersand a hybrid between these two. Each one of these three will be optimal interms of area and timing for certain modulus [14].
An interesting approach of implementing addition in the special moduloset {2n − 1, 2n, 2n + 1} by using a parallel-prefix adder is presented in [4].A more in detail description is available in [15]. Due to the low level of thisapproach [16] can be used as an initial implementation idea.
The Verilog language and the synthesis tool has support for the built-inVerilog operators “+”, addition, and “%”, modulus. An implementationwith only these operators will be a good naive reference when comparingwith the other implementations. Also for modulo 2n the trivial implemen-tation using a standard adder will be used. The additions to implement are
11
3.2. CONVERSION CHAPTER 3. PROPOSED DESIGN
Type Description
0 Look-up table (LUT) based RNS adder
1 Two binary adders
2 A hybrid between 0 and 1
3 Modulo 2n − 1 using modified parallel-prefix adder
4 Modulo 2n + 1 using diminished-one numberrepresentation
5 Using Verilog’s built in operators “+”and “%”
6 Ordinary adder for modulo 2n
Table 3.1: Definition of different adder types
summarized in table 3.1.
3.1.2 Multiplication
RNS multiplications can be implemented in a huge variety of ways. Apromising implementation is presented in [17] which is a modulo-m product-partitioning multiplier with ROM. This implementation seems more promis-ing than multiplication by reciprocal of modulus as described in [18] sincethis implementation uses three instead of two multipliers.
For the special set {2n − 1, 2n, 2n + 1} some improvements in terms ofarea, power and delay can be made. A parallel modulo-m multiplier for2n ± 1 is presented in [4] without any special speed-up techniques. Thisimplementation might be interesting especially for relatively small n.
A implementation for 2n±1 is presented in [19] using Booth-8 encoding.This approach is compared with other implementations with a good resultfor n ≥ 32 though this can be extrapolated to give a good result at lowern as well. If this is not the case a Booth-4 encoding could be used. TheBooth encoding technique is well known in other contexts than RNS andwill therefore not be investigated further in this thesis.
Another interesting approach is [20]. In [5] an isomorphic techniqueis used to replace multiplication with addition and look-up table. Thisimplementation would be very interesting.
The different RNS multipliers that have been selected for implementationare presented in table 3.2 on the next page.
3.2 Conversion
As with the RNS adders and multipliers several forward and reverse conver-sion algorithms should be investigated.
12
3.2. CONVERSION CHAPTER 3. PROPOSED DESIGN
Type Description
0 Look-up table (LUT) based RNS multiplier
1 modulo-m product-partitioning multiplier withROM
2 Parallel modulo-m multiplier for 2n − 1
3 Parallel modulo-m multiplier for 2n + 1
4 Isomorphism technique as described in [21]
5 High radix multiplier for modulo 2n − 1 [20]
6 Using Verilog’s built in operators “+”and “%”
7 Ordinary multiplication for modulo 2n
Table 3.2: Definition of different RNS multiplier types
3.2.1 Forward conversion
Forward conversion is generally far less complicated to implement than re-verse conversion. Even though residue number systems needs too be ableto represent a certain bit width, the input is mostly represented with amuch smaller bit width. This reduces of course the complexity. The gen-eral way of solving the forward conversion problem involves the fact thata TCS number can be calculated in the following well known manner:−an−12n−1 +
∑i=n−2i=0 ai2
i. The most straightforward solution is to cal-culate the sum of the ai2
i’s using RNS adders instead of TCS adders. Byslightly modifying the solution on page 64 in [4] it can support negativenumbers as well.
A modification of this algorithm is to use the periodic properties of mod-ulus. The periodic properties can be derived by calculating the residue ofeach 2i mod m.
A look-up table based solution is also possible though since it would haveto consist of all possible input combinations at ninput bits corresponding toRNS values of nrns ≥ ninput bits. Due to this fact, this solution can beexcluded from further investigation.
In [22] a modular exponentiation algorithm is proposed that seems promis-ing. Unfortunately it is very complex and therefore very difficult to imple-ment in a parametrized way for arbitrary modulo and input bit width.
Several other sequential algorithms have been proposed in [4] but thesewill not produce one result per clock-cycle and are therefore not investigatedfurther.
13
3.3. CHOOSING A MODULI-SET CHAPTER 3. PROPOSED DESIGN
Type Description
0 RNS adder tree
1 RNS adder tree with periodicity
2 Forward conversion for the special moduli-set
3 Using SystemVerilog’s built-in operators
Table 3.3: Definition of different RNS forward conversion types
3.2.2 Reverse conversion
Reverse conversion is the conversion process from RNS to TCS. The mainmethods for implementing the reverse conversion is by using either the Chi-nese Remainder Theorem (CRT) or the Mixed-Radix Conversion (MRC)technique. All other techniques are variants of these two [4]. Among theseCRT is the most straightforward solution. MRC utilize ”mixed-radix” tech-niques and this would require far more investigating. Other implementa-tions involve using pseudo-SRT division (simply modification of a divisionalgorithm so that it only produce the remainder) or the core function (asdescribed in [4]). An other interesting implementation would be using aLUT. Unfortunately the resulting LUT would be larger than what a synthe-sis tool would support. The resulting reverse converter to be implementedis presented in table 3.4.
Type Description
1 Using CRT
Table 3.4: Definition of different RNS reverse conversion types
3.3 Choosing a moduli-set
Previous research [8], [5], [7] have shown that a significant amount of thepower dissipation still will take place during the regular computations andnot in the forward or the reverse conversion when the number of taps inan FIR filter is big. Therefore the initial guess of which moduli-sets tochoose was done by comparing the power dissipation of a simple one tap FIRfilter element without conversion. These simple components where designedin various ways and then an optimal (or near optimal) combination wascalculated. There are basically two groups of moduli-sets: arbitrary andspecial sets as described in section 2.1.3 on page 7.
3.3.1 Modulus for comparison
Since the basic idea with RNS is to choose several small numbers to representa big number it will be advantageous too choose these numbers quite small
14
3.3. CHOOSING A MODULI-SET CHAPTER 3. PROPOSED DESIGN
(but not necessarily as small as possible). A requirement is that the RNSFIR filter will be able to compute inputs that are 20 bits wide, and due tothe multiplication the incoming word-length has to be extended to 40 bits.Therefore the modulus for the comparison as described above will be chosenas follows.
• All primes between 2 and 251
• All numbers fulfilling 2n or 2n ± 1 where n ≤ 14 (to get a dynamicrange of 240 with the moduli-set {2n − 1, 2n, 2n + 1})
• Each closest prime that is smaller than 2n where n ≤ 14. If these turnout to be optimal, possibly more similar primes will be added.
Note that these sets intersects and no modulo shall be tested twice. Theserules will result in the set of integers presented in Appendix A.
15
Chapter 4
Implementation
The main implementation philosophy has been to use parametrized mod-ules and functions. The implementation has been done on RTL level inSystemVerilog [23], therefore it can not be guaranteed (and most unlikely)that the synthesis tool maps (as described in section 2.3.1 on page 9) theRTL code directyle to the hardware structure described by the RTL code.Though it has of course been verified that the functionality is consistent.
The parametrization of the RTL code will make the implementation ofRNS easily adaptable for new DSP algorithms, scalable in terms of number oftaps and bit-widths and easily modifiable for new algorithms of for exampleadders and multipliers.
4.1 RNS addition
The main issue with RNS addition is that the sum has to be within therange of [0,mi − 1]. The corresponding binary adder would result in asum of [0, 2(mi − 1)] and therefore at most a modulo reduction with mi isrequired.
4.1.1 LUT and binary adders
The most direct approach to implement RNS addition is to use a look-uptable (LUT), two binary adders, or a combination of these.
LUT
The LUT RNS adder implementation is a straightforward ROM storing eachsum of the two inputs.
16
4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION
Two binary adders
By using one binary adder for addition and the other adder for subtractionin the modulo reduction and modulo overflow detection a quite neat RNSadder as shown in figure 4.1 was implemented.
+
a b
1 0
sum
+
−m
Figure 4.1: RNS addition using two binary adders
Hybrid
The hybrid RNS adder consists of one adder connected to a LUT. The LUTstores the resulting residue for each sum of the adder.
+
LUT
a b
sum
Figure 4.2: Hybrid version of RNS addition
4.1.3 End-around carry parallel-prefix adder
The end-around carry parallel-prefix adder is designed to only work formodulo 2n−1, where the advantage is that by using the end-around carry, ituse approximately the same hardware as an ordinary parallel-prefix adder.
17
4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION
The parallel-prefix adder was implemented by translating the RNS adderin [16] from VHDL to SystemVerilog. It uses a Sklansky parallel-prefixstructure with an end-around carry. The adder uses different logic operatorsas shown in figure 4.3. The exact behavior of the logic operators is describedin equation (4.1).
(Gl−1i:k , P
l−1i:k )
(Gli:k, P
li:k) (Gl
i:k, Pli:k)
(Gl−1i:j+1, P
l−1i:j+1)(Gl−1
j:k , Pl−1j:k )
(Gli:k, P
li:k) (Gl
i:k, Pli:k)
ai bi
(gi, pi)
si
pi ci
Figure 4.3: Logic operators for the parallel-prefix adder
: Gli:k = Gl−1
i:k P li:k = P l−1
i:k
: Gli:k = Gl−1
i:j+1 ∨ (Gl−1i:k ∧ P
l−1i:j+1) P l
i:k = P l−1i:k ∧ P
l−1i:j+1
: gi =
{a0 ∧ b0 ∨ a0 ∧ c0 ∨ b0 ∧ c0 if i = 0ai ∧ bi otherwise
pi = ai ⊕ bi
: ci+1 = Gmi:0
si = pi ⊕ ci (4.1)
In equation (4.1) i is the bit position and i = 0, ..., nbits − 1, l is the level inthe prefix structure and l = 1, ...,m where m is the total required depthof the prefix structure (which can be calculated by dlog2(nbits)e). And0 ≤ k ≤ j ≤ i (for more details see [15]). An 8-bit example of the parallel-prefix adder can be seen in figure 4.4 on the next page.
18
4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION
0
1
2
3
a0 b0a1 b1a2 b2a3 b3a4 b4a5 b5a6 b6a7 b7
s0s1s2s3s4s5s6s7
Sklansky prefix structure
Figure 4.4: End-around carry parallel-prefix adder with Sklansky parallel-prefix structure
4.1.4 Parallel-prefix adder using the diminished-one num-ber representation for modulo 2n − 1
By using the fact that modulo 2n − 1 almost can be represented with nbits a diminished-one number representation can be implemented. In adiminished-one representation n bits represent the number and the n+1 bitis used to identify a zero. Hence an ordinary number X can be representedas X in the diminished-one representation, as presented in equation (4.2).
X = 0 : X[n] = 1
X 6= 0 : X[n] = 0, X[n− 1 : 0] = X − 1. (4.2)
The advantage with this adder is that the parallel-prefix structure used inthe modulo 2n − 1 adder in section 4.1.3 can be used except for the smallchange that the end-around carry is inverted. Some forward and reverseconversion is also needed which is described in figure 4.5 on the next page.The blocks used in figure 4.5 on the following page are the same as usedin the adder for modulo 2n − 1 and are described in equation (4.1) on thepreceding page.
19
4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION
0
1
2
3
a0 b0a1 b1a2 b2a3 b3a4 b4a5 b5a6 b6a7 b7
s0s1s2s3s4s5s6s7
Sklansky prefix structure
+
a −1
+
b −1
MSB
+
1
1 0
sum
1 0
0
Forwardconversion
Reverseconversion
Figure 4.5: Adder based on the diminished-one number representation
4.1.5 Addition using Verilog’s built-in modulo operator
Addition using Verilog’s built-in modulo operator can be performed by usingthe %-sign and then letting the synthesis tool decide what to do with it. The
20
4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION
implementation will look as figure 4.6 and can be expressed as
assign output sum = ( input a + input b ) % modulo parameter ;
a
sum
%
b
Figure 4.6: Adder using Verilog’s built-in operator
4.1.6 Ordinary addition for modulo 2n
The easiest and most efficient implementation of an RNS addition will bethat one for modulo 2n as it will only require an ordinary binary adderwhere the resulting carry-out is neglected. The implementation will look asfigure 4.7 or:
assign output sum = input a + input b ;
a
sum
b
Figure 4.7: Modulo 2n addition using a binary adder
4.2 Multiplication
RNS multiplication has the same requirement as the RNS adders in that theproduct has to be within the range [0,mi−1], but unfortunately the productof an ordinary binary multiplier will be within the range [0, (mi − 1)2] sothe number of modulo reductions with mi would instead be mi− 2 (insteadof one in the RNS adder) which would increase the complexity dramaticallywith increasing modulo.
4.2.0 LUT based multiplication
The look-up table based RNS multiplication is using the two operands asaddresses to a two dimensional look-up table where the product is stored.
21
4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION
4.2.1 Modulo-m product-partitioning multiplier with ROM
A modulo-m product-partitioning multiplier with ROM is presented in [17]and [4] for arbitrary modulus. This multiplier is based on the fact that the
product P∆= AB can be expressed as in equation (4.3). AB is partitioned
into four parts: W , k + 1 bits; Z, n − (k + 1) bits; Y , 1 bit and X, n − 1bits.
P = AB = 22n−(k+1)W + 2nZ + 2n−1Y +X. (4.3)
|AB|m =∣∣∣22n−(k+1)W + 2nZ + 2n−1Y +X
∣∣∣m
=∣∣∣∣∣∣22n−(k+1)W + 2n−1Y
∣∣∣m
+ |2nZ|m +X∣∣∣m
=/
2n = m+ c⇒ |2n|m = c/
=∣∣∣∣∣∣22n−(k+1)W + 2n−1Y
∣∣∣m
+ cZ +X∣∣∣m. (4.4)
Here n is the number of bits, c = 2n −m and k = 1 + blog2 cc and m is themodulo. By ensuring that the product is within the range of the moduli,equation (4.3) can be rewritten as equation (4.4).
Since k will be relatively small and Y only consist of one bit, e∆=∣∣22n−(k+1)W + 2n−1Y
∣∣m
can be pre-calculated for each value of W and Yand stored in a ROM. Due to the number of bits used to store e, cZ and Xthe result will be in the range 0 ≤ e + cZ + X < 2m. It is slightly betterto instead store e − m in the ROM [4] and detect whether the result gotnegative or not and in that case add m. The resulting RNS multiplier canbe seen in figure 4.8 on the next page.
4.2.2 Parallel-prefix multiplier for modulo 2n − 1
By reusing the parallel-prefix adder from section 4.1.3 and connecting partialproducts to it an implementation of a parallel-prefix multiplier for modulo2n − 1 can be achieved. The entire multiplication can be rewritten as equa-tion (4.5). Note that due to the properties of RNS, PPi will always be nbits wide.
|X · Y |2n−1 =
n−1∑i=0
PPi where PPi = xi ∧ yn−i−1...y0yn−1...yn−i. (4.5)
In figure 4.9 on page 24 the schematic of the parallel-prefix multiplier formodulo 2n − 1 is shown. Note that more optimal adder tree structuresprobably can be used.
4.2.3 Parallel-prefix multiplier for modulo 2n + 1
The parallel-prefix multiplier for modulo 2n + 1 may be implemented usingdiminished-one representation to remove the extra bit required compared to
22
4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION
ROM
A B
Multiplier
n n
k + 1 1n− 1− k
Multiplier
Adder
n− 1
c
k
WZ
Y
X
cZe
0 1
Adder
−m
n+ 1
n+ 1
MSB
|AB|m
n
Figure 4.8: Modulo-m product-partitioning multiplier with ROM
23
4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION
RNS Adder
PP0 PP1
RNS Adder
PP2
RNS Adder
PPn−1
product
Figure 4.9: Multiplier based on parallel-prefix RNS adders
modulo 2n. This implementation would require a diminished-one adder butdue to the poor results of this adder (as can be seen in figure 5.7 on page 42)this multiplier has not been implemented.
4.2.4 Modular multiplication using the isomorphic tech-nique
This technique has earlier been used by [5] and [8]. The basic principleof the isomorphic technique is described in [21] and can be summarized asin equation (4.6). When m is a prime there exists a q that will fulfill theequation. This means that a multiplicand ni can instead be represented bywi.
ni = |qwi |m with ni ∈ [1,m− 1], wi ∈ [0,m− 2] (4.6)
For the specific case of a two input modular multiplier, i ∈ [1, 2] we getequation (4.7).
|a1·a2|m = |qw|m where w = |w1+w2|m−1 and a1 = |qw1 |m, a2 = |qw2 |m.(4.7)
A direct implementation of equation (4.6) and equation (4.7) can be imple-mented using two different look-up tables, each storing m−1 entries, and anRNS modulo-m adder. A sketch of this implementation can be seen in fig-ure 4.10 on the next page. Due to the fact that zero can not be representedby ni = |qwi |m, this has to be taken care of. This is done by a simple zerodetecting logic. A schematic of the multiplier can be found in figure 4.10 onthe facing page.
24
4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION
+
LUT LUT
n1 n2
LUT
1 0
’0
Figure 4.10: Multiplier based on the isomorphic technique
4.2.5 High radix modulo 2n − 1 multiplier
The high radix modular RNS multiplier for modulo 2n − 1 is based on asuggested multiplier in [4]. In [20] another multiplier is suggested that willonly work for modulus where n−1
k = 4 where k is an integer and n− 1 is thenumber of bits required to represent a number in modulo 2n − 1.
This multiplier is based on the fact that a multiplication A · B can berewritten as a sum partial products. First divide A and B into two k-
bit numbers where k = b dlog2(2n−1)e+12 c so that A = A12k + A0 and B =
B12k + B0. Now the product A · B can now be rewritten as equation (4.8)by using cyclic convolution.
A ·B = (A12k +A0) · (B12k +B0) = 2k(A1B0 +A0B1) + (A1B1 +A0B0)
= 2kP1 + P0. (4.8)
This can be extended to |A · B|2n−1 = |2kP1 + P0|2n−1 for modulo 2n − 1.P0 and P1 can also be expressed as
P0 =a2 − b2 + c2 − d2
8
P1 =a2 − b2 − c2 + d2
8.
25
4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION
where
a = A0 +A1 +B0 +B1
c = A0 +A1 −B0 −B1
d = A0 −A1 +B0 −B1
b = A0 −A1 −B0 +B1.
By combining these equations a schematic can be derived as shown in fig-ure 4.11.
A0 A1 B0 B1
+ +
+ +
−1
LUT LUTSquaring LUT
+
−1
+
A0 A1 B0 B1
+ +
+ +
−1
LUT LUT
+
−1
+
−1−1
−1
a b c d
>> 3 >> 3
Mod 2n − 1 adder
product
P0 P1
Figure 4.11: Modular high-radix RNS multiplier
26
4.3. FORWARD CONVERSION CHAPTER 4. IMPLEMENTATION
4.2.6 Using Verilog’s built-in operators
Multiplication using Verilog’s built-in modulo operator can be performed byusing the %-sign and then letting the synthesis tool decide what to do withit. The implementation will look as figure 4.12 and can be expressed as
assign output product = ( input a ∗ input b ) % modulo parameter ;
a
product
%
b
Figure 4.12: Multiplier based on the isomorphic technique
4.2.7 Ordinary multiplication for modulo 2n
The easiest and most efficient implementation of an RNS multiplication willbe the one for modulo 2n as it will only require an ordinary binary multiplierwhere the resulting most-significant half of the product is neglected. Theimplementation will look as figure 4.13 and can be expressed as
assign output product = input a ∗ input b ;
a
product
b
Figure 4.13: Multiplier based on binary multiplier for modulo 2n
4.3 Forward conversion
Forward conversion is the translation process from TCS to RNS. Since theTCS bit-width most often is smallest at the input, the complexity of theforward conversion will be less than the complexity of the reverse conversion.Due to this smaller bit-width no pipelining1 is required to fulfill the timing
1Pipelining is a process where registers are inserted in the critical path to increase themaximum operating frequency
27
4.3. FORWARD CONVERSION CHAPTER 4. IMPLEMENTATION
RNS +
0RNS(20)
20
0RNS(21)
21
0RNS(2n−2)
2n−2
0RNS(2n−1)
2n−1
RNS +
RNS +
l = 1
l = 2
l = nlevels
Figure 4.15: Forward conversion using an RNS adder tree
goal of 500 MHz and therefore only registers at the input and output ofthe forward conversion are considered in the design, which can be seen infigure 4.14.
Forward
conversionTCS RNS
Figure 4.14: Forward conversion with registers at input and output
4.3.1 RNS adder tree
The most straightforward solution for the forward conversion is to use thefact that a TCS number can be represented as −an−12n−1 +
∑i=n−2i=0 ai2
i.The RNS representation of a number in TCS can be derived by first con-verting each operand in the summation to RNS (using a LUT) and thencalculating each individual addition with RNS adders. This will result in anRNS adder tree.
The RNS adder tree has parametrized input bit-width and modulo. Theentire tree will scale with this parameter as seen in figure 4.15. The num-ber of levels, nlevels, in the RNS adder tree can be calculated as nlevels =dlog2(n)e where n is the number of TCS input bits. At each level therewill be win
l = d n2l e input wires, where l is the level. There will also be
woutl =
⌈win
l
2
⌉which will result in nadders =
⌊win
l
2
⌋adders.
28
4.3. FORWARD CONVERSION CHAPTER 4. IMPLEMENTATION
4.3.2 Periodicity
The periodicity of a modulo can be derived from the fact that result from2i mod m will repeat itself for all modulo when i increases (note that thisrepetition not necessarily is valid for residues where i < dlog2(m)e). Theperiodicity of the modulus can be solved by a brute-force search and storingthe periodicity of the relevant modulus in a ROM. An example of this canbe seen in table 4.1.
Modulus m Residue 2n mod m Periodicity, p3 1,2,1,2,1,2,... 24 1,2,0,0,0,... 15 1,2,4,3,1,2,3,1,... 46 1,2,4,2,4,2,4,... 27 1,2,4,1,2,4,... 311 1,2,4,8,5,10,9,7,3,6,1,2,4,8,5,10,9,7,3,6,1,... 1017 1,2,4,8,16,15,13,9,1,2,4,8,16,15,13,9,1,... 831 1,2,4,8,16,1,2,4,8,16,1,2,4,8,16,1,... 551 1,2,4,8,16,32,13,26,1,2,4,8,16,32,13,26,1,2,... 8
Table 4.1: Periodicity of some residues
The periodic property of a modulo can be used to reduce the TCS bit-width used for forward conversion. A TCS number can be sign extendedinto p · dlog2(nTCSbits)e bits and than partitioned into chunks that are p bitswide. These chunks is then added with regular TCS adders. The sum ofthe addition is then used in the forward conversion, which will reduce thenumber of bits used in the RNS forward conversion. A conversion processidentical to the one presented in section 4.3.1 on the preceding page canfollow the periodicity simplification.
Example 2 Consider the forward conversion of the 13-bit TCS representa-tion of the number −32493 for modulus 5. −3821 is expressed as 1000100010011in TCS.The periodicity of modulus 5 is 4 which can be fetched from table 4.1. Sosign-extend the TCS number to p · dlog2(nTCSbits)e = 4 · dlog2(13)e = 16bits: 1111000100010011. Separate this number into 4-bit chunks and addthem (remember the negative weight of the MSB):
1111 + 0001 + 0001 + 0011 = −1 + 1 + 1 + 3 = 4 = 000100
Then use an RNS adder tree to compute the RNS representation of thenumber:
| − 3821|5 = |0 + 0 + 0 + |22|5 + 0 + 0|5 = 4
29
4.4. REVERSE CONVERSION CHAPTER 4. IMPLEMENTATION
4.3.3 Forward conversion for modulo 2n − 1
By extending the solution for forward conversion in the special moduli-setin [4] a more general forward conversion solution for a bigger moduli-setcontaining modulo 2n − 1 can be achieved.
The first step is to sign extend the TCS input to a number of bits,nS.e.−bits, that is even divisible by n. This new number is then dividedinto nS.e.−bits
n -chunks which are summed using an RNS adder tree. A smallmodification will be necessary to allow the modulo-2n − 1 RNS adder tosupport input in the range of [0,mi] instead of [0,mi − 1].
4.3.4 Using Verilog’s built-in modulo operator
For comparison a TCS to RNS forward converter also has been implementedusing the built-in Verilog modulo operator, %, as seen below in the Verilogcode.
assign output rns = i n p u t t c s % modulo parameter ;
4.3.5 Forward conversion for modulo 2n
Forward conversion modulo 2n is easily performed by selecting the n leastsignificant bits. An example of how this could be realized in Verilog is shownbelow.
assign output rns = i n p u t t c s [ n b i t s −1 : 0 ] ;
4.4 Reverse conversion
Reverse conversion is the translation process from RNS to TCS. Observethat compared with forward conversion, no computations can be performedindividually in each modulus but the entire moduli-set has to be taken ac-count of. This complication makes reverse conversion a major, if not THEmajor, drawback of RNS. Due to this non-parallel approach as can be seenin other parts of RNS, the reverse conversion process is also more complex.The two main reverse conversion algorithms are based on the Mixed-RadixConversion (MRC) or the Chinese Remainder Theorem (CRT), where onlythe latter has been investigated in this thesis. The reverse conversion pro-cess can be seen in figure 4.16 on the facing page, note that in comparisonto forward conversion the reverse conversion has to be pipelined to fulfill thetiming goals.
30
4.4. REVERSE CONVERSION CHAPTER 4. IMPLEMENTATION
Reverse
conversionRNS TCS
Figure 4.16: Reverse conversion
4.4.1 CRT
The Chinese Remainder Theorem (recall the quote in chapter 1 on page 1)is a mathematical way of finding the TCS representation of an RNS number.Recall from chapter 2 on page 5 that a moduli-set, m1,m2, ...,mN , consistingof N pairwise relative primes can represent a number X within the rangeof k ≤ X ≤ k + M where M =
∏Ni=1mi and k is an integer. A number in
RNS can be represented as 〈x1, x2, ..., xN 〉 where each xi = |X|mi.
Now define Mi = Mmi
and M−1i as the multiplicative inverse where
||M−1i |mi
Mi|mi= 1. Now the CRT states that a TCS number X can be
computed by equation (4.9).
X =
∣∣∣∣∣N∑i=1
xi|M−1i |mi
Mi
∣∣∣∣∣M
. (4.9)
The Mi can easily computed as described above, though the multiplica-tive inverse, M−1
i , is far harder to calculate. There is in fact no generalexpression to calculate the multiplicative inverse in this context, [4]. Forprime modulus Fermat’s Theorem may sometimes be useful for finding themultiplicative inverse. A far less complicated solution of finding the multi-plicative inverse is to instead calculate |M−1
i |miwith a brute-force search
for all numbers between 0 and mi − 1. This can be computed on a PC andat elaboration time the multiplicative inverses to the chosen modulus in themoduli-set can be stored in a memory element on the ASIC. Note that Mi
and M−1i will be unique for each moduli-set. Pseudo code for finding Mi
and |M−1i |mi
is presented below:
for modulus in modul i s e t :M i = prod ( modu l i s e t )/ modulusfor i n v i t e r in range (1 , modulus ) :
i f ( ( M i ∗ i n v i t e r ) % modulus == 1 ) :M i inve r s e = i n v i t e rbreak
print modulus , M i , M i inve r s e
The product of Mi and |M−1i |mi
will be stored in a look-up table in theASIC. Beside the LUT the reverse conversion is just a matter of multiplyingeach xi with the content of the LUT and then add the products. Boththe multiplication and addition will be performed using RNS adders. Theresulting schematic will look like figure 4.17 on the following page.
31
4.4. REVERSE CONVERSION CHAPTER 4. IMPLEMENTATION
TCS
RNS
RNS ADD
x1
LUT
x2
LUT
xN
LUT
...
RNS ADD
RNS ADD
RNS ADD
RNS MULT RNS MULT RNS MULT
Figure 4.17: Reverse conversion using CRT
In figure 4.17 the RNS adder tree has been pipelined due to the hugeamount of bits needed to represent the dynamic range M . This is enough tofulfill the timing requirements in the RNS adder tree when using the simplestand straightforward adder type with two binary adders connected (addertype 1, figure 4.1 on page 17). After a quick review of the implementedmultipliers it can be discovered that none of them are purely combinatorialfor arbitrary modulo. Due to the need of multiplication of big bit widths inCRT reverse conversion multiplier type 1 (Modulo-m product-partitioningmultiplier with LUT) was reimplemented with combinatorial logic insteadof LUT and some registers where inserted to pipeline the multiplier, as canbe seen in figure 4.18 on the next page.
32
4.5. CHOOSING A MODULI SET CHAPTER 4. IMPLEMENTATION
COMB
A B
Multiplier
Multiplier
Adder
c
0 1
Adder
−m
MSB
|AB|m
Figure 4.18: Modulo-m product-partitioning multiplier with combinatoriallogic instead of LUT. Changes from ordinary RNS multiplier are shown inwhite.
By adding registers in the RNS adder tree, designing the RNS multiplierpurely combinatorial and adding registers inside the multiplier a maximumoperating frequency of 500 MHz could be achieved, which was required.
4.5 Choosing a moduli set
The optimum moduli set in terms of power dissipation for representing nbits can be found by solving equation (4.10) on the following page. Here piis the power dissipation, mi is the modulo, N is the number of modulus andsi is a decision variable. When optimizing the moduli-sets only the power
33
4.6. FIR FILTER CHAPTER 4. IMPLEMENTATION
dissipation of the computational operations as been taken account of (oneadder and one multiplier), in a larger system the conversion is considered tobe neglected.
minsi∈{0,1}
(i=N−1∑i=0
sipi
)when
∏si 6=0
simi ≥ 2n
and |mi|mj6= 0 ∀ ( i 6= j,mi ≥ mj) (4.10)
This can be solved with the following pseudo code.
for n comb in range (1 , max n comb ) :for modul i s e t in combinat ions ( a l l modulus , n comb ) :
c u r r e n t c o s t = sum( power cost [ i ] for i in modul i s e t )i f c u r r e n t c o s t < b e s t c o s t :
i f prod ( modu l i s e t ) >= dynamic range :for pa i r in combinat ions ( modul i set , 2 ) :
r e l a t i v e p r i m e = True# gcd = g r e a t e s t common d i v i d e ri f gcd ( pa i r [ 0 ] , pa i r [ 1 ] ) != 1 :
r e l a t i v e p r i m e = Falsei f r e l a t i v e p r i m e :
b e s t c o s t = c u r r e n t c o s tb e s t m o d u l i s e t = modu l i s e t
Due to the exponentially increasing number of combinations with the num-ber of modulus, the modulus sent to the program has been optimized andthose modulus with a very high pi
log2(mi)has been excluded without any
affect on the outcome.
4.6 FIR filter
Several different implementations are possible to achieve identically func-tionality to the direct-form implementation of equation (2.2) on page 8 asshown in figure 4.19 on the facing page. A major improvement of this de-sign can be achieved by moving the registers from before the multiplicationsto inside the summation chain as shown in figure 4.20 on the next page.This design is usually referred to as transposed direct-form FIR filters andwill have a larger area than the direct-form FIR filter (due to more thantwice the size of the registers) but the critical path will only go through onemultiplication and one addition (compared to the entire summation chainand one multiplication in the previous case). The simulations preformedin this thesis will be using the transposed direct-form FIR filter as shownin figure 4.20 on the facing page unless anything else is stated. The wordlength used in the accumulator registers in this case will be equation (4.11)on the next page to prevent overflow.
34
4.6. FIR FILTER CHAPTER 4. IMPLEMENTATION
wacc = wdata + wcoef + dlog2(ntaps)e (4.11)
c3 c2 c1 c0
Figure 4.19: Direct-form FIR filter
c0 c1 c2 c3
Figure 4.20: Transposed direct-form FIR filter
In a larger DSP system the samples are very unlikely to arrive at everyclock cycle, therefore the hardware can be reused by using a folded FIRfilter as presented in figure 4.21. Several other techniques for designing thestructure of FIR filters are available but not further discussed in this thesis.
LUT
in
out
Figure 4.21: Folded FIR
There are several different ways of deriving the FIR coefficients to fulfillcertain goals for the filter. The method used in this thesis is based onthe program described in [24] from [25] that is implemented in the Signalprocessing library in SciPy, an open-source library of scientific tools forPython. The choice of coefficients is not very import for the results of thisthesis as long as they are realistic.
35
Chapter 5
Results
The results has been achieved using conditions that are very close to a realDSP system.
• A 500 MHz clock has been used
• New data has been assumed to arrived every clock cycle
• The libraries used have been using a 32 nm technology
• The power dissipation was calculated as average power dissipation andnot peak power dissipation
5.1 Input data and coefficients
The dynamic power dissipation will highly depend on what input data andcoefficients that are provided to the system. The input data and coefficientswill also affect RNS and TCS systems in different ways. For example willsigned and unsigned values result in a similar behavior in RNS but signedvalues will most likely give quite a higher dynamic power dissipation forTCS, compared with unsigned values.
5.1.1 Uniformly distributed data and coefficients
In some cases random data that is uniformly distributed has been used. Thedata and coefficients have been generated by randomly generating each bit.The distributions can be seen in figure 5.1 on the next page for differentnumber of bits. In this case both the data and coefficients are updated eachclock cycle, the reason for also updating the coefficients is to not let thechoice of coefficients affect the result. The updating of the coefficients willaffect the result but the result are assumed to be affected in the same forTCS and RNS.
36
5.1. INPUT DATA AND COEFFICIENTS CHAPTER 5. RESULTS
-1.0 -0.5 0.0 0.5 1.0
1019
0
1
2
3
4
5
6
710−20 64 bits
-3 -2 -1 0 1 2 3
109
0.0
0.5
1.0
1.5
2.0
2.5
3.010−10 32 bits
-150 -100 -50 0 50 100 1500.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.00458 bits
Figure 5.1: Discrete uniform distributions for different number of bits
5.1.2 Sawtooth data and coefficients ramp
Another interesting input data to investigate is sawtooth data. The idea is togenerate the highest possible switching activity. This can then be matchedwith for example a ramp as coefficients. The input data and coefficientsused in this case are presented in figure 5.2 on the following page. They aregenerated using equation (5.1), where i is the current clock-cycle.
data = i ∗ (−1)i
coef = i (5.1)
5.1.3 Realistic input data and FIR coefficients
The most realistic data is normal distributed data with constant FIR co-efficients. The FIR coefficients has been generated using Remez exchangealgorithm as presented in [26]. The FIR filter will have a passband be-tween 0 and 2π · 0.297 rad/sample and a stopband between 2π · 0.328 andπ rad/sample. The resulting frequency response for some different numbersof taps can be seen in figure 5.4 on page 39. The input data is normaldistributed and consists of a signal that has already been processed by alow-pass filter. These are typical signal properties for an input data signalto an FIR filter in a DSP application. The histogram of the input data isplotted in figure 5.3 on the next page.
5.1.4 Different properties of the data and coefficients
Due to different properties of the data and coefficients they will behave indifferent ways in RNS and TCS.
Sign switching rate The sign switching rate is the rate with which thedata switches from positive to negative or vice versa. The switchingrates for the used input data and an ordinary normal distributioncontaining white noise is presented in table 5.1 on page 40.
37
5.1. INPUT DATA AND COEFFICIENTS CHAPTER 5. RESULTS
0 5 10 15 20 25 30 35 40-40
-30
-20
-10
0
10
20
30
40Data
0 5 10 15 20 25 30 35 40-40
-30
-20
-10
0
10
20
30
40Coefficients
Sawtooth data and ramp coefficients
Figure 5.2: Sawtooth data and ramp coefficients
0 211−211 212−212 213−213 214−214
Value
0.00000
0.00005
0.00010
0.00015
0.00020
Pro
babi
lity
ofo
ccur
ance
Realistic input data for 20 bit FIR filter
Figure 5.3: Histogram for realistic input data for a 20-bit FIR filter
38
5.1. INPUT DATA AND COEFFICIENTS CHAPTER 5. RESULTS
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Frequency (rad/sample)
10−6
10−5
10−4
10−3
10−2
10−1
100
Am
plit
ude
(dB
)
3 taps
47 taps
95 taps
151 taps
-160
-140
-120
-100
-80
-60
-40
-20
0
Ang
le(r
adia
ns)
3 taps
47 taps
95 taps
151 taps
FIR filter frequency response
Figure 5.4: Frequency response for some different FIR filter coefficients
39
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
Theoretical multiplier toggle rate The theoretical multiplier toggle rateis the rate with which the product of a multiplication in TCS and RNStoggles for different input data and coefficients. The results are shownin table 5.2. The fact that a 60-bit RNS multiplier is compared as wellin table 5.2 is due to the fact that for example an FIR filter wouldrequire a larger word length than two times the input word lengthwhen using more than one tap as seen in equation (4.11) on page 35.
Input data Sign switching rateUniform distribution 0.5Normal distribution 0.5Sawtooth data 1.0Realistic data 0.33
Table 5.1: Sign switching rate of input data
Input data 40-bit TCS 40-bit RNS 60-bit RNSUniform distribution 0.486 0.425 0.456Normal distribution 0.484 0.427 0.458Sawtooth data 0.810 0.414 0.451Realistic data 0.414 0.429 0.458
Table 5.2: Theoretical toggle rate at output of a 20-bit input multipli-cation. The optimum moduli-sets as presented in table 5.5 on page 55 isused.
5.2 Adders and multipliers
The results for different modulus for each adder and multiplier are presentedin this section. The test setup for generating the results is shown in figure 5.6on the next page, where both Op. A and Op. B are provided with twodifferent streams of uniformly distributed random data. In the resultinggraphs the Total power, Toggle rate, UVT ratio and Gate count can be foundon the two y-axes. On the x-axis the modulo is plotted with a logarithmicscale of base two. In figure 5.5 on the facing page the values in the graphsfor the RNS adders and multipliers are pointed out and a description ofthese can be found below.
Total power The total power is the sum of the static and dynamic powerdissipation.
Toggle rate The toggle rate is relative to the entire y-axis, that is themaximum “total power” represents a toggle rate of one and a “total
40
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25
Modulo, m
0
20
40
60
80
100
Tot
alp
ower
[µW
] Total power
Toggle rate
UVT ratio
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
600
Gat
eco
unt
Gate count
Gate count
RNS adder type 0
Figure 5.5: Description of the RNS multiplier and adder graphs
power” of zero represents a toggle rate of zero. The toggle rate itselfrepresents in which average rate that all the nets in the design toggles.So the toggle rate is approximately proportional to the dynamic powerdissipation in combination with the gate count.
UVT ratio The UVT ratio is represented on the y-axis in the same way asthe toggle rate. The UVT ratio is the ratio of low-leakage cells usedin the design. A UVT ratio of close to 100 % is desirable.
Gate count The gate count is a technology independent measure of thetotal area and correlates together with the UVT ratio to the staticpower dissipation.
Op. A
Op. B Sum
(a) RNS adders
Op. A
Op. B Product
(b) RNS multipliers
Figure 5.6: Test setup for RNS adders and multipliers
5.2.1 Adders
The resulting best RNS adders for each modulo are compared in figure 5.8on page 43 with the RNS adders for modulo 2n. In table 5.3 on page 44 the
41
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21
22
23
24
25
26
27
28
29
210
211
212
Mo
dulo
0 50
100
150
200
250
Total power dissipation [µW]T
ype:
0
Typ
e:1
Typ
e:2
Typ
e:3
Typ
e:4
Typ
e:5
Typ
e:6
RN
Sadders
Figure 5.7: Total power dissipation for all RNS adders using uniformlydistributed input data as described in section 5.1.1 on page 36.
42
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25 26 27 28 29 210 211 212 213 214 215
Modulo, m
0
50
100
150
200
250
Tot
alp
ower
diss
ipat
ion
[µW
](s
olid
line)
Tot
alar
ea[µm
2]
(das
hed
line)
m 6= 2n
m 6= 2n
m = 2n
m = 2n
Best RNS adders with respect to power
Figure 5.8: The best RNS adder for each modulo compared with RNSadders for modulo 2n. Power dissipation was calculated using uniformlydistributed input data as described in section 5.1.1 on page 36.
43
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
best RNS adders with corresponding adder type is presented. An observa-tion of all the results as presented in appendix C gives the conclusion (somesimplifications has been made) that modulo-2n− 1 should use adder type 3,modulo-2n should use adder type 6 and all other modulo > 67 should useadder type 1. The complete list can be found in table 5.3. The approxima-tions made for m > 67 have resulted in that 25 % of the modulus above 67use a non-optimal adder type but with an average power dissipation increaseper moduli of approximately 1 %. This approximation was necessary sinceit is not convenient to store the corresponding optimum adder types for eachmodulo in a too large array when implementing RNS.
Modulo Adder type Modulo Adder type2 6 37 13 0 41 24 6 43 25 2 47 27 3 53 18 6 59 59 2 61 2
11 2 63 313 2 64 615 3 65 216 6 67 117 2 71 119 2 73 523 5 79 129 2 2n − 1 331 3 2n 632 6 2n + 1 133 2 other m > 67 1
Table 5.3: The best adder type for chosen modulo with respect to power,refer to table 3.1 on page 12 for details about the adder types.
5.2.2 Multipliers
Since the RTL code was written in a generic fashion the modulus mi > 2n+1has been excluded since the look-up table based multiplier in table 3.2 onpage 13 was implemented using a mi×mi sized look-up table. The synthesistool elaborates the entire code and therefore it is not synthesizable since thelook-up table in the elaboration phase will become too big.
44
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25
Modulo, m
0
20
40
60
80
100T
otal
pow
er[µW
]
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
600
Gat
eco
unt
Gate count
RNS adder type 0
(a) RNS adder type 0
21 22 23 24 25 26 27 28 29 210 211 212 213 214 215
Modulo, m
0
20
40
60
80
100
120
140
160
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
Gat
eco
unt
Gate count
RNS adder type 1
(b) RNS adder type 1
Figure 5.9: RNS adders type 0 and 1
45
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25 26 27 28 29 210 211 212 213 214
Modulo, m
0
20
40
60
80
100
120
140
160
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
600
Gat
eco
unt
Gate count
RNS adder type 2
(a) RNS adder type 2
21 22 23 24 25 26 27 28 29 210 211 212 213 214
Modulo, m
0
20
40
60
80
100
120
140
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
50
100
150
200
250
300
350
400
450
Gat
eco
unt
Gate count
RNS adder type 3
(b) RNS adder type 3
Figure 5.10: RNS adders type 2 and 3
46
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
23 24 25 26 27 28 29 210 211 212 213 214 215
Modulo, m
20
40
60
80
100
120
140
160
180T
otal
pow
er[µW
]
Total power
Toggle rate
UVT ratio
100
200
300
400
500
600
Gat
eco
unt
Gate count
RNS adder type 4
(a) RNS adder type 4
21 22 23 24 25 26 27 28 29 210 211 212 213 214 215
Modulo, m
0
20
40
60
80
100
120
140
160
180
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
600
Gat
eco
unt
Gate count
RNS adder type 5
(b) RNS adder type 5
Figure 5.11: RNS adders type 4 and 5
47
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25 26 27 28 29 210 211 212 213 214
Modulo, m
0
20
40
60
80
100
120
140
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
50
100
150
200
250
300
350
400
Gat
eco
unt
Gate count
RNS adder type 6
(a) RNS adder type 6
Figure 5.12: RNS adders type 6
48
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21
22
23
24
25
26
27
28
29
210
211
212
Mo
dulo
0
200
400
600
800
1000
1200
Total power dissipation [µW]
Typ
e:0
Typ
e:1
Typ
e:2
Typ
e:3
Typ
e:4
Typ
e:5
Typ
e:6
Typ
e:7
RN
Sm
ultipliersR
NS
multipliers
RN
Sm
ultipliers
Figure 5.13: All RNS multipliers
49
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25 26
Modulo, m
0
20
40
60
80
100
120
140
160
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
600
700
800
900
Gat
eco
unt
Gate count
RNS multiplier type 0
(a) RNS multiplier type 0
21 22 23 24 25 26 27 28 29 210 211 212
Modulo, m
0
50
100
150
200
250
300
350
400
450
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
200
400
600
800
1000
1200
1400
1600
1800
Gat
eco
unt
Gate count
RNS multiplier type 1
(b) RNS multiplier type 1
Figure 5.14: RNS multipliers type 0 and 1
50
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
21 22 23 24 25 26 27 28 29 210 211
Modulo, m
0
500
1000
1500
2000
2500
3000T
otal
pow
er[µW
]
Total power
Toggle rate
UVT ratio
0
1000
2000
3000
4000
5000
6000
7000
8000
Gat
eco
unt
Gate count
RNS multiplier type 2
(a) RNS multiplier type 2
22 23 24 25 26 27 28
Modulo, m
0
50
100
150
200
250
300
350
400
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
500
1000
1500
2000
2500
Gat
eco
unt
Gate count
RNS multiplier type 4
(b) RNS multiplier type 4
Figure 5.15: RNS multipliers type 2 and 4
51
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
23 24 25 26 27 28 29 210 211
Modulo, m
0
500
1000
1500
2000
2500
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
1000
2000
3000
4000
5000
6000
7000
8000
9000
Gat
eco
unt
Gate count
RNS multiplier type 5
(a) RNS multiplier type 5
21 22 23 24 25 26 27 28 29 210 211 212
Modulo, m
0
500
1000
1500
2000
2500
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
1000
2000
3000
4000
5000
6000
7000
Gat
eco
unt
Gate count
RNS multiplier type 6
(b) RNS multiplier type 6
Figure 5.16: RNS multipliers type 5 and 6
52
5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS
22 23 24 25 26 27 28 29 210 211 212
Modulo, m
0
20
40
60
80
100
120
140
160
180T
otal
pow
er[µW
]
Total power
Toggle rate
UVT ratio
0
100
200
300
400
500
600
700
Gat
eco
unt
Gate count
RNS multiplier type 7
(a) RNS multiplier type 7
28 213 218 223 228 233 238 243 248 253 258 263 268
Dynamic range, M
0
1000
2000
3000
4000
5000
6000
7000
8000
Tot
alp
ower
[µW
]
Total power
Toggle rate
UVT ratio
0
1000
2000
3000
4000
5000
6000
7000
Gat
eco
unt
Gate count
TCS multiplier
(b) TCS multiplier
Figure 5.17: RNS multipliers type 7 and TCS multiplier
53
5.3. MODULI-SET CHAPTER 5. RESULTS
Modulo Multiplier type Modulo Multiplier type2 7 47 43 0 53 44 7 59 45 0 61 47 0 63 28 7 64 79 0 65 1
11 4 67 413 4 71 415 2 73 416 7 79 417 1 83 419 4 89 123 4 97 129 4 127 231 2 255 232 7 2n − 1 133 1 2n 737 4 other m > 97 141 443 4
Table 5.4: The best multiplier type for chosen modulo with respect topower, refer to table 3.2 on page 13 for details about the multiplier types.
5.3 Moduli-set
The optimum moduli-sets where derived using the technique described insection 4.5 on page 33. They were optimized to have the least amount oftotal power for one tap, that is one RNS adder and one RNS multiplier. Thecomplete list of optimum moduli-sets is presented in Appendix B but someexamples are presented below in table 5.5 on the facing page. By combiningthe RNS multipliers into the optimum moduli-sets with different number ofmodulus the result in figure 5.18 on the next page can be achieved for RNSmultipliers.
54
5.3. MODULI-SET CHAPTER 5. RESULTS
Req. no. bitsN∏i=0
mi Optimum moduli-set
6 6.0 {64}7 7.39 {3, 7, 8}10 10.13 {5, 7, 32}20 20.13 {3, 5, 7, 11, 31, 32}30 30.08 {3, 5, 7, 11, 13, 19, 31, 128}40 40.02 {3, 5, 7, 11, 13, 17, 19, 29, 31, 256}50 50.03 {5, 7, 9, 11, 13, 19, 23, 29, 31, 127, 512}60 60.01 {11, 13, 17, 19, 23, 29, 31, 37, 63, 127, 4096}
Table 5.5: Some of the optimum moduli-sets and their resulting numberof bits. For the complete list refer to Appendix B.
0 10 20 30 40 50 60 70
Dynamic range, [bits]
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Tot
alp
ower
diss
ipat
ion
[mW
]
TCS
RNS 3 modulus
RNS 5 modulus
RNS 7 modulus
RNS 9 modulus
RNS 11 modulus
Figure 5.18: Combinations of RNS multipliers with a maximum of 3,5,7,9and 11 RNS multipliers in the moduli-set compared with TCS multiplier
55
5.4. FIR FILTERS CHAPTER 5. RESULTS
0 2 4 6 8 10 12 14
Dynamic range, [bits]
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Tot
alp
ower
diss
ipat
ion
[mW
]
TCS
RNS 3 modulus
RNS 5 modulus
RNS 7 modulus
RNS 9 modulus
RNS 11 modulus
Figure 5.19: Combinations of RNS adders with a maximum of 3,5,7,9 and11 RNS adders in the moduli-set compared with RNS adder for modulo 2n
(which is almost identical to a TCS adder)
5.4 FIR filters
RNS and TCS FIR filters has been synthesized and simulated to get resultson RNS performance in a real DSP application.
5.4.1 Varying input word length
In figure 5.20 on the facing page and figure 5.21 on page 58 RNS FIR filterswith varying input word length has been tested. Uniformly distributed datahas been used in the first figure and sawtooth data in the second.
56
5.4. FIR FILTERS CHAPTER 5. RESULTS
24 26 28 210 212 214 216 218 220 222 224 226 228
Input bit width
0
10
20
30
40
50
60
70
80T
otal
pow
er[mW
]
1% 1%
5%8%
13% 14% 15% 15% 15%17% 16% 15%
20%18%
15% 16%
Total power TCS
Toggle rate TCS
UVT ratio TCS
Total power RNS
Toggle rate RNS
UVT ratio RNS
0
50000
100000
150000
200000
250000
300000
350000
Gat
eco
unt
Gate count TCS
Gate count RNS
RNS vs. TCS 64-tap FIR filter random data and coefficients
Figure 5.20: 64-tap FIR filter with varying input bit width for RNS andTCS. Uniform data as described in section 5.1.1 on page 36. The red linerepresents the power reduction.
57
5.4. FIR FILTERS CHAPTER 5. RESULTS
24 26 28 210 212 214 216 218 220 222 224 226 228
Input bit width
0
2
4
6
8
10
12
14
16
18
Tot
alp
ower
[mW
]
6%
16%
8%
12% 12% 11%13% 14% 14%
18%15%
18%20%
16%13%
18%15%
10% 10%
Total power TCS
Toggle rate TCS
UVT ratio TCS
Total power RNS
Toggle rate RNS
UVT ratio RNS
0
10000
20000
30000
40000
50000
60000
70000
80000
Gat
eco
unt
Gate count TCS
Gate count RNS
RNS vs. TCS 16-tap FIR filter sawtooth input data
Figure 5.21: 16-tap FIR filter with varying input word length for RNS andTCS. Sawtooth data with ramp coefficients as described in section 5.1.2 onpage 37. The red line represents the power reduction.
5.4.2 Varying number of taps
A realistic FIR filter for a telecommunication system can have an inputword length of 20 bits. By varying the number of taps, the results shouldbe comparable with, for example [5] in which the authors have performedsimilar tests. In figure 5.22 on the facing page the results of an FIR filterwith 20-bit uniformly distributed input data are presented and in figure 5.23on page 60 exactly the same RNS FIR filter is provided with realistic inputdata and coefficients instead.
58
5.4. FIR FILTERS CHAPTER 5. RESULTS
0 20 40 60 80 100 120 1400
20
40
60
80
100T
otal
pow
er[mW
]an
dp
erce
ntag
e
25%
20%18% 17%
15% 15% 15% 15% 15% 15% 15% 15% 15% 15%
Total power TCS
Toggle rate TCS
UVT ratio TCS
Total power RNS
Toggle rate RNS
UVT ratio RNS
0
50000
100000
150000
200000
250000
300000
350000
400000
Gat
eco
unt
Gate count TCS
Gate count RNS
RNS vs. TCS 20 input bits FIR filter random input data and coefficients
Figure 5.22: 20-bit FIR filter with varying number of taps for RNS andTCS. Uniformly distributed data and coefficients are used as described insection 5.1.1 on page 36. The red line represents the power reduction.
59
5.4. FIR FILTERS CHAPTER 5. RESULTS
0 20 40 60 80 100 120 1400
20
40
60
80
100
Tot
alp
ower
[mW
]an
dp
erce
ntag
e
16%
7%5%
-1%1% 1% -0%
-2% -1%-2% -2%-5% -5% -5%
Total power TCS
Toggle rate TCS
UVT ratio TCS
Total power RNS
Toggle rate RNS
UVT ratio RNS
0
50000
100000
150000
200000
250000
300000
350000
400000
Gat
eco
unt
Gate count TCS
Gate count RNS
RNS vs. TCS 20 input bits FIR filter real input data and coefficients
Figure 5.23: 20-bit FIR filter with varying number of taps for RNS andTCS. Realistic data with constant FIR coefficients are used as described insection 5.1.3 on page 37. The red line represents the power reduction.
5.4.3 Folded FIR filter
In a more realistic FIR filter the input data will not arrive every clockcycle and therefore a folded FIR filter could be used instead. An FIR filterthat has been folded N times has been designed where N is the numberof taps. The resulting schematic has earlier been discussed in figure 4.21on page 35. The FIR filter was designed with 20-bit input and 22 tapsand providing it with uniformly distributed input data and coefficients asdescribed in section 5.1.1 on page 36, yielding an RNS dynamic range of20 · 2 + dlog2 22e = 45 bits.
TCS RNS DifferenceTotal power 2.69 mW 3.34 mW + 24 %Total area 5745 µm2 10323 µm2 + 80 %No. registers 986 2319 + 135 %
Table 5.6: Results for an FIR filter folded 22 times with 20-bit input and22 taps
60
5.5. MAXIMUM FREQUENCY CHAPTER 5. RESULTS
5.5 Maximum frequency
The maximum frequency of the RNS computational elements has as anexample been calculated for a 20 and 30-bit input 4-tap FIR filter for bothRNS and TCS. For these designs the synthesis tool was trying to achievea maximum frequency of 1.5 GHz. The results are presented in table 5.7.From the results it is clear that the implementation of RNS is slightly worsein terms of maximum frequency when it is compared with TCS. This resultreasonable since the individual RNS adders and multipliers in this thesishave been optimized for power and not for maximum frequency.
20-bit TCS 20-bit RNS 30-bit TCS 30-bit RNSMax frequency 1066 MHz 1052 MHz 981 MHz 895 MHzTotal area 9262 µm2 6414 µm2 18693 µm2 11645 µm2
UVT ratio 10.43 % 14.74 % 8.46 % 18.19 %
Table 5.7: Synthesis results for 4-tap FIR filter with 20 or 30 input bit-width. The synthesis maximum frequency goal was set to 1.5 GHz.
61
Chapter 6
Discussion and conclusions
In this section the results will be discussed, analyzed and compared withsome previous research. In general the results are in the same line as previousresearch. The main conclusion of this thesis is the one that can be realizedfrom analyzing figure 5.22 on page 59 and figure 5.23 on page 60: RNS willachieve a decrease in power dissipation when providing it with random databut using a more realistic data will not reduce power dissipation.
6.1 Adders
By analyzing the different adders in figure 5.9, figure 5.10, figure 5.11 andfigure 5.12 on page 48 it is quite clear that some of the modulus are moreefficient. Naturally the modulo 2n adders are the best since they will havethe same profile as ordinary TCS adders as can be seen in figure 5.8 onpage 43. These results are clearly visible in almost all graphs as “dips” inboth total power and gate count. These dips are most likely caused by thefact that the synthesis tool detects that it has been provided by almost anordinary TCS adder and optimizes the hardware for that.
Another interesting observation when analyzing the results from the RNSadders is the different behavior when using a look-up table based implemen-tation or a combinatorial implementation, as clearly can be seen by com-paring figure 5.9a on page 45 and figure 5.9b on page 45. In figure 5.9a it isquite clear that the power and gate count increase with increasing modulobut in figure 5.9b it seems instead to be more related to the number of bitsused to represent the modulo.
By looking at the results for all the different RNS adders it is possible togeneralize the results to that the power and gate count of the RNS addersimplemented in this thesis grows linearly. This will lead to the fact that acombination of only RNS adders in a moduli-set will not be better than aTCS adder for the same bit-width. This can clearly be seen in figure 5.19on page 56. Fortunately RNS is almost as good as TCS so a generalized
62
6.2. MULTIPLIERS CHAPTER 6. DISCUSSION AND CONCLUSIONS
view of it would be that a combination of RNS adders will almost performas good as one TCS adder representing the same dynamic range.
6.2 Multipliers
The results for the multipliers have in general the same properties as theresults for the RNS adders. The main difference is that power and gate countseems to grow more than linearly, which is logical since multiplication is morecomplex than addition. Due to this more than linear growth a combinationof RNS multipliers will at a certain point outperform TCS multipliers ascan be seen in figure 5.18 on page 55. The conclusion from this is that RNSmultipliers will outperform TCS multipliers as long as enough modulus areused in the moduli-set and that the dynamic range (or output bit width) isgreater than approximately 22 bits.
Another interesting observation is that most all the multipliers for mod-ulo 6= 2n or 2n − 1 utilize some kind of LUT approach. These LUTs canprobably share the hardware resources more than what they currently doand this could be a future upgrade of both the multipliers and adders.
6.3 FIR filters
Regarding the FIR filters [5] performed tests in an environment that werevery similar to the ones presented in figure 5.22 on page 59. The resultspresented in [5] were using a moduli-set of {7, 11, 13, 17, 64} which is equalto a dynamic range of 20 bits, while the results in figure 5.22 are using ainput bit-width of 20 bits which results in a dynamic range of at least 40bits. In [5] the authors manage to achieve a power reduction between 30and 35 %, while the power reduction in figure 5.22 is approximately 15 %.This difference can be explained by the fact that the cell libraries used forsynthesis in industry are far more enriched in terms of for example differentmultiplier types which will benefit TCS FIR filters.
Even more interesting results are achieved when providing the same FIRfilter as described above with a more realistic input data and coefficients ascan be seen in figure 5.23 on page 60. In this case the RNS FIR filter willactually increase the power dissipation. This can be explained by inspectingtable 5.1 on page 40 and table 5.2 on page 40. The very interesting resultsin these tables is that RNS seems to treat almost any input data type asrandom data which is caused by the forward conversion, hence almost anytype of input data will toggle as random data in RNS. So to get a powerreduction with RNS the original input data has to toggle a lot.
From the results in figure 5.23 on page 60 one might be able to reasonthat a completely folded FIR filter would decrease the power even moresince a 12-tap FIR filter using realistic data actually decrease the powerdissipation with 16%. Unfortunately a folded FIR filter will have some sort
63
6.3. FIR FILTERS CHAPTER 6. DISCUSSION AND CONCLUSIONS
of shift registers at the input which would use a significantly larger area inRNS in comparison with TCS due to the fact that the entire RNS wordlength have to be stored instead of only the input word length as in TCS.This difference will affect the result far more in the folded FIR filter casedue to a heavily increased number of registers at the inputs.
64
Chapter 7
Future Work
RNS can be investigated much further, since a lot of research has been donein the field and a lot of other implementations has been proposed. Furtherinvestigation of RNS would probably decrease the power dissipation further.Some examples of this that I have discovered during the thesis are presentedbelow:
• In [27] a possibly better RNS multiplier is presented by using the factthat not all input bit combinations are used for modulus 6= 2n.
• Some research has been made on multi-level RNS, which simply meansthat the largest modulus will consist of an additional level of RNS. Thismight be interesting since the implementations of RNS investigated inthis thesis usually use a slightly larger bit-width compared with theones implemented in academic papers. [28] has used a multi-level RNS.
• Due to the inherent properties of small word lengths in RNS very manyinteresting arithmetic algorithms can be investigated, for example one-hot encoding and shift- and add multiplication [29].
• The parallel-prefix multiplier implemented in this thesis can probablybe re-implemented using a more sophisticated adder tree structure.
• It would be interesting to optimize the moduli-sets for realistic dataas well (instead of only using a uniform distribution as been optimizedfor in this thesis).
If RNS should be implemented in a real system further research wouldalso be required on forward and reverse conversion (these has to be properlyadapted to the surrounding system). Also further research would probablybe necessary on scaling, overflow and sign detection since the RNS compu-tations would have to be big to compensate for the conversion process andtherefore probably would require these operations.
65
References
[1] Lay Yong Lam and Tian Se Ang. Fleeting footsteps tracing the con-ception of arithmetic and algebra in ancient China. River Edge, N.J.: World Scientific, 2004 (cit. on p. 1).
[2] Michael A. Soderstrand, W. Kenneth Jenkins, Graham A. Jullien,and Fred J. Taylor, eds. Residue Number System Arithmetic: ModernApplications in Digital Signal Processing. Piscataway, NJ, USA: IEEEPress, 1986 (cit. on p. 3).
[3] N.S. Szabo and R.I. Tanaka. Residue arithmetic and its applicationsto computer technology. McGraw-Hill series in information processingand computers. McGraw-Hill, 1967 (cit. on p. 3).
[4] Amos R. Omondi and Benjamin Premkumar. Residue number systems: theory and implementation. Advances in computer science and en-gineering: Texts: v. 2. London : Imperial College Press ; Singapore; Hackensack, NJ : Distributed by World Scientific Publishing, 2007(cit. on pp. 3, 5, 8, 11–14, 22, 25, 30, 31).
[5] G.C. Cardarilli, A. Del Re, A. Nannarelli, and M. Re. “Low Powerand Low Leakage Implementation of RNS FIR Filters.” In: ConferenceRecord of the Thirty-Ninth Asilomar Conference on Signals, Systems& Computers, 2005 (2005), p. 1620 (cit. on pp. 3, 12, 14, 24, 58, 63).
[6] G.C. Cardarilli, M. Re, A. Del Re, and A. Nannarelli. “Impact of RNScoding overhead on FIR filters performance.” In: Conference Record- Asilomar Conference on Signals, Systems and Computers. Confer-ence Record of the 41st Asilomar Conference on Signals, Systems andComputers, ACSSC. Department of Electronics, University of RomeTor Vergata, 2007, pp. 1426–1429 (cit. on p. 3).
[7] W.L. Freking, K.K. Parhi, M.P. Fargues, and R.D. Hippenstiel. “Low-power FIR digital filters using residue arithmetic.” In: vol. 1. 1998(cit. on pp. 3, 14).
[8] G.C. Cardarilli, A. Nannarelli, and M. Re. “Reducing power dissi-pation in FIR filters using the residue number system.” In: MidwestSymposium on Circuits and Systems. Vol. 1. Department of Electrical
66
REFERENCES REFERENCES
Engineering, Univ. of Rome Tor Vergata, 2000, pp. 320–323 (cit. onpp. 7, 14, 24).
[9] D. Zivaljevic, N. Stamenkovic, and V. Stojanovic. “Digital filter im-plementation based on the RNS with diminished-1 encoded channel.”In: University of Nis, Faculty of Electronic Engineering, A. Medvedeva14, Nis, 18000, Serbia, 2012 (cit. on p. 8).
[10] Sune Soderkvist. Fran insignal till utsignal. 2007 (cit. on p. 8).
[11] Himanshu Bhatnagar. Advanced ASIC chip synthesis using Synop-sys Design Compiler, Physical Compiler, and PrimeTime / HimanshuBhatnagar. Boston : Kluwer Academic Publishers, 2002 (cit. on p. 9).
[12] PrimeTime Datasheet. Accessed: 2014-05-08. url: http://www.synopsys.com/Tools/Implementation/SignOff/Documents/primetime\_ds.
pdf (cit. on p. 10).
[13] Gordon Yip. Expanding the Synopsys PrimeT ime R© Solution withPower Analysis. 2006. url: https://www.synopsys.com/Tools/
Implementation/SignOff/CapsuleModule/ptpx_wp.pdf (cit. onp. 10).
[14] M. Bayoumi, G. Jullien, and W. Miller. “A VLSI implementation ofresidue adders.” In: IEEE Transactions on Circuits & Systems 34.3(1987), p. 284 (cit. on p. 11).
[15] R. Zimmermann, I. Koren, and P. Kornerup. “Efficient VLSI imple-mentation of modulo (2n ± 1) addition and multiplication.” In: Inte-grated Syst. Lab., Swiss Federal Inst. of Technol., Zurich, Switzerland,1999 (cit. on pp. 11, 18).
[16] R. Zimmermann. “VHDL Library of Arithmetic Units”. In: Proc. FirstInt. Forum on Design Languages (FDL’98) (1998) (cit. on pp. 11, 18).
[17] A.A. Hiasat. “New efficient structure for a modular multiplier forRNS.” In: IEEE Transactions on Computers 49.2 (2000), pp. 170–174 (cit. on pp. 12, 22).
[18] G. Alia and E. Martinelli. “A VLSI modulo m multiplier.” In: IEEETransactions on Computers 40.7 (1991), p. 873 (cit. on p. 12).
[19] Ramya Muralidharan and Chip-Hong Chang. “Area-Power EfficientModulo 2n − 1 and Modulo 2n + 1 Multipliers for {2n − 1, 2n, 2n + 1}Based RNS.” In: IEEE Transactions on Circuits & Systems. Part I:Regular Papers 59.10 (2012), p. 2263 (cit. on p. 12).
[20] Alexander Skavantzos and Poornachandra B. Rao. “New multipliersmodulo 2N - 1.” In: IEEE Transactions on Computers 41.8 (1992),pp. 957–961 (cit. on pp. 12, 13, 25).
[21] Ivan Matveyevich Vinogradov. Elements of number theory. New York:Dover, 1954 (cit. on pp. 13, 24).
67
REFERENCES REFERENCES
[22] A.B. Premkumar. “A formal framework for conversion from binaryto residue numbers”. In: Circuits and Systems II: Analog and DigitalSignal Processing, IEEE Transactions on 49.2 (2002), pp. 135–144(cit. on p. 13).
[23] “IEEE Standard for SystemVerilog–Unified Hardware Design, Specifi-cation, and Verification Language”. In: IEEE STD 1800-2009 (2009),pp. 1–1285 (cit. on p. 16).
[24] J.H. McClellan, T.W. Parks, and L. Rabiner. “A computer programfor designing optimum FIR linear phase digital filters”. In: Audio andElectroacoustics, IEEE Transactions on 21.6 (1973), pp. 506–526 (cit.on p. 35).
[25] J. McClellan and T. Parks. “A unified approach to the design of opti-mum FIR linear-phase digital filters”. In: Circuit Theory, IEEE Trans-actions on 20.6 (1973), pp. 697–701 (cit. on p. 35).
[26] SciPy v0.13 Reference Guide for the Remez exchange algorithm. Ac-cessed: 2014-05-07. url: http : / / docs . scipy . org / doc / scipy -
0.13.0/reference/generated/scipy.signal.remez.html (cit. onp. 37).
[27] V. Paliouras, K. Karagianni, and T. Stouraitis. “A low-complexitycombinatorial RNS multiplier.” In: IEEE Transactions on Circuits andSystems II: Analog and Digital Signal Processing 48.7 (2001), pp. 675–683 (cit. on p. 65).
[28] Jassbi S. Jafarali, Navi K., and Khademzadeh A. “An optimum moduliset in residue number system.” In: International Mathematical Forum59 (2010), p. 2911 (cit. on p. 65).
[29] K. Johansson, O. Gustafsson, and L. Wanhammar. “Bit-Level Opti-mization of Shift-and-Add Based FIR Filters”. In: Electronics, Circuitsand Systems, 2007. ICECS 2007. 14th IEEE International Conferenceon. 2007, pp. 713–716 (cit. on p. 65).
68
Appendix A
Modulus
2 101 233 5113 103 239 5125 107 241 5137 109 251 102311 113 509 102413 127 1021 102517 131 2039 204719 137 4093 204823 139 8179 204929 149 16381 409531 151 4 409637 157 8 409741 163 9 819143 167 15 819247 173 16 819353 179 32 1638359 181 33 1638461 191 63 1638567 193 6471 197 6573 199 12879 211 12983 223 25589 227 25697 229 257
69
Appendix B
Optimum moduli-sets
Table B.1: Resulting moduli-sets
Req. no. bits Resulting no. bits Resulting moduli-set2 2.0 {4}3 3.0 {8}4 4.0 {16}5 5.0 {32}6 6.0 {64}7 7.39231742278 {3, 7, 8}8 8.39231742278 {3, 7, 16}9 9.12928301694 {5, 7, 16}10 10.1292830169 {5, 7, 32}11 11.1292830169 {5, 7, 64}12 12.0927571409 {3, 7, 13, 16}13 13.0927571409 {3, 7, 13, 32}14 14.0927571409 {3, 7, 13, 64}15 15.1736771363 {3, 5, 7, 11, 32}16 16.1736771363 {3, 5, 7, 11, 64}17 17.1736771363 {3, 5, 7, 11, 128}18 18.0469534513 {3, 7, 13, 31, 32}19 19.0469534513 {3, 7, 13, 31, 64}20 20.1278734467 {3, 5, 7, 11, 31, 32}21 21.1278734467 {3, 5, 7, 11, 31, 64}22 22.1278734467 {3, 5, 7, 11, 31, 128}23 23.1278734467 {3, 5, 7, 11, 31, 256}24 24.1278734467 {3, 5, 7, 11, 31, 512}25 25.1220443679 {3, 5, 7, 11, 13, 19, 128}26 26.1220443679 {3, 5, 7, 11, 13, 19, 256}27 27.0762406783 {3, 5, 7, 11, 13, 16, 19, 31}28 28.0762406783 {3, 5, 7, 11, 13, 19, 31, 32}29 29.0762406783 {3, 5, 7, 11, 13, 19, 31, 64}30 30.0762406783 {3, 5, 7, 11, 13, 19, 31, 128}31 31.0762406783 {3, 5, 7, 11, 13, 19, 31, 256}32 32.0762406783 {3, 5, 7, 11, 13, 19, 31, 512}33 33.0762406783 {3, 5, 7, 11, 13, 19, 31, 1024}34 34.209856116 {3, 5, 7, 11, 13, 23, 29, 31, 64}
Continued on next page
70
APPENDIX B. OPTIMUM MODULI-SETS
Table B.1 – continued from previous pageReq. no. bits Resulting no. bits Resulting moduli-set35 35.209856116 {3, 5, 7, 11, 13, 23, 29, 31, 128}36 36.209856116 {3, 5, 7, 11, 13, 23, 29, 31, 256}37 37.209856116 {3, 5, 7, 11, 13, 23, 29, 31, 512}38 38.209856116 {3, 5, 7, 11, 13, 23, 29, 31, 1024}39 39.0216845147 {3, 5, 7, 11, 13, 17, 19, 29, 31, 128}40 40.0216845147 {3, 5, 7, 11, 13, 17, 19, 29, 31, 256}41 41.0427461302 {5, 7, 9, 11, 13, 19, 23, 29, 31, 128}42 42.0427461302 {5, 7, 9, 11, 13, 19, 23, 29, 31, 256}43 43.0427461302 {5, 7, 9, 11, 13, 19, 23, 29, 31, 512}44 44.0427461302 {5, 7, 9, 11, 13, 19, 23, 29, 31, 1024}45 45.0427461302 {5, 7, 9, 11, 13, 19, 23, 29, 31, 2048}46 46.1302089714 {5, 7, 9, 11, 13, 17, 19, 23, 29, 31, 256}47 47.1302089714 {5, 7, 9, 11, 13, 17, 19, 23, 29, 31, 512}48 48.1302089714 {5, 7, 9, 11, 13, 17, 19, 23, 29, 31, 1024}49 49.1302089714 {5, 7, 9, 11, 13, 17, 19, 23, 29, 31, 2048}50 50.031430817 {5, 7, 9, 11, 13, 19, 23, 29, 31, 127, 512}51 51.031430817 {5, 7, 9, 11, 13, 19, 23, 29, 31, 127, 1024}52 52.031430817 {5, 7, 9, 11, 13, 19, 23, 29, 31, 127, 2048}53 53.031430817 {5, 7, 9, 11, 13, 19, 23, 29, 31, 127, 4096}54 54.0247889997 {7, 11, 13, 15, 19, 23, 29, 31, 37, 41, 2048}55 55.0247889997 {7, 11, 13, 15, 19, 23, 29, 31, 37, 41, 4096}56 56.0010571679 {7, 11, 13, 15, 19, 23, 29, 31, 47, 127, 2048}57 57.0010571679 {7, 11, 13, 15, 19, 23, 29, 31, 47, 127, 4096}58 58.0083788194 {7, 11, 13, 15, 19, 29, 31, 41, 53, 127, 4096}59 59.0725091568 {7, 13, 15, 19, 23, 29, 31, 41, 53, 127, 4096}60 60.0064189289 {11, 13, 17, 19, 23, 29, 31, 37, 63, 127, 4096}61 61.2765080923 {11, 13, 19, 23, 29, 31, 37, 41, 63, 127, 4096}62 62.0184105261 {11, 13, 19, 23, 29, 31, 43, 59, 63, 127, 4096}63 63.0023331289 {11, 19, 23, 29, 31, 37, 41, 43, 63, 127, 4096}64 64.0229340157 {11, 13, 23, 29, 31, 37, 41, 63, 127, 255, 2048}65 65.0282412996 {13, 19, 23, 29, 31, 41, 43, 63, 127, 255, 2048}66 66.0207068321 {15, 19, 23, 29, 31, 37, 41, 63, 127, 511, 2048}
71
Appendix C
RNS adders results
The results have been generated using uniformly distributed random inputoperands and a setup as described in figure 5.6a on page 41. Below is aDescription of the titles in the header row of table C.1.
Modulo The modulo
Type Results for the specific adder type, see table 3.1 on page 12 for details.
Area The total area of the adder including the registers, in µm2.
Gates The total gate count including registers.
Switch power The switching power. See section 2.3.3 on page 10 for de-tails about the different power values in µW.
Int. power The internal power in µW.
Leak power The leakage power in µW.
Total power The total power in µW.
Toggle rate The average toggle rate per net per clock cycle.
UVT cells The percentage of UVT cells. See section 2.3.3 on page 10 fordetails.
72
APPENDIX C. RNS ADDERS RESULTS
Tab
leC
.1:
Res
ult
sfo
rR
NS
ad
der
s
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s2
014
30
0.2
38.0
70.3
85
8.6
90.2
185.7
1%
21
14
30
0.2
38.0
70.3
85
8.6
90.2
185.7
1%
22
14
30
0.2
38.0
70.3
85
8.6
90.2
185.7
1%
25
14
30
0.2
38.0
70.3
85
8.6
90.2
185.7
1%
26
14
30
0.2
38.0
70.3
85
8.6
90.2
185.7
1%
BE
ST
30
26
56
1.0
715.5
0.6
14
17.2
0.2
93.3
3%
BE
ST
31
29
62
1.7
416
0.6
77
18.4
0.2
95.2
4%
32
28
59
1.3
115.7
0.6
517.7
0.2
94.4
4%
33
29
61
1.5
16
0.6
81
18.1
0.2
95%
35
29
61
1.4
915.9
0.6
81
18.1
0.2
95.2
4%
40
26
55
0.5
26
16.1
0.6
94
17.3
0.2
291.6
7%
41
26
55
0.5
26
16.1
0.6
94
17.3
0.2
291.6
7%
42
26
55
0.5
26
16.1
0.6
94
17.3
0.2
291.6
7%
45
26
55
0.5
26
16.1
0.6
94
17.3
0.2
291.6
7%
46
26
55
0.5
26
16.1
0.6
94
17.3
0.2
291.6
7%
BE
ST
50
48
103
3.7
624.4
1.0
829.2
0.1
897.5
%5
144
93
2.2
324.5
1.0
727.8
0.1
996%
52
41
88
2.2
823.9
127.2
0.2
95.8
3%
BE
ST
55
46
99
3.2
524.7
1.1
329
0.1
996.8
8%
70
60
128
6.8
827.4
1.3
835.7
0.2
98.2
5%
71
43
91
2.6
25.4
1.0
729.1
0.2
295.8
3%
72
43
92
2.9
525.3
1.0
929.4
0.2
296.3
%7
341
87
2.5
424.9
128.5
0.2
295.4
5%
BE
ST
75
45
96
2.9
925.8
1.1
229.9
0.2
196.4
3%
80
38
80
1.5
724.5
0.9
61
27
0.2
393.7
5%
81
38
80
1.5
724.5
0.9
62
27
0.2
393.7
5%
82
38
80
1.5
724.5
0.9
62
27
0.2
393.7
5%
85
38
80
1.5
724.5
0.9
62
27
0.2
393.7
5%
86
38
80
1.5
724.5
0.9
62
27
0.2
393.7
5%
BE
ST
90
98
209
8.5
633.7
2.0
444.3
0.1
199.0
7%
91
59
125
3.7
333.8
1.5
339.1
0.2
96.6
7%
92
56
118
3.1
532.4
1.3
936.9
0.2
96.8
8%
BE
ST
94
63
134
4.9
833.5
1.4
940
0.2
97.7
3%
95
60
127
3.8
332.7
1.4
537.9
0.1
897.4
4%
11
0125
265
11.2
36
2.6
749.9
0.1
99.3
5%
11
159
125
3.8
734.9
1.5
740.4
0.2
196.7
7%
Conti
nued
on
next
page
73
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s11
259
126
4.5
333.5
1.4
539.5
0.2
197.3
7%
BE
ST
11
563
133
4.8
533.9
1.5
340.3
0.2
97.7
8%
13
0150
320
13
37
3.2
53.2
0.0
85
99.4
8%
13
159
125
4.0
935.5
1.5
941.2
0.2
296.7
7%
13
259
125
4.7
834.2
1.4
440.4
0.2
297.3
%B
EST
13
561
129
4.4
434.7
1.5
140.6
0.2
197.4
4%
15
0128
272
19.7
44.1
2.8
866.8
0.1
999.3
7%
15
158
124
3.8
135.6
1.6
141
0.2
296.4
3%
15
258
122
4.2
434.5
1.4
440.2
0.2
396.9
7%
15
356
118
4.0
434.3
1.4
139.7
0.2
396.8
8%
BE
ST
15
558
123
4.2
434.3
1.4
139.9
0.2
296.9
7%
16
065
139
5.8
435
1.6
42.4
0.2
198.1
5%
16
150
106
1.9
832.8
1.3
36.1
0.2
395.2
4%
16
250
106
1.9
832.8
1.3
36.1
0.2
395.2
4%
16
550
106
1.9
832.8
1.3
36.1
0.2
395.2
4%
16
650
106
1.9
832.8
1.3
36.1
0.2
395.2
4%
BE
ST
17
0193
410
19.3
48.9
4.4
572.6
0.1
199.6
2%
17
173
155
4.9
342.4
1.8
849.2
0.2
97.3
7%
17
269
147
4.3
440.4
1.7
346.5
0.1
997.5
%B
EST
17
576
162
4.7
441.4
1.8
948.1
0.1
698.0
4%
19
0222
473
21.2
50.9
5.2
577.4
0.1
199.6
8%
19
173
155
4.8
943.3
1.9
150.1
0.2
197.2
2%
19
273
156
5.4
342
1.8
249.2
0.2
197.8
7%
BE
ST
19
576
163
5.3
642.5
1.8
949.7
0.1
897.9
6%
23
0267
568
23.5
53.5
6.4
383.4
0.0
97
99.7
5%
23
173
155
4.8
743.9
1.9
650.7
0.2
197.3
%23
277
163
6.3
242.7
1.8
650.9
0.2
198.1
1%
23
577
163
5.4
343.2
1.9
550.6
0.1
897.9
2%
BE
ST
29
0240
511
32.8
59.4
5.5
97.8
0.1
499.7
%29
173
155
5.1
644.6
1.9
851.7
0.2
297.3
%29
273
154
5.7
642.9
1.8
50.5
0.2
297.7
3%
BE
ST
29
575
160
5.2
44
1.9
351.1
0.2
297.7
8%
31
172
154
4.8
44.5
251.3
0.2
297.0
6%
31
272
153
5.8
143
1.8
150.6
0.2
397.7
8%
31
370
148
5.1
642.6
1.7
549.5
0.2
397.7
3%
BE
ST
31
573
156
4.8
843.3
1.8
750.1
0.2
297.6
2%
32
0119
252
16.7
51.2
2.8
170.7
0.2
299.1
6%
32
161
130
2.6
741.1
1.6
45.4
0.2
396%
Conti
nued
on
next
page
74
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s32
261
130
2.6
741.1
1.6
45.4
0.2
396%
32
561
130
2.6
741.1
1.6
45.4
0.2
396%
32
661
130
2.6
741.1
1.6
45.4
0.2
396%
BE
ST
33
188
186
6.0
751.4
2.2
759.8
0.2
197.8
3%
33
283
176
4.8
949.3
2.1
56.3
0.1
997.7
3%
BE
ST
33
493
198
6.8
250.7
2.2
959.9
0.2
98.3
9%
33
591
193
5.9
850.5
2.2
858.8
0.1
798.2
8%
37
187
185
6.1
952
2.2
960.5
0.2
197.7
8%
BE
ST
37
291
193
7.3
51
2.2
560.6
0.2
98.2
5%
37
592
197
7.0
751.2
2.3
60.6
0.1
898.3
3%
41
187
185
6.2
952.6
2.3
61.2
0.2
297.7
8%
41
292
195
7.1
251.5
2.2
660.8
0.1
998.2
8%
BE
ST
41
593
197
6.8
251.8
2.3
461
0.1
898.4
4%
43
187
185
6.3
552.9
2.3
361.5
0.2
297.8
3%
43
288
186
7.3
351.3
2.1
860.9
0.2
298.1
5%
BE
ST
43
593
197
7.5
352
2.3
261.8
0.1
998.3
6%
47
187
184
5.8
552.8
2.3
461
0.2
297.6
7%
47
289
190
7.0
251.6
2.2
160.8
0.2
198.2
1%
BE
ST
47
593
198
6.9
852.1
2.3
361.4
0.1
898.4
1%
53
187
185
6.6
853.8
2.3
462.9
0.2
397.8
3%
BE
ST
53
294
200
8.8
53
2.2
964.1
0.2
198.4
4%
53
592
195
8.0
853
2.3
363.4
0.2
98.3
1%
59
187
184
6.1
753.5
2.3
762.1
0.2
297.6
7%
59
287
184
7.4
152.2
2.1
961.8
0.2
398.0
8%
59
589
189
6.5
252.8
2.3
261.6
0.1
998.1
1%
BE
ST
61
187
184
6.3
353.8
2.3
762.5
0.2
397.6
7%
61
286
183
6.9
552.3
2.1
861.4
0.2
397.9
6%
BE
ST
61
588
187
6.4
452.9
2.2
661.6
0.2
98.0
8%
63
186
183
5.8
153.3
2.3
861.4
0.2
297.5
%63
286
183
6.8
52
2.1
860.9
0.2
297.9
6%
63
382
175
650.9
2.0
559
0.2
397.9
6%
BE
ST
63
587
185
6.3
351.7
2.2
160.3
0.2
198%
64
172
154
3.6
449.5
1.8
655
0.2
96.3
%64
272
154
3.6
449.5
1.8
655
0.2
396.3
%64
572
154
3.6
449.5
1.8
655
0.2
396.3
%64
672
154
3.6
449.5
1.8
655
0.2
396.3
%B
EST
65
1102
216
6.9
859.9
2.6
569.5
0.2
98.1
1%
65
298
209
6.2
557.6
2.4
666.3
0.1
998.1
8%
BE
ST
Conti
nued
on
next
page
75
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s65
4121
257
10.7
62.1
3.0
575.8
0.2
98.9
%65
5107
228
6.6
458.8
2.7
68.1
0.1
698.5
5%
67
1101
215
6.9
660.1
2.6
469.7
0.2
198.0
4%
BE
ST
67
2107
228
8.3
159.1
2.7
70.1
0.1
998.6
1%
67
5107
228
7.4
259.7
2.7
269.8
0.1
798.5
1%
71
1101
214
7.0
760.8
2.6
570.5
0.2
198%
BE
ST
71
2109
232
8.9
760.1
2.7
471.8
0.2
98.6
8%
71
5108
230
7.8
60.1
2.7
270.6
0.1
898.5
7%
73
1101
215
7.4
60.7
2.6
570.7
0.2
198.0
4%
73
2114
243
9.9
260.3
2.8
73.1
0.1
998.8
2%
73
5108
230
7.9
459.6
2.7
70.3
0.1
798.6
1%
BE
ST
79
1100
213
6.8
961.2
2.6
870.8
0.2
297.9
2%
BE
ST
79
2110
234
9.8
360.5
2.6
873
0.2
98.6
8%
79
5109
232
8.1
60.5
2.7
571.3
0.1
898.5
7%
83
1101
215
7.4
161.7
2.6
871.8
0.2
298.0
4%
BE
ST
83
2153
325
16.3
64.1
3.7
384.1
0.1
699.3
4%
83
5107
229
8.7
961
2.7
72.5
0.1
998.6
3%
89
1101
215
7.4
662
2.6
972.1
0.2
298.0
4%
BE
ST
89
2141
300
14.2
63.9
3.4
281.5
0.1
799.2
4%
89
5108
231
8.3
561.5
2.7
472.6
0.1
998.6
3%
97
1101
216
7.3
561.6
2.6
771.7
0.2
298.0
8%
97
2145
309
14.7
64.4
3.4
982.6
0.1
799.2
8%
97
5105
224
7.3
660.6
2.6
870.7
0.1
898.4
6%
BE
ST
101
1101
215
7.6
162.2
2.7
72.5
0.2
398.0
4%
BE
ST
101
2154
327
16.8
65.2
3.7
485.7
0.1
699.3
4%
101
5107
228
8.7
861.1
2.7
372.6
0.1
998.5
7%
103
1100
213
7.2
962.5
2.7
172.5
0.2
397.9
2%
BE
ST
103
2157
335
16.6
66
3.8
86.4
0.1
699.3
6%
103
5107
227
8.3
961.8
2.6
972.9
0.1
998.5
1%
107
1101
215
7.5
862.7
2.7
373
0.2
398.0
8%
BE
ST
107
2161
342
17.1
66.1
3.8
587.1
0.1
699.3
7%
107
5109
232
9.6
462.2
2.7
674.6
0.2
98.6
3%
109
1101
215
7.7
663.1
2.7
373.6
0.2
398.0
8%
BE
ST
109
2146
311
16.2
65.8
3.6
585.6
0.1
799.2
8%
109
5106
225
8.7
262.4
2.7
73.8
0.2
98.4
6%
113
1101
214
7.4
862.1
2.6
972.3
0.2
398%
113
2122
259
12
63.1
2.9
878.2
0.2
98.9
6%
113
5105
223
7.6
761.3
2.6
971.7
0.1
898.5
1%
BE
ST
Conti
nued
on
next
page
76
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s127
1100
212
6.8
362.1
2.7
771.7
0.2
297.8
3%
127
2102
216
8.4
60.6
2.5
871.6
0.2
298.3
9%
127
395
202
7.0
759.8
2.4
169.3
0.2
398.1
5%
BE
ST
127
5101
215
7.5
160.3
2.5
370.3
0.1
998.2
1%
128
184
178
4.3
257.7
2.1
664.2
0.2
96.7
7%
128
284
178
4.3
257.7
2.1
664.2
0.2
396.7
7%
128
584
178
4.3
257.7
2.1
664.2
0.2
396.7
7%
128
684
178
4.3
257.7
2.1
664.2
0.2
396.7
7%
BE
ST
129
1115
245
868.3
2.9
979.3
0.2
98.3
3%
129
2119
253
8.7
766.7
2.9
178.4
0.1
998.7
%B
EST
129
4141
299
12.8
71.3
3.5
487.6
0.2
99.1
%129
5122
261
8.0
367.8
3.0
678.9
0.1
698.7
3%
131
1115
245
7.9
868.8
3.0
179.8
0.2
198.2
8%
131
2138
293
13.4
70
3.2
586.7
0.1
999.0
7%
131
5122
260
8.1
867.8
3.0
879
0.1
698.7
2%
BE
ST
137
1115
245
8.6
269.7
3.0
181.4
0.2
298.3
1%
137
2143
304
14.6
71.2
3.4
89.2
0.1
999.1
1%
137
5124
263
9.1
68.9
3.1
281.1
0.1
898.7
8%
BE
ST
139
1115
245
8.3
769.6
3.0
381
0.2
298.3
1%
BE
ST
139
2122
260
10.1
68.8
3.0
381.9
0.2
98.7
5%
139
5124
263
9.7
968.9
3.1
381.9
0.1
898.8
%149
1115
245
8.6
970.4
3.0
482.1
0.2
298.3
3%
BE
ST
149
2147
313
16.4
72.7
3.5
692.6
0.2
99.1
7%
149
5123
263
10.3
69.6
3.0
983.1
0.1
998.7
7%
151
1115
244
8.3
570.5
3.0
681.9
0.2
298.2
5%
BE
ST
151
2132
281
13.2
71.1
3.2
587.5
0.2
198.9
7%
151
5124
264
10.1
69.9
3.1
183.1
0.1
998.7
5%
157
1115
244
8.3
670.5
3.0
681.9
0.2
298.2
5%
157
2129
274
12.6
70.6
3.1
586.4
0.2
198.9
1%
157
5123
261
9.3
869.5
3.0
681.9
0.1
998.7
2%
BE
ST
163
1115
245
8.6
70.5
3.0
382.1
0.2
298.3
1%
BE
ST
163
2130
278
12
70.3
3.2
85.5
0.2
98.9
5%
163
5123
262
9.7
569.5
3.1
382.4
0.1
898.8
1%
167
1115
244
8.4
970.9
3.0
682.5
0.2
398.2
5%
BE
ST
167
2135
287
13.9
71.9
3.2
989.1
0.2
199%
167
5124
264
10.1
70.4
3.1
483.6
0.1
998.8
1%
173
1115
245
8.8
271.1
3.0
983
0.2
398.3
6%
BE
ST
173
2129
274
12.4
70.7
3.1
786.3
0.2
198.9
%C
onti
nued
on
next
page
77
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s173
5124
263
10.4
70.4
3.1
183.9
0.1
998.7
7%
179
1115
244
8.6
871.1
3.0
782.9
0.2
398.2
5%
BE
ST
179
2134
286
14.8
72.7
3.3
290.8
0.2
298.9
8%
179
5124
264
10.4
70.8
3.1
584.4
0.1
998.8
1%
181
1115
245
8.9
271.5
3.0
983.5
0.2
398.3
6%
BE
ST
181
2134
286
14.3
72.4
3.3
190.1
0.2
199.0
1%
181
5123
262
10.2
70.9
3.1
284.2
0.2
98.7
5%
191
1114
243
7.8
270.6
3.1
181.6
0.2
298.1
8%
BE
ST
191
2123
262
10.9
69.9
3.1
183.9
0.2
198.8
1%
191
5123
261
9.1
769.8
3.1
182.1
0.1
898.7
5%
193
1115
246
8.5
770.9
3.0
582.5
0.2
298.3
1%
193
2116
247
9.2
769.2
2.9
181.3
0.2
298.5
9%
BE
ST
193
5122
260
8.8
669.7
3.1
181.7
0.1
898.7
5%
197
1115
245
8.8
571.3
3.0
583.2
0.2
398.3
1%
BE
ST
197
2134
285
14.1
72.4
3.2
889.7
0.2
199%
197
5122
260
10.2
70.4
3.0
883.7
0.1
998.7
7%
199
1114
243
8.5
671.4
3.0
683
0.2
398.2
1%
BE
ST
199
2139
295
14.4
72.3
3.3
690.1
0.2
99.0
7%
199
5123
261
9.8
670.7
3.1
383.7
0.1
998.7
5%
211
1115
244
8.7
471.5
3.0
883.3
0.2
398.2
5%
BE
ST
211
2144
306
16.8
74.1
3.4
994.4
0.2
299.1
3%
211
5122
259
10.6
70.5
3.0
484.2
0.2
98.7
%223
1114
243
8.0
871.1
3.1
282.3
0.2
298.1
8%
BE
ST
223
2125
267
12.3
71
3.1
386.4
0.2
298.8
6%
223
5123
262
9.6
570.3
3.1
383.1
0.1
998.7
8%
227
1114
243
8.7
71.6
3.0
783.3
0.2
398.2
1%
BE
ST
227
2125
266
12.6
71.8
3.1
487.5
0.2
398.8
4%
227
5121
257
9.8
170.6
3.0
883.5
0.1
998.7
%229
1115
244
8.8
171.8
3.0
983.7
0.2
398.2
5%
BE
ST
229
2140
297
16.2
74
3.4
693.6
0.2
299.1
%229
5121
258
10.1
70.6
3.0
983.8
0.2
98.7
2%
233
1115
244
8.7
371.3
3.1
83.1
0.2
398.2
5%
233
2116
246
9.6
769.6
2.9
382.2
0.2
298.5
5%
BE
ST
233
5121
258
9.5
70.5
3.1
283.1
0.1
998.7
2%
239
1114
243
8.2
171.4
3.1
382.8
0.2
398.1
8%
239
2117
248
9.7
170
2.9
682.6
0.2
298.5
5%
BE
ST
239
5121
257
9.3
470.4
3.0
782.8
0.1
998.7
%241
1114
243
8.7
371.5
3.0
783.3
0.2
398.2
1%
Conti
nued
on
next
page
78
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s241
2115
245
9.5
769.8
2.9
82.2
0.2
298.5
3%
BE
ST
241
5119
253
8.9
70.6
3.0
582.6
0.1
998.6
3%
251
1114
243
8.2
771.4
3.1
482.8
0.2
398.1
8%
251
2115
246
9.5
969.7
2.9
182.2
0.2
398.4
6%
251
5118
252
9.1
570
3.0
182.2
0.1
998.5
7%
BE
ST
255
1114
242
7.8
970.9
3.1
482
0.2
298.0
8%
255
2115
245
9.5
569.5
2.8
981.9
0.2
298.5
1%
255
3109
232
8.2
268.4
2.7
679.3
0.2
398.4
8%
BE
ST
255
5116
246
8.3
69
2.9
180.2
0.1
998.4
8%
256
195
203
5.0
266.1
2.4
573.6
0.2
97.1
4%
256
295
203
5.0
266.1
2.4
573.6
0.2
397.1
4%
256
595
203
5.0
266.1
2.4
573.6
0.2
397.1
4%
256
695
203
5.0
266.1
2.4
573.6
0.2
397.1
4%
BE
ST
257
1130
276
9.2
477.7
3.3
790.4
0.2
198.5
3%
257
2136
289
10.4
76.1
3.2
989.8
0.1
898.8
9%
257
4158
336
14.5
80.9
3.9
899.4
0.2
199.2
%257
5136
290
8.6
976.4
3.4
588.5
0.1
698.8
4%
BE
ST
509
1128
272
9.2
79.9
3.5
292.6
0.2
298.3
6%
509
2130
277
10.3
77.8
3.2
491.3
0.2
198.7
%509
5131
279
9.7
378.1
3.3
291.1
0.1
998.7
3%
BE
ST
511
1127
271
8.8
480
3.5
392.4
0.2
298.2
8%
511
2130
277
10.8
78.2
3.2
992.3
0.2
298.7
2%
511
3124
264
9.2
177.1
3.1
389.4
0.2
398.7
%B
EST
511
5130
276
9.6
377.7
3.2
490.5
0.1
998.6
3%
512
1107
228
5.7
574.5
2.7
483
0.2
197.4
4%
512
2107
228
5.7
574.5
2.7
583
0.2
397.4
4%
512
5107
228
5.7
574.5
2.7
583
0.2
497.4
4%
512
6107
228
5.7
574.5
2.7
583
0.2
497.4
4%
BE
ST
513
1144
305
10.2
86.1
3.7
1100
0.2
198.6
7%
513
2152
323
11.4
84.5
3.7
399.6
0.1
899.0
3%
513
4177
376
16.3
90.3
4.4
7111
0.2
199.2
8%
513
5152
322
9.9
185.3
3.8
199
0.1
798.9
6%
BE
ST
1021
1142
302
10.3
89.2
3.9
1103
0.2
398.5
1%
1021
2162
344
15.7
88.9
3.8
9109
0.2
99.1
3%
1021
5146
311
11.3
87.1
3.8
1102
0.2
96.4
3%
BE
ST
1023
1141
300
9.8
988.9
3.9
1103
0.2
298.4
4%
1023
2152
322
13.2
87.6
3.7
7105
0.2
199.0
1%
1023
3137
291
10.3
86
3.4
699.8
0.2
398.7
7%
BE
ST
Conti
nued
on
next
page
79
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s1023
5146
310
11
86.4
3.6
1101
0.1
998.7
7%
1024
1119
252
6.4
682.8
3.0
492.3
0.2
197.6
7%
1024
2135
287
9.6
384.9
3.3
897.9
0.2
298.6
3%
1024
5119
252
6.4
682.8
3.0
492.3
0.2
497.6
7%
1024
6119
252
6.4
682.8
3.0
492.3
0.2
497.6
7%
BE
ST
1025
1158
337
11.4
95.6
4.1
5111
0.2
198.8
%1025
2169
359
12.6
93.4
4.1
2110
0.1
899.1
4%
1025
4193
411
17.7
99.7
4.9
1122
0.2
199.3
4%
1025
5167
356
10.9
93.8
4.2
2109
0.1
699.0
7%
BE
ST
2039
1156
332
11.2
98.2
4.3
3114
0.2
398.6
3%
BE
ST
2039
2199
424
21.2
101
4.8
4127
0.2
99.3
9%
2039
5165
350
13.2
96.9
4.4
9115
0.1
996.8
4%
2047
1155
331
10.8
97.7
4.3
3113
0.2
298.5
7%
2047
2165
351
14
95.9
4.1
2114
0.2
199.0
7%
2047
3150
319
11.4
94.4
3.8
110
0.2
398.8
6%
BE
ST
2047
5160
340
12
94.8
3.9
5111
0.1
998.9
1%
2048
1130
277
7.1
691.1
3.3
4102
0.2
197.8
7%
2048
2149
317
9.8
194.3
3.6
9108
0.2
298.6
5%
2048
5130
277
7.1
791.1
3.3
4102
0.2
497.8
7%
2048
6130
277
7.1
791.1
3.3
4102
0.2
497.8
7%
BE
ST
2049
1173
367
12.6
104
4.4
8121
0.2
198.9
1%
BE
ST
2049
2191
406
15.2
103
4.6
5123
0.1
899.2
8%
2049
4211
450
19.7
109
5.3
4134
0.2
199.4
%2049
5214
456
19.3
107
5.3
2132
0.1
799.4
%4093
1170
362
12.3
107
4.7
3124
0.2
398.7
5%
BE
ST
4093
2210
447
21.4
109
5.0
1135
0.1
999.4
1%
4093
5177
377
13.7
105
6.1
3125
0.1
989.6
2%
4095
1170
361
11.9
107
4.7
3123
0.2
298.7
%4095
2195
416
18.2
106
4.7
5129
0.1
999.3
2%
4095
3163
347
12.6
103
4.1
1119
0.2
398.9
9%
BE
ST
4095
5176
374
13.2
104
5.9
4123
0.1
990.1
%4096
1142
301
7.8
899.5
3.6
3111
0.2
198.0
4%
4096
2162
345
11.6
102
4.0
3118
0.2
298.8
8%
4096
5142
301
7.8
899.5
3.6
3111
0.2
498.0
4%
4096
6142
301
7.8
899.5
3.6
3111
0.2
498.0
4%
BE
ST
4097
1187
399
13.7
113
4.8
7132
0.2
299.0
1%
4097
2200
425
15.3
110
4.8
8131
0.1
899.2
8%
BE
ST
4097
4229
488
21.1
118
5.8
3145
0.2
199.4
6%
Conti
nued
on
next
page
80
APPENDIX C. RNS ADDERS RESULTS
Table
C.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s4097
5234
497
21.1
117
5.7
8143
0.1
799.4
7%
8179
1184
392
13.7
117
5.0
6135
0.2
398.8
5%
BE
ST
8179
2225
480
22.7
118
5.4
7146
0.2
99.4
4%
8179
5247
525
22
128
7.1
1157
0.1
899.4
5%
8191
1184
391
12.9
115
5.1
2133
0.2
298.8
1%
8191
2204
433
18.3
114
4.9
6137
0.2
99.3
%8191
3178
378
13.6
111
4.4
8129
0.2
399.0
9%
BE
ST
8191
5194
414
15.2
113
6.0
6134
0.1
994.6
9%
8192
1153
326
8.5
9108
3.9
3120
0.2
198.1
8%
8192
2176
374
12
111
4.3
6127
0.2
198.9
1%
8192
5153
326
8.5
9108
3.9
3120
0.2
498.1
8%
8192
6153
326
8.5
9108
3.9
3120
0.2
498.1
8%
BE
ST
8193
1201
428
14.7
121
5.2
1141
0.2
199.0
7%
BE
ST
8193
27325
15585
196
273
213
682
0.0
26
99.8
4%
8193
4248
527
22.9
128
6.3
8157
0.2
98.4
9%
8193
5251
533
22.3
125
6.1
6154
0.1
799.4
9%
16381
1198
421
14.4
125
5.5
144
0.2
398.9
2%
BE
ST
16381
2243
517
25.1
126
5.7
157
0.1
999.4
7%
16381
5257
546
22.3
134
7.3
4163
0.1
899.4
2%
16383
1197
420
13.8
124
5.5
143
0.2
298.8
9%
16383
2228
486
20.7
123
5.4
149
0.1
899.4
%16383
3192
409
14.8
120
4.8
6140
0.2
399.1
5%
BE
ST
16383
5256
544
21.8
133
7.3
162
0.1
799.4
3%
16384
1165
351
9.2
8116
4.2
2130
0.2
198.3
1%
16384
2189
402
13.2
120
4.6
8137
0.2
299.0
3%
16384
5165
351
9.2
8116
4.2
2130
0.2
498.3
1%
16384
6165
351
9.2
8116
4.2
2130
0.2
498.3
1%
BE
ST
16385
1216
459
15.8
131
5.6
152
0.2
299.1
4%
BE
ST
16385
215576
33141
315
386
455
1160
0.0
21
99.8
7%
16385
4266
567
24.3
137
7.0
7169
0.2
198.0
9%
16385
5273
582
24.4
136
6.8
8167
0.1
799.5
5%
81
Appendix D
RNS multiplier results
The results have been generated using uniformly distributed random inputoperands and a setup as described in figure 5.6b on page 41. Below is aDescription of the titles in the header row of table D.1.
Modulo The modulo
Type Results for the specific multiplier type, see table 3.2 on page 13 fordetails.
Area The total area of the multiplier including the registers, in µm2.
Gates The total gate count including registers.
Switch power The switching power. See section 2.3.3 on page 10 for de-tails about the different power values in µW.
Int. power The internal power in µW.
Leak power The leakage power in µW.
Total power The total power in µW.
Toggle rate The average toggle rate per net per clock cycle.
UVT cells The percentage of UVT cells. See section 2.3.3 on page 10 fordetails.
82
APPENDIX D. RNS MULTIPLIER RESULTS
Tab
leD
.1:
Res
ult
sfo
rR
NS
mu
ltip
lier
s
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s2
714
29
0.1
13
7.5
90.3
48.0
40.2
85.7
1B
EST
20
14
29
0.1
13
7.5
90.3
48.0
40.2
85.7
12
114
29
0.1
13
7.5
90.3
48.0
40.2
85.7
13
025
52
0.4
06
14.9
0.5
75
15.9
0.1
990.9
1B
EST
31
27
58
1.0
215.2
0.6
18
16.8
0.1
894.1
23
228
59
1.2
15.2
0.6
31
17
0.1
894.4
43
628
60
1.0
515.2
0.6
31
16.9
0.1
794.1
24
725
53
0.6
02
15.4
0.6
11
16.6
0.2
92.3
1B
EST
40
25
53
0.6
15.4
0.6
11
16.6
0.2
92.3
14
125
53
0.6
02
15.4
0.6
11
16.6
0.2
92.3
15
043
92
2.8
423.9
127.7
0.2
96.7
7B
EST
51
66
140
4.8
626.9
1.6
933.4
0.1
398.1
55
669
148
4.8
625.8
1.5
932.2
0.1
298.5
57
046
99
3.4
824.9
1.0
729.5
0.2
97.0
6B
EST
71
57
122
4.6
626.7
1.4
32.8
0.1
797.8
77
251
109
3.8
926
1.2
731.1
0.2
97.3
74
51
108
4.0
726.5
1.2
231.8
0.2
197.4
47
667
143
5.6
627.7
1.5
934.9
0.1
598.4
18
040
86
2.0
624.1
0.9
94
27.2
0.2
196.3
BE
ST
81
38
81
1.4
323.6
0.9
17
25.9
0.2
95.2
48
738
81
1.4
323.6
0.9
17
25.9
0.2
95.2
49
084
179
7.7
134.4
1.9
144
0.1
698.8
5B
EST
91
103
218
8.6
838.2
2.6
849.5
0.1
598.8
99
6105
224
7.8
38.3
3.1
949.3
0.1
295.7
11
474
158
7.7
836.5
1.7
746
0.2
198.4
1B
EST
11
091
193
8.9
735.7
2.1
346.8
0.1
599.0
111
1106
225
9.7
140.3
2.7
552.8
0.1
698.9
111
6118
251
10.6
42.5
4.5
157.6
0.1
496.2
613
478
165
9.4
237.4
1.8
448.6
0.2
298.5
5B
EST
13
0135
286
13.2
37.7
3.0
153.9
0.1
199.4
413
1101
215
10.4
43.1
2.8
856.4
0.1
798.5
913
6113
239
10.8
43.4
3.2
257.4
0.1
697.9
415
287
185
9.4
638.4
2.1
450
0.1
898.6
3B
EST
15
0102
218
14
39.7
2.3
256
0.1
999.1
15
191
194
8.3
541.3
2.5
752.2
0.1
898.4
615
5680
1448
95.7
144
50.1
290
0.1
383.0
1C
onti
nued
on
next
page
83
APPENDIX D. RNS MULTIPLIER RESULTS
Table
D.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s15
6112
238
11.2
44.4
3.4
359
0.1
594.9
16
756
120
2.8
633.3
1.4
837.6
0.1
896.8
8B
EST
16
073
156
8.2
35.5
1.7
445.5
0.2
198.5
716
156
120
2.8
633.3
1.4
837.6
0.1
896.8
817
1137
292
13.5
51.7
3.7
368.9
0.1
699.1
4B
EST
17
0187
397
22.1
53.8
4.6
80.5
0.1
599.6
17
6288
613
38.4
85.6
10.3
134
0.1
897.0
119
4110
234
13.5
47.6
2.5
863.7
0.1
999.0
6B
EST
19
0206
439
20.5
52.2
5.0
377.7
0.1
299.6
519
1189
402
20.4
60.3
5.8
986.6
0.1
598.4
19
6261
556
30.7
72.6
8.6
5112
0.1
597.3
823
4121
258
17.2
50.3
2.8
370.3
0.2
99.1
9B
EST
23
0228
485
22.4
54.5
6.0
182.9
0.1
299.6
923
1167
356
20.3
57.7
4.4
282.5
0.1
799.3
923
6281
598
39.7
84.4
10.2
134
0.1
897.2
129
4138
293
20.5
52.5
3.1
876.1
0.2
99.3
4B
EST
29
0377
803
55
78
9142
0.1
599.8
229
1146
310
18.1
62.2
4.6
784.9
0.1
898.0
429
6289
615
42.4
89.7
9.9
3142
0.1
997.6
931
1127
270
14.1
55.9
3.6
473.6
0.1
998.9
BE
ST
31
0240
511
38.6
64.4
5.5
6109
0.1
799.7
31
2163
348
21.4
61.3
4.9
487.6
0.1
898.7
731
4141
300
22.6
54.3
3.2
780.2
0.2
199.3
631
51099
2338
102
147
145
394
0.0
855.9
631
6259
552
35.3
78.8
8.8
2123
0.1
797.4
232
776
162
4.7
443.8
2.0
850.6
0.1
997.7
3B
EST
32
0163
346
27.4
56.5
3.7
787.6
0.2
99.5
132
176
162
4.7
443.8
2.0
850.6
0.1
997.7
333
1186
397
21.3
67.5
5.3
994.2
0.1
698.6
8B
EST
33
0404
861
33.8
72.2
10.9
117
0.0
95
99.8
433
6465
990
67.2
124
23.4
214
0.1
691.4
937
4201
427
25.7
63.2
4.6
893.6
0.1
599.6
BE
ST
37
1297
631
37.2
81.6
10.3
129
0.1
595.6
937
6432
919
59.8
116
21.7
197
0.1
692.8
941
4201
427
27.7
64.5
4.8
897.1
0.1
699.6
BE
ST
41
1264
563
34
77.7
7.2
6119
0.1
699.3
641
6435
926
63.6
122
22.5
208
0.1
689.8
643
4217
462
31.6
67.1
5.1
4104
0.1
699.6
3B
EST
Conti
nued
on
next
page
84
APPENDIX D. RNS MULTIPLIER RESULTS
Table
D.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s43
1261
555
36.4
83.3
8.4
9128
0.1
797.4
543
6420
895
63.8
119
19.4
202
0.1
793.0
447
4236
503
34.5
69.2
5.7
4110
0.1
699.6
7B
EST
47
1248
528
34.4
83
8.9
2126
0.1
896.8
47
6425
905
65.2
120
21
206
0.1
791.2
353
4239
508
36.6
70.1
5.6
8112
0.1
699.6
7B
EST
53
1273
581
37.7
89.5
9.8
1137
0.1
796.8
953
6430
914
69.1
126
18.9
214
0.1
894.9
359
4236
503
41.1
75.2
5.5
5122
0.1
999.6
7B
EST
59
1241
512
33
84.7
7.7
1125
0.1
898.7
59
6406
864
65.3
120
18.2
203
0.1
893.0
661
4237
504
41.8
76
5.5
1123
0.1
999.6
6B
EST
61
1230
490
32.2
87.3
7.7
6127
0.1
897.9
661
6417
888
67.8
123
20
211
0.1
891.1
63
1168
358
20.8
72
5.5
98.4
0.1
897.3
5B
EST
63
2285
607
40.9
89.6
9.2
6140
0.1
998.0
663
51346
2863
234
304
226
765
0.1
439.3
63
6417
887
67.2
124
19.5
210
0.1
893.2
164
199
210
6.9
254.3
2.7
964
0.1
998.2
867
1444
945
61.8
113
15
190
0.1
597.4
5B
EST
67
4308
654
41.7
81.3
8.6
5132
0.1
498.7
867
6649
1382
117
184
54.8
356
0.1
878.3
371
1458
975
67.2
116
13.9
197
0.1
698.1
4B
EST
71
4349
743
49.7
87.5
10.3
147
0.1
398.9
171
6654
1391
115
183
56.7
355
0.1
975.7
273
1376
800
51.5
108
12.2
172
0.1
697.2
9B
EST
73
4315
670
43.9
80.5
9.1
2134
0.1
397.6
173
6689
1467
123
192
46.4
361
0.1
985.2
879
4382
812
55
92.6
10.6
158
0.1
499.0
4B
EST
79
1451
960
64.8
122
16.3
203
0.1
696.6
379
6680
1447
124
196
61.3
381
0.1
978.2
183
4438
933
63.9
97.2
14.2
175
0.1
396.7
BE
ST
83
1471
1003
72.7
124
16.1
213
0.1
697.4
383
6656
1395
116
191
61.9
368
0.1
975.9
489
4425
903
64.4
95.5
13.7
174
0.1
396.4
2B
EST
89
1396
842
59.7
113
13.2
186
0.1
797.9
389
6566
1205
106
163
49.6
318
0.1
875.8
497
1397
844
58.8
115
14.4
188
0.1
797.0
3B
EST
Conti
nued
on
next
page
85
APPENDIX D. RNS MULTIPLIER RESULTS
Table
D.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s97
4407
865
53.6
86.4
9.9
9150
0.1
199.3
197
6630
1340
114
189
49
352
0.1
979.7
9101
4488
1038
64.8
102
15.2
182
0.1
197.6
3B
EST
101
1414
880
61.5
119
17.1
197
0.1
695.6
3101
6584
1242
115
168
52.1
334
0.1
875.7
3103
4497
1058
67.1
101
13.7
182
0.1
298.8
1B
EST
103
1391
831
60.4
122
15.5
198
0.1
896.2
5103
6643
1367
121
196
43.2
360
0.2
85.2
6107
4561
1193
73.5
107
15.7
197
0.1
198.6
1B
EST
107
1424
902
70.4
127
20.6
218
0.1
892.5
6107
6684
1456
134
207
63.1
405
0.2
75.1
1109
4536
1140
77.4
111
16.5
205
0.1
297.9
7B
EST
109
1446
949
72.7
131
16.4
220
0.1
798.1
7109
6655
1394
131
204
48.2
383
0.2
82.0
1113
1333
707
48.3
110
11.6
169
0.1
897.9
7B
EST
113
4465
990
61.5
95.1
14.6
171
0.1
297.5
9113
6611
1299
116
185
44.8
346
0.2
83.5
7127
1246
523
34.2
92
7.4
134
0.1
999.0
9B
EST
127
2567
1207
100
160
25.3
286
0.1
994.9
7127
4458
974
83.3
114
13.1
210
0.1
798.2
7127
52066
4396
230
294
425
950
0.0
85
17.6
9127
6594
1263
118
177
60.9
356
0.1
969.6
1128
7124
264
9.9
665.8
3.5
279.3
0.1
998.6
7B
EST
128
1124
264
9.9
665.8
3.5
279.3
0.1
998.6
7129
1336
716
42.5
106
11.9
160
0.1
795.1
7B
EST
129
6915
1946
151
236
118
506
0.1
659.8
8131
4575
1223
64.6
107
19.5
191
0.0
92
97.4
5B
EST
131
1583
1241
82.4
138
18.5
238
0.1
697.0
1131
6924
1966
159
243
127
529
0.1
658.7
5137
1663
1410
101
154
22.9
277
0.1
696.4
6B
EST
137
6945
2010
168
247
126
541
0.1
757.2
7139
4615
1309
65.7
111
17.8
195
0.0
91
99.0
8B
EST
139
1740
1574
116
173
24.7
313
0.1
696.6
6139
6923
1964
156
247
123
526
0.1
760.0
2149
4601
1278
61.8
105
20.4
187
0.0
88
97.1
3B
EST
149
1673
1433
103
163
25.4
291
0.1
696.5
3149
6876
1864
169
237
126
532
0.1
747.5
2151
4652
1386
66.1
112
17.9
196
0.0
999.2
9B
EST
Conti
nued
on
next
page
86
APPENDIX D. RNS MULTIPLIER RESULTS
Table
D.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s151
1693
1474
109
166
22.3
297
0.1
697.8
3151
6820
1745
155
229
107
492
0.1
754.7
5157
4642
1366
65.1
107
21.5
194
0.0
88
97.0
2B
EST
157
1718
1527
117
173
24.5
314
0.1
697.3
7157
6933
1986
167
251
123
541
0.1
757.4
9163
4714
1518
76.4
118
24.3
219
0.0
93
97.0
6B
EST
163
1709
1509
114
172
24.3
309
0.1
697.5
2163
6943
2007
170
258
121
549
0.1
761.6
6167
4709
1508
73
119
20.8
213
0.0
93
98.2
6B
EST
167
1772
1642
119
174
25.9
319
0.1
597.2
1167
6893
1899
154
242
118
515
0.1
757.4
173
4714
1519
72.5
118
20.4
210
0.0
85
99.0
7B
EST
173
1767
1631
120
177
27.6
324
0.1
696.3
6173
6868
1847
172
233
122
527
0.1
848.3
2179
4711
1514
72
114
24.5
210
0.0
85
96.8
3B
EST
179
1725
1542
117
176
25
318
0.1
797.3
2179
6894
1901
167
248
117
532
0.1
857.8
2181
4742
1580
75
125
22.7
223
0.0
85
98.9
1B
EST
181
1731
1556
117
176
25.2
318
0.1
697.3
3181
6859
1828
151
235
108
494
0.1
860.7
8191
4771
1640
77.9
118
23.8
220
0.0
85
97.8
7B
EST
191
1494
1052
75.7
139
17.4
232
0.1
896.1
2191
6916
1950
160
244
120
524
0.1
758.5
8193
4694
1477
66.4
105
19.7
191
0.0
898.5
8B
EST
193
1525
1117
82.7
150
18
251
0.1
798.2
4193
6970
2065
181
273
136
590
0.1
854.5
9197
4792
1686
76.5
119
24.5
220
0.0
81
98.2
6B
EST
197
1582
1237
93.8
151
23.1
268
0.1
794.9
7197
6863
1837
158
242
116
516
0.1
857.2
2199
4806
1714
78.5
125
23.5
227
0.0
84
99.1
8B
EST
199
6890
1894
165
247
123
534
0.1
858.4
211
4832
1769
84.5
125
22.3
232
0.0
83
99.5
2B
EST
211
1591
1257
92.9
155
20.1
268
0.1
697.7
9211
6839
1786
180
242
117
539
0.1
948.9
8223
4878
1868
87.3
129
30.1
247
0.0
83
97.1
3B
EST
223
1533
1133
83.3
145
18.6
247
0.1
796.8
3223
6938
1996
169
261
130
560
0.1
854.9
2227
1469
997
72.5
144
16.9
234
0.1
896.7
6B
EST
Conti
nued
on
next
page
87
APPENDIX D. RNS MULTIPLIER RESULTS
Table
D.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s227
4909
1934
80.9
128
31.3
240
0.0
73
97.4
227
6828
1763
170
240
117
527
0.1
945.5
229
4873
1857
81.6
126
28.7
236
0.0
78
97.3
5B
EST
229
1514
1093
83.4
150
17.5
251
0.1
897.1
2229
6822
1750
171
235
115
520
0.1
947.3
4233
1510
1085
84.9
150
17.5
252
0.1
896.7
BE
ST
233
4856
1820
142
169
24.5
335
0.1
597.9
8233
6911
1937
174
263
123
560
0.1
857.1
9239
1431
916
70.7
133
14.4
218
0.1
897.5
4B
EST
239
4880
1872
140
175
26.9
342
0.1
598.0
1239
6927
1973
179
266
128
573
0.1
856.3
6241
1389
827
60
132
14.2
206
0.1
997.9
1B
EST
241
4910
1936
152
178
24.5
354
0.1
598.9
8241
6835
1777
178
239
115
532
0.1
947.2
8251
1374
795
60.1
131
13.1
205
0.1
997.5
1B
EST
251
4976
2078
154
184
27.8
366
0.1
499.0
2251
6854
1817
162
245
102
509
0.1
961.5
3255
1315
670
45.2
114
10.4
169
0.1
997.3
2B
EST
255
2999
2126
201
260
111
573
0.1
865.7
9255
52599
5531
519
606
559
1680
0.1
510.9
1255
6938
1996
174
267
117
558
0.1
859.6
6256
7153
325
14.1
77.8
4.3
496.2
0.1
998.9
9B
EST
256
1153
325
14.1
77.8
4.3
496.2
0.1
998.9
9257
1407
866
54.1
125
12.5
192
0.1
798.4
1B
EST
257
61242
2642
218
327
197
742
0.1
645.8
509
1429
913
68.6
146
14.1
229
0.1
997.8
6B
EST
509
61132
2409
215
324
171
710
0.1
849.5
511
1397
845
59.1
135
12.6
207
0.1
998.7
3B
EST
511
22001
4257
434
482
275
1190
0.1
848.3
1511
53241
6896
358
450
667
1480
0.0
83
9.8
6511
61131
2406
211
308
163
681
0.1
750.9
3512
7183
389
19.4
89.2
5.0
2114
0.1
999.2
BE
ST
512
1183
389
19.4
89.2
5.0
2114
0.1
999.2
513
1498
1060
71.4
152
15.5
239
0.1
798.4
9B
EST
513
61738
3698
348
480
325
1150
0.1
831.6
81021
1536
1140
89.4
180
18.5
288
0.1
997.7
7B
EST
1021
61693
3602
376
510
308
1190
0.1
932.5
21023
1467
993
73.4
156
15.8
245
0.1
997.4
9B
EST
Conti
nued
on
next
page
88
APPENDIX D. RNS MULTIPLIER RESULTS
Table
D.1
–conti
nued
from
pre
vio
us
page
Modulo
Typ
eA
rea
Gates
Sw
itch
pow
er
Int.
pow
er
Leak
pow
er
Total
pow
er
Toggle
rate
UV
Tcell
s1023
22919
6210
715
733
582
2030
0.1
812.0
81023
53803
8090
788
894
779
2460
0.1
614.4
11023
61713
3644
368
514
326
1210
0.1
930.5
61024
7216
460
24.6
102
5.9
6133
0.1
999.3
3B
EST
1024
1216
460
24.6
102
5.9
6133
0.1
999.3
31025
1614
1306
92.8
181
21.3
295
0.1
797.0
9B
EST
1025
62108
4486
454
607
432
1490
0.1
818.9
2039
1746
1587
134
250
28.5
412
0.2
197.6
BE
ST
2039
61939
4125
445
587
393
1430
0.1
920.0
42047
1550
1170
88
180
17.5
285
0.1
998.1
6B
EST
2047
23729
7935
917
905
764
2590
0.1
89.8
82047
54192
8920
453
542
831
1830
0.0
81
13.5
12047
62031
4322
454
597
406
1460
0.1
919.2
12048
7252
535
31
117
6.9
3155
0.1
999.4
3B
EST
2048
1252
535
31
117
6.9
3155
0.1
999.4
32049
1735
1563
119
230
27.6
377
0.1
996.3
4B
EST
2049
62949
6274
640
793
639
2070
0.1
79.2
84096
7289
616
36.5
131
8.1
1176
0.2
99.5
BE
ST
4096
1289
616
36.5
131
8.1
1176
0.2
99.5
89
APPENDIX D. RNS MULTIPLIER RESULTS
90
Avdelning, InstitutionDivision, Department
DatumDate
Sprak
Language
� Svenska/Swedish
� Engelska/English
�
RapporttypReport category
� Licentiatavhandling
� Examensarbete
� C-uppsats
� D-uppsats
� Ovrig rapport
�
URL for elektronisk version
ISBN
ISRN
Serietitel och serienummerTitle of series, numbering
ISSN
Linkoping Studies in Science and Technology
Thesis No. 4792
TitelTitle
ForfattareAuthor
SammanfattningAbstract
NyckelordKeywords
Power dissipation has become one of the major limiting factors in the design ofdigital ASICs. Low power dissipation will increase the mobility of the ASIC byreducing the system cost, size and weight. DSP blocks are a major source of powerdissipation in modern ASICs. The residue number system (RNS) has, for a long time,been proposed as an alternative to the regular two’s complement number system(TCS) in DSP applications to reduce the power dissipation. The basic concept ofRNS is to first encode the input data into several smaller independent residues. Thecomputational operations are then performed in parallel and the results are eventuallydecoded back to the original number system. Due to the inherent parallelism of theresidue arithmetics, hardware implementation results in multiple smaller design units.Therefore an RNS design requires low leakage power cells and will result in a lowerswitching activity.
The residue number system has been analyzed by first investigating different imple-mentations of RNS adders and multipliers (which are the basic arithmetic functionsin a DSP system) and then deriving an optimal combination of these. The opti-mum combinations have been used to implement an FIR filter in RNS that has beencompared with a TCS FIR filter.
By providing different input data and coefficients to both the RNS and TCS FIRfilter an evaluation of their respective performance in terms of area, power and oper-ating frequency have been performed. The result is promising for uniform distributedrandom input data with approximately 15 % reduction of average power with RNScompared to TCS. For a realistic DSP application with normally distributed inputdata, the power reduction is negligible for practical purposes.
Division of Electronics Systems,Department of Electrical Engineering581 83 Linkoping
2014-08-25
-
LiTH-ISY-EX--14/4792--SE
-
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-110176
2014-08-25
Low Power Design Using RNS
Viktor Classon
××
residue number system, RNS, low power, ASIC