FPGA AND ASIC IMPLEMENTATION OF RHO AND P-1 METHODS OF FACTORING
by
Ramakrishna Bachimanchi A Thesis
Submitted to the Graduate Faculty
of George Mason University
In Partial fulfillment of The Requirements for the Degree
of Master of Science
Computer Engineering
Committee:
_________________________________ Dr. Kris Gaj, Thesis Director
_________________________________ Dr. Rao Mulpuri, Committee Member
_________________________________ Dr. Jens-Peter Kaps, Committee Member
_________________________________ Andre Manitius, Chairman, Department Electrical and Computer Engineering
_________________________________ Lloyd J. Griffiths, Dean, The Volgenau School of Information Technology and Engineering
Date: _____________________________ Spring Semester 2007 George Mason University Fairfax, Virginia
FPGA and ASIC Implementation of rho and p-1 methods of factoring
A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science at George Mason University
By
Ramakrishna Bachimanchi Bachelor of Engineering
Osmania University, 2004
Director: Dr. Kris Gaj, Associate Professor Department of Electrical and Computer Engineering
Spring Semester 2007 George Mason University
Fairfax, VA
iii
ACKNOWLEDGEMENTS
I would like to thank Dr. Kris Gaj for helping me throughout the course of this research. Special thanks to Dr. Soonhak Kwon, Dr. Patrick Baier, Paul Kohlbrenner, Hoang Le and Mohammed Khaleeluddin.
iv
TABLE OF CONTENTS
PAGE
ABSTRACT ............................................................................................................................................. viii
CHAPTER 1: INTRODUCTION .................................................................................................................... 1
1.1 PUBLIC KEY CRYPTOSYSTEMS .................................................................................................................. 1 1.2 RSA AND ITS SECURITY.......................................................................................................................... 2 1.3 GOALS OF THIS THESIS .......................................................................................................................... 4
CHAPTER 2: CO-FACTORING PHASE OF NUMBER FIELD SIEVE: OVERVIEW AND ALGORITHMS ................. 6
2.1 RHO METHOD .................................................................................................................................... 6 2.2 P-1 METHOD: OVERVIEW AND ALGORITHM ............................................................................................. 10 2.3 ECM METHOD ................................................................................................................................. 12 2.4 BASIC OPERATIONS ............................................................................................................................ 14
2.4.1 Montgomery Multiplication ................................................................................................... 14 2.4.2 Exponentiation Algorithms .................................................................................................... 16
CHAPTER 3: PROPOSED ARCHITECTURE OF CO-FACTORING CIRCUIT...................................................... 18
3.1 TOP LEVEL VIEW ............................................................................................................................... 19 3.2 IMPLEMENTATION OF BASIC ARITHMETIC OPERATIONS ............................................................................... 20
3.2.1 Modular Addition and Subtraction ........................................................................................ 21 3.2.2 Montgomery Multiplier ......................................................................................................... 24
CHAPTER 4: ARCHITECTURE OF RHO AND P-1 ........................................................................................ 29
4.1 PARTITIONING OF OPERATIONS BETWEEN HARDWARE AND SOFTWARE ............................................................ 29 4.2 HARDWARE ARCHITECTURE OF RHO METHOD .......................................................................................... 32 4.3 HARDWARE ARCHITECTURE OF P-1 METHOD ........................................................................................... 38 4.4 GLOBAL MEMORY MAPS..................................................................................................................... 41
4.4.1 Rho ....................................................................................................................................... 41 4.4.2 P-1 ........................................................................................................................................ 43 4.4.3 Unified rho and p-1 ............................................................................................................... 44
4.4 LOCAL MEMORY MAPS ....................................................................................................................... 44 4.4.1 Medium level operations ....................................................................................................... 44 4.4.2 Rho ....................................................................................................................................... 45 4.4.3 P-1 ........................................................................................................................................ 46 4.4.4 Unified rho and p-1 ............................................................................................................... 49
4.5 CONTROL UNIT ................................................................................................................................. 49 4.5.1 Rho ....................................................................................................................................... 49 4.5.2 P-1 ........................................................................................................................................ 51 4.5.3 Unified Unit ........................................................................................................................... 53
CHAPTER 5: FPGA IMPLEMENTATION AND VERIFICATION ..................................................................... 54
v
5.1 OVERVIEW OF FPGA FAMILIES ............................................................................................................. 54 5.2 FPGA DESIGN FLOW ......................................................................................................................... 55 5.3 TOOLS AND VERIFICATION METHODOLOGY .............................................................................................. 57
CHAPTER 6: ASIC IMPLEMENTATION AND VERIFICATION ....................................................................... 59
6.1 ASIC DESIGN FLOW ........................................................................................................................... 59 6.2 PORTING DESIGNS FROM FPGA TO ASIC ................................................................................................ 62 6.3 TOOLS AND LIBRARIES ........................................................................................................................ 62
6.3.1 Tools ..................................................................................................................................... 62 6.3.2 LIBRARIES ..................................................................................................................................... 62
CHAPTER 7: RESULTS .............................................................................................................................. 63
7.1 FPGA RESULTS ................................................................................................................................. 63 7.1.1 Memory Requirements .......................................................................................................... 64
7.1.1.1 P-1 ................................................................................................................................................64 7.1.1.2 Rho ...............................................................................................................................................66 7.1.1.3 Unified rho and p-1 ........................................................................................................................67
7.1.2 Timing Calculations for rho, p-1 and unified unit .................................................................... 68 7.1.2.1 Rho ...............................................................................................................................................68 7.1.2.2 P-1 ................................................................................................................................................69 7.1.2.3 Unified rho and p-1 ........................................................................................................................71
7.1.3 Area and Timing results from FPGA implementations............................................................. 71 7.2 AREA AND TIMING RESULTS FROM ASIC IMPLEMENTATIONS ........................................................................ 76 7.3 COMPARISON OF FPGA AND ASIC RESULTS ............................................................................................ 78
CHAPTER 8: SUMMARY AND CONCLUSION ............................................................................................ 79
BIBLIOGRAPHY ....................................................................................................................................... 82
CURRICULUM VITAE ............................................................................................................................... 84
vi
LIST OF TABLES
TABLE PAGE TABLE 1: CONVERSIONS FROM ORDINARY DOMAIN TO MONTGOMERY DOMAIN......................................................... 15 TABLE 2: OPERATION TABLE FOR RHO ............................................................................................................... 37 TABLE 3: MEMORY RESOURCES NEEDED FOR P-1 HARDWARE IMPLEMENTATION.......................................................... 65 TABLE 4: MEMORY RESOURCES NEEDED FOR RHO IMPLEMENTATION ........................................................................ 66 TABLE 5: EXECUTION TIMES OF ARITHMETIC BLOCKS ............................................................................................. 68 TABLE 6: TIMING CALCULATIONS FOR PHASE 1 AND PHASE 2 OF P-1 ........................................................................ 69 TABLE 7: FPGA IMPLEMENTATION OF SINGLE UNIT OF RHO.................................................................................... 71 TABLE 8: FPGA IMPLEMENTATION OF SINGLE UNIT OF P-1 .................................................................................... 72 TABLE 9: FPGA IMPLEMENTATION OF SINGLE UNIFIED UNIT ................................................................................... 72 TABLE 10: FPGA IMPLEMENTATION OF MAX. NO. OF RHO UNITS ............................................................................ 73 TABLE 11: FPGA IMPLEMENTATION OF MAX. NO. OF P-1 UNITS ............................................................................. 74 TABLE 12: FPGA IMPLEMENTATION OF MAXIMUM NO OF UNIFIED UNITS ................................................................. 75 TABLE 13: RESULTS FROM ASIC IMPLEMENTATION .............................................................................................. 76
vii
LIST OF FIGURES
FIGURE PAGE
FIGURE 1: STEPS OF NFS................................................................................................................................. 3 FIGURE 2: PATTERN OF RHO ............................................................................................................................ 7 FIGURE 3: PROPOSED ARCHITECTURE FOR CO-FACTORING ...................................................................................... 19 FIGURE 4: BLOCK DIAGRAM OF THE ADDER/SUBTRACTOR ...................................................................................... 23 FIGURE 5: CASCADED OF TWO CARRY SAVE ADDERS, REDUCING FOUR OPERANDS TO TWO .............................................. 25 FIGURE 6: BLOCK DIAGRAM OF THE MULTIPLIER .................................................................................................. 26 FIGURE 7: TOP LEVEL BLOCK DIAGRAM .............................................................................................................. 31 FIGURE 8: GRAPHICAL REPRESENTATION OF ORIGINAL POLLARD’S RHO...................................................................... 33 FIGURE 9: GRAPHICAL REPRESENTATION OF IMPROVED VERSION BY BRENT ................................................................ 35 FIGURE 10: SEQUENCE OF OPERATIONS IN HARDWARE FOR BRENT’S RHO .................................................................. 37 FIGURE 11: CONTENTS OF GLOBAL MEMORY FOR RHO .......................................................................................... 42 FIGURE 12: CONTENTS OF GLOBAL MEMORY FOR P-1 A) PHASE 1 B) PHASE 2 ............................................................. 43 FIGURE 13: CONTENTS OF LOCAL MEMORY OF RHO .............................................................................................. 46 FIGURE 14: LOCAL MEMORY OF P-1 ................................................................................................................. 47 FIGURE 15: CONTENTS OF LOCAL MEMORY OF P-1 A) PHASE 1 B) PHASE 2 ................................................................. 48 FIGURE 16: FPGA DESIGN FLOW .................................................................................................................... 56 FIGURE 17: ASIC DESIGN FLOW ...................................................................................................................... 60 FIGURE 18: PHYSICAL DESIGN FLOW OF ASIC ...................................................................................................... 61
ABSTRACT
FPGA AND ASIC IMPLEMENTATION OF RHO AND P-1 METHODS OF FACTORING
Ramakrishna Bachimanchi, MS
George Mason University, 2007
Thesis Director: Dr. Kris Gaj
One of the main challenges in the cryptography is breaking RSA-1024. The problem can
be explained as factoring large integers. The suitable methods for this purpose are
Multiple Polynomial Quadratic Sieve (MPQS) and Number Field Sieve (NFS). Out of these
two methods, Number Field Sieve is the most efficient for breaking RSA-1024. In NFS,
the most time consuming step is to check for the smoothness of a large amount of
numbers and to factor numbers identified as potentially smooth. This phase is called
relation collection. In this phase, the numbers identified by sieving as potentially smooth
will go through the trial division and then special purpose integer factorization methods.
These methods are rho, p-1 and elliptic curve method. All these methods are suitable
for different types of numbers based on the construction of the number and the
properties of the factors they may have. As smoothness test need to be done for large
amount of numbers there is a need for an architecture which can do this task in an
efficient and fast manner. The solution is to implement all the special purpose integer
factorization methods in hardware and then integrate them together. Elliptic curve
method of factoring was implemented in hardware and presented in the thesis
submitted by Mohammed Khaleeluddin. In this thesis I implemented the remaining
methods, which are rho and p-1, and I developed a unified architecture combining rho
and p-1. These two methods and the integrated unit all are implemented in two
technologies, FPGA and ASIC. In FPGA, low cost devices Spartan 3 and Spartan 3E
outperform high performance devices Virtex II and Virtex 4 in terms of performance to
cost ratio, thus most suitable for code-breaking. ASIC implementation resulted in higher
frequency and more area efficient than FPGA. When large amount of ASIC chips are
manufactured to overcome the non-recurring cost of fabrication, ASIC has an edge over
FPGA. Low cost FPGA devices are the best choice, when large amount of chips are not
considered.
1
Chapter 1: Introduction
Cryptography (“secret writing”, from the Greek kryptós, “hidden,” and gráfo “write”) is
the study of message secrecy. It refers to encryption, which is the process of scrambling
or obscuring the ordinary messages (plaintext), such that they are not easily readable
(have a form of a randomly looking stream of bits called ciphertext). The process reverse
to encryption is decryption, which recovers the original message (plaintext) from the
encrypted message (cipher text). An algorithm that defines both encryption and
decryption is called a cryptosystem. Cryptosystems are mainly categorized into Secret
Key (or symmetric) cryptosystem and Public Key (or asymmetric) cryptosystems. In all
cryptosystems encryption and decryption depend on a key (or a pair of keys) which
changes their detailed operation.
1.1 Public Key Cryptosystems
In secret key cryptosystems the same secret key is used for both encryption and
decryption. The difficult and challenging task in this cryptosystem is to agree on the
same secret key without any third party knowing the value of the key. Anyone who finds
out the key can intercept the message in the middle, as well as read, modify or forge the
2
information. The process of generating, transmitting and storing all the keys is known as
key management. One of the main challenges or difficulties of the symmetric
cryptosystems is securing the key management.
In public key cryptosystems each user is supplied with a public key and private key. In
these cryptosystems, message is encrypted using recipient’s public key and can only be
decrypted using intended recipient’s private key. Public key may be published but the
private key should be kept secret. Public and private keys are related mathematically
and deriving the private key from public key is difficult and not practical. As all the
communications make use of the public key, the problem of sharing a secret key
between sender and receiver can be avoided.
1.2 RSA and its security
RSA is one of the first and great developments in public key cryptography. It was the
first algorithm which is suitable for both signing and encryption. RSA is one of the widely
used public key cryptosystems in electronic commerce protocols. Security of the RSA
depends on the length of the keys used. In other words, if the key includes the modulus,
which is difficult to factor, then the system, using RSA has more security. There are
suitable methods for factoring RSA-1024 and they will be discusses in the following
section.
RSA security is determined by the difficulty of factoring integers. The most suitable
methods for factoring large integers are MPQS (Multiple Polynomial Quadratic Sieve)
3
and NFS (Number Field Sieve). Out of these two methods Number Field Sieve is the most
efficient for factoring RSA-1024.
Number field sieve was invented by Pollard in 1991 [6]. Initially it was used only for
factoring numbers that are close to perfect powers, but it was later extended and
improved to handle arbitrarily large integers. Using NFS, an RSA modulus of 663 bits was
successfully factored by Bahr, Boehm, Franke and Kleinjung in May 2005[7]. The main
steps of NFS are shown in Figure 1.
P o lyn o m ia l S e le c tio n
S ie v in g E C M
p -1 m eth o d
rh o m eth o d
R e la tio n C o lle c tio nN o rm fa c to rin g /C o -fa c to rin g
200 b it & 350 b it
n u m b ers
L in e a r A lg e b ra
S q u a re ro o t
Figure 1: Steps of NFS
4
The main steps of the NFS are Polynomial selection, Relation collection, Linear algebra
and Square root. The relation collection step consists of two phases, sieving and co-
factoring (also known as norm-factoring). In this thesis the main focus is on the p-1 and
rho methods of factoring used as part of co-factoring in the second phase of the
Relation collection step. In Relation collection step, sieving selects the numbers which
are potentially smooth, of the size between 200-350 bits, and for which their largest
factor is not greater than . Once these numbers are identified, they go through the
co-factoring phase, where they can be factored using special integer factorization
algorithms, such as rho method, p-1 method and ECM. All these methods are
probabilistic and there is no guarantee that all numbers generated by sieving will be
factored in this phase, even if they fulfill all requirements. The main features of these
algorithms are as follows:
Rho method easily finds factors up to about
P-1 method finds factors p for which p-1 is smooth over a certain bound, and
ECM finds factors p for which . In case of ECM the probability of success
can be increased by using multiple iterations of the algorithm with different
initial parameters.
1.3 Goals of this Thesis
The main goals of this thesis are to:
5
determine the optimum architectures for rho, p-1 and the unified architecture
for both algorithms,
implement and verify these architectures in FPGA and ASIC, and
demonstrate the speed up of ASIC vs. FPGA in terms of number of operations per
second assuming the maximum resource utilization.
The choice of optimum architecture amounts to selecting an optimum number of
arithmetic units, i.e. multipliers and adder/subtractors, choosing an optimum
exponentiation algorithm for p-1, and optimum memory organization.
6
Chapter 2: Co-factoring phase of number field sieve: Overview and
Algorithms
2.1 Rho Method
Rho method is a special purpose integer factorization algorithm which was proposed by
John M. Pollard [13] in 1975 and improved later by Richard Brent [15] in 1980. This
algorithm is suitable if the composite number has smaller factors. It is based on Floyd’s
cycle-finding algorithm and the birthday paradox.
The theory behind Pollard’s rho method is for a composite which has an unknown
factor . There always exists a which is no bigger than . If we start picking at
random numbers which are greater than zero and less than , only time we can get
is when and are identical. As we know is less than , there is a
chance that even if they are not identical. In this case
means divides . As is a factor of and a divisor of , gcd of and is
an integer multiple if . We can divide with the gcd to break it into smaller composite
numbers and then we can apply the same algorithm to the smaller numbers if they are
composite. This method is effective at splitting composite numbers which have small
factors.
7
Basics of the algorithm are as follows:
Pick two numbers x and y at random and find out the greatest common divisor of
and , if it is equal to one pick another pair of numbers and and find the greatest
common divisor of and and repeat the procedure based on the result. For
pairs we have to do gcd checks. We can minimize the number of gcd checks if we
follow the following theory.
Figure 2: Pattern of Rho
Instead of choosing numbers completely at random we will choose a polynomial
which will generate the next random number based on the polynomial and the previous
8
number. If we consider a polynomial and generate the numbers at random, at some
point we will have two values and such that is congruent to
modulo . Once we reach a point where is congruent to , every element in
the sequence after will be congruent to the corresponding element after This
sequence doesn’t repeat from the very first element of the sequence, but rather repeats
after some number of s elements, which form a tail of the sequence, as shown in Figure
2. The subsequent sequence values reduced modulo q form a loop. The tail and loop
together have the shape of a Greek letter rho, from where the method acquired its
name.
Why will this work? Let us consider a loop of length t, and the loop starts at an s-th
random number. As shown in Figure 2 at some point we are at element of
the sequence. if k is equal to t or is a multiple of t. Instead of
checking the gcd for all possible differences – , we find out the gcd of the
differences of elements which are in the form of ( – ) by increasing by one in
each step. The polynomial selection is the next thing to do and in general the polynomial
is . where ‘a’ is some constant which is not congruent to 0 or -2 modulo
N. and are other typical choices. The algorithm is explained as Algorithm
1
9
In 1980 Richard P. Brent [15] improved the algorithm so that it is 25% faster than
Pollard’s version, while outcome is still the same. The improved version is shown in
Algorithm 2.
Complexity of the rho method [3]: Let p be a factor of n and , then rho
algorithm has expected running time .
Algorithm 1: Pollard’s rho Algorithm
0
2
1 . ( )
2 . ( ) m od ( ( )) m od
3 . gcd ( - , )
4 . 1 ,
5 . 1 2
In itia lize b c x
choose the po lynom ia l as f x x a
ca lcu la te b f b n and c f f c n
com pu te d b c n
if d n a non triv ia l facto r o f n is found
if d go to step
if d N ch 1ange a and go to step
Algorithm 2: Brent’s Improved algorithm based on rho
10
2
: in t .
: .
1 . 1, 1
2 . in t
3 . ( ) (m o d )
4 . 1
( )
( ) 1
in p u t co m p o site eg er n
o u tp u t a n o n tr iv ia l fa cto r p o f n
set r q p c
ch o o se ra n d o m eg er y
ch o o se f x x c n
w h ile p
a set x y
b fo r i to
. ( )
( ) 0
( ) 1
.
. 1 m in ( , - )
. ( )
( | -
s
r
i se t y f y
c set k
d w h ile k r a n d p
i set y y
ii fo r i to m r k
A set y f y
set q q x |) m o d
. g cd ( , )
( ) 2
y n
iii se t p q n
set k k m
e set r r
5 .
( ) 1
. ( )
gcd( - , )
6 . 1 1
s s
s
if g n
a w hile p
i set y f y
set p x y n
if p n set c c and go to
else return p
2.2 P-1 Method: Overview and algorithm
Pollard’s p-1 algorithm is a special purpose integer factorization method, which was
proposed in 1974 by John M. Pollard *14+. This method is based on Fermat’s Little
Theorem. It is only suitable for factoring integers with a prime factor p such that p-1 is
11
smooth over a bound B. This method will not give any non-trivial factors of n if p-1 is not
smooth over the bound B.
From Fermat’s Little Theorem
ap-1 ≡ 1 (mod p)
am(p-1) ≡ 1 (mod p)
am(p-1) – 1 ≡ 0 (mod p)
In this method ‘a’ is any small integer (in practice a = 2).
For a small integer a and a prime factor p the algorithm is described as follows:
Input to the algorithm
N, composite number which is to be factored
Output of algorithm
p, a non trivial factor of N
choose a small number a such that 1 < a < N
choose a number k
compute ak (mod N) – 1
compute gcd (ak (mod N) – 1, N)
use division algorithm to see if the result after gcd is a factor of N
If a factor is found, it is a success, if not, change a and/or k and repeat the algorithm. If k
is chosen properly k and p-1 will have many factors in common. This algorithm is well
suited when is a product of many small primes. The above explained algorithm
can also be explained as in Algorithm 3.
Algorithm 3: p-1 Algorithm
Inputs:
N: composite number to be factored
a: small prime number such that gcd (a, N) = 1
12
B1, B2: smoothness bounds
Output: q Such that q|N
1
1
lo g
1 2
1
(m od )
2
1
* ( 1) m od
gcd ( , )
pi
i
B
p B i
k
p
phase
k p
d a N
phase
x
fo r each prim e p B to B do
x x d N
end for
q x N
Complexity of the p-1 method [3]: expected running time of p-1 algorithm with p is a
factor of N and p is B smooth is
2.3 ECM Method
The Elliptic Curve Method (ECM) was originally proposed by H. W. Lenstra [9] in 1985
and is based on Pollards p-1 method [13]. It was subsequently extended by Brent [8] and
Montgomery [11]. The original part of the algorithm proposed by Lenstra is typically
referred to as Phase 1 and the extension by Brent and Montgomery is called Phase 2.
The pseudo-code of both phases is given below as Algorithm 4. Let q be an unknown
factor of N. Then the order of the curve , i.e., the number of points on the
curve E with operations performed (mod q), might be a smooth number that divides k.
In that case, we have for some l. For any point P0 belonging to the curve
13
, therefore . Thus, , and
the unknown factor of , can be recovered by taking .
Montgomery and Brent independently suggested a continuation of Phase 1 if one has
kP0 ≠ O. Their ideas utilize the fact that even if one has Q0 = kP0 ≠ O, the value of k might
miss just one large prime divisor of |E(Zq)|. In that case, one only needs to compute the
scalar multiplication by p to get pQ0 = O. A second bound B2 restricts the size of possible
values of p.
Let be the cost of one multiplication (mod N). Then Phase 1 of ECM finds a factor
q of N with the conjectured time complexity O(exp((p2 + o(1))plog q log log q)M(N)) [9].
Phase 2 speeds up Lenstra's original method by the factor log q which is absorbed in the
term of the complexity, but is significant for small and medium size factors. More
details about ECM and the hardware implementation can be found in the Master’s
thesis presented by Khaleeluddin Mohammed [5].
Algorithm 4: ECM Algorithm
14
00 0 0
1 2 2 1
R equ ire: : com posite num ber to be facto r ed , : e llip tic cu rve,
( , , ) ( ): in itia l po in t ,
, : bound fo r P hase 1 and P hase 2 respectively, .
E nsu re:
N
N E
P x y z Z
B B B B
0
1
1
0 0
: facto r o f , 1 < , o r F A IL .
P hase 1
1 : such that - consecu tive p rim e s
- largest exponen t such that
2 : ( :
i
i
i
e
i ip
e
i i
Q
q N q N
k p p B
e p B
Q kP x0
0
0 0 0
1 2
0
: )
3 : gcd ( , )
4 : if 1
5 : re tu rn (facto r o f )
6 : e lse
7 : go to P hase 2
8 : end if
9 : 1
P hase 2
10 : fo r each p rim e to do
11 : ( , , )
12 :
Q
Q
p Q p Q p Q
z
q z N
q
q N
d
p B B
x y z pQ
0
(m od )
13 : end fo r
14 : gcd ( , )
15 : if 1 then
16 : re tu rn
17 : e lse
18 : re tu rn F A IL
19 : end if
p Qd d z N
q d N
q
q
2.4 Basic Operations
2.4.1 Montgomery Multiplication
Multiplication is the most crucial operation of rho, p-1 and ECM. The multiplication used
in these methods is modular multiplication XY (mod N), where N is the number to be
15
factored. This multiplication is more time consuming than simple multiplication XY. A
method for calculating products (mod N) without the costly reduction (mod N) was
introduced by Montgomery and is known as Montgomery Multiplication. Montgomery
multiplication of X and Y, MP(X, Y, N), is defined as XY 2-n (mod N) for a fixed integer N.
Since Montgomery multiplication is not an ordinary multiplication, there is a process of
conversion between the ordinary domain (with ordinary multiplication) and the
Montgomery domain. The conversion between the ordinary domain and the
Montgomery domain is given by the relation X ↔ X’ with X’ = X2n (mod N). The
conversions are shown in Table 1.
Table 1: Conversions from Ordinary Domain to Montgomery Domain
Ordinary Domain ↔ Montgomery Domain
X ↔ X’ = X2n (mod N)
Y ↔ Y’ = X2n (mod N)
XY ↔ (XY)’ = X2n (mod N)
The Table 1 shows that the conversion is compatible with multiplications in each
domain, since
MP(X’, Y’,N) ≡ X’ Y’2-n ≡ (X2n)(Y2n)2-n ≡ XY2n ≡ (XY )’ (mod N).
16
The conversion between each domain can be done using the same Montgomery
operation, in particular X’ = MP(X, 22n(modN), N) and X = MP(X’,1,N), where 22n(modN)
can be pre-computed. Despite the initial conversion cost, as many Montgomery
multiplications are followed by a single conversion, an advantage over ordinary
multiplication is obtained.
The pseudo-code for radix-2 Montgomery multiplication is shown in Algorithm 3 where
2log 2n N . It should be mentioned that here n is slightly different from
2log 1N which Montgomery [10] originally used. This modified algorithm makes all
the inputs and output in the same range, i.e., 0 ≤ X, Y, S[n] < 2N. Therefore it is possible
to implement Algorithm 5 repeatedly without any reduction unlike the original
algorithm [10], where one has to take reduction (mod N) at the end of the algorithm to
make the output value in the same range as the input values.
2.4.2 Exponentiation Algorithms
The main part of the p-1 algorithm is based on exponentiation. Given a number g, and
exponent the result is ge. There are many algorithms for modular exponentiation. Some
of the methods are left to right binary exponentiation, right to left binary
exponentiation, exponentiation using addition chains, window based exponentiation
method etc. Out of several algorithms described in [3], we have selected sliding window
exponentiation as best suited for the hardware implementation of exponentiation in the
p-1 method. This algorithm is described as Algorithm 6.
17
Algorithm 5: 5 Radix-2 Montgomery Multiplication
1 1
2 0 0
0 0
R equ ire: , log 2, 2 , 2 w ith 0 , 2
E nsure: ( , , ) 2 (m od ) 2
1 : [0 ] 0
2 : fo r 0 - 1 do
3 : [ ] (m od 2 )
4 : [ 1] ( [ ]
n nj j
j jj j
n
i
N n N X X Y Y X Y N
Z M P X Y N X Y N N
S
i to n
q i S i X Y
S i S i ) 2
5 : end fo r
6 : retu rn [ ]
i iX Y q N div
S n
Algorithm 6: Sliding window exponentiation
1 1 0
2
1 2
1
2 1 2 1 2
: , ( ........ , ) 1, in t 1
:
1 .
,
1 (2 1) : *
2 . 1,
3 . 0
t t t
e
k
i i
Inpu t g e e e e e w ith e a n d a n eger w
O utpu t g
precom pu ta tion
g g g g
F or i from to do g g g
A i t
w h ile i do the fo llow ing
i
- 1
1
2
-1
2
( ....... )
0 : , - 1
( 0 ), ..... - 1
1,
* , 1i l
i i l
i
i i i l
l
e e e
f e then do A A i i
o therw ise e fin d th e longest b its tr ing e e e such tha t i l w
and e and do the fo llo w in g
A A g i l
4 . R e ( )tu rn A
18
Chapter 3: Proposed architecture of co-factoring circuit
Proposed architecture of the co-factoring circuit will perform all the three algorithms
starting with rho, p-1 and finally ECM. In ASIC implementation, unifying all three
algorithms reduces the non-recurrent cost, as only one instead of three devices needs to
be developed and fabricated. Each device can perform all three functions, which
increases flexibility while building and reusing a large factoring circuit.
In FPGA implementation, these issues are not of primary concern, as non-recurrent
costs are minimal and devices can be reconfigured in the middle of computations. In
both types of implementations, transfer of inputs and outputs between hardware
devices can be reduced in the unified architecture. Another reason for unification of p-1
and ECM is that these methods make use of the same parameters such as the product
of prime powers that are less than B1, gcd tables and prime tables. With minor
modifications all this data can be used for both methods.
Original idea was to unify all the three special methods of factoring rho, p-1 and ECM.
But due to time limitation and design discrepancies only unification of rho and p-1 has
been completed. ECM method of factoring is implemented using two multipliers and
19
one adder/subtractor whereas rho and p-1 are implemented using only one multiplier
and one adder/subtractor. Unification of all three methods requires implementation of
ECM using only one multiplier and one adder/subtractor, which would require a very
substantial amount of time and effort. That’s why only the unification of rho and p-1 is
done, leaving unification of ECM as future work.
3.1 Top Level View
Figure 3: Proposed architecture for co-factoring
Proposed architecture of the co-factoring is shown in Figure 3. In this circuit, host
computer is responsible for pre computations such as computing the initial parameters
Control
Unit
Global
memory
I/O
FPGA / ASIC
RAM
Host
computer
Co-factoring Units
Instruction
ROM
20
needed for all the three algorithms, and post computations such as calculating gcd of
the result from the algorithms with the number to be factored. It is also responsible for
transferring the initial parameters needed to the global memory. This architecture has
multiple co-factoring units performing the similar algorithms in parallel on the same or
different numbers, using different sets of parameters. Operation of the units is
controlled by a global control unit. The control unit monitors and controls the operation
of the whole circuit. Details about the contents of the global memory for ECM, the
architecture of ECM unit were explained in the thesis submitted by Khaleeluddin
Mohammed. Details of the remaining units, i.e., rho, p-1 and the unified rho and p-1 are
explained in Section 4.
3.2 Implementation of Basic Arithmetic Operations
Low level arithmetic unit is the same for all the three architectures. This arithmetic unit
has one Montgomery multiplier and one adder/subtractor. Three basic low level
operations of the rho, p-1 and unified rho and p-1 are modular multiplication (defined in
section 2.4.1), modular addition and modular subtraction. Modular addition and
subtraction are very similar to each other, and as a result they are implemented using
one functional unit, adder/subtractor.
In order to simplify our Montgomery multiplier, all operations are performed on inputs
X; Y in the range 0 ≤ X, Y < 2N, and return an output S in the same range, 0 ≤ S < 2N. This
is equivalent to computing all intermediate results modulo 2N instead of N, which
21
increases the size of all intermediate values by one bit, but shortens the time of
computations, and leads to exactly the same final results as operations (mod N).
3.2.1 Modular Addition and Subtraction
The algorithms for modular addition and subtraction are shown as Algorithms 7 and 8
respectively. In both algorithms, S is the result, T is the temporary variable and C1, C2
are two carry bits. The block diagram of the adder/subtractor unit implementing both
algorithms are shown in Figure 4. The modulus N is loaded to the adder/subtractor,
using input X_N, one time, during the initialization stage of Phase 1 using the control
signal X_N_Choice given by the control unit, and does not need to be changed until the
next run of Phase 1 for another number N. This modulus is stored in the internal 32x32-
bit memory, used to hold three numbers N, S, and T, all up to 256 bits wide. The 32-bit
words of operands X and Y are loaded in parallel, starting from the least significant
word, and immediately added or subtracted, depending on the value of the control
input sub (with sub = 1 denoting subtraction). The result is stored in the internal
memory as a variable T for addition, i.e., X + Y, and S for subtraction, i.e., X - Y. Here
subtraction is nothing but addition in 2's compliment, which is done by inverting one of
the inputs, done by the XOR gates at the input of the adder and by setting the carry bit
C2, these operations are controlled the sub signal coming from the adder/subtractor
state machine
22
This first operation is followed by the second operation of the respective algorithm,
involving the previously computed value and the modulus 2N computed on the fly, with
the result stored back to the memory. Finally, depending on the sign of S, stored in the
flip-flop C2, either T or S is returned as a final result. For 256-bit operands, the entire
operation takes 41 clock cycles (including writing data back to local RAM), the same
amount for addition and subtraction. I would like to thank Mohammed Khaleeluddin for
his contributions in designing the adder/subtractor.
23
+
C 1
C 2
L U T
3 2 X 3 2
M E M
<>
ad d r1 ad d r2W E L
O P 1 O P 2
X _ N _ C h oice
X _ N Y
su b
s ig n Z
read
C inC ou t
su m 1 su m 2E C 1
E C 2
X _ N
A D D E R
3 2 b ti reg X 3 2 b ti reg Y
E YE X
N
2 N
<<1
Figure 4: Block diagram of the adder/subtractor
24
Algorithm 7: Modular Addition
( ) ( ) ( )
( ) ( ) ( )
1 1
R e : , , 2 , exp sin 32 - , , , , 0 , ...., - 1
: m od 2
1 : 0 - 1
2 : ( , )
3 :
4 : 0 - 1
5 :
j j j
j j j
qu ire N X Y N all ressed u g e b it w ords X Y N j e
E nsure Z X Y N
for j to e do
C T C X Y
end for
for j to e do
( ) ( ) ( )
2 2 ( , ) - (2 )
6 :
7 : 0
8 :
9 :
10 :
11 :
j j jC S C T N
end for
if S then
return T
else
return S
end if
Algorithm 8: Modular Subtraction
( ) ( ) ( )
( ) ( ) ( )
2 2
R e : , , 2 , exp sin 32 -
, , , 0 , ....., - 1
: - m od 2
1 : 0 - 1
2 : ( , ) -
3 :
4 : 0
j j j
j j j
qu ire N X Y N all ressed u g e b it w ords
X Y N j e
E nsure Z X Y N
for j to e do
C S C X Y
end for
fo r j to e
( ) ( ) ( )
1 1
- 1
5 : ( , ) (2 )
6 :
7 : 0
8 :
9 :
10 :
11 :
j j j
do
C T C S N
end for
if S then
return T
else
return S
end if
3.2.2 Montgomery Multiplier
The multiplier circuit used was designed by my colleague Hoang Le. This multiplier is
based on the radix-2 version of the Montgomery Multiplier algorithm shown in
25
Algorithm 5. This algorithm assumes that all words of the inputs X, Y, and M, are already
available inside of the multiplier, and can be accessed at the same time. The second
instruction inside for loop involves the addition of three long words.
Figure 5: Cascaded of two carry save adders, reducing four operands to two
VU W Y
CSA
CSA
S C
CSR42
w w w w
w+1 w+1
w+2 w+2
26
S 1 S 2
A (S h ift_R eg)
B
C S R 42
> > 1
> > 1
S 1in S 2 in
A
B
zeros zeros
n
n nB B
S 1out S 2out B out
c arrys um
S 1in
S 2 in
A N D
S 2out(0 )
S 1out(0 )
B ou t(0 )
A i
A i q i
A 1 A 2 B C
S U M C A R R Y
E s E s
loadA
E b
reg _ rs t reg _ rs t
res et
res et
S S 1
E s s
res et S S 2
E s s
res et
S 1out S 2out
q i
n
nout
E b
res et
A i
A (0 )
Figure 6: Block diagram of the multiplier
If implemented directly in hardware the operation would result in a long critical path
and a very low clock frequency. In order to prevent that, this addition is performed
using carry save adders, and the result S [i + 1] is stored in the carry save form. Using
carry save adders, the sum of three numbers U, V , W is reduced to the sum of two
numbers S (sum) and C (carry), such that U + V + W = C + S. Similarly, using a cascade of
two carry save adders, as shown in Figure 5, the sum of four numbers, U, V , W, and Y
can be reduced to the sum of two numbers S and C, such that U +V +W +Y = C +S. Each
carry save adder is composed of a row of n Full Adders working in parallel, so it
27
introduces a delay of just a single Full Adder (i.e., a delay of a single stage of a basic
ripple-carry adder). The modified algorithm, based on carry save addition (CSA) is shown
as Algorithm 9. This algorithm has been described earlier in. This block diagram of the
circuit implementing Algorithm 9 is shown in Figure 6. The modulus N and the
parameter n are loaded in to the multiplier once at the beginning of Phase 1, and do not
need to be changed until the beginning of Phase 1 for another number N. At the
beginning of multiplication, the inputs X and Y are first loaded in parallel, in 32-bit
words, to internal 256-bit registers X and Y. In the following n clock cycles, the circuit
executes n iterations of for loop. Finally, in the last 8 clock cycles, the final result is
computed word by word, starting from the least significant word, and transferred to the
output. The total execution time of a single Montgomery multiplication is equal to n +
16 clock cycles.
28
Algorithm 9: Radix-2 Montgomery Multiplication with carry save addition
-1
2 0
- ( ) ( ) ( )
R equ ire: , log 2, 2 w ith 0 , 2
E nsure: ( , , ) 2 (m od ) 2 ; , [ ] , [ ] deno te a
j-th w ord o f , [ ] and [ ] res pectively.
1 : [0 ] 0
2 : [0 ] 0
3 : fo
n j
jj
n j j j
N n N X X X Y N
Z M P X Y N X Y N N Z C n S n
Z C n S n
S
C
0 0 0
( ) ( ) ( )
( )
r 0 - 1 do
4 : ( [ ] [ ] ) (m od 2)
5 : ( [ 1], [ 1]) ( [ ], [ ], , ) 2
6 : end fo r
7 : 0
8 : fo r 0 to 7 do
9 : ( , ) [ ] [ ]
10 : retu rn
11 : end fo r
i i
i
j j j
j
i to n
q C i S i X Y
C i S i C SA C i S i X i Y q N div
C
j
C Z C n S n C
Z
29
Chapter 4: Architecture of Rho and p-1
4.1 Partitioning of operations between hardware and software
All the computations in three architectures can be divided into three categories, pre-
computations, main computations and post-computations. Pre-computations and post-
computations are performed using the host computer, whereas main computations are
performed in hardware. In a typical scenario, the pre and post computations take
negligible amount of time when compared to the time taken for main computations in
hardware.
For rho method pre-computations are converting the initial point and other parameters
needed for the algorithm to Montgomery domain, and the main computations are
executing the Algorithm 10 which involves in Montgomery multiplication and
Montgomery addition and subtraction.
The software pre-computation for phase1 of p-1 include calculating product of power of
primes k, which is dependent on B1, and converting initial values needed for algorithm
execution, i.e., a and g, into the Montgomery domain. The phase 2 pre-computations
30
include the generation of two bit tables, prime_table and GCD_table, described in
Section 4.2.
These bit tables can be re-used for most of the time as long as boundaries B1 and B2 are
kept constant. Thus, these values need to be calculated only once, if we use the same
bounds. Main computations in hardware for p-1 include modular exponentiation, and
phase 2 operations, modular multiplication and modular subtraction, which will be
running most of the time. These main computations done in hardware will be keeping
majority of hardware resources busy almost all the time.
For the unified architecture, pre-computations are combination of pre-computations
from individual rho and p-1 units. The main computations are also combination of
individual rho and p-1 operations.
In all three architectures, post-computations consist of calculating the gcd of the final
results with the number to be factored, which is done in the end and done only once.
Top level block diagram of the three architectures looks similar and the difference will
be explained in the following sections.
In all the three architectures multiple identical units will be working in parallel on the
same or different numbers with different sets of parameters depending on the
requirement. Each unit has its own local memory, Montgomery multiplier and
adder/subtractor. Global memory, instruction memory and control unit are common for
31
all processing units. The operation starts with loading the initial parameters needed by
the particular architecture. Once all the computations are finished in hardware, the
result is transferred to host pc for final gcd calculation.
U N IT 1
A /S
M U L
L O C A L
M E M
U N IT n
A /S
M U L
L O C A L
M E M
C O N TR O L
U N IT
G L O B A L
M E M
Figure 7: Top level block diagram
As the contents of the global memory are different for different factoring methods, they
are explained in detail with the block diagrams in this section.
In this sections hardware architectures of all the three implementations, i.e., rho, p-1
and unified rho and p-1 are discussed in detail. Along with the architecture, the
32
optimization criteria and the criteria for selecting optimum number of arithmetic units
are also discussed.
4.2 Hardware Architecture of Rho Method
The Algorithm 1, which was explained in Section 2.1, can be re-written as shown in
Algorithm 10, clearly defining all the inputs, outputs of the system and the complete
execution of the algorithm.
Algorithm 10: Re-written rho algorithm based on Algorithm 1
2
0
1 1 0 2 2 1 2 1 2 1
2
2 2
2 2
2
2 2
2 2
2
1 1
1 1
: , , ( ) , , ( , 2 )
: ( | )
( ), ( ), - - , 1
( 2; ; )
{
*
Inpu ts x a f x x a N t even
O utpu ts q such tha t q N
v x f x v x f x tem p v v x x d
for i i t i
v v
v v a
v v
v v a
v v
v v a a ll opera tions are done
tem2 1
- m od
*
}
gcd ( , )
p v v u lo N
d d tem p
q d N
As explained in Section 2.1, in order to minimize the number of gcd computations, we
are trying to accumulate the differences between the numbers of the sequence x i
wherever the difference between the indices increases by one. This increase comes
from applying the polynomial twice to x2i and once to xi. This scheme results in a regular
33
pattern of the numbers, where the difference between the indices of the numbers is
increased by one in each iteration. The reason for increasing the differences in indices
by one is to find out the loop length starting from 1 till it occurs. If we collect the value
of every difference, one of the differences leads to a factor of N. The corresponding
graphical representation of the above explained Algorithm 10 is shown in Figure 7.
In the accumulated product d, we need to include one difference from each column of
the matrix given in Figure 8. In Algorithm 10, we include only terms shown in bold and
belonging to the diagonal of the matrix.
Figure 8: Graphical representation of original Pollard’s rho
The Algorithm 2, which Brent improved, based on the original Pollard’s Algorithm 1, can
be re-written as shown in Algorithm 11.
x1-x
2 x
1-x
3 x
1-x
4 x
1-x
5 x
1-x
6 ----------------------------------------------- x
1-x
i
x2-x
3 x
2-x
4 x
2-x
5 x
2-x
6 x
2-x
7 ----------------------------------------------- x
2-x
i
x3-x
4 x
3-x
5 x
3-x
6 x
3-x
7 x
3-x
8 ----------------------------------------------- x
3-x
i
x4-x
5 x
4-x
6 x
4-x
7 x
4-x
8 x
4-x
9 ----------------------------------------------- x
4-x
i
x5-x
6 x5-x
7 x
5-x
8 x
5-x
9 x
5-x
10 ----------------------------------------------- x
5-x
i
x6-x
7 x
6-x
8 x
6-x
9 x
6-x
10 x
6-x
11 x
6-x
12 ------------------------------------ x
6-x
i
x7-x
8 x
7-x
9 x
7-x
10 x
7-x
11 x
7-x
12 x
7-x
13 x
7-x
14 ------------------------- x
7-x
i
x8-x
9 x
8-x
10 x
8-x
11 x
8-x
12 x
8-x
13 x
8-x
14 x
8-x
15 x8-x
16 ---------------- x
8-x
i
-------------------------------------------------------------------------------------------------
xk-x
k+1 x
k-x
k+2 x
k-x
k+3 -------------------------------- xk
-x2k ----------------- xk
-xi
34
The difference between the original algorithm and improved version is that in the
original algorithm some of the values are calculated more than once. For example in the
third iteration, we calculate x4 and x2 from x2 and x1. This way x2 is calculated twice. If
we want to minimize the number of computations by storing the already computed
values, we are substantially increasing the memory needed by the system. When the
number of iterations is a large number, in the order of 220, huge amount of memory
needed by the system might limit the number of operations on a platform like FPGA.
Algorithm 11: Brent’s algorithm for rho
2
0
1 0 2 1 2 1
2 2
-1 1
2 1
1
1 2
: , , ( ) , , ( , 2 )
: ( | )
( ), ( ), 1
( 3; 2 ; )
{
( )
(2 2 1 2 )
{
-
*
}
( 2 )
{
1
}
}
g
k k k
k
In p u ts x a f x x a N t even
O u tp u ts q su ch th a t q N
x f x v v x f x k
fo r i i t i
v f v
if i
tem p v v
d d tem p
if i
v v
k k
q cd ( , )d N
Instead of increasing the amount of memory needed, we store only a single value all the
time depending on the condition described in the Algorithm 10. Thus, we are reducing
the number of computations needed at no extra expense. In this algorithm,
35
accumulation is not performed in every iteration as in the Algorithm 10, even though
the number of the accumulations is the same in both methods. The corresponding
graphical representation of Algorithm 11 is shown in Figure 9.
In this method whenever the iteration number is equal to power of two, the
corresponding value of x will be stored and depending on the condition explained in the
Algorithm 11, the appropriate difference and the accumulation will be computed.
Similarities between the two methods are as follows: In each method there is an
element (difference between two values) from each column contributing to the final
accumulated product. The differences between the indices cover all the values from 1
till the iteration value, which makes both methods equivalent.
Figure 9: Graphical representation of improved version by Brent
x1-x
2 x
1-x
3 x
1-x
4 x
1-x
5 x
1-x
6 ---------------------------------------------------- x
1-x
i
x2-x
3 x
2-x
4 x
2-x
5 x
2-x
6 x
2-x
7 ---------------------------------------------------- x
2-x
i
x3-x
4 x
3-x
5 x
3-x
6 x
3-x
7 x
3-x
8 ---------------------------------------------------- x
3-x
i
x4-x
5 x
4-x
6 x
4-x
7 x
4-x
8 x
4-x
9 ---------------------------------------------------- x
4-x
i
x5-x
6 x5-x
7 x
5-x
8 x
5-x
9 x
5-x
10 ----------------------------------------------------- x
5-x
i
x6-x
7 x
6-x
8 x
6-x
9 x
6-x
10 x
6-x
11 x
6-x
12 ------------------------------------------ x
6-x
i
x7-x
8 x
7-x
9 x
7-x
10 x
7-x
11 x
7-x
12 x
7-x
13 x
7-x
14 ------------------------- ------ x
7-x
i
x8-x
9 x
8-x
10 x
8-x
11 x
8-x
12 x
8-x
13 x
8-x
14 x
8-x
15 x8-x
16 ---------------------- x
8-x
i
--------------------------------------------------------------------------------------------------------
xk-x
k+1 x
k-x
k+2 x
k-x
k+3 ----------------x(2
k) – x(2k
+ 2k-1
+1) ------------- x(2k)-x(2
k+1)----- xk-xi
36
Speed up vs. Original Pollard’s rho:
Number of operations in terms of multiplications, additions and subtractions for arriving
at x8-x16 for both methods are as follows:
Pollard’s algorithm:
Number of multiplications: 30
Number of additions: 23
Number of subtractions: 8
Brent’s algorithm:
Number of multiplications: 23
Number of additions: 16
Number of subtractions: 8
For the above case, and in general, Brent’s algorithm is approximately 25% faster than
Pollard’s algorithm, and thus Brent’s algorithm is chosen for hardware implementation.
37
Figure 10: Sequence of operations in hardware for Brent’s rho
Operation table in hardware:
Inputs: x0, a, f(x) =x
2
+a, n, t (even,>2)
Outputs: d
Table 2: Operation table for rho
MUL ADD/SUB
1 to 2t-1 v
2 ← v
2
2
cond1 temp ← (v
2-v
1)
cond1 d ← d*temp 1 to 2t-1 v2
← v2+ a
cond1: 2k
+2k-1
+1≤ i-1 ≤2k+1
v2 d v
1
x2 d=1 x
2
x3
x4 d=d*(x
4-x
2) x4
x5
x6
x7 d*(x
7-x
4)
x8 d*(x
8-x
4) x
8
x9
x10
x11
x12
x13 d*(x
13-x
8)
x14 d*(x
14-x
8)
x15 d*(x
15-x
8)
x16 d*(x
16-x
8) x
16
38
The sequence of operations in hardware is shown in Figure 10. As already explained,
whenever the index of v2 is a power of two, it is stored in v1 and used for subtractions
later. In selecting the number of arithmetic units, Table 2 is taken into account. In all the
operations, subtraction and accumulation are done only when the condition is satisfied.
As the algorithm checks for a specific condition, which is not all the time true, the
operations cannot be effectively parallelized, as there are dependencies among the
computations. If we have more than one multiplier and adder/subtractor, only one
multiplier and one adder/subtractor are active all the time and the other pair of
arithmetic units will be in use for less than 50% of the time, in which case the resources
are not utilized efficiently. This criterion justifies the selection of one multiplier and one
adder/subtractor for the implementation.
4.3 Hardware Architecture of P-1 Method
The Algorithm 2, which was explained as a part of overview of p-1 will be explained in
detail in this section. The Algorithm 3 can be described as two phases, and the Phase 1 is
shown in Algorithm 12.
Algorithm 12: Phase1 of p-1 algorithm
Inputs:
39
The primary operation of Phase 1 is modular exponentiation. Value of the exponent k
used in the exponentiation is independent of n and can be pre-computed. Out of
multiple algorithms existing for modular exponentiation, sliding window method has
been selected for phase 1. Based on the number of operations to finish the
exponentiation addition chain method is the fastest one, but it requires a lot of memory
resources which would limit the number of processing units that can be placed on low-
cost FPGAs. On the other hand, if we choose one of the basic binary exponentiations, for
a random L-bit exponent, total number of multiplications will be 3*L/2 (assuming in a
randomly chosen number there are 50% bits of value ‘1’) which is more than the sliding
window method takes. Thus, the sliding window method is faster than binary
exponentiation, and needs less memory resources than addition chain method.
In case Phase 1 doesn’t produce a non-trivial factor of n, Phase 2 described in Algorithm
13 is applied. For the purpose of efficient hardware implementation, Algorithm 3 is
modified as follow.
Choose 0 < D <B2, and let every prime p, B1 < p ≤ B2, be expressed in the form
where m changes between MMIN = (B1 + D – 1)/D to MMAX = (B2 + 1) /
D , and j varies between 1 and D. The condition that p is prime implies that gcd(j, D) = 1.
40
Thus, possible values of j form a set Js = ,j: 1≤ j < D, gcd(j, D) = 1-, and the possible value
of m form a set MT = {m: MMIN ≤ m ≤ MMAX} of the size MN = MMAX – MMIN + 1, where MN
is approximately equal to (B2-B1)/D.
Then, the condition is satisfied, which implies
For this purpose, all values are pre-computed once Phase 2 is started. One then
computes with a current value of , and all pre-computed values, for which
is a prime. In order to simplify calculations, a bit table, prime_table is pre-
computed. when is a prime, and 0 otherwise. This
table can be reused for multiple iterations of Phase 2 with the same values of and .
Similarly another bit table GCD_table can be pre-computed.
when , and 0 otherwise
This table will have bits and this leads to the phase 2 algorithm shown in Algorithm
13. Value of is the most natural choice for D as it minimizes the
size of set JS and as a result of the amount of pre-computations and memory storage
required for phase 2.
41
Algorithm 13: Phase 2 of p-1 algorithm
m in 1 m ax 2
m in m ax
( ) / , ( - 1) /
_
1 - 1
g cd ( , ) 1
_ [ ] 1
_
1 - 1
M B D D M B D D
clear G C D tab le
fo r each j to D do
if j D then
G C D tab le j
end if
end fo r
clear p rim e tab le
fo r each m M to M do
for ea ch j to D
*
m in
( * - )
_ [ , ] 1
1 - 1
(gcd ( , )) 1
1, ,
j
m D D
if m D j is p rim e then
prim e tab le m j
end if
end fo r
end fo r
fo r each j to D do
if j D then
com pu te d
end if
end fo r
x y d t d
fo r m M tm ax
1 - 1
( _ [ , ] 1)
* ( - )
*
g cd ( , )
j
o M
for j to D
if p rim e tab le m j
x x y d
en d if
end fo r
y y t
end fo r
q x N
4.4 Global Memory Maps
4.4.1 Rho
42
Figure 11: Contents of global memory for rho
Global memory is a single port memory. It transfers data to and from local memories
word by word. Prior to the execution of the algorithm, host pc transfers the data to
global memory. The memory map of the global memory is shown in the Figure 11. This
includes the data needed by each unit to start the execution of the algorithm, such as
modulus n, initial point x0, and a small integer a. It also contains the number of
n for unit1
n for unit2
x0
a
. . .
31 0 0
t
n for unit m
. . .
No. of iterations
Same for all units
43
iterations t needed by the control unit to determine the end of the execution. All the
parameters needed by each unit will be transferred and then the computations will be
started.
Once all the computations are finished, the results from all the units will be transferred
to the global memory, which sends all that data to host pc for gcd calculation. In the p-1
method global memory is responsible for holding the data, which is needed for phase1
and phase2. The contents of the global memory in Phase 1 and 2 is shown in Figure 12
4.4.2 P-1
Figure 12: Contents of global memory for p-1 a) phase 1 b) phase 2
511
prime_table[1]
GCD_table[1]
GCD_table[GMAXD]
Mmin
Mmax
31 0 0
. . .
prime_table[2]
. . .
prime_table[PMAXD]
Phase2
Determines j such that
1 ≤ j ≤ D and
gcd(j, D) = 1
Determines m, j such
that p=m·D - j is prime
Phase 1
n for unit1
k
n for unit2
g2
g1
. .
. initial values
for
all units
kN
31 0 0
511
n for unit m
b) a)
44
Parameters needed for p-1 algorithm
n, the number to be factored
a, small integer where gcd (a, n) = 1
k, product of prime powers which are less than B1
gcd and prime tables needed for phase2.
4.4.3 Unified rho and p-1
Contents of the global memory of unified rho and p-1 are combination of both rho and
p-1 global memory contents. Only difference here is once rho algorithm is completed,
instead of computing gcd, it continues with p-1 algorithm and once the execution of two
algorithms is finished, then the result from both the units will be transferred to global
memory. These results will then be sent to host computer for gcd computations. All the
global and local memory requirements for all the three architectures are specified in the
Results section.
4.4 Local Memory Maps
4.4.1 Medium level operations
In rho the medium level operations are computing the value of v2 and computing the
accumulated product d, based on the iteration value. The final value of d is computed
using the condition specified in Table 2.
45
Medium level operations in p-1 algorithm are modular exponentiation in phase 1 and
finding the value of x in phase 2. In phase 1, exponentiation is performed using sliding
window exponentiation based on the value of k, which depends on the phase 1 bound
B1. In phase2, calculating the value of x is based on the GCD_table and prime_table.
In the unified rho and p-1 the medium level operations will be rho operation and p-1
operation. First the unit tries to factor the number using rho method, and once rho is
completed it moves into p-1 algorithm, so that if the number has more than one factor,
both factors will be given as output. The main advantages of integrating both rho and p-
1 are both units use the same low level unit, there is a possibility to find more than one
factor in a single run.
4.4.2 Rho
Local memory is a single port in and dual port out memory. The purpose of having dual
port out is to load the two operands simultaneously for multiplication, addition and
subtraction. Single port in is sufficient for loading data into the local memory from
global memory, multiplier and adder/subtractor. This local memory transfers the data to
global memory after all the computations are done using a tri-state buffer which
connects local memory to global memory. Contents of local memory are shown in the
Figure 13.
46
4.4.3 P-1
Local memory is a dual port in and dual port out memory. The reason for selecting dual
port memory is to ease the loading of both the operands simultaneously for addition,
subtraction and multiplication. It is also helpful if there is any data movement that
needs to be done within the local memory. This feature also supports data output to the
global memory once the whole algorithm has been executed. This is done using a tri
state buffer at port A.
M
tem p
V 1
V 2
a
d
L o ca l M e m o ry
0
63
031
6
6
32
A addr
B addr
W E A
A _M
B
32
32
32
G re i
da ta_out
0
01
1
g _ l
u_ l
K out
C
32
32
data_ in
Figure 13: Contents of local memory of rho
47
LO C A L
M E M O R Y
G R E i
D ata_ ou t
B
A _ M
C
A ad d r
W E A
W E B i
B ad d r
K ou t
A in
W E B
B in
A ad d r
B ad d r
W E A
A ou t
B ou t
3 1 0
0
5 1 1
3 2
3 2
3 2
3 2
9
2 0
9
3 2
sel_ d ata1
0
1
Figure 14: Local memory of p-1
The contents of the local memory in phase 1 and phase 2 are shown in Figure 15 (a) and
(b). In phase 1 it holds the modulus, and all the pre-computed values for sliding window
exponentiation and the result of phase1. In phase 2 it will hold values where
. It also has other values needed to compute the final result , which are
, and the difference between current and where is a prime.
48
Once all the operations are done, the result x which is at the last location of the local
memory will be transferred to global memory.
Figure 15: Contents of local memory of p-1 a) phase 1 b) phase 2
511
b)
g 2
31 0
0 N
g 1
d = g e
g 3
a)
Phase 1
. . . .
. . . .
g s
*s = 2k-1
d 2
31 0
0
511
N /d
d
d 11
d 13
d 209
Phase 2
. . . .
. . . .
d D
d m.D
d m.D - d j
x
49
4.4.4 Unified rho and p-1
Local memory of unified architecture will have two separate local memories. Once
memory is for holding the data needed and computed for rho and the other one for p-1.
Depending on the algorithm, the unit is accessing the corresponding local memory. All
the local memory requirements for the three architectures will be discussed in the
Results section.
4.5 Control Unit
Control units in all three architectures have been implemented in hardware. It is part of
the chip whether it is an FPGA or an ASIC. The basic components of the control units are
registers, shift registers, counters, flip flops, multiplexers and state machines. The
common operation among all the control units is, once host pc transfers the initial data
to global memory, control unit takes the control of the system and then it sends the
data to the individual units, and, once the whole algorithm is being executed, it transfers
the data to global memory. Finally the data are sent from global memory to host pc for
gcd calculation. The detailed operation of each control unit is explained as follow.
4.5.1 Rho
The control unit of rho has 5 state machines and a total of 45 states. The functions of
these state machines are transferring initial data required by the individual units, storing
the number of iterations in a register in the control unit, and executing the algorithm
50
based on the operation table described in Table 2. The detailed operation of the control
unit is described below.
Out of the four operations in the Table 2, and are performed all
the time irrespective of the condition in Table 2. Whereas the other operations
– and are performed only when the condition shown in
Table 2 is true. The flow of operations when the condition is true is as follows:
In the first row the multipliers are loaded with the two operands v2 and v2 which are the
same for the squaring operation, and multiplier is started. While the multiplier is
performing the squaring task, and from the previous iteration are transferred to
adder/subtractor for subtraction – operation, and the subtraction is done in
parallel with multiplication. Thus subtraction doesn’t need extra clock cycles to finish
the operation. In the second row addition is computed all the time.
When the condition is true the same strategy used for the first row of operations is
used. Multipliers are loaded with the operands needed for accumulation d and temp,
and the multiplication is started. The addition operation is done in parallel with the
multiplication, which effectively decreases the overall execution time.
Number of operations in rho with iterations is
First row
squarings for , t subtractions for
Second row
51
multiplications for accumulation
additions for
The total no. of operations is = (squaring + accumulation + *additions) = (
multiplications + * additions) as the remaining operations can be parallelized with the
multiplications. Once all the operations are completed, control unit transfers the data
from all the individual units to global memory which is the last operation of the control
unit.
4.5.2 P-1
The detailed operations performed by the control unit in p-1 are described in this
section. The control unit in p-1 has 12 state machines and total of 103 states. These
state machines execute Phase 1 and Phase 2 of the Algorithm. Sequence of operations
in Phase 1 is based on sliding window algorithm described in Algorithm 6.
Loading the initial values to all the individual units
Pre-computing all the odd powers of g used in the exponentiation
Computing the exponentiation as described in Algorithm 6. The window size is chosen as
5 and the control unit takes five bits from k, provided the first bit is not zero. If the first
bit is zero squaring operations are performed. After taking the 5 bits, the unit tries to
identify the window based on the condition that first and the last bits are non-zero.
Once the window size and the value are determined, squarings and the multiplication
52
with one of the pre-computed values are computed as described in the Algorithm 6.
Once all the bits of k are considered by the algorithm, Phase 1 is completed.
Sequence of operations in Phase 2 is based on the algorithm described in Algorithm 13.
The multiplier is loaded with d, result from Phase 1 for computing , which will be used
for computing values.
The next step is to compute the values based on GCD_table. The values are
computed for all the odd values of and only those values are stored for which
. This can be done with starting from d and multiplying the subsequent
values with .
The last value computed as part of the step is and for computing multiplier
is supplied with the operands and .
The next operation in sequence is computing from for . For this
operation as m is a small number, the left to right binary exponentiation [3] is used. The
result of this step is for .
The final operation of Phase 2 is accumulation based on the prime_table. For each pair
of , the prime_table has ‘1’ if is a prime and ‘0’ if it is not. If is a
prime, first the subtraction is calculated and then the multiplication is
performed with and as the operands and the result of this operation is
53
stored back in . This accumulation operation is performed for all the prime values in
the prime table.
If accumulation is finished for all the prime values of for a particular where
is a prime valued of is increased by one. It means multiplication is
computed and the value of is updated with the new value.
Once all the values are considered for accumulation Phase 2 algorithm is finished in
hardware and the result of Phase 2 is store in .
Control unit then transfers the result of Phase 2 to global memory which is the final step
of the p-1 algorithm.
4.5.3 Unified Unit
Control unit of unified unit performs combined operations from control units of
individual rho and p-1. It has 17 state machines and a total of 140 states for performing
the rho and p-1 operations in sequence.
Initially it loads the local memories for rho of individual units for the execution of rho.
Once rho algorithm is completely executed, it doesn’t transfer the result to global
memory as the control unit of rho does. Instead the control unit loads the initial values
and parameters needed for Phase 1 and Phase 2 of p-1 for the corresponding algorithm
execution. Once all the operations of p-1 are completed, control unit transfers results
from rho local memory and p-1 local memory to global memory. This is the final step of
the control unit for unified architecture.
54
Chapter 5: FPGA Implementation and Verification
5.1 Overview of FPGA families
An FPGA is a field programmable gate array with various configurable elements. The
user has the flexibility of re-configuring these devices based on the application. Xilinx
and Altera are the two main manufacturers of FPGAs and they hold a large market share
(around 89%). There are various families in Xilinx FPGAs based on the requirement and
application, i.e., technology, performance and cost.
The main families in the new Xilinx FPGAs are Virtex (high performance) and Spartan
(low cost). Some of the high performance FPGAs are
1. Virtex-II, Virtex-II pro based on 130 nm technology
2. Virtex-4 based on 90 nm technology
3. Virtex-5 based on 65 nm technology.
Out of the high performance FPGAs Virtex-5 is the most recent product which has lot of
DSP blocks embedded on it. Apart from Virtex-5, Virtex-4 is also used mostly in signal
processing applications. On the contrast Xilinx has low cost devices which are widely
used in most industrial applications. Some of the low cost FPGA families are
55
1. Spartan -3 based on 90 nm technology
2. Spartan-II derived from Virtex
3. Spartan-IIE derived from Virtex-E
Basic building blocks of Xilinx FPGA are CLB (Configurable Logic Block) slices. One CLB
slice has two Look Up Tables (LUTs) and two Flip Flops. Look Up Table is capable of
implementing any 4-input logic function. Apart from CLB slices depending on the
technology some of the FPGAs have power pc processors embedded, Ethernet mac
interface, multipliers and most of the Xilinx FPGAs have Block RAMs. These Block RAMs
are used whenever there is need to process or manage large amounts of data.
5.2 FPGA Design Flow
FPGA design flow is shown in Figure 16. It starts with design entry, in which the
specification will be analyzed and will be described in RTL using either VHDL or Verilog
HDL. It is done using block diagrams and ASM charts from the specification and these
will be developed to HDL. Once the RTL description is done, functionality should be
verified. It is verified using a testbench through functional simulation. This testbench
can further be used for other simulations also. If there are any discrepancies in the
design, we will go back to the design entry step and make some modifications in the RTL
description and then verify the functionality. Once the functionality is verified, next step
is to synthesize the code. Synplify pro is one of the best tools for synthesis. Synthesis is a
process where the RTL description will be converted to a gate level netlist. Once this
netlist is generated, we need to do the post-synthesis simulation to make sure that gate
56
level simulation is working properly. If there are any mis-matches in post-synthesis
simulation we need to go back to the RTL code and fix it before going to the process of
implementation. Once this simulation is verified next step is the implementation.
S p ec ifica tio n
R T L D escrip tio n
(V H D L / V erilo g H D L )
D es ig n E n try
S yn th es is
Im p lem en ta tio n
C o n fig u ratio n
F u n ctio n a l S im u la tio n
P o st-S yn th es is S im u la tio n
T im in g S im u la tio n
O n C h ip T estin g
D esig n V erifica tio n
Figure 16: FPGA Design Flow
57
Implementation is performed using ISE, which is Xilinx FPGA vendor tool. As a part of the
implementation, the tool goes through mapping, placing and routing, static timing
analysis and generating the bit-stream, so that it can be downloaded to the FPGA for on
chip testing. Before going to on-chip testing, we need to perform timing simulation, in
which the timing model will add real time delays to the signals inside the design. Before
verifying the timing simulation, it is advisable to go through static timing analysis, so
that we will know the maximum frequency at which our design can be operated. If the
results are matching with functional simulation and timing simulation, the final step is to
program the FPGA and verify the functionality on the chip. Once all the simulations are
generating the expected output, in can be expected that experimental on-chip testing
will be successful as well.
5.3 Tools and Verification Methodology
Functionality of the circuit is verified using the testbenches and running functional, post-
synthesis and timing simulations. Once a testbench is written using the input values, the
same testbench can be used for all the simulations. I would like to thank Paul
Kohlbrenner for providing me with the test vectors needed for verification.
Tools used for FPGA design flow are:
Simulation: Active HDL, Modelsim
Synthesis: Synplicity Synplify Pro, Xilinx XST
59
Chapter 6: ASIC Implementation and Verification
ASIC stands for application specific integrated circuit. Unlike FPGA it is not configurable.
Once a design is implemented on an ASIC, it will be on the chip forever. When compared
with an equivalent FPGA implementation, ASIC implementation is faster and utilizes less
area.
6.1 ASIC Design Flow
ASIC design flow is similar to FPGA design flow, and is conceptualized in Figure 17. First
step is to start with specifications and describe it using HDL. Once it is described in HDL,
we check for two aspects before going to the next step. We check for functionality by
writing a testbench and through the functional simulation. Once the design meets the
timing goals then it will go through synthesis which generates circuit netlist. Once the
post synthesis simulation is verified next step is physical design. Physical design flow is
shown in Figure 18
60
H D L co d in g
G o al S p ec ifica tio n F u n ctio n a l S im u la tio n
D es ig n E xp lo ra tio n
F u n ctio n a lity co rrec t
w ith in 15% o f
T Im in g G o a ls
D es ig n Im p lem en ta tio n
M et G o a ls
P h ys ica l D es ig n
M et G o a lsN o
N o
Y es
Y es
Y es
D o n e
Y esN o
N o
Y es
Figure 17: ASIC design flow
The main steps of the physical design process are design and timing set up,
floorplanning, placement, clock tree synthesis, routing and design for manufacturing. In
floorplanning all the power, ground connections are made and if there are any macros
in the design those will be placed in the core area. After floorplanning next step is
61
placement, in which all the standard cells will be placed inside the chip area. Following
step will be building clock tree to minimize clock skew and propagation delay. Once
clock tree synthesis is completed, the next step is routing, in which clocks are routed
first and then all other nets are routed. Once all the nets and macros are routed in the
design, the generated netlist need to be checked for any DRC violations, metal rules and
antenna rules etc. Once all these errors are cleared, then the design can be sent for
fabrication.
D es ig n & Tim in g S etu p
F loorp lan n in g
P lacem en t
C lock Tree S yn th es is
R ou tin g
D es ig n fo r M an u fac tu rin g
if n o errors
sen d fo r fab rica tion
S yn th es is
Figure 18: Physical design flow of Asic
62
6.2 Porting designs from FPGA to ASIC
Designs from FPGA can be ported to ASIC with minor or major modifications. Synopsys
design analyzer looks for certain format to be followed when the design need to be
synthesized. The major difference appears if there are any FPGA technology specific
blocks such as LUT based shift registers, BRAMs, embedded multipliers etc. These are
only embedded in FPGA, and the synopsys library will not have them. Changes need to
be done accordingly for a successful synthesis by synopsys design analyzer. In p-1 and
rho I am using BRAMs which are FPGA specific and these memories are replaced by
equivalent memory components from the memory database library.
6.3 Tools and Libraries
6.3.1 Tools
These are the list of the tools used for ASIC implementation.
Synthesis : Synopsys Design Analyzer
Physical Design: Astro
Static Timing Analysis: primetime
Formal Verification: Formality
6.3.2 Libraries
These are the libraries used for synthesis and implementation.
Standard Cell Libraries: cb13fs120_tsmc_max (tsmc 130 nm library)
Memory Libraries: ram32x64_max, ram32x32_max, ram32x128_max
63
Chapter 7: Results
7.1 FPGA Results
All the three architectures are developed entirely in RTL using VHDL. An effort is made
to make the codes as generic as possible so that the same code can be used across
different families of FPGAs and ASIC with minor modifications. The three architectures
i.e., rho, p-1 and unified rho and p-1 are implemented in three different families of
FPGAs i.e. Spartan 3, Spartan 3E and Virtex II. Initially all the single units are
implemented to calculate the resource utilization across three architectures and then
maximum number of units is implemented in all the three FPGA families to compare the
results. Apart from these three families of FPGAs, all the single units are implemented in
Virtex-4. Because of the limited system resources, maximum number of units couldn’t
be implemented in Virtex-4. For all FPGA implementations, all the codes were
synthesized using Synplify Pro 8.6 and implementation was done using Xilinx ISE 8.2i.
Functionality of all the units are thoroughly verified using test vectors provided by Paul
Kohlbrenner.
64
7.1.1 Memory Requirements
7.1.1.1 P-1
Memory requirements for the implementation of p-1 are shown in Table 3. For phase 1,
local memory has to hold the modulus, data from the pre-computations of sliding
window exponentiation and the final result. The amount of resources needed for pre-
computations depends on the size of the window. If the size of the window is w, the
number of pre-computed values is 2w-1-1. For the implementation of p-1, window size is
chosen as 5 and the number of pre-computations is 15. Apart from the pre-computed
values, the initial parameters needed for the implementation are g, g2 and 1 in
Montgomery domain. The intermediate and final result is also stored in local memory. If
we combine all these values, the local memory for phase 1 requires 19 registers. In
phase 2, size of the local memory depends on the number of gcd values. This number
depends on the parameter D. The gcd values computed in Phase 2 are stored in local
memory, as they are used for subtraction and accumulation. For D=210, there are 48
gcd values and each value requires 8 locations in the memory. Other than gcd values,
there are 4 other locations which are used for intermediate values during the
computation and the final result. After combining all these locations, local memory
requires 424 locations.
65
Table 3: Memory resources needed for p-1 hardware implementation
Objects # objects
#32-bit
words per objects
#
of words
#
of bits
Memory
size
#
BRAMs (512x32)
p-1 Local Memory- Phase 1
Registers 19 8 152 4864 256x32 1
p-1 Local Memory-Phase 1 & 2
Registers 53 8 424 13,568 512x32 1
Global Memory- Phase1 (14 units)
Registers 4*14 8 468 14,976
kN 1 1 1 32
k 1 43 43 1376
512 16,384 512x32 1
Global Memory- Phase 2
GCD_table 4 1 4 128
Mmin,MN 2 1 2 64
prime_table 402 402 402 12,864
408 13,056 512x32 1
Global memory size depends on the phase of the algorithm. In Phase 1, global memory
has the initial data needed for all individual units and the value of k, which is product of
primes and is common to all the units. Each unit needs the number to be factored n, the
initial values g,g2 and 1 in Montgomery domain. Size of the k, which is common to all
the units is 1375 bits, which occupy 43 locations and one location for the number of
words in k. The size of the global memory for Phase 2 is independent of the number of
units and depends on the parameters B1, B2 and D. For the values of B1 = 960, B2 =
57,000 and D = 210, the number of words in GCD_table are four and the number of
words in prime_table are 402. Including the values of Mmin and Mmax, the size of global
66
memory needed for Phase 2 is 408x32. For this requirement, the available size of BRAM
in FPGA is 512x32. When using 512x32 global memory for Phase 2, using the same size
for Phase 1 can hold the initial values needed for 14 units along with the value of k. The
size of the global memory for Phase 1 for m no. of units is m*32+44 words. Using the
same memory once all units are supplied with the parameters needed for Phase 1,
global memory can be loaded with the data needed for Phase 2.
7.1.1.2 Rho
Memory requirements for rho implementation are show in Table 4. As shown in Figure
13, local memory contains 6 registers and each register has 8 words. Thus the size of the
local memory is 64x32.
Table 4: Memory resources needed for rho implementation
Objects # objects #32-bit
words per
objects
#
words
#
bits
Memory
size
# BRAMs
(512x32)
Local Memory
Registers 6 8 48 1,536 64x32
Global Memory
Registers 5 8 40 1280
t 1 1 1 32
41 1312 64x32
67
Global memory will hold the parameters needed for each unit. The parameters n, x0, a,
d are different for different units and the number of iterations , which is common to all
the units. Based on these values and for the ease of implementation is loaded twice
in the local memory. As explained, global memory needs 41 locations for one unit. For m
number of units the size of the global memory is .
7.1.1.3 Unified rho and p-1
As explained in the architecture of unified unit, it has two local memories one for rho
and the other one for p-1. The contents and the size of these memories is the same as
explained in earlier sections.
Global memory for the unified unit has all the parameters needed for both rho and p-1.
If we combine the parameters and the values needed for both methods, global memory
requires 517 locations for a single unified unit. The size of the global memory with m
units is described as follow.
m*8 locations for the numbers to be factored
m*32 locations for parameters of rho
1 location for number of iterations in rho
m*24 locations for Phase 1 parameters of p-1
1 location for no. of words in k for p-1
43 locations for the value of k
408 values for the Phase 2 of p-1
68
Combining all these parameters and values results in the size of the global memory as
with unified units.
7.1.2 Timing Calculations for rho, p-1 and unified unit
The execution times for rho, p-1 and unified unit are explained in this section. Timing is
described in generic formulas for each method in the following sections. The timing
details of basic arithmetic blocks multiplier and adder/subtractor are shown in Table 5.
In this table n is the number of bits in the number to be factored. Number with 195 bits
is factored using Phase 1 and the corresponding clock cycles for multiplication and
addition are 41.
Table 5: Execution times of arithmetic blocks
Operation Notation Formula # clk cycles
Modular addition TA 41
Montgomery Multiplication
TM TM = n+16 211
7.1.2.1 Rho
The execution time of the rho method depends on the number of iterations t. From
Figure 9 and Algorithm 11, the total number of operations in the rho method with t
number of iterations is
2*t multiplications + 2*t additions
69
t-1 accumulations + t subtractions
As shown in Table 2, operations in the first rho and second rho are execute in parallel,
i.e., multiplier is loaded with the values and once the multiplication is started addition
or subtraction is finished in parallel. Thus addition or subtraction in all cases doesn’t
require extra time.
Total execution time for rho method ≈ (3t-1)TM + (t)TA
For t=1024 which is √220, the estimates and simulated timings are
Testrho = 689,965 clock cycles
Tsimrho = 703,125 clock cycles
This difference in the estimated and simulated timings is, timing information for control
operations and data movements is not included in the estimated timing.
7.1.2.2 P-1
The execution time of p-1 algorithm is shown in Table 6. This time consists of pre-
computation time and modular exponentiation time in Phase 1, gcd values computation
time and the accumulation time in Phase 2. The number of multiplications in Phase 1
depends on the value of k and the number of pre-computations.
Table 6: Timing calculations for Phase 1 and Phase 2 of p-1
Operation Notation Formula # clk cycles
Phase 1
Pre-computation TpreP1 Tpre = 15TM 3,165
70
Exponentiation Texp Texp = 1602TM 338,022
Phase 1(estimation) TP1est TP1est = Tpre + Texp 341,187
Phase 1(simulation) TP1sim 343,087
Phase 2
Precalculating
d2
Tsq Tsq = TM 211
dj Tj Tj = (D/2 -1) TM 21,944
Calculating dD TdD TdD = TM 211
Calculating dmD
from
dD for
m= Mmin
TdmD TdmD = 3TM 633
Calculating dmD
for
m = Mmin+1 to Mmax TmnD TmnD = (MN-1)TM 56,337
Number of ones in
the prime_table nprime_table 5,646
Calculating
accumulated product
x
Tx Tx =
nprime_table(TM+TA) 1,422,792
Phase 2(estimation) TP2est 1,501,917
Phase 2(simulation) TP2sim 1,565,271
Total time (estimation) =
Total time (simulation) = 1,912,155 clock cycles
In Table 6 all the detailed operations in Phase 2 of p-1 along with the time they take are
explained. The expected execution time is derived from the operations and compared
with the time obtained from simulation. There exists a 3% difference between the
estimated time and simulated time. This is because; all the control unit operations, data
transfers inside of the local memory and between local and global memory are not
included in the estimated timing. If we look at the timing of Phase 2, accumulation
operation takes 94% of the total time.
71
7.1.2.3 Unified rho and p-1
Taking execution times of rho and p-1 into account, the execution time is addition of
execution times for unified unit. The estimated and simulated times for unified unit are
Tuniest (estimation) = clock cycles
Tunisim (simulation) = 2,614,746 clock cycles
As explained earlier, the difference between the estimated and simulated timings is the
timing for control operations and data movement inside the local memory and data
movement between local and global memories, which are not included in the estimated
time.
7.1.3 Area and Timing results from FPGA implementations
Table 7: FPGA implementation of single unit of rho
Results Virtex II
XC2v6000-6
Spartan 3
XC3S5000-5
Spartan 3E
XC3S1600E-5
Virtex 4
XC4VLX200-
11
Resources
-CLB Slices 1,680(4%) 1,740(5%) 1,778(12%) 1,785(2%)
-LUTs 2,714(4%) 2,845(4%) 2,905(9%) 2,903(1%)
-FFs 1,518(2%) 1,524(2%) 1,522(5%) 1,513(1%)
-BRAMs 0/144 0/104 0/36 0/336
Technology 0.15/0.12 um 90 nm 90 nm 90 nm
Max. Clock
Frequency
130 MHz 94 MHz 110 MHz 141 MHz
72
Table 8: FPGA implementation of single unit of p-1
Results Virtex II XC2v6000-6
Spartan 3 XC3S5000-5
Spartan 3E XC3S1600E-5
Virtex 4 XC4VLX200-
11
Resources
-CLB Slices 1,749(5%) 1,787(5%) 1,805(12%) 1,830(2%)
-LUTs 2,875(4%) 2,915(4%) 2,947(9%) 2,963(1%)
-FFs 1,645(2%) 1,672(2%) 1,681(5%) 1,642(1%)
-BRAMs 2/144 2/104 2/36 2/336
Technology 0.15/0.12 um 90 nm 90 nm 90 nm
Max. Clock
Frequency
131 MHz 112 MHz 120 MHz 155 MHz
Table 9: FPGA implementation of single unified unit
Results Virtex II
XC2v6000-6
Spartan 3
XC3S5000-5
Spartan 3E
XC3S1600E-5
Virtex 4
XC4VLX200-11
Resources
-CLB Slices 2,042(6%) 2,199(6%) 2,201(14%) 2,166(2%)
-LUTs 3,451(5%) 3,763(5%) 3,774(12%) 3,669(2%)
-FFs 1,740(2%) 1,742(2%) 1,755(5%) 1,729(1%)
-BRAMs 2/144 2/104 2/36 2/336
Technology 0.15/0.12 um 90 nm 90 nm 90 nm
Max. Clock
Frequency
115 MHz 95 MHz 101 MHz 141 MHz
Results for single units of rho, p-1 and unified are tabulated in Table 7, Table 8 and Table
9. As expected high performance devices, Virtex II and Virtex 4 can operate at more
frequency than the low cost devices, Spartan 3 and Spartan 3E. These results
demonstrate the fact that, with little increase in area the two methods rho and p-1 can
be combined as one. As both the methods using the same low level arithmetic units, it is
73
only the area of the control unit which is greater in the unified unit. The unification of
rho and p-1 is achieved at the expense of the frequency drop. This is the result of the
combined state machines of both individual units working together increasing fanout on
some internal signals. This frequency can be improved by adding registers in the critical
path if necessary. Individual units of rho and p-1 are using similar amounts of resources,
whereas unified unit is using slightly more resources.
Table 10: FPGA implementation of max. No. of rho units
Results Virtex II
XC2v6000-6
Spartan 3
XC3S5000-5
Spartan 3E
XC3S1600E-5
Virtex 4
XC4VLX200-
11(estimated)
Max # of rho units per FPGA
device
20 20 8 40
Resources
-CLB Slices 29,837(88%) 32,069(96%) 13,122(88%)
-LUTs 49,261(72%) 53,902(80%) 21,827(73%)
-FFs 27,709(40%) 27,732(41%) 11,285(38%)
-BRAMs 2/144 2/104 2/36 4/336
Technology 0.15/0.12 um 90 nm 90 nm 90 nm
Cost of an
FPGA device
$2700 $130 $35 $3000
Max. Clock
Frequency
108 MHz 81 MHz 97 MHz 141 MHz
Time for rho 6.5 ms 8.68 ms 7.25 ms 5 ms
# of rho
computations
per second
3,076 2,304 1,103 8,000
# of rho
computations
per second per
$100
114 1,772 3,151 266
74
Table 11: FPGA implementation of max. No. of p-1 units
Results Virtex II XC2v6000-6
Spartan 3 XC3S5000-5
Spartan 3E XC3S1600E-5
Virtex 4 XC4VLX200-
11(estimated)
Max # of p-1 units per FPGA
device
23 23 9 46
Resources
-CLB Slices 31,827(94%) 32,451(97%) 13,013(88%)
-LUTs 49,668(73%) 51,207(76%) 20,768(70%)
-FFs 31,942(47%) 31,949(48%) 12,754(43%)
-BRAMs 25/144 15/104 10/36 50/336
Technology 0.15/0.12 um 90 nm 90 nm 90 nm
Cost of an FPGA device
$2700 $130 $35 $3000
Max. Clock
Frequency
107 MHz 86 MHz 112 MHz 155 MHz
Time for p-1 17.88 ms 22.24 ms 17.08 ms 12.33 ms
# of p-1
computations
per second
1,286 1,034 527 3,730
# of p-1 computations
per second per
$100
47 795 1,505 124
FGPA implementations with maximum number of rho, p-1 and unified units are shown
in Table 10, Table 11 and Table 12. Resource utilization in terms of CLB slices, number of
BRAMs used, maximum clock frequency, execution time, computations per second and
computations when cost is taken into account are shown in the tables. As mentioned
before, due to limited number of system resources maximum number of units on Virtex
4 weren’t implemented. Instead these values are estimated considering the best case
for the comparison to be fair.
75
By looking at the results, low cost devices Spartan 3 and Spartan 3E outperform the high
performance Virtex II and Virtex4 devices in rho by factor of 15 and 12, in p-1 by factor
of 17 and 12 and by factor of 15 and 11 respectively and thus most suitable for code
breaking. Additionally the new generation devices Spartan 3E and Virtex 4 has better
performance than their previous generation devices Spartan 3 and Virtex II.
Table 12: FPGA implementation of maximum No of unified units
Results Virtex II
XC2v6000-6
Spartan 3
XC3S5000-5
Spartan 3E
XC3S1600E-5
Virtex 4
XC4VLX200-
11(estimated)
Max # of
unified units per
FPGA device
21 19 8 42
Resources
-CLB Slices 32,280(95%) 30,905(93%) 13,587(92%)
-LUTs 53,791(79%) 45,832(69%) 19,810(67%)
-FFs 29,268(43%) 27,894(41%) 11,392(38%)
-BRAMs 25/144 23/104 10/36 49/336
Technology 0.15/0.12 um 90 nm 90 nm 90 nm
Cost of an
FPGA device
$2700 $130 $35 $3000
Max. Clock
Frequency
102 MHz 80 MHz 95 MHz 141 MHz
Time for rho &
p-1
25.62 ms 32.68 ms 27.53 ms 18.56 ms
# of unified
computations per second
819 581 290 2,262
# of unified
computations per second per
$100
30 447 828 75
76
7.2 Area and Timing Results from ASIC implementations
All the three architectures are implemented in ASIC using 120 nm tsmc library and
shown in Table 13. The netlist generated after synthesis is verified for functional
equivalency against RTL using the tool called Formality. This process is known as Formal
Verification. In general the frequency of operation in ASIC is 2 to 2.5 times that of FPGA.
But I could never get more than 200 MHz in any of the cases. Initially I tried using 250
MHz as target frequency, and it resulted in lot of timing violations and DRC errors. Then
target frequency is decreased to 200 MHz, and there are no timing violations and DRC
errors. Even though some positive slack was obtained, I am using 200 MHz for the
comparison purposes. Maximum number of units couldn’t be implemented due to the
limited system resources available in the lab.
Table 13: Results from ASIC implementation
Operation rho p-1 Unified architecture
Area 1.15 mm2 1.21 mm
2 1.8 mm
2
Max. Clock
Frequency
200 MHz 200MHz 200 MHz
Time for execution 3.52 ms 9.56 ms 13.1 ms
# of operations per
second (using
maximum no. of units)
96,022 34,100 16,615
Core utilization ratio 70% 70% 65%
77
When 20 rho units are implemented in ASIC the results are shown in Figure 19. The
improvement in terms of area is about 51. The area factor of 50 is reasonable [19] for
this kind of circuit, where logic and memory are being used. This factor may be less for
p-1 and unified architecture, as they need more area for memory.
Figure19: ASIC layout of 20 units of rho
78
7.3 Comparison of FPGA and ASIC Results
Before comparing the performance of FPGA and ASIC, we need to determine how many
units will fit on ASIC with the same area as Virtex II, as Virtex II is the closest match with
the ASIC technology used for implementations. Area of Virtex II FPGA is 19.68 x 19.8
mm2 (estimation by R.J. Lim Fong, MS Thesis, VPI, 2004).
When the same area is considered for ASIC implementation, ASIC can fit 338 units of
rho, 322 units of p-1 and 216 units of unified rho and p-1. The number of computations
will be much more than in FPGA. This is because the effective utilization of area and high
frequency of operation in ASIC. The maximum number of units can be implemented on
the ASIC is more than the estimation. The factors which affect the maximum number
are as follows: With the present comparison in ASIC there is one control unit for one
execution unit, which is not true for FPGA and will not be that way when actually
implemented on ASIC. Another factor is size of the global memory. Now for one unit size
of the global memory is 512x32 and it is not linearly related to the no. of units. Global
memory occupies large portion of the area and if properly managed, more number of
units can be implemented on the chip. But for comparison purposes, all the calculations
are done using the numbers presented in this section. If we compare the number of
operations per second ASIC dominates FPGA with a factor of 31 for rho, with a factor of
27 for p-1 and with a factor of 21 for the unified unit. If all the above mentioned factors
are considered, these factors may go up to 65-80.
79
Chapter 8: Summary and Conclusion
A novel architecture for rho, p-1 and unified architecture has been proposed. P-1 and
rho methods are implemented in hardware for the first time. The three architectures
i.e., rho, p-1 and unified rho and p-1 are implemented in two different technologies
FPGA and ASIC. In FPGA, all these architectures are implemented in four different
families of FPGA. From the FPGA implementation results, it shows that low cost devices
such as Spartan 3 and Spartan 3E outperform the high performance devices Virtex II and
Virtex 4 by factor of 15 and 11 in the case of unified architecture when number of
operations per unit of time and cost is considered. It is also shown that with little
overhead in area both methods i.e., rho and p-1 can be integrated into one unit. This
unification may find more than one factor and has more probability of finding a factor,
as it executes both algorithms sequentially. When implemented in ASIC, the hardware
resources are effectively used and it is possible to fit significantly more units than in
FPGA. Area of Virtex II is considered for comparison purposes as its technology is the
closet possible technology of the ASIC used. In terms of frequency ASIC achieved
approximately 1.5 times the frequency of FPGA. An ASIC chip, which has similar area of
a Virtex II FPGA can fit 332 units of rho, 322 units of p-1 and 216 unified units. The
number of computations performed by ASIC is better than FPGA by a factor of 31 for
80
rho, 27 for p-1 and 26 for unified architecture. These numbers are calculated when the
resources on ASIC are least efficiently used. If all the resources on ASIC are utilized
properly these numbers increase, at least by a factor of 3-5. When produced in a larger
number (greater than 100,000) to overcome the non-recurring cost ($1,000,000), ASIC is
better technology as it effectively uses the area and gives better frequency. When the
mass production is not considered as an option, suitable devices for code-breaking are
low cost devices such as Spartan 3 and Spartan 3E.
The unified unit of rho and p-1 is using slightly more resources than individual units, but
it has more probability of finding the factors. If this unified unit is combined with the
ECM [5] and trial division, it is capable of implementing the whole post-sieving phase of
NFS. This combined unit will have more chances of finding factor(s). ECM was
implemented using two multipliers and one adder/subtractor, whereas unified
architecture has only one multiplier and one adder/subtractor. Implementing trial
division in hardware, re-designing ECM with one multiplier and one adder/subtractor
and combining all the methods is a possible direction for future work.
82
BIBLIOGRAPHY
[1] C. McIvor, M. McLoone, J. McCanny, A. Daly and W. Marnane, Fast
montgomery modular multiplication and rsa cryptographic processor
architectures,"Proc. 37th IEEE Computer Society Asilomar Conference on
Signals, Systems and Computers, pp. 379-384, Nov 2003.
[2] K Gaj, S Kwon, P Baier, P Kohlbrenner, H Le, M Khaleeluddin and R
Bachimanchi, Implementing the elliptic curve method of factoring in
reconfigurable hardware." in Proc. Cryptographic Hardware and Embedded
Systems Workshop, CHESS 2006, Yokohama, Japan, Oct 2006.
[3] A. Menzes, P. van Oorschot, and S. Vanstone, Handbook of Applied
Cryptography, CRC Press, 1996.
[4] Per Leslie Jensen, Master’s Thesis on Integer Factorization, Department of
Computer Science, University of Copenhagen, Fall 2005.
[5] Mohamed Khaleeluddin, Master’s Thesis on Hardware Implementation of The
Elliptic Curve Method of Factoring, Department of Electrical and Computer
Engineering, George Mason University, Summer 2006.
[6] J. M. Pollard, Factoring with cubic integers," in Lecture Notes in Mathematics
1554. Springer-Verlag, 1993, pp. 4-10.
[7] F. Bahr, M. Boehm, J. Franke and T. Kleinjung, Factorization of rsa-200."
[Online]. Available: http:crypto-world.com/announcements/rsa200.txt
[8] R. P. Brent, Some integer factorization algorithms using elliptic curves,"
Australian Computer Science Communications, vol. 8, pp. 149-163, 1986.
83
[9] H. W. Lenstra, Factoring integers with elliptic curves," Annals of Mathematics,
vol. 126, no. 2, pp. 649-673, 1987.
[10] P. L. Montgomery, Modular multiplication without trivial division,"
Mathematics of Computation, vol. 44, pp. 519-521, 1985.
[11] P. L. Montgomery, Speeding the pollards and elliptic curve methods of
factorization," Mathematics of Computation, vol. 48, pp. 243-264, 1987.
[12] P. L. Montgomery, An FFT extension of the elliptic curve method of
factorization," Ph.D. dissertation, UCLA, 1992.
[13] J. M. Pollard, A monte carlo method for factorization," Nordisk Tidskrift for
Informationsbehandlung (BIT), vol. 15, pp. 331-334, 1975.
[14] John M. Pollard, Theorems on factorization and primality testing., Proceedings
of the Cambridge Philosophical Society 76 (1974), 521–528.
[15] Richard P. Brent, An improved Monte Carlo factorization algorithm, BIT 20
(1980), 176–184.
[16] W. Stallings, Cryptography and Network Security, 3rd ed. Prentice Hall, 2003
[17] Ian Kuon and Jonathan Rose, Measuring the Gap Between FPGAs ad ASICs,
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 62, no. 2, Feb 2007.