incomplete reduction in modular arithmetic

7
Incomplete reduction in modular arithmetic T.Yank, E.Savas and C ¸ .K.Koc ¸ Abstract: The authors describe a novel method for obtaining fast software implementations of the arithmetic operations in the finite field GF( p) with an arbitrary prime modulus p of arbitrary length. The most important feature of the method is that it avoids bit-level operations which are slow on microprocessors and performs word-level operations which are significantly faster. The proposed method has applications in public-key cryptographic algorithms defined over the finite field GF( p), most notably the elliptic curve digital signature algorithm. 1 Introduction The basic arithmetic operations (i.e. addition, subtraction and multiplication) in the finite field GF( p) have several applications in cryptography, such as decipherment opera- tion of the RSA algorithm [1], the Diffie–Hellman key exchange algorithm [2], elliptic curve cryptography [3, 4], and the Digital Signature Standard including the Elliptic Curve Digital Signature Algorithm (ECDSA) [5]. These applications demand high-speed software implementations of the arithmetic operations in GF( p) for 160 dlog 2 ( p)e 2048. In this paper we describe a new method for obtaining high-speed software implementations of the arithmetic operations on microprocessors and general-purpose computers. The most important feature of this method is that it avoids bit-level operations, which are slow on modern microprocessors. The algo- rithms proposed in this paper perform word-level opera- tions, trading them off for bit-level operations and thus resulting in much higher speeds. We provide the timing results of our implementations on a Pentium II computer, supporting our speedup claims. 2 Representation of the numbers The arithmetic of GF( p) is also called modular arithmetic where the modulus is p. The elements of the field are the set of integers {0, 1, ... ,( p 7 1)}, and the arithmetic function (addition, subtraction and multiplication) takes two input operands from this set and produces the output which is also in this set. We are assuming that the modulus p is a k-bit integer, where k 2 [160, 2048]. A number in this range is represented as an array of words, where each word is of length w . Most software implementations require that w ¼ 32; however, w can be selected as 8 or 16 on 8-bit or 16 bit microprocessors. To create a scalable implementation, we are not placing any restrictions on the prime p or its length k. The prime number does not need to be in any special form, as some methods require; for example, the method in [6] requires that p ¼ 2 k 7 c. Furthermore, the length of the prime p does not need to be an integer multiple of the word-size of the computer. We have the following definitions: k: the exact number of bits required to represent the prime modulus p, i.e. k ¼dlog 2 pe w: the word-size of the representation (i.e. the computer); usually, w ¼ 8, 16, 32 s: the exact number of words required to represent the prime modulus p, i.e. s ¼ l k w m m: the total number of bits in s words, i.e. m ¼ sw . Since our computers are capable of performing only two’s complement binary arithmetic, we represent the numbers as unsigned binary numbers. A number from the field GF( p) is represented as an s-word array of unsigned binary integers. We will use the notation A ¼ (A s1 A s2 A 1 A 0 ), where the words A i for i ¼ 0, 1, ... ,(s 7 1) are unsigned binary numbers of length w . The most significant word (MSW) of A is A s1 , while the least significant word (LSW) of A is A 0 . The bit-level representation of A is given as A ¼ (a k1 a k2 a 1 a 0 ). Similarly, the most significant bit (MSB) of A is a k1 , while the least significant bit (LSB) of A is a 0 . We will use exactly s words to represent a number. If k is not an integer multiple of w , then we have k ¼ (s 7 1)w þ u, where u < w is a positive integer, and thus only the least significant u bits of the MSW of A s1 are occupied, and the most significant (w 7 u) bits are all zero. 2.1 Incompletely reduced numbers This representation methodology has a shortcoming: it requires bit-level operations on the MSW in order to perform arithmetic, which affects the speed of the opera- # IEE, 2002 IEE Proceedings online no. 20020235 DOI: 10.1049/ip-cdt:20020235 Paper first received 26th June 2001 and in revised form 25th January 2002 The authors are with the Department of Electrical & Computer Engineering, Oregon State University, Corvallis, Oregon 97331, USA A s1 A s2 0 0 |fflffl{zfflffl} wu a ðs1Þwþu1 a ðs1Þw a (s1)w1 a (s2)w A 1 A 0 a 2w1 a w a w1 a 0 46 IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

Upload: ck

Post on 20-Sep-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Incomplete reduction in modular arithmetic

T.Yan�k, E.Savas’and C.K.Koc

Abstract: The authors describe a novel method for obtaining fast software implementations of thearithmetic operations in the finite field GF(p) with an arbitrary prime modulus p of arbitrarylength. The most important feature of the method is that it avoids bit-level operations which areslow on microprocessors and performs word-level operations which are significantly faster. Theproposed method has applications in public-key cryptographic algorithms defined over the finitefield GF(p), most notably the elliptic curve digital signature algorithm.

1 Introduction

The basic arithmetic operations (i.e. addition, subtractionand multiplication) in the finite field GF(p) have severalapplications in cryptography, such as decipherment opera-tion of the RSA algorithm [1], the Diffie–Hellman keyexchange algorithm [2], elliptic curve cryptography [3, 4],and the Digital Signature Standard including the EllipticCurve Digital Signature Algorithm (ECDSA) [5]. Theseapplications demand high-speed software implementationsof the arithmetic operations in GF(p) for 160�dlog2(p)e� 2048. In this paper we describe a newmethod for obtaining high-speed software implementationsof the arithmetic operations on microprocessors andgeneral-purpose computers. The most important featureof this method is that it avoids bit-level operations,which are slow on modern microprocessors. The algo-rithms proposed in this paper perform word-level opera-tions, trading them off for bit-level operations and thusresulting in much higher speeds. We provide the timingresults of our implementations on a Pentium II computer,supporting our speedup claims.

2 Representation of the numbers

The arithmetic of GF(p) is also called modular arithmeticwhere the modulus is p. The elements of the field are theset of integers {0, 1, . . . , (p7 1)}, and the arithmeticfunction (addition, subtraction and multiplication) takestwo input operands from this set and produces the outputwhich is also in this set. We are assuming that the modulusp is a k-bit integer, where k2 [160, 2048]. A number in thisrange is represented as an array of words, where each wordis of length w. Most software implementations require thatw¼ 32; however, w can be selected as 8 or 16 on 8-bit or16 bit microprocessors.

To create a scalable implementation, we are not placingany restrictions on the prime p or its length k. The primenumber does not need to be in any special form, as some

# IEE, 2002

IEE Proceedings online no. 20020235

DOI: 10.1049/ip-cdt:20020235

Paper first received 26th June 2001 and in revised form 25th January 2002

The authors are with the Department of Electrical & ComputerEngineering, Oregon State University, Corvallis, Oregon 97331, USA

46

methods require; for example, the method in [6] requiresthat p¼ 2k 7 c. Furthermore, the length of the prime pdoes not need to be an integer multiple of the word-size ofthe computer.

We have the following definitions:

k: the exact number of bits required to represent the primemodulus p, i.e. k¼dlog2 pew: the word-size of the representation (i.e. the computer);usually, w¼ 8, 16, 32s: the exact number of words required to represent theprime modulus p, i.e.

s ¼l k

w

m

m: the total number of bits in s words, i.e. m¼ sw.

Since our computers are capable of performing only two’scomplement binary arithmetic, we represent the numbersas unsigned binary numbers. A number from the fieldGF(p) is represented as an s-word array of unsignedbinary integers. We will use the notation A¼ (As�1As�2 � � �

A1A0), where the words Ai for i¼ 0, 1, . . . , (s7 1) areunsigned binary numbers of length w. The most significantword (MSW) of A is As�1 , while the least significant word(LSW) of A is A0 . The bit-level representation of A is givenas A¼ (ak�1ak�2 � � � a1a0). Similarly, the most significantbit (MSB) of A is ak�1 , while the least significant bit (LSB)of A is a0 . We will use exactly s words to represent anumber. If k is not an integer multiple of w, then we havek¼ (s7 1)wþ u, where u<w is a positive integer, andthus only the least significant u bits of the MSW of As�1 areoccupied, and the most significant (w7 u) bits are all zero.

2.1 Incompletely reduced numbers

This representation methodology has a shortcoming: itrequires bit-level operations on the MSW in order toperform arithmetic, which affects the speed of the opera-

As�1 As�2 � � �

0 � � �0|fflffl{zfflffl}w�u

aðs�1Þwþu�1 � � � aðs�1Þw a(s�1)w�1 � � � a(s�2)w � � �

A1 A0

a2w�1 � � � aw aw�1 � � � a0

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

tions in software implementations. To overcome this short-coming, we introduce the concept of incomplete modulararithmetic. To explain the mechanics of the method, wemake the following definitions:

� completely reduced numbers: the numbers from 0 to(p7 1):

C ¼ f0; 1; . . . ; ðp � 1Þg

� incompletely reduced numbers: the numbers from 0 to(2m 7 1):

I ¼ f0; 1; . . . ; p � 1; p; p þ 1; . . . ; ð2m � 1Þg

� unreduced numbers: the numbers from p to (2m 7 1):

U ¼ fp; p þ 1; . . . ; ð2m � 1Þg:

Note that we have the following relationship between thesesets:

C � I

U � I

U ¼ I � C

If A2C and 2p< 2m, then A has an incompletely reducedequivalent B2 I such that A¼B (mod p). Thus, instead ofworking with A, we can also work with B in our arithmeticoperations. The incompletely reduced numbers completelyoccupy s words as follows:

When we perform arithmetic with incompletely reducednumbers, we do not need to perform bit-level operations onthe MSW. All operations are word-level operations, andchecks for carry bits are performed on the word bound-aries, not within the words. Furthermore, we skip unne-cessary reductions until the actual output is produced, inwhich case we make sure that it belongs to C. Thisrepresentation yields high-speed implementations of thearithmetic operations.

An incompletely reduced number B can be converted toits completely reduced equivalent A by subtracting integermultiples of p from B as many times as necessary (until it isless than p).

2.2 Positive and negative numbers

The numbers in the range [0, p7 1] as defined are positivenumbers. The implementation of the subtraction operationrequires a method of representation for negative numbersas well. We use the least positive residues which allowsimultaneous representation of the positive and negativenumbers modulo p. In this representation, the numbers arealways left as positive, i.e. if the result of a subtractionoperation is a negative number, then it is converted back topositive by adding p to it. For example, for p¼ 7, theoperation s¼ 37 4 is performed as s¼ 37 4þ 7¼ 6. Thenumbers from 0 to (p7 1)=2 can be interpreted as positivenumbers modulo p, while the numbers from (p7 1)=2þ 1to p7 1 can be interpreted as negative numbers modulo p.However, numbers are always kept as unsigned positiveintegers.

Bs�1 Bs�2 � � �

bsw�1 � � � b(s�1)w b(s�1)w�1 � � � b(s�2)w � � �

B1 B0

b2w�1 � � � bw bw�1 � � � b0

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

2.3 Representation example

We take the prime modulus as p¼ 11¼ (1011) and theword-size as w¼ 3. Thus, we have k¼ 4 and s¼dk=we¼d4=3e¼ 2, which gives m¼ 2 � 3¼ 6. The completelyreduced set of numbers is C¼ {0, 1, . . . , 9, 10}, whilethe incompletely reduced set contains the numbers rangingfrom 0 to (2m 7 1)¼ (26 7 1)¼ 63 as I¼ {0, 1, . . . , 62,63}. The incompletely reduced numbers occupy two wordsas A¼ (A1A0)¼ (a5a4a3 a2a1a0). For example, the decimalnumber 44 is represented as (101 100) in binary or (5 4) inoctal.

An incompletely reduced equivalent of A is given asB¼Aþ i �p, where B2 [0, 63] and i is a positive integer.For example, if A¼ 5, then the incompletely reducedequivalents of A are given as {5, 16, 27, 38, 49, 60}.The incompletely reduced representation is a redundantrepresentation: we use the notation 5¼ {5, 16, 27, 38, 49,60} to represent the residue class 5. We will show in thefollowing sections that this redundant representationprovides more efficient arithmetic in GF(p).

3 Modular addition

Since we use the incompletely reduced numbers, ournumbers are allowed to grow as large as 2m 7 1. Theincompletely reduced representation avoids unnecessaryreduction operations. We add the input operands asX :¼AþB(mod p). If the result does not exceed 2m 7 1,then we do not perform reduction. This check is simple toperform: since m¼ sw, we are only checking to see if thereis a carry-out from the MSW. We will use the notation

ðc; SiÞ :¼ Ai þ Bi þ c ð1Þ

to denote the word-level addition operation which adds the1-word numbers Ai and Bi and the 1-bit carry-in c, produ-cing the outputs c and Si such that c is the 1-bit carry-outand Si is the 1-word sum. The addition algorithm is givenas follows:

Modular addition algorithm

Inputs: A ¼ ðAs�1 � � �A1A0Þ and B ¼ ðBs�1 � � �B1B0Þ

Auxiliary: F ¼ ðFs�1 � � �F1F0Þ

Output: X ¼ ðXs�1 � � �X1X0Þ

step 1: c :¼ 0

step 2: for i ¼ 0 to s � 1

step 3: ðc; SiÞ :¼ Ai þ Bi þ c

step 4: if c ¼ 0 then return X ¼ ðSs�1 � � � S1S0Þ

step 5: c :¼ 0

step 6: for i ¼ 0 to s � 1

step 7: ðc; TiÞ :¼ Si þ Fi þ c

step 8: if c ¼ 0 then return X ¼ ðTs�1 � � � T1T0Þ

step 9: c :¼ 0

step 10: for i ¼ 0 to s � 1

step 11: ðc;UiÞ :¼ Ti þ Fi þ c

step 12: return X ¼ ðUs�1 � � �U1U0Þ

If the carry-out from the MSW is zero, then the algorithmproduces the correct result in step 4 as X¼ S¼ (Ss�1 � � �

S1S0). If the carry-out is one, we first ignore the carry-outand then correct the result. By ignoring the carry-out, weare essentially performing the operation S :¼ S7 2m. Sincewe need to perform modulo p arithmetic, we are allowedonly add or subtract integer multiples of p; therefore, weneed to correct the result as T :¼ (S7 2m)þF, where

47

F¼ (Fs�1 � � � F1F0) is called the correction factor foraddition and is defined as

F ¼ 2m � Ip ð2Þ

where I is the largest possible integer which brings F to therange [1, p7 1]; in other words, I¼b2m=pc. The number Fis precomputed and saved. By performing the operationT :¼ (S7 2m)þF, we essentially perform a modulo preduction as

T :¼ ðS � 2mÞ þ F ¼ S � 2m þ 2m � Ip ¼ S � Ip ð3Þ

Thus, the result X¼ T will be correct as a modular numberafter step 8. However, there is a danger in performing theoperation T :¼ SþF since this might also cause a carry-outfrom the MSW. The input operands A and B are arbitrarynumbers which can be as large as 2m 7 1, givingS¼ 2mþ1 7 2. By ignoring the carry in step 3, we obtainS¼ 2m 7 2. Therefore, T :¼ SþF in step 4 may exceed 2m,and we need to correct the result one more time, which isaccomplished in steps 9–11. This is the last correction weneed to perform. There is no need for another correctionafter step 11, since the maximum value of U is strictly lessthan 2m:

U ¼ ðT � 2mÞ þ F ¼ 2m � 2 � 2m þ F

¼ �2 þ F � �2 þ p � 1 < 2m ð4Þ

3.1 Addition examples

Let p¼ 11, k¼ 4, w¼ 3, m¼ 6 and s¼ 2. We also computeF as

F ¼ 2m �b2m=pc � p ¼ 64�b26=11c � 11 ¼ 64� 5 � 11 ¼ 9

ð5Þ

(i) We illustrate the addition of 4¼ {4, 15, 26, 37, 48, 59}and 5¼ {5, 16, 27, 38, 49, 60} and use the incompletelyreduced numbers 26 and 27:

S ¼ 26 þ 27

¼ 53 ðc ¼ 0 return step 4Þ

The result is indeed correct since 53 is equivalent to9¼ {9, 20, 31, 42, 53}.(ii) We give an addition example where the first correction(steps 5–8) would be required. The addition of 4¼ {4, 15,26, 37, 48, 59} and 5¼ {5, 16, 27, 38, 49, 60} and usingthe incompletely reduced numbers 37 and 49 is such anexample:

S ¼ 37 þ 49

¼ 86 ðc ¼ 1 step 4Þ

¼ 86 � 64 ðignore carry step 4Þ

¼ 22

T ¼ 22 þ 9 ðcorrection steps 5 7Þ

¼ 31 ðc ¼ 0 return step 8Þ

The result is indeed correct since 31 is equivalent to9¼ {9, 20, 31, 42, 53}.(iii) We give an addition example where the secondcorrection (steps 10 and 11) would also be required. Theaddition of 6¼ {6, 17, 28, 39, 50, 61} and 7¼ {7, 18, 29,40, 51, 62} using the incompletely reduced numbers 61

48

and 62 is such an example:

S ¼ 61 þ 62

¼ 123 ðc ¼ 1 step 4Þ

¼ 123 � 64 ðignore carry step 4Þ

¼ 59

T ¼ 59 þ 9 ðcorrection steps 5 7Þ

¼ 68 ðc ¼ 1 step 8Þ

¼ 68 � 64 ðignore carry step 8Þ

¼ 4

U ¼ 4 þ 9 ðcorrection steps 9 11Þ

¼ 13 ðreturn step 12Þ

The result is indeed correct since 13 is equivalent to2¼ {2, 13, 24, 35, 46, 57}.

4 Modular subtraction

The subtraction is performed using two’s complementarithmetic. The input operands are least positive residuesrepresented using incompletely reduced representation. Wewill use the notation

ðb; SiÞ :¼ Ai � Bi � b ð6Þ

to denote the word-level subtraction operation whichsubtracts the 1-word number Bi and the 1-bit borrow-in bfrom the 1-word number Ai , producing the outputs, whichare the 1-word number Si and the 1-bit borrow-out b. Thefield subtraction algorithm for computing X¼A7B mod pis given as follows:

Modular subtraction algorithm

Inputs: A ¼ ðAs�1 � � �A1A0Þ and B ¼ ðBs�1 � � �B1B0Þ

Auxiliary: G ¼ ðGs�1 � � �G1G0Þ and F ¼ ðFs�1 � � �F1F0Þ

Output: X ¼ ðXs�1 � � �X1X0Þ

step 1: b :¼ 0

step 2: for i ¼ 0 to s � 1

step 3: ðb; SiÞ :¼ Ai � Bi � b

step 4: if b ¼ 0 then return X ¼ ðSs�1 � � � S1S0Þ

step 5: c :¼ 0

step 6: for i ¼ 0 to s � 1

step 7: ðc; TiÞ :¼ Si þ Gi þ c

step 8: if c ¼ 0 then return X ¼ ðTs�1 � � � T1T0Þ

step 9: c :¼ 0

step 10: for i ¼ 0 to s � 1

step 11: ðc;UiÞ :¼ Ti þ Fi þ c

step 12: return X ¼ ðUs�1 � � �U1U0Þ

If b¼ 0 after step 4, then the result is positive, and it is aproperly reduced modular number. If b¼ 1, then the resultis negative; we obtain the two’s complement result, i.e. weessentially compute

S :¼ A � B ¼ A þ 2m � B ð7Þ

The result S is in the range [0, 2m 7 1], as required.However, it is incorrectly reduced, i.e. 2m is added, andthus we need to correct the result by adding G¼ (Gs�1 � � �

G1G0), which is the correction factor for subtraction,defined as

G ¼ Jp � 2m ð8Þ

where J is the smallest integer which brings G to the range[1, p7 1], i.e. J¼ [2m=p]. It is easily proven that the sum

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

of the correction factors for addition and subtraction, i.e.the sum of F and G, is equal to p since

F þ G ¼ 2m � Ip þ Jp � 2m

¼ ðJ � I Þp ¼ ðd2m=pe � b2m=pcÞp ¼ p ð9Þ

In other words, we have G¼ p7F or F¼ p7G. Theresult S is corrected to obtain T in steps 5–8. After thecorrection of S in step 8, we obtain

T ¼ S þ G ¼ A þ 2m � B þ Jp � 2m ¼ A � B þ Jp ð10Þ

Similar to the addition algorithm in step 8, this correctionmight cause a carry from the MSW, requiring anothercorrection which we need to take care of using F. This isaccomplished in steps 9–11. There is no need for anothercorrection after step 12, since the maximum valueS¼ 2m 7 1 gives

U � ð2m � 1Þ þ G � 2m þ F ¼ �1 þ p < 2m ð11Þ

4.1 Subtraction examples

Let p¼ 11, k¼ 4, w¼ 3, m¼ 6 and s¼ 2. We also computeG as

G ¼ d2m=pe � p � 2m ¼ ½26=11� �11 � 64 ¼ 6 �11 � 64 ¼ 2

ð12Þ

Since FþG¼ p, we could have also obtained G usingG¼ p7F¼ 117 9¼ 2.

(i) We illustrate the subtraction operation S :¼ 57 7,where 5¼ {5, 16, 27, 38, 49, 60} and 7¼ {7, 18, 29, 40,51, 62}, using the incompletely reduced equivalents 49 and40.

S ¼ 49 � 29

¼ 20 ðb ¼ 0 return step 4Þ

The result is indeed correct since 20 is an incompletelyreduced number representing the reduced number 9¼ {9,20, 31, 42, 53}, which is equal to 57 7¼�2¼ 9 mod 11.(ii) On the other hand, the same subtraction S :¼ 57 7operation using the incompletely reduced equivalents 16and 40 is performed as

S ¼ 16 � 40

¼ �24 ðb ¼ 1 step 4Þ

¼ 64 � 24 ðtwo’s complement step 4Þ

¼ 40

T ¼ 40 þ 2 ðcorrection steps 5�8Þ

¼ 42 ðc ¼ 0 return step 8Þ

The incompletely reduced number 42 is also the correctresult since it represents 9¼ {9, 20, 31, 42, 53}.(iii) We now give an addition example where the secondcorrection (steps 9–12) would be required. The subtractionoperation 57 6, where 5¼ {5, 16, 27, 38, 49, 60} and6¼ {6, 17, 28, 39, 50, 61}, using the incompletely reduced

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

numbers 49 and 50 is such an example:

S ¼ 49 � 50

¼ �1 ðb ¼ 1 step 4Þ

¼ 64 � 1 ðtwo’s complement step 4Þ

¼ 63

T ¼ 63 þ 2 ðcorrection steps 5 8Þ

¼ 65 ðc ¼ 1 step 8Þ

¼ 65 � 64 ðignore carry step 8Þ

¼ 1

U ¼ 1 þ 9 ðcorrection steps 9�11Þ

¼ 10 ðreturn step 12Þ

The result is indeed correct since 10 is equal to(�1) mod 11.

5 Montgomery modular multiplication

The modular multiplication operation multiplies the inputoperands A and B and reduces the product modulo p, i.e. itcomputes C :¼AB (mod p). The reduction operation oftenrequires bit-level shift-subtract operations [7]. We areinterested in algorithms which require word-level opera-tions. Instead of the classical modular multiplicationoperation, we prefer to use the Montgomery modularmultiplication [8], which computes

T :¼ ABR�1 mod p ð13Þ

where R is an integer with property gcd(R, p)¼ 1. Theselection of R is very important since it determines thealgorithmic details and the speed of the Montgomerymultiplication. Generally R is selected as the smallestpower of 2 which is larger than p, i.e. R¼ 2k, wherek¼ [log2 p]. Thus, we have 1< p<R; however, 2p>R.If k is not an integer multiple of the word-length w, thisselection requires that we perform bit-level operations.Following the general premise of this paper, we proposethe use of R¼ 2m, where m¼ sw, to avoid performing bit-level operations. In this case, 2m may be several timeslarger than p, as it was also the case for the addition andsubtraction algorithms presented in the previous sections.

We propose to use the Montgomery multiplicationalgorithm for incompletely reduced numbers, whichreceives two numbers A and B in the range [0, 2m 7 1],and computes the result T, which is also an incompletelyreduced number in the range [0, 2m 7 1], given by (13). Inthe high-level view, the Montgomery multiplication algo-rithm computes the result T using

T ¼AB þ pðABp0 mod RÞ

Rð14Þ

where p0 is defined using the multiplicative inverseR�1 mod p as

RR�1 � pp0 ¼ 1 ð15Þ

and computed using the extended Euclidean algorithm.The algorithm receives the inputs A, B2 [0, R7 1] andcomputes the result T in (14). Since A, B<R, the result ofthe operation in (14) will have the maximum value

ðR � 1ÞðR � 1Þ þ pðR � 1Þ

ðR � 1ÞðR � 1 þ pÞ

R

< R � 1 þ p ð16Þ

In other words, T as computed by (14) exceeds R only byan additive factor of p; therefore we need to perform only asingle subtraction to bring it back to the range [0, R7 1].

49

The word-level description of the Montgomery multi-plication involves a word-level multiplication operation,which we denote as

ðc; TjÞ :¼ Tj þ Ai � Bj þ c ð17Þ

in which the new value of Tj and the new carry word c arecomputed using the old value of Tj and also the 1-wordoperands Ai and Bj, and the old carry word c. Here, alloperands Ai , Bj , Tj , c2 [0, 2w 7 1], i.e. they are all 1-wordnumbers. Since we have

ð2w � 1Þ þ ð2w � 1Þ � ð2w � 1Þ þ ð2w � 1Þ

¼ ð2w � 1Þð2w þ 1Þ ¼ 22w � 1 ð18Þ

the result of the operation in (17) is a 2-word numberrepresented using the 1-word numbers Tj and c.

The details of different Montgomery multiplicationalgorithms can be found in [9]. Here we describe analgorithm which computes T using the least significantword of the number p0 defined in (15). Since R¼ 2sw, wecan reduce (15) mod 2w, and obtain

� pp0 ¼ 1 mod 2w ð19Þ

Let P0 and Q0 be the LSW of p and p0, respectively. Then,Q0 is the negative of the multiplicative inverse of the LSWof p mod 2w, i.e.

Q0 ¼ �P�10 mod 2w ð20Þ

This 1-word number can be computed very quickly using avariation of the extended Euclidean algorithm given in[10]. The Montgomery multiplication algorithm forcomputing T¼AB2�m(mod p) using Q0 is given below.

Montgomery modular multiplication algorithm

Inputs: A ¼ ðAs�1 � � �A1A0Þ and B ¼ ðBs�1 � � �B1B0Þ

Auxiliary: Q0 and p ¼ ðPs�1 � � �P1P0Þ

Output: T ¼ ðTs�1 � � � T1T0Þ

step 1: for j ¼ 0 to s � 1

step 2: Tj :¼ 0

step 3: for i ¼ 0 to s � 1

step 4: c :¼ 0

step 5: for j ¼ 0 to s � 1

step 6: ðc; TjÞ :¼ Tj þ Ai � Bj þ c

step 7: Ts :¼ c

step 8: M :¼ T0 � Q0 mod 2w

step 9: c :¼ ðT0 þ M � P0Þ=2w

step 10: for j ¼ 1 to s � 1

step 11: ðc; Tj�1Þ :¼ Tj þ M � Pj þ c

step 12: ðc; Ts�1Þ :¼ Ts þ c

step 13: if c ¼ 0 return T ¼ ðTs�1 � � � T1T0Þ

step 14: b :¼ 0

step 15: for j ¼ 0 to s � 1

step 16: ðb; TjÞ :¼ Tj � Pj � b

step 17: return T ¼ ðTs�1 � � � T1T0Þ

The explanations and proofs of the steps are given below:

(i) In steps 1 and 2, we clear the words of the result T tozero. The final result T¼AB2�m(mod p) will be located inthe s-word T at the end of the computation.(ii) The first part of the multiplication loop (steps 3–7)computes a partial product T of length sþ 1. For i¼ 0, thisvalue is given as

T :¼ A0 � B

50

Since A0 2 [0, 2w�1] and B2 [0, 2m�1], this value is

T � 2w�1 � 2m�1 ¼ 2w�1 � 2sw�1 ¼ 2ðsþ1Þw�2

(iii) In steps 8–12, we are reducing T mod p in such a waythat it is now of length s words at the end of step 12. This isaccomplished using the following substeps:

(a) First, in step 8, we multiply the LSW of T byQ0 modulo 2w. Recall that Q0 is the LSW of p0 or it isequal to �P0

�1 mod 2w. Thus, M (which is a 1-wordnumber) is given as

M :¼ T0 � Q0 ¼ T0 � ð�P0�1Þ ¼ �T0P�1

0 mod 2w

(b) Then, in step 9, we compute T0 þM �P0 , which isequal to

X :¼ T0 þ M � P0 :¼ T0 þ ð�T0P�10 ÞP0

Note that X is a 2-word number; however, the LSW of X iszero since

T0 þ ð�T0P�10 ÞP0 ¼ 0 mod 2w

Therefore, after the division by 2w in step 9, we obtain the1-word carry c from the computation T0 þM �P0 .(c) Then, in the remaining steps, i.e. in steps 10–12, we arefinishing the computation of TþM �P. Since the LSW ofthe result is zero, we are also shifting the result by 1 wordto the right (towards the least significant) to obtain the s-word number given by (14).

(iv) According to (16), the result computed at the end ofstep 12 can exceed R7 1 by at most p, and thus a singlesubtraction will bring it back to the range [0, R7 1]. Instep 13, we check if the carry computed at the end of step12 is 1, i.e. if T exceeds R7 1. If there is no carry, then wereturn the result T in step 13 as the final product.(v) Otherwise, we perform a simple subtraction T :¼ T7 pto bring back T to the range [0, R7 1]. The subtractionoperation is accomplished in steps 14–16, and the finalproduct value is returned in step 17.

Therefore, the Montgomery modular multiplication workseven if the radix R¼ 2sw is much larger than p, i.e. it neednot be the smallest number of the form 2i which is largerthan p. While there may be several correction steps neededin the addition and subtraction operations, a single subtrac-tion operation is sufficient for computing the Montgomeryproduct T¼AB2�sw mod p.

The complete Montgomery multiplication and theincomplete Montgomery multiplication algorithms differonly slightly from one another. Their main differences arein the way the input and output operands are specified:

(i) The radix R in the complete Montgomery multiplica-tion algorithm is taken as 2k, while the incompleteMontgomery multiplication algorithm uses the value 2sw,therefore avoiding bit-level operations if k is not an integermultiple of w.(ii) The complete Montgomery multiplication algorithmrequires that input operands be complete, i.e. numbers inthe range [0, p7 1], while the incomplete Montgomerymultiplication algorithm requires that input operands be inthe range [0, 2m 7 1].(iii) The complete Montgomery multiplication algorithmcomputes the final result as a complete number, i.e. anumber in the range [0, p7 1], while the incompleteMontgomery multiplication algorithm computes the resultin the range [0, 2m 7 1].

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

5.1 Multiplication examples

Let p¼ 53, k¼ 4, w¼ 3, m¼ 6 and s¼ 2. Sincep¼ 53¼ (110101) and P0 ¼ (101)¼ 5, we computeQ0 ¼�P0

�1 mod 2w as

Q0 ¼ �5�1 mod 8 ¼ �5 ¼ 3

Also, we have R¼ 2m¼ 26

¼ 64. Here are two multiplica-tion examples:

(i) We illustrate the multiplication of 5¼ {5, 58} and7¼ {7, 60} using the incompletely reduced numbers 58and 60. Taking A¼ 58¼ (111 010) and B¼ 60¼(111 100), we compute T¼A �B �R�1 mod p as follows:

step 3: i ¼ 0

steps 4; 5; 6 and j ¼ 0: ðc; T0Þ :¼ A0 � B0 ¼ 2 � 4 ¼ 8 ¼

ð001 000Þ

steps 5; 6 and j ¼ 1: ðc; T1Þ :¼ A0 � B1 þ c ¼ 2 � 7 þ 1 ¼

15 ¼ ð001 111Þ

step 7: T2 ¼ c ¼ 1; therefore, we have T ¼ ð001 111 000Þ

step 8: M ¼ T0 � Q0 ¼ 0 � 3 mod 8 ¼ 0

step 9: c ¼ ðT0 þ M � P0Þ=8 ¼ ð0 þ 0 � 5Þ=8 ¼ 0

steps 10; 11 and j ¼ 1: ðc; T0Þ ¼ T1 þ M � P1 þ c ¼ 7þ

0 � 6 þ 0 ¼ 7 ¼ ð000 111Þ

step 12: ðc; T1Þ ¼ T2 þ c ¼ 1 þ 0 ¼ 1¼ ð000 001Þ;we now

have T ¼ ð001 111Þ

step 3: i ¼ 1

steps 4; 5; 6 and j ¼ 0: ðc; T0Þ :¼ T0 þ A1 � B0 ¼

7 þ 7 � 4 ¼ 35 ¼ ð100 011Þ

steps 5; 6 and j ¼ 1: ðc; T1Þ :¼ T1 þ A1 � B1 þ c ¼

1 þ 7 � 7 þ 4 ¼ 54 ¼ ð110 110Þ

step 7: T2 ¼ c ¼ 6; therefore, we have T ¼ ð110 110 011Þ

step 8: M ¼ T0 � Q0 ¼ 3 � 3 mod 8 ¼ 1

step 9: c ¼ ðT0 þ M � P0Þ=8 ¼ ð3 þ 1 � 5Þ=8 ¼ 1

steps 10; 11 and j ¼ 1: ðc; T0Þ ¼ T1 þ M � P1 þ c ¼

6 þ 1 � 6 þ 1 ¼ 13 ¼ ð001 101Þ

step 12: ðc; T1Þ ¼ T2 þ c ¼ 6 þ1¼ 7 ¼ ð000 111Þ;we now

have T ¼ ð111 101Þ

step 13: since c ¼ 0; the result T ¼ ð111 101Þ is returned.

The final result is T¼ (111 101)¼ 61, which is an incom-plete number. The complete equivalent of 61 is 8, which isequal to 5 �7 �64�1 mod 53.(ii) We now illustrate the multiplication of 8¼ {8, 61} and10¼ {10, 63} using the incompletely reduced numbers 61and 63. Taking A¼ 61¼ (111 101) and B¼ 63¼(111 111), we compute T¼A �B �R�1 mod p below. Inthis multiplication, the subtraction steps (steps 14–17)are performed.

step 3: i ¼ 0

steps 4; 5; 6 and j ¼ 0: ðc; T0Þ :¼ A0 � B0 ¼ 5 � 7 ¼ 35 ¼

ð100 011Þ

steps 5; 6 and j ¼ 1: ðc; T1Þ :¼ A0 � B1 þ c ¼ 5 � 7 þ 4 ¼

39 ¼ ð100 111Þ

step 7: T2 ¼ c ¼ 4; therefore we have T ¼ ð100 111 011Þ

step 8: M ¼ T0 � Q0 ¼ 3 � 3 mod 8 ¼ 1

step 9: c ¼ ðT0 þ M � P0Þ=8 ¼ ð3 þ 1 � 5Þ=8 ¼ 1

steps 10; 11 and j ¼ 1: ðc; T0Þ ¼ T1 þ M � P1 þ c ¼

7 þ 1 � 6 þ 1 ¼ 14 ¼ ð001 110Þ

step 12: ðc; T1Þ ¼ T2 þ c ¼ 4 þ1¼ 5 ¼ ð000 101Þ;we now

have T ¼ ð101 110Þ

step 3: i ¼ 1

steps 4; 5; 6 and j ¼ 0: ðc; T0Þ :¼ T0 þ A1 � B0 ¼ 6 þ 7 � 7

¼ 55 ¼ ð110 111Þ

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002

steps 5; 6 and j ¼ 1: ðc; T1Þ :¼ T1 þ A1 � B1 þ c ¼ 5þ

7 � 7 þ 6 ¼ 60 ¼ ð111 100Þ

step 7: T2 ¼ c ¼ 6; therefore, we have T ¼ ð110 110 011Þ

step 8: M ¼ T0 � Q0 ¼ 7 � 3 mod 8 ¼ 5

step 9: c ¼ ðT0 þ M � P0Þ=8 ¼ ð7 þ 5 � 5Þ=8 ¼ 4

steps 10, 11 and j ¼ 1: ðc; T0Þ ¼ T1 þ M � P1 þ c ¼ 4þ

5 � 6 þ 4 ¼ 38 ¼ ð100 110Þ

step 12: ðc; T1Þ ¼ T2 þ c ¼ 7 þ 4 ¼ 11 ¼ ð001 011Þ

step 13: since c ¼ 1;we execute the subtraction steps below

step 14: b ¼ 0

steps 15; 16 and j ¼ 0: ðb; T0Þ ¼ T0 � P0 � b ¼

6 � 5 � 0 ¼ 1 ¼ ð000 001Þ

steps 15; 16 and j ¼ 1: ðb; T1Þ ¼ T1 � P1 � b ¼

3 � 6 � 0 ¼ �3 mod 8 ¼ 5ð000 101Þ

step 17: T ¼ ð101 001Þ ¼ 41 is returned:

The final result is T (101 001)¼ 41, which is a completenumber. The final result, 41, is equal to 8 �10 �64�1 mod 53.

6 Implementation results and conclusions

To measure the speed of the incomplete addition, subtrac-tion and Montgomery multiplication algorithms togetherwith their complete counterparts, we have implemented theECDSA [5, 11] over the finite field GF(p). The objectivewas to assess the performance impact of the incompletearithmetic as compared to the complete arithmetic on theECDSA. We executed the ECDSA code several hundredtimes using two different random elliptic curve sets. Theaddition, subtraction and multiplication timings shown inTable 1 are obtained by running the ECDSA code andmeasuring the timings of the arithmetic operations in thatcontext. The computer platform was a 450 MHz Pentium IIPC running the Windows NT 4.0 operating system, with256 Mbytes of memory. The code is written entirely in C.The timings of the complete and incomplete routines aretabulated in Table 1 in microseconds.

The speedup in percentage is obtained by subtracting theincomplete timing result from the complete timing resultand then dividing it by the complete timing result. As canbe seen from Table 1, the incomplete addition is 34–43%faster than the complete addition for the range of kfrom 161 to 256. Similarly, the incomplete subtraction is17–25% faster than the complete subtraction. The reduc-tion in the speedup for the subtraction operation ascompared to the addition operation is mainly due to thefact that we have to perform more corrections than theaddition operation. On the other hand, we obtain only asmall speedup (3–5%) for the incomplete Montgomerymultiplication operation since the incomplete and completeMontgomery multiplication algorithms differ only slightly.

On the other hand, the timing results of the ECDSAsignature generation algorithm are given in Table 2 inmilliseconds. These ECDSA timings are obtained withoutusing any precomputation. The ECDSA code was executedseveral hundred times using two different random ellipticcurve sets for bit lengths as specified in Table 2. Theserandom elliptic curves are not special in any meaning of theterm; the curve parameters are randomly selected from thefield GF(p), i.e. they are full-size numbers. The implemen-tation results have shown that the ECDSA algorithm can bemade to execute 10–13% faster using the incomplete modu-

51

Table 1: Incomplete and complete arithmetic timings in microseconds

Addition Subtraction Multiplication

k Complete Incomplete % Complete Incomplete % Complete Incomplete %

161 1.85 1.11 40 1.43 1.10 23 4.80 4.58 5

176 1.90 1.11 42 1.38 1.10 20 4.74 4.57 4

192 2.00 1.26 37 1.38 1.04 25 4.79 4.62 4

193 1.98 1.23 38 1.47 1.20 18 6.36 6.17 3

208 2.14 1.22 43 1.46 1.19 18 6.40 6.13 4

224 2.03 1.28 37 1.45 1.16 20 6.35 6.17 3

225 2.20 1.30 41 1.58 1.29 18 8.06 7.73 4

240 2.23 1.32 41 1.53 1.27 17 8.03 7.74 4

256 2.31 1.52 34 1.53 1.27 17 8.02 7.76 3

lar arithmetic, which is very significant. The method ofcomputation of the point multiplication, i.e. the computationof dP in ECDSA, where d is an integer and P is a point on thecurve, is the addition–subtraction elliptic scalar multiplica-tion [11]. Furthermore, we use the projective co-ordinatesystem to represent the points on the curve.

According to the IEEE Standard [11], an elliptic curvepoint doubling requires 9 field additions, 3 field subtrac-tions and 10 field multiplications. On the other hand, anelliptic curve point addition requires 3 field additions, 5field subtractions and 11 field multiplications. As anexample of performance estimation, let us take the scalarmultiplication operation dP in which the integer d is176 bits. This means that the addition-subtraction ellipticcurve scalar multiplication requires approximately176=3’ 59 point additions and 175 point doublings.Therefore, the computation of dP would require approxi-mately (175� 9þ 59� 3)¼ 1752 field additions,(175� 3þ 59� 5)¼ 820 field subtractions and(175� 10þ 59� 11)¼ 2399 field multiplications. Thisgives the total time for the computation of the ellipticcurve scalar multiplication using the complete and incom-plete arithmetic as

complete ¼ ð1752 � 1:9 þ 820 � 1:38 þ 2399 � 4:74Þ

’ 15.83 ms

incomplete ¼ ð1752 � 1:11 þ 820 � 1:1 þ 2399 � 4:57Þ

’ 13.81 ms

Thus, this rough estimation shows that the incompletearithmetic based point multiplication operation is approxi-mately 15.83=13.81’ 1.15 times faster than its completearithmetic version. This means that the incomplete arith-

Table 2: ECDSA over GF(p) signature generation timingsin milliseconds

C code only Cþ assembly

k Complete Incomplete % Incomplete

161 13.6 12.0 12 5.3

176 14.8 12.9 13 5.8

192 16.5 14.7 11 6.6

193 20.8 18.4 12 8.5

208 22.6 19.7 13 9.1

224 23.7 21.1 11 9.7

225 29.8 26.5 11 12.2

240 31.1 27.9 10 12.8

256 34.2 30.8 10 14.0

52

metic for 176 bits would be about 15% faster than thecomplete arithmetic for ECDSA; our actual timing result inTable 2 gives this as 13%, which is very close.

7 Conclusions

In this paper we have presented a new methodology forspeeding up arithmetic operations, and shown that the newmethod provides up to 13% speedup in the execution of theECDSA algorithm over the field GF(p) for the length of pin the range 161� k� 256. Coupled with some machine-level programming, the ECDSA algorithm can be madesignificantly faster, as shown in the last column of Table 2.

Here we have applied the incomplete arithmetic only tothe modular addition, the modular subtraction and theMontgomery multiplication operations. A similar word-level methodology was described in [12] for the modularinverse and Montgomery modular inverse operations.These algorithms are also of type word-level, i.e. theyavoid bit-level operation.

8 Acknowledgments

This research is supported in part by rTrust Technologies.The reader should note that Oregon State University hasfiled US and International patent applications for inven-tions described in this paper.

9 References

1 QUISQUATER, J.-J., and COUVREUR, C.: ‘Fast decipherment algo-rithm for RSA public-key cryptosystem’, Electron. Lett., 1982, 18, (21),pp. 905–907

2 DIFFIE, W., and HELLMAN, M.E.: ‘New directions in cryptography’,IEEE Trans. Inf. Theory, 1976, 22, pp. 644–654

3 KOBLITZ, N.: ‘Elliptic curve cryptosystems’, Math. Comput., 1987, 48,(177), pp. 203–209

4 MENEZES, A.J.: ‘Elliptic curve public key cryptosystems’ (KluwerAcademic Publishers, Boston, MA, 1993)

5 National Institute for Standards and Technology. Digital SignatureStandard (DSS). FIPS PUB 186–2, January 2000

6 CRANDALL, R.E.: ‘Method and apparatus for public key exchange in acryptographic system’. US Patent Numbers 5,463,690, 5,271,061 and5,159,632, October 1995

7 KOC, C.K.: ‘High-speed RSA implementation’. Technical Report TR201, RSA Laboratories, November 1994

8 MONTGOMERY, P.L.: ‘Modular multiplication without trial division’,Math. Comput., 1985, 44, (170), pp. 519–521

9 KOC, C.K., ACAR, T., and KALISKI, B.S. Jr.: ‘Analyzing and compar-ing Montgomery multiplication algorithm’, IEEE Micro, 1996, 16, (3),pp. 26–33

10 DUSSE, S.R., and KALISKI, B.S. Jr.: ‘A cryptographic library forthe Motorola DSP56000’ in DAMGARD, I.B. (Ed.): ‘Advances incryptology – EUROCRYPT 90’, Lecture Notes in Computer Science,No. 473 (Springer, Berlin, 1990), pp. 230–244

11 IEEE. P1363: ‘Standard specifications for public-key cryptography’.Draft Version 13, 12 November 1999

12 SAVAS’, E., and KOC, C.K.: ‘The Montgomery modular inverse –

revisited’, IEEE Trans. Comput., 2000, 49, (7), pp. 763–766

IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 2, March 2002