time-memorytrade-oﬀsforparallelcollision … · factoring and discrete logs, ... mapping f. the...

Time-Memory Trade-offs for Parallel CollisionSearch Algorithms

Monika Trimoska and Sorina Ionica and Gilles Dequen

Laboratoire MIS, Université de Picardie Jules Verne33 Rue Saint Leu Amiens 80039 - France

Abstract. Parallel versions of collision search algorithms require a sig-nificant amount of memory to store a proportion of the points computedby the pseudo-random walks. Implementations available in the literatureuse a hash table to store these points and allow fast memory access.We provide theoretical evidence that memory is an important factor indetermining the runtime of this method. We propose to replace the tra-ditional hash table by a simple structure, inspired by radix trees, whichsaves space and provides fast look-up and insertion. In the case of many-collision search algorithms, our variant has a constant-factor improvedruntime. We give benchmarks that evaluate the linear parallel perfor-mance of the attack on ECDLP.

Keywords: discrete logarithm, parallelism, collision, elliptic curves, meet-in-the-middle, attack, trade off, radix tree

1 Introduction

Given a function f : S → S on a finite set S, we call collision any pair a, bof elements in S such that f(a) = f(b). Collision search has a broad range ofapplications in the cryptanalysis of both symmetric and asymmetric ciphers:computing discrete logarithms, finding collisions on hash functions and meet-in-the-middle attacks. The Pollard’s rho method [14], initially proposed for solvingfactoring and discrete logs, can be adapted to find collisions for any randommapping f . The parallel collision search algorithm, proposed by van Oorschotand Wiener [18], builds on Pollard’s rho method, and is expected to have alinear speedup compared to the sequential version. This algorithm computesseveral walks in parallel and stores some of these points, called distinguishedpoints, within the shared memory of an SMP system.

In this paper, we revisit the memory complexity of the parallel collisionsearch algorithm, both for applications which need a small number of collisions(i.e. discrete logs) and those needing a large number of collisions, such as meet-in-middle attacks. In the case of discrete logarithms, collision search methodsare the fastest known attacks in a generic group. In elliptic curve cryptogra-phy, subexponential attacks are known for solving the discrete log on curves

defined over extension fields, but only generic attacks are known to work inthe prime field case. Evaluating the performance of collision search algorithmsis thus essential for understanding the security of curve-based cryptosystems.Several record-breaking implementations of this algorithm are available in theliterature: over a prime field the current record reaches a discrete log in a 112-bit group on a curve of the form y2 = x3 − 3x+ b [5,10]. This computation wasperformed on a Playstation 3. More recently, Bernstein, Lange and Schwabe [7]reported on an implementation on the same platform and for the same curve, inwhich the use of the negation map gives a speed-up by a factor

√2. Over binary

fields, the current record is a FPGA implementation breaking a discrete loga-rithm in a 117-bit group [1]. As for the meet-in-the-middle attack, this generictechnique is widely used in cryptanalysis to break block ciphers (double andtriple DES, GOST [9]), hash functions [12,13] and lattice-based cryptosystems(NTRU [4,19]).

First, our contribution is to extend the analysis of the parallel collision searchalgorithm and present a formula for the expected runtime to find any givennumber of collisions, with and without a memory constraint. We show how tocompute optimal values of θ - the proportion of distinguished points, allowing tominimize the running time of collision search, both in the case of discrete loga-rithms and meet-in-the-middle attacks. In the case where the available memoryis limited, we determine the optimal value of θ, proving that the value conjec-tured by van Oorschot and Wiener was asymptotically correct. Going further inthe analysis, our formulae show that the actual running time of finding-many-collisions algorithm is critically reduced if the number of words w that can bestored in memory is larger.

Secondly, we focus on the data structure used for the algorithm. To the best ofour knowledge, all existing implementations of parallel collision search algorithmsuse hash tables to organize memory and allow fast look-up operations. In thispaper, we introduce a new structure, called Packed Radix-Tree-List, which isinspired by radix trees. We show that the use of this structure leads to a betteruse of memory in implementations and thus yields improved running times formany-collision applications.

Using the PRTL structure, we have implemented the parallel collision searchalgorithm for discrete logarithms on elliptic curves defined over prime fields,as well as a multi-collision search algorithm. Our benchmarks demonstrate theperformance and scalability of this method.

Organisation. Section 2 reviews algorithms for solving the discrete logarithmproblem and for meet-in-the-middle attacks. In Section 3, we revisit the prooffor the time complexity of the collision finding algorithm for a small and a largenumber of collisions. We furthermore show how to minimize the runtime, infunction of the proportion of distinguished points. Section 4 describes our choicefor the data structure, complexity estimates and comparison with hash tables.Finally, Section 5 presents our experimental results.

2

2 Parallel collision search

In this section we briefly review Pollard’s rho method and the parallel algorithmfor searching collisions. In order to look for collisions for a function f : S → Swith Pollard’s rho method, the idea is to compute a sequence of elements xi =f(xi−1) starting at some random element x0. Since S is finite, eventually thissequence begins to cycle and we therefore obtain the desired collision f(xk) =f(xk+t), where xk is the point in the sequence before the cycle begins and xk+tis the last point on the cycle before getting to xk+1 (hence f(xk) = f(xk+t) =xk+1). One may show that the expected number of steps taken until the collisionis found is

√πn2 , and therefore that the memory complexity is also O(

√πn2 ). This

algorithm can be further optimized to constant memory time by using Floyd’scycle [11,8]. We do not further detail memory optimizations here since they areinherently of sequential nature and no way to exploit these ideas in a parallelalgorithm is currently known.

The parallel version for the collision search was proposed by van Oorschotand Wiener [18] and assigns to each thread the computation of a trail givenby points xi = f(xi−1) starting at some point x0. Only points that belong toa certain subset, called the set of distinguished points, are stored. This set isdefined by points having an easily testable property. Whenever a walk computesa distinguished point xd, it stores in a common list of tuples (x0, xd). If two walkscollide, this is identified only when they both reached a common distinguishedpoint. We may then re-compute the paths and the points preceding the commonpoint are distinct points which map to the same value.

Solving discrete logarithms. Our focus is on the elliptic curve discrete log-arithm (ECDLP) in a cyclic group G = 〈P 〉, but the methods described in thispaper apply to any cyclic finite group. We will assume that the curve E and thegroup G are defined over a finite field Fp, where p is a prime number. Let Q ∈ Gand say we want to solve the discrete logarithm problem Q = xP , where x ∈ Z.To apply the ideas explained above, we define a map F : G→ G which behavesrandomly and such that each time we compute f(R) we can easily keep track ofintegers a and b such that f(R) = aP + bQ. Pollard’s initial proposal for such afunction was

f(R) =

R+ P if R ∈ S1

2R if R ∈ S2

R+Q if R ∈ S3

(1)

where the sets Si, i ∈ {1, 2, 3} are pairwise disjoint and give a partition of thegroup G. As a consequence, whenever a collision f(R) = f(R′) occurs, we obtainan equality

aP + bQ = a′P + b′Q. (2)

This allows us to recover x = (a−a′)/(b−b′), provided that b−b′ is not a multipleof r. Starting from R0, a multiple of P , Pollard’s rho [15] method computes a

3

sequence Ri of points where Ri+1 = f(Ri). Since the group G is finite, thissequence will produce a collision after

√πn2 iterations on average, where n is

the cardinality of the group G. To define distinguished points, we take an easilytestable property, such as a certain number of trailing bits of their x-coordinatebeing zero. Whenever a walk computes such a point, this is stored in a commonlist, together with the corresponding a and b. If two walks collide, this can notbe identified until the computation of the common distinguished point. Thenthe discrete logarithm is recovered from (2).

Meet-in-the-middle attacks. Meet-in-the-middle attacks require finding acollision of the type f1(a) = f2(b), where f1 : D1 → R and f2 : D2 → Rare two functions with the same co-domain. As explained in [18], solving thisequation may be formulated as a collision search problem on a single functionf : S × {1, 2} → S × {1, 2}, where the solution we need is of the type:

f(a, 1) = f(b, 2), (3)

and has some extra specific property. This collision is called the golden collision.The number of unordered pairs in S are approximatively n2

2 and the probabilitythat the two points in a pair map to the same value of f is 1

n . There are n2

expected collisions for f and there may be several solutions to Equation (3).Hence one typically assumes that all collisions are equally likely to occur andthat in the worst case, all possible n

2 collisions for f are generated before findingthe golden one. Because so many collisions are generated, memory complexitycan be the bottleneck in meet-in-the-middle attacks and the memory constraintbecomes an important factor in determining the running time of the algorithm.We further explain this idea in Section 3.

Computational model and data structure. We consider a CPU imple-mentation of the shared memory paradigm, where each thread involved in theprocess performs the same task of finding and storing distinguished points. Inthis case, the choice of a data structure allowing efficient look-up and insertion issignificant. The most common structure used in the literature is a hash table. Inorder to make parallel access to memory possible, van Oorschot and Wiener [18]propose the use of the most significant bits of distinguished points. Their ideais to divide the memory in segments, each corresponding to a pattern on thefirst few bits. Threads read off these first bits and are directed towards the rightsegment. Each segment is organized as a memory structure on its own.

A more commonly used computational model for solving the ECDLP is thedistributed client-server model, where a large number of client processors arecommunicating with a central server over the Internet. Here the server storesand compares the distinguished points received from the clients. A representativeimplementation of this model is [6]. Following the client-server architecture, thereis the FPGA model, where a host computer is controlling a large number ofFPGAs, as in [1].

4

Excluding the need for a structure that allows efficient simultaneous accessto memory, all results in this paper apply to both computational models, eventhough our experimental results are obtained using the CPU implementation.

Notation. In the remainder of this paper, we denote by θ the proportion ofdistinguished points in a set S. We denote by n the number of elements of S. Wedenote by E an elliptic curve defined over a prime finite field Fp. Whenever theset S is the group E(Fp), n is the cardinality of this group. For simplicity, in thiscase, we assume that n is prime (which is the optimal case in implementations).

3 Time complexity

Van Oorschot and Wiener [18] gave formulae for the expected running time ofparallel collision search algorithms. In this section, we revisit the steps of theirproof and show a careful analysis of the running time. Our theoretical modelprovides a more rigorous argument for linear scalability and yields runtimeswhich asymptotically coincide to those in [18]. However, our refined formulaeindicate that the actual running time of the algorithm depends on the proportionof distinguished points and allow us to determine the optimal choice of θ foractual implementations.

3.1 Finding one collision: elliptic curve discrete logarithm

Van Oorschot and Wiener [18] proved that the runtime for finding one collisionis

O

(1

L

√πn

2

),

with L the number of threads we use. This is obtained by finding the expectednumber of computed points before a collision occurs and then intuitively dividingthe clock time by L when L processors are involved. The proof of the followingtheorem, which is in Appendix A gives a more refined analysis on why the PCSparallelizes linearly.

Theorem 1. In the parallel collision search algorithm, the expected runningtime to find one collision is

f(θ) = (1

L

√πn

2+

1

θ)tc + (

θ

L

√πn

2)ts, (4)

where tc and ts are constants of the time it takes to compute and to store a pointrespectively.

As we can see in Equation (4), the proportion of distinguished points wechoose will influence our time complexity. The optimal value for θ is the one thatgives the minimal run complexity. Most importantly, our analysis puts forward

5

the idea that the optimal choice for θ depends essentially on the choices madefor the implementation and the memory management. From this formula, weeasily deduce that if the proportion of distinguished points is too small or toolarge, the running time of the algorithm increases significantly.

By estimating the ratio ts/tc for a given implementation, one can extrapolatethe optimal value of θ by computing the zeros of the derivative:

f ′(θ) =1

L

√πn

2ts −

1

θ2tc.

Figure 1 gives timings for our implementation of the attack, using a hashtable to store distinguished points. Timings shown on the figure are a mean for100 tests on a 65-bits curve and support our theoretical findings.

4 6 8 10 12 14 16 18 20 22 24 26 28

1.5

2

2.5

3

·10949m29s

29m1s22m41s22m1s20m26s21m12s22m38s21m55s23m5s

30m8s

50m9s

Trailling zero bits

Run

time(µs)

Fig. 1: Timings of solving ECDLP for different values of θ, 65-bits curve, 28threads

Note that most recent implementations available in the literature choosethe number of trailing bits giving the distinguished point property in a rangebetween 0.178 log n and 0.256 log n (see [1,7,10]). This value was determined byexperimenting on curves defined over small size fields. Our theoretical findingsconfirm that these values were close to optimal, but we support the idea that forfuture record-breaking implementations, the value of θ should be determined asexplained above.

When we consider the client-server model, clients do not have access to mem-ory, but they send distinguished points to the server and thus ts stands for costof communication on the client side. We suppose that all of the clients’ proces-sors are dedicated to computing points. On the server side however, the analysisis different. This theorem and the means of finding the optimal value for θ applyboth to the shared memory implementation adopted in this paper, and to themore common distributed client-server model.

3.2 Finding many collisions: Meet-in the middle attacks

Using a simplified complexity analysis, van Oorschot andWiener [18] put forwardthe following heuristic.

6

Heuristic. Let n be the cardinality of the set S. For a memory which can holdw distinguished points, the (conjectured) optimum proportion of distinguishedpoints is θ ∼ 2.25

√wn . Under this assumption, the expected number of itera-

tions required to complete a meet-in-the-middle attack using these parametersis O(nL

√nw ).

This heuristic suggests that in the case of meet-in-the-middle attacks, a memorydata structure allowing to store more distinguished points will yield a bettertime complexity. To prove the conjectured runtime, we first give a more refinedanalysis for the running time of a parallel collision search for findingm collisions.

Theorem 2. In the parallel collision search algorithm, the expected runningtime to find m collisions with a memory constraint of w words is:

1

L

(w

θ+ (m− w2

2θ2n)θn

w

)+m

θ. (5)

The proof of this Theorem is in Appendix B and it relies strongly on ourformula for the expected total number of computed distinguished points forfinding k collisions, when the memory is not limited:

Sk = θ√

2kn. (6)

We confirmed this estimation experimentally by running a multi-collision algo-rithm for a curve over a 55-bit prime field. The comparison of our formula withthe experimental results is in Table 1.

Collisions ExperimentalAvg.

Sk Collisions ExperimentalAvg.

Sk

100 238289 231704 500 530493 5181071000 750572 732714 2000 1062581 10362155000 1681831 1638399 7000 1990671 1938581

Table 1: Comparing Formula 6 to an experimental average. Each value is anaverage of 100 tests.

Recall that in the meet-in-the-middle attack, one needs to compute n2 colli-

sions. By minimizing the complexity function obtained in Theorem 2 , we obtainan estimate for the optimal value of θ to take, in order to minimize the runningtime of the algorithm.

Corollary 1. The optimum proportion of distinguished points minimizing thetime complexity bound in Theorem 2 is θ =

√w2+nLwn . Furthermore, by choosing

this value for θ, the running time of the parallel collision search algorithm forfinding n

2 collisions is bounded by:

O

(n√w2 + nLw

wL

). (7)

7

The proof of this corollary is in Appendix C and it confirms and proves theheuristic findings in [18]. Most importantly, Theorem 1 suggests that in the caseof applications which fill the memory available, the number of words we canstore is an important factor in the running time complexity. More storage spaceyields a faster algorithm by a constant factor. We propose such optimization inSection 4.

4 Our approach for the data structure

In this section, we evaluate the memory complexity of parallel collision searchalgorithms. As explained in Section 2, van Oorschot and Wiener’s [16] proposedto divide the memory into segments to allow simultaneous access by threads.We revisit this construction, with the goal in mind to minimize the memoryconsumption as well. Since in Section 3 we showed that the time complexity ofcollision search depends strongly on the available amount of memory, we proposean alternative structure called a Packed Radix-Tree-List, which will be referredto as PRTL in this paper. We explain how to choose the densest implementationof this structure for collision search data storing in Section 5.

Since the PRTL is inspired by radix trees, we first describe the classic radixtree structure and then we give complexity analysis confirmed with experimentalresults on why its straightforward implementation is not memory efficient. Thestructure that we finally use for collision search has the memory gain of radixtree common prefixes, but avoids the memory loss of manipulating pointers.

4.1 Radix tree structure

Each distinguished point from the collision search is represented as a number ina base of our choice, denoted by b. For example, in the case of attacks on thediscrete logs on the elliptic curve, we may represent a point by its x-coordinate.The first numerical digit of this number in base b gives the root node in thetree, the next digit is a child and so on. According to graph theory, this leads todefine an acyclic graph which consists of b connected components (i.e. a forest).

In regard to the memory consumption, we take advantage of common prefixesto have a more compact structure. Let c be the length of words we store in thetree and K the number of distinguished points computed by our algorithm.To estimate the memory complexity of this approach, we give upper and lowerbounds for the number of nodes that will be allocated in the radix tree before acollision is found.

Proposition 1. The expected number of nodes in the radix tree verifies the fol-lowing inequalities:

b

b− 1K − c− logK − 1 ≤ N(K) ≤ (c− logK +

b

b− 1)K. (8)

8

The proof of these inequalities is detailed in Appendix D.Traditionally, nodes in a radix tree are implemented as arrays of pointers

to child nodes. This representation will lead to excessive memory consumptionwhen the data to be stored follows a uniform random distribution, leading tosparsely populated branches and to the average distribution of nodes in the treebeing closer to the worst case than to the best case.

The difference between the worst case value and the best case value is ap-proximatively (c − logK)K. Depending on the application, this value may belarge. We take the case where a single collision is stored for solving the ECDLP.By a theorem of Hasse [16], we know that the number of points on the curve isgiven by n = p + 1 − t, with |t| ≤ 2

√p. Since we assume that n is prime, we

approximate log n ∼ log p. Hence an approximation of this number is:

(1

2log n− log

√π

2)θ

√πn

2.

In the case of many collisions algorithms, c ∼ K and this standard deviationbecomes negligible, resulting into a space-reduced data structure. We show howto handle sparse trees efficiently in Section 4.2.

Memory consumption of a radix tree for solving ECDLP. In the case ofattacks on the discrete logs on the elliptic curve, we store the starting point ofthe Pollard walk aP and the first distinguished point we find, represented by thecoefficient a and the x-coordinate correspondingly. In practice, implementingthis structure is very costly. To have a log p look-up and insertion time eachnode has to store the pointers to all of its b children in an array. Furthermore,internal nodes, as well as leaves, have to hold a pointer to a coefficient a, sincex-coordinates can be of different lengths.1 For reasons pointed out earlier, thereis also a lock associated to each node.

We label a pointer wasted when it is pending (i.e. unallocated child or acoefficient) and we denote by rate of use the ratio between the memory that isused and the allocated amount of memory. We show in Table 2 the experimentalresults that we got for the rate of use of the radix tree structure in the case ofcollision search for the discrete log attack.

Key size base 2 base 4 base 8 base 1060 bits 51.47% 35.46% 22.4% 18.58%65 bits 51.39% 35.36% 21.95% 18.50%

Table 2: Rate of use of the radix tree structure for solving ECDLP according tobase 2, 4, 8 and 10. Each value is an average of 100 tests.

1 Words can be of uniform length if we add zeros at the beginning.

9

4.2 Packed Radix-Tree-List

Using the analysis from the previous paragraph we look to construct a moreefficient memory structure by avoiding the properties of the classic radix treewhich make it memory costly in the collision search case. Intuitively, we see thatthe radix tree is dense at the upper levels and sparse at the lower ones. Henceit would be less memory consuming to construct a radix tree up to certain leveland then add the points to linked lists, each list starting from a leaf on the tree.We call this a Packed Radix-Tree-List2. Figure 2 illustrates an example of anabstract Radix-Tree-List in base 4.

Fig. 2: Radix-Tree-List structure with b = 4 and l = 2

We look to estimate up to which level the tree is complete for our use case.Let K be the number of stored points before a collision is found and let l be thelevel up to which we build the radix tree. The number of leaves in a completeradix tree of depth l is bl. As per the coupon collector’s problem, all the linkedlists associated with a leaf will contain at least one point when the followinginequality is verified:

K ≥ bl(ln bl + 0.577). (9)

We consider the highest value of l which satisfies this inequality to be the optimallevel, as it allows us to obtain the shortest linked lists while having 100% rateof use of the memory structure. We verify this experimentally by inserting agiven number of randomly obtained points of length c = 65 bits to the PRTLstructure. The results are in Table 3.

We do 100 runs per K using a base 2 tree implementation and we count thenumber of empty lists at the end of each run. None of the 300 runs finished withan empty list in the PRTL structure, which confirms that the obtained l is smallenough to have at least one point per list.

Then, to confirm that l is the highest possible value that achieves this, wedo the same number of tests, taking l + 1, which is the lowest value that doesnot satisfy the inequality 9. The results show that l + 1 is not small enough toproduce a 100% rate of use of the memory, therefore l is in fact the optimal levelto choose.2 The ’packed’ property is addressed in Section 5, where we give implementation de-tails.

10

Nb. of points Chosen level Average nb. of empty lists per runl l + 1 l l + 1

5 million 18 19 0 377 million 18 19 0 0.8410 million 19 20 0 75

Table 3: Verifying experimentally the optimal level.

We notice that in the case of 7 million points, we have very few (or none)empty slots even by taking the level l + 1. This is explained by the fact thata PRTL of level 19 needs 7207281 points to have all of its lists filled, as per 9.Given that 7 million is very close to this number, some of the runs finish with100% rate of use.

The attribution of a point to a leaf is determined by its prefix and we knowin advance that all of the leaves will be allocated. Therefore, in practice we donot actually have to construct the whole tree, but only the leaves. Hence, weallocate an array indexed by prefixes beforehand and then we insert each pointto the corresponding prefix’s list. The operation used to map a point to an indexis faster than a hash table function. More precisely, we perform a logical ANDoperation between the point’s x-coordinate and a precomputed mask to extractthe prefix. Furthermore, the lists are sorted. Since we are doing a search-and-add operation, sorting the lists does not take additional time and proves to bemore efficient than simply adding at the end of the list. Figure 3 illustrates theimplementation of this structure.

Fig. 3: PRTL implementation. Same points stored as in Figure 2.

Remark 1. When implementing the attack for curves defined over sparse primes,we advise taking an l-bit suffix instead of an l-bit prefix. Prefixes of numbersin sparse prime fields are not uniformly distributed and one might end up onlywith prefixes starting with the 0-bit, and therefore a half empty array.

Remark 2. Describing the structure, we took the example of the ECDLP. How-ever, the analyses and choices we made for constructing the PRTL are valid forany collision search application which includes storing a pair (key, data) and re-quires pairs to be efficiently looked up by key. For the ECDLP, Bailey et al. [6]propose, for example, to store a 64-bit seed on the server instead of the initialpoint, which makes the pair (x− coordinate, seed).

11

PRTL vs. hash table We experimented with the ElfHash function, which isused in the UNIX ELF format for object files. It is a very fast hash function, andthus comparable to the mask operation in our implementation. Small differencesin efficiency are negligible since the insertion is the less significant part of thealgorithm. Indeed, recall one insertion is done after 1

θ computations.As is the practice with the Parallel Collision search, we allocate K indexes for

the hash table, since we expect to have K stored points. Recall that this guaran-tees an average search time of O(1), but it does not avoid multi-collisions. Indeed,according to [11, Section 6.3.2], in order to avoid to avoid 3-multicollisions, oneshould choose a hash table with K

32 buckets. Consequently, we insert points in

the linked lists corresponding to their hash keys, as we did with the PRTL. Everyelement in the list holds a pair (key, data) and a link to the next element. ThePRTL is more efficient in this case as we only need to store the suffix of the key.

With this approach, we can not be sure that a 100% of the hash table indexeswill have at least one element. We test this by inserting a given number of randompoints on a 65-bit curve and counting the number of empty lists at the end ofeach run, like we did to test the rate of use for the PRTL. We try out twodifferent table sizes: the recommended hash table size and for comparison, a sizewhich matches the number of leaves in the PRTL. All results are an average of100 runs.

Nb. of points Average nb. of empty listsfor size = K

Average nb. of empty listsfor size = 2l

5 million 2592960 (51.85%) 98308 (37.50%)7 million 3632679 (51.89%) 98304 (37.50%)10 million 5138792 (51.38%) 196615 (37.50%)Table 4: Test the rate of memory use of a hash table structure.

Results in Table 4 show that when we choose a smaller table size, we havefewer empty lists, but the hash table is still not 100% full. Due to these results,when implementing a hash table we choose to allocate a table of pointers toslots, instead of allocating a table of actual slots which will not be filled. This isthe optimal choice because we only waste 8 Bytes for each empty slot, insteadof 24 (the size of one slot).

Since results in Table 3 show that the array in PRTL will be filled completely,when using this structure we allocate an array of slots directly. This makes PRTLsave a constant of 8K Bytes compared to a hash table.

To sum up, the PRTL structure is less space consuming and has a memoryrate of use of 1. Note however that by Equation (9), the average number ofelements in a chained list corresponding to a prefix is K

bl≈ l log b + 0.577. This

shows that the search time in our structure is negligible, and our benchmarksshown in Section 5 confirm that memory access has no impact on the totalrunning time for the algorithm.

12

It is clear that when one implements the PRTL, this structure takes theform of a hash table where the hash function is in fact the modulo a specificvalue calculated using the correlation (9). It might seem counter-intuitive thatthe optimal solution for a hash function is the modulo function. However, col-lision search algorithms do not require a memory structure that has hash tableproperties, such as each key to be assigned to a unique index.

Indeed, a well distributed hash function is useful when we look to avoidmulticollisions. With collision search algorithms, the number of stored elementsis so vast that we can not possibly allocate a hash table of the appropriatesize and thus we are sure to have longer than usual linked lists. Fortunately,this is not a problem since the insertion time is, in this case, not significantcompared to the 1

θ random walk computations needed before each insertion. Forexample, 1

θ would be of order 232 for a 129-bit curve. On the other hand, asshown in Section 3, the available storage space is a significant factor in the timecomplexity, which makes the use of this alternative structure more appropriatefor collision searches.

5 Implementation and benchmarks

Packed RTL A link in the lists in the PRTL stores one (key, data) pair. Theintuitive structure implementation of such a link would be the following:

s t r u c t {po in t e r to key−s u f f i x ;po in t e r to data ;po in t e r to next ;} l i n k ;

In addition, we have the bytes storing the actual values for the key-suffix andthe data.

In order to have the best packed structure, we look to avoid wasting spaceon addressing, structure memory alignment and unintended padding. Hence wepropose to store all relevant data in one byte-vector. Our compact slot has thefollowing structure:

s t r u c t {byte vec to r [ vec to r s i z e ] ;po in t e r to next ;} l i n k ;

The key-suffix and data are bound in one single vector. We designed functionsthat allow us to extract and set a specific bit. In this way, we have at most 7bits wasted due to alignment.

Our implementation of a PRTL yields a better memory occupation, but mostimportantly, manipulating this structure does not slow down the overall runtimeof the attack. We show tests that verify this in Table 5, where we insert a givennumber of random points on a 65-bit curve, using both a hash table and the

13

PRTL. To have a measurement of the runtime that does not depend on pointcomputation time, we take θ = 1, meaning every point is a distinguished one.The key length is thus c = 65. All results are an average of 100 runs.

K Memory RuntimePRTL Hash table PRTL Hash table

5 million 106 MB 324 MB 5.05 s 5.20 s7 million 148 MB 454 MB 6.74 s 7.01 s10 million 213 MB 649 MB 9.84 s 10.2 s

Table 5: Comparing the insertion runtime and memory occupation of a PRTLvs. a hash table.

We show similar experiments in Table 6. This time, we performed actualattacks on the discrete log over elliptic curves, instead of inserting random points.Since the number of stored points is now random and can be different betweentwo sets of runs, the runtime per stored point and memory per stored point aremore relevant results.

Field Memory Memory per point Runtime Runtime per pointPRTL Hash

tablePRTL Hash

tablePRTL Hash

tablePRTL Hash

table55-bit 402 KB 1172 KB 19 B 59 B 35.16 s 36.42 s 1.69 ms 1.81 ms60-bit 618 KB 1801 KB 20 B 59 B 210.33 s 212.83 s 6.88 ms 6.91 ms65-bit 1856 KB 5212 KB 21 B 60 B 1292 s 1291 s 14.90 ms 14.95 msTable 6: Runtime and the memory cost for the attack on ECDLP using PRTLsand hash tables.

The results are an average of 100 runs and they show that by using a PRTLfor the storage of distinguished points we optimize the memory complexity by afactor of 3. The vector method that we describe in this section can be appliedfor a classic hash table as well. In that case, the PRTL would have a gain factorof 1.5 over the packed hash table, according to these experimental results.

However, the number of stored points for the presented tests is relativelysmall, corresponding to the number of points needed for one collision. For ap-plications that demand a large number of collisions, such as meet-in-the-middleattacks, the ratio between the prefix and the suffix grows in the favour of thePRTL. Moreover, the difference between the memory occupied by the hash tableand the one taken by the PRTL grows linearly with the number of stored points,and can be calculated as follows:

Mhybrid = Mhash −K(d l8e+ 8).

We suppose that all of the points can be addressed using 8-Byte addresses, i.e.K < 264, and let us look at the example of a meet-in-the-middle attack on the

14

3-DES with three independent keys. Following van Oorschot and Wiener [18],this has a runtime of 7n2

√n1

K , where the memory available is K words andn1 = 256 and n2 = 2112. We calculate the memory requirements to store K pairs(key, data) ∈ {0, 1}64 × {0, 1}56. Table 7 shows the memory occupation for thepacked hash table and the PRTL and the ratio between them when we increasethe value of K.

K Packed hash table PRTL Gain Ratio Complexity232 137438 MB 85899 MB 12K 1.60 7 · 2124240 35184372 MB 20890720 MB 13K 1.68 7 · 2120248 9007199254 MB 5066549580 MB 14K 1.77 7 · 2116256 2305843009213 MB 1224979098644 MB 15K 1.88 7 · 2112264 590295810358705 MB 295147905179352 MB 16K 2.00 7 · 2108

Table 7: Comparing structure-specific memory requirements for a parallel colli-sion search MITM attack on 3-DES.

To prove our claims from Section 3 that more storage space yields a fasteralgorithm, we ran a multi-collision search while limiting the available memory.When the memory is filled, each thread continues to search for collisions withoutadding any new points. Results in Table 8 show that the PRTL yields a betterruntime compared to a classic Hash table due to the more efficient memory use.

Collisions Memory limit Runtime Stored pointsPRTL Hash table PRTL Hash

table400 10MB 14,3 min 18,8 min 474019 216459

10000 40MB 74,2 min 113,4 min 2104832 86742950000 100MB 172,5 min 241,6 min 5262727 2169383

Table 8: Runtime for multi-collision search for a 55-bit curve using PRTLs andhash tables. Each value is an average of 100 runs.

5.1 ECDLP implementation details and scalability

To support our findings, we implemented the parallel collision search using bothPRTLs and hash tables for discrete logarithms on elliptic curves defined overprime fields. Our implementation is in C and the external libraries we used areThe GNU Multiple Precision Arithmetic Library [2] for large numbers arith-metic, and the OpenMP (Open Multi-Processing) interface [3] which supportsshared memory multiprocessing programming. Our tests were performed on a28-core Intel Xeon E5-2640 processor and we experimented using between 1 and28 threads. For completeness, in Appendix E we enumerate the choices we madein our implementation and which are common in the literature.

15

Parallel Performance We were interested in the parallel performance i.e.how efficient our program is when increasing the number of parallel processingelements. A program is considered to scale linearly if the speedup is equal tothe number of threads used. In the theoretical model [18], the Parallel CollisionSearch is considered to have a linear scalability and our time complexity inTheorem 1 also proves this.

We experimented with L ∈ {1, 2, 7, 14, 28} threads, solving the discrete logover a 60-bit curve. Table 9 shows the Wall clock runtime and the parallel per-formance of the attack when we double the number of threads.

L1 L2 Parallel performanceThreads Runtime Threads Runtime1 2459 s 2 1699 s 0.727 776 s 14 411 s 0.9414 411 s 28 210 s 0.97

Table 9: Runtime and Parallel performance of the attack on ECDLP. Resultsare based on 100 runs per Li ∈ L.

We conclude that the parallel performance is not as good as expected for 2threads, yet it gets closer to linear as the number of threads grows.

6 Conclusion

We revisited the time complexity of the parallel collision search and explainedhow to choose the optimal value for the proportion of distinguished points whenimplementing this algorithm. We proposed an alternative memory structure forthe parallel collision search algorithm proposed by van Oorschot and Wiener [18].We show that this structure yields a better memory complexity than the hashtable variant of the algorithm. Moreover, using the new memory structure, weobtained a better bound for the time complexity of the parallel collision search,in the case where a large number of collisions is needed. Finally, we implementedthe radix tree parallel collision search algorithm for solving discrete logarithmsand showed its scalability.

References

1. Faster elliptic-curve discrete logarithms on FPGAs. https://eprint.iacr.org/2016/382.

2. GNU Multiple Precision Arithmetic Library. https://gmplib.org/.3. Open Multi-Processing Specification for Parallel Programming. https://gmplib.

org/.4. A meet-in-the-middle attack on an NTRU private key. Tehnical report, NTRU

Cryptosystems, 2003.5. Playstation 3 computing breaks 260 barrier; 112-bit prime ECDLP solved. http:

//lacal.epfl.ch/112bit_prime, 2009.

16

https://eprint.iacr.org/2016/382


https://gmplib.org/

https://gmplib.org/

https://gmplib.org/

http://lacal.epfl.ch/112bit_prime

http://lacal.epfl.ch/112bit_prime

6. Daniel V. Bailey, Lejla Batina, Daniel J. Bernstein, Peter Birkner, Joppe W. Bos,Hsieh-Chung Chen, Chen-Mou Cheng, Gauthier van Damme, Giacomo de Meule-naer, Luis Julian Dominguez Perez, Junfeng Fan, Tim Güneysu, Frank Gurkaynak,Thorsten Kleinjung, Tanja Lange, Nele Mentens, Ruben Niederhagen, ChristofPaar, Francesco Regazzoni, Peter Schwabe, Leif Uhsadel, Anthony Van Herrewege,and Bo-Yin Yang. Breaking ecc2k-130. Cryptology ePrint Archive, Report2009/541, 2009. https://eprint.iacr.org/2009/541.

7. Daniel J. Bernstein, Tanja Lange, and Peter Schwabe. On the Correct Use of theNegation Map in the Pollard rho Method. In Dario Catalano, Nelly Fazio, RosarioGennaro, and Antonio Nicolosi, editors, PKC 2011: 14th International Conferenceon Theory and Practice of Public Key Cryptography, volume 6571 of Lecture Notesin Computer Science, pages 128–146, Taormina, Italy, March 6–9, 2011. Springer,Heidelberg, Germany.

8. Richard P. Brent. An improved Monte Carlo factorization algorithm. BIT, 20:176–184, 1980.

9. Takanori Isobe. A single-key attack on the full GOST block cipher. In AntoineJoux, editor, Fast Software Encryption – FSE 2011, volume 6733 of Lecture Notesin Computer Science, pages 290–305, Lyngby, Denmark, February 13–16, 2011.Springer, Heidelberg, Germany.

10. Peter L. Montgomery Joppe W. Bos, Marcelo E. Kaihara. Pollard rho on theplaystation 3. Workshop record of SHARCS’09 http://www.hyperelliptic.org/tanja/SHARCS/record2.pdf, 2009.

11. Antoine Joux. Algorithmic Cryptanalysis, chapter 7, pages 225–226. Chapman &Hall/CRC, 2009.

12. Dmitry Khovratovich, Ivica Nikolic, and Ralf-Philipp Weinmann. Meet-in-the-middle attacks on SHA-3 candidates. In Orr Dunkelman, editor, Fast SoftwareEncryption – FSE 2009, volume 5665 of Lecture Notes in Computer Science, pages228–245, Leuven, Belgium, February 22–25, 2009. Springer, Heidelberg, Germany.

13. Florian Mendel, Christian Rechberger, Martin Schläffer, and Søren S. Thomsen.The rebound attack: Cryptanalysis of reduced Whirlpool and Grøstl. In OrrDunkelman, editor, Fast Software Encryption – FSE 2009, volume 5665 of Lec-ture Notes in Computer Science, pages 260–276, Leuven, Belgium, February 22–25,2009. Springer, Heidelberg, Germany.

14. J. M. Pollard. Monte carlo methods for index computation ( (mod p)). Mathe-matics of Computation, 32(143):918–924, 1978.

15. John Pollard. Monte Carlo methods for index computation (mod p). Math.Comp., (32):918–924, 1978.

16. J. H. Silverman. The Arithmetic of Elliptic Curves, volume 106 of Graduate Textsin Mathematics. Springer, 1986.

17. Edlyn Teske. On random walks for Pollard’s rho method. Math. Comp.,70(234):809–825, 2001.

18. Paul C. van Oorschot and Michael J. Wiener. Parallel collision search with crypt-analytic applications. Journal of Cryptology, 12(1):1–28, 1999.

19. C. van Vredendaal. Reduced memory meet-in-the-middle attack against the NTRUprivate key. LMS Journal of Computation and Mathematics, 19(Issue A (Algorith-mic Number Theory Symposium XII)):43–57, 2016.

17


http://www.hyperelliptic.org/tanja/SHARCS/record2.pdf

http://www.hyperelliptic.org/tanja/SHARCS/record2.pdf

A Appendix : Proof of Theorem 1

Proof. We call short path the chain of points computed by a thread between twoconsecutive distinguished points. The expected number of distinguished pointsproduced after a certain clock time T is TLθ. The probability of not having acollision at T = 1, for one thread is

1− Lθ

nθ.

Note that any of the L threads can cause a collision. Thus, the probability forall threads of not finding a collision on any point on the short walk is:

(1− L

n)L,

at the moment T = 1.Let X be the number of points calculated per thread before duplication.

Hence:

P (X > T ) = (1− Lθ

nθ)L · (1− 2L

nθ)L · . . . · (1− TLθ

nθ)L.

To do this multiplication we are going to take a shortcut. When x is close to0, a coarse first-order Taylor approximation for ex as:

ex ≈ 1 + x.

Now we can rewrite our expression to:

P (X > T ) = (e−Ln · e− 2L

n · . . . · e−TLn )=(e−(L+2L+...+TL)

n ))L=

= (e−T (T+1)L

2n )L = (e−T2L

2n )L = e−T2L2

2n . (10)

This gives us the probability

P (X > T ) = e−T2L2

2n ,

thus the expected number of distinguished points found before duplication, is

E(X) =

∞∑T=1

T ·P (X = T ) =

∞∑T=1

T ·(P (X > T−1)−P (X > T )) =

∞∑T=0

P (X > T ).

We approximate

E(X) =

∞∑T=0

e−T2L2

2n ≈∫ ∞0

e−x2L2

2n dx ≈ 1

L

√πn

2.

18

Since the expected length of a short walk is 1θ , the number of distinguished points

before a collision occurs isθ

L

√πn

2.

However, a collision might occur on any point on the walk and it will not bedetected until the walk reaches a distinguished one. We add 1

θ to the numberof calculations for the discovery of a collision. Finally, the expected number ofcalculated points per thread is:

1

L

√πn

2+

1

θ.

The two main operations in our algorithm are computing the next point on therandom walk and storing a distinguished point. Thus, the time complexity ofour algorithm is:

f(θ) = (1

L

√πn

2+

1

θ)tc + (

θ

L

√πn

2)ts. (11)

Remark 3. Note that the analysis above shows that the number of points com-puted by the algorithm is O

(θ√

πn2

). This was proven by van Oorschot and

Wiener in the first place.

B Appendix : Proof of Theorem 2

Proof. Let X be the number of distinguished points calculated per thread beforeduplication. Let T1 be the number of distinguished points computed until thefirst collision was found, and Ti, for any i > 1, the number of points stored inthe memory after the (i− 1)th collision was found and before the ith collision isfound.

As shown in Theorem 1, the expected number of points stored before findingthe first collision is T1 = θ

√πn2 . The probability of not having found the second

collision after each thread has found and stored T distinguished points is

P (X > T ) = (1− L+ T1nθ

)Lθ · (1− 2L+ T1

nθ)Lθ · . . . · (1− TL+ T1

nθ)Lθ .

As in the proof of Theorem 1, we approximate this expression by

P (X > T ) = e−T2L2−2LT1T

2nθ2 .

19

Hence the expected number of distinguished points computed by a threadbefore the second collision is:

E(X) =

∞∑T=0

e−T2L2−2LT1T

2nθ2 ≈∫ ∞0

e−x2L2−2xLT1

2nθ2 dx =

= eT21

2nθ2

∫ ∞0

e−(xL+T1)2

2nθ2 dx =θ√n

LeT21

2nθ2

∫ ∞T1θ√

2n

e−t2

dt

=θ√

2n

LeT21

2nθ2

(−e−t2

2t

∣∣∣∣∞T1θ√

2n

−∫ ∞

T1θ√n

e−t2

2t2

).

We denote by

Uk = T1 + T2 + . . . Tk.

By applying repeatedly the formula above (and neglecting the last integral),we have that Tk = θ2n

LUk−1. Therefore we have Uk = Uk−1 + θ2n

LUk−1. By letting

Vk = LUkθ√n, we obtain a sequence given by the recurrence formula

Vk = Vk−1 +1

Vk−1.

We will use the Cesaro-Stolz criterion to prove the convergence of this limit.First, we note that this sequence is increasing and tends to ∞. Moreover wehave that V 2

k = V 2k−1 + 2 + 1

V 2k−1

. Hence V 2k−V

2k−1

k+1−k → 2 and as per Cesaro-Stolz

we have Vk ∼√

2k. We conclude that Uk ∼ θ√2knL .

Since Uk is the number of distinguished points computed per thread, the totalnumber of stored points is θ

√2kn. Hence the memory will fill when θ

√2kn = w.

This will occur after computing the first kw = w2

2θ2n collisions and the expectedtotal time for one thread is:

w

Lθ.

When the memory is full, the time to find a collision is θnLw . To sum up, the total

time to find m collisions is:

1

L

(w

θ+ (m− w2

θ22n)θn

w

)+m

θ.

Remark 4. According to formula obtained for Uk in the proof of Theorem 2, wesee that if the memory is not filled when running the algorithm for finding n

2collisions, as in meet-in-the-middle applications, then we store θn distinguishedpoints, i.e. all distinguished points in S.

20

C Appendix : Proof of Corollary 1

Proof. From Theorem 2, the runtime complexity is given by:

1

L

(w

θ+ (

n

2− w2

θ22n)θn

w

)+

n

2θ.

By computing the zeros of the derivative:

f ′(θ) =n2θ2 − w2 − nLw

2Lwθ2,

we obtain that by taking θ =√w2+nLwn , the time complexity is O

(n√w2+nLwwL

).

D Appendix : Radix tree best and worst-case

Worst-case scenario. In the worst case scenario, for each new word addedin this structure we will create as much nodes as possible. This means that thex-coordinates of the added points have the shortest possible common prefix, asshown in Figure 4. For the first b points, we will use bc nodes. After that, thefirst distinguished point that we find will take c− 1 nodes, since all possibilitiesfor the first letter in the string were created. We repeat this operation (b − 1)btimes, provided that K > b+ (b− 1)b.

Fig. 4: Worst-case scenario example, b = 10

More generally, let k = blogbKc − 1. We build the tree by allocating nodesas follows:

– bc nodes for the first b points– (b− 1)b(c− 1) for the next (b− 1)b points– (b− 1)b2(c− 2) for the next (b− 1)b2 points etc.– (b− 1)bk(c− k) for (b− 1)bk points.

For each of the remaining K − (b+∑ki=1(b− 1)bi) points we will need c− k− 1

nodes. To sum up, the total number of nodes that will bound our worst-case

21

scenario is given by:

N(K) = bc+

k∑i=1

(b− 1)bi(c− i) + (K − b− b(b− 1)

k−1∑i=0

bi)(c− k − 1).

After simplification, we have that

N(K) ≈ b

b− 1bblogbKc +K(c− blogbKc). (12)

Best-case scenario Let K be the number of distinguished points that we needto store and let k = blogbKc. In the best-case scenario, we may assume withoutloss of generality that each time a new point is added in the structure, theminimal number of nodes is used, i.e. the x-coordinate of the added point hasthe longest possible common prefix with some other point that was previouslystored. For example, for the first point c nodes are allocated, for the next b-1nodes, one extra node is allocated and so on, until all subtrees of depth 1, 2 etc.are filled one by one. Figure 5 gives an example of how 215 points are stored. IfK > bc−1, we fill the first tree and start a new one. Let xi, for i ∈ {0, 1 . . . , k},

Fig. 5: Best-case scenario example

denote the i-th digit of K, from right to the left. In full generality, since c > k,we use:

– xk complete subtrees of depth k and a (xk+1)-th incomplete tree of depthk;

– the (xk+1)-th tree of depth k has xk−1 complete subtrees of depth k−1 anda (xk−1+1)-th incomplete tree of depth k − 1;

– c− k − 1 extra nodes.

22

Summing up all nodes, we get the following formula:

N(K) =

k∑i=0

xi

i∑j=0

bj + k + c− k − 1 =1

b− 1

k∑i=0

xi(bi+1 − 1) + c =

=b+ 1

bK − c− 1

b− 1

k∑i=0

xi.

We conclude that:

N(K) ≥ b

b− 1K − c− k − 1. (13)

E Appendix : implementation details

Additive walks. Teske [17] showed experimentally that the walk proposed byPollard 1 originally performs on average slightly worse than a random walk. Sheproposes alternative mappings that lead to the same performance as expectedin the random case: additive walks and mixed walks. The additive walks arepresented as follows. Let r be the number of sets Si which give a partition of thegroup G, and let Mi be a linear combination of P and Q: Mi = aiP + biQ, fori = ¯1, r. We choose the iterating function of the form:

Ri+1 = f(Ri) =

Ri +M1, Ri ∈ S1;

Ri +M2, Ri ∈ S2;

. . .

Ri +Mr, Ri ∈ Sr.

(14)

In the case of mixed walks, we introduce squaring steps to r-additive walks.However, Teske’s experimental results show that apart from the case r = 3, theintroduction of squaring steps does not lead to a significantly better performance.After experimenting with both of them, we confirmed her conclusion and decidedto use additive walks.

Teske shows experimentally that if r ≥ 20 then additive walks are close torandom walks. We therefore chose r = 20 in our implementation.

Use of automorphisms If the function f is chosen such that f(R) = f(−R)then we may regard f as being defined on equivalence classes under ±. Sincethere are n

2 equivalence classes, this would lead to a theoretical speed-up of√2. However, it was observed that the use of the negation map leads to so-

called fruitless cycles, cycles that trap the random-walks. In practice, since thesecycles need to be handled, the actual speed-up is significantly less than

√2

and actually depends on the platform one uses [7]. In this paper, we aim atevaluating the performance of our algorithm independently of the platform onemay choose for its implementation. Therefore, we do not use automorphisms inour implementation.

23

Long walks vs. short walks As we explained in Section 2 every thread selectsa starting point, which is a multiple of P , and computes the random walk until adistinguished point is found. After the distinguished point is stored in the radixtree, the thread starts a new walk from a new starting point. We refer to this asa short walk because the walk stops at the first distinguished point and has anaverage length of 1/θ. A second possibility is that the thread would continue thewalk from a distinguished point rather than start from a new one. We refer tothis as a long walk. This approach is used in the Pollard’s rho method becauseit allows the walk to enter a cycle and is an indispensable factor in finding acollision using the Floyd’s cycle finding algorithm[11]. However, in the ParallelCollision Search algorithm every distinguished point is stored, and thus the cycleproperty is irrelevant.

Furthermore, using the short walk method we are not required to calculatethe coefficients a and b every time. We calculate only the value of R for eachiteration, and when we find a distinguished point we store the coefficients of thestarting point (only the coefficient a is stored because the starting point being amultiple of P , the b coefficient is zero). It is only when a collision is found thatwe start iterating from the beginning of the short walk, this time computing aand b. This convenience makes short walks the better choice. Furthermore, weexperimented with both short walks and long walks, finding that short walks giveslightly better runtime results. All our results presented here use short walks.

24

time-memorytrade-oﬀsforparallelcollision … · factoring and discrete logs, ... mapping f. the...

Documents