de ne = 2 curve : of (2 ) on the elliptic curve + over f · 1968 v eltk amp, 1971 dekk er. sp...

Post on 21-May-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

High-speed

elliptic-curve cryptography

D. J. Bernstein

Thanks to:

University of Illinois at Chicago

NSF CCR–9983950

Alfred P. Sloan Foundation

Define = 2255 � 19; prime.

Define = 358990. Define

Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple

of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .

Main topic of this talk: Compute

� Curve( ) � Curve( )

in very few CPU cycles.

In particular, use floating point

for fast arithmetic mod .

High-speed

elliptic-curve cryptography

D. J. Bernstein

Thanks to:

University of Illinois at Chicago

NSF CCR–9983950

Alfred P. Sloan Foundation

Define = 2255 � 19; prime.

Define = 358990. Define

Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple

of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .

Main topic of this talk: Compute

� Curve( ) � Curve( )

in very few CPU cycles.

In particular, use floating point

for fast arithmetic mod .

Why cryptographers care

Each user has secret key ,

public key Curve( ).

Users with secret keys �

exchange Curve( ) � Curve( )

through an authenticated channel;

compute Curve( ); hash it;

use hash as shared secret to

encrypt and authenticate messages.

Curve speed is important

when number of messages is small.

Define = 2255 � 19; prime.

Define = 358990. Define

Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple

of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .

Main topic of this talk: Compute

� Curve( ) � Curve( )

in very few CPU cycles.

In particular, use floating point

for fast arithmetic mod .

Why cryptographers care

Each user has secret key ,

public key Curve( ).

Users with secret keys �

exchange Curve( ) � Curve( )

through an authenticated channel;

compute Curve( ); hash it;

use hash as shared secret to

encrypt and authenticate messages.

Curve speed is important

when number of messages is small.

Define = 2255 � 19; prime.

Define = 358990. Define

Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple

of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .

Main topic of this talk: Compute

� Curve( ) � Curve( )

in very few CPU cycles.

In particular, use floating point

for fast arithmetic mod .

Why cryptographers care

Each user has secret key ,

public key Curve( ).

Users with secret keys �

exchange Curve( ) � Curve( )

through an authenticated channel;

compute Curve( ); hash it;

use hash as shared secret to

encrypt and authenticate messages.

Curve speed is important

when number of messages is small.

Analogous system using 2�

mod :

1976 Diffie Hellman.

Using elliptic curves

to avoid index-calculus attacks:

1986 Miller, 1987 Koblitz.

Using � 3 + � 2 + � for speed:

1987 Montgomery (for ECM).

High precision from fp sums:

1968 Veltkamp, 1971 Dekker.

Speedups: 1999–2005 Bernstein.

Why cryptographers care

Each user has secret key ,

public key Curve( ).

Users with secret keys �

exchange Curve( ) � Curve( )

through an authenticated channel;

compute Curve( ); hash it;

use hash as shared secret to

encrypt and authenticate messages.

Curve speed is important

when number of messages is small.

Analogous system using 2�

mod :

1976 Diffie Hellman.

Using elliptic curves

to avoid index-calculus attacks:

1986 Miller, 1987 Koblitz.

Using � 3 + � 2 + � for speed:

1987 Montgomery (for ECM).

High precision from fp sums:

1968 Veltkamp, 1971 Dekker.

Speedups: 1999–2005 Bernstein.

Why cryptographers care

Each user has secret key ,

public key Curve( ).

Users with secret keys �

exchange Curve( ) � Curve( )

through an authenticated channel;

compute Curve( ); hash it;

use hash as shared secret to

encrypt and authenticate messages.

Curve speed is important

when number of messages is small.

Analogous system using 2�

mod :

1976 Diffie Hellman.

Using elliptic curves

to avoid index-calculus attacks:

1986 Miller, 1987 Koblitz.

Using � 3 + � 2 + � for speed:

1987 Montgomery (for ECM).

High precision from fp sums:

1968 Veltkamp, 1971 Dekker.

Speedups: 1999–2005 Bernstein.

Understanding CPU design

Computers are designed for

music, movies, Photoshop, Doom 3,

etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic.

Example: Athlon, every cycle,

does one add and one multiply

of high-precision fp numbers.

Programmer paying attention

to these CPU features

can use them for cryptography.

Analogous system using 2�

mod :

1976 Diffie Hellman.

Using elliptic curves

to avoid index-calculus attacks:

1986 Miller, 1987 Koblitz.

Using � 3 + � 2 + � for speed:

1987 Montgomery (for ECM).

High precision from fp sums:

1968 Veltkamp, 1971 Dekker.

Speedups: 1999–2005 Bernstein.

Understanding CPU design

Computers are designed for

music, movies, Photoshop, Doom 3,

etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic.

Example: Athlon, every cycle,

does one add and one multiply

of high-precision fp numbers.

Programmer paying attention

to these CPU features

can use them for cryptography.

Analogous system using 2�

mod :

1976 Diffie Hellman.

Using elliptic curves

to avoid index-calculus attacks:

1986 Miller, 1987 Koblitz.

Using � 3 + � 2 + � for speed:

1987 Montgomery (for ECM).

High precision from fp sums:

1968 Veltkamp, 1971 Dekker.

Speedups: 1999–2005 Bernstein.

Understanding CPU design

Computers are designed for

music, movies, Photoshop, Doom 3,

etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic.

Example: Athlon, every cycle,

does one add and one multiply

of high-precision fp numbers.

Programmer paying attention

to these CPU features

can use them for cryptography.

A 53-bit fp number

is a real number 2�

with � � Z and� �

253.

Round each real number � to

closest 53-bit fp number, fp53 � .

Round halves to even.

Examples:

fp53(8675309) = 8675309;

fp53(2127 + 8675309) = 2127;

fp53(2127 � 8675309) = 2127.

Understanding CPU design

Computers are designed for

music, movies, Photoshop, Doom 3,

etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic.

Example: Athlon, every cycle,

does one add and one multiply

of high-precision fp numbers.

Programmer paying attention

to these CPU features

can use them for cryptography.

A 53-bit fp number

is a real number 2�

with � � Z and� �

253.

Round each real number � to

closest 53-bit fp number, fp53 � .

Round halves to even.

Examples:

fp53(8675309) = 8675309;

fp53(2127 + 8675309) = 2127;

fp53(2127 � 8675309) = 2127.

Understanding CPU design

Computers are designed for

music, movies, Photoshop, Doom 3,

etc. Heavy use of fp arithmetic,

i.e., approximate real arithmetic.

Example: Athlon, every cycle,

does one add and one multiply

of high-precision fp numbers.

Programmer paying attention

to these CPU features

can use them for cryptography.

A 53-bit fp number

is a real number 2�

with � � Z and� �

253.

Round each real number � to

closest 53-bit fp number, fp53 � .

Round halves to even.

Examples:

fp53(8675309) = 8675309;

fp53(2127 + 8675309) = 2127;

fp53(2127 � 8675309) = 2127.

Typical CPU: UltraSPARC III.

Every cycle, UltraSPARC III can do

one fp multiplication� ��� � fp53(

� � )

and one fp addition� ��� � fp53(

� + � ),

subject to limits on � .

“4-cycle fp-operation latency”:

Results available after 4 cycles.

Can substitute subtraction

for addition. I’ll count

subtractions as additions.

A 53-bit fp number

is a real number 2�

with � � Z and� �

253.

Round each real number � to

closest 53-bit fp number, fp53 � .

Round halves to even.

Examples:

fp53(8675309) = 8675309;

fp53(2127 + 8675309) = 2127;

fp53(2127 � 8675309) = 2127.

Typical CPU: UltraSPARC III.

Every cycle, UltraSPARC III can do

one fp multiplication� ��� � fp53(

� � )

and one fp addition� ��� � fp53(

� + � ),

subject to limits on � .

“4-cycle fp-operation latency”:

Results available after 4 cycles.

Can substitute subtraction

for addition. I’ll count

subtractions as additions.

A 53-bit fp number

is a real number 2�

with � � Z and� �

253.

Round each real number � to

closest 53-bit fp number, fp53 � .

Round halves to even.

Examples:

fp53(8675309) = 8675309;

fp53(2127 + 8675309) = 2127;

fp53(2127 � 8675309) = 2127.

Typical CPU: UltraSPARC III.

Every cycle, UltraSPARC III can do

one fp multiplication� ��� � fp53(

� � )

and one fp addition� ��� � fp53(

� + � ),

subject to limits on � .

“4-cycle fp-operation latency”:

Results available after 4 cycles.

Can substitute subtraction

for addition. I’ll count

subtractions as additions.

Some variation among CPUs.

PowerPC RS64 IV: One addition

or one multiplication or one

“fused” � ��� ��� � fp53(� � + � ).

Results available after 4 cycles.

Athlon: fp64 instead of fp53;

one multiplication and one addition.

Results available after 4 cycles.

I’ll focus on UltraSPARC III.

Not the most important CPU,

but it’s a good warmup.

Typical CPU: UltraSPARC III.

Every cycle, UltraSPARC III can do

one fp multiplication� ��� � fp53(

� � )

and one fp addition� ��� � fp53(

� + � ),

subject to limits on � .

“4-cycle fp-operation latency”:

Results available after 4 cycles.

Can substitute subtraction

for addition. I’ll count

subtractions as additions.

Some variation among CPUs.

PowerPC RS64 IV: One addition

or one multiplication or one

“fused” � ��� ��� � fp53(� � + � ).

Results available after 4 cycles.

Athlon: fp64 instead of fp53;

one multiplication and one addition.

Results available after 4 cycles.

I’ll focus on UltraSPARC III.

Not the most important CPU,

but it’s a good warmup.

Typical CPU: UltraSPARC III.

Every cycle, UltraSPARC III can do

one fp multiplication� ��� � fp53(

� � )

and one fp addition� ��� � fp53(

� + � ),

subject to limits on � .

“4-cycle fp-operation latency”:

Results available after 4 cycles.

Can substitute subtraction

for addition. I’ll count

subtractions as additions.

Some variation among CPUs.

PowerPC RS64 IV: One addition

or one multiplication or one

“fused” � ��� ��� � fp53(� � + � ).

Results available after 4 cycles.

Athlon: fp64 instead of fp53;

one multiplication and one addition.

Results available after 4 cycles.

I’ll focus on UltraSPARC III.

Not the most important CPU,

but it’s a good warmup.

Exact dot products

If � � � 220 � � � � � 0 � 1 � � � � � 220

then � is a 53-bit fp number

so � = fp53( � ).

If � � ��� � � 220 � � � � � 220

then � ��� � � + � are

53-bit fp numbers so

� = fp53( � ), � = fp53( � ),

� + � = fp53( � + � ).

UltraSPARC III computes

� � ��� � � � + � with

two fp mults, one fp add.

Some variation among CPUs.

PowerPC RS64 IV: One addition

or one multiplication or one

“fused” � ��� ��� � fp53(� � + � ).

Results available after 4 cycles.

Athlon: fp64 instead of fp53;

one multiplication and one addition.

Results available after 4 cycles.

I’ll focus on UltraSPARC III.

Not the most important CPU,

but it’s a good warmup.

Exact dot products

If � � � 220 � � � � � 0 � 1 � � � � � 220

then � is a 53-bit fp number

so � = fp53( � ).

If � � ��� � � 220 � � � � � 220

then � ��� � � + � are

53-bit fp numbers so

� = fp53( � ), � = fp53( � ),

� + � = fp53( � + � ).

UltraSPARC III computes

� � ��� � � � + � with

two fp mults, one fp add.

Some variation among CPUs.

PowerPC RS64 IV: One addition

or one multiplication or one

“fused” � ��� ��� � fp53(� � + � ).

Results available after 4 cycles.

Athlon: fp64 instead of fp53;

one multiplication and one addition.

Results available after 4 cycles.

I’ll focus on UltraSPARC III.

Not the most important CPU,

but it’s a good warmup.

Exact dot products

If � � � 220 � � � � � 0 � 1 � � � � � 220

then � is a 53-bit fp number

so � = fp53( � ).

If � � ��� � � 220 � � � � � 220

then � ��� � � + � are

53-bit fp numbers so

� = fp53( � ), � = fp53( � ),

� + � = fp53( � + � ).

UltraSPARC III computes

� � ��� � � � + � with

two fp mults, one fp add.

Bit extraction

Define � � = 3 � 2� +51,

top �� = fp53(fp53(

� + � � ) � � � ),

bottom � � = fp53(� � top �

� ).

If � is a 53-bit fp number

and� � �

2� +51 then

top �� 2

�Z;

�bottom � �

�2��� 1; and

� = top �� + bottom � � .

Exact dot products

If � � � 220 � � � � � 0 � 1 � � � � � 220

then � is a 53-bit fp number

so � = fp53( � ).

If � � ��� � � 220 � � � � � 220

then � ��� � � + � are

53-bit fp numbers so

� = fp53( � ), � = fp53( � ),

� + � = fp53( � + � ).

UltraSPARC III computes

� � ��� � � � + � with

two fp mults, one fp add.

Bit extraction

Define � � = 3 � 2� +51,

top �� = fp53(fp53(

� + � � ) � � � ),

bottom � � = fp53(� � top �

� ).

If � is a 53-bit fp number

and� � �

2� +51 then

top �� 2

�Z;

�bottom � �

�2��� 1; and

� = top �� + bottom � � .

Exact dot products

If � � � 220 � � � � � 0 � 1 � � � � � 220

then � is a 53-bit fp number

so � = fp53( � ).

If � � ��� � � 220 � � � � � 220

then � ��� � � + � are

53-bit fp numbers so

� = fp53( � ), � = fp53( � ),

� + � = fp53( � + � ).

UltraSPARC III computes

� � ��� � � � + � with

two fp mults, one fp add.

Bit extraction

Define � � = 3 � 2� +51,

top �� = fp53(fp53(

� + � � ) � � � ),

bottom � � = fp53(� � top �

� ).

If � is a 53-bit fp number

and� � �

2� +51 then

top �� 2

�Z;

�bottom � �

�2��� 1; and

� = top �� + bottom � � .

Big integers as fp sums

Every integer mod 2255 � 19

can be written as a sum�

0 + �22 + �

43 + �64 +

�85 + �

107 + �128 + �

149 +�

170 + �192 + �

213 + �234

where � � 2� � 222 � � � � � 222 .

Indices�

are � 255 12 �for 0 � 1 � � � � � 11 .

Representation is not unique;

it’s not the input/output format.

Uniqueness would cost cycles!

Bit extraction

Define � � = 3 � 2� +51,

top �� = fp53(fp53(

� + � � ) � � � ),

bottom � � = fp53(� � top �

� ).

If � is a 53-bit fp number

and� � �

2� +51 then

top �� 2

�Z;

�bottom � �

�2��� 1; and

� = top �� + bottom � � .

Big integers as fp sums

Every integer mod 2255 � 19

can be written as a sum�

0 + �22 + �

43 + �64 +

�85 + �

107 + �128 + �

149 +�

170 + �192 + �

213 + �234

where � � 2� � 222 � � � � � 222 .

Indices�

are � 255 12 �for 0 � 1 � � � � � 11 .

Representation is not unique;

it’s not the input/output format.

Uniqueness would cost cycles!

Bit extraction

Define � � = 3 � 2� +51,

top �� = fp53(fp53(

� + � � ) � � � ),

bottom � � = fp53(� � top �

� ).

If � is a 53-bit fp number

and� � �

2� +51 then

top �� 2

�Z;

�bottom � �

�2��� 1; and

� = top �� + bottom � � .

Big integers as fp sums

Every integer mod 2255 � 19

can be written as a sum�

0 + �22 + �

43 + �64 +

�85 + �

107 + �128 + �

149 +�

170 + �192 + �

213 + �234

where � � 2� � 222 � � � � � 222 .

Indices�

are � 255 12 �for 0 � 1 � � � � � 11 .

Representation is not unique;

it’s not the input/output format.

Uniqueness would cost cycles!

Assume � = � � as above,

and similarly � = � � . Then� � = 0 + 22 + � � � + 468

where 0 = �0 � 0,

22 = �0 � 22 + �

22 � 0,

43 = �0 � 43 + �

22 � 22 + �43 � 0,

etc.

Each � is a 53-bit fp number.

Given � � ’s and � � ’s,

can compute � ’s using

144 fp mults, 121 fp adds.

Big integers as fp sums

Every integer mod 2255 � 19

can be written as a sum�

0 + �22 + �

43 + �64 +

�85 + �

107 + �128 + �

149 +�

170 + �192 + �

213 + �234

where � � 2� � 222 � � � � � 222 .

Indices�

are � 255 12 �for 0 � 1 � � � � � 11 .

Representation is not unique;

it’s not the input/output format.

Uniqueness would cost cycles!

Assume � = � � as above,

and similarly � = � � . Then� � = 0 + 22 + � � � + 468

where 0 = �0 � 0,

22 = �0 � 22 + �

22 � 0,

43 = �0 � 43 + �

22 � 22 + �43 � 0,

etc.

Each � is a 53-bit fp number.

Given � � ’s and � � ’s,

can compute � ’s using

144 fp mults, 121 fp adds.

Big integers as fp sums

Every integer mod 2255 � 19

can be written as a sum�

0 + �22 + �

43 + �64 +

�85 + �

107 + �128 + �

149 +�

170 + �192 + �

213 + �234

where � � 2� � 222 � � � � � 222 .

Indices�

are � 255 12 �for 0 � 1 � � � � � 11 .

Representation is not unique;

it’s not the input/output format.

Uniqueness would cost cycles!

Assume � = � � as above,

and similarly � = � � . Then� � = 0 + 22 + � � � + 468

where 0 = �0 � 0,

22 = �0 � 22 + �

22 � 0,

43 = �0 � 43 + �

22 � 22 + �43 � 0,

etc.

Each � is a 53-bit fp number.

Given � � ’s and � � ’s,

can compute � ’s using

144 fp mults, 121 fp adds.

Furthermore, modulo 2255 � 19,� � �

0 + �22 + � � � + �

234

where �0 = 0 + 19 � 2

� 255255,

�22 = 22 + 19 � 2

� 255277, etc.

Each � � is a 53-bit fp number.

Example: �0 is an integer;

� �0

�381 � 244.

Computing � � ’s from � ’s takes

11 fp mults, 11 fp adds.

Structure: (Z[ � ] Z[2255�12 � ])

(2255 � 12 � 19) Z (2255 � 19).

Assume � = � � as above,

and similarly � = � � . Then� � = 0 + 22 + � � � + 468

where 0 = �0 � 0,

22 = �0 � 22 + �

22 � 0,

43 = �0 � 43 + �

22 � 22 + �43 � 0,

etc.

Each � is a 53-bit fp number.

Given � � ’s and � � ’s,

can compute � ’s using

144 fp mults, 121 fp adds.

Furthermore, modulo 2255 � 19,� � �

0 + �22 + � � � + �

234

where �0 = 0 + 19 � 2

� 255255,

�22 = 22 + 19 � 2

� 255277, etc.

Each � � is a 53-bit fp number.

Example: �0 is an integer;

� �0

�381 � 244.

Computing � � ’s from � ’s takes

11 fp mults, 11 fp adds.

Structure: (Z[ � ] Z[2255�12 � ])

(2255 � 12 � 19) Z (2255 � 19).

Assume � = � � as above,

and similarly � = � � . Then� � = 0 + 22 + � � � + 468

where 0 = �0 � 0,

22 = �0 � 22 + �

22 � 0,

43 = �0 � 43 + �

22 � 22 + �43 � 0,

etc.

Each � is a 53-bit fp number.

Given � � ’s and � � ’s,

can compute � ’s using

144 fp mults, 121 fp adds.

Furthermore, modulo 2255 � 19,� � �

0 + �22 + � � � + �

234

where �0 = 0 + 19 � 2

� 255255,

�22 = 22 + 19 � 2

� 255277, etc.

Each � � is a 53-bit fp number.

Example: �0 is an integer;

� �0

�381 � 244.

Computing � � ’s from � ’s takes

11 fp mults, 11 fp adds.

Structure: (Z[ � ] Z[2255�12 � ])

(2255 � 12 � 19) Z (2255 � 19).

Carries

“Carry from �0 to �

22”:

replace �0 and �

22 by

bottom22�0 and �

22 + top22�0.

This takes 4 fp adds,

and guarantees� �

0�

221.

Series of 13 carries puts all � � ’s

in range for subsequent products:

from �192 to �

213 to �234 to 255;

then from �0 to �

22 to �43 to � � �

to �192 to �

213.

This takes 52 fp adds.

Furthermore, modulo 2255 � 19,� � �

0 + �22 + � � � + �

234

where �0 = 0 + 19 � 2

� 255255,

�22 = 22 + 19 � 2

� 255277, etc.

Each � � is a 53-bit fp number.

Example: �0 is an integer;

� �0

�381 � 244.

Computing � � ’s from � ’s takes

11 fp mults, 11 fp adds.

Structure: (Z[ � ] Z[2255�12 � ])

(2255 � 12 � 19) Z (2255 � 19).

Carries

“Carry from �0 to �

22”:

replace �0 and �

22 by

bottom22�0 and �

22 + top22�0.

This takes 4 fp adds,

and guarantees� �

0�

221.

Series of 13 carries puts all � � ’s

in range for subsequent products:

from �192 to �

213 to �234 to 255;

then from �0 to �

22 to �43 to � � �

to �192 to �

213.

This takes 52 fp adds.

Furthermore, modulo 2255 � 19,� � �

0 + �22 + � � � + �

234

where �0 = 0 + 19 � 2

� 255255,

�22 = 22 + 19 � 2

� 255277, etc.

Each � � is a 53-bit fp number.

Example: �0 is an integer;

� �0

�381 � 244.

Computing � � ’s from � ’s takes

11 fp mults, 11 fp adds.

Structure: (Z[ � ] Z[2255�12 � ])

(2255 � 12 � 19) Z (2255 � 19).

Carries

“Carry from �0 to �

22”:

replace �0 and �

22 by

bottom22�0 and �

22 + top22�0.

This takes 4 fp adds,

and guarantees� �

0�

221.

Series of 13 carries puts all � � ’s

in range for subsequent products:

from �192 to �

213 to �234 to 255;

then from �0 to �

22 to �43 to � � �

to �192 to �

213.

This takes 52 fp adds.

Total 155 mults, 184 adds

to multiply modulo 2255 � 19

in this representation.

184 UltraSPARC III cycles.

= 184 cycles? Two obstacles:

fp-operation latency;

“load/store” latency imposed by

limited number of “registers.”

Schedule instructions carefully

to bring cycles down to 184.

Carries

“Carry from �0 to �

22”:

replace �0 and �

22 by

bottom22�0 and �

22 + top22�0.

This takes 4 fp adds,

and guarantees� �

0�

221.

Series of 13 carries puts all � � ’s

in range for subsequent products:

from �192 to �

213 to �234 to 255;

then from �0 to �

22 to �43 to � � �

to �192 to �

213.

This takes 52 fp adds.

Total 155 mults, 184 adds

to multiply modulo 2255 � 19

in this representation.

184 UltraSPARC III cycles.

= 184 cycles? Two obstacles:

fp-operation latency;

“load/store” latency imposed by

limited number of “registers.”

Schedule instructions carefully

to bring cycles down to 184.

Carries

“Carry from �0 to �

22”:

replace �0 and �

22 by

bottom22�0 and �

22 + top22�0.

This takes 4 fp adds,

and guarantees� �

0�

221.

Series of 13 carries puts all � � ’s

in range for subsequent products:

from �192 to �

213 to �234 to 255;

then from �0 to �

22 to �43 to � � �

to �192 to �

213.

This takes 52 fp adds.

Total 155 mults, 184 adds

to multiply modulo 2255 � 19

in this representation.

184 UltraSPARC III cycles.

= 184 cycles? Two obstacles:

fp-operation latency;

“load/store” latency imposed by

limited number of “registers.”

Schedule instructions carefully

to bring cycles down to 184.

Have developed qhasm,

new programming language

for high-speed computations.

Includes range verification,

guided register allocation, et al.

Lets me write desired code

with much less human time than

traditional asm, C compiler, etc.

Have also used for fast AES,

fast Poly1305, fast Salsa20, etc.;

see, e.g., http://cr.yp.to

/mac/poly1305_athlon.s.

Total 155 mults, 184 adds

to multiply modulo 2255 � 19

in this representation.

184 UltraSPARC III cycles.

= 184 cycles? Two obstacles:

fp-operation latency;

“load/store” latency imposed by

limited number of “registers.”

Schedule instructions carefully

to bring cycles down to 184.

Have developed qhasm,

new programming language

for high-speed computations.

Includes range verification,

guided register allocation, et al.

Lets me write desired code

with much less human time than

traditional asm, C compiler, etc.

Have also used for fast AES,

fast Poly1305, fast Salsa20, etc.;

see, e.g., http://cr.yp.to

/mac/poly1305_athlon.s.

Total 155 mults, 184 adds

to multiply modulo 2255 � 19

in this representation.

184 UltraSPARC III cycles.

= 184 cycles? Two obstacles:

fp-operation latency;

“load/store” latency imposed by

limited number of “registers.”

Schedule instructions carefully

to bring cycles down to 184.

Have developed qhasm,

new programming language

for high-speed computations.

Includes range verification,

guided register allocation, et al.

Lets me write desired code

with much less human time than

traditional asm, C compiler, etc.

Have also used for fast AES,

fast Poly1305, fast Salsa20, etc.;

see, e.g., http://cr.yp.to

/mac/poly1305_athlon.s.

Speedup: Squarings

Often know in advance that � = � .

�0

�64 + �

22�

43 + �43

�22 + �

64�

0

is more efficiently computed as

2( �0

�64 + �

22�

43).

Even better: First compute

2 �0 � 2 �

22 � � � � � 2 �234

and then compute

(2 �0)

�64 + (2 �

22)�

43 etc.

130 fp adds instead of 184.

Makes carry time even more visible.

Have developed qhasm,

new programming language

for high-speed computations.

Includes range verification,

guided register allocation, et al.

Lets me write desired code

with much less human time than

traditional asm, C compiler, etc.

Have also used for fast AES,

fast Poly1305, fast Salsa20, etc.;

see, e.g., http://cr.yp.to

/mac/poly1305_athlon.s.

Speedup: Squarings

Often know in advance that � = � .

�0

�64 + �

22�

43 + �43

�22 + �

64�

0

is more efficiently computed as

2( �0

�64 + �

22�

43).

Even better: First compute

2 �0 � 2 �

22 � � � � � 2 �234

and then compute

(2 �0)

�64 + (2 �

22)�

43 etc.

130 fp adds instead of 184.

Makes carry time even more visible.

Have developed qhasm,

new programming language

for high-speed computations.

Includes range verification,

guided register allocation, et al.

Lets me write desired code

with much less human time than

traditional asm, C compiler, etc.

Have also used for fast AES,

fast Poly1305, fast Salsa20, etc.;

see, e.g., http://cr.yp.to

/mac/poly1305_athlon.s.

Speedup: Squarings

Often know in advance that � = � .

�0

�64 + �

22�

43 + �43

�22 + �

64�

0

is more efficiently computed as

2( �0

�64 + �

22�

43).

Even better: First compute

2 �0 � 2 �

22 � � � � � 2 �234

and then compute

(2 �0)

�64 + (2 �

22)�

43 etc.

130 fp adds instead of 184.

Makes carry time even more visible.

Speedup: Karatsuba’s method

Say 0 = �0 + �

22 � + � � � + �107 � 5,

1 = �128 + �

149 � + � � � + �234 � 5,

0 = � 0 + � � � , 1 = � 128 + � � � .

Original, 184 adds: Product is

0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.

Karatsuba, 182 adds:

(( 0+ 1)( 0+ 1) �0 0

�1 1) � 6

+ 0 0 + 1 1 � 12.

Improved Karatsuba, 177 adds:

( 0 + 1)( 0 + 1) � 6

+ ( 0 0�

1 1 � 6)(1 � � 6).

Speedup: Squarings

Often know in advance that � = � .

�0

�64 + �

22�

43 + �43

�22 + �

64�

0

is more efficiently computed as

2( �0

�64 + �

22�

43).

Even better: First compute

2 �0 � 2 �

22 � � � � � 2 �234

and then compute

(2 �0)

�64 + (2 �

22)�

43 etc.

130 fp adds instead of 184.

Makes carry time even more visible.

Speedup: Karatsuba’s method

Say 0 = �0 + �

22 � + � � � + �107 � 5,

1 = �128 + �

149 � + � � � + �234 � 5,

0 = � 0 + � � � , 1 = � 128 + � � � .

Original, 184 adds: Product is

0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.

Karatsuba, 182 adds:

(( 0+ 1)( 0+ 1) �0 0

�1 1) � 6

+ 0 0 + 1 1 � 12.

Improved Karatsuba, 177 adds:

( 0 + 1)( 0 + 1) � 6

+ ( 0 0�

1 1 � 6)(1 � � 6).

Speedup: Squarings

Often know in advance that � = � .

�0

�64 + �

22�

43 + �43

�22 + �

64�

0

is more efficiently computed as

2( �0

�64 + �

22�

43).

Even better: First compute

2 �0 � 2 �

22 � � � � � 2 �234

and then compute

(2 �0)

�64 + (2 �

22)�

43 etc.

130 fp adds instead of 184.

Makes carry time even more visible.

Speedup: Karatsuba’s method

Say 0 = �0 + �

22 � + � � � + �107 � 5,

1 = �128 + �

149 � + � � � + �234 � 5,

0 = � 0 + � � � , 1 = � 128 + � � � .

Original, 184 adds: Product is

0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.

Karatsuba, 182 adds:

(( 0+ 1)( 0+ 1) �0 0

�1 1) � 6

+ 0 0 + 1 1 � 12.

Improved Karatsuba, 177 adds:

( 0 + 1)( 0 + 1) � 6

+ ( 0 0�

1 1 � 6)(1 � � 6).

The Curve function

Overall strategy to compute

� Curve( ) � Curve( ),

using arithmetic mod = 2255 � 19:

For various integers � ,

find � � � � � such that

Curve( � ) � � � � (mod ),

i.e., � � Curve( � ) � � (mod ).

e.g. �1 = Curve( ), � 1 = 1,

assuming Curve( ) = .

Can easily restrict � Curve( )

to ensure that never appears.

Speedup: Karatsuba’s method

Say 0 = �0 + �

22 � + � � � + �107 � 5,

1 = �128 + �

149 � + � � � + �234 � 5,

0 = � 0 + � � � , 1 = � 128 + � � � .

Original, 184 adds: Product is

0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.

Karatsuba, 182 adds:

(( 0+ 1)( 0+ 1) �0 0

�1 1) � 6

+ 0 0 + 1 1 � 12.

Improved Karatsuba, 177 adds:

( 0 + 1)( 0 + 1) � 6

+ ( 0 0�

1 1 � 6)(1 � � 6).

The Curve function

Overall strategy to compute

� Curve( ) � Curve( ),

using arithmetic mod = 2255 � 19:

For various integers � ,

find � � � � � such that

Curve( � ) � � � � (mod ),

i.e., � � Curve( � ) � � (mod ).

e.g. �1 = Curve( ), � 1 = 1,

assuming Curve( ) = .

Can easily restrict � Curve( )

to ensure that never appears.

Speedup: Karatsuba’s method

Say 0 = �0 + �

22 � + � � � + �107 � 5,

1 = �128 + �

149 � + � � � + �234 � 5,

0 = � 0 + � � � , 1 = � 128 + � � � .

Original, 184 adds: Product is

0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.

Karatsuba, 182 adds:

(( 0+ 1)( 0+ 1) �0 0

�1 1) � 6

+ 0 0 + 1 1 � 12.

Improved Karatsuba, 177 adds:

( 0 + 1)( 0 + 1) � 6

+ ( 0 0�

1 1 � 6)(1 � � 6).

The Curve function

Overall strategy to compute

� Curve( ) � Curve( ),

using arithmetic mod = 2255 � 19:

For various integers � ,

find � � � � � such that

Curve( � ) � � � � (mod ),

i.e., � � Curve( � ) � � (mod ).

e.g. �1 = Curve( ), � 1 = 1,

assuming Curve( ) = .

Can easily restrict � Curve( )

to ensure that never appears.

We’ll see how to compute� � � � � � �

2� � � 2

� ; and� � � � � � � �

+1 � � �+1 � Curve( )

� �2

�+1 � � 2

�+1.

Combine to compute� � � � � � � �

+1 � � �+1 � � Curve( )

� � � � � � � � �

+1 � � �

+1

where =� � 2 � , = � mod 2.

Conditional branches and

input-dependent load addresses

can leak via timing.

Replace with arithmetic:

e.g., (1 � ) � � + ( ) � �+1.

The Curve function

Overall strategy to compute

� Curve( ) � Curve( ),

using arithmetic mod = 2255 � 19:

For various integers � ,

find � � � � � such that

Curve( � ) � � � � (mod ),

i.e., � � Curve( � ) � � (mod ).

e.g. �1 = Curve( ), � 1 = 1,

assuming Curve( ) = .

Can easily restrict � Curve( )

to ensure that never appears.

We’ll see how to compute� � � � � � �

2� � � 2

� ; and� � � � � � � �

+1 � � �+1 � Curve( )

� �2

�+1 � � 2

�+1.

Combine to compute� � � � � � � �

+1 � � �+1 � � Curve( )

� � � � � � � � �

+1 � � �

+1

where =� � 2 � , = � mod 2.

Conditional branches and

input-dependent load addresses

can leak via timing.

Replace with arithmetic:

e.g., (1 � ) � � + ( ) � �+1.

The Curve function

Overall strategy to compute

� Curve( ) � Curve( ),

using arithmetic mod = 2255 � 19:

For various integers � ,

find � � � � � such that

Curve( � ) � � � � (mod ),

i.e., � � Curve( � ) � � (mod ).

e.g. �1 = Curve( ), � 1 = 1,

assuming Curve( ) = .

Can easily restrict � Curve( )

to ensure that never appears.

We’ll see how to compute� � � � � � �

2� � � 2

� ; and� � � � � � � �

+1 � � �+1 � Curve( )

� �2

�+1 � � 2

�+1.

Combine to compute� � � � � � � �

+1 � � �+1 � � Curve( )

� � � � � � � � �

+1 � � �

+1

where =� � 2 � , = � mod 2.

Conditional branches and

input-dependent load addresses

can leak via timing.

Replace with arithmetic:

e.g., (1 � ) � � + ( ) � �+1.

Eventually reach � = .

Divide � � by � � modulo

to obtain Curve( ).

Simple division method: Fermat!� � � � � � �

� � 2� .

Euclid-type division methods

are faster but have

input-dependent timings.

Finally convert from

floating-point representation

to byte-string output format.

We’ll see how to compute� � � � � � �

2� � � 2

� ; and� � � � � � � �

+1 � � �+1 � Curve( )

� �2

�+1 � � 2

�+1.

Combine to compute� � � � � � � �

+1 � � �+1 � � Curve( )

� � � � � � � � �

+1 � � �

+1

where =� � 2 � , = � mod 2.

Conditional branches and

input-dependent load addresses

can leak via timing.

Replace with arithmetic:

e.g., (1 � ) � � + ( ) � �+1.

Eventually reach � = .

Divide � � by � � modulo

to obtain Curve( ).

Simple division method: Fermat!� � � � � � �

� � 2� .

Euclid-type division methods

are faster but have

input-dependent timings.

Finally convert from

floating-point representation

to byte-string output format.

We’ll see how to compute� � � � � � �

2� � � 2

� ; and� � � � � � � �

+1 � � �+1 � Curve( )

� �2

�+1 � � 2

�+1.

Combine to compute� � � � � � � �

+1 � � �+1 � � Curve( )

� � � � � � � � �

+1 � � �

+1

where =� � 2 � , = � mod 2.

Conditional branches and

input-dependent load addresses

can leak via timing.

Replace with arithmetic:

e.g., (1 � ) � � + ( ) � �+1.

Eventually reach � = .

Divide � � by � � modulo

to obtain Curve( ).

Simple division method: Fermat!� � � � � � �

� � 2� .

Euclid-type division methods

are faster but have

input-dependent timings.

Finally convert from

floating-point representation

to byte-string output format.

From � to 2 �

In Z :�

2� = ( � 2� � � 2� )2,

� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).

Compute as follows:

( � � � � � )2; ( � � + � � )2;�

2� = ( � � � � � )2( � � + � � )2;

4 � � � � = ( � � + � � )2 � ( � � � � � )2;

( � 2) � � � � = 89747 � 4 � � � � ;

� 2� =

4 � � � � (( � � + � � )2 + ( � 2) � � � � ).

Eventually reach � = .

Divide � � by � � modulo

to obtain Curve( ).

Simple division method: Fermat!� � � � � � �

� � 2� .

Euclid-type division methods

are faster but have

input-dependent timings.

Finally convert from

floating-point representation

to byte-string output format.

From � to 2 �

In Z :�

2� = ( � 2� � � 2� )2,

� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).

Compute as follows:

( � � � � � )2; ( � � + � � )2;�

2� = ( � � � � � )2( � � + � � )2;

4 � � � � = ( � � + � � )2 � ( � � � � � )2;

( � 2) � � � � = 89747 � 4 � � � � ;

� 2� =

4 � � � � (( � � + � � )2 + ( � 2) � � � � ).

Eventually reach � = .

Divide � � by � � modulo

to obtain Curve( ).

Simple division method: Fermat!� � � � � � �

� � 2� .

Euclid-type division methods

are faster but have

input-dependent timings.

Finally convert from

floating-point representation

to byte-string output format.

From � to 2 �

In Z :�

2� = ( � 2� � � 2� )2,

� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).

Compute as follows:

( � � � � � )2; ( � � + � � )2;�

2� = ( � � � � � )2( � � + � � )2;

4 � � � � = ( � � + � � )2 � ( � � � � � )2;

( � 2) � � � � = 89747 � 4 � � � � ;

� 2� =

4 � � � � (( � � + � � )2 + ( � 2) � � � � ).

From � � � + 1 to 2 � + 1

�2

+1 = 4( � � � �

+1� � � � �

+1)2,

� 2�

+1 =

4( � � � �

+1� � � � �

+1)2 Curve( ).

Compute as follows:

( � � � � � )( � �

+1 + � �

+1);

( � � + � � )( � �

+1� � �

+1);

2( � � � �

+1� � � � �

+1) = sum;

2( � � � �

+1� � � � �

+1) = difference;�

2�

+1 = (2( � � � �

+1� � � � �

+1))2;

(2( � � � �

+1� � � � �

+1))2;

� 2�

+1 = ( � � � ) Curve( ).

From � to 2 �

In Z :�

2� = ( � 2� � � 2� )2,

� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).

Compute as follows:

( � � � � � )2; ( � � + � � )2;�

2� = ( � � � � � )2( � � + � � )2;

4 � � � � = ( � � + � � )2 � ( � � � � � )2;

( � 2) � � � � = 89747 � 4 � � � � ;

� 2� =

4 � � � � (( � � + � � )2 + ( � 2) � � � � ).

From � � � + 1 to 2 � + 1

�2

+1 = 4( � � � �

+1� � � � �

+1)2,

� 2�

+1 =

4( � � � �

+1� � � � �

+1)2 Curve( ).

Compute as follows:

( � � � � � )( � �

+1 + � �

+1);

( � � + � � )( � �

+1� � �

+1);

2( � � � �

+1� � � � �

+1) = sum;

2( � � � �

+1� � � � �

+1) = difference;�

2�

+1 = (2( � � � �

+1� � � � �

+1))2;

(2( � � � �

+1� � � � �

+1))2;

� 2�

+1 = ( � � � ) Curve( ).

From � to 2 �

In Z :�

2� = ( � 2� � � 2� )2,

� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).

Compute as follows:

( � � � � � )2; ( � � + � � )2;�

2� = ( � � � � � )2( � � + � � )2;

4 � � � � = ( � � + � � )2 � ( � � � � � )2;

( � 2) � � � � = 89747 � 4 � � � � ;

� 2� =

4 � � � � (( � � + � � )2 + ( � 2) � � � � ).

From � � � + 1 to 2 � + 1

�2

+1 = 4( � � � �

+1� � � � �

+1)2,

� 2�

+1 =

4( � � � �

+1� � � � �

+1)2 Curve( ).

Compute as follows:

( � � � � � )( � �

+1 + � �

+1);

( � � + � � )( � �

+1� � �

+1);

2( � � � �

+1� � � � �

+1) = sum;

2( � � � �

+1� � � � �

+1) = difference;�

2�

+1 = (2( � � � �

+1� � � � �

+1))2;

(2( � � � �

+1� � � � �

+1))2;

� 2�

+1 = ( � � � ) Curve( ).

Total time

Slightly over 1600 fp adds

(520 from carries)

for each bit of .

Total for 256-bit :

413000 fp adds; plus

50000 fp adds for final division.

Aiming for 500000 cycles.

Still have to finish software.

Should end up even faster than

my NIST P-224 software,

despite 14% more bits!

From � � � + 1 to 2 � + 1

�2

+1 = 4( � � � �

+1� � � � �

+1)2,

� 2�

+1 =

4( � � � �

+1� � � � �

+1)2 Curve( ).

Compute as follows:

( � � � � � )( � �

+1 + � �

+1);

( � � + � � )( � �

+1� � �

+1);

2( � � � �

+1� � � � �

+1) = sum;

2( � � � �

+1� � � � �

+1) = difference;�

2�

+1 = (2( � � � �

+1� � � � �

+1))2;

(2( � � � �

+1� � � � �

+1))2;

� 2�

+1 = ( � � � ) Curve( ).

Total time

Slightly over 1600 fp adds

(520 from carries)

for each bit of .

Total for 256-bit :

413000 fp adds; plus

50000 fp adds for final division.

Aiming for 500000 cycles.

Still have to finish software.

Should end up even faster than

my NIST P-224 software,

despite 14% more bits!

top related