aes’ [and other block ciphers] implementation...

49
AES’ [and other Block Ciphers] Implementation Tricks

Upload: dothuan

Post on 09-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

AES’ [and other Block Ciphers]Implementation Tricks

Cryptographic algorithms Basic primitives

Survey by Stephen et al, LNCS 1482, Sep. 98

General Structure of a Block Cipher

Useful Properties for ImplementingBlock Ciphers

Bit-wise operations (XOR, AND, OR, etc.)

LUT4 x 1

X = a + bY = X + c Z = Y + d

Z = a + b + c + d

Useful Properties for ImplementingBlock Ciphers

Substitution

Useful Properties for ImplementingBlock Ciphers

Permutation

Permutation = [1, 5, 4, 3, 2, 6]Change of wires Free of cost

Useful Properties for ImplementingBlock Ciphers

Shift & rotation

IN[31:0]

A[31:24]

B[23:16]

C[15:8]

D[7:0]

Shifting 8-bit Rotation

•Cost free operations

IN[24:0]

OUT[31:0]

8-bit

Useful Properties for ImplementingBlock Ciphers

Iterative nature

1st

Round

2nd

Round

nth

RoundLatch Latch Latch

CE CLK CE CLK CE CLK

IN Out

Latch Latch Latch

CE CLK1 CE CLK1 CE CLK1

IN Out

1st

Round

Latch

2nd

Round

Latch

nth

Round

Latch

CLK2

One

Round

CE CLK

IN

Select

OutLatchIterative

Pipeline

Sub-Pipeline

Useful Properties for ImplementingBlock Ciphers

Parallelism

X = a + bY = X + cZ = Y + d

X = a + bY = a + b + cZ = a + b + c + d

One cycleThree cycle

X Y Z XYZ

How FPGA implementationsSpeed up encryption ??

• Lot of permutation operations. Is there any difficulty?• Substitution is a problem?

Example for DES Implementation onFPGA

right<=ip(56)&ip(48)&ip(40)&ip(32)&ip(24)&ip(16)&ip(8)&ip(0)&ip(58)&ip(50)&ip(42)&ip(34)&ip(26)&ip(18)&ip(10)&ip(2)&ip(60)&ip(52)&ip(44)&ip(36)&ip(28)&ip(20)&ip(12)&ip(4)&ip(62)&ip(54)&ip(46)&ip(38)&ip(30)&ip(22)&ip(14)&ip(6);

concatenation operator

ip[63:0] right

Permutations in Hardware (FPGA)

S1 64 x 4 = 256 bitsS2 64 x 4 = 256 bitsS3 64 x 4 = 256 bitsS4 64 x 4 = 256 bitsS5 64 x 4 = 256 bitsS6 64 x 4 = 256 bitsS7 64 x 4 = 256 bitsS8 64 x 4 = 256 bits

2048 bits = 2K

CLB slices in memoy mode = 4 x 8 = 32 CLB slicesUsing selected BRAM => Virtex series devices contains more than 280 BRAMs of 4K each

Substitution in Hardware (FPGA)

DES implementation in Hardware(FPGA)

Author Device CLB Slices

Allowed Freq. (MHz)

Throughput (Mbits/s)

Biham(software) 1997

Alpha 8400

300 127

Wong et al 1998 XC4020E 438 10 26.7

Kaps and Paar 1998

XC4028EX 741 25.18 402.7

Free-DES 2000 XCV400 5263 47.7 3052

McLoony 2003 XCV1000 6446 59.5 3808

Sandia 1999 Laboratories

ASIC 9280

Patterson 2000 (Jbits)

XCV150 1584 168 10752

This work XCV312 165 68.05 274

The same hold for other block ciphers?

AES

AES

Plain Text

Key

Cipher Text

128

128

128

AES Processes

Key Scheduling Encryption Decryption

RijndaelAdvanced Encryption Standard

Rijndael block cipher algorithm has been chosen byNIST as the Advanced Encryption Standard– 128, 192 and 256 bit block-length– When it is called AES, it means block length of 128 bits only

FPGA AES implementations:– Single encryptor:

• Dandalis, … , Elbirt, … & Gaj,… : 2000

– Full encryptor/decryptor:• McLoone & McCanny 2001 CHES2001

– 3.2 Gbps

AES Encryption AlgorithmFlow

BS: Byte SubstitutionSR: Shift RowsMC: Mix ColumnARK: Add Round Key

ARK BS ARK BS SR ARK

SR MC

IN OUT

(ROUND-1)

USER KEY SUB KEY SUB KEY

Selection of rounds

AES

1514131211109876543210bbbbbbbbbbbbbbbb

!!

"

!!

#

$

!!

%

!!

&

'

15141312

111098

7654

3210

bbbb

bbbb

bbbb

bbbb

Input = 128 bits = 16 bytes

Both plaintext and key are arranged into 4 x 4 matrix

State Matrix

………………………….. RoundKey 10

RoundKey 3

RoundKey 1

RoundKey 0

!!

"

!!

#

$

!!

%

!!

&

'

15141312

111098

7654

3210

kkkk

kkkk

kkkk

kkkk

…………………………..

…………………………..

Key Scheduling

!!

"

!!

#

$

!!

%

!!

&

'

31302928

27262524

23222120

19181716

kkkk

kkkk

kkkk

kkkk

!!

"

!!

#

$

!!

%

!!

&

'

175174173172

171170169168

167166165164

163162161160

kkkk

kkkk

kkkk

kkkk

User-key Generated- keys

S-BOX16x16

a3,3a3,2a3,1a3,0

a2,3a2,2a2,1a2,0

a1,3a1,2a1,1a1,0

a0,3a0,2a0,1a0,0

b3,3b3,2b3,1b3,0

b2,3b2,2b2,1b2,0

b1,3b1,2b1,1b1,0

b0,3b0,2b0,1b0,0

1. Byte Substitution

State Matrix

BS ARK

SR MC

SUB KEY

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

BRAMIN

OUT

16 BRAMS OF 256 X 8

Byte Substitution

onmpjilkehgfdcba

ponmlkjihgfedcbaOffset 0

2. ShiftRow(SR)

MCOffset 1

Offset 2

Offset 3

onmpjilkehgfdcba

ponmlkjihgfedcbaOffset 0

IMCOffset 1

Offset 2

Offset 3

BS ARK

SR MC

SUB KEY

onmpjilkehgfdcba

ponmlkjihgfedcbaOffset 0

ShiftRow(SR)

MCOffset 1

Offset 2

Offset 3

BS ARK

SR MC

SUB KEY

abcde f g

hi

jk

l

mno

p

IN OUT

**Every entry is represented in GF(28)

3. MixColumn(MC) &Inv MixColumn(IMC)

MC

IMC

i=0,1,2,3

BS ARK

SR MC

SUB KEY

!!!!

"

#

$$$$

%

&

!!!!

"

#

$$$$

%

&

=

!!!!

"

#

$$$$

%

&

'

'

'

'

i

i

i

i

c

c

c

c

c

c

c

c

,3

,2

,1

,0

0,0

0,0

0,0

0,0

02010103

03020101

01030201

01010302

!!!!

"

#

$$$$

%

&

!!!!

"

#

$$$$

%

&

=

!!!!

"

#

$$$$

%

&

'

'

'

'

i

i

i

i

c

c

c

c

EDB

BED

DBE

DBE

c

c

c

c

,3

,2

,1

,0

0,0

0,0

0,0

0,0

00900

00090

00009

09000

in GF(28)

in GF(28)

a3,3a3,2a3,1a3,0

a2,3a2,2a2,1a2,0

a1,3a1,2a1,1a1,0

a0,3a0,2a0,1a0,0

b3,3b3,2b3,1b3,0

b2,3b2,2b2,1b2,0

b1,3b1,2b1,1b1,0

b0,3b0,2b0,1b0,0

k3,3k3,2k3,1k3,0

k2,3k2,2k2,1k2,0

k1,3k1,2k1,1k1,0

k0,3k0,2k0,1k0,0

!=

4. AddRoundKey(ARK)

key

BS ARK

SR MC

SUB KEY

Novel techniques for implementingAES round transformation

Steps• Key schedule• S-Box & Inv. S-Box• MC & Inv. MC

!

k0

k4

k8

k12

k1

k5

k9

k13

k2

k6

k10

k14

k3

k7

k11

k15

"

#

$ $ $ $

%

&

' ' ' '

Step 1 Step 2

( ) rconkSboxkk !!="1300

044kkk "!="

,

0848kkkk "!!="

,

0128.412kkkkk "!!!="

Step 1 Step 2 Step 3 Step 4

( ) rconkSboxkk !!="1300

044kkk "!="

488kkk "!="

81212kkk "!="

Key Schedule

!

" k 0

" k 4

" k 8

" k 12

" k 1

" k 5

" k 9

" k 13

" k 2

" k 6

" k 10

" k 14

" k 3

" k 7

" k 11

" k 15

#

$

% % % %

&

'

( ( ( (

Step 1 Step 2

( ) rconkSboxkk !!="1300

044kkk "!="

,

0848kkkk "!!="

,

0128.412kkkkk "!!!="

Key Schedule

Byte Substitution (BS)• Look-up table method• Composite Field approach

IAF

MIAF S-BOX

INV S-BOX

IN

E/D

MI AF

IAF MI

S-BOX

INV S-BOX

IN

in GF(28)

Byte Substitution (BS)(MI manipulation)• Look-up table method

Two methods to construct S-Box using look-up tablemethod

1. Using distributed memory2. Using built in memories called BRAMs

1. Map the element A ∈ GF(28) to a composite field F2. Compute the Multiplicative Inverse over the field F3. Map back from field F to GF(28)

• Composite Field GF((22)2 )2

S. Morioka and A. Satoh, CHES 2002

MixColumn (MC)[ ]

[ ]

[ ]

[ ]

[ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ]302201101003

303202101001

301203102001

301201103002

3

2

1

0

02010103

03020101

01030201

01010302

aaaa

aaaa

aaaa

aaaa

a

a

a

a

!!!

!!!

!!!

!!!

=

""""

#

$

%%%%

&

'

""""

#

$

%%%%

&

'

in GF(28)

02×v.

[ ] [] [ ] [ ]3210 aaaat !!!=

[ ] []10 aav !=

( )vxtimev =

[ ] [ ] tvaa !!=" 00

[] [ ]21 aav !=

( )vxtimev =

[] [] tvaa !!=" 11

[ ] [ ]32 aav !=

( )vxtimev =

[ ] [ ] tvaa !!=" 22

[ ] [ ]03 aav !=

( )vxtimev =

[ ] [ ] tvaa !!=" 33

[ ] [ ] [ ]( ) [ ] [ ] [ ] [ ]( )32101020020 aaaaaaa !!!!!!

MixColumn (MC)

[] [ ] [ ]321 aaav !!=

[ ]( )00

axtimext =

[ ]1

0 xtxtvao!!="

[ ] [ ] [ ]320 aaav !!=

[]( )11

axtimext =

[]21

1 xtxtva !!="

[ ] [] [ ]310 aaav !!=

[ ]( )22

axtimext =

[ ]32

2 xtxtva !!="

[ ] [] [ ]210 aaav !!=

[ ]( )33

axtimext =

[ ]03

3 xtxtva !!="

[] [ ] [ ]321 aaav !!=

[ ]( )00

axtimext =

[ ] [ ]1

00 xtxtvka o !!!="

[ ] [ ] [ ]320 aaav !!=

[]( )11

axtimext =

[] []21

11 xtxtvka !!!="

[ ] [] [ ]310 aaav !!=

[ ]( )22

axtimext =

[ ] [ ]32

22 xtxtvka !!!="

[ ] [] [ ]210 aaav !!=

[ ]( )33

axtimext =

[ ] [ ]03

33 xtxtvka !!!="

Key

02×v

Inv MixColumn(IMC)[ ]

[ ]

[ ]

[ ]

[ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ]302091000

302010900

302010009

309201000

3

2

1

0

00900

00090

00009

09000

EaaDaBa

BaEaaDa

DaBaEaa

aDaBaEa

a

a

a

a

EDB

BED

DBE

DBE

!!!

!!!

!!!

!!!

=

"""""

#

$

%%%%%

&

'

"""""

#

$

%%%%%

&

'

IMC

( )( )( )[ ] ( )( )[ ] ( )xxtimexxtimextimexxtimextimextimeEx !!=0

( )xxtimex =02( ) xxxtimex !=03

08(x) 04(x) 02(x)

02(x)

Now compare MC & IMC ?

IMC

MC *

!!!!

"

#

$$$$

%

&

!!!!

"

#

$$$$

%

&

=

!!!!

"

#

$$$$

%

&

05000400

00050004

04000500

00040005

02010103

03020101

01030201

01010302

00900

00090

00009

09000

EDB

BED

DBE

DBE

We observe that,

(1) (2)

( )( ) xxxtimextimex !=05

The biggest co-efficient for Eq.2 is, 05

Eq.1, we already have(MC), Eq.2 calculation can be made before Eq.1

Inv MixColumn(IMC)

Implementing AES on FPGAs

Architecture 1: Encryptor core• Sequential approach

Architecture 2: Encryptor core• Pipeline approach

Architecture 3: Encryptor/decryptor core• MC/IMC modified approach

Architecture 4: Encryptor/decryptor core• Using look-up table method

Architecture 5: Encryptor/decryptor core• Using composite field approach

AES Implementation Strategies

The commonly used architecures are:

round 1round 2

------------round n

n roundsone round

Register 1Stage 1

Register 2

Stage k Register k

..............

One round repeated n times

Loop unrolling

Iterative looping

Inner-round pipeling

Architecture 1Sequential Approach

KGEN LATCHROUND KEY

SRCON CLK

USER KEY

RND 1-9 LATCH

SROUND-KEY CLK

RND 10RND 0

ROUND-KEY

CIPHER TEXT

PLAIN TEXT

USER-KEY

Architecture 2 Pipelined Approach

IN R

EG

RN

D 0

RN

D 1

RN

D 2

RN

D 3

RN

D 4

RN

D 5

RN

D 6

RN

D 7

RN

D 8

RN

D 9

RN

D 1

0

IN OUT

IN R

EG

KG

EN

KG

EN

KG

EN

KG

EN

KG

EN

KG

EN

KG

EN

KG

EN

KG

EN

KG

EN

KG

ENUSER- KEY

RK

0

RK

1

RK

2

RK

3

RK

4

RK

5

RK

6

RK

7

RK

8

RK

9

RK

10

Encryption: MI + AF + SR + MC + ARKDecryption: ISR + IAF + MI + ModM + MC + ARK

Architecture 3Encryption/Decryption

ISRIAF

MI

AFSR

ModM

MCARKIN OUT

ENC

DEC

ISR

IAF

MI

AF

SRIMC

IARK

MC

ARKIN OUT

ENC

DEC

E/DE/D

E/DE/D

Architecture 4Encryptor/decryptor core

using look-up table method

• Same S-Box (MI) for encryption/decryption• Memory requirements become half

• BRAMs are used for storing MI values.• No initial time to prepare them

ISRMI

AF

IN

IAF

SRE/D MC

ARK

IMCIARK

OUTE/D

GF(28) TO FIELD F GF(2)2)2

M-1M FIELD F TO GF(28)MIManipulation

IstTransformation

2ndTransformation

Let A∈F2 and A= AH y + AL , then it can be shown that:

( )( )

LLHLLHH

LHH

AAAAAAAyAAA

AAyAA16216161617

16

0

;+=++×=×=

++=

ll

AHGF(28)

toGF(24)

8

A-14

4 XlAH

AL

4GF(24)

toGF(28)

8A

A17

AL16

4Mul 4x4

AL

X2 Mul 4x4

Mul 4x4

X -1

l AH2

ALA16

Architecture 5Encryptor/decryptor core using

composite field for MI

Results Comparison

AES Algorithm Implementations

Metrics to measureperformance

Throughput := Clock cycle (Frequency) x No. of bits No. of rounds

1

2

Area CLB slices, BRAMs etc.

AES Implementation Strategies

3

Ratio= Throughput/Area

Device

(XCV)

Area

(CLB slices)

Throughput

(Mbs)

Through-put/Area

Gaj et al [1] 1000 2902 331.5 0.11

Dandalis et al [2] 1000 5673 353 0.06

Nazar et al 812 2744 258.5 0.09

Device

(XCV)

Area (CLB slices) Throughput

(Mbits/s)

Throughput/Area

Elbirt et al [3] 1000 9004 1940 0.22

Nazar et al 2600 2136 2868 1.29

Architecture 1: AES encryptor core usingsequential approach

Architecture 2: AES encryptor core usingpipeline approach

5%, 51% 22%, 26%

76% 47%

Architecture 3: AESencryptor/decryptor core usingMC/IMC modified approach

T/SThroughput(Mbits/s)(T)

CLB(S)Slices

BRAMsDevice

0.734121567780XCV2600EThis design

0.4332397576102XCV3200EMcLoone etal

Two approach for MC/IMC Less BRAMs Less Slices Higher Throughput reported to-date

27.03%25.06%

T/SThroughput(Mbits/s)(T)

CLB(S)Slices

BRAMsDevice

0.24313613416NoBRAMs

XCV2600EE/D GF(24)0.583840667680XCV2600EE/D GF(28)0.4332397576102XCV3200EMcLoone

Two approaches for MI Key Scheduling included No initial delay

First design uses look-up table for MI, Fast but high memory requirements Second design use composite field approach for MI, Slower with less memory requirements.

Both are efficient as compared to reported design

Architecture 4 & 5: AESencryptor/decryptor core usingMI look-up table and compositefield approach

11%, 77 % 25%, 3 %

Related Publications1. Nazar A. Saqib, Francisco Rodriguez-Henriquez, and Arturo Diaz-Perez,

“Sequential and pipelined architectures for AES implementation, ” proceedingsof IASTED international conference COMPUTER SCIENCE AND TECHNOLOGY,pp 159-163, May 19-21, 2003, Cancun Mexico.

2. F. Rodriguez-Henriquez, N.A. Saqib, and A. Diaz-Perez, “4.2 Gbit/s single-chipFPGA implementation of AES algorithm, “ ELECTRONICS LETTERS, Vol.39,No. 15, July 24, 2003.

3. Nazar A. Saqib, Francisco Rodriguez-Henriquez, and Arturo Diaz-Perez, “TwoApproaches for a Single-Chip FPGA Implementation of an Encryptor/DecryptorAES Core,” FPL 2003, Lecture Notes in computer Science 2778, pp. 303-312, 2003 (FPL 2003, Sep 1-3, Lisbon,Portugal).

4. Nazar A. Saqib, Francisco Rodriguez-Henriquez, and Arturo Diaz-Perez, “AESAlgorithm Implementation-An efficient approach for Sequential and Pipelinearchitectures,” Fourth Mexican International Conference on Computer Science,ENC’ 03, pp. 126-130, Sep. 8-12, 2003, Tlaxcala, Mexico.

5. Nazar A. Saqib, Arturo Diaz-Perez and Francisco Rodriguez-Henriquez, HighlyOptimized Single-Chip FPGA Implementations of AES Encryption and DecryptionCores”, Accepted for Iberchip 2004

Conclusions A promising AES Encryptor/decryptor core (contributions for AES S-Box/Inv S-Box)

Using look-up table for S-Box Using Composite Fields GF(24)

An optimized AES Encryptor/decryptor core (contributions for AES MC/IMC)

Using Modified version for IMC

A sequential and pipeline encryptor core (tradeoff between speed and area)

Future work: completion of ECC scalar multiplication Thesis writing and defense