ece 545 lecture 8b hardware architectures of secret-key ... · - outer-round pipelining -...

George Mason University

Hardware Architectures of Secret-Key Block Ciphers

and Hash Functions

ECE 545 Lecture 8b

Recommended reading

•  K. Gaj and P. Chodowiec, FPGA and ASIC Implementations of AES, Chapter 10 in C.K. Koc (Ed.), Cryptographic Engineering Section 10.4 Parameters of Hardware Implementations Section 10.5 Hardware Architectures of Symmetric Block Ciphers

Recommended reading E. Homsirikamol, M. Rogawski, and K. Gaj, "Throughput vs. Area Trade-offs in High-Speed Architectures of Five Round 3 SHA-3 Candidates Implemented Using Xilinx and Altera FPGAs," in LNCS 6917, Cryptographic Hardware and Embedded Systems - CHES 2011, Nara, Japan, Sep. 28-Oct. 1, pp. 491-506. Sections 1-4.

Secret-key Ciphers

Cipher

message

ciphertext

cryptographic key

N bits

K bits

Current American Standards AES vs. Triple DES

Triple DES AES

64 bits

output

168 bits AES

128 bits

128 bits output

128, 192, and 256 bits

Initial transformation

Final transformation

#rounds times

Round Key[i] i:=i+1

Round Key[0]

i<#rounds?

Cipher Round

Round Key[#rounds+1]

Typical Flow Diagram of a Secret-Key Block Cipher

key scheduling

encryption/decryption

memory of internal keys

output

input/key

input interface

output interface

Control unit

control

Top level block diagram

key expansion encryption/

decryption

memory of internal keys

output

input interface

output interface

Control unit

control

key setup

key scheduling

Primary parameters of hardware implementations for secret-key block ciphers

Latency Throughput

Encryption/ decryption

Time to encrypt/decrypt a single block

of data

Ci Number of bits

encrypted/decrypted in a unit of time

Mi Mi+1 Mi+2

Ci Ci+1 Ci+2

Throughput = Block_size · Number_of_blocks_processed_simultaneously Latency

Encryption time

Latency (Message_size –Block_size)

Message size

Dependence of the encryption time on latency and throughput

Throughput

register

combinational logic

one round

multiplexer

Basic iterative architecture

round key

register

combinational logic

one round

multiplexer

Basic iterative architecture

Basic architecture: Timing

#rounds · clock_period

Block vs. stream ciphers

Stream cipher

Internal state - IS Block cipher

M1, M2, …, Mn m1, m2, …, mn

C1, C2, …, Cn c1, c2, …, cn

Ci=fK(Mi) ci = fK(mi, ISi) ISi+1=gK(mi, ISi)

Every block of ciphertext is a function of only one

corresponding block of plaintext

Every block of ciphertext is a function of the current block

of plaintext and the current internal state of the cipher

Typical stream cipher Sender Receiver

Pseudorandom Key Generator

plaintext

ciphertext

ki keystream

key initialization vector (seed)

Pseudorandom Key Generator

mi plaintext

ciphertext

ki keystream

key initialization vector (seed)

ECB (Electronic CodeBook) mode

Electronic CodeBook Mode – ECB Encryption

M1 M2 M3

Ci = EK(Mi) for i=1..N

MN-1 MN

E E E E . . .

C1 C2 C3 CN-1 CN

K K K K K

Electronic CodeBook Mode – ECB Decryption

C1 C2 C3

Ci = EK(Mi) for i=1..N

CN-1 CN

D D D D . . .

M1 M2 M3 MN-1 MN

K K K K K

Counter Mode

Counter Mode - CTR Encryption

m1 m2 m3

ci = mi ⊕ ki ki = EK(IV+i-1) for i=1..N

mN-1 mN

E E E E . . .

c1 c2 c3 cN-1 cN

IV IV+1 IV+2 IV+N-2 IV+N-1

k1 k2 k3 kN-1 kN

K K K K K

Counter Mode - CTR Decryption

c1 c2 c3

mi = ci ⊕ ki ki = EK(IV+i-1) for i=1..N

cN-1 cN

E E E E . . .

m1 m2 m3 mN-1 mN

IV IV+1 IV+2 IV+N-2 IV+N-1

k1 k2 k3 kN-1 kN

K K K K K

Counter Mode - CTR

counter

1 L 1 L

IS1 = IV ci = EK(ISi) ⊕ mi ISi+1 = ISi+1

CBC (Cipher Block Chaining) Mode

Cipher Block Chaining Mode - CBC Encryption

m1 m2 m3

ci = EK(mi ⊕ ci-1) for i=1..N c0=IV

mN-1 mN . . .

E E E E . . .

c1 c2 c3 cN-1 cN

Cipher Block Chaining Mode - CBC Decryption

mi = DK(ci) ⊕ ci-1 for i=1..N c0=IV

m1 m2 m3 mN-1 mN

IV . . .

D D D D D . . .

c1 c2 c3 cN-1 cN

Primary factor in choosing the encryption/decryption unit architecture

Symmetric-key cipher mode of operation:

1. Non-feedback cipher modes

ECB, counter mode

2. Feedback cipher modes

CBC, CFB, OFB

Non-feedback Counter Mode - CTR

M0 M1 M2

Ci = Mi ⊕ AES(IV+i) for i=0..N

MN-1 MN

E E E E . . .

C1 C2 C3 CN-1 CN

IV IV+1 IV+2 IV+N-1 IV+N

Feedback cipher modes - CBC M1 M2 M3

C1 = AES(Mi ⊕ IV) Ci = AES(Mi ⊕ Ci-1) for i=2..N

MN-1 MN . . .

E E E E . . .

C1 C2 C3 CN-1 CN

Feedback cipher modes CBC, CFB, OFB

combinational logic

k rounds

register

multiplexer

round 1 round 2

round k . . . . .

k-rounds Loop Unrolling

Loop Unrolling: Timing

#rounds/k · extended_clock_period

k=2 k=3 k=4 k=5

loop-unrolling basic architecture

Loop Unrolling: Speed vs. Area

speed = speed basic 1 + τ

1 + τ / k

τ << 1

combinational logic

MUX register

one round

Architectures suitable for feedback modes

round K . . . .

round 1

round 2

round #rounds

. . . .

round 1

round 2

Decreasing area by resource sharing

D0’ D1’

multiplexer

Before After

register

Throughput

basic architecture

Resource Sharing: Speed vs. Area

- basic architecture

- resource sharing

resource sharing

Non-Feedback Cipher Modes ECB, counter, OCB

Comparison for non-feedback cipher modes, e.g. Counter Mode - CTR

M0 M1 M2

Ci = Mi ⊕ AES(IV+i) for i=0..N

MN-1 MN

E E E E . . .

C1 C2 C3 CN-1 CN

IV IV+1 IV+2 IV+N-1 IV+N

. . . L

length

Zi=f(L, R)

τ bits

Control sum

Increasing speed by parallel processing

Increasing speed using pipelining

Cipher 1 Cipher 2

round 1 round 1

round 2

round 10

round 16

Speed = target_clock_period

block size

target clock period, e.g., 20 ns

Pipelined operation of the encryption unit

clock cycle 1

B3 B2 B1

B4 B3 B2

B5 B4 B3

B6 B5 B4

B8 B7 B6 B5

B13 B12 B11 B10

B14 B13 B12 B11

B15 B14 B13 B12

B16 B15 B14 B13

B9 B8 B7 B6

B10 B9 B8 B7

B11 B10 B9 B8

B12 B11 B10 B9

clock cycle 9 10 11 12 13 14 15 16

. . . .

#rounds registers

round #rounds = one pipeline stage

round 1 = one pipeline stage

Full outer-round pipelining

Total # of pipeline stages = #rounds

Full mixed inner- and outer-round pipelining

round #rounds =k pipeline stages

. . . .

round 1 = k pipeline stages

round 2 =k pipeline stages

. . . .

k registers

Total # of pipeline stages = #rounds·k

k rounds

register1

register2

register k . . . .

pipeline stage 1 = round 1

pipeline stage 2 = round 2

pipeline stage k = round k

multiplexer

k-stage Outer-Round Pipelining

Outer-Round Pipelining: Timing

#rounds · clock_period

P4 P5 P6

outer-round pipelining non-feedback modes

basic architecture

Outer-Round Pipelining: Speed vs. Area

outer-round pipelining feedback modes

round #rounds = one pipeline stage

. . . .

K registers

round K = one pipeline stage

. . . .

MUX K registers

combinational logic

MUX register

one round, no pipelining

Outer-round Pipelining

one round

register1

register2

register k . . . .

pipeline stage 1

pipeline stage 2

pipeline stage k

multiplexer

k-stage Inner-Round Pipelining

Inner-Round Pipelining: Timing

#rounds · (k · reduced_clock_period)

P4 P5 P6

inner-round pipelining non-feedback modes

basic architecture k=2

inner-round pipelining feedback modes

Inner-Round Pipelining: Speed vs. Area

Mixed Inner- and Outer-round Pipelining

round #rounds =k pipeline stages

. . . .

round 2 =k pipeline stages

. . . .

d) k registers

round K = k pipeline stages

. . . .

k registers

one round = k pipeline stages

. . . .

b) k registers

one round, no pipelining

a) register

combinational logic

- basic architecture - outer-round pipelining

- inner-round pipelining - mixed inner and outer-round pipelining

Throughput

basic architecture

inner-round pipelining

mixed inner and outer-round pipelining

outer-round pipelining

K=2 K=3

Comparison of the traditional and new design methodologies

Area [CLB slices]

Speed [Mbit/s]

Choosing optimum architecture for non-feedback cipher modes

basic architecture

Latency

basic architecture

K=2 K=4 K=3 K=5

K=2 K=3

Latency vs. area dependence for the new design methodology

- basic architecture - outer-round pipelining

- inner-round pipelining - mixed inner and outer-round pipelining

r1 r2 r1

r1 r2 r1 r3 r4

op1 op2 op3 op4 op5

TCLKmin

Limits on the minimum clock period after pipelining (1)

1. Delay of a single round divided by k = number of internal pipeline stages

2. Delay of the longest indivisible operation

r1 r2 r1

cntr1 cntr3 cntr1

rc Control Unit

op1 op3 op4 op5

TCLKmin

Limits on the minimum clock period after pipelining (2)

3. Delays within the control unit

4. Maximum latency

5. Maximum input/output bandwidth

cntr2 cntr4

DES"encryption/decryption"

clock reset

encrypt/decrypt

data input data available data read

key input

key available key read

IV input IV available IV read

Key memory"

Key schedule"

data output

write full

56 key

read bank"write bank"

round number" round key"4 48

DES/3DES

key choice"2

AES"encryption/decryption"

clock reset

encrypt/decrypt

data input data available data read

key input key available

key read

IV input IV available IV read

Key schedule!

Key memory"

data output

write full

64! Key material!

read bank"write bank"

round number" round key!4 128!

DES/3DES

Cycle number! 6!

2!key size!

k=2 k=3 k=4 k=5

loop-unrolling

resource sharing

- basic architecture - loop unrolling

- inner-round pipelining - outer-round pipelining

- resource sharing

basic architecture

Performance of alternative architectures: in non-feedback cipher modes (ECB, counter)

k=2 k=3 k=4 k=5

outer-round pipelining inner-round pipelining

loop-unrolling

resource sharing

basic architecture

Performance of alternative architectures: in feedback cipher modes (CBC, CFB, OFB)

- basic architecture - loop unrolling

- inner-round pipelining - outer-round pipelining

- resource sharing

Hash Functions

Hash Function

arbitrary length

message

hash function

hash value h(m)

fixed length

It is computationally infeasible to find such

m and m’ that h(m)=h(m’)

Collision Resistance:

Message

Hash function

Public key cipher

Alice Signature

Alice’s private key

Hash function

Alice’s public key

Hash Functions in Digital Signature Schemes

Hash value 1

Hash value 2

Hash value

Public key cipher

yes no

Message Signature

General scheme for constructing a secure hash function: Merkle–Damgård scheme

Message m

Padding, appending bit length, M

IV H0 H1 H2 f f . . .

compression function

output transformation

h(m) g

M2 Mt . . .

Sponge Scheme

Wt Kt Step t

Basic iterative architecture of SHA-1 and SHA-2

Data Stream 1 . . . . . . . . Data Stream k

. . . . .

Wt KtStep t

Architecture with Multiple Processing Units

Wt KtStep t

Features of architecture with multiple processing units

" Pros —  Throughput increases by a factor of k

" Cons —  Latency the same as for the basic architecture

—  Area increases by a factor of k

—  Requires k independent data streams (messages)

Pipelined architecture

K t W t

R 2 step t, stage 1

step t, stage 2

Features of the pipelined architecture

" Pros —  Throughput increases by a factor close to k

—  Area increases by a factor smaller than k

" Cons —  Latency the same as for the basic architecture

—  Requires k independent data streams (messages)

. . . .

Wt Wt+1

Wt+k-1

Kt Kt+1

Kt+k-1

Step t Step t+1

Step t+k-1

. . . .

Unrolled architecture

Features of the unrolled architecture

" Pros —  Reduces both latency and throughput

—  Requires only one data stream

" Cons —  Area may increase substantially compared

to the basic architecture

•  datapath width = state size •  one clock cycle per one round

Starting Point: Basic Iterative Architecture

Currently, most common architecture used to implement SHA-1, SHA-2, and many other hash functions.

Throughput

Area A

•  datapath width = state size •  two clock cycles per one round

Horizontal Folding - /2(h)

Typically Throughput/Area ratio increases

Throughput

Area A

Th x1 /2(h)

•  datapath width = state size/2 •  two clock cycles per one round/step

Vertical Folding - /2(v)

Throughput

Area A

Typically Throughput/Area ratio decreases

Vertical Folding with the State Kept in Memory BLAKE /4(h)/4(v)-m

•  datapath width = state size/4 •  16 clock cycles per one round

Throughput/Area ratio increases

Throughput

Area A

/4(h)/4(v)-m

•  datapath width = state size •  one clock cycle per two rounds

Unrolling - x2

Typically Throughput/Area ratio decreases

Throughput

Area A

•  datapath width = state size •  one clock cycle per two rounds

Efficient Unrolling - x2

Throughput

Area A

x2-efficient

Sometimes Throughput/Area ratio increases

Multiple Packets Available for Parallel Processing

PACKETS

1500B576B576B 64B

1500B1500B64B 64B

64B 40B576B 40B

576B 576B40B N1

PACKETS

Typical sizes of packets: 40B – 1500B 1500 B = Maximum Transmission Unit (MTU) for Ethernet v2

Parallel Processing Using Multi-Unit Architecture – MU2

Throughput

Area A

Typically Throughput/Area ratio stays the same

Unrolled Architecture with Pipelining - x2-PPL2

Throughput

Area A

x2-PPL2

Typically Throughput/Area ratio stays almost the same

Basic Architecture with Pipelining - x1-PPL2

Throughput

Area A

x1-PPL2

Typically Throughput/Area ratio increases

BLAKE-256 in Virtex 5

Why Interface Matters?

•  Pin limit

Total number of i/o ports ≤ Total number of an FPGA i/o pins

•  Support for the maximum throughput

Time to load the next message block ≤ Time to process previous block

Interface: Two possible solutions

Length of the message communicated at the beginning

+ easy to implement passive source circuit − area overhead for the counter of message bits

Dedicated end of message port

− more intelligent source circuit required + no need for internal message bit counter

msg_bitlen

zero_word

message end_of_msg SHA core

SHA Core: Interface & Typical Configuration

•  SHA core is an active component; surrounding FIFOs are passive and widely available •  Input interface is separate from an output interface •  Processing a current block, reading the next block, and storing a result for the previous message can be all done in parallel

fifoin_empty

fifoin_read

idata w w

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

clk rst

clk rst clk rst

clk rst

SHA Core Interface

SHA core

din dout

src_ready

src_read

dst_ready

dst_write

clk rst

Communication Protocol for Unpadded Messages

msg_bitlen

zero_word −−−−−

message

w bits

seg_0_bitlen

zero_word

w bits

seg_1_bitlen

� � �

seg_n-1_bitlen

seg_n-1

−−−−−

Communication Protocol for Unpadded Messages Without Message Splitting

last = 1 | msg_len_bp

message

msg_len_bp – message length before padding [bits]

w bits

Communication Protocol for Unpadded Messages With Message Splitting

last=0 | seg_0_len_bp

w bits

last = 0 | seg_1_len_bp

� � �

last = 1 | seg_n-1_len_bp

seg_n-1

seg_i_len_bp – segment i length before padding [bits]

* For all i < n-1 segment i length is assumed to be a multiple of the message block size, b [characteristic to each function], and thus also the word size, w. The last segment cannot consist of only padding bits. It must include at least one message bit.

SHA core

din dout

src_ready

src_read

dst_ready

dst_write

clk rst

io_clk

fifoin_empty

fifoin_read

idata w w

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

clk rst

io_clk rst io_clk rst

clk rst

io_clk

fifoin_empty

fifoin_read

idata w w

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

clk rst

clk or io_clk rst clk or io_clk rst

clk rst

io_clk

ece 545 lecture 8b hardware architectures of secret-key ... · - outer-round pipelining -...

Documents

graphics pipelining

complex pipelining

pipelining newppt

pipelining principles

pipelining lecture

linear pipelining

advanced pipelining

pipelining chemaxon

pipelining cache

arm pipelining

unit3 pipelining

pipelining - ii

todayʼs menu multi-cycle exceptions exceptions ... · 13...

pipelining & parallel processing -...

3 pipelining

pipelining notes

pipelining i

recap (pipelining)

instruction pipelining

processor pipelining