comparative implementation of aes-based crypto-cores

72
Department of Electrical Engineering and Information Technology Computer Engineering Group Comparative Implementation of AES- based Crypto - Cores Master‟s Thesis Submitted to Prof. Dr. Sybille Hellebrand Muhammad Asim Zahid

Upload: muhammad-asim-zahid

Post on 19-Jan-2017

26 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Comparative Implementation of AES-based Crypto-Cores

Department of Electrical Engineering and Information Technology

Computer Engineering Group

Comparative Implementation of AES-

based Crypto-Cores

Master‟s Thesis

Submitted to Prof. Dr. Sybille Hellebrand

Muhammad Asim Zahid

Page 2: Comparative Implementation of AES-based Crypto-Cores
Page 3: Comparative Implementation of AES-based Crypto-Cores

iii

Declaration

I declare that I have developed and written the enclosed Master Thesis completely by

myself, and have not used sources or means without declaration in the text. Any

thoughts from others or literal quotations are clearly marked. The Master Thesis was not

used in the same or in a similar version to achieve an academic grading or is being

published elsewhere.

Location, Date _____________________ Signature ___________________________

Page 4: Comparative Implementation of AES-based Crypto-Cores
Page 5: Comparative Implementation of AES-based Crypto-Cores

v

Contents 1 Introduction ............................................................................................................... 1

1.1 Motivation: Industry 4.0 and Data Security ....................................................... 1

1.2 Objectives ........................................................................................................... 2

2 Theoretical and Mathematical Preliminaries ............................................................ 5

2.1 Essentials ............................................................................................................ 5

2.2 Standards and Information ................................................................................. 7

2.3 Mathematical Operations ................................................................................... 8

Addition and Subtraction ............................................................................ 8 2.3.1

Multiplication.............................................................................................. 9 2.3.2

3 Advanced Encryption Standard .............................................................................. 11

3.1 Introduction ...................................................................................................... 11

Features ..................................................................................................... 11 3.1.1

Usage ........................................................................................................ 12 3.1.2

3.2 Encryption ........................................................................................................ 12

SubBytes Transformation ......................................................................... 14 3.2.1

ShiftRows Transformation ........................................................................ 15 3.2.2

MixColumns Transformation ................................................................... 16 3.2.3

AddRoundKey Transformation ................................................................ 17 3.2.4

3.2.4.1 Key Expansion................................................................................... 17

3.2.4.2 Round Key Addition.......................................................................... 19

3.3 Decryption ........................................................................................................ 19

InverseShiftRows Transformation ............................................................ 20 3.3.1

InverseSubBytes Transformation ............................................................. 21 3.3.2

InverseMixColumns Transformation ........................................................ 22 3.3.3

AddRoundKey Transformation ................................................................ 23 3.3.4

4 Encryption & Authentication Modes ...................................................................... 25

4.1 Encryption Modes ............................................................................................ 25

Electronic Codebook (ECB) ..................................................................... 25 4.1.1

Page 6: Comparative Implementation of AES-based Crypto-Cores

vi

Cipher Block Chaining (CBC) .................................................................. 26 4.1.2

Counter Mode (CTR) ................................................................................ 27 4.1.3

4.2 Counter with CBC-MAC (CCM) ..................................................................... 27

Introduction ............................................................................................... 27 4.2.1

Algorithm .................................................................................................. 28 4.2.2

4.3 Galois/Counter Mode (GCM) ........................................................................... 28

Introduction ............................................................................................... 28 4.3.1

Algorithm .................................................................................................. 29 4.3.2

4.3.2.1 GHASH .............................................................................................. 29

4.3.2.2 GCTR ................................................................................................. 33

Authenticated Encryption .......................................................................... 33 4.3.3

Authenticated Decryption .......................................................................... 34 4.3.4

5 Literature Review .................................................................................................... 37

5.1 AES Designs ..................................................................................................... 37

Iterative Loop Structure ............................................................................. 38 5.1.1

Memory-Based AES Hardware Implementation Designs ......................... 38 5.1.2

Unfolded / Parallel Structure Based Designs ............................................ 40 5.1.3

Sub-Pipelined or Stage-Level Pipelined Structure .................................... 40 5.1.4

5.2 Galois Field Multiplier (GFM) Designs ........................................................... 43

6 Design & Implementation Results .......................................................................... 45

6.1 Design ............................................................................................................... 45

AES Core ................................................................................................... 45 6.1.1

Galois Field Multiplier (GFM) Core ......................................................... 47 6.1.2

Complete Design ....................................................................................... 49 6.1.3

6.2 Implementation and Results ............................................................................. 53

Implementation Platform ........................................................................... 53 6.2.1

Tests & Results .......................................................................................... 54 6.2.2

6.3 Conclusion ........................................................................................................ 56

Appendix ......................................................................................................................... 57

Bibliography .................................................................................................................... 59

Page 7: Comparative Implementation of AES-based Crypto-Cores

vii

List of Figures

Figure 2.1 Diagrammatic Representation of Encryption/Decryption ............................... 5

Figure 2.2 Diagrammatic Representation of Authentication ............................................ 6

Figure 3.1 Diagramatic Representation of a State .......................................................... 13

Figure 3.2 Flow Diagram of Round Transformations in AES Encryption ..................... 14

Figure 3.3 SubBytes Transformation .............................................................................. 14

Figure 3.4 ShiftRows Transformation ............................................................................ 16

Figure 3.5 Diagramatic Representation of MixColumns ................................................ 17

Figure 3.6 Flow Diagram of Key Expansion Algorithm ................................................ 18

Figure 3.7 Diagramatic Representation of RotWord ...................................................... 18

Figure 3.8 Diagramatic Representation of AddRoundKey Transformation ................... 19

Figure 3.9 Flow Diagram Representation of AES Decryption ....................................... 20

Figure 3.10 Diagramatic Representation of InversShiftRows Transformation .............. 21

Figure 3.11 Diagramatic Representation of InverseMixColumns .................................. 23

Figure 4.1 Diagrammatic Representation of ECB Mode ................................................ 26

Figure 4.2 Diagrammatic Representation of CBC mode ................................................ 26

Figure 4.3 Diagrammatic Representation of CTR Mode ................................................ 27

Figure 4.4 Diagrammatic Representation of CBC-MAC ............................................... 28

Figure 4.5 Diagrammatic Representation of the GHASH function ................................ 30

Figure 4.6 Diagrammatic Representation of GF(2128

) Multiplication ............................ 31

Figure 4.7 Diagrammatic Representation of the Hashing sequence ............................... 32

Figure 4.8 Diagrammatic Representation of Authenticated Encryption......................... 34

Figure 4.9 Diagrammatic Representation of Authenticated Decryption ........................ 35

Figure 5.1 Iterative Loop Implementation ...................................................................... 38

Figure 5.2 Diagrammatic Representation of Memory Based AES structure .................. 39

Figure 5.3 Unrolled AES Structure ................................................................................ 40

Figure 5.4(a)Pipelined AES encryption datapath, (b) Pipelined Key Scheduler ............ 41

Figure 5.5 Stages for Sub-Pipelining .............................................................................. 42

Figure 6.1 Internal Structure of AES core ...................................................................... 46

Figure 6.2 Diagrammatic Representation of Encryption Block ..................................... 47

Figure 6.3 Diagrammatic Representation of Pipelined GF(2128

) Multiplier ................... 48

Figure 6.4 Diagrammatic Representation of Authentication Block ............................... 48

Page 8: Comparative Implementation of AES-based Crypto-Cores

viii

Figure 6.5 State Diagram of Crypto-Core State Machine ............................................... 52

Figure 6.6 Complete Design Layout ............................................................................... 53

Figure 6.7 Graphical Comparison of Frequency and Throughput .................................. 54

Figure 6.8 Graphical Comparison of Area Usage ........................................................... 55

Figure 6.9 Graphical Comparison of Efficiency ............................................................. 55

Page 9: Comparative Implementation of AES-based Crypto-Cores

ix

List of Tables

Table 2.1 GF(28) Polynomial Representation ................................................................... 8

Table 3.1 AES Number of Rounds with respect to Key Length..................................... 12

Table 3.2 S-box: Substitution values for byte „xy‟ (hexadecimal) ................................. 15

Table 3.3 Inverse S-box: Substitution values for byte 'xy' (hexadecimal) ..................... 21

Table 6.1 I/O Signals of Crypto-Core ............................................................................. 49

Table 6.2 Comparison of Operational Frequency and Throughput ................................ 54

Table 6.3 Comparison of Area usage and Efficiency ..................................................... 55

Page 10: Comparative Implementation of AES-based Crypto-Cores
Page 11: Comparative Implementation of AES-based Crypto-Cores

1

1 Introduction

The thesis topic, as suggested by its name, comprises of a comparison of AES

(Advanced Encryption Standard) Crypto-Cores for finding a good and optimized way of

secure data transfer. Today there is a growing need of safe and secure data transfer, so

to avoid unauthorized access of data, data security and integrity is needed. Security

protocols are used for daily computing and whenever security is compromised in any of

the standard protocols they are either immediately re-designed or replaced with a

improved standard, e.g. ( Data Encryption Standard, i.e., DES was replaced by AES)[6]

.

A good security protocol should have less processing time, higher throughput, should be

reliable, must be faster and, at times, must have good authentication standards. The

purpose of encryption is that the one who is authorized to read the message can read it

by decrypting it with the help of a valid decryption algorithm.

It is the goal of every sector to choose an encryption which meets its security standards.

A relatively good encryption would be one where the deciphering of the encrypted data

is near to impossible, but in most cases even the most promising algorithm can be

broken, with a suitable attack. Therefore, choosing the right encryption is very

important keeping in mind the known threat to the data in question.

1.1 Motivation: Industry 4.0 and Data Security

The fourth industrial revolution has set the trend for exchange of data. First comes the

principle of interoperability according to which through Internet of Things (IoT) and

Internet of People (IoP), sensors, devices, people and machines communicate and are

connected. Second is the principle of Information transparency, i.e., Information

Sytems create a virtual copy of the physical world by combining higher-value context

information and raw sensor data. Third is the principle of Technical assistance which

talks of two abilities; first includes solving problems and making informed decisions in

a very short time by supporting humans through comprehensible aggregation and

visualizing of information. Second ability is to support humans physically through cyber

Page 12: Comparative Implementation of AES-based Crypto-Cores

2

physical systems. The fourth and the last one are Decentralized decisions in which the

performance of the tasks is autonomous by cyber physical systems and can take their

own decision. In some exceptional cases the higher level is delegated with the task. [3]

The Industry 4.0 has developed a growing need for systems integrated with

cryptography due to which many organizations have suggested standards for data

security and integrity meeting the Industry 4.0 requirements. Open Platform

Communications – Unified Architecture (OPC-UA), which is a non- profit organization,

is becoming more and more popular for providing safe, platform-independent and

reliable data exchange standards. Its standard works in association with many

researchers, manufacturers and users and keeps on improving its standards to maintain

competitiveness, having a vast vision. In order to implement the vision of Industry 4.0,

OPC-UA has suggested that more effective use of energy and resources is required in

minimum time to reduce complexity. For this certain activities are required, which

includes automation and optimization of the system, shaping the digitalization of all

industrial sectors and so on. This research is based on the security standard set by OPC-

UA which includes scalable mechanisms and introduction of fast but area-efficient

crypto-cores. [1], [2]

1.2 Objectives

A secure data transfer requires the implementation of a good and reliable encryption

standard. Advanced Encryption Standard (AES) is a widely used standard which, in the

past years, has been considered as one of the most reliable encryption standards

available. Thus, AES has been chosen as the encryption standard in this research and

following are the main objectives of this research thesis:

Studying of various AES techniques to have a better understanding for further

optimization.

Implementation of a fast pipelined-AES technique as an IP-core with

considerably less area cost.

Implement two AES modes, i.e., Galois/Counter Mode (GCM) and Counter with

CBC-MAC (CCM) as an IP-core for an accurate comparison.

Page 13: Comparative Implementation of AES-based Crypto-Cores

3

Devise a method for the implementation of Galois Field Multiplication of

GF(28) for GCM that has considerably less area-cost but is still fast.

Page 14: Comparative Implementation of AES-based Crypto-Cores
Page 15: Comparative Implementation of AES-based Crypto-Cores

5

2 Theoretical and Mathematical

Preliminaries

This chapter discusses the theoretical and mathematical preliminaries that will be used

in the further chapters.

2.1 Essentials

The main idea behind this research thesis is having a security platform for hardware

which would make the information transfer from one node to another at hardware-level

much more secure. Before going in to the details of how this is done the basic essentials

should first be defined.

Encryption/Decryption: Encryption can be defined as a process to protect data in such a

way that only authorized units (people, machines, etc.) can access it. This is done by

encoding the data with a special key which can only be decoded by whoever has that

key. For example, if some information has to be exchanged between two units, A and B,

but it has to be ensured that no unauthorized unit can access that data, encryption will be

used. The data might still be accessible but it will be unreadable for anyone who doesn‟t

have the authorization key. The process to decode that encrypted data in to the original

data would be called Decryption. Figure 2.1 shows a diagrammatic environment of

encryption/decryption.

Figure 2.1 Diagrammatic Representation of Encryption/Decryption

Hello Hello f3#7r

f3#7r (unreadable)

Page 16: Comparative Implementation of AES-based Crypto-Cores

6

Authentication: Authentication is a process in which the received data is compared with

the sent data to ensure that valid data has been received. This means that it ensures that

the data has not been changed or modified in any way during data transfer. This is

achieved by adding a verification tag to the message known as Message Authentication

Code (MAC). The MAC algorithm, on the Sender end, takes in a secret key and the

message to be sent to generate a MAC which is sent along with the message. On the

Receiver end the message is passed through a MAC algorithm to generate a MAC which

is compared with the received MAC to see whether the message received is authentic or

has been tampered with. Figure 2.2 shows the diagrammatic representation of how

authentication works.

Figure 2.2 Diagrammatic Representation of Authentication

Crypto-Core: An IP (intellectual property) core is a block of logic or data that is used in

making a field programmable gate array ( FPGA ) or application-specific integrated

circuit ( ASIC ) design. Ideally, an IP core should be entirely portable - that is, able to

easily be inserted into any vendor technology or design methodology. Universal

Asynchronous Receiver/Transmitter ( UART s), central processing units

( CPU s), Ethernet controllers, and PCI interfaces are all examples of IP cores. Crypto-

Cores are another example of an IP-Core to embed encryption within the hardware.

Sender Receiver

Message is

Authentic

Message is

Modified

Channel

Page 17: Comparative Implementation of AES-based Crypto-Cores

7

2.2 Standards and Information

The AES algorithm takes bits (binary values) as sequences for input and output, referred

to as blocks. The number of bits contained in a block is known as block length. For an

AES algorithm a secret key, known as the Cipher Key, is required to encrypt/decrypt

data. The AES can be implemented in three different key lengths, i.e., 128-, 192- and

256-bit[5]

. The AES algorithm used in this thesis will be a 128-bit algorithm.

Bits: Within such sequences, the bits are numbered from 0 to Block Length – 1. A

sequence of 8 bits is known as a Byte which is the basic unit for an AES algorithm. A

sequence of 4-byte is known as a Word.

Index: It is the number attached to a bit. It ranges for a 128-bit

algorithm.

Galois Field: A field containing a finite set of elements is known as a Galois Field(GF).

A Galois Field is represented as GF(pn), that denotes that it is a Galois Field of p

n

elements, where p is a prime number. So, a Galois Field of 256 elements will be

represented as GF(28). In this research thesis the GF(2

8) will be taken in to

consideration since AES operations take GF(28) as input elements.

Galois Field Polynomial: The elements in a Galois Field GF(pn) can also be written in

the form of a polynomial of degree that is less than n. Consider an element of GF(28)

written in binary form as {00110101}. The GF polynomial for this would be

. Table 2.1 shows a better understanding of the GF(28) polynomial. Since the

GF(28) has 256 elements so Table 2.1 just shows some polynomial representations to

provide an understanding.

Binary Conversion GF Polynomial

00000000 0x7+0x

6+0x

5+0x

4+0x

3+0x

2+0x

1+0x

0 0

00000001 0x7+0x

6+0x

5+0x

4+0x

3+0x

2+0x

1+1x

0 1

00000010 0x7+0x

6+0x

5+0x

4+0x

3+0x

2+1x

1+0x

0 x

00000011 0x7+0x

6+0x

5+0x

4+0x

3+0x

2+1x

1+1x

0 x+1

00000100 0x7+0x

6+0x

5+0x

4+0x

3+1x

2+0x

1+0x

0 x

2

Page 18: Comparative Implementation of AES-based Crypto-Cores

8

00000101 0x7+0x

6+0x

5+0x

4+0x

3+1x

2+0x

1+1x

0 x

2+1

00000110 0x7+0x

6+0x

5+0x

4+0x

3+1x

2+1x

1+0x

0 x

2+x

00000111 0x7+0x

6+0x

5+0x

4+0x

3+1x

2+1x

1+1x

0 x

2+x+1

……… ……………………… …………

11111101 1x7+1x

6+1x

5+1x

4+1x

3+1x

2+0x

1+1x

0 x

7+x

6+x

5+x

4+x

3+x

2+1

11111110 1x7+1x

6+1x

5+1x

4+1x

3+1x

2+1x

1+0x

0 x

7+x

6+x

5+x

4+x

3+x

2+x

11111111 1x7+1x

6+1x

5+1x

4+1x

3+1x

2+1x

1+1x

0 x

7+x

6+x

5+x

4+x

3+x

2+x+1

Table 2.1 GF(28) Polynomial Representation

2.3 Mathematical Operations

The elements in an AES algorithm are understood as GF(28) elements as explained

above. All finite elements can be added and multiplied; however, their operations are

not as those done for numbers. The mathematical concepts for finite field elements are

different and are explained in this section.

Addition and Subtraction 2.3.1

Although it is a bit different from normal algebric addition or subtraction, the addition

or subtraction of two GF polynomials is very simple. For addition of two GF(2n)

polynomials, the two polynomials are added and then reducing the result modulo 2.

Modulo by any integer or polynomial means to divide with that integer or polynomial

and take the remainder as the answer. Subtraction of two GF(2n) polynomials is the

same as addition. So considering a GF(28) field it will be addition modulo 2 or

subtraction modulo 2.

Let us take an example taking two GF(28) polynmials, and

. The addition of these two GF polynomials would result as

shown in Equation 2.1:

( 2.1)

Page 19: Comparative Implementation of AES-based Crypto-Cores

9

Equation 2.1 shows that the same result would be achieved by an exclusive-OR (XOR)

operation between A and B, so it can also be represented as 1 .

Thus, the addition of two GF(28) polynomials can also be done by doing an XOR

operation between the two polynomials which would make implementation much

easier.

Multiplication 2.3.2

To multiply two polynomials in Galois Field GF(2n), initially, their corresponding

polynomials are multiplied just as in algebra (except for their coefficients that are only 0

& 1. A lot of terms will be dropped out because 1+1=0, which makes calculations

easier). The result is then modulo by an irreducable polynomial of degree n. For the

AES algorithm in GF(28) the irreducible polynomial is shown in Equation 2.2:

( 2.2)

The multiplication is denoted by ●. Implementing multiplication of finite field elements

is somewhat more complex than addition. Modulo by m(x) ensures that the resultant

binary polynomial will of degree less than 8, and therefore can be represented by a byte.

At the byte level, there is no simple operation for multiplication, like for addition.

As an example consider take two GF(28) polynomials, and

, and show their multiplication from Equation 2.3 to Equation 2.6.

( 2.3)

This implies:

( 2.4)

And,

1 ⊕ represents XOR of two elements

Page 20: Comparative Implementation of AES-based Crypto-Cores

10

( 2.5)

= | ⁄ |

To compute

, first use as the quotient. Thus, by multiplying

with , results in:

( 2.6)

This when subtracted from gives:

( 2.7)

as the remainder. Now, the degree of the polynomial in the remainder is 10 so select

as the quotient. Multiplying with , result is:

( 2.8)

Subtracting 2.8 from 2.7 the remainder is:

( 2.9)

Since the terms with factor 2 in 2.9 will be dropped as mentioned previously and since

the final result would be the absolute value of the remainder, so the result is:

(2.10)

Page 21: Comparative Implementation of AES-based Crypto-Cores

11

3 Advanced Encryption Standard

For the encryption of commercial and sensitive computer data, the US government

adopted Data Encryption Standard (DES), as an official Federal Information

Processing Standard (FIPS). Since this was the first encryption algorithm approved by

the US government, hence the public and private industry, requiring strong encryption,

welcomed it readily and saw its adoption in a wide variety of embedded systems, smart

cards, SIM cards and network devices. For any cipher, the most basic method of attack

is brute force, which involves trying each key until the right one is found. Therefore,

encryption strength is directly dependent upon the key size. DES uses a 64-bit key, eight

of which bits are used for parity checks, effectively limiting the key to 56-bits. Since the

DES was using the same key to encrypt / decrypt a message, as such 56-bit keys (of

DES) were considered too small compared to the processing power of modern

computers, making it susceptible to cyber-attacks and, as such, soon began losing its

usefulness. The U.S. National Institute of Standards and Technology (NIST), in 1997,

started looking for a better alternate to DES. In 2001, it selected the Advanced

Encryption Standard (AES) as a replacement.

3.1 Introduction

The Advanced Encryption Standard (AES) [5]

, also known as Rijndael after the two

Belgian cryptographers, Joan Daemen and Vincent Rijmen, was published by NIST in

2001. It is the most commonly used encryption standard, throughout the world.

AES is a symmetric block cipher that operates on 128-bit block as input and output data

and is used to protect classified information implemented in software and hardware to

encrypt sensitive data.

Features 3.1.1

AES data encryption is a more mathematically efficient and elegant cryptographic

algorithm, but its main strength rests in the key length options. It is based on a design

Page 22: Comparative Implementation of AES-based Crypto-Cores

12

principle known as a substitution-permutation network, combination of both

substitution and permutation, and is fast in both software and hardware [31]

. The

algorithm can encrypt and decrypt blocks using a secret key which has a key size of

256-bit, 192-bit, or 128-bit. One of the main features of AES is simplicity that is

achieved by repeatedly combining substitution and permutation computations at

different rounds, i.e., AES encrypts/decrypts a 128-bit plaintext/ciphertext by repeatedly

applying the same round transformation a number of times depending on the key size.

Key Length Block Length Number of Rounds

AES-128 128-bit 128-bit 10

AES-192 192-bit 128-bit 12

AES-256 256-bit 128-bit 14

Table 3.1 AES Number of Rounds with respect to Key Length

The actual key length depends on the desired security level. Today, AES-128 is

predominant and supported by most hardware implementations. It is also the standard

that will be focused on in this implementation since it is the preferred standard for

GCTR module of the AES – GCM to provide authenticity.

Usage 3.1.2

This AES standard is used by concerned departments and agencies whenever it is

considered that any unclassified sensitive information is of importance and has to be

protected cryptographically.

Other cryptographic algorithms approved by FIPS are also available for use in addition

to or in lieu of this standard. Commercial and private organizations have also, in the

past years, turned this standard for security of their information and systems.

3.2 Encryption

It is understood that the basis of AES Encryption lies in the design principle which is

commonly referred to as a substitution-permutation network, a combination of

substitution and permutation both, which is called Cipher. In plain words, Cipher may

mean any method to encrypt a text, known as plaintext, so that its readability and/or

Page 23: Comparative Implementation of AES-based Crypto-Cores

13

meaning is concealed. It is a coded or disguised way of writing a message. This coding

is known as encryption. Sometimes the encrypted text is itself also referred to as Cipher,

but generally the term used is ciphertext. It is understood that it takes its origin from the

Arabic word Sifr which means Empty or Zero. The AES operates on a matrix 4 × 4,

referred to as the state S, although certain variants of Rijndael do operate on a larger

block size having more columns in the state [5]

. Majority of AES calculations are

performed in a special finite field. For instance, if 16 bytes, b0, b1, b2, b3, b4 …….b15

are considered, they will be represented by the shown in Equation (3.1).

[

] ( 3.1)

The diagrammatic representation of a State is shown in Figure 3.1

Figure 3.1 Diagramatic Representation of a State

Initially, the input of Cipher is copied to the State Array using the conventional method.

After initially performing a Round Key Addition, transformation of the State Array is

done by implementation of a round function 10, 12 or 14 times depending on the key

length as discussed previously.

The Cipher Algorithm of a 128-bit cipher is explained in the form of a flow diagram in

Figure 3.1. Individual transformations - AddRoundKey, ShiftRows, SubBytes and

MixColumns – are explained in detail further in the chapter.

As shown in the Figure 3.2, all rounds (Nr) are identical with the exception of the final

round (Nr = 10), which does not include the MixColumns transformation.

Page 24: Comparative Implementation of AES-based Crypto-Cores

14

Figure 3.2 Flow Diagram of Round Transformations in AES Encryption

SubBytes Transformation 3.2.1

Substitution of bytes using an 8-bit substitution table is known as SubBytes

transformation. These Sub-Bytes transformations operate on each byte independently,

using substitution table (S-box) of the State [4]

.

Figure 3.3 shows the diagrammatic State representation of how the SubBytes

transformation is done.

Figure 3.3 SubBytes Transformation

Rounds = Nr

Start

Key

Expansion

Add Round

Key

Add Round

Key

Sub

Bytes

Shift

Rows

Mix

Columns

Nr

Nr < 10

Nr = 10

Sub

Bytes

Shift

Rows

Add Round

Key

End

a0,0

a0,1

a0,2

a0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

b0,0

b0,1

b0,2

b0,3

b1,0

b1,1

b1,2

b1,3

b2,0

b2,1

b2,2

b2,3

b3,0

b3,1

b3,2

b3,3

S-Box a

i,j b

i,j

Page 25: Comparative Implementation of AES-based Crypto-Cores

15

The S-box is an invertible matrix and is derived by taking the multiplicative inverse in

the GF(28) having good non-linearity properties. The element {00} is mapped to itself.

Table 3.2 shows the substitution table used in the AES encryption algorithm [5]

.

y

X

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76

1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0

2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15

3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75

4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84

5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf

6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8

7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2

8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73

9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db

a e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79

b e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08

c ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a

d 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e

e e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df

f 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16 Table 3.2 S-box: Substitution values for byte „xy‟ (hexadecimal)

The value of the byte is used as an index to find the substitution byte. For example the

byte {6d} will find the substitution byte in such a way that it will locate the byte in the

location where x = 6 and y = d, i.e, {3c}.

ShiftRows Transformation 3.2.2

Within a certain offset, the ShiftRows operation cyclically shifts over the bytes in the

rows of the State. In AES, with the first row, r=0, remaining as it is, the second row

bytes are shifted to the left by an offset of 1. Similarly the third and fourth rows are

shifted by an offset of two & three, respectively [5]

.

Page 26: Comparative Implementation of AES-based Crypto-Cores

16

Figure 3.4 ShiftRows Transformation

MixColumns Transformation 3.2.3

As the MixColumns transformations has to operate column-by-column, each column is

treated as a four-term polynomial and thus the MixColumn transformation takes four

bytes as input and gives four bytes as output. Each input byte has an effect on all four

output bytes. These columns, being taken as polynomials over GF(28), are multiplied

with modulo x4

+ 1 and a fixed polynomial, q(x), where

[5]

( 3.2)

Where, {01}, {02} and {03} are Hexadecimal values 0x01, 0x02 and 0x03,

respectively.

Let the new column (in the State) be b(x) and the original column is a(x). The

MixColumn transformation can be represented as:

( 3.3)

This can be written in matrix multiplication form [4]

:

[

]

[

] [

] ( 3.4)

The four bytes in the new columns after the MixColumns operation can be calculated by

the expressions given in Equations (3.5) to (3.8).

Shift 3

Shift 2

Shift 1

No Shift a

0,0 a

0,1 a

0,2 a

0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

a0,0

a0,1

a0,2

a0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

ShiftRows

Page 27: Comparative Implementation of AES-based Crypto-Cores

17

( ) ( 3.5)

( 3.6)

( 3.7)

( 3.8)

The diagramatic representation of MixColumn Transformation is given in Figure 3.5

Figure 3.5 Diagramatic Representation of MixColumns

AddRoundKey Transformation 3.2.4

In simple words, the AddRoundKey transformation XOR‟s the output from the previous

step (MixColumns in the first 9 rounds and ShiftRows in the final round) to a RoundKey

generated from the Key Expansion algorithm [4]

. To further understand the

AddRoundKey the two steps in the AddRoundKey transformation, Key Expansion and

Adding of the Round Key, are important:

3.2.4.1 Key Expansion

Considering the AES-128 the Key Expansion algorithm takes a 128-bit key as input to

generate a key schedule. The expansion of the input key in to the key schedule requires

two processes, namely SubWord and RotWord [5]

. These two processes will be explained

in detail further in this section. The 16 byte input cipher key is transferred to a word

array w[i] following the Pseudo-Code shown below [5]

.

while (i < 4)

w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])

a0,0

a0,1

a0,2

a0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

a0,j

a1,j

a2,j

a3,j

b0,0

b0,1

a0,2

b0,3

b1,0

b1,1 a

1,2 b

1,3

b2,0

b2,1

a2,2

b2,3

b3,0

b3,1

a3,2

b3,3

b0,j

b1,j

b2,j

b3,j

MixColumns

𝑞 𝑥

Page 28: Comparative Implementation of AES-based Crypto-Cores

18

The flow diagram representation of the Key Expansion algorithm is given in Figure 3.6.

Figure 3.6 Flow Diagram of Key Expansion Algorithm

RotWord: Performs a cyclic permutation on a 4-byte word as depicted in Figure 3.7

Figure 3.7 Diagramatic Representation of RotWord

SubWord: Takes 4-bytes as input and applies S-box substitution to all of the four bytes

to give a 4-byte output. The S-box used is the same for SubBytes transformation.

From the Pseudo Code and the flow diagram it can be deduced:

w[i]

w[i-1]

i mod 4=0?

w[i-2] w[i-3] w[i-4]

RotWord i mod 4=0?

SubWord

i mod 4=0?

Rcon[i/4]

True

False

False

True

True

False

4-bytes

4-bytes

4-bytes

4-bytes

4-bytes

4-bytes

Round Key

a0

a1 a

2 a

3 a

0

a1 a

2 a

3

Cyclic Permutation

RotWord

Page 29: Comparative Implementation of AES-based Crypto-Cores

19

The first 4 words of the expanded key are filled with the input Cipher Key.

Every following word is the XOR of the previous word (w[i-1]) and the word 4

positions earlier (w[i-4]).

For words in position that are multiple of 4, the RotWord and SubWord

transformation is applied to w[i-1] and then an XOR is done with an Rcon,

before the final XOR.

3.2.4.2 Round Key Addition

Once, the Round Key, , is generated than it is added to the output of the previous

transformation, , with a simple bitwise-XOR. The diagrammatic representation of

the Round Key addition is shown in Figure 3.8:

Figure 3.8 Diagramatic Representation of AddRoundKey Transformation

3.3 Decryption

For decrypting the data using the AES algorithm the Cipher transformations stated

above can be inverted and then implemented in reverse order. The transformations used

in the decryption algorithm, or the Inverse Cipher, are InverseSubBytes,

InverseShiftRows, InverseMixColumns and AddRoundKey.

a0,0

a0,1

a0,2

a0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

b0,0

b0,1

b0,2

b0,3

b1,0

b1,1

b1,2

b1,3

b2,0

b2,1

b2,2

b2,3

b3,0

b3,1

b3,2

b3,3

ai,j

b

i,j

k0,0

k0,1

k0,2

k0,3

k1,0

k1,1

k1,2

k1,3

k2,0

k2,1

a2,2

k2,3

k3,0

k3,1

k3,2

k3,3

ki,j

Page 30: Comparative Implementation of AES-based Crypto-Cores

20

The overall flow of the Decryption is the same as that of the Encryption other than the

fact that all the transformations are inverse of the transformations in the Encryption

algorithm [5]

. The flow diagram of the Inverse Cipher is given in Figure 3.9.

Figure 3.9 Flow Diagram Representation of AES Decryption

InverseShiftRows Transformation 3.3.1

As evident by its name, it is the inverse of the ShiftRows transformation. In the

InverseShiftRows transformation the shifting over the bytes is a right-shift instead of a

left-shift as in ShiftRows transformation. The first row, r=0, is not shifted. The bottom

three rows are shifted right with an offset 1, 2 and 3 respectively. Figure 3.9 shows the

diagrammatic representation of the InverseShiftRows transformation is shown in Figure

3.10.

Rounds = Nr

Start

Key

Expansion

Add Round

Key

Add Round

Key

Inverse

ShiftRows

Inverse

SubBytes

Inverse

MixColumns

Nr

Nr < 10

Nr = 10

Inverse

ShiftRows

Inverse

SubBytes

Add Round

Key

End

Page 31: Comparative Implementation of AES-based Crypto-Cores

21

Figure 3.10 Diagramatic Representation of InversShiftRows Transformation

InverseSubBytes Transformation 3.3.2

The inverse of SubBytes transformation requires an inverse S-box. This inverse S-box is

then used for one-to-one byte substitution. Table 3.3 shows the inverse S-box [5]

.

Y

X

0 1 2 3 4 5 6 7 8 9 A b C d e f

0 52 09 6a d5 30 36 a5 38 bf 40 a3 9e 81 f3 d7 fb

1 7c e3 39 82 9b 2f ff 87 34 8e 43 44 c4 de e9 cb

2 54 7b 94 32 a6 c2 23 3d ee 4c 95 0b 42 fa c3 4e

3 08 2e a1 66 28 d9 24 b2 76 5b a2 49 6d 8b d1 25

4 72 f8 f6 64 86 68 98 16 d4 a4 5c cc 5d 65 b6 92

5 6c 70 48 50 fd ed b9 da 5e 15 46 57 a7 8d 9d 84

6 90 d8 ab 00 8c bc d3 0a f7 e4 58 05 b8 b3 45 06

7 d0 2c 1e 8f ca 3f 0f 02 c1 af bd 03 01 13 8a 6b

8 3a 91 11 41 4f 67 dc ea 97 f2 cf ce f0 b4 e6 73

9 96 ac 74 22 e7 ad 35 85 e2 f9 37 e8 1c 75 df 6e

a 47 f1 1a 71 1d 29 c5 89 6f b7 62 0e aa 18 be 1b

b fc 56 3e 4b c6 d2 79 20 9a db c0 fe 78 cd 5a f4

c 1f dd a8 33 88 07 c7 31 b1 12 10 59 27 80 ec 5f

d 60 51 7f a9 19 b5 4a 0d 2d e5 7a 9f 93 c9 9c ef

e a0 e0 3b 4d ae 2a f5 b0 c8 eb bb 3c 83 53 99 61

f 17 2b 04 7e ba 77 d6 26 e1 69 14 63 55 21 0c 7d

Table 3.3 Inverse S-box: Substitution values for byte 'xy' (hexadecimal)

Shift 3

Shift 2

Shift 1

No Shift

InvShiftRows

a0,0

a0,1

a0,2

a0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

a0,0

a0,1

a0,2

a0,3

a3,0

a3,1

a3,2

a3,3

a2,0

a2,1

a2,2

a2,3

a1,0

a1,1

a1,2

a1,3

Page 32: Comparative Implementation of AES-based Crypto-Cores

22

InverseMixColumns Transformation 3.3.3

As for the MixColumns transformations, the InverseMixColumns transformation also

operates column-by-column with each column being treated as a four-term polynomial.

Each of the four input bytes have an effect on all four output bytes. These columns,

being taken as polynomials over GF(28), are multiplied with modulo x

4

+ 1 and a fixed

polynomial, let‟s say , where can be represented as shown in equation

3.9:

[5]

( 3.9)

Where, {0b}, {0d}, {09} and {0e} are Hexadecimal values 0x0b, 0x0d, 0x09 and 0x0e,

respectively.

Let the new column (in the State) be b-1

(x) and the original column is a-1

(x). The

InverseMixColumns transformation can be represented as:

( 3.10)

This can be written in matrix multiplication form [4]

:

[

]

[

]

[

]

( 3.11)

The four bytes in the new columns after the InverseMixColumns operation can be

calculated by the expressions given in Equations (3.12) to (3.15) [5]

.

( ) ( 3.12)

( 3.13)

( 3.14)

( 3.15)

The diagramatic representation of InverseMixColumn Transformation is given in Figure

3.11.

Page 33: Comparative Implementation of AES-based Crypto-Cores

23

Figure 3.11 Diagramatic Representation of InverseMixColumns

AddRoundKey Transformation 3.3.4

The AddRoundKey transformation remains the same for decryption and encryption

since the Key Expansion algorithm doesn‟t change and the adding of the Round Key is a

simple XOR, which is the inverse of itself. Please see section 3.2.4 for detailed

explanation.

InverseMixColumns

a0,0

a0,1

a0,2

a0,3

a1,0

a1,1

a1,2

a1,3

a2,0

a2,1

a2,2

a2,3

a3,0

a3,1

a3,2

a3,3

a0,j

a1,j

a2,j

a3,j

b0,0

b0,1

a0,2

b0,3

b1,0

b1,1 a

1,2 b

1,3

b2,0

b2,1

a2,2

b2,3

b3,0

b3,1

a3,2

b3,3

b0,j

b1,j

b2,j

b3,j

𝑞 𝑥

Page 34: Comparative Implementation of AES-based Crypto-Cores
Page 35: Comparative Implementation of AES-based Crypto-Cores

25

4 Encryption & Authentication

Modes

A Cipher encrypts or decrypts data for a single block but applying a Cipher repeatedly

over large blocks of data is known as a Mode of Operation for that Cipher. Many modes

of operation for AES have been introduced over the years and some of the relevant ones

for this Thesis will be discussed in this Chapter. Along with modes of operation for

encryption, authentication modes will also be discussed.

Over the years authentication has been an integral part of information exchange for an

efficient data transfer. Many attacks involve the attacker injecting messages to the data

in question and thus there is a need for verification, whether the data was sent by the

claimed sender or someone else. A mode of operation that provides both encryption and

authentication is known as Authenticated Encryption (AE) [32]

. AES also has various

modes which provide AE. In this chapter, two AE modes namely CCM (Counter with

CBC-MAC) and GCM (Galois/Counter Mode), of AES will be discussed that are the

main comparison platforms for the Crypto-Cores implemented in this Research Thesis.

4.1 Encryption Modes

Electronic Codebook (ECB) 4.1.1

Electronic Codebook (ECB) is the simplest mode of operation for AES, where a large

message is divided in to blocks depending on the key-size, i.e. 128-bits in this case, and

each block is encrypted/decrypted separately [32]

. For example, consider Figure 4.1

which explains the ECB modes diagrammatically for a message of the size . It

is divided in blocks where each block of Plaintext/Ciphertext is 128-bits in size and is

passed through the Cipher/Inverse Cipher separately with an identical key to produce an

output Plaintext/Ciphertext block of size 128-bits.

Page 36: Comparative Implementation of AES-based Crypto-Cores

26

Figure 4.1 Diagrammatic Representation of ECB Mode

Cipher Block Chaining (CBC) 4.1.2

Cipher Block Chaining (CBC) mode is an AES mode of operation in which each

Plaintext block is XORed with the previous Ciphertext block before encryption. In case

of decryption the output of the inverse cipher is XORed with the previous Ciphertext

block to get the plaintext block. Figure 4.2 shows the diagrammatic representation. For

the first block the Plaintext is XORed with an Initialization Vector (IV). An IV is a

fixed-size input,in this case 128-bit, which can be of any random value.

Figure 4.2 Diagrammatic Representation of CBC mode

B1 B2 B3 … Bm

B1 B

2 B

3 B

m

Cipher /

Inverse Cipher

Cipher /

Inverse Cipher

Cipher /

Inverse Cipher

Cipher /

Inverse Cipher

Input

Message Key

A1 A

2 A

3 … A

m

Output

Message

A1 A

1 A

1 A

1

B1 B

2 B

3 … B

m

B1 B

2 B

3 B

m

Cipher

Plaintext

Message Key

A1 A

2 A

3 … A

m

Ciphertext

Message

A1 A

2 A

3 A

m

Cipher

IV

Cipher Cipher

Page 37: Comparative Implementation of AES-based Crypto-Cores

27

Counter Mode (CTR) 4.1.3

Counter Mode (CTR) is a mode of operation for the AES which converts a Block

Cipher in to a Stream Cipher. An 96-bit IV is given as input to a counter function,

which can be any function that generates a sequence of numbers that don‟t repeat but

usually an increment-by-one counter is used, which appends the 32-bit of the counter

and generates a new 128-bit string for each iteration. These are then used as an input to

the cipher to generate a keystream, a stream of random values, which is the XORed with

the Plaintext to generate a Ciphertext. For decryption the generated keystream is

XORed with the Ciphertext to generate the Plaintext. Figure 4.3 shows the

diagrammatic representation of the CTR mode.

Figure 4.3 Diagrammatic Representation of CTR Mode

4.2 Counter with CBC-MAC (CCM)

Introduction 4.2.1

As visible by its name, the Counter with CBC-MAC (CCM) mode uses the CBC mode

to generate a MAC and then CTR mode is applied over the message and the tag to

encrypt the message. This shows that CCM is a mode to apply Authenticated

Encryption (AE) to the message. CCM mode can only be applied to block ciphers of

block size 128-bits.

Generated

Keystream

B1 B

2 B

3 … B

m

Cipher

Plaintext

Message Key

A1 A

2 A

3 … A

m

Ciphertext

Message

A1

B1

Counter 1

Cipher

A2

B2

Cipher

A3

B3

Counter 3

Cipher

Am

Bm

Counter m Counter 2

Page 38: Comparative Implementation of AES-based Crypto-Cores

28

Algorithm 4.2.2

The MAC is generated using the CBC-MAC by applying the CBC mode of encryption

to the message with an IV of 128-bits of 0‟s. Each block of the CBC mode depends on

the proper encryption of the previous block, thus if an intermediate block is changed

this will be visible in the last block. The last block is used as the MAC which is sent

along with the message and is used to compare whether the message is authentic or not.

The diagrammatic representation in Figure 4.4 shows the working of the CBC-MAC.

Figure 4.4 Diagrammatic Representation of CBC-MAC

The generated MAC is encrypted along with the message using the CTR mode.

4.3 Galois/Counter Mode (GCM)

Introduction 4.3.1

Galois/Counter Mode (GCM) is a block cipher mode of operation that uses universal

hashing over a binary Galois field to provide Authenticated Encryption (AE). It can be

implemented in hardware to achieve high speeds with low cost and low latency. There

is a growing need for a mode of operation that can efficiently provide authenticated

encryption at high speeds without too much area cost, and is free of Intellectual

B1 B

2 B

3 … B

m

B1 B

2 B

3 B

m

Cipher

Message Key

A1 A

2 A

3 MAC

Cipher

0

Cipher Cipher

Page 39: Comparative Implementation of AES-based Crypto-Cores

29

Property (IP) restrictions [33]

. Since the possible use case for this research thesis is the

use in Industry 4.0, achieving AE with high data rates is essential. The mode must admit

pipelined implementations and have minimal computational latency in order to be

useful at high data rates. GCM has an added advantage that it can act as a stand-alone

MAC when encryption is not required. This is a feature which is not available in any of

the other proposed AE implementations.

Algorithm 4.3.2

GCM implements the Galois mode of authentication with an underlying Cipher, usually

the AES which is also used in this research as well. The underlying AES is implemented

in CTR mode [33]

. The GCM algorithm has two core functions, namely, GHASH and

GCTR which are explained below.

4.3.2.1 GHASH

The GHASH function is basically the finite field multiplication of the input with a

hashing key H over GF(2128

). The hashing key H can be treated as a fixed 128-bit

constant since it does not change if the Cipher Key doesn‟t change [7]

. It can be

calculated by applying the AES block on 128 bits of 0’s.

Algorithmically speaking, take as the input bit string where the length of is 128*m,

where m is some integer, as the hash subkey and block as the output. The

following steps explain the algorithm [33]

:

Let represent the unique sequence of blocks such that

|| || || ||2

Let Y0 be the “zero block,” which means is a bit string comprised by 128

binary 0„s.

For , let ⊕ , where “ ” indicates multiplication

over finite field.

received at the end would be the output block that would be the MAC.

2 || represents the concatanation of two elements.

Page 40: Comparative Implementation of AES-based Crypto-Cores

30

Following block-diagram representation of the algorithm can give a better

understanding of this algorithm.

Figure 4.5 Diagrammatic Representation of the GHASH function

The multiplication over the finite field GF(2128

) can be explained by the following

algorithm [33]

:

Let be the 128-bit block that has to be hashed containing elements

.

Let be the 128-bit Hash Key, i.e., 128-bits of 0‟s ciphered through the AES

block.

Let be 128 bits of 0‟s, and be a constant 128-bit string with the

value || .

For i = 0 to 127

{

⊕ ( 4.1)

{

⊕ ( 4.2)

After these operations are done the 128-bits of would be the output of the

multiplication.

X1 X

2 X

3 … X

m

X1 X

2 X

3 X

m

𝐻

Message

Y1 Y

2 Y

3 Ym

𝐻 𝐻 𝐻

Y0

MAC

Page 41: Comparative Implementation of AES-based Crypto-Cores

31

Figure 4.6 shows the diagrammatic representation of the GF(2128

) multiplication

operation.

Figure 4.6 Diagrammatic Representation of GF(2128

) Multiplication

If LSB(Ui) = 1

If LSB(Ui) = 0

Z0 Z1 Z

2 Z

3 … Z

127 Z

128

U0 U

1 U

2 U

3 … U

127 U

0

Z0 0

128

H

Ui

R 11100001||0120

Initialization:

Logic:

>>1 ?

>>1

Ui+1

Ui+1

R

If xi = 1

If xi = 0

Zi ? Z

i+1

Zi+1

Ui

X x0,x1,x2,…,x127

Z128

Result

Page 42: Comparative Implementation of AES-based Crypto-Cores

32

The finite field multiplication in GHASH has two possible implementations i.e. bit-

serial implementation and bit-parallel implementation. Simply explained, the core of the

GHASH architecture is a 128-bit multiplier over GF (2128

). The GF (2128

) multiplier

basically multiplies two 128-bit operands to generate a 128-bit output. One operand of

the GF multiplier is the hash subkey H which can be treated as a fixed 128- bit constant

for it will not change if the 128-bit key does not change. For the second operand two

values have to be kept under consideration, the 128-bit additional authenticated data

block (AAD) sequence and the Ciphertext block sequence [7]

. Figure 4.7 shows the

diagrammatic representation of the hashing sequence.

Figure 4.7 Diagrammatic Representation of the Hashing sequence

The 128-bit AAD are hashed to the GHASH through one of two inputs

of XOR gates. The 128-bit Ciphertext block sequence, , are hashed to the

same input of XOR gates following the AAD. Meanwhile, the intermediate hash value

is fed back to another input of XOR gates to generate the other operand for the GF

multiplier. Considering that it takes m clock cycles for the AAD hashing and n clock

cycles for the ciphertext block hashing then the latency for a bit-parallel multiplier

would be m+n+1 and for a bit-serial multiplier the latency would be 128*(m+n+1) [33]

.

The advantage of the bit-serial multiplier over the bit-parallel multiplier is the usage of

less logic elements but at the same time it adds more latency to the system.

128-bit Multiplier

over Galois Field

AAD and Ciphertext

hashing sequentially

Y Register

H Register

Page 43: Comparative Implementation of AES-based Crypto-Cores

33

4.3.2.2 GCTR

GCTR is the implementation of the previously explained CTR mode with a particular

incrementing function, for generating the necessary sequence of counter blocks. The

GCM consists of an underlying block cipher and a Galois Field Multiplier with which

authenticated encryption and authenticated decryption are realized. The cipher needs to

have a block size of 128-bits. For encryption, first an initial counter is derived from an

Initialization Vector (IV). The initial counter value is then incremented which is then

encrypted and XORed with the first plaintext block. For subsequent plaintext blocks, the

counter is incremented and then encrypted. The underlying cipher is only used in the

encryption mode. GCM allows pre-computation of the block cipher function if the IV is

known ahead of time [33]

.

Authenticated Encryption 4.3.3

Now that the working of the GHASH and GCTR functions are understood, they can be

combined to understand the authenticated encryption and authenticated decryption that

take place inside the GCM mode. First comes the authenticated encryption, so consider

a 128-bit AES as the underlying block cipher, the inputs would be a Plaintext , an

initialization vector and additional authenticated data . The outputs would be the

Ciphertext and the authentication MAC. Following steps explain the authenticated

encryption algorithm [33]

:

Let is the hash subkey which is the 128 bits of 0‟s ciphered through the block

cipher i.e., .

Define , such that, is a 128-bit string consisted of 96-bits of any value, 31

„0‟ bits, and 1 „1‟ bit.

Let ( ) ⊕ , where would be the counter blocks in

the GCTR and would be the Plaintext data.

Let || ⊕ .

The resulting C is the Ciphertext and the resulting MAC is the authentication

MAC.

The flow diagram in Figure 4.8 shows how the Authenticated Encryption algorithm

works.

Page 44: Comparative Implementation of AES-based Crypto-Cores

34

Figure 4.8 Diagrammatic Representation of Authenticated Encryption

Authenticated Decryption 4.3.4

The inputs for the the authenticated decryption would be the 128-bit block cipher AES,

the initialization vector , the ciphertext , the addidtional authenticated data and the

MAC whereas the output would be the simple plaintext P or indication of inauthenticity

FAIL. Following steps explain the algorithm of the authenticated decryption [33]

:

Let is the hash subkey which is the 128 bits of 0‟s ciphered through the block

cipher i.e., .

Define , such that, is a 128-bit string consisted of 96-bits of any value, 31

„0‟ bits, and 1 „1‟ bit.

Let ⊕ , where would be the counter blocks in

the GCTR and would be the ciphered data.

Let || ⊕ .

If , then return which would be the resultant plaintext; else

return .

Figure 4.9 shows the diagrammatic representation of Authenticated Decryption.

GCTR GHASH

MAC

A IV

AES

H

0128

C

P

inc

Key

Hash encryption

Output

Page 45: Comparative Implementation of AES-based Crypto-Cores

35

Figure 4.9 Diagrammatic Representation of Authenticated Decryption

𝑀𝐴𝐶 ≠ 𝑀𝐴𝐶

𝑀𝐴𝐶‘

𝑀𝐴𝐶

?

PASS FAIL

𝑀𝐴𝐶 𝑀𝐴𝐶

GCTR GHASH

A IV

AES

H

0128

C

inc

Key

P

Hash encryption

Output

Page 46: Comparative Implementation of AES-based Crypto-Cores
Page 47: Comparative Implementation of AES-based Crypto-Cores

37

5 Literature Review

The implementation of the previously discussed algorithms in hardware in the form of

Crypto-Cores has been studied in the recent past. In this chapter the previous studies are

discussed. The hardware implementation of an AES-GCM crypto-core consists of two

core block:

An AES core

A Galois Field Multiplier Core

The previous studies for implementation of these core blocks will be discussed further

in this chapter.

5.1 AES Designs

The AES algorithm itself has four transformations which for a 128-bit key have to be

implemented 10 times as discussed earlier. This, when implemented in an iterative

structure, takes one clock cycle to complete each transformation of each round.

Although this implementation is simple, the efficiency of the system is very low.

Further studies were done to introduce pipelining while implementing AES on

hardware. A lot of research has been done in this area and the studies have provided

some better solutions for hardware implementation of the AES architecture.

An efficient hardware implementation would show good data throughput with very less

area usage. Throughput can be defined as:

The existing proposed designs for the hardware implementation of AES architecture can

be classified in to four groups:

Iterative Loop Structure based Designs

Page 48: Comparative Implementation of AES-based Crypto-Cores

38

Memory-Based Designs

Unfolded Structure or Parallel Structure based Designs

Sub-Pipelined Structure based Designs

Iterative Loop Structure 5.1.1

A design based on the Iterative Loop Structure for implementing the AES architecture

on hardware is simplest form of implementation. It has very less hardware utilization

since it uses a single core hardware design for a single round which is reused for all the

ten rounds. If all of the round transformations (SubBytes, ShiftRows, MixColumns and

AddRoundKey) are done in a single clock cycle, taking ten clock cycles for encryption

of a 128-bit Block, the system shows low clock frequency.

Figure 5.1 Iterative Loop Implementation

On the other hand in [18] it is discussed if each round transformation is done in separate

clock cycles using intermediate registers, a higher operational frequency is achievable

but the overall clock cycles required to encrypt a block also increases 4 times. Thus, in

both cases, the achievable throughput is not that impressive. Pipelined structure has

been implemented in many other studies and compared with other structures.

Memory-Based AES Hardware Implementation Designs 5.1.2

One example of this is a Memory-Based structure of the AES where the instead of

utilizing FPGA logic units, memory blocks are utilized to perform round

transformations.

One Clock Cycle

AES Round Input Output

Page 49: Comparative Implementation of AES-based Crypto-Cores

39

The SBox in the SubBytes transformation and the Column Multiplication in the

MixColumns transformation are implemented on internal memory block instead of the

FPGA logic units. This is possible because each resultant element in the round

transformations is dependent on a single element of the data block. The ShiftRows is a

fixed operation and can be achieved by accurately routing values to the specific memory

blocks. Only the addition part of the MixColumns transformation and the

AddRoundKey transformation is done using the logic units. Figure 5.2 shows how the

blocks look like during a round.

Figure 5.2 Diagrammatic Representation of Memory Based AES structure

In [9] a memory-based design was suggested for an unfolded AES core where all the

rounds are processed in parallel. The logic unit utilization is reduced but the overall

frequency is dependent on the operating frequency of the internal memory and

furthermore the memory utilization in this architecture is quite high.

In [16], a fully synchronous, memory-based, single-chip FPGA implementation of the

recent AES Standard, Rijndael encryption algorithm is presented. Design partition

allowed for an iterative loop structure where the block ciphers was implemented using

the Electronic Code Book (ECB) mode of operation. The encryption RTL design

focuses on a memory-based bite-sized arithmetic pipeline structure that processes one

round at a time.

Output

ShiftRows by routing to

specific memory blocks

SBox MixColumns

Mult.

AddRoundKey

Last Round

MixColumns

Add.

Input

Memory Block

Round Transformations

Page 50: Comparative Implementation of AES-based Crypto-Cores

40

In [17] an AES hardware implementation comparison between a composite field

algorithm and Block RAM to realize the SubBytes and MixColumns module was

introduced. Based on the composite field algorithm, a lower efficiency

(throughput/area) was realized, whereas for the Block RAM based design the maximum

frequency was limited to the maximum frequency of the Block RAM‟s

Unfolded / Parallel Structure Based Designs 5.1.3

A parallel structure based design is used to achieve very high throughputs but

accordingly there is very high area cost as well. All the rounds are performed in a single

iteration, which means that substantially the loops are unfolded or unrolled and all ten

cores operate in parallel.

Figure 5.3 Unrolled AES Structure

The unrolled AES structures have been widely used for high implementations where

hardware cost is not an issue. Studies like [7] and [13] have implemented unrolled

architectures for AES for Galois/Counter Mode. Since the parallel structure of AES

computes all the rounds in one clock cycle, it shows very high throughput, even the the

achievable frequency is not that high.

Sub-Pipelined or Stage-Level Pipelined Structure 5.1.4

A sub-pipelined structured is one where the internal blocks of a round in an aes

structure are pipelined. In [10], architecture for pipelining the AES rounds efficiently is

introduced. It is suggested that multiple packets be performed in parallel and registers

be introduced in critical path of the stages. This would increase the optimal frequency of

the system as the latency increases with a significantly less area usage on the hardware

and very less internal memory usage. This system suggested is considerably faster than

a normal iterative structure and is known as a sub-pipelined structure. This system was

Unrolled

128-bit

Output 128-bit

Input Round 1 Round 2 Round 3 Round 10 .....

Page 51: Comparative Implementation of AES-based Crypto-Cores

41

introduced because simple pipelining was not efficient for the CBC mode due to its

feedback nature.

The sub-pipelined structure works by pipelining the internal transformations of the AES

algorithm. This is done by adding registers to the critical paths. In [10], a stage

pipelining of an AES system was introduced for an AES-CCM based design. The basic

idea behind is to insert registers to the critical path. This allows 4 blocks of data to be

encrypted by one AES-core. The working flow diagram of the encryption data-path is

shown in Figure 5.4.

Figure 5.4(a)Pipelined AES encryption datapath, (b) Pipelined Key Scheduler

In the stage for reg2, SubBytes and ShiftRows transformation is implemented. The

SubBytes can be implemented in two ways. The first way is a 256-byte lookup table

(LUT) and the second way is to logically calculate sub-byte transformation. Calculating

logically would have substantial area cost and also a larger latency, therefore it is easier,

cheaper and faster to implement an LUT. The ShiftRows is simply implemented by

accurately routing from the Sbox to the MixColumns transformation, thus not requiring

a separate step.

RCON

Key4

Key2

Key1

Key0

Output

Input

Round Key

r

e

g

1

r

e

g

2

r

e

g

3

r

e

g

4

MixColumn

Add (part)

MixColumn

Add (part)

AddRoundKey

MixColu

mn

Mult(2's)

Sbox

LUT

r

e

g

5

r

e

g

7

r

e

g

8

Sbox

LUT

(a)

(b)

r

e

g

6

Page 52: Comparative Implementation of AES-based Crypto-Cores

42

In MixColumn transformation addition and multiplication occurs in GF (28). As

previously discussed in MixColumns transformation the multiplication of the

polynomial with and is required. The multiplication of the polynomial with

can be achieved by multiplying the polynomial with and then adding the

original polynomial to the result, therefore, only multiplication with needs to be

implemented. MixColumns transformation is achieved in three stages. Following are the

stages:

Multiplying polynomial with {02}

Addition process

AddRoundKey performed

The 4-stage pipelined structure works in a way such that 4 blocks of data can be

computed with a delay of one-clock cycle. When one stage has computed the result for

one block it is ready to take the next block of data. In this way four pipes are working as

shown in Figure 5.5.

Figure 5.5 Stages for Sub-Pipelining

It takes 40 clock cycles to compute one block but at the same time four blocks are being

processed. This increases the operating frequency of the system with very less area-cost

[10].

Stage 1 Stage 2 Stage 3 Stage 4

Stage 1 Stage 2 Stage 3 Stage 4

Stage 1 Stage 2 Stage 3 Stage 4

Stage 1 Stage 2 Stage 3 Stage 4

Pipeline 1

Pipeline 2

Pipeline 3

Pipeline 4

Page 53: Comparative Implementation of AES-based Crypto-Cores

43

5.2 Galois Field Multiplier (GFM) Designs

In the past years GCM has been increasingly adopted for hardware since it has proven

to be fast and efficient. Since GCM can be parallelized and pipelined (unlike CBC due

to its feedback nature) it is very desirable when hardware implementations are

concerned. Many ideas have been suggested to implement GCM as a Crypto-Core and

the main computational complexity usually in GCM, compared to any other mode, is

the multiplication in the GF(2128

) field for hashing.

Conventionally two methods have been suggested to implement the finite field

multiplication required for hashing on the hardware.

Bit-Serial Multiplier

Bit-Parallel Multiplier

The Bit-Serial Multiplier is implementation of the multiplier where each iteration of ,

as depicted in 4.1 and 4.2, is calculated serially. This design compromises the latency of

the system for very low area usage on the hardware.

The Bit-Parallel Multiplier calculates the in 1 clock cycle but the hardware cost

due to the implementation of the 128-bit operands in GF(2128

) is very much. The high

complexity of the Multiplier also reduces the achievable operating frequency of the

system when processed in parallel.

For implementing Galois Field multiplication on hardware a lot of studies have been

done to find a suitable solution. In [8] a method for implementing a parallel multiplier

was introduced known as Mastrovito multiplier. It is the most widely used method for

implementing Galois Field multiplication on hardware. The design is essentially a brute

force multiplier in the sense that the matrix vector product, shown in Equation 4.1 and

4.2, is computed like traditional matrix multiplication. Elements are in GF(2m

), so

and gates are used for element wise multiplication and addition respectively.

Although the Mastrovito multiplier is fast, the area usage on the hardware is

considerably large thus increasing the cost of hardware.

Page 54: Comparative Implementation of AES-based Crypto-Cores

44

Another method was introduced in [29] for implementing GF multiplication on

hardware. The idea was to use the multiplication method introduced in [22] by A.

Karatsuba, thus naming the multiplier Karatsuba Multiplier. The idea behind the

Karatsuba multiplier was to decrease the number of multiplication operation while

increasing the addition operations. Since, the addition operation require less area as

compared to multiplication, the area cost for a Karatsuba multiplier is less than a

Mastrovito multiplier. This, however, comes with a delay cost due to which the

Karatsuba based multipliers are much slower than the Mastrovito multipliers.

Recently, a method which serves as a compromise between the Karatsuba and

Mastrovito multipliers was introduced, called the Fan-Hasan (FH) Multiplier [25], [27], [28]

.

The FH-Multiplier is considerably faster than the Karatsuba multiplier but still has a

larger delay than the Mastrovito multiplier. However, the area overhead is much less

than the Mastrovito multiplier. A comparison between the three types of multipliers has

been provided in [30].

Since the Mastrovito multiplier is still the most widely used multiplier for the GFM for

hardware implementations of GCM, this study will also look for a good GFM solution

using the Mastrovito multiplier. The problem in the Mastrovito multiplier is the large

area overhead due to the parallel matrix vector product. This also causes a low

achievable operational frequency. In [11] a method was introduced to use Mastrovito

parallel multiplication in a much more efficient way. The idea was to introduce pipeline

in the multiplier and doing the multiplication in multiple iterations instead of doing the

complete multiplication in one. Using this, a higher operational frequency was

achievable and the area cost was introduced but this also created a latency in computing

the result.

Page 55: Comparative Implementation of AES-based Crypto-Cores

45

6 Design & Implementation Results

6.1 Design

The final design was made by looking at the various studies in the past and considering

the best route to be taken. The authentication mode chosen was the Galois/Counter

Mode because of its flexibility when implementing on hardware. Another advantage of

GCM over any other mode is the fact that the authentication core (GMAC) can operate

as a separate entity for messages that just need authentication.

The design has two cores, an underlying AES core and a Galois Field Multiplier (GFM)

core, that are controlled by a state machine. Each core will be explained separately and

then the state machine will be explained3.

AES Core 6.1.1

The AES core is based on the stage-level pipelining design as explained in [10].

Although this design was suggested to introduce a pipelining method for the AES-CCM

mode, for which normal pipelining is not possible, it has shown good operating

frequency for the system. Due to the feedback nature of CBC-mode there is a

requirement for waiting for the result of the previous block which increases the

complexity of the system. In AES-GCM since the mode of operation is the CTR mode

this can be simplified by using a single Initialization Vector and using an increment

function for the next iterations.

The AES core takes the Initialization Vector as input data along with a Cipher Key. A

data enable bit identifies the core that a data block is available for encryption. A Key

Generator core, as shown in Figure 5.4(b), generates the key for encryption. A counter

function is present to notify the number of pipelines that are being utilized.

3 See Appendix for source codes.

Page 56: Comparative Implementation of AES-based Crypto-Cores

46

When the core receives data it sends the data to the first stage of the round

transformation. Each stage takes 1 clock cycle, thus in the next clock cycle another data

block is sent to the round transformation. On each input data the pipeline counter is

incremented (max. 4). Once all the pipelines are computed the AES core send a done

signal to enable next blocks of data to be computed. Figure 6.1 shows the internal

structure of the AES core.

Figure 6.1 Internal Structure of AES core

The AES core is the central part of the Encryption block of the AES-GCM core.

Following are the characteristics of the Encryption Block:

The mode of operation for the Encryption Block is the CTR mode.

The inputs for the Encryption Block are an initialization vector , a Cipher Key

and the input plaintext P.

The is passed through an increment block incr to generate a sequence of

values depending on the size of the P.

The AES core inside the Encryption Block takes the and the Cipher

Key as input and generates a Keystream as an output.

The output Keystream is then XORed with P and the result is given out as a

ciphertext C.

Round Enable

Input Data

Data Enable

Key Generator

AES Round

Pipeline

Counter

Round

Register

Round Data

Input

Pipeline

Count

Intermediate

Round

Output

Generated Key

Last Round

Page 57: Comparative Implementation of AES-based Crypto-Cores

47

Figure 6.2 shows the diagrammatic representation of the Encryption Block.

Figure 6.2 Diagrammatic Representation of Encryption Block

Galois Field Multiplier (GFM) Core 6.1.2

The GF(2128

) multiplier core takes in AAD and C sequentially and applies GF(2128

)

multiplication with hashing key . is required to generate the Matrix as shown in

4.2. Multiplying two 128-bit operands in a GF(2128

) field takes a lot of area and causes a

lower achievable frequency due to its complexity.

In this design it is proposed that the Hashing Matrix is computed and stored as a

memory block. This is done by using a constant key and calculating beforehand. All

128 elements of the Matrix are calculated using Equation 4.2 and then stored as

memory blocks. This solves the problem of high area usage that is seen in a basic

Mastrovito bit-parallel multiplier and because most of the complex operations are

reduced, leaving just XOR and AND operations, a considerably better operational

frequency is achievable.

Figure 6.3 represents the diagrammatic representation of this design.

AES

Core

incr

Encryption Block

IV

Key

P

Keystream

C

Page 58: Comparative Implementation of AES-based Crypto-Cores

48

Figure 6.3 Diagrammatic Representation of Pipelined GF(2128

) Multiplier

Following are the characteristics of the Authentication Block:

The GF multiplier core has a 128-bit input for AAD and C generated from the

encryption block being entered sequentially.

The subsequent bits from Matrix are called from the memory to be XORed

with the bits from the data input.

All the XOR are done in 1 clock cycles (Mastrovito parallel multiplier).

The output of the GF Multiplier is looped back and XORed with the next block

to be sent as input.

After the final block the output 128-bits are XORed with the keystream

generated from the AES core.

The resultant value is sent as MAC.

Figure 6.4 shows the diagrammatic representation of the Authentication Block:

Figure 6.4 Diagrammatic Representation of Authentication Block

Matrix U Memory

Block

GF(2128

)

Multiplier

Input 128-bits

Output 128-bits

Z register 128-bits

MAC Keystream

GF(2128

) Multiplier

Core

Matrix U

Z

AAD &

C

Authentication Block

Page 59: Comparative Implementation of AES-based Crypto-Cores

49

Complete Design 6.1.3

The above explained AES-core and GFM core are implemented together to form a

Crypto-Core. A state-machine is implemented to utilize the two cores according to the

requirements. To understand the state machine of the Crypto-Core, first the signals of

the Crypto-Core have to be defined:

Signal Type Description

data_in(128-bit) Input 128-bit Data input interface

data_in_valid(4-bit) Input Notifies that data is available on data_in

data_in_type(1-bit) Input Notifies data type. (1=AAD, 0=Plaintext)

data_in_not_ready(1-bit) Output Output busy bit

data_in_last_word(1-bit) Input Notifies the last block of input data

data_in_size(4-bit) Input Notifies size of data (0 – 15 => 8 bit – 128 bit)

start(1-bit) Input Notifies start operation

IV_valid(1-bit) Input Notifies that data on data_in is IV

data_out(128-bit) Output 128-bit Data Output Interface

data_out_valid(1-bit) Output Notifies that data is available on data_out

data_out_size(4-bit) Output Notifies size of data (0 – 15 => 8 bit – 128 bit)

data_out_last_word(1-bit) Output Notifies the last block of output data

tag_valid(1-bit) Output Notifies that Tag is available on data_out

Table 6.1 I/O Signals of Crypto-Core

The Top layer of the AES-GCM core controls the working of the two core with the help

of a state machine. The state machine defines how the AES-GCM core operates for the

incoming data blocks. The description of each state in the state machine is given as

follows:

IDLE: In the IDLE state the system checks for the start and IV_valid signals. Once both

are high it stores the IV available on the data_in port to a register (Yi) and changes the

state to INIT_COUNTER.

INIT_COUNTER: In the INIT_COUNTER state the systems sends the value from Yi to

the AES core to generate the keystream. The state is changed to ENCRYPT_Y0.

Page 60: Comparative Implementation of AES-based Crypto-Cores

50

ENCRYPT_Y0: The system waits for AES core to generate the keystream which is

stored in a register (EkY0). The state is changed to DATA_ACCEPT.

DATA_ACCEPT: The data_in_not_ready signal is set to „0‟. The system checks for

data_in_valid[0], if it is high it checks for data_in_type. For data_in_type = 1 (AAD) it

starts the GFM count register and sends the data to the GFM input, the state is changed

to GFM_MULT. For data_in_type = 0 (Plaintext) the system increments the Yi register

by 1 and changes the state to INC_COUNTER1.

INC_COUNTER_1: The system sends value from Yi to input of the AES core and

data_in_valid[0] is sent to data enable of AES core. The data on data_in is stored to a

register. The system checks data_in_valid[1] signal; if it is „0‟ the state is changed to

ENCRYPT. If it is „1‟ the system increments the Yi register by 1 and a stream register

by 1. The state is changed to INC_COUNTER_2.

INC_COUNTER_2: The system sends value from Yi to input of the AES core and

data_in_valid[1] is sent to data enable of AES core. The data on data_in is stored to a

register. The system checks data_in_valid[2] signal; if it is „1‟ increments the Yi

register by 1 and stream register by 1. The state is changed to INC_COUNTER_3. If it

is „0‟ the state is changed to ENCRYPT.

INC_COUNTER_3: The system sends value from Yi to input of the AES core and

data_in_valid[2] is sent to data enable of AES core. The data on data_in is stored to a

register. The system checks data_in_valid[3] signal; if it is „1‟ the system stores

increments the Yi register by 1 and stream register by 1. The state is changed to

INC_COUNTER_4. If it is „0‟ the state is changed to ENCRYPT.

INC_COUNTER_4: The system sends value from Yi to input of the AES core and

data_in_valid[3] is sent to data enable of AES core. The data on data_in is stored to a

register. The state is changed to ENCRYPT.

ENCRYPT1: The system waits for AES core to generate keystream and the generated

keystream is XORed with the value in r_datain0. The data is sent to data_out and

data_out_valid is set to 1. The output data is also sent as input to the GFM. The system

Page 61: Comparative Implementation of AES-based Crypto-Cores

51

checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the

state is changed to ENCRYPT2 otherwise it is changed to GFM_MULT.

ENCRYPT2: The system waits for AES core to generate keystream and the generated

keystream is XORed with the value in r_datain1. The data is sent to data_out and

data_out_valid is set to 1. The output data is also sent as input to the GFM. The system

checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the

state is changed to ENCRYPT3 otherwise it is changed to GFM_MULT.

ENCRYPT3: The system waits for AES core to generate keystream and the generated

keystream is XORed with the value in r_datain2. The data is sent to data_out and

data_out_valid is set to 1. The output data is also sent as input to the GFM. The system

checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the

state is changed to ENCRYPT4 otherwise it is changed to GFM_MULT.

ENCRYPT4: The system waits for AES core to generate keystream and the generated

keystream is XORed with the value in r_datain3. The data is sent to data_out and

data_out_valid is set to 1. This data is sent as input to the GFM and the state is changed

to GFM_MULT.

GFM_MULT: The checks the data_in_last_word signal. If it is „0‟ the state is changed

back to DATA_ACCEPT. If it is „1‟ the state is changed to PRE_TAG_CALC.

PRE_TAG_CALC: The GFM count register is reset to „0‟ and the state is changed to

TAG_CALC.

TAG_CALC: The output of the multiplier is XORed with the value in EkY0 to generate

Tag. The generated Tag is sent to the data_out port and the tag_valid signal is set to „1‟.

The state diagram in Figure 6.5 shows the flow of the state machine.

Page 62: Comparative Implementation of AES-based Crypto-Cores

52

Figure 6.5 State Diagram of Crypto-Core State Machine

Following are the characteristics of the Final completed design:

The inputs for the Crypto-Core are a 96-bit initialization vector , a 128-bit

input Plaintext and a 128-bit input additional authenticated AAD.

The outputs are a 128-bit output Ciphertext , and a 128-bit output MAC.

The 128-bit Cipher Key, 256-byte and the 1 Kbyte Matrix are memory

units and are accessed from the Memory Block.

A Control Block controls all the operations for the Crypto-core.

The diagrammatic representation of the complete design is shown in Figure 6.5:

AES Done

Start AES core for

next block

data_in_last_word = 0

data_in_last_word = 0

data_in_valid = 1

data_in_type = 0

data_in_valid = 1

data_in_type = 1

Start AES core

start = 1

IV_valid = 1

IDLE

INIT_COUNTER

DATA_ACCEPT

INC_COUNTER(x4)

GFM_MULT

ENCRYPT(x4)

PRE_TAG_CALC

TAG_CALC

ENCRYPT_Y0

Page 63: Comparative Implementation of AES-based Crypto-Cores

53

Figure 6.6 Complete Design Layout

6.2 Implementation and Results

Implementation Platform 6.2.1

The above explained design has been implemented and tested on Altera based hardware.

The board used for the hardware implementation is the DB5CGXFC7 provided by

Devboards GmbH4. The DB5CGXFC7 Board is based on the Altera Cyclone V GX

Device and has an Altera EP5CGXFC7C6F23C7N FPGA chip which contains 150K

Logic Elements and a 7Mbit RAM. This is a low end device of the Altera FPGA family

and is selected because the idea behind implementation of this Crypto-Core was to have

a design that is suitable for low-end devices as well. The design was also simulated

using the ModelSim – Altera 10.4b software for test_bench.v (See Appendix).

4 http://www.devboards.de/en/home/boards/product-details/article/db5cgxfc7/

AAD & C

Keystream

C

Keystream

C

AES-GCM Crypto-Core

Sbox

IV

P

Encryption

Block

Key

Authentication

Block

Control

Block

Memory

Block

P

IV

AAD

Matrix U

MAC

Page 64: Comparative Implementation of AES-based Crypto-Cores

54

Tests & Results 6.2.2

The system has been tested and compared with previous studies to provide a better

understanding of the performance. The designs were tested for area usage, achievable

operational frequency, throughput and efficiency.

The proposed design is simulated on ModelSim to see the total delay in receiving the

output. Then it is compared with the design in [10], implementing both on the same

hardware platform. Another implementation is also compared, {1} implementing the

Mastrovito multiplier in four steps as described in [11] without implementing Matrix U

in memory and implementing the AES round transformations in sub-pipelined mode.

Table 6.2 shows the comparison results of the designs with respect to achievable

operational frequency, clock cycles takes for 4 blocks of data and throughput. The

graphical representation of the comparison is given in Figure 6.6.

Operational

Frequency

Clock Cycles

per 4 blocks

Throughput

Proposed Design 140.17 MHz 44 1.63 GHz

[10] 146.34 MHz 44 1.87 GHz

{1} 159.32 MHz 56 1.46 GHz

Table 6.2 Comparison of Operational Frequency and Throughput

Figure 6.7 Graphical Comparison of Frequency and Throughput

0.1401 0.14634 0.15932

1.63

1.87

1.46

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Proposed Work [10] [11]

Frequency (GHz)

Throughput (GHz)

Page 65: Comparative Implementation of AES-based Crypto-Cores

55

The comparison for area usage in Adaptive Logic Module (ALM) used and the

efficiency (area/throughput) is shown in table 6.3. Figure 6.7 shows the Graphical

comparison of the Area usage and 6.8 shows the Graphical comparison of the

efficiencies between the designs.

Area (ALM) Efficiency(Mbps/ALM)

Proposed Design 1176 1.39

[10] 1582 1.18

{1} 2784 0.52

Table 6.3 Comparison of Area usage and Efficiency

Figure 6.8 Graphical Comparison of Area Usage

Figure 6.9 Graphical Comparison of Efficiency

1176

1582

2784

0

500

1000

1500

2000

2500

3000

Proposed Work [10] {1}

Area (ALMs)

Area (ALMs)

1.39

1.18

0.52

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Proposed Work [10] {1}

Efficiency (Mbps/ALMs)

Efficiency (Mbps/ALMs)

Page 66: Comparative Implementation of AES-based Crypto-Cores

56

6.3 Conclusion

The results shown in the previous section show that although in {1} a higher frequency

is achievable but, due to the latency created because of the 4 clock cycles takes by

dividing the multiplication in four iterations, the throughput achieved is much less. The

area usage due to the implementation of the calculation of Matrix U in logic increases

thus causing the overall efficiency of the system to decrease. On the other hand in the

proposed design the Matrix U is pre-calculated and stored as memory block, thus

decreasing the area cost. The memory usage, even after storing the Matrix U and SBox

as memory blocks, is nearly 1%. The efficiency of the proposed design is much better

than the designs in comparison. Even though, due to the constant key, there is a security

risk but for implementation in hardware level for an automated system this design is

very suitable since the cryptographic key can be kept constant and is not user provided.

Page 67: Comparative Implementation of AES-based Crypto-Cores

57

Appendix

The source codes are provided in a CD along with the Thesis. Following is a brief

description of the source code files:

Name Desciption

AESGCM.v Top-level file for AES-GCM Crypto-Core

GFM.v Implementation of Galois Field Multiplier

AESConstants.v Necessary definitions for the core

AESCore.v[10]

Top-Level for AES Core

AESRound.v[10]

Implementation of Round Transformations

Counter.v[10]

Counter function to count round number

defineAES.v[10]

Pipeline definitions

lib.v[10]

Other primitives

MixColumnAddKey.v[10]

MixColumn and AddRoundKey Implementation

KeyGenerator.v[10]

Key Generation implementation

SBox1.v[10]

SBox as LUT

SBox2.v[10]

SBox and polynomial multiply by 02 as LUT

SBox1_2LUT.v[10]

SBox Top-Level

Page 68: Comparative Implementation of AES-based Crypto-Cores
Page 69: Comparative Implementation of AES-based Crypto-Cores

59

Bibliography

[1] OPC Unified Architecture, Interoperability for Industry 4.0 and the Internet of

Things, OPC Foundation Forum.

[2] Klaus Shwab, The Fourth Industrial Revolution, World Economic Forum, 2016.

[3] Mario Hermann, Tobias Pentek and Boris Otto, Design Principles for Industrie

4.0 Scenarios, System Sciences (HICSS), 2016.

[4] J. Daemen and V. Rijmen, AES Proposal: Rijndael, AES Algorithm Submission,

September 3, 1999.

[5] Federal Information Processing Standards (FIPS), Specification for the

Advanced Encryption Standard (AES), FIPS Publication 197, November 26,

2001.

[6] Harris Nover, Algebraic Cryptanalysis of AES: An Overview, Department of

Mathematics, University of Wisconsin.

[7] Sheng Wang, An Architecture for AES-GCM Security Standard, Master‟s Thesis

presented at University of Waterloo, Canada, 2006.

[8] E. D. Mastrovito, VLSI Designs for Multiplication over Finite Fields GF(2m), in

Proc. Sixth International Conference, Applied Algebra, Algebric Algorithms and

Error-Correcting Codes (AAECC-6), Rome, July 1988.

[9] Ricardo Chaves, Georgi Kuzmanov, Stamatis Vassiliadis and Leonel Sousa,

Reconfigurable Memory based AES Co-Processor, Parallel and Distributed

Processing Symposium, 2006.

Page 70: Comparative Implementation of AES-based Crypto-Cores

60

[10] Haeyoung Rha and Hae-wook Choi, Efficient Pipelined Multistream AES CCMP

Architecture for Wireless LAN, Paper submitted at Korea Advanced Institute of

Science & Technology (KAIST), 2012.

[11] Bryce Barcelo and John Taylor, Crypto Acceleration Using Asynchronous

FPGAs, Submitted to the faculty of Worcester Polytechnic Institute.

[12] Cheng Wang and Howard M. Heys, Using a Pipelined SBox in Compact AES

Hardware Implementations, IEEE NEWCAS2010, pp. 101-104, 2010.

[13] Arash Reyhani-Masoleh, Mehran Mozaffari-Kermani, Efficient and High-

Performance Parallel Hardware Architectures for the AES-GCM, IEEE

Transactions on Computers, vol. 61, no. , pp. 1165-1178, Aug. 2012.

[14] Muhammad H. Rais and Syed M. Qasim, Efficient Hardware Realization of

Advanced Encryption Standard Algorithm using Virtex-5 FPGA, International

Journal of Computer Science and Network Security (IJCSNS) Vol. 9 No. 9,

September 2009.

[15] Abolfazl Soltani and Saeed Sharifian, An Ultra-High Throughput and fully

Pipelined Implementation of AES algorithm on FPGA, Journal:

Microprocessors and Microsystems Vol. 39 Issue 7, Amsterdam, October 2015.

[16] A. Brokalakis and H. Michail, A High-Speed and Area-Efficient Hardware

Implementation of AES-128 Encryption Standard, 5th

WSEAS Conference on

Multimedia, Internet and Video Technologies, Greece, August 2005.

[17] D. Chen, G. Shou, Y. Hu and Z. Guo, Efficient Architecture and

Implementations of AES, IEEE ICACTE2010, pp. V6-295-V6-298, 2010.

[18] Nadia Nedjah, Luiza de Macedo Mourelle, Marco Paulo Cardoso, A Compact

Pipelined hardware Implementation of the AES-128 Cipher, IEEE ITNG2006,

2006.

Page 71: Comparative Implementation of AES-based Crypto-Cores

61

[19] Kenneth Stevens, Otmane A. Mohamed, Single-chip FPGA Implementation of a

Pipelined, Memory-Based AES Rijndael Encryption Design, IEEE ECE2005,pp.

1296- 1299, 2005.

[20] Deen Kotturi, Seong-Moo Yoo, and John Blizzard, AES Crypto Chip Utilizing

High-Speed Parallel Pipelined Architecture, IEEE ISCAS2005,pp. 4653-4656

vol.5, 2005.

[21] J. Guajardo, T. Güneysu, Sandeep S. Kumar, C. Paar and J. Pelzl, Efficient

Hardware Implementation of Finite Fields with Applications to Cryptography,

Acta Appl Math (2006) 93: 75–118, September 2006.

[22] A. Karatsuba and Y. Ofman, Multiplication of Multidigit Numbers on Automata,

Soviet Physics Doklady, 7:595, 1963.

[23] Emilia Käsper and Peter Schwabe, Faster and Timing-Attack Resistant AES-

GCM, Katholieke Universiteit Leuven.

[24] Bo Yang, Sambit Mishra and Ramesh Karri, High Speed Architecture for

Galois/Counter Mode of Operation (GCM), Polytechnic University, Brooklyn,

New York.

[25] H. Fan and M.A. Hasan, A New Approach to Subquadratic Space Complexity

Parallel Multipliers for Extended Binary Fields, IEEE Transactions on

Computers, 56(2):224–233, 2007.

[26] A. Satoh, High-Speed Parallel Hardware Architecture for Galois Counter

Mode, IEEE International Symposium on Circuits and Systems(ISCAS 2007),

pages 1863–1866, 2007.

[27] M.A. Hasan, Matrix-vector Product based Subquadratic Arithmetic Complexity

Schemes for Field Multiplication, Proceedings of SPIE, 6697:669702, 2007.

Page 72: Comparative Implementation of AES-based Crypto-Cores

62

[28] H. Fan and Y. Dai, Fast Bit-Parallel GF (2n) Multiplier for All Trinomials, IEEE

Transactions on Computers, 54(4):485–490, 2005.

[29] C. Paar, A new Architecture for a Parallel Finite Field Multiplier with low

Complexity based on Composite Fields, IEEE Transactions on Computers,

45(7):856–861, 1996.

[30] Pujan Patel, Parallel Multiplier Designs for the Galois/Counter Mode of

Operation, A thesis presented to the University of Waterloo, Waterloo, Ontario,

Canada, 2008.

[31] Bruce Schneier, John Kelsey, Doug Whiting, David Wagner, Chris Hall, Niels

Ferguson, Tadayoshi Kohn , The Twofish Team's Final Comments on AES

Selection, May 2000.

[32] NIST Computer Security Division's (CSD) Security Technology Group (STG),

Proposed modes, Cryptographic Toolkit, NIST, April 14, 2013.

[33] David A. McGrew, John Viega, The Galois/Counter Mode of Operation, NIST,

2005.