algorithm and hardware design of encryption scheme for h

Algorithm and Hardware Design ofEncryption Scheme for H.264/AVC

FAN, Yibo

Graduate School of Information, Production and Systems

Waseda University

February 2009

- i -

Abstract

H.264, which is also known as MPEG-4 part 10 or AVC (for Advanced Video Coding),

is the latest international video coding standard proposed in 2003. Currently, there is few

encryption schemes proposed for H.264/AVC standard, and most of the proposed

schemes are designed for previous video coding standards, such as MPEG-1,

MPEG-2/H.262, MPEG-4 and H.263.

This dissertation presents a new video encryption scheme for H.264/AVC and also the

hardware design of encryption module. The contributions of this dissertation include

three parts: 1) the proposed new video encryption scheme provides higher security with

lower computational cost. 2) The proposed scalable hardware architecture for encryption

module achieves great scalability, which can be widely used in different video systems. 3)

The proposed five DPA attack countermeasure methods can be successfully used in

encryption module to prevent DPA attack.

This dissertation consists of seven chapters which are as follows:

Chapter 1 [Introduction] introduces the basic conception of video coding system,

video encryption methods and cryptographic algorithms. H.264 video coding standards,

selective video encryption methods and AES algorithm are also introduced in this

chapter.

Chapter 2 [Selective Video Encryption Schemes] describes the recently proposed

video encryption schemes and some encryption algorithms used in these schemes. The

basic idea of selective encryption is to encrypt a part of video data, and leave others as

unencrypted. The security of selective encryption is low. However, it saves a lot of

computational cost. A brief survey is provided to clearly show the difference between

these schemes. Three main problems of these proposed schemes are discussed: security

problem, computation problem, and feasibility problem.

- ii -

Chapter 3 [Unequal Secure Encryption Scheme for H.264/AVC] describes the

proposed Unequal Secure Encryption (USE) scheme for H.264/AVC. The purpose of this

scheme is to reduce the computational cost while keeping high security level. The main

idea of USE scheme is that using high secure algorithm to encrypt important data

partition, and using low secure algorithm to encrypt unimportant data partition. All of

data are encrypted to improve the security. Some new ideas of the proposed scheme are

listed as follows:

1) Data classification methods. Three data classification methods are proposed: Data

Partitioning, FMO and Parameter Extraction. Each method is proposed for different

coding profiles of H.264/AVC.

2) Multiple security levels definition. Four security levels are defined to make a trade-off

between security and computational cost. For security level 0, the computational cost

is only 18% of full encryption, and for level 3, the computational cost is 50% of full

encryption. Compared to the other selective encryption schemes, our scheme achieves

much lower computational cost while performs 100% video data encryption.

3) Hybrid encryption module. This module includes two encryption functions: AES

encryption for important data partition, and FLEX encryption for unimportant data

partition. Our proposed FLEX algorithm achieved 5 times throughput of AES, and it

can reuse the hardware of AES.

Chapter 4 [Hardware Design of AES & RSA] presents the proposed hardware

architecture for AES and RSA algorithm. For AES, since performance requirement for

different video applications changes very much, the scalability of hardware design

becomes very important. Parallel data path and configurable hardware modules are used

to achieve high scalability. The experimental results show that the throughput of lowest

cost AES implementation which uses 1 S-Box and 1 MixColumn is 75 Mbps, while the

highest cost AES with 20 S-Box and 4 MixColumn can be 2.4 Gbps. Our design

approaches a new way for scalable hardware design of AES. Compared to the other AES

- iii -

architectures which are not scalable, it can be used for designing AES under various

performance specifications. As a result, it is much suitable for video encryption systems.

For RSA, firstly, a modified scalable high-radix Montgomery algorithm is proposed to

reduce critical path. Secondly, a high-radix clock-saving dataflow is proposed to support

high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused

architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data

the implementation results show that the total cost of Montgomery multiplier is 130

KGates, the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption

is 352 Kbps.

Chapter 5 [DPA Attack on AES] introduces the side-channel attack methods,

especially for Differential Power Analysis (DPA) attack. DPA attack method is proposed

by Paul Kocher in 1998, which can successfully recover the secret key by collecting

power consumption of these devices. It posed a serious threat to the security of

cryptographic devices. The detailed attack procedure on AES and some recently proposed

countermeasure methods are also discussed in this chapter.

Chapter 6 [AES Design with DPA Countermeasure] presents our proposed AES

designs with DPA attack countermeasure. A hybrid countermeasure solution which

includes five methods, Independent ARK, Data Sliding, Subbyte Hiding, Simplified S-Box

Masking and Registers Masking is proposed. The theoretical analysis shows that our

solution increases the complexity of DPA attack to 212N times. In this way, even if one or

two countermeasure methods are cracked, the remained other countermeasure methods

can also prevent a successful attacking. There are few papers about hardware design of

DPA countermeasure methods. In this dissertation, the detailed hardware implementation

of DPA countermeasure methods is proposed. Moreover, an ultra low-cost AES with

proposed five countermeasure methods for real-time video encryption is designed. A test

chip includes four AES core is implemented in VDEC project (RHOM 0.18 um, Chip

- iv -

size is 2.5mm×2.5mm): 1 AES0: Pure AES design without any countermeasure

methods. It achieves lowest hardware cost (4678 Gates) with proper throughput (51

Mbps), and clock frequency (80 MHz). 2 AES1.0: AES design with Independent ARK

and Data Sliding. The hardware cost is 5500 Gates, the clock frequency achieves 125

MHz, and throughput is 75 Mbps. 3 AES1.1: AES design with Subbyte Hiding. The

hardware cost is 6244 Gates, and the clock frequency and throughput are same as AES1.0.

4 AES1.2: AES design with Simplified S-Box Masking and Registers Masking. The

hardware cost is 6834 Gates and the clock frequency is reduced to 75 MHz. The

throughput is also reduced to 45 Mbps. In order to evaluate the effectiveness of proposed

countermeasure methods, a DPA attack system based on SASEBO board is designed. The

DPA attack experiment results show that, the AES design with our proposed

countermeasure methods (AES1.0, AES1.1, AES1.2) can successfully prevent DPA

attack.

Chapter 7 [Conclusion] concludes the contributions of this dissertation.

Keywords

Video Encryption, H.264/AVC, Unequal Secure Encryption, AES, RSA, Side-channel

Attack, Differential Power Analysis, Low cost, Scalable Architecture, VLSI, Sasebo

Board

- v -

Acknowledge

First of all, I would like to appreciate Professor Satoshi Goto, for his guidance,

instructions, and support during my research. He advised me to setup a research goal and

to achieve it step by step. What I learned from him must be the most valuable asset in my

life. I also thank Professor Takeshi Ikenaga for his continuous support, instructions and

insightful comments on my work. He gave me a lot of valuable and helpful advice in

detailed technical problems. I also express my appreciation to Professor Yoshimura, for

his continuous support, encouragement and insightful comments throughout my research

work.

I also thank Dr. Tsunoo (NEC Central Research Lab) for advising me in cryptography

research. His great knowledge in cryptography helps me to find the right research

directions and instruct me how to continue my work. Thanks to Mr. Kimura (Y.D.K.

Corp.), Mr. Nozawa (Y.D.K. Corp.) and Mr. Syouji (Y.D.K. Corp.) for helping me to use

Sasebo Board.

I also thank to Mr. Jidong Wang and Mr. Guoyu Qian for working with me in video

encryption and side-channel attack. Thanks to the graduated students of Goto Lab: Dr.

Yang Song, Dr. Lingfeng Li, Dr. Shen Li, Dr. Jing Wang. Discussion with you gave me

great inspirations in my research work. Thanks to all of students of Goto lab, you make

my life be joyful. Thanks also give to all of my friends, I appreciate every moment with

you.

Finally, I would like to thanks my family for their unconditionally support and love.

- vi -

Contents

Abstract.................................................................................................................................................... i

Acknowledge .......................................................................................................................................... v

List of Tables........................................................................................................................................viii

List of Figures ........................................................................................................................................ ix

List of Notations ....................................................................................................................................xi

1 Introduction..................................................................................................................................... 1

1.1 Video Compression ......................................................................................................... 1

1.2 Video Encryption ............................................................................................................ 5

1.3 Cryptography .................................................................................................................. 7

1.4 Our Contributions and Dissertation Organization........................................................... 9

2 Selective Video Encryption Schemes............................................................................................ 12

2.1 Visual Data Formats ...................................................................................................... 12

2.1.1 Video Sequence..................................................................................................... 12

2.1.2 Coded video stream format ................................................................................... 14

2.2 Conventional video encryption methods....................................................................... 15

2.2.1 Cryptography based video encryption................................................................... 15

2.2.2 Permutation based video encryption ..................................................................... 16

2.3 A Survey of selective video encryption schemes .......................................................... 17

2.4 Problems of current video encryption scheme .............................................................. 21

2.5 Conclusion .................................................................................................................... 23

3 Unequal Secure Encryption (USE) Scheme for H.264 ................................................................. 24

3.1 Introduction of H.264.................................................................................................... 24

3.2 USE Scheme for H.264/AVC........................................................................................ 27

3.2.1 Data Partition Methods ......................................................................................... 28

3.2.2 Data Partition Methods ......................................................................................... 29

3.2.3 Security levels ....................................................................................................... 33

3.2.4 Encryption Methods .............................................................................................. 34

3.3 Comparison ................................................................................................................... 37

3.4 Conclusion .................................................................................................................... 39

4 Hardware Design of Encryption Accelerator ................................................................................ 44

4.1 Hardware Design of AES.............................................................................................. 44

4.1.1 Introduction of AES Algorithm..................................................................................... 44

4.1.2 Existing low-cost implementations of AES................................................................... 46

4.1.3 Proposed Scalable Hardware Architecture for AES ...................................................... 50

4.1.3.1 Top Level Architecture.......................................................................................... 50

4.1.3.2 Two typical subclass architectures ........................................................................ 54

4.1.3.3 Sub- ............................................................................................ 57

4.1.4 Performance Analysis.................................................................................................... 59

- vii -

4.1.4.1 Scalability.............................................................................................................. 59

4.1.4.2 Dataflows .............................................................................................................. 60

4.1.4.3 Hardware Implementation..................................................................................... 61

4.2 Hardware Design of RSA.............................................................................................. 65

4.2.1 Introduction of RSA Algorithm..................................................................................... 65

4.2.2 Proposed Optimized Algorithm..................................................................................... 67

4.2.3 Proposed Optimized Data Flow .................................................................................... 68

4.2.4 Proposed Hardware Architecture for RSA .................................................................... 69

4.2.5 Performance Analysis.................................................................................................... 73

4.3 Conclusion .................................................................................................................... 76

5 DPA Attack on AES ...................................................................................................................... 77

5.1 Introduction of Differential Power Analysis attack....................................................... 77

5.1.1 Power Consumption of CMOS Circuit ................................................................. 79

5.1.2 Power Model ......................................................................................................... 82

5.1.3 Hypothetical Power Consumption based on HD model: Case study .................... 83

5.1.4 Differential Power Analysis Attacks ..................................................................... 86

5.2 DPA attack on AES ....................................................................................................... 87

5.2.1 DPA attack on AES: An Example.......................................................................... 87

5.2.2 DPA attack on AES: A successful attack and a failed attack ................................. 89

5.3 Conventional Countermeasure Methods ....................................................................... 92

5.4 Conclusion .................................................................................................................... 96

6 AES Design with DPA Countermeasure ....................................................................................... 97

6.1 Proposed DPA Countermeasure methods for AES........................................................ 97

6.1.1 Register Masking .................................................................................................. 97

6.1.2 S-Box Masking.................................................................................................... 100

6.1.3 Subbytes Hiding .................................................................................................. 101

6.1.4 Independent ARK and Data Sliding .................................................................... 104

6.1.5 Time Complexity Analysis .................................................................................. 108

6.2 Ultra Low-cost Design of AES with DPA Countermeasure ........................................ 112

6.2.1 Specification........................................................................................................ 112

6.2.2 Hardware Architecture ........................................................................................ 114

6.2.3 Data Flow............................................................................................................ 116

6.2.4 Implementation ................................................................................................... 117

6.3 DPA Attack Evaluation Environment.......................................................................... 119

6.3.1 DPA attack platform ............................................................................................ 119

6.3.2 Sasebo Board....................................................................................................... 120

6.3.3 Test Flow............................................................................................................. 122

6.4 Experiment Results of DPA Attack ............................................................................. 124

6.5 Chip Design................................................................................................................. 129

6.6 Conclusion .................................................................................................................. 131

7 Conclusion .................................................................................................................................. 132

- viii -

Reference ............................................................................................................................................ 134

Publications......................................................................................................................................... 143

International Journal ................................................................................................................... 143

International Conference (with review) ...................................................................................... 143

Domestic Conference (with review) ........................................................................................... 144

Domestic Conference (without review) ...................................................................................... 145

List of Tables

Table 2.1 A survey of selective video encryption schemes ........................................................ 18

Table 3.1 Security levels in the USE scheme............................................................................. 33

Table 3.2 Video data partition size ............................................................................................. 40

Table 3.3 Video data partition for different security levels. ....................................................... 41

Table 3.4 Comparison with other video encryption schemes..................................................... 42

Table 4.1 Hardware cost of 32-bit AES @ 131 MHz, 0.11um, [48]. ......................................... 48

Table 4.2 Hardware cost of 8-bit AES @ 100 KHz, 0.35um, [50]............................................. 49

Table 4.3 Bit width of operations in AES algorithm. ................................................................. 53

Table 4.4 Comparison of two architectures................................................................................ 57

Table 4.5 Possible implementations of AES based on scalable architecture. ............................. 59

Table 4.6 Hardware cost of lowest cost AES @ 123 MHz, 0.18 um. ........................................ 62

Table 4.7 Hardware cost of highest performance AES @ 416 MHz, 0.18 um........................... 62

Table 4.8 Scalability of hardware implementations. .................................................................. 63

Table 4.9 ........................................................................ 63

-7, 8]. .......................................................... 71

Table 4.11 M0 to InvM.......................................................................................................... 72

Table 4.12 Clock cycles comparison of different dataflows. ..................................................... 74

............................................................ 74

Table 5.1 Power consumption of four transitions in a circuit..................................................... 80

Table 6.1 Summary of different countermeasure methods....................................................... 111

Table 6.2 Comparison of time complexity for each countermeasure methods. ....................... 111

Table 6.3 Max bit-rate and resolution of selected H.264 levels. .............................................. 113

Table 6.4 AES0@80MHz, TSMC 0.18um............................................................................... 118

Table 6.5 AES1.1@125MHz, TSMC 0.18um.......................................................................... 118

Table 6.6 AES1.0@125MHz, TSMC 0.18um.......................................................................... 118

Table 6.7 AES1.2@75MHz, TSMC 0.18um............................................................................ 118

Table 6.8 VDEC Test Chip....................................................................................................... 130

- ix -

List of Figures

Figure 1.1 Video encoder/decoder system ................................................................................... 3

Figure 1.2 Video encoder. ............................................................................................................ 4

Figure 1.3 Video decoder. ............................................................................................................ 4

Figure 1.4 Secure Video System. ................................................................................................. 6

Figure 2.1 Video Sequence: I, P, B Frames and I, P, B MBs...................................................... 13

Figure 2.2 Coded Video Stream Format..................................................................................... 14

Figure 2.3 Samples of Selective Encryption. ............................................................................. 22

Figure 3.1 H.264 Baseline, Main and Extended Profiles. .......................................................... 25

Figure 3.2 H.264/AVC data format. ........................................................................................... 25

Figure 3.3 Unequal Secure Encryption Scheme......................................................................... 28

Figure 3.4 Data Partition in H.264/AVC Extended Profile. ....................................................... 30

Figure 3.5 Data Partition by FMO. ............................................................................................ 31

Figure 3.6 Data Partition by Parameters Extraction................................................................... 32

Figure 3.7 FLEX encryption algorithm...................................................................................... 35

Figure 3.8 Leak position in the even and odd rounds. ............................................................... 35

Figure 3.9 XOR Method ............................................................................................................ 37

Figure 3.10 Comparison of security and computational complexity.......................................... 43

Figure 4.1 Dataflow. (a) Encryption. (b) Decryption. ................................................................ 45

Figure 4.2 Transformations in AES algorithm. .......................................................................... 46

Figure 4.3 32-bit architecture for AES. ...................................................................................... 47

Figure 4.4 8-bit architecture for AES. ........................................................................................ 49

Figure 4.5 Scalable Hardware Architecture for AES ................................................................. 51

Figure 4.6 Shared S-Box Architecture. ...................................................................................... 55

Figure 4.7 Unified S-Box Architecture ...................................................................................... 56

Figure 4.8 S-Box structure. ........................................................................................................ 58

Figure 4.9 MixColumns structure. ............................................................................................. 58

Figure 4.10 Dataflows for scalable architecture......................................................................... 60

Figure 4.11 Comparison with others .................................................................... 64

Figure 4.12 Optimized Data Flow. ............................................................................................. 69

Figure 4.13 Proposed Hardware Architecture. ........................................................................... 70

Figure 4.14 Implementation of qMj............................................................................................. 73

Figure 4.15 Comparison of dataflow and corresponding data path............................................ 75

Figure 5.1 CMOS Inverter.......................................................................................................... 79

Figure 5.2 Power consumption of a circuit: Case I. ................................................................... 85

Figure 5.3 Power consumption of a circuit: Case II................................................................... 85

Figure 5.4 Last round of AES module........................................................................................ 87

Figure 5.5 2-D views of successful DPA attack. ........................................................................ 90

Figure 5.6 3-D views of successful DPA attack. ........................................................................ 90

- x -

Figure 5.7 2-D views of failed DPA attack. ............................................................................... 91

Figure 5.8 3-D views of failed DPA attack. ............................................................................... 91

Figure 5.9 Time dimension hiding. ............................................................................................ 93

Figure 5.10 Amplitude dimension hiding. ................................................................................. 93

Figure 5.11 AES after masking. ................................................................................................. 95

Figure 5.12 S-Box after masking. .............................................................................................. 96

Figure 6.1 The round ith of the AES without and with masking countermeasures. .................... 98

Figure 6.2 Proposed Registers Masking..................................................................................... 99

Figure 6.3 Proposed S-Box Masking. ...................................................................................... 101

Figure 6.4 A power trace of AES. ............................................................................................ 102

Figure 6.5 Subbytes without and with hiding. ......................................................................... 102

Figure 6.6 Hardware design of Subbytes hiding. ..................................................................... 103

Figure 6.7 Integrated Subbytes and AddRoundKey. ................................................................ 105

Figure 6.8 Separated Subbyte and AddRoundKey. .................................................................. 105

Figure 6.9 Feedback structure and Data Sliding Structure....................................................... 106

Figure 6.10 Ultra low-cost AES with DPA countermeasure. ................................................... 115

Figure 6.11 Data flow for ultra low-cost AES.......................................................................... 115

Figure 6.12 DPAAttack Evaluation System (Photo)................................................................ 121

Figure 6.13 DPAAttack Evaluation System (Architecture). .................................................... 121

Figure 6.14 Sasebo Board. ....................................................................................................... 122

Figure 6.15 DPA attack test flow.............................................................................................. 123

Figure 6.16 Power trace from oscilloscope.............................................................................. 124

Figure 6.17 2-D view of DPA attack on Pure AES................................................................... 125

Figure 6.18 3-D view of DPA attack on Pure AES................................................................... 125

Figure 6.19 2-D view of DPA attack on AES with Subbytes hiding. ....................................... 126

Figure 6.20 3-D view of DPA attack on AES with Subbytes hiding. ....................................... 126

Figure 6.21 2-D view of DPA attack on AES with masking. ................................................... 127

Figure 6.22 3-D view of DPA attack on AES with masking. ................................................... 127

Figure 6.23 2-D view of DPA attack on AES with Independent ARK and Data Sliding. ........ 128

Figure 6.24 3-D view of DPA attack on AES with Independent ARK and Data Sliding. ........ 128

Figure 6.25 Test Chip Architecture. ......................................................................................... 129

Figure 6.26 Chip design of AES .............................................................................................. 130

- xi -

List of Notations

VOD Video on Demand

AVC Advanced Video Coding

MPEG Moving Picture Experts Group

VCEG Video Coding Experts Group

MB Macro Block

I-MB Intra-coded Macro Block

P-MB Inter-coded Macro Block

B-MB Bi-directional coded Macro Block

MV Motion Vector

MVD Motion Vector Difference

DCT Discrete Cosine Transform

Q Quantization

MC Motion Compensation

FMO Flexible Macroblock Ordering

VLC Variable Length Coding

DES Data Encryption Standard

AES Advanced Encryption Standard

USE Unequal Secure Encryption

FLEX Fast Leakage EXtraction

DPA Differential Power Analysis

CPA Correlation coefficient Power

Analysis

GF Galois Field

I.ARK Independent AddRoundKey

D.S. Data Sliding

bps bit-per-second

P Power consumption

Coefficient for power modeling

HD Hamming Distance

HW Hamming Weight

n Noise

R/REG Registers

C/Comb Combinational logic

Inv Inverse Operation

T / t Time

K / k Key

X / Y Random number

~o(DPA) Time complexity of DPA attack on

pure AES

Function

AES0 Pure AES

AES1.0 AES + I.ARK&D.S.

AES1.1 AES + Hiding

AES1.2 AES + Masking

SASEBO Side-channel Attack Standard

Evaluation Board

- 1 -

Introduction

1 Introduction

Multimedia is a hot topic in this IT era, especially for telecommunication and internet.

In ten years ago, people use text-based method, such as ICQ, to communicate with each

other in internet. People published their information in internet only by text or picture.

And now, things change! We talk to others face-to-face in internet by using a monitor and

a web camera. We use skype to make an internet call for free. We share our video in

youtube, share our personal photos in picasa, and watch TV in PPStream or enjoy music

in Kugou. All of these wonderful applications can be used in internet for free. Even more,

we can use our mobile phone to do it! We can enjoy everything in everywhere.

The virtual world becomes more and more attractive because we can sense it. And the

videos play a most important role. Some very popular video applications include: VOD

(Video On Demand) which is used to watch movies in internet, Pay-TV, which is widely

used in television set-top box, and Video conference. However, the data size of video is

very huge, which makes video data transmission and storage become a problem. In order

to reduce data size, people proposed a lot of video compression methods to compress

video data, such as MPEG-4 and the latest H.264/AVC. In the other hand, in order to

protect the sensitive information in video, people proposed many video encryption

schemes to encrypt video data. The cryptosystems are widely used in many video

applications.

1.1 Video Compression

Video makes multimedia applications be more attractive. However, the uncompressed

video sequence requires a large bit rate (approximately 2Gbps for HDTV 1080p). In this

way, the compression is necessary for practical storage and transmission of digital video.

- 2 -

Introduction

Video compression research has long history. During the past decades, many

international standards are developed, namely the ISO/IEC MPEG-x series

[1][2][3][4][5], and the ITU-T H.26x series [6][7]. The Moving Picture Experts Group

(MPEG) is a working group of the International Organization for Standardization (ISO)

standards for compression, processing and representation of moving pictures and audio. It

has been responsible for a series of important standards, such as MPEG-1 [1]

(compression of video and audio for CD playback), MPEG-2 [2] (storage and

broadcasting of television quality video and audio). MPEG-4 [3] (coding of audio-visual

objects) is the latest standard that deals specifically with audio-visual coding. MPEG-7 [4]

and MPEG-21 [5] are concerned with multimedia content representation and a generic

multimedia framework respectively. MPEG is best known for its contribution to audio

and video compression. Particularly, MPEG-2 is widely used in digital TV broadcasting,

and DVD video and MPEG Layer 3 audio coding has become very popular for music

storage and sharing.

The Video Coding Expert Group (VCEG) is another working group of the

International Telecommunication Union Telecommunication Standardization Sector

(ITU-T). VCEG has been developed a series of standards related to video communication

over telecommunication networks and computer networks, such as H.261 [6] standard

(first widely used standard for video conference), and the followed H.263 [7] or later

versions. Since 2001, the cooperation between VCEG and MPEG was carried out, and a

new organization JVT came out. The Joint Video Team (JVT) consists of members of

MPEG and VCEG, and its main purpose is to develop a new video coding standard,

[8] which was also known as MPEG-4 part 10

or H.264.

- 3 -

Introduction

Figure 1.1 Video encoder/decoder system

A basic video encoder/decoder system is shown in Figure 1.1. Camera captures video

sequences and transfers all of uncompressed video data to video encoder. Video encoder

compresses video data according to specific coding standards, and then transfers the

coded video stream to transmit channel. In the receiver part, video decoder receives

coded video stream and decodes it by using same coding standards, and then the display

plays the video sequence.

A conventional architecture of video encoder is shown in Figure 1.2. It consists of

three main functional units: a temporal model, a spatial model and an entropy encoder.

The temporal model attempts to reduce temporal redundancy by exploiting the

similarities between neighbouring video frames or neighbouring macro blocks (MBs)

within one frame. In this figure, there are two predictors: Inter predictor for inter-frame

prediction, and intra predictor for intra-frame prediction. The output of temporal model

includes residual data and a set of model parameters, such as motion vectors, block types,

prediction types and so on.

The spatial model makes use of similarities between neighbouring samples in the

residual frame to reduce spatial redundancy. In MPEG-4 visual and H.264, this is

achieved by DCT transformation and quantization. The input of spatial model is the

residual data produced by temporal model, and the output data of spatial model is a set of

quantized transform coefficients.

- 4 -

Introduction

Figure 1.2 Video encoder.

Figure 1.3 Video decoder.

- 5 -

Introduction

Entropy encoder is used to compress the parameters produced by temporal model and

the transform coefficients p spatial model. The final encoded video stream consists of

header information, parameters information, coded motion vectors and coded residual

data.

A conventional architecture of video decoder is shown in Figure 1.3. The

architecture of video decoder is much simpler than encoder. Entropy decoder extracts the

header, parameters and transforms coefficients from coded video stream. Prediction

parameters are used to reconstruct prediction data in motion compensation (MC) module.

Combines with residual data from inverse DCT and inverse Quantization, the original

video sequence can be reconstructed.

1.2 Video Encryption

With the increase of multimedia applications, huge amounts of digital visual data are

stored on different media and exchanged over various sorts of networks. As a

consequence, techniques are required to provide security functionalities such as privacy,

integrity or authentication. is aimed towards these emerging

technologies and applications.

To protect the video content, the conventional technologies can be classified into three

categories: 1) Encryption technology to provide end-to-end security when distributing

video over internet or other public communication channel. 2) Watermarking technology

to achieve copyright protection, ownership trace, and authentication. 3) Access control

technology to present unauthorized access. In this dissertation, we focus on video data

encryption technology.

- 6 -

Introduction

Figure 1.4 Secure Video System.

Several dedicated international meetings have emerged as a forum to represent and

transaction, and EURASIP and so on. However, a common video encryption standard

does not exist. A conventional video encryption system is shown in Figure 1.4. Two

secure modules are added: Video encryption for video content encryption and RSA

encryption for secret key exchange.

Several review papers have been published on video encryption, such as Liu and

work in [9] [10], and Furht, Socek and

[11]. From these review papers and other literatures from internet,

we found that most of the proposed video encryption schemes are designed for previous

video coding standards such as MPEG-1, MPEG-2/H.262, MPEG 4 and H.263. And the

selective encryption was mostly used in these proposed schemes. Selective encryption is

an encryption method to encrypt a portion of video bit-stream. Respectively, full

encryption will encrypt whole video bit-stream by using a specific encryption algorithm.

- 7 -

Introduction

The full video encryption method has two different approaches: (a) Video scrambling

technology. Permuting the video in the time domain or the frequency domain, however, it

t provide substantial high security. (b) Encryption. Encrypting the entire video data

using standard cryptographic algorithm

computational cost is very high.

The selective video encryption also can be further classified into three types: temporal

domain scheme, spatial domain scheme and entropy coding scheme. Temporal domain

scheme selects temporal model parameters such as motion vectors, DCT coefficients, I

blocks, I frames and so on. Most of the selective encryption methods are based on

temporal domain [14-32]. Spatial domain scheme makes use of spatial model parameters

in video data. In [23], it makes use of quadtree structure of motion vectors and quadtree

structure of residual errors to do video encryption. Entropy coding scheme uses special

entropy codec to do encryption. In [33-35], they use multiple Huffman tables and

multiple state indices in the entropy encoder.

The detailed introduction and discussion of selective video encryption schemes are

provided in Chapter 2.

1.3 Cryptography

Before the modern era, cryptography was concerned solely with message

confidentiality. In recent decades, the field has expanded beyond confidentiality concerns

to include techniques for message integrity checking, sender/receiver identity

authentication, digital signatures, interactive proofs, and secure computation, amongst

others. Modern cryptography can be divided into two types: 1) Symmetric key

cryptography and 2) Asymmetric key cryptography.

- 8 -

Introduction

Symmetric Key Cryptography

The common property for symmetric key cryptography is: a shared secret key is used

in between communication parties. This key is used both as an encryption key and as a

decryption key. Most of the modern cryptosystems use symmetric key cryptography.

Some famous and widely used modern symmetric cryptosystem includes DES [36] (Data

Encryption System), IDEA [40], Triple-DES, AES [37] (Advanced Encryption System),

and so on. Typical Key sizes are 56-bits (DES), 128 bits (IDEA, AES), 192 and 256 bits

(AES).

Asymmetric Key Cryptography

Asymmetric key cryptography is also called public key cryptography. There are two

different keys in this system: a public key, which is publicly known, and the secret key,

are used for encryption and decryption. If data is encrypted with a public key, it can only

be decrypted with the corresponding secret key and vice versa. The asymmetric key

asymmetric cryptosystems are: RSA [38] (Ron Rivest, Adi Shamir, and Leonard

Adleman at MIT) and ECC [39] (Elliptic curve cryptography). RSA is the most popular

used asymmetric cryptosystem with key length 1024 bits, 2048 bits or longer. ECC is a

new asymmetric cryptosystem which has smaller key size.

Both of symmetric key and asymmetric key cryptography has their advantages and

disadvantages. Symmetric ciphers require much smaller key size for the same level of

security and the computations for symmetric ciphers are much faster and the memory

requirements are smaller. However, since every party share a same key, they should keep

the key absolutely secret. This becomes more dangerous with an increasing number of

involved parties. In a conventional cryptosystem, the asymmetric ciphers are used for key

exchange, authentication, digital signature and integration check, and the symmetric

- 9 -

Introduction

ciphers are used for data encryption. For video encryption, symmetric ciphers are widely

used. Meanwhile, some other scrambling methods with low security are also widely used

in selective video encryption.

-

Side-channel attack uses side-channel information of crypto-devices, such as power

consumption or time consumption, to detect the secret key in these devices. Especially,

differential power analysis (DPA) has been successfully used to crack symmetric ciphers

as DES, and asymmetric ciphers as RSA. It posed a serious threat to the security of

current cryptosystems.

1.4 Our Contributions and Dissertation Organization

In this dissertation, a new video encryption scheme for H.264/AVC and the hardware

design of encryption module are proposed.

Unequal Secure Encryption (USE) scheme is proposed for H.264/AVC video

coding standard. There are three major targets in the USE scheme: security, feasibility,

and low computational cost. In the USE scheme, we encrypt the entire video data using

standard cryptography to make our scheme highly secure. We perform all of the

encryption operations after entropy coding to separate the video coding system and

encryption system. In this way, our USE scheme is feasible in any kind of video security

applications. The remaining problem is computational cost. As computational cost of

to make some optimization to reduce the

computational cost. Here we use two methods: (1) Data classification. We classify the

total video data into two data partitions, important data partition and unimportant data

partition. Many new features in H.264/AVC make this procedure easy to implement.

Normally, important data partition has smaller size than unimportant one. (2) Unequal

secure encryption. We use AES to encrypt important data partition and use our proposed

- 10 -

Introduction

FLEX to encrypt unimportant data partition. FLEX is a cipher based on AES. The

computational cost of FLEX is only 20% of AES. In this way, we can keep our scheme

highly secure with low computational cost.

Hardware Architectures for AES and RSA are proposed for low cost and high

performance design. For AES, we propose a scalable framework for designing specific

AES module with different throughput and hardware cost. Especially, it is very useful for

low-cost, low power design of AES module. There are two important features in this

architecture: 1) Scalable S-Box and Scalable MixColumns design. It supports different

number of S-Box and MixColumns running in this architecture. 2) Parallel data path

design. The proposed architecture has three main data paths: S-Box data path,

Mixcolumns data path and AddRoundKey data path. All of these data paths have

different bit width. The advantage is high scalability and shorten critical path. For RSA,

we propose a high speed design of Montgomery multiplier. It parallelizes the data path

and shortens the critical path. By using proposed clock-saving dataflow, it reduces the

total clock cycles of multiplication to a very small number. Our design achieves very

high performance with low hardware cost.

DPA Countermeasure methods for AES design is proposed to counter-

measure DPA attack. Five countermeasure methods are proposed in this dissertation: 1)

Independent ARK is used to separate AddRoundkey operation from other operation in

hardware data path. 2) Data Sliding is used to scramble the register data. 3) Subbytes

Hiding is to randomize the Subyytes operation in time domain. 4) Simplified S-box

masking is proposed to induce randomization in S-Box power consumption. 5) Register

masking is proposed to induce randomization in register power consumption. All of the

five methods can be used together or independently. The theoretical analysis and

experimental results show that our proposed methods greatly increase the security and

efficiently countermeasures DPA attack.

- 11 -

Introduction

Ultra low-cost implementation of AES with DPA Countermeasure is

proposed for video encryption. It bases on the proposed scalable architecture and

combines all of proposed countermeasure methods together. Only one S-Box is used in

this implementation, and the hardware cost is extremely low, about 7k gates by TSMC

0.18 standard cell library. The throughput is about 75 Mbps which can be used in

real-time video encryption. Five countermeasure methods are used, and they are

evaluated by our DPA attack system. A test chip which includes 4 AES implementations

is designed.

The rest of this dissertation is organized as follows: a survey of selective video

encryption schemes is given in Chapter 2. The proposed Unequal Secure Encryption

(USE) scheme is presented in Chapter 3. The hardware architecture for AES and RSA are

presented in Chapter 4. The DPA attack and conventional countermeasure methods are

introduced in Chapter 5. The proposed DPA countermeasure methods, and ultra low-cost

implementation of AES with DPA countermeasure are presented in Chapter 6. Finally,

Chapter 7 concludes the whole dissertation.

- 12 -

Selective Video Encryption Schemes

2 Selective Video Encryption Schemes

In order to encrypt video data in real-time, selective video encryption is proposed to

reduce computational cost. The basic idea of selective encryption is to encrypt a part of

compressed video data. As a result, the computational cost can be reduced. The selected

part of data is considered as important data part. There is no clearly definition for

importance of video data. Normally, the importance can be replaced by another word:

Difficulty. The lost data causes more difficult to reconstruct video, it is regarded as more

important. For example, the header information, the parameters are much more important

than VLC data in video stream. Over the past years, a number of different selective video

encryption schemes for different video coding standards have been proposed. An

introduction and discussion of these schemes are presented in this chapter.

2.1 Visual Data Formats

2.1.1 Video Sequence

Video sequence is usually organized in a rectangular arrays denoted as frames. As

shown in Figure 2.1. Figure 2.1 A) shows a frames structure in video sequence. The video

data is organized as frame by frame. Every frame contains a picture. The frames are

played in time axis to form a video. There are several types of frame: I-frame, P-frame

and B-frame, as shown in B). I-frame is short for Intra frame. I-frame is independent with

other frames. It is encoded by intra-frame prediction. P frame is short for Predicted frame,

which means that it is predicted by the previous frames. B frame is short for

Bi-directional Predicted frame. Both of backward frames and forward frames are used for

prediction. B-frame has two reference frames while P-frame only has one reference frame.

In order to reconstruction a picture, for I-frame, it can reconstruct by itself, for B and P

- 13 -


frames, they needs to combine with reference frames to reconstruct a picture.

Figure 2.1 C) shows the MB structure in a frame. The size of MB can be various,

which depends on the MB size definition in specific video coding standards. Normally,

there are two kinds of MB: I-MB and P, B-MB. Similar with I, P, B frames, I-MB is intra

MB which is predicted by the surrounding MBs in the same frame. P, B-MB is inter MBs

which is predicted by MBs in the other frames.

Figure 2.1 Video Sequence: I, P, B Frames and I, P, B MBs.

- 14 -


Figure 2.2 Coded Video Stream Format.

2.1.2 Coded video stream format

A coded video sequence consists of one or more video packets. A video packet is

analogous to a slice or a frame in MPEG-1, MPEG-2 or H.264, and consists of a

resynchronization marker, a header field and a serious of coded macro blocks.

As shown in Figure 2.2, the coded video stream is formatted as a layered structure:

Sequence layer defines the properties of whole video sequence by sequence header, and

follows with a string of frames. Frame layer includes the frame header which indicate the

frame properties, such as frame type, whether used for prediction or not and so on, and

MBs in this frame. MB layer consists of MB parameters as MB type, MB partitions, MB

prediction methods and so on. For P, and B MB, it includes motion vectors and VLC

(Residual data coded by Variable

vectors.

- 15 -


2.2 Conventional video encryption methods

2.2.1 Cryptography based video encryption

Cryptography based video encryption means that the selected video contents are

encrypted by a cryptosystem, such as DES, AES. The security is highly depends to

cryptosystem. In other words, the security is approved and certified.

Header encryption

Since the header information in video stream contains a lot of parameters to

reconstruct video data, most of schemes for video encryption took use of headers. There

are several headers in different layers in video bit stream: Sequence header contains

global parameters for whole video sequence. Slice header only defines constrains for

current slice. MB header is the lowest level header which consists of MB type, MB

partitions and so on.

I-frame, I-MB encryption

I-frame and I-MB is very important to reconstruct a picture, because they can be

reconstructed without needing any other information. I-frame is widely used to

synchronize video data or to recover the broken pictures in video stream. I-MB is also

very important for MB reconstruction and then used for prediction. Since P and B frames

are reconstructed based on predictions obtained from I-frame, the main assumption is that

if these are encrypted, P and B frames are expected to be protected well.

Motion Vectors encryption

Motion vector is used in B and P frames to indicate the prediction positions of each

MB. In decoder part, the MC (motion compensation) block uses motion vector to

- 16 -


comprise about 10%~20% of the entire video data, therefore, lots of high secure

encryption schemes tend to encrypt it.

VLC encryption

VLC data occupies most part of the video data, about 60~80% of whole video data.

Most of the proposed video encryption schemes leave it unencrypted to save

computational cost. However, it may pose a serious security problem in the future. VLC

data also can be classified into two types: I-MB VLC and B, P-MB VLC. Some high

secure encryption schemes encrypt I-MB VLC and leave B, P-MP VLC unencrypted,

because I-MB VLC only occupies about 20% of total VLC data.

2.2.2 Permutation based video encryption

Permutation based video encryption means the selected video contents are encrypted

by permutation, scrambling, shuffling or other simple methods. This kind of encryption

methods target low computation and low security video encryption. Their security is very

weak and not approved.

Macroblock Permutation

As discussed in Section 2.1.1, each frame consists of the same number of macroblocks.

Each macroblock contains a piece of picture. Microblock permutation is to exchange the

order of macroblocks within a frame. This is an encryption variant which is annoying but

not secure. The reason is that based on the correlation of border pixels the originally

neighboring macroblocks can be regained. And this effect becomes more risky when

there are more frames permuted using the same order.

Motion Vectors Permutation

Each predicted P or B-MB has a corresponding motion vector. It is possible to permute

the motion vectors which assigned to distinctive macroblocks. Since it only affects the P

- 17 -


and B frames, and in many cases many motion vectors within a same frame have the

same overall directions. Thus, the distortion of encrypted video is very light.

DCT Coefficient Permutation

Similar to macroblock permutation, DCT coefficient permutation is to exchange the

DCT coefficients within a macroblock. There are two kinds of DCT permutation: DC

coefficient permutation and AC coefficient permutation. Since the number of DC

coefficients is much less than AC coefficients, most of proposed schemes use DC

coefficients permutation. However, same as macroblock permutation, this method is not

secure too. It makes the reconstruction difficult, but not impossible.

Sign-bit Masking

A lot of coefficients of coded video stream have sign bit. Such as motion vector has

sign bit to indicate the direction, and DCT coefficient also has sign bit. Sign bit masking

is to mask the sign bit from 1 to -1 or from -1 to 1.

In a real scenario, most of the proposed video encryption schemes adopt more than one

encryption methods to ensure the security. Many schemes also define several security

levels to make balance between security and computational cost.

2.3 A Survey of selective video encryption schemes

There are a lot of selective video encryption schemes have been proposed. Liu and

Eskicioglu in [9], Furht, Socek and Eskicioglu in [11] have presented a comprehensive

classification include most of the presented selective video encryption algorithms. An

recent version of this classification is shown in table 2.1 [65]. According to their work,

these encryption schemes can be classified into three types: frequency domain schemes,

spatial domain schemes and entropy coding schemes. Comprehensive survey studies of

the video encryption techniques are given in [9-13].

- 18 -


Table 2.1 A survey of selective video encryption schemes

(adapted from [9][11])

Domain Proposal EncryptionAlgorithm Encrypted Content

Year& Ref

FrequencyDomain

Meyer, Gadegast DES, RSAHeader, All I-block , Partial I-block, I-frame

1995 [14]

Spanos, Maples DES I-frame, Sequence Header, ISO end code 1995[15][16]

Tang DES, Permutation DCT coefficients 1996 [17]

Qiao, Nahrstedt IDEA, XOR, Permutation Every other bit of bit-stream 1997 [18]

Shi, Bhargava XOR DCT coefficients (Only Sign bit) 1998 [19]

Shi, Wang,Bhargava

IDEA Motion Vector (Only Sign bit) 1999 [20]

Alattar, A-Regib,Al-Semari

DES Header ( nth I-macroblock)Header ( nth Predicted-macroblock)

1999[21]

Shin, Sim, Rhee RC4, Permutation DCT coefficients (Only Sign bit of I frame) 1999 [22]

Cheng, Li No algorithm is specified Significance information (Pixel and setrelated) in the residual data (Two highestpyramid levels of SPIHT)

2000 [23]

Tosun, Feng VEA DC&AC (Lower layer). Classify thecoefficients into three layers, the two lowerlayers are encrypted by VEA

2000 [24]

Wen, Severa, Zeng,Luttrell, Jin

AES , DES Significance information (FLC codewords,VLC codewords)

2002 [25]

Zeng, Lei XOR, Permutation Transform coefficients (JPEG & Wavelet)Motion Vector ( JPEG)

2002 [26]

Wu, Mao Any modern cipher, randomshuffling on bit-planes inMPEG-4 FGS

BitstreamQuantized values (before RLC Coding)RLC symbolIntra bit-plane shuffling

2002 [27]

Choon, Samsudin,Budiarto

Confusion , Diffusion Permutation between macroblock and XORtemplate in macroblock

2004 [28]

Liu, Li, Dong Permutation Permutation of macroblocks , DC, AC. 2004 [29]

Liu, Ikenaga, Baba,Goto

Event Shuffle, DCEA DCT coefficients 2004, 2006[30][31]

Wang, Fan,Ikenaga, Goto

XOR, PermutationMotion Vector (Only Sign bit)Intra modeTrailing one of VLC code

2007 [32]

SpatialDomain Cheng, Li No algorithm is specified Motion Vector (Quadtree structure)

Residual errors (Quadtree structure)2000 [23]

EntropyCodec

Wu, Kuo MHT (Multiple Huffmantables) & multiple states

Encryption of data by MHT & multiple stateindices in the QM coder

2000,2001[33][34]

Cheong,Hung,Tung, Ke, Chen

MHT rotation, XOR DCT CoefficientsMotion VectorsMHT

2005 [35]

- 19 -


Some important selective video encryption schemes include:

SECMPEG [14]

SECMPEG, also called Secure MPEG was proposed by Meyer and Gadegast in 1995.

It was designed for the MPEG-1 video standard. It defines four levels of security:

1) Header data from the sequence layer down to the slice layer is encrypted.

2) Encrypt the same data as in level 1 and the low frequency DCT coefficients of all

blocks in I-frames.

3) Encrypt all I-blocks (Includes I-frames, I-blocks in B and P-frames).

4) Encrypt the entire video.

The authors chose DES symmetric cryptosystem to do encryption, which was the

natural choice since the DES is the official symmetric algorithm at that time. Anyway,

AES also can be used for SECMPEG for higher security.

Aegis [15][16]

Aegis was proposed by Maples and Spanos in 1995, which was designed for MPEG-1

and MPEG-2. Aegis encrypts the following selection of a video stream: Video sequence

header, I-frames and ISO end code. It leaves B and P frame unencrypted. The encryption

engine was DES too. Ageis is very similar to SECMEG level 3.

VEA [18]

VEA stands for Video Encryption Algorithm, which was developed by Qiao and

Nahrstedt in 1997. It was also designed for MPEG video standard. This algorithm is a

kind of whole video encryption, which is significantly different from other selective

video encryption schemes. The algorithm consists of the following four steps:

1) Let the 2n byte sequence, denote by a1, a2 2n, represent the video frames

2) Separate the sequence into two lists, even list (a1, a3, a5 2n-1) and odd list (a2,

- 20 -


a4 2n).

3) XOR two lists into one list: b1, b2 n = a1, a2 2n xor a2, a4 2n .

4) Apply the chosen symmetric cryptosystem E with secret key to encrypt either

odd list or even list. The cipher text sequence is {Ekey(a1, a3, a5 2n-1), b1, b2

bn} or {Ekey(a2, a4 2n), b1, b2 n}.

RVEA [19][20]

RVEA was proposed Shi, Wang, and Bhargava in 1999. Actually, they have proposed

four video encryption methods: Method 1, Method 2 (VEA), Method 3 (MVEA) and

Method 4 (RVEA). RVEA has the highest security level than the previous three methods.

RVEA encrypts the sign bits of DCT coefficients and motion vectors which are simply

extracted from the MPEG video sequence by using a conventional cryptosystem such as

DES or AES. And then, they restored the encrypted bits back to their original position.

Alattar [21]

In 1999, Alattar, Al-Regib and Al-Semari presented a video encryption scheme based on

DES cryptosystem. They defined four security methods:

1) Method 0: encrypts all macroblocks from I-frames and the headers of all

prediction macroblocks.

2) Method 1: Encrypt all data associated with every nth I macro blocks

3) Method 2: Encrypt the same data as in method 1 and all header data of predicted

macroblocks.

4) Methods 3: Encrypt the same data as in method 1 and every nth predicted

macroblocks.

- 21 -


2.4 Problems of current video encryption scheme

There are three main problems in current selective encryption schemes.

A. Security Problem

A lot of cryptanalysis work has been done in proposed video encryption schemes [10],

41-45

standard cryptographic algorithms is very low. For example, Permutation is highly risky

shown in [10, 42-44]. Even using standard cryptographic algorithms such as DES or AES

in video encryption scheme, there are also many security problems existing. The

corresponding cryptanalysis can be found in [10, 41, 45].

Another crucial problem of selective encryption is that information ca

concealed after encryption. Some objects in video sequence still can be recognized from

the unencrypted part of video stream. As shown in figure 2.3 [63] [64], the original video

is encrypted by I-Frame/I-MB encryption, DCT coefficient encryption and Motion Vector

encryption respectively. After encryption, the video quality is greatly reduced. However,

the contents of this picture also can be recognized even after encryption. For some

secure-sensitive system, it will pose a great risk.

B. Computational Cost Problem

Some methods can provide substantial security. However, computational overhead and

data overhead become worse. For example, VEA scheme [18] is

[11]. However, it needs to

encrypt half of video data using internal encryption scheme E and transfer a large amount

of additional keys to receiver. The detailed computational cost of selective video

encryption schemes will be further analyzed in next Chapter.

- 22 -


Figure 2.3 Samples of Selective Encryption.

C. Feasibility Problem

Feasibility is another problem existed in many schemes. A lot of existing schemes are

so called . It means that the video

encryption module must be integrated into video compression system. For example,

permutation of AC, DC coefficients should be done before entropy coding. In this way,

the encryption should break the procedure of video compression, and the encryption

module must be integrated into video compression system. That is why the standard

this secure encoder should be .

This causes such kind of scheme very hard to be widely used in commercial applications.

- 23 -


2.5 Conclusion

In this chapter, I introduced the fundamental of selective video encryption. A survey of

current video encryption schemes and the problem discussion are presented. From the

security point of view, the best way of protecting video data is full encryption algorithm,

which encrypts the entire video data by a standard cryptosystem. However, expensive

computational overhead makes it inefficient or impossible in lots of applications. As a

result, selective encryption targets encrypting only a part of video data in order to reduce

the computational cost, and keep the security level high. However, many proposed

schemes only achieved moderate to low security and only few of the proposed methods

promise to achieve substantial security. A high secure and low computational cost video

encryption scheme is absolutely necessary for future high definition video coded by

H.264/AVC.

- 24 -

Unequal Secure Encryption (USE) Scheme for H.264

3 Unequal Secure Encryption (USE) Scheme for H.264

and the computational cost. Selective encryption schemes show their weak points in

security, and full encryption scheme requires too much computational cost. The original

idea for our USE scheme is: Video data can be selected, why not to select cryptosystems

for selected video data? If doing this, all of the video data can be encrypted, and we can

choose light cryptosystem with low computational cost to encrypt unimportant data set to

reduce total computational cost. As a result, both of the security problem and

computational cost problem can be solved.

This chapter introduces our proposed USE video encryption scheme. In the last part,

the comparison with other schemes is discussed.

3.1 Introduction of H.264

H.264/AVC is the newest international video coding standard. It has been approved by

ITU-T as Recommendation H.264 and by ISO/IEC as International Standard 14496-10

(MPEG-4 part 10) Advanced Video Coding (AVC).

There are a lot of new techniques used in H.264/AVC, which include new coding

techniques, new data structure, new video storage and broadcast techniques. As the USE

scheme is applied after video coding, the details of H.264/AVC coding, storage and

transmission techniques to be considered very much. The H.264/AVC video data

structure has more impact on USE scheme. We need to do data classification by carefully

studying the data structure of H.264/AVC.

- 25 -


Figure 3.1 H.264 Baseline, Main and Extended Profiles.

NALHeader

NALNAL

HeaderNAL

NALHeader

NAL

SliceHeader

Slice DataSlice

HeaderSlice Data

MB MB Skip_run MB MBMB

Mb_type Mb_pred Coded residual

...

... ...

. . . . . .

Figure 3.2 H.264/AVC data format.

- 26 -


In H.264/AVC, profiles and levels specify conformance points. A profile defines a set

of coding tools or algorithms that can be used in generating a conforming bitstream,

whereas a level places constraints on certain parameters of the bitstream. The first version

of H.264/AVC defines a set of three profiles as shown in Figure 3.1 [46][47].

The Baseline profile supports most of features except two sets: 1) B slice, CABAC,

field coding and weighted prediction; 2) SP/SI coding and data partition. Main profile

are supported in baseline profile. The Extended profile supports all features of the

Baseline profile, and both sets of feature except for CABAC.

Some new features which can be used in the USE scheme are listed below:

Coded Data Format: H.264/AVC makes a distinction between a Video Coding Layer

(VCL) and Network Abstraction Layer (NAL). The output of the encoding process is

VCL data which are mapped to NAL units prior to transmission or storage. A coded

video sequence is represented by a sequence of NAL units. The data format of NAL is

shown in Figure 2. One NAL unit contains one or more slices, each slice contains an

integral number of macroblocks (MBs). Each MB contains a series of header elements

and coded residual data.

Parameter sets: H.264 introduces the concept of parameter sets, which provides for

robust and efficient conveyance header information. Parameter sets includes the key

information such as sequence header, picture header, this key information is separated for

handling in a more flexible and specialized manner in H.264/AVC. This new feature is

fully used in our USE scheme.

- 27 -


Flexible macroblock ordering (FMO): FMO is a new technique introduced by

H.264/AVC which has ability to partition the picture into regions called slice groups.

FMO can be used to enhance robustness to data losses in transmission. In the USE

scheme, we provide two kinds of usage of FMO in video encryption scheme.

Data partitioning: As some coded information is more important than others for

purpose of representing the video content, H.264/AVC allows syntax of each slice to be

separated into three partitions. In the USE scheme, this data partition is used.

3.2 USE Scheme for H.264/AVC

The purpose of designing Unequal Secure Encryption scheme is to provide substantial

security with low computational cost for video encryption. As discussed in Chapter 2, a

lot of existing video encryption schemes target low computational cost while ignoring

Integrated video compression

and encryption system

proposed schemes can achieve high security level. However, the computational cost is

bad.

USE scheme is a full encryption scheme which encrypts the entire video data using

selective encryption methods. The target application of the USE scheme is H.264/AVC

based video security system. Especially, the USE scheme can be very efficiently used for

high definition video encryption, because the computational cost of USE scheme is much

The contents of USE scheme can also

be found in [61] and [62].

- 28 -


Figure 3.3 Unequal Secure Encryption Scheme.

3.2.1 Data Partition Methods

The USE scheme is shown in Figure 3.3. It includes two major steps: The first step is

video data classification. The purpose of classification is to divide video data into two

partitions: important video data partition and unimportant video data partition. The

importance is evaluated by how difficult to reconstruct a picture. If the data in important

partition the data in unimportant

partition is lost, the video content can also be reconstructed just with quality reduction.

Therefore, the important video data group needs to be protected more securely than

unimportant one. As shown in this figure, after data classification, H.264/AVC video data

is parted into DPA (Data Partition A, important) and DPB (Data Partition B,

unimportant).

- 29 -


The second step in the USE scheme is unequal secure encryption. Unlike the existing

selective encryption scheme, the USE scheme encrypts entire video data, and different

cryptosystems are selected to encrypt different part of video data. As discussed in

Chapter 2, from the view points of cryptanalysis, the best way to keep security is to

encrypt the entire video data, and use the standard cryptography to do encryption other

than some other methods whose security can t be approved. As shown in Figure 3.3, two

cryptographies are used in the USE scheme. DPA is encrypted by cipher A, and DPB is

encrypted by cipher B. Different algorithm has different security level and computational

cost. In the USE scheme, we use AES as cipher A, and FLEX as cipher B. FLEX is our

proposed algorithm which based on AES, the hardware implementations of AES can also

support FLEX, and the speed of FLEX is faster than AES. Besides AES and FLEX, some

other cryptographic algorithms also can be used in the USE scheme.

The computational cost for USE scheme depends on data classification and

cryptographic algorithms. As the algorithms have been decided, the data classification

plays a more important role. There are three data classification methods in the USE

scheme. As the USE scheme is designed for H.264/AVC, two of these classification

methods use the new features of H.264/AVC.

3.2.2 Data Partition Methods

The purpose of data classification is to partition video data based on importance. There

reconstruct the picture caused by data loss is used to evaluate the importance of data. In

H.264, Header data (includes parameter sets and MVD) loss causes most difficult to

reconstruct the picture. VLC data (includes Intra and Inter residual data) loss causes

video quality reduction. Intra data is independent between each frame while Inter data is

dependent with neighboring frames, so the reconstruction of Intra loss is much more

- 30 -


difficult than Inter loss.

There are three data classification methods in the USE scheme. All of them are

performed after video encoding. The video coding scheme and video encryption scheme

are totally separated in our USE scheme.

Figure 3.4 Data Partition in H.264/AVC Extended Profile.

Data Partitioning (Extended Profile)

This is a new feature in H.264/AVC Extended Profile, which can do data partition

automatically. As shown in Figure 3.4, the coded data that makes up a slice is placed in

three separate Data Partitions (A, B and C). Partition A contains the slice header and

header data of C

more important than B and C. Normally, intra data (Partition B) is considered more

important than inter data (Partition C).

- 31 -


FMO (Baseline Profile, Extended Profile)

FMO is a new feature in H.264/AVC. It has ability to partition the picture into regions

called slice groups. In H.264/AVC standard, FMO consists of seven different partition

types. All of these types make it easy to partition pictures. In the USE scheme, there are

two kinds of partition modes (shown in Figure 3.5). The first partition mode is Region

Based FMO. In this mode, the picture is partitioned into two slice groups: Secret regions

and Normal regions. The shape of secret regions can be decided by other pre-processing

tools such as object recognition and extraction. This mode can support extraction of any

interesting shapes in picture, so object based encryption can be realized. The second

partition mode is Mode Based FMO. In this mode, the picture is partitioned into two slice

groups: Intra MBs and Inter MBs. As Intra MBs is more important than Inter MBs to

reconstruct picture, the Intra MBs should use highly secure encryption algorithms.

Figure 3.5 Data Partition by FMO.

- 32 -


Figure 3.6 Data Partition by Parameters Extraction.

Parameters Extraction (All Profiles)

Since Data Partitioning method and FMO method are profile limited methods, a

common method which can be used in any profiles is needed. The Parameter Extraction

method which is shown in Figure 3.6 is such kind of method. The effect of this method is

like Data Partitioning method. The difference is that Data Partitioning method can be

automatically done by codec. And this method needs a parser to do data classification.

- 33 -


3.2.3 Security levels

According to [71], and other

video encryption survey papers [9-13], the security strength of video data depends on

how much important data are encrypted. The importance of video data is defined as how

difficult to reconstruct video while the data is lost.

Similar as the others video encryption schemes, there are 4 security levels defined in

the USE scheme (Shown in Table 3.1). The definitions are listed as following:

Table 3.1 Security levels in the USE scheme.

SecureLevels

Algorithm Video content Data Classification Methods

Level 0 AES Headers Parameters Extraction

FLEX Inter, Intra, MVD

Level 1 AES Headers, MVD Data Partitioning

Parameters ExtractionFLEX Inter, Intra

Level 2 AES Headers, MVD, Intra Data Partitioning

Parameters Extraction

FMOFLEX Inter

Level 3 AES All -

Level x AES Secret Region FMO

FLEX Normal Region

- 34 -


Level 0: Headers are encrypted by AES, and the remained data are encrypted by

FLEX. In level 0, the computational cost is the lowest. The Parameters Extraction

method can be used in this level.

Level 1: Headers and MVDs (in H.264/AVC, MVD corresponds to motion vector)

are encrypted by AES, and the remained data are encrypted by FLEX. The Data

Partitioning method and Parameters Extraction method can be used in this level.

Level 2: Headers, MVD and Intra MBs are encrypted by AES, and Inter MBs are

encrypted by FLEX. All of three data classification methods can be used in this level.

Level 3: The entire video is encrypted by AES. Level 3 has the highest

computational cost and security.

Level x: This is an extra security levels for the USE scheme. Only FMO methods can

be used in this level. It can be used in object-based encryption applications.

3.2.4 Encryption Methods

A. AES Algorithm

Advanced Encryption Standard (AES), also known as Rijndael, is a block cipher adopted

as an encryption standard by the U.S. government in 2001. AES is the most popular

algorithm used in symmetric key cryptography. AES has a fixed block size of 128 bits and a

key size of 128, 192 or 256 bits. AES operates on a 4×4 array of bytes termed the State. For

encryption, it will implement a round function 10, 12, 14 times (depends on the key length).

The detailed introduction of AES will further discussed in next Chapter.

- 35 -


Figure 3.7 FLEX encryption algorithm.

Figure 3.8 Leak position in the even and odd rounds.

- 36 -


B. FLEX Algorithm

FLEX (which stands for Fast Leak EXtraction) is a crypto method based on the AES

round transformation. FLEX can handle longer context more quickly than AES while

maintain the same key agility and short context block performance. Moreover, the

flexibility for hardware and software implementation is same as AES, and the hardware

resource can be shared between FLEX and AES. The detailed algorithm is shown in

Figure 3.7.

Firstly, the given IV is encrypted by AES invocation: S=AESKey(IV). The 128-bit result

S together with encryption Key constitutes a 256-bit secret state of the stream cipher.

Secondly, we use result S as a new input d Key(S). The cipher stream

it comes

from internal states of AES. As shown in Figure 3.8, 4×4 array of bytes constitutes the

internal state of AES. In every round function of AES, a part of AES States is output. In

FLEX algorithm, b0, 0, b0, 2, b1, 1, b1, 3, b2, 0, b2, 2, b3, 1, b3, 3 are output in odd rounds, b0, 1, b0,

3, b1, 1, b1, 3, b2, 1, b2, 3, b3, 1, b3, 3 are output in even rounds. It totally outputs 80 States of

AES (640 bits) in every AES encryption round. The speed of FLEX is exactly 5 times

faster than AES.

C. XOR Method

In order to further reduce computational cost, we use XOR method to reduce 50% of

computational cost. This method is shown in Figure 3.9. There are three steps of this

method:

Step 1: Divide total plaintext into two partitions A and B (with the same size),

Step 2: Encrypt partition A while XOR partition A with partition B bits by bits,

Step 3: Partition C and D are ciphertext.

- 37 -


Figure 3.9 XOR Method

By using XOR method, we can just encrypt half of video data to achieve low

computational cost. The security of total plaintext is equal to partition A.

3.3 Comparison

In order to compare the computational cost and encrypted data part of our proposed

stream should be firstly calculated.

Table 3.2 shows the experimental results for several H.264/AVC QCIF sequences. It

lists the header information size, MVD size, Intra MBs residue size and Inter MBs

residue size in 10 QCIF test sequences. In every test sequence, it begin with I frame,

followed by P or B frames. Totally 100 frames are included in each test sequence. From

these 10 sequences, the average ratios of data size for Header is about 20%, MVD is

about 20%, Intra residue is about 15%, and Inter residue is about 45%.

Table 3.3 shows the ratios of each data partition for different video sequences under

different security levels. In level 0, about 20% video data is encrypted by AES and 80%

- 38 -


video data is encrypted by FLEX. In level 1, the percentage is 40% and 60%, and level 2

is 55% and 45%. Level 3 is 100% encrypted by AES. Level x uses FMO data partition

ment, Level x is not

included.

Table 3.4 shows the computational cost and encrypted data percentage comparison of

results listed in Table 3.2. We use the average percentage of 10 sequences. The

computational cost is measured by n@AES. We consider that the full encryption by AES

is 100%@AES. For example, the computational cost for SECMPEG level 1 is

20%@AES. It means that the computational cost of SECMPEG level 1 is 20% of full

encryption. The encrypted data percentage reflects the security strength of each video

encryption schemes. As all of the schemes use AES to encrypt the selected important data,

the security can be evaluated by the amount of encrypted data.

From Table 3.4, it can be seen that our scheme can achieve both high security and low

the computational cost of

Level 0 in our USE scheme is just about 18% of naive encryption, and the encrypted data

percentage is 100%.

Figure 3.10 shows the comparison of security and computational complexity of our

proposed USE scheme with other schemes. The computational complexity is defined as

the how many percentage of full encryption by AES algorithm. The security is also

evaluated according to the AES full encryption. We considered the security of FLEX

algorithm is 1/5 of AES algorithm. From this figure, our proposed USE scheme is much

of our scheme is very

low.

- 39 -


3.4 Conclusion

In this chapter, an unequal secure encryption scheme for H.264/AVC is proposed. In

order to maintain high security, our scheme uses full encryption approach to encrypt the

whole video data by selective encryption methods. This scheme mainly includes two

parts: Data classification and Unequal secure encryption:

(1) Data classification: There are three classification methods in the USE scheme:

Data partitioning for extended profile, FMO for main and baseline profile, and

parameters extraction for all profiles. After data classification, the entire video data are

divided into two partitions: the important data partition and unimportant data partition.

(2) Unequal secure encryption: This method can also be called as selective

cryptosystem method. As different cryptosystems have different security level and

different computational cost. In the USE scheme, we choose AES to encrypt the

important data partition and propose a light encryption algorithm called FLEX to encrypt

the unimportant data partition. The speed of FLEX is 5 times faster than AES. However,

the security is less than AES because it leaks many internal states when doing encryption.

The experimental results and comparison show that our scheme can achieve both high

security and low computational cost. For level 0 of USE scheme, the computational cost

is only 18% of full encryption. And for highest security level 3, the computational cost is

only 50% of full encryption. It is very suitable to be used in high security and high

definition video encryption systems.

- 40 -


Table 3.2 Video data partition size

(QCIF@100 Frames, I Frame followed by P or B Frames).

Video

Sequence

Header (Includes MVD) Intra MBs Residue Inter MBs Residue Total size

of

compressed

H.264 File

(bits)Header

(bits)

Header/Total

(%)

MVD

(bits)

MVD/Total

(%)

VLC

(bits)

VLC/Total

(%)

VLC

(bits)

VLC/Total

(%)

Canoa 676577 26.04% 300816 11.58% 769777 29.62% 1152357 44.34% 2608088

CarPhone 314675 51.84% 150868 24.85% 55551 9.15% 236802 39.01% 616672

Claire 95326 57.69% 38300 23.18% 10801 6.54% 59111 35.77% 175640

Container 96239 46.49% 32468 15.68% 23877 11.53% 86899 41.98% 217832

Football 825441 30.14% 390128 14.25% 866291 31.64% 1046531 38.22% 2747592

Foreman 375985 55.99% 195606 29.13% 43971 6.55% 251588 37.46% 680648

Grandma 99382 52.85% 39218 20.86% 17903 9.52% 70763 37.63% 198600

Mobile 454322 36.29% 207090 16.54% 54242 4.33% 743504 59.38% 1261768

News 183186 41.21% 86012 19.35% 55332 12.45% 206017 46.34% 454736

Table 312751 39.18% 165196 21.03% 78360 9.98% 394422 50.21% 795512

- 41 -


Table 3.3 Video data partition for different security levels.

Video

Sequence

Level 0 Level 1 Level 2

AES FLEX AES FLEX AES FLEX

Canoa 14.41% 85.59% 26.04% 73.96% 55.66% 44.34%

CarPhone 26.56% 73.44% 51.84% 48.16% 60.99% 39.01%

Claire 32.47% 67.53% 57.69% 42.31% 64.23% 35.77%

Container 29.28% 70.72% 46.49% 53.51% 58.02% 41.98%

Football 15.84% 84.16% 30.14% 69.86% 61.78% 38.22%

Foreman 26.50% 73.50% 55.99% 44.01% 62.54% 37.46%

Grandma 30.29% 69.71% 52.85% 47.15% 62.37% 37.63%

Mobile 19.59% 80.41% 36.29% 63.71% 40.62% 59.38%

News 21.37% 78.63% 41.21% 58.79% 53.66% 46.34%

Table 18.55% 81.45% 39.18% 60.82% 49.1% 50.84%

- 42 -


Table 3.4 Comparison with other video encryption schemes.

Encryption Schemes Content to be encrypted Computational cost

( @ AES )

EncryptedData

SEC MPEG[14]

Level1

Header 20% @ AES 20%

Level3

Header and Intra 35% @ AES 35%

Level4

All 100% @ AES 100%

Aegis [15][16] Header, I frame 35% @ AES 35%

VEA [18] All 50% @ AES 100%

RVEA [19][20] Sign Bit of DCT and motionvectors

10% @ AES 10%

Alattar[21]

Method 0 Header, Intra and MVD 55% @ AES 55%

Method 1 Every nth I MB 1/n*15%@AES 1/n*15%

Method 2 + Header (1/n*15 + 40)% @ AES (1/n*15 + 40)%

Method 3 + nth Header (1/n*15 +1/n*40)%@ AES (1/n*15+1/n*40)%

Ours

Level 0 All 18% @ AES 100%




- 43 -


Figure 3.10 Comparison of security and computational complexity.

- 44 -

Hardware Design of Encryption Accelerator

4 Hardware Design of Encryption Accelerator

As video streaming needs real-time encryption, the hardware implement for a secure

video system becomes very important, especially for high-performance, low-latency and

also low-power. For a complete secure video streaming system, it includes two main parts:

Secret key exchange module and video content encryption module. For secret key

exchange, the RSA is the most widely used algorithm and Montgomery multiplier is the

core accelerator in RSA. For video content encryption, the USE scheme has been

proposed in chapter 3 and AES is the core accelerator. In this chapter, the hardware

design for AES and RSA are proposed to achieve high-performance and low hardware

cost.

4.1 Hardware Design of AES

4.1.1 Introduction of AES Algorithm

AES [37], also known as Rijndael, is the most popular algorithm used in symmetric

key cryptography. AES operates on a 4×4 array of bytes termed the State. For encryption,

it implements a round function 10, 12, 14 times (depends on the key length). The

encryption and decryption flow of AES algorithm are shown in Figure 4.1 (a) and (b).

Four transformations including Subbytes, ShiftRows, MixColumns and Addroundkey are

performed in the encryption process, and the other four inverse transformations are

performed in the decryption process. A separate KeyExpansion unit is used to generate

keys for each round of AES algorithm. In order to reduce the hardware cost, we propose a

hybrid dataflow for both of encryption and decryption, which is shown in Figure 4.1 (c).

This data flow supports both of encryption and decryption. All of function modules

support forward and inverse operation. KeyExpansion module also supports generating

forward and inverse key sequence. Compared to solution uses two AES cores (one for

encryption and another for decryption), the hybrid solution saves about 40% hardware

- 45 -


cost (For single modules, (Inv)Subbytes saves 50%, (Inv)MixColumns saves 30%,

(Inv)ShiftRows saves 20% of hardware cost).

Figure 4.2 shows the operations in AES algorithm. The briefly introduction is listed as

below:

1) SubBytes: The SubBytes operation is a non- linear byte substitution that operates on

each byte of the State using a substitution table.

2) ShiftRows: In the ShiftRows operation, the bytes in the last three rows of the State

are cyclically shifted over different numbers of bytes.

3) MixColumns: Mixing operation which operates on the columns of the State using a

linear transformation.

4) AddRoundKey: A Round Key is added to the State by a simple bitwise XOR

operation.

Figure 4.1 Dataflow. (a) Encryption. (b) Decryption.

(c) Proposed hybrid dataflow for encryption & decryption.

- 46 -


Figure 4.2 Transformations in AES algorithm.

4.1.2 Existing low-cost implementations of AES

Many hardware implementations of AES algorithm already have been proposed. They

can be classified into two types: high speed designs and low-cost designs. Most of the

existing designs are high-speed design. However, with the increase of personal security

requirement and portal commercial electronic device usage, the low power and low-cost

design becomes very important.

The existing low cost implementations of AES can be classified into 32-bit design and

8-bit design two types. While using 32-bit width data path, it is called 32-bit design, and

8-bit width data path is called 8-bit design. The briefly introduction of these two kinds of

designs are listed as following.

32-bit implementation of AES Algorithm [48]

The 32-bit implementation of AES algorithm is shown in Figure 4.3. It is proposed by

Satoh in [48]. This architecture uses 32-bit width data path. The key function modules of

- 47 -


this architecture include:

S-Box module (32-bit): Every S-Box supports one SubBytes (8-bit) operation, so 32-bit

SubBytes operation is realized by using 4 S-Boxes in the data path. These S-Box modules

can be used for both of AES Round function and Key expansion operation.

MixColumns module (32-bit): MixColumns module can support both of MixColumns

and Inverse MixColumns operation through reusing part of hardware.

In this architecture, S-Box module, MixColumns module and AddRoundKey module

(Xor module) are serially connected. This architecture can be also called as serial data

path design. The hardware cost of this implementation is shown in Table 4.1. The ASIC

implementation shows that the frequency is 130 MHz and the throughput is 311 Mbps.

Figure 4.3 32-bit architecture for AES.

- 48 -


Table 4.1 Hardware cost of 32-bit AES @ 131 MHz, 0.11um, [48].

Components Gates %

Data Register 864 16.01%

ShiftRows 160 2.96%

S-Boxes 1,176 21.79%

MixColumns 350 6.48%

AddRoundKey 56 1.04%

Key Expander 1,896 35.12%

Others 699 12.95%

Total 5398 100%

8-bit implementation of AES Algorithm [50]

The 8-bit implementation of AES algorithm is shown in Figure 4.4. It is proposed by

Feldhofer in [50]. This architecture uses 8-bit width data path, and it is designed to be

used in RFID tag. The key function modules are listed as following:

S-Box module (8-bit): There is only 1 S-Box in this architecture. It is also used to

support both of Round function and Key expansion.

1/4 MixColumns module (8-bit): Since Mixcolumns operation is a 32-bit operation, the

authors proposed a new method to implement it into 8-bit. Additional registers and clock

cycles are needed.

In this architecture, S-Box module, MixColumns module and AddRoundKey module

are parallelized. This architecture is also referred as parallel data path design. The

hardware cost of this implementation is shown in Table 4.2. The ASIC implementation of

this design shows that the working frequency is about 100 KHz, and the throughput is

about 12.6 kbps, which is enough for RFID communication.

The 32-bit and 8-bit implementation of AES achieves very low hardware cost, and they

- 49 -


are suitable for low cost, low throughput system. However, the architectures in these two

implementations are dedicated to specific applications, and can

performance. For video encryption, the performance requirement is quite different for

different video resolution and frame rate. For example, for oneseg [51] (Mobile TV used

in Japan), the throughput of video data is 160 kbps, however, for HDTV DVD, the

throughput of video data can reach to 50 Mbps. In this way, a scalable architecture which

can provide different performance is the best choice for AES hardware design.

Figure 4.4 8-bit architecture for AES.

Table 4.2 Hardware cost of 8-bit AES @ 100 KHz, 0.35um, [50].

Components Gates %

S-Boxes 395 10.0%

MixColumns 252 7.0%

AddRoundKey 90 2.5%

Key Expander 161 4.5%

RAM 2,337 65%

Controller 360 10.0%

Total 3,595 100%

- 50 -


4.1.3 Proposed Scalable Hardware Architecture for AES

Scalable architecture is very important for IP design. Different implementations with

different performance and hardware cost according to specific requirement can be

designed based on a common architecture provides great flexibility and reliability for IP

design and also system integration. For multimedia system, various specifications are

required for different applications. For example, mobile TV usually uses CIF size video

with less than 1 Mbps bit rate. In contrast, for HDTV broadcasting, the bit rate is more

than 20 Mbps, and for future super HDTV the bit rate will be increased to hundreds of

Mbps. A scalable architecture, which can be used for wide specifications, is urgent for

AES IP design.

4.1.3.1 Top Level Architecture

The top level of proposed scalable hardware architecture for AES is shown in Figure 4.5

There are many blocks included in this architecture:

Data Registers

The data registers includes 16 bytes of registers, same as the block length (128-bit) of

plaintext of AES. Every subblocks (Gray color) represents a byte of data, which termed a

State. Before encryption, the data registers load plaintext from external memory, and after

encryption, the data register output the ciphertext.

Key Expander Module

Key expander module is used to generator key for each round of AES. It includes two

part: Key registers and Key scheduler. Key registers can be 16, 24, 32 bytes of register

for 128-bit, 192-bit and 256-bit key length. Currently, 128 bits key length is enough for

high security applications. And most of cryptosystem uses 128-bit key length. Key

Scheduler is a xor gate array with a control unit to generate round keys for encryption.

- 51 -


Figure 4.5 Scalable Hardware Architecture for AES

- 52 -


ShiftRows Module

Shiftrows module is very easy to implement. It is a wire mapping box to map the right

output ports to input ports. The input and output of ShiftRows module is 128-bit.

MixColumns Array

MixColumns Array is a set of MixColumns modules. The module number can be

implemented from one to four, since the Mixcolumns is 32-bit operation, at most four

modules can be used in an implementation. The detailed structure of Mixcolumns will be

discussed in Chapter 4.3.3.

S-Box Array

S-Box Array is a set of S-Box modules. S-Box is used to do subbytes operation of AES

algorithm, and it is very important for hardware implementation of AES. The subbytes

operation is the main computation in AES. The number of S-Box used in hardware

design greatly affects the performance and the hardware cost. The S-

be particularly discussed in Chapter 4.3.3.

The scalability of this design is achieved by the following new ideas:

Independent Data path for Each Operations

There are three main data paths in our design: MixColumns datapath, SubBytes

datapath and AddRoundKey data path. The advantages of this design include:

1) Scalability

As shown in Table 4.3, the operations in AES have different bit width: SubBytes is an

8-bit operation. MixColumns is a 32-bit operation. AddRoundKey and ShiftRows are

128-bit operation.

A main problem of previous architecture for scalable design is that they integrate

- 53 -


different operations in one data path. As a result, all of the operations in the same data

path should use the same bit width. In other words, the number of processing elements

for each operation is highly correlated.

However, in our proposed design, we separate each operation into different data paths.

It makes our design very flexible for increase or decrease parallelism of each single

operation. For example, in our proposed architecture, the number of MixColumns module

and number of S-Box modules can be freely configured, without to consider about other

operations.

Table 4.3 Bit width of operations in AES algorithm.

2) Power & Performance Improvement

Since the operations are distributed in several parallel data paths, the critical path of

hardware implementation becomes much shorter than serial architecture. The maximum

working frequency can be improved to a higher level. For low power design, because the

Operations Bit width for operations

SubBytes 8-bit

ShiftRows 128-bit

MixColumns 32-bit

AddRoundKey 128-bit

SubBytes for KeyExpansion 8-bit

- 54 -


critical path becomes shorter, it can use lower power supply and higher threshold voltage

to reduce power consumption.

Scalable S-Box Array, MixColumns Array

As shown in Figure 4.5, the S-Box array and MixColumns array are scalable modules,

which support different number of processing elements. Each S-Box is used to do

Subbyte operation with 8-bit inputs/outputs. The data registers is 128-bit, which should

do 16 times Subbytes in each round of AES algorithm. For key registers, four times

Subbytes is needed in each round of AES. Totally, in one round of AES algorithm, it

executes 20 times Subbytes operation. In our architecture, S-Box array supports 1-20

S-Boxes, which covers a wide range of performance and hardware cost requirement.

For MixColumns array, -bit operation and for each round of AES algorithm,

four MixColumns operations are executed. Correspondingly, 1-4 MixColumns modules

can be used in this architecture.

The advantage of Scalable S-Box Array and MicColumns Array is that the

performance and hardware cost can be balanced by simply adjusting the number of

processing elements. The scalability of AES hardware is greatly improved by these two

scalable modules.

4.1.3.2 Two typical subclass architectures

Based on the proposed scalable architecture, there are two typical subclass

architectures: Shared S-Box Architecture and Unified S-Box Architecture. Table 4.4

compares these two architectures.

- 55 -


Figure 4.6 Shared S-Box Architecture.

Shared S-Box Architecture

As shown in Figure 4.6 shared S-box Architecture uses one unique S-Box Array for

both of data registers and key registers. The advantages of this architecture is that all of

the S-Boxes can be used for both of data Subbyte and key Subbyte. However, these two

Subbyte opearion. Thus, this architecture is suitable for low hardware cost

implementations.

- 56 -


Figure 4.7 Unified S-Box Architecture

Unified S-Box Architecture

Unified S-Box array separate the S-Boxed into two parts: Data S-Box array and Key

S-Box array, as shown in Figure 4.7. Since key Subbyte executes four times in each round

of AES, four S-Boxes can be used. For data S-Box array, the number of S-box can be

1-16. The advantage of this architecture is that key operation is totally separated from

data operation. It needs less clock cycles than shared architecture. And it is very suitable

for high performance implementations. However, the utilization of key S-Box is not

higher, because the key Subbyte executes much less than data Subbyte.

- 57 -


Table 4.4 Comparison of two architectures.

Shared S-Box Arch. Unified S-Box Arch.

Configurable S-

number

1 to 20Data part 1-16

Key part 1-4

Configurable

Mix

1 to 4 1 to 4

Advantages S-Box utilization is high

Save hardware cost

Separate Key and Data operation

Easy to control

Orientation Low-cost AES High-performance AES

4.1.3.3 Sub-

As shown in Figure 4.5, there are three main sub-modules in the architecture: ShiftRows

module, S-Box module, and MixColumns module. ShiftRows is very simple and easy to

implement. It is just some shifting operations which can be easily implemented by hard

wire. The detailed implementation of S-Box and MixColumns are presented as follows:

S-Box Design

The S- [49]. The

architecture of S-Box is shown in Figure 4.8. This S-Box design uses normal basis to

optimize the GF(8) inverter to GF(((22)2)2

affine transformation are matrix operation which is easy to be implemented in hardware.

According to the experimental results of Canr

- 58 -


Figure 4.8 S-Box structure.

a) Factors of Inverse MixColumns b) Dual-function module

Figure 4.9 MixColumns structure.

- 59 -


MixColumns Design

There are a lot of hardware reuse methods for MixColumns module. We referred

[48]. Figure 4.9 a) shows the reused method of this module. In this

equation, the InversionMixColumns is separated into one MixColumns with two

additional matrixs. Figure 4.9 b) shows the hardware architecture of Dual-function

MixColumns module.

4.1.4 Performance Analysis

4.1.4.1 Scalability

Our proposed scalable architecture provides greatest scalability. There are totally 336

possible implementations based on proposed architecture. As shown in Table 4.5, for

Shared S-Box architecture, there are 1-20 S-Boxes and 1-4 MixColumns can be used.

Totally, 80 possible implementations are achieved based in this architecture. For unified

S-Box architecture, 1-16 S-Boxes for data, 1-4 S-Boxes for key, and 1-4 MixColumns

can be used. Totally, it has 256 possibilities. Taking count of two architectures, there are

336 possible implementations of AES base on the proposed salable architecture.

Table 4.5 Possible implementations of AES based on scalable architecture.

Shared S-Box Arch. Unified S-Box Arch.

S-Box for Data

1- 20

1-16

S-Box for Key 1-4

MixColumns module 1 - 4 1-4

Possible Configurations 80 256

Total 336

- 60 -


Figure 4.10 Dataflows for scalable architecture.

4.1.4.2 Dataflows

Figure 4.10 shows the dataflows for proposed scalable architecture. The dataflow

includes three parts: First round, Round i and Final round. First round includes two

sub-procedures: {A, S}, and {ks}. The meaning of notations is listed in the table in this

figure. Especially, A and S are executed in the same clock cycle. Round i is a loop

function of AES. For AES 128-bit key, the number of loops is 9. In this step, for unified

S-Box architecture, the dataflow includes two sub-procedures: {A, S, ks} and {A, R}. For

shared S-Box architecture, the dataflow includes three sub-procedures: {S}, {ks, M} and

{A, R}. All of the operations within same sub-procedures are executed in parallel. Final

round is the last round of AES, and it includes two sub-procedures: {S}, {A}. For an

AES-128 encryption, the total needed clock cycles for these two architectures are,

- 61 -


Clock cycles for shared S-Box Architecture

Clock cycles for unified S-Box Architecture

4.1.4.3 Hardware Implementation

In order to reduce hardware cost, we use only one S-box and one MixColumns module

and also constrained the design by loose frequency to get the lowest hardware cost in

synthesis tool. The synthesis results are shown in Table 4.6 Here we use TSMC 0.18 um

standard cell library, and use Synopsys Design Compiler to do synthesis. In order to

measure the highest performance of proposed architecture, we use 20 S-boxes and 4

- 62 -


MixColumns modules and constrained the design by maximum frequency. The synthesis

results are shown in Table 4.7.

The results show that the lowest hardware cost for each component of AES hardware

can be achieved below the minimum frequency of 123 MHz. In other words, the

hardware cost will be increased very much while the constrained frequency is set above

123 MHz. Because of parallel data path to shorten critical path, the maximum frequency

of proposed architecture can achieve 416 Mbps. Besides performance increasing, the

hardware cost also increased very much. The detailed hardware cost is listed in Table 4.7.

Table 4.6 Hardware cost of lowest cost AES @ 123 MHz, 0.18 um.

Components Gates

Data Registers 1079

ShiftRows + AddRoundKey 307

S-Box 358

MixColumns/InvMixcolumns 376

Key Expander + Key Registers 1935

Controller 247

Others 318

Total 4620

Table 4.7 Hardware cost of highest performance AES @ 416 MHz, 0.18 um.

Components Gates

Data Registers 1079

ShiftRows + AddRoundKey 310

S-Box 1138 * 20

MixColumns/InvMixcolumns 349 * 4

Key Expander + Key Registers 2548

Controller 264

Others 1112

Total 29469

- 63 -


Table 4.8 Scalability of hardware implementations.

Components Lowest hardware cost

implementation

Highest performance

implementation

Technology TSMC 0.18 TSMC 0.18

Frequency 123 MHz 416 MHz

Hardware cost 1 S-Box, 1 MixColumns 20 S-Boxes, 4 MixColumns

Required clock cycles for AES

encryption211 22

Throughput 75Mbps 2.4 Gbps

Table 4.9

*32-bit

Architecture

[48]

8-Bit

Architecture

[50]

Proposed Scalable Architecture

An implementation

example 1

An implementation

example 2

Hardware cost

Gate

4 S-Box,

1 MixColumns

7226

1 S-Box,

1/4 MixColumns

3595

5 S-Box

1 MixColumns

7344

4 S-Box,

1 MixColumns

6986

Frequency 138 MHz

@ 0.18 um

100 KHz

@ 0.35 um

180 MHz

@ 0.18 um

180 MHz

@ 0.18 um

clock cycles for

AES encryption54 >1000 54 64

Throughput327 Mbps 12.6 kbps 427 Mbps 360 Mbps

Power

Consumption18.3 mW 8.5 uW 22.3 mW 20 mW

Scalable NO NO YES

ture, and implemented by us.

- 64 -


Figure 4.11

Table 4.8 shows the scalability of proposed hardware design of AES. While implement

the AES to lowest hardware cost, it only use 1 S-Box and 1 MixColumns, and the

throughput is 75Mbps. However, for highest performance implementation, there are 16

S-Boxes and 4 Mixcolumns are used. The throughput can achieve 2.4 Gbps.

Table 4.9 compares the proposed architecture with 32-bit architecture and 8-bit

architecture. In order to equally compare with Sato two example

implementations with the similar hardware cost is designed [66]. Figure 4.11 shows this

comparison more clear. It can be seen that our scalable architecture provides great

scalability in both of performance and hardware cost. In the same level hardware cost,

our design achieves better performance.

- 65 -


4.2 Hardware Design of RSA

4.2.1 Introduction of RSA Algorithm

Public key cryptography plays a very important role in modern information security.

It not only can be used to encrypt/decrypt data like symmetric cryptography, but also can

provide service such as confidentiality, authentication, data integrity check and

non-repudiation. RSA algorithm [38], which is proposed by Rivest, Shamir and

Adleman in 1976, is the most widely used public key cryptographic algorithm.

RSA algorithm uses modular multiplication as the primary operation. With the

increase of the key-length used in these algorithms, the speed of modular multiplication

becomes a bottleneck. A lot of papers have been published to accelerate the speed of

modular multiplication. Till now, Montgomery modular multiplication algorithm [73] is

considered as the most efficient algorithm. A lot of hardware implementations are based

on this algorithm. Some of them focus on scalable design [74-75], which makes the

hardware implementation have ability to handle any key-length encryption/ decryption.

Some focus on high-radix design in [76-78, 80], which can reduce total clock cycles for

multiplication. Some focus on dataflow optimization [79], which can reduce the delay

cycles in dataflow.

Algorithm 1 shows the original Montgomery algorithm. The advantage of this

algorithm is that the division in modular operation is replaced by shift. In this way, this

algorithm is very suitable to be implemented in hardware.

The direct hardware implementation of Montgome

key-length operation. To make Montgomery algorithm scalable to variable key-length

and improve the speed of this algorithm, Tenca and Koç proposed a scalable high-radix

Montgomery multiplication algorithm, as shown in algorithm 2.

- 66 -


The sign ext in step 12 and 13 is sign extending operation. The Booth function in step

3 is used to support high-radix operation. The detail of Booth function is,

Booth (Xj+k- -1) = -2kXj+k-1+2k-1Xj+k-2 +20Xj-1 (4.3)

Algorithm 1: Montgomery Multiplication Algorithm

Input: X Y M

Output: S = MM (X, Y) = XYr-1 mod M

1. S=0

2. For i=0 to N-1

3. S=S + Xi Y

4. S=S + S0 M

5. S=S/2

6. End For

7. If S>=M Then S=S-M

8. Return S

Algorithm 2: Scalable High-radix Montgomery Multiplication Algorithm

Input X Y M

Output S = MM (X, Y) = XYr-1 mod M

1. S=0, C=0, X-1=0

2. For j=0 to N-1 Step k

3. qYj = Booth(Xj+k-1 j-1)

4. (C0,S0) = C0 + S0 + qYj Y0

5. qMj = (S0k-1 0+ C0

k-1 0) (2k-(M0)-1k-1 0) mod 2k

6. (C0,S0) = C0 + S0 + qMj M0

7. For i = 1 to NW 1

8. (Ci, Si) = Ci + Si + qYj Yi + qMj Mi

9. Si-1 = (Sik-1...0, Si-1

BPW-1...k)

10. Ci-1 = (Cik-1 0, Ci-1

BPW-1 k)

11. End For

12. SNW-1 = sign ext (SNW-1BPW-1 k)

13. CNW-1 = sign ext (SNW-1BPW-1 k)

14. End For

15. P = S+C

16. If P >= M Then P = P M

17. Return P

- 67 -


Algorithm 2 provides some advantages compared to algorithm 1. Firstly, word-based

operation makes the multiplier be scalable to variable key-length. Secondly, high-radix

design processes multiple bits of X in every loop (Step 3, Algorithm 2). It can reduce the

clock cycles used in multiplication. Thirdly, carry-save adder is introduced to reduce

critical path.

However, there are some disadvantages in Algorithm 2. Firstly, high radix design

makes the calculation of qYj and qMj very complex, and the path delay will be increased

very quickly when using higher radix. Secondly, scalable design makes the data be

dependent in pipeline, which causes two clock cycles delay in pipeline.

4.2.2 Proposed Optimized Algorithm

Algorithm 3 [68] is different from algorithm 2. Firstly, in step 1, Y is multiplied by 2k.

In this way, all of the k-LSB of Y is zero. The result of (S0k- , C0

k- ) is not changed

Algorithm 3: Modified scalable high-radix Montgomery multiplication algorithm

Input X Y M

Output S = MM (X, Y) = XYr-1 mod M

1. S=0, C=0, X-1=0 Y=Y 2k

2. For j=0 to N+k-1 Step k

3. qYj = Booth(Xj+k-1...j-1)

4. qMj = (S0k-1 0 + C0

k-1...0) (2k - (M0)-1k-1 0) mod 2k

5. (C0,S0) = S0 + C0 + qYj Y0 + qMj M0

6. For i = 1 to NW 1

7. (Ci, Si) = Si + Ci + qYj Yi + qMj Mi

8. Si-1 = (Sik-1...0, Si-1

BPW-1...k)

9. Ci-1 = (Cik-1...0, Ci-1

BPW-1...k)

10. End For

11. SNW-1 = sign ext (SNW-1BPW-1...k)

12. CNW-1 = sign ext (SNW-1BPW-1...k)

13. End For

14. P = S+C

15. If P >= M Then P = P M

16. Return P

- 68 -


by adding qYj Y0. The calculation of qMj is independent of qYj. Secondly, the calculation

of qYj and qMj are performed in parallel (Step 3, 4), while these two calculations are

calculated in serial in algorithm 2 (Step 3, 5).

For high-radix design, the path delay of qYj and qMj is large. In this way, our proposed

algorithm can support parallel calculation of qYj and qMj, and it achieves much shorter

critical path than algorithm 2. Our proposed algorithm is much more suitable for

high-speed hardware implementation of Montgomery algorithm.

4.2.3 Proposed Optimized Data Flow

The most frequently used dataflow for scalable high-radix Montgomery algorithm is

shown in Figure 4.12(a), which is proposed by Tenca and Koç in [76]. This data flow is a

pipeline data flow of algorithm 2. Due to the data dependence in algorithm 2 (In Step 9,

10, the output (Si-1, Ci-1) needs both of (Si, Ci) and (Si-1, Ci-1), there are two clock cycles

delay in this dataflow. As a result, this dataflow needs more clock cycles to complete one

time multiplication.

In order to deal with this problem, Herris proposed a new radix-2 dataflow in [79],

which is shown in Figure 4.12(b). This dataflow achieves one clock cycle delay by

removing data dependence in algorithm 2. As shown in algorithm 2, the right-shifting

(Step 9, 10.) causes data dependence. I -shifting of product S

is removed. As a result, the product S is equivalently be multiplied by 2 in every pipeline

stage, so the input data (Y, M) of next stage need to be multiplied by 2 too.

-save result and support high-

dataflow achieves one clock cycle delay while using radix-

- 69 -


Figure 4.12 Optimized Data Flow.

The proposed dataflow is shown in Figure 4.12(c) [68]. This dataflow bases on

proposed algorithm 3. It achieves both of one clock cycle delay and high-radix design. As

shown in Figure, operand Y is initially multiplied by 2k as specified in algorithm 3, so the

input data (Y, M) for each stage becomes (2kY, M). In order to achieve one clock cycle

delay in dataflow and support high-radix design, the input data (2kY, M) needs to be

multiplied by 2k accumulatively in each stage (except the first stage). Compare with

-radix and one

clock cycle delay make this dataflow need very few clock cycles to do multiplication.

Secondly, algorithm 3 used in this dataflow can achieve much shorter critical path than

-speed design of

Montgomery multiplier.

4.2.4 Proposed Hardware Architecture for RSA

The proposed Montgomery multiplier is shown in Figure 4.13(a) [68]

data path contains NS MM Cells. The MM Cell is the basic processing element in the

pipeline. There are two coefficient processing elements, qYj PE and qMj PE. They can be

shared to all of the MM Cells in pipeline. From Fig 4.12(c), it can be seen that the

- 70 -


calculation of qYj and qMj are done just in the first cycle of each stage (Grey cycle shown

in Fig. 4.12(c)). All of the remained cycles (White cycles) don't need to calculate qYj

and qMj. This property provides possibility to reuse the qYj PE and qMj PE in the data path.

The FIFO in Figure 4.13(a) is used to avoid data overflow in pipeline. When the NW

(Number of words of operands) is larger than NS (number of stages in dataflow), data

overflow will happen in pipeline. The FIFO can be used to store overflowed data

temporarily.

Figure 4.13 Proposed Hardware Architecture.

Parallel Radix-16 MM-Cell Design

The design of MM Cell is shown in Figure 4.13(b) [68]. This is a high-radix design. In

this paper, we implement radix-16 (k = 4) design of MM Cell. As shown in algorithm 3,

the function of MM Cell is:

(Ci, Si) = Si + Ci + qYj Yi + qMj Mi (4.4)

While using proposed dataflow in Figure 4.12(c), the input data (Si, Ci, Yi, Mi)

becomes (2jkSi, 2jkCi, 2(j+1)kYi, 2jkMi) in the jth pipeline stage. The input qYj_sel and

qMj_sel are the select signal for multiplexer, which is generated from qYj and qMj by qYj

- 71 -


PE and qMj PE. The implementation of qYj Yi is as following: Firstly, splitting qYj into

two numbers which is power of 2. Secondly, shifting Yi to get two components of qYj Yi

based on these two numbers. Finally, adding these two components with (Si, Ci) by using

4-to-2 carry-save adder. For example, while qYj = 6, qYj can be split into 2 and 4. Then,

6Yi can be represented as (2Yi + 4Yi). The inputs of 4-to-2 carry-save adder are (2Yi, 4Yi,

Si, Ci). Because 2Yi and 4Yi can be easily generated by left-shifting of Yi, the

multiplication of 6 Yi can be avoided. qMj Mi is implemented as same as qYj Yi. The

shift & Inverse modules are used for this purpose.

Considering the equation (1), while using radix-16, it becomes:

qYj =Booth(X -1)= -24Xj+3+23Xj+2+22Xj+1+2Xj+Xj-1 (4.5)

The range of qYj is [-8, 8]. All of the number in this range can be split into two

components, which is power of 2. For qMj under radix-16,

qMj = (S0 + C03...0) (16 - (M0)-1 ) mod 16 (4.6)

The range of qMj

order to deal with this problem, we propose a mapping table from [0, 15] to [-7, 8] in

Table 2, which can be equivalently used for qMj.

Table 4.10 -7, 8].

qMj qMj qMj qMj qMj qMj qMj qMj

0 0 4 4 8 8 12 -4

1 1 5 5 9 -7 13 -3

2 2 6 6 10 -6 14 -2

3 3 7 7 11 -5 15 -1

- 72 -


Very Low Complex implementation of qMj

qMj is much more complex than qYj. Normally, qYj PE and qMj PE are directly

implemented by lookup table. The size of look up table is increased exponentially to

radix number. While using radix 16, the table size of qMj is 4 times of qYj. In order to

reduce the cost of qMj calculation, we present a very low complex implementation of qMj

[68]. Considering equation (4.6), we use InvM to present (16 - (M0)-1 ). Table 4.11

shows the mapping from M0 to InvM. As modulus M is an odd number, there is 8

different values of M0 .

Table 4.11 M0 to InvM.

M0 InvM M0 InvM

0001 1111 1001 0111

0011 0101 1011 1101

0101 0011 1101 1011

0111 1001 1111 0001

All of these values can be divided into two groups: {(0001, 1111), (0111, 1001)} and

{(0011, 0101), (1011, 1101)}. Here (A, B) means a pair of (M03:0, InvM) or (InvM, M0

3:0).

In the first group:

InvM = ~ M03:0 + 1 (4.7)

In the second group:

InvM = {M03, M0

1, M02, M0

0} (4.8)

The difference of group 1 and group 2 is:

Group 1: M02 xor M0

1 = 0

Group 2: M02 xor M0

1 = 1

For example, (0001) is in group1, 0 xor 0 is equal to 0. (0011) is in group2, 0 xor 1 is

equal to 1. Based on above analysis, the qMj can be implemented as Figure 4.14 shows.

Because modulus M is an odd number, equation (4.7) can be further presented as:

- 73 -


InvM = ~ M03:0 + 1 = {~ M0

3, ~M02, ~M0

1, 1} (4.9)

After qMj is calculated, the qMj_sel can be calculated by using a small size lookup

table as same as qYj_sel.

Figure 4.14 Implementation of qMj.

4.2.5 Performance Analysis

Table 4.12 shows the clock cycles comparison of our dataflow with Tenca-Koç

dataflow and Herris dataflow. It shows that our dataflow achieves much less clock cycles

than their dataflow. In this table, different key length and stages are used to calculate

clock cycles for each dataflow. The BPW is equal to 32. Because our dataflow and

Tenca-Koç dataflow are high-radix dataflow, we use radix-16 for these two dataflows in

table 4, and Herris dataflow uses radix-2. In this table, the NS of one clock cycle delay

dataflow is half of two clock cycles delay dataflow. The reason is illustrated in Figure

4.15. Each Dataflow Stage of two clock cycles delay dataflow in Figure 4.15(a) actually

includes two DataPath Stages. One is MM Cell stage for operation of equation (2), the

other is Register Bank stage for storing (Yi, Mi, Si, Ci). The dashed cycles in Figure

4.15(a) represent the registering stages in the dataflow. For fairly comparing, the NS of

one clock cycle delay dataflow should be two times of two clock cycles delay dataflow.

- 74 -


Table 4.12 Clock cycles comparison of different dataflows.

NNS

NWThis Paper

Clock

Cycles

(Radix-16)

Tenca-Koç Dataflow

(Radix-16)

Herris Dataflow

(Radix-2)

Two clock

Cycles Delay

Dataflow

One clock

Cycle Delay

Dataflow

Clock

Cycles

Reduced

Cycles (%)

Clock

Cyels

Reduced

Cycles (%)

512

4 8 16 297 552 46.2% 1111 73.3%

8 16 16 171 289 40.8% 575 70.3%

16 32 16 167 281 40.6% 559 70.1%

1024

8 16 32 578 1072 46.1% 2159 73.2%

16 32 32 333 561 40.6% 1119 70.2%

32 64 32 329 553 40.5% 1103 70.2%

2048

16 32 64 1140 2112 46.0% 4255 73.25%

32 64 64 657 1105 40.5% 2207 70.2%

64 128 64 653 1097 40.5% 2191 70.2%

Table 4.13 .

Ref. Tech Area

(Gates/LUTs)

Frequency

(MHz)

Scalable Radix Delay Cycles

in Dataflow

1024-bit RSA

Throughput (Kbps)

[77] 0.5 m 28k Gates 64 MHz Yes 8 2 22 Kbps

[78] 0.25 m 100k Gates 125 MHz Yes 16 2 143 Kbps

[79] FPGA 5598 LUTs 144 MHz Yes 2 1 62.5 Kbps

[80] FPGA 2847 LUTs + 32

Mults + 5n RAM

102 MHz Yes 216 4 152 Kbps

[72] 0.18 m 150k 300 MHz Yes 2 2 100 Kbps

This

Paper

0.25 m 130k 180 MHz Yes 16 1 352 Kbps

- 75 -


Figure 4.15 Comparison of dataflow and corresponding data path.

The ASIC implementation of this work uses NS = 32, BPW = 32, Radix = 16. We use

HHNEC 0.25 m CMOS standard cell library and Synopsys EDA tools to do ASIC

design. Table 4.13

work. [77] is a radix 8 design proposed by Tenca-Koç, [78] is an improved radix 16

design.

frequency is higher than [78] under the same radix number. [79] is a FPGA

implementation by using one clock cycle delay dataflow, [80] is a very high radix design

(radix 216) using multiplier and RAM which is embedded in FPGA, [72] is a radix-2

scalable design which using 2 clock cycles delay dataflow.

Normally, radix-2k design can achieve about k times of performance than radix-2

design. One clock cycle delay dataflow can achieve about 2 times of performance than

two clock cycles delay dataflow. From the table 4.13, our design uses radix-16 with one

clock cycle delay dataflow. It achieves much higher performance than

- 76 -


4.3 Conclusion

This chapter presents the hardware design of encryption accelerators, including AES

and RSA. For AES, our proposed architecture provides high scalability for designing

different AES implementations with various hardware cost and performance. Two kinds

of subclass architectures are introduced: Unified S-Box architecture and Share S-Box

architecture. The comparison and advantages of these two architectures are discussed.

The sub- . For RSA, our proposed

architecture parallelizes the data path and shortens the critical path. By using proposed

clock-saving dataflow, it reduces the total clock cycles of multiplication to a very small

number. Finally, the performance analysis for both of AES and RSA are discussed.

- 77 -

DPA Attack on AES

5 DPA Attack on AES

Cryptographic devices are widely used in many places, such as smart card, USB key,

and so on. These devices are used to provide authentication of users or store secret

information. The security problem of these devices is a very hot topic. Many researchers

engage in this field, they proposed a lot of attack methods to crack cryptographic devices.

Correspondingly, many countermeasure methods are proposed to prevent attacking.

In recent years, side-channel attack becomes very popular, because side-channel attack

uses different approach to achieve its goal. Traditional attack methods use mathematical

analysis to reveal the weak point of cryptographic algorithm. The researcher should be

very experienced in crypto-analysis and own deep knowledge about the cryptographic

algorithms. However, side-channel attack only uses the side-channel information, such as

power consumption, time consumption, to decipher the secret information in

cryptographic devices. It is not necessary for attacker to hold crypto-analysis experience

or cryptographic algorithm knowledge. Among the proposed side-channel attack methods,

DPA attack was proved to be most efficient and easy to implement.

In this chapter, we briefly introduce the DPA attack method, especially for DPA attack

on AES. Some DPA attack countermeasure methods are also discussed in the end of this

chapter.

5.1 Introduction of Differential Power Analysis attack

In recent years, several kinds of attacks on cryptographic devices have become public.

The goal of these attacks is to reveal secret keys of cryptographic devices. Attacks on

cryptographic devices differ significantly in terms of cost, time, equipment, and expertise

needed. When Kocher et al. [52] showed in 1998 that power analysis attacks can

- 78 -

DPA Attack on AES

efficiently reveal the secrets of cryptographic devices, the world was shocked. After then,

power analysis attacks received most amount of attention because they are very powerful

and because they can be conducted relatively easily. Consequently, they pose a serious

threat to the security of cryptographic devices in practice. For the design and

development of modern cryptographic devices, it is crucial to power analysis attacks and

countermeasures.

The basic idea of power analysis attack is to reveal the key by analyzing cryptographic

cryptographic device depends on the data it processes and the operation it performs.

Attacker make use of power consumption with some other mathematical methods, the

secret key can be cracked.

Basically, power analysis attack can be classified into two categories: Simple power

analysis and Differential power analysis. Simple power analysis (SPA) attacks are

characterized by Kocher et al

directly interpreting power consumption measurements collected during cryptographic

operations , which means that attacker tries to derive the key more or less directly from a

given trace. In contrast to SPA, Differential Power Analysis (DPA) attacks requires a

large number of power traces, and it exploit the data dependency of the power

consumption of cryptographic device.

DPA attack exploits the fact that the power consumption of cryptographic devices

depends on intermediate values that are processed during the execution of a

cryptographic algorithm. DPA attacks are the most popular type of power analysis attack

due to the fact that DPA attacks do not require detailed knowledge about the attacked

device. Therefore, they can reveal the secret key of a device even if the recorded power

traces are noisy.

- 79 -

DPA Attack on AES

5.1.1 Power Consumption of CMOS Circuit

The total power consumption of a circuit depends two parts: the sum of logic cells

making up this circuit, and the activities of each logic cell. The logic cells can be

considered from system level, architecture level, to final MOS transistor level. Currently,

logic cells are usually implemented using CMOS. We use CMOS invert to describe the

power consumption, because the inverter is representative for all other cells.

As shown in Figure 5.1, the inverter includes consists of two transistors P1 and N1,

and a load capacitance CL. The power consumption includes two parts: Statistic power

consumption and dynamic power consumption.

Figure 5.1 CMOS Inverter.

- 80 -

DPA Attack on AES

Statistic Power Consumption

In this CMOS convert, P1 is conducting and N1 is insulating if the input a is set to

GND. Vice versa, P1 is insulating and N1 is conducting if the input a is set to VDD. In

both of cases, there is no direct connection between the VDD and GND. Therefore, only

a small leakage current is flowing through the MOS transistor. This leakage is denoted by

Ileak. The static power consumption Ps can be calculated by the following equation:

(5.1)

Dynamic Power Consumption

Dynamic power consumption occurs for logic cell switching. For a logic cell, four

transitions can be essentially performed: 0->0, 1->1, 0->1 and 1->0. For the first two

cases (0->0, 1->1), only static power is consumed. For the last two cases (1->0, 0->1), the

dynamic power is consumed. Table 5.1 illustrates those power consumptions for each

transition.

Table 5.1 Power consumption of four transitions in a circuit.

Transitions Power consumption

0 -> 0 Static power consumption

0 -> 1 Static + Dynamic power consumption

1 -> 0 Static + Dynamic power consumption

1 -> 1 Static power consumption

- 81 -

DPA Attack on AES

The dynamic power consumption Pd consists of two parts: Charging power

consumption and Short-circuit power consumption.

1) Charge Power Consumption

CMOS inverter draw a charging current from the power supply to charge the output

capacitance CL when output q switching from 0 to 1. CL is internal capacitances that

connected to the output port q, and it depends on the physical properties of process

technology, fanout cells, and the length of connected wires.

The average charging power Pch is consumed by a cell during the time T, as shown in

equation 5.2

(5.2)

In this equation, pch(t) denotes the instantaneous charging power, f is the clock

frequency, is activity factor of the cell which corresponds to the average number of

0->1 transitions that occur at the output of a cell in every clock cycle.

2) Short-Circuit Power Consumption

Short circuit happened temporary in CMOS circuit during the switching of the output.

In the case of CMOS inverter, there is a short period of time where both of P1 and N1 are

conducting simultaneously. The average power consumption of Psc that caused by

short-circuit can be calculated in equation 5.3:

(5.3)

In this equation, psc(t) denotes the instantaneous short-circuit power consumed by a

cell. Ipeak is the current peak caused by the short circuit during switching event. tsc is the

time of short circuit exists.

- 82 -

DPA Attack on AES

5.1.2 Power Model

Power analysis attack is cryptographic attack that uses power consumption information

and hypothetical power model to reveal secret keys in cryptographic devices. The power

consumption can be easily got when cryptographic devices are running. The most

important thing for power analysis attack is to build an accurate hypothetical power

model. Different from the conventional power model used in other applications, the

absolute values of power consumption are not relevant in power analysis attack. Only

relative differences between simulated power consumption values are important. In this

way, researchers only make use of dynamic power to model circuit power.

Hamming weight (HW) model and hamming distance (HD) model are two important

hypothetical power models for power analysis attack. Hamming weight model is a simple

power model, which assumes that the power consumption is proportional to the number

of bits that are set in the processed data value. The data values that are processed before

and after this value are ignored. Therefore, this power model is not very well suited to

describe the power consumption of a CMOS circuit. The equation 5.4 shows the power

consumption based on HW model.

(5.4)

is a constant to model the ratios of noise power and circuit power. HW(S) is the

Hamming weight of internal state S. n is the noise power.

Hamming distance model is more accurate than hamming weight model. The basic

idea of the hamming distance model is to count the number of 1 0 and 0 1 transitions

that occur in a digital circuit during a certain time interval. This number is then used to

describe the power consumption of the circuit in this time interval. Hamming distance

model assumes that all 1 0 and 0 1 transitions in a digital circuit lead to the same

power consumption. By dividing the entire simulation of a circuit into small intervals, a

- 83 -

DPA Attack on AES

kind of power trace can be generated. This power trace does not contain actual power

consumption value but the number of transitions that occur in the corresponding time

interval. A formal definition of the hamming distance is given in the following.

(5.5)

The hamming distance of two values S1 and S2 corresponds to the hamming weight

of . The hamming weight corresponds to the number of bits that are set to one.

Hence, corresponds to the number of bits that differ in S1 and S2.

The formal equation of power consumption in HD model is as follows:

(5.6)

and n are same as HW model. The HW is replaced by HD in this equation. S1 and

S2 are two states (the output of a circuit) in two clock cycles.

5.1.3 Hypothetical Power Consumption based on HD model: Case study

Based on the discussion in the last two sub sections, two cases of power consumption

of specific circuit are discussed in this section.

Figure 5.2 shows a RTL level graph of a circuit. This circuit includes two registers,

REG0 and REG1, and a combinational circuit is inserted between these two registers. The

left part of this figure represents the states of the circuit in a specific time (S0), and after

one clock cycle, the state of the circuit is changed (S1)as shown in the right part.

The power consumption of this circuit includes two parts: Sequential logic (Registers)

power consumption PREG and combinational logic power consumption PComb. For PREG, it

is,

(5.7)

- 84 -

DPA Attack on AES

where is SNR ratio for registers, denotes power noise produced by registers.

S0, S1 are states of REG in different clock cycles. For PComb, it also can be modeled as,

(5.8)

is SNR ratio for combinational circuit. is power noise produced by

combinational circuit. The total power consumption of this circuit in this clock cycle is:

(5.9)

Figure 5.3 shows another case of a circuit. In this circuit, the execution results

produced by combinational circuit are feedback to the original register. As shown in this

. The total power consumption of this

circuit becomes different since the architecture is changed. Same as discussed above, the

power consumption includes:

(5.10)

(5.11)

(5.12)

As a result, the power consumption of circuit CASE II only depends on one variable

S0, while in CASE I, it depends two variable S0 and S1.

- 85 -

DPA Attack on AES

Figure 5.2 Power consumption of a circuit: Case I.

Figure 5.3 Power consumption of a circuit: Case II.

- 86 -

DPA Attack on AES

5.1.4 Differential Power Analysis Attacks

Differential Power Analysis (DPA) attack was proposed by Kocher et al. in 1998. DPA

is a side channel attack method which measures power consumption and uses

hypothetical power model to recover secret information. In a successful attack, the

hypothetical power consumption trace of the correctly guessed key displays a significant

higher correlation with the actual measurements of the cryptographic device than others.

DPA attack has been proven to be practical and efficient. Therefore, it posed a serious

threat to the security of cryptographic devices.

CPA (Correlation coefficient Power Analysis) was proposed by Brier, Clavier and

Olivier in 2003 [54]. CPA uses hamming distance model instead of Hamming weight

model . Also it uses correlation coefficient instead

of differential coefficient. The CPA is an improvement of original DPA, which provides

much more accurate to successfully attack a cryptographic device. Normally, people use

DPA attack as a common name to represent both of DPA and CPA attack. In this

dissertation, the DPA attack means power analysis attack using DPA or other optimized

DPA methods, such as CPA.

DPA uses following equations to calculate correlation coefficient:

DPA Algorithm:

(5.13)

(5.14)

(5.15)

(5.16)

- 87 -

DPA Attack on AES

(5.17)

(5.18)

W(t) is power traces. P(k) is hypothetical power consumption which is calculated by

hamming distance. t is time step. k is hypothetical key. N is number of power traces.

Equation 5.17 and 5.18 calculate the mean of real power consumption and hypothetical

power consumption for each trace. Equation 5.16 and 5.17 are used to calculate the

variance for real and hypothetical power consumption. Equation 5.14 is used to calculate

covariance of real power and hypothetical power. Finally, the correlation coefficients are

calculated by equation 5.13.

5.2 DPA attack on AES

5.2.1 DPA attack on AES: An Example

Figure 5.4 Last round of AES module.

- 88 -

DPA Attack on AES

A success DPA attack is to probe the power consumption, to get the ciphertext, and to

model the hypothetical power consumption based on hardware. Finally, do the DPA

algorithm to calculate correlation coefficients of real power consumption and

hypothetical power consumption.

Figure 5.2 shows the last round of AES module. Attacker only knows ciphertext and

power consumption. The detailed DPA attack procedures are listed as following:

Step 1. Measuring the power consumption of AES encryption device. The power traces

data W(t) can be easily got by using oscilloscope. The device execute AES

encryption algorithm with the unknown, constant key. The ciphertext is known to

attacker.

Step 2. Calculating hypothetical intermediate values. As shown in figure, using known

ciphertext and hypothetical key, the corresponding intermediate value can be

calculated.

(5.19)

InvOP represents inverse operation of AddRoundkey, ShiftRows and SubBytes. key0

is one byte of hypothetical key. It has 256 possibilities.

Step 3. Calculating hypothetical power consumption. Hamming distance model is

chosen in this step. Hypothetical power consumption is equal to hamming distance of

intermediate value and ciphertext.

- 89 -

DPA Attack on AES

(5.20)

Step 4. Calculating correlation coefficients by using P(k) and W(t). The correctly

guessed key is highly correlated with real power consumption, and it can be

identified from the significant peaks in DPA curves.

5.2.2 DPA attack on AES: A successful attack and a failed attack

According to the discussion in Chapter 5.2.1, the final Correlation Coefficients set

include three dimensions: Time, Hypothetical keys and Correlation coefficients value. It

can be denoted by . T is Time, K is hypothetical key and C is

coefficients value. ByteKey is denotes which byte of key. For AES 128-bit key, there are

totally 16 bytes. The results of DPA attack can be clearly shown in the 2-D or 3-D view.

2-D view shows 16 graphs, and each graph represents a byte of key. The x-axis is time

axis, and y-axis is coefficients value. 2-D view shows the final result

as . It shows the coefficient curves under the right key. The curves

each graphs (Figure 5.5). A failed attack shows smooth and noisy curves (Figure 5.7).

3-D view is used to show coefficient value mesh of one byte of key. x-axis is

hypothetical keys which is from 0 to 255. y-axis is time, and z-axis is coefficients value.

A successful attack shows a wall (Figure 5.6), while the failed attack shows smooth and

noisy mesh (Figure 5.8).

- 90 -

DPA Attack on AES

Figure 5.5 2-D views of successful DPA attack.

(16 bytes of key, the peak in each graph indicates that the hypothetical key is a right key)

Figure 5.6 3-D views of successful DPA attack.

(4th byte of AES key, the wall in this 3-D view indicates that the hypothetical key is a right key)

- 91 -

DPA Attack on AES

Figure 5.7 2-D views of failed DPA attack.

( peak in the correlation graphs)

Figure 5.8 3-D views of failed DPA attack.

(4th byte of AES key, there is no wall in the correlation coefficients mesh)

- 92 -

DPA Attack on AES

5.3 Conventional Countermeasure Methods

DPA attack works because the power consumption of cryptographic device depends on

intermediate values of the executed cryptographic algorithm. The goal of countermeasure

is to avoid or reduce these dependencies. Techniques to prevent DPA attack fall into two

categories, according to reference [53].

Hiding Method

Hiding method is done by breaking the link between power consumption of the devices

and processed data values. This method makes it difficult for an attacker to find

exploitable information in power traces. Two types of hiding methods are introduced by

Stefan Mangard in [53]: Time dimension hiding and amplitude hiding.

Time dimension hiding usually shuffles operations of cryptographic algorithms to

randomize the executions. This makes the power consumption appear to be more or less

random for an attacker. As shown in Figure 5.9, a power trace consists of several

operations, such as A, B, C shown in this figure. Each operation executes in different

clock cycles, and produces different power consumption. Before hiding, the execution

order is operations are fixed: A->B->C. After hiding, the execution order is shuffled. The

operations are executed randomly in time dimension. It changes in every power traces. As

a result, it is impossible for attacker to get the right power consumption for each

Amplitude dimension hiding method differs from time dimension hiding by adding a

noise source in the original circuit. As shown in Figure 5.10, the hided power

consumption consists of two power consumption source: the pure power produced by

pure circuit and noise power produced by noise source. The idea of this method is to

reduce the SNR ratios to make the power consumption becomes too noise to DPA

analysis.

- 93 -

DPA Attack on AES

Figure 5.9 Time dimension hiding.

Figure 5.10 Amplitude dimension hiding.

- 94 -

DPA Attack on AES

Currently, hiding methods is most commonly used in software implementations of AES

in embedded system. The conventional techniques are the random insertion of dummy

operations and the shuffling of operations. Hardware implementation of hiding methods

has not been reported yet.

Masking Method

Masking method is done by randomizing the intermediate values that are processed by

the cryptographic devices. This method makes the power consumption independent of the

intermediate values. The mask operations can be illustrated as

(5.21)

S is the internal state circuit before masking. M is a random number to mask S. SM is

state. The operation is most often the Boolean exclusive-or, the modular addition, or

the modular multiplication.

Masking methods is widely used in both of software and hardware implementations of

AES to against DPA attack. The most frequently used masking method for AES hardware

design was proposed by Akkar and Giraud in [55]. This method includes two parts:

1) Register Masking. Register masking, or internal state masking, is to mask the

intermediate data when AES encryption/decryption is running. As shown in Figure 5.11,

all of the intermediate values (A, B, C, D, E) are masked with a random number X.

2) S-Box Masking. Since S-Box consumes a much higher power consumption compare

to other blocks, attacker always focus on S-Box. As shown in Figure 5.12, S-Box

masking is performed to mask a random number in every intermediate value. One

GF(256) inverter, four GF(256) multiplier and two GF(256) adder are added in the

original S-Box design as extra hardware cost for masking method.

- 95 -

DPA Attack on AES

Figure 5.11 AES after masking.

- 96 -

DPA Attack on AES

1, ,( )i j i jA Y

1, ,i j i jA X

Figure 5.12 S-Box after masking.

5.4 Conclusion

This chapter introduced the DPA attack. Firstly, the basic conceptions, such as power

consumption of CMOS circuit, power models for power estimation, basic DPA workflow

are introduced. Secondly, the DPA attack on AES algorithm is introduced. An attack

procedure consists of four steps, and the final results of a successful attack and a failed

attack are also discussed. Finally, two countermeasure methods, hiding and masking, are

introduced.

- 97 -

AES Design with DPA Countermeasure

6 AES Design with DPA Countermeasure

For hardware design of AES countermeasure DPA attack, currently, only masking

methods is proposed. And almost all of the hardware implements only use masking

methods to countermeasure DPA attack. However, masking method is proved unsecured

to high-order DPA attack in software implementation in [56]. For hardware

implementation, currently, there are few articles about masked AES DPA attack.

Nevertheless, it is still a risk. And many new attack methods are still under research.

Since the attack methods change very quickly. Some countermeasure methods even

show their strong points to some specific attack methods, however, for other methods, the

weak points are still existed. A basic opinion to improve the security is to combing

several countermeasure methods together. In this chapter, we propose several DPA

countermeasure methods for AES hardware design, and finally, an ultra low-cost AES

design with multiple DPA countermeasures, which combines masking, hiding, and our

proposed independent ARK and Data sliding, is proposed. The experiment environment

(DPA attack system) and experimental results are also provided in this chapter.

6.1 Proposed DPA Countermeasure methods for AES

6.1.1 Register Masking

As discussed in Chapter 5, DPA attack use dynamic power consumption and power

model to do attack. For dynamic power consumption in a circuit, it includes sequential

logic power consumption (Consumed by Registers) and combinational logic power

consumption. In order to countermeasure DPA attack, both of registers and combinational

logics should be masked.

- 98 -


Figure 6.1 The round ith of the AES without and with masking countermeasures.

- 99 -


Since registers are updated in every clock cycle. They consumes a lot of power

consumption, and even more, all of registers are refreshed in the same time, it is very

easy for attacker to locate the right positions in power trace, according to the hypothetical

power model. The registers masking is to mask the values stored in register. After

masking, the power consumption of registers PR is randomized by a factor of random

number X:

(6.1)

Figure 6.1 shows round function of AES without and with masking countermeasures. In

the left part, all of the values (A, B, C, D, E) stored in registers are plaintext. The internal

states of AES are known to attacker. After masking, all of the internal states are masked

by a random number X. And this number is changed for every times of encryption.

Figure 6.2 Proposed Registers Masking.

Figure 6.2 shows the registers masking method. Two exclusive-or gates are inserted

around the registers. The values stored in the registers are randomized.

- 100 -


6.1.2 S-Box Masking

The original S-box and the proposed masked S-box are shown in Figure 6.3. Original

S-box includes a GF(28) inversion and a affine operation. GF(28) inversion is a Galois

field operation which is non-linear. Affine operation is a simple matrix transformation

which can be easily implemented by hardwire.

In order to conceal the intermediate value, a simplified S-box masking method is

proposed. The main idea of S-box masking method is: Using a random number to mask

the input data of S-box. In this way, the power consumption of S-box only depends on

masked input data, and independent of original data. Since Galois field is a non-linear

a result, people use Galois field multiplication to do masking.

As shown in Figure 6.3, two Galois field multipliers are added, and a random number Y

is used as masking pattern. Since Y is independent of A, A×Y is also independent of A.

The power consumption is a linear function of hamming-distance of A Y, thus, the

power consumption is independent of A.

consumption PComb is represented as:

(6.2)

Compared to the original masking method, which masks both of S-box and data bus in

[55], in hardware implementation, our proposed design saves one Galois field inverter,

two Galois field multipliers and two Galois field adders.

- 101 -


Figure 6.3 Proposed S-Box Masking.

6.1.3 Subbytes Hiding

Subbytes Hiding means that the sequence of subbytes opearation is shuffled. Subbytes

Hiding breaks the correlation of power consumption and AES operations in time domain.

After shuffling, the attacker can t know the corresponding operations in the power trace s

time coordinate. Thus, the hypothetical power consumption can t be correctly calculated

for power analysis.

A power trace of AES hardware is shown in Figure 6.4. Four operations are executed

one by one: ShiftRows, SubBytes, MixColumns and AddRounKey. Each operation

produces different power consumption. However, for attacker, they know that the same

operation will occurs in the same time in different power traces, like the Figure 6.5 a)

shows. This makes the power analysis attack become feasible.

- 102 -


Figure 6.4 A power trace of AES.

a) Subbytes without hiding b) Subbytes with hiding

Figure 6.5 Subbytes without and with hiding.

- 103 -


In order to prevent the power analysis attacker, the hiding methods shuffle the

execution of operations. As a result, in the collected power traces, the same operation is

randomly distributed in time domain, as shown in Figure 6.5 b). The power consumption

of Subbytes Psub can be denoted as,

(6.3)

Pti represents the power consumption in ith clock cycle. There are totally n possibilities

of Psub, and the n depends on the hiding methods used.

Figure 6.6 Hardware design of Subbytes hiding.

The hardware design of Subbytes hiding is shown in Figure 6.6. One of 16 bytes of data

is selected to do SubBytes. The SEL module is a multiplex for 16-to-1 selection. LFSR is

linear feedback shifting registers used to generate 4-bit selection signals. The initial

vector of LFSR is generated by Random number generator (RNG). RNG is a module

outside of AES, and normally, it existed in every cryptographic device.

- 104 -


6.1.4 Independent ARK and Data Sliding

Independent ARK

The conventional AES hardware design integrated all AES operations on one data path

to save clock cycles. As shown in Figure 6.7, in the last round of AES, Subbytes and

AddRoundKey are executed within one clock cycle, and the ciphertext is calculated. The

power consumption of this procedure can be represented by

(6.4)

Function is the inverse operation of Subbytes and AddRoundKey. C is ciphertext

which is known to attacker. S1 is an internal state of circuit, and S1 is the result of inverse

function of C. key is unknown, and the attacker uses a hypothetical key in this equation to

calculate the hypothetical power consumption. As discussed in the last chapter, DPA

attacks use hypothetical power and real power to do attack. The right key can be easily

recognized in the correlation coefficient graph. In this equation, attacker needs to guess

one byte of key (8-bit), which has 256 possibilities.

Independent ARK means that AddRoundKey operation is separated from other

operations. As shown in Figure 6.8, the last round of AES is separated to two sub steps:

Subbytes and AddRoundKey. These two steps are executed in different clock cycles. For

DPA attack, only Subbytes operation can be used because S-Box consumes much power

than other operations. The power consumption of Subbytes in this circuit is,

(6.5)

S2, S3 are two internal states of circuit. S2 is the result of exclusive-or of ciphertext

and key. S3 is the result of inverse subbyte of S2.

- 105 -


Figure 6.7 Integrated Subbytes and AddRoundKey.

Figure 6.8 Separated Subbyte and AddRoundKey.

- 106 -


Figure 6.9 Feedback structure and Data Sliding Structure.

Data Sliding

Data sliding is used to make the states of registers relate to its neighbouring registers.

In Figure 6.9, two kinds of circuit structures are showed:

A) Feedback circuit structure

Feedback circuit means that the source and destination of a data is the same. As

shown in this figure, R0, R1 are two registers. R0 and Subbytes make up a feedback

circuit. In time t0, the states of these two registers are S0 and S1. After one clock cycle

consumption can be represented as,

(6.6)

- 107 -


B) Data Sliding circuit structure

In Data Sliding circuit, there is no feedback circuit. And destination and source of a

combinational circuit are pointed to different registers. As shown in this figure, the

input of Subbytes comes from R0, and the output of Subbytes goes to R1. The changes

of state of this circuit are also listed in the table of this figure. The power consumption

of Data Sliding circuit can be represented as,

(6.7)

Different from feedback circuit, the power consumption of this circuit depends on

two registers. Both of two registers state should be took account for power consumption.

As shown in this equation, power consumption P depends on both of S0 and S1.

While combining Independent ARK and Data Sliding together, the power consumption

becomes:

(6.8)

key0 and key1 are two bytes of key. C0 and C1 are ciphertext which is known to attacker.

In this power consumption equation, there are two bytes of keys need to hypothesize.

Compared to the DPA attack on conventional circuit which only one byte of key need to

hypothesize, our proposed methods increase it to two bytes. As a result, the

computational cost is increased to 28 times for every power trace.

- 108 -


6.1.5 Time Complexity Analysis

The proposed DPA countermeasure methods greatly increase the computational

complexity of DPA attack methods. The time complexity of DPA attack on AES design

with countermeasure methods is analyzed in this section.

For DPA attack, the time complexity includes three parts: 1) Power traces measuring. 2)

Hypothetical power modeling. 3) Correlation coefficient calculating. For DPA attack on

pure AES (without countermeasure) and secure AES (with countermeasure), the

difference happened in the part 2 and part 3. For pure AES, the hypothetical power traces

is,

(6.9)

In order to do DPA attack, the value of key should be hypothesized. Every hypothetical

key corresponds to a set of hypothetical power traces .The correlation coefficients are

calculated for every hypothetical power traces and real power traces. We define the time

complexity of DPA attack on pure AES as: ~ ( )o DPA .

For AES with countermeasure methods, because the power consumption changes

according to different methods (as discussed in section 6.1.1-6.1.5), the hypothetical

power model has much more unknown variables than pure AES. A summary of

hypothetical power consumption and time complexity of DPA attack is shown as follows:

AES with Masking method

As discussed in equation 6.1 and 6.2, the hypothetical power consumption of AES with

masking method is,

- 109 -


(6.10)

Compared to equation 6.9, an additional 8-bit random number X is included in this

equation. In order to get the power trace, for every hypothetical key, it should additional

guess the random number X. For a single power trace, the possibility of power value is

increased to 28 times. For n power traces, the possibility of this power trace set will be

increased to 28N. In other words, the time complexity also increased to 28N.

AES with Hiding method

As shown in equation 6.3, the hypothetical power consumption of AES with hiding

method is,

(6.11)

Y is a selection random number which indicate which power value is the right value for a

specific time space. Normally, Y equals 16 since there are 16 Subbytes operations in each

round of AES algorithm. Similarly, for a single power trace, the possibility of power trace

is increased to 24 times, and for n power traces, the possibility is increased to 24N. The

time complexity is also increased to 24N.

AES with Independent ARK + Data Sliding

As discussed in Section 6.1.4, the hypothetical power consumption of AES with

Independent ARK and Data Sliding is,

- 110 -


(6.12)

key0 and key1 are 8-bit hypothetical keys. This equation is similar as 6.10 (Considering

key0 as key, key1 as random number X). The time complexity analysis is also similar as

masking. The complexity is increased to 28N times.

Table 6.1 summarizes the power consumption of AES design with each

countermeasure methods. Table 6.2 summarizes the time complexity for AES without

countermeasure, AES with masking, AES with Hiding, and AES with Independent ARK

&Data Sliding.

Our proposed countermeasure methods also can be combined together to improve the

security to a higher level. For example, combing all of methods (Masking, Hiding,

Independent ARK & Data Sliding) together, the power consumption becomes,

(6.13)

There are four unknown variables in this equation: Key (key0, key1) and random number

(X, Y). The computational complexity of DPA attack becomes , which is

212N times more secure than the AES design only with masking method. In this way, even

the masking method may be proved to unsecure, the other countermeasure methods can

also guarantee the security.

- 111 -


Table 6.1 Summary of different countermeasure methods.

Description Effect to power consumption

Masking Randomize the internal

data Add a random number X in power consumption

Subbytes Hiding Shuffling the execution

order of SubbytesRight power consumption belongs a member of

power consumption set

Independent ARK

with Data Sliding

Equal to mask data with

another key Add another key in power consumption

Table 6.2 Comparison of time complexity for each countermeasure methods.

Power consumption Complexity

AES without

DPA

Masking

Subbytes Hiding

Independent ARK

with Data Sliding

- 112 -


6.2 Ultra Low-cost Design of AES with DPA Countermeasure

6.2.1 Specification

The data size of coded video highly depends on the resolution, frame rate and coding

methods. Resolution is the size of picture in video sequence. When the resolution

becomes higher, every frames of video sequence consists of more MBs, and the data size

will greatly increased. In the other hand, high resolution makes video contains more

details, and becomes more attractive to audience. Frame rate is the number of frames

within one second. When frame rate increasing, it means that there are more pictures

should be displayed in one second. High frame rate makes moving pictures seem

smoother, especially for high motion pictures. Normally, for low resolution video (Less

than 1920 1088), the frame rate is set to 30 fps, and for high resolution video (More

than 1920 1088), the frame rate is set to 60 fps. Coding methods is other important

factor for video data size. Some coding methods greatly affect the coded video data size

such as RDO (Rate-Distortion Optimization), QP Matrix, CAVLC, CABAC and so on.

For more information about coding methods, please refer to [46].

Table 6.3 shows the maximum bit-rate of selected levels in H.264 [57]. Each level

defined the maximum bit-rate, video resolution, frame rate and maximum stored frames

in buffer. Table 6.3 vels list can

be found in [57]. Some frequently-used video resolutions are listed in this table. 176

144 (QCIF) and 352 288 (CIF) are usually used in the mobile phone. Since the screen

size and the battery power of mobile phone are limited, the small size video is acceptable

[51] uses QVGA (320 240) @

15fps, 128kbps to broadcasting TV for mobile phone. 720 480 (VGA) is normally used

in high-end portable media player. 1280 720 (HDTV 720p) and 1920 1080 (HDTV

1080p) are widely used in High Definition TV. For future use, the 4096 2048 (4kx2k)

and 8192 4096 (8kx4k) super-HDTV are under researching.

- 113 -


Table 6.3 Max bit-rate and resolution of selected H.264 levels.

H.264

Levels

Max bit rate (bps) Resolution

@

frame rate

Baseline

Main

Extend

Profile

High

Profile

High 10

Profile

High 4:2:2

4:4:4

Profile

1 64 K 80 K 192 K 256 K 128 96@30

1.1 192 K 240 K 576 K 768 K 176 144@30

2 2 M 2.5 M 6 M 8 M 352 288@30

3 10 M 12.5 M 30 M 40 M 720 480@30

3.1 14 M 17.5 M 42 M 56 M 1280 720@30

3.2 20 M 25 M 60 M 80 M 1280 720@60

4 20 M 25 M 60 M 80 M 1920 1080@30

4.1 50 M 62.5 M 150 M 200 M 2048 1024@30

4.2 50 M 62.5 M 150 M 200 M 2048 1080@60

1920 1080@64

5 135 M 168.75 M 405 M 540 M 1920 [email protected]

2048 [email protected]

2048 [email protected]

5.1 240 M 300 M 720 M 960 M 1920 [email protected]

4096 2048@30

- 114 -


Currently, 1920 1080@60fps, high profile is the highest configuration for

commercial products. For video communication, like video conference, the widely used

resolution is VGA. As a conclusion, the maximum bit rate for current video applications

is under 62.5 Mbps. For real-time video encryption module, the bit-rate should above this

number.

6.2.2 Hardware Architecture

As discussed in the last sub section, the maximum throughput for video application is

about 62.5 Mbps. The throughput of lowest hardware cost AES base on scalable

architecture proposed in chapter 4 can achieve 75 Mbps. In this way, for real-time video

encryption, the lowest hardware cost architecture in chapter 4 is the most suitable to be

used.

Figure 6.10 shows the hardware architecture of ultra low-cost AES design with DPA

countermeasure. This architecture bases on our proposed scalable architecture in chapter

4. In this way, most part of architecture is the same. Some important points of this

architecture include:

S-Box with masking: Only one S-Box and one Mixcolumns are used in this design

to reduce the total hardware cost. The S-Box masking method proposed in section 6.1.2 is

used.

Subbytes Shuffling: A 17-to-1 multiplexer is used to do SubBytes hiding. One of 16

data is randomly selected to do Subbytes in every clock cycle. The selection signal is

produced by the circuit proposed in section 6.1.3.

Register Masking: All of the data registers are masked. There are two sets of XOR

gate array before and after each data register, same as in section 6.1.1.

- 115 -


Figure 6.10 Ultra low-cost AES with DPA countermeasure.

Figure 6.11 Data flow for ultra low-cost AES.

- 116 -


Independent ARK: Independent ARK has been already used in the scalable

architecture. Because the data path is separated into 3, the operation AddRoundKey,

SubBytes and MixColumns are independent with each other.

Data Sliding: Data Sliding is achieved by right shifting of data registers in this

architecture. Since only one S-Box is used, the right-shifting should be done in every

clock cycle. This effect equals to data sliding.

In hardware design, masking and hiding methods cost extra hardware cost, in contrast,

Independent ARK and Data Sliding is architecture level design which didn

extra hardware.

6.2.3 Data Flow

The dataflow of proposed ultra low-cost AES design with DPA countermeasure follows

the similar way of unified architecture, which has been already discussed in section 4.4.2.

Figure 6.11 shows this dataflow in detail. The meaning of the notations used in this

dataflow is listed in the notations table. Every block in this dataflow represents a clock

cycle. The operations in the same block means that these operations are executed in

parallel.

The total dataflow consists of three parts: First round, Round i (i is from 1 to 9), and

the Last round. First round cost 5 clock cycles and it execute three operations:

Addroundkey, ShiftRows and key Subbytes. Because Addroundkey and ShiftRows are

done to data registers, they can be merged into one combined operation. The Round i is a

loop function which is executed for 9 times. There are totally 6 operations are executed:

Key update, Data Subbytes, MixColumns, Key Subbytes, AddRoundKey and ShiftRows.

Many operations are executed in parallel to save clock cycles. The Last round consists of

only three operations: Key update, Data Subbytes, and Addroundkey. From this dataflow,

- 117 -


it can be seen that the Addroundkey is always executed as independent. And the data

Subbytes can be executed randomly in every steps.

6.2.4 Implementation

In order to compare the hardware cost for AES designs with different countermeasure

methods, we implement 4 AES designs. All of the designs are coded by verilog HDL, and

synthesized by Synopsys Design Compiler. TSMC 0.18 um standard cell library are used

for circuit synthesis.

AES without countermeasure methods (AES 0)

This AES design has been proposed in reference [58]. The architecture of this design is

similar as scalable architecture proposed in Chapter 4. In order to further reduce the

hardware cost, the Addroundkey, MixColumns and ShiftRows are integrated into a 32-bit

data path. In this way, there are only two parallel data path. This design achieves lowest

hardware cost for pure AES design (without DPA countermeasure). The detailed

description can be found in [58]. And the reconfigurable design of this architecture can be

found in [66]. Table 6.4 shows the hardware cost of AES0 under 80 MHz clock frequency.

AES with Independent ARK and Data Sliding (AES 1.0)

This AES design uses the architecture shown in Figure 6.10. Only independent ARK and

Data sliding are used. In the other words, the 17-to-1 multiplexer shown in this figure is

not used. Independent ARK and Data Sliding are inherent from this architecture and

dataflow. Table 6.6 shows the hardware cost of AES 1.0. The frequency achieves 125 MHz,

which is much higher than AES0. The reason is that AES1.0 use three datapaths, thus, the

critical path is much shorter than AES0.

- 118 -


Table 6.4 AES0@80MHz, TSMC 0.18um

(Pure AES)

Components Gates

S-Box 358

MixColumns 376

Key Expander 1935

Controller 247

Data Registers

+ others (ARK, ShiftRow)

1762

Total 4678

Table 6.5 AES1.1@125MHz, TSMC 0.18um

(AES + Subbytes Hiding)

Components Gates

S-Box 383

MixColumns 313

KeyExpander 2220

Controller 235

Data Registers

+ Others (Multiplexer, ARK,

ShiftRows)

3093

Total 6244


(AES + Independent ARK, Data Sliding)


(AES + Masking)

Components Gates

S-Box 423

MixColumns 313

KeyExpander 2223

Controller 235

Data Registers

+ others (ARK, ShiftRow)

2306

Total 5500

Components Gates

S-Box 1124

MixColumns 325

KeyExpander 2220

Controller 235

Data Registers

+ Others (Xor Gates, ARK,

ShiftRows)

2930

Total 6834

- 119 -


AES with Subbytes hiding (AES 1.1)

AES with Subbyte hiding adds a 16-to-1 multiplexer compare to AES 1.0. Same

architecture and same dataflow are used. Table 6.5 shows the hardware cost of AES1.1

under 125 MHz. The performance of AES1.1 is same as AES1.0.

AES with masking (AES 1.2)

AES with masking consists of register masking and S-Box masking. Since masking

adds many circuit to original design, the performance reduced very much compare to

AES1.0. Table 6.7 shows the hardware cost of AES1.2. The clock frequency of AES1.2

reduced to 75 MHz.

From the hardware implementation results listed above, the Independent ARK, Data

Sliding has the smallest effect to hardware cost. Subbytes hiding method is the second

low effect method to hardware design. Masking shows its weak point to both of hardware

cost and performance reduction. For some hardware cost sensitive AES design, like RFID,

our proposed Independent ARK, Data sliding and Subbytes hiding is much better to be

used than masking.

6.3 DPA Attack Evaluation Environment

6.3.1 DPA attack platform

In order to implement the DPA attack on AES algorithm, firstly an attack environment

is necessary. We use Sasebo board to process the test of power analysis attack. Also, we

need an oscilloscope to retrieve the power traces derived from the FPGA board. Moreover,

a PC is needed to process the retrieved power traces data, using the specified power

model.

The following Figure 6.13 shows the photo of our DPA attack system. Figure 6.13 shows

- 120 -


the system architecture. We use SASEBO board provided by AIST to do the AES

operation. The board is connected to the independent power supply. While the encryption

is running, we use digital oscilloscope to retrieve the power traces. We record the power

traces data when there is a trigger signal. After record the data, we transmit the data back

to PC. We transfer two types of data. The cipher text encrypted by the SASEBO device is

transmitted to PC through RS232 serial port communication. On the other hand, the

digitized power trace waveform data will be transmitted back through LAN. The power

analysis attack is totally based on the power consumption data and the cipher text. The

detailed description of our DPA attack system also can be found in [69].

6.3.2 Sasebo Board

Side-channel Attack Standard Evaluation Board (SASEBO) is a board specifically

designed to develop standard evaluation schemes to secure the cryptographic module

against physical attacks. This system is developed by AIST and Tohoku University

[59][60]. It has FPGA version and ASIC version. FPGA version uses a Xilinx FPGA

Virtex-II XC2VP7 to implement AES designs in the board. ASIC version uses an ASIC

chip, which has already implemented several AES designs in this chip.

In this dissertation, we only use FPGA version, because the proposed AES hardware

can be implemented in the FPGA. Figure 6.14 shows the architecture of the SASEBO

board. There are two FPGA modules in this board: FPGA1 is used for cryptographic

operation; FPGA 2 is used for control logic. Two EEPROMs are used to configure FPGA,

and the configuration file is downloaded through the JTAG port. The power supply and

clock source of each FPGA is separated. For PC communication, a RS232 serial port is

used. LED module is used to express the internal status of FPGA1. Detailed description

of SASEBO-G board could be found in website [59][60].

- 121 -


Figure 6.12 DPA Attack Evaluation System (Photo).

Figure 6.13 DPA Attack Evaluation System (Architecture).

- 122 -


Figure 6.14 Sasebo Board.

6.3.3 Test Flow

In this part, we will give the flow path of the testing process. The complete test flow is

shown in Figure 6.15:

Firstly, we select the AES encryption. Then, the oscilloscope needs to be initialized

(For example, the sampling rate is set to 2GSa/s). Then, the number to execute AES

operation should be set. After that, check the oscilloscope to see whether it is in Run

status or not. If not, move back to oscilloscope initializing phase. Then, send the control

signal to FPGA through the RS232C serial port, according to the input data format. After

receiving the control signal, the FPGA could do encryption, decryption or reset. Here, we

do encryption in order to get power trace data. After the encryption, transfer the data back

to PC. At the same time, check if there is a trigger existed. If not, it means that there is

something wrong with the FPGA. If FPGA is normally running, record the power trace

data on PC through LAN, in a CSV or text file type. If the number of AES operation is

satisfied, which means all the operation is done, then step into DPA attack phase.

- 123 -


Figure 6.15 DPA attack test flow.

- 124 -


In DPA attack step, we use Hamming distance model to build the relationship between

power traces and processed data. Then we use correlation coefficient to present the

intensity of the two factors. In the whole flow, we need to transfer data between different

equipments. We use RS232C serial port to connect host PC and FPGA board, and use

LAN to control the data reading/writing between PC and oscilloscope.

6.4 Experiment Results of DPA Attack

Figure 6.16 shows a power trace measured by oscilloscope. For AES encryption,

totally it needs 211 clock cycles. We samples about 5000 points from the start of

encryption to the end of encryption. For one time DPA attack, we collect 5000 power

traces to do DPA analysis. Figure 6.17 and Figure 6.18 shows the 2-D and 3-D result of

DPA attack on Pure AES. Figure 6.19 and Figure 6.20 shows the DPA attack result of

AES design only with hiding method. Figure 6.21 and Figure 6.22 shows the DPA attack

result of AES design only with masking methods. Figure 6.23 and Figure 6.24 shows the

DPA attack result of AES design only with Independent ARK and Data Sliding. From

these figures, all of proposed DPA countermeasure methods can against DPA attack very

well.

Figure 6.16 Power trace from oscilloscope

- 125 -


Figure 6.17 2-D view of DPA attack on Pure AES.

Figure 6.18 3-D view of DPA attack on Pure AES.

(4th byte of key, the other 15 bytes of key are similar)

- 126 -


Figure 6.19 2-D view of DPA attack on AES with Subbytes hiding.

Figure 6.20 3-D view of DPA attack on AES with Subbytes hiding.


- 127 -


Figure 6.21 2-D view of DPA attack on AES with masking.

Figure 6.22 3-D view of DPA attack on AES with masking.


- 128 -


Figure 6.23 2-D view of DPA attack on AES with Independent ARK and Data Sliding.

Figure 6.24 3-D view of DPA attack on AES with Independent ARK and Data Sliding.


- 129 -


6.5 Chip Design

In order to evaluate the proposed countermeasure methods in ASIC, a test chip is

designed. This chip is designed for VDEC project [70]. ROHM 0.18 um standard cell

library is used. The chip size is 2.5mm

constrains. This chip contains four AES designs as discussed in Section 6.2.4. Top

module consists of multiplexers and UART module. A select signal is used to select one

of four AES designs under running. The Architecture of this chip is shown in Figure 6.25.

Figure 6.25 Test Chip Architecture.

- 130 -


Figure 6.26 Chip design of AES

Table 6.8 VDEC Test Chip.

Technology ROHM 0.18 um

Chip Size 2.5mm 2.5mm

PAD Number 128

Voltage 1.8V

Metal 5

Frequency ~100 MHz

Designs AES0, AES1.0, AES1.1, AES1.2

AES0 AES1.0

AES1.1 AES1.2

TOP

Module

+

UART

- 131 -


6.6 Conclusion

This chapter presented five DPA countermeasure methods for AES hardware design:

Register Masking, S-Box Masking, Subbytes Hiding, Independent ARK and Data Sliding.

The theoretical analysis shows that the complexity of DPA attack on the AES, which uses

hybrid countermeasure solution, will be increased to 212N times. In this way, even if one

or two countermeasure methods are cracked, the remained other countermeasure methods

can also prevent a successful attacking. For hardware design, an ultra low-cost AES

design with these countermeasure methods is proposed. This AES is designed for

real-time video encryption. Only one S-box and one Mixcolumns are used in the

architecture. The effect of hardware cost for different countermeasure methods is

discussed. Finally, in order to evaluate the effectiveness of proposed countermeasure

methods, a DPA attack evaluation system and a test chip which includes 4 AES cores was

implemented. The DPA attack experimental results show that our proposed

countermeasure methods successfully prevent DPA attack.

- 132 -

Conclusion

7 Conclusion

In this dissertation, a new video encryption scheme and the hardware design of

encryption module are proposed. This dissertation consists of three parts: 1) In algorithm

level, a new video encryption scheme is proposed. 2) In hardware level, the optimized

hardware architecture for AES and RSA algorithm are proposed. 3) In security level, the

DPA countermeasure methods for AES hardware design are proposed.

Conventional selective video encryption schemes have a lot of problems, such as low

security, high computational cost and hard to be implemented. In order to improve the

security and reduce the computational cost of video encryption, we proposed an Unequal

Secure Encryption (USE) scheme for video encryption, especially for H.264/AVC video

coding standard. This scheme mainly includes two parts: Data classification and Unequal

secure encryption. For data classification, we proposed three data classification methods

based on H.264/AVC. After data classification, the video bit stream can be separated into

two parts: important data partition and unimportant data partition. There are totally four

security levels defined in USE scheme. These security levels are used to balance the

security strength and computational complexity. For unequal secure encryption, we use

two encryption methods: AES encryption algorithm for important data partition, and

FLEX encryption algorithm for unimportant data partition. The FLEX algorithm is based

on AES, and the speed is 5 times of AES. In this way, for encryption module design, only

AES should be implemented.

For hardware design of AES algorithm, a scalable architecture is proposed. Since the

video data size changes very much according to different video levels, a fixed

architecture with specific performance is not a good solution. In this dissertation, we

proposed a scalable architecture. The number of S-Box and MixColumns is configurable

- 133 -

Conclusion

in this architecture. Totally, 1-20 S-Boxes and 1-4 MixColumns can be used. The

experimental results show that the lowest cost implementation only uses one S-Box and

one MixColumns. The throughput achieves 75 Mbps. While using 20 S-Boxes and 4

MixColumns for highest performance implementation, the throughput can achieve 2.4

Gbps.

For RSA hardware design, firstly, a modified scalable high-radix Montgomery

algorithm is proposed to reduce critical path. Secondly, a high-radix clock-saving

dataflow is proposed to support high-radix operation and one clock cycle delay in

dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost

and a parallel radix-16 design of data path is proposed to accelerate the speed. The

implementation results show that the total cost of Montgomery multiplier is 130 KGates,

the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption is 352

Kbps. This design is suitable to be used in high speed RSA or ECC encryption/

decryption. As a scalable design, it supports any key-length encryption/decryption up to

the size of on-chip memory.

In order to enhance the security of AES encryption module, especially for DPA attack

countermeasure, we proposed five DPA attack countermeasure methods: Register

Masking, S-Box Masking, Subbytes Hiding, Independent ARK and Data Sliding. Combing

with these methods, an ultra low-cost AES design with multiple DPA countermeasure

methods is proposed. The DPA attack experimental results show that our proposed

methods successfully prevent DPA attack.

In conclusions, an efficient video encryption scheme for H.264/AVC video coding

standard, and the hardware implementation of the encryption module are presented in this

dissertation. The design proposed in this paper is very useful for secure video

communication systems.

- 134 -

Reference

Reference

[1] ISO/IEC 11172, Information technology coding of moving pictures and associated

audio for digital storage media at up to about 1.5Mbit/s, 1993 (MPEG-1).

[2] ISO/IEC 13818, Information technology: generic coding of moving pictures and

associated audio information, 1995 (MPEG-2).

[3] ISO/IEC 14496-2, Coding of audio-visual objects Part 2: visual, 2001.

[4] ISO/IEC 15938, Information technology multimedia content description interface

(MPEG-7), 2002.

[5] ISO/IEC 21000, Information technology multimedia framework (MPEG-21), 2003.

[6] ITU-T Recommendation H.261, Video CODEC for audiovisual services at px64

kbit/s, 1993

[7] ITU-T Recommendation H.263, Video coding for low bit rate communciation,

Version 2, 1998.

[8] ISO/IEC 14496-10 and IUT-T Rec. H.264, Advanced Video Coding, 2003.

[9] X. Liu and A.M. Eskicioglu "Selective Encryption of Multimedia Content in

Distribution

Conference on Communications, Internet and Information Technology (CIIT 2003),

Scottsdale, AZ, November 17-19, 2003.

[10]

International Journal on Computer and Graphics, Special Issue on Data Security in

Image Communication and Network, 22(3), 1998.

[11]

C, Ch. 3, pp. 93-131.

December 2004.

[12]

- 135 -

Reference

Comprehensive Report on Information Security, International Engineering

Consortium, Chicago, IL, 2003.

[13]T. Lookabaugh, D. C. Sicker, D. M. Keaton,

Analysis of Selectively Encrypted MPEG-

Applications VI Conference, Orlando, FL, September 7-11, 2003.

[14]

Example MPEG-

Berlin, Germany, May 1995.

[15]T.B. Maples and G.A. Spanos, "Performance study of selective encryption scheme

for the security f networked real-time video," in Proceedings of the 4th International

Conference on Computer and Communications, Las Vegas, NV, 1995.

[16]G.A. Spanos and T.B. Maples, "Security for Real-Time MPEG Compressed Video in

Distributed Multimedia Applications," in Conference on Computers and

Communications, 1996, pp. 72-78.

[17]L.

Proceedings of the 4th ACM International Multimedia Conference, Boston, MA,

November 18-22, 1996, pp. 219-230.

[18]

Proceedings of the 1st International Conference on Imaging Science, Systems and

-29.

[19]

of the 6th International Multimedia Conference, Bristol, UK, September 12-16, 1998.

[20]C. Shi, S.- -Time Using

Processing Techniques and Applications (PDPTA'99), Las Vegas, NV, June 28 - July

1, 1999.

- 136 -

Reference

[21]A. M. Alattar, G. I. Al-Regib and S. A. Al-

techniques for Secure Transmission of MPEG Video Bit-

the 1999 International Conference on Image Processing (ICIP '99), Vol. 4, Kobe,

Japan, October 24-28, 1999,pp. 256-260.

[22]

Information Security (ISW 99), Kuala Lumpur, Malaysia, November 1999, Lecture

Notes in Computer Science, Vol. 1729, pp. 191-201, 1999.

[23]

Transactions on Signal Processing, 48(8), 2000, pp. 2439-2451.

[24]A.S. Tosun, -layer coding and encryption of MPEG

July 2000, pp. 119 122.

[25] -Compliant

Configurabl

Transactions of Circuits and Systems for Video Technology, Vol. 12, No. 6, June

2002, pp. 545-557.

[26]

ctions on Multimedia, Vol. 5, No. 1, March 2002, pp. 118-129.

[27] -

International Workshop on Multimedia Signal Processing, St. Thomas, US Virgin

Islands, December 9-11, 2002.

[28]L.S. Choon, -effective MPEG video

Technologies: From Theory to Applications, 19-23 April 2004 pp.525 526.

- 137 -

Reference

[29]

Proceedings of the 12th annual ACM international conference on Multimedia, New

York, USA, October 10-16, 2004, pp.304-307.

[30]

MPEG Com

Communications and Computer Sciences, Volume E89-A, Issue 1, January 2006,

pp.194-202.

[31]

based on event shuffl -Pacific

Conference on Circuits and Systems, Volume 2, 6-9 Dec. 2004, pp.761-764.

[32]

information security,

Huistenbosch, Japan, 23-25 January, 2007, 3B3-1.

[33]C.-P. Wu and C.-

Boston, MA, November 2000, pp. 284-295.

[34]C.-P. Wu and C.-

Volume 4314, San Jose, CA, January 2001.

[35]I. K. Cheong, Y. C. Hung, Y. S. Tung, S. R. Ke

electronics, 8-12 January 2005, pp.61-62.

[36]National Institute of Standards and Technology (U.S.). Data Encryption Standard

(DES). FIPS Publication 46-3, NIST, 1999.

[37]National Institute of Standards and Technology (U.S.). Advanced Encryption

Standards (AES). FIPS Publication 197, 2001.

[38]R. L. RIVEST. A. SHAMIR, AND L. ADLEMAN. A method for obtaining digital

- 138 -

Reference

signatures and public key cryptosystems . Communications of the ACM, 21(1978),

120-126.

[39]

[40]

ypt 91, 1991,

pp.17-38.

[41]

Proceedings of the Symposium on Network and Distributed Systems Security, IEEE,

1996.

[42]L. Qiao, K. Nahrstedt, and I. Tam, "Is MPEG Encryption by Using Random List

Instead of Zigzag Order Secure?" IEEE International Symposium on Consumer

Electronics, December 1997. Singapore.

[43]

2002, available at http://raidlab.cs.purdue.edu/papers/mm.ps.

[44]

Cryptology TATRACRYPT 2003, Bratislava, Slovak Republic, 2003.

[45]A. Alattar and G. Al-

secure transmission of MPEG video bit-

International Symposium on Circuits and Systems, vol. 4, pp IV-340-IV-343, 1999.

[46]Ia -4 Video Compression, Video coding for

next- -223.

[47]

nsactions on Circuits and Systems for Video

Technology, Volume 13, Issue 7, July 2003, pp.560 - 576.

[48]

Architecture with S- - ASIACRYPT

- 139 -

Reference

2001, 7th International Conference on the Theory and Application of Cryptology and

Information Security, Gold Coast, Australia, December 9-13, 2001, pp.239 254.

[49] -

Embedded Systems CHES, September 2005, pp.441 455.

[50]

Systems - CHES 2004, Volume 3156, 2004, pp.357-370.

[51]OneSeg in Japan. http://en.wikipedia.org/wiki/Oneseg

[52]P. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In M. Wiener, editor,

in Computer Science, pages 388 397, Santa Barbara, CA, USA, August 15-19 1999.

Springer-Verlag.

[53]S. Mangard, E. Oswald, and T. Popp Power Analysis Attacks: Revealing the secrets

of smart card published by Springer, 2007.

[54]Eric Brier, Christophe Clavier, and Francis Olivier, "Optimal Statistical Power

Analysis", Cryptology ePrint Archive, http://eprint.iacr.org/2003/152.pdf

[55]M. L. Akkar, C. Giraud

Proceedings International Workshop on Cryptographic Hardware and

Embedded Systems (CHES 2001), pp.309-318, 2001.

[56]M. Joye, P. Paillier, B. Schoenmakers, On second-order differential power analysis,

Proceedings International Workshop on Cryptographic Hardware and Embedded

Systems (CHES 2005), pp.293-308, 2005.

[57]H.264 in wikipedia. http://en.wikipedia.org/wiki/H.264

[58]Yibo Fan, Jidong Wang, Ikenaga, T. Goto, S., "Mixed bus width architecture for low

cost AES VLSI design", 7th International Conference on ASIC (ASICON), 2007,

22-25 Oct. 2007 Page(s):854 857.

[59]SASEBO project in Research Center for Information Security(RCIS),

- 140 -

Reference

www.rcis.aist.go.jp/special/SASEBO/

[60]Cryptographic hardware project in TOHOKU University,

http://www.aoki.ecei.tohoku.ac.jp/crypto/

[61]

Unequal Secure Encryption Scheme For H.264/AVC Video Compression Standard

Dat -A, No.1, pp.12-21,

Jan 2008.

[62]Yibo FAN

-Rim conference on multimedia (PCM 2007), 2007.

[63]Jidong Wang, Yibo FAN

2007.

[64]Jidong Wang, Yibo FAN

Scheme for H.264 Format Vi

systems in karuizawa, 23-24 April, 2007.

[65]Yibo FAN

(IPS), pp. 17-20, Taipei, Taiwan, July 2007.

[66]Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "A Low-cost Reconfigurable Architecture

for AES Algorithm", International Conference on Information and Communications

Security (ICICS 2008), Prague, Czech Republic, July 25-27, 2008.

[67]Yibo FAN, Takeshi IKENAGA, Yukiyasu TSUNOO, Satoshi GOTO, "A Low-cost

LSI design of AES against DPA attack by hiding power information", The 21th

workshop on circuits and systems in karuizawa, 2008.

[68] -speed Design of Montgomery

-A, No.4, pp.971-977, April,

2008.

- 141 -

Reference

[69]Guoyu QIAN, Yibo FAN, Yukiyasu Tsunoo, Takeshi Ikenaga, Satoshi Goto, "FPGA

& ASIC Implementation of Differential Power Analysis Attack on AES", the 4th

International Conferences on Information Security and Cryptology, Dec. 14-17, 2008

(to be published)

[70]VDEC. http://www.vdec.u-tokyo.ac.jp/

[71] Andreas Uhl, Andreas Pommer, "Image and video encryption, from digital rights

management to secured personal communication", Springer, 1 edition November 4,

2004.

[72] -based RSA crypto-processor

-Pacific

Conference on Advanced System Integrated Circuits, pp.218-221, Aug.4-5, 2004.

[73]

computation, vol.44, no.170, pp.519- 521, April 1985.

[74] comparing Montgomery

-33, June 1996.

[75]

.

1215 1221, Sep. 2003.

[76] -Radix Design of a Scalable Modular

-CHES 2001, Lecture

Notes in Computer Science, no.2162, pp.189-205, May 13-16, 2001.

[77]G. Todorov -radix

[78] -16

design of a scalable Montgomery multi

ASIC-ASICON 2005, vol.1, 24-27, pp.153-157, Oct. 2005.

[79]

- 142 -

Reference

Computer Arithmetic, pp.172-178, June 27-29, 2005.

[80]

International Workshop on System-on-Chip for Real-Time Applications, pp.400- 404,

July 2005.

- 143 -

Publications

Publications

International Journal

[1] Yibo FAN

Transaction on Electronics, Vol.E91-C, No.4, pp.440-448, April, 2008.

[2] Yibo FAN, Takeshi Ikenaga, Satoshi -speed Design of Montgomery

-A, No.4, pp.971-977, April,

2008.

[3] Yibo FAN

Unequal Secure Encryption Scheme For H.264/AVC Video Compression Standard

-A, No.1, pp.12-21,

Jan 2008.

International Conference (with review)

[1] Guoyu QIAN, Yibo FAN, Yukiyasu Tsunoo, Takeshi Ikenaga, Satoshi Goto, "FPGA

& ASIC Implementation of Differential Power Analysis Attack on AES", the 4th

International Conferences on Information Security and Cryptology, Dec. 14-17, 2008

(to be published)

[2] Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "Optimized 2-D SAD Tree Architecture of

Integer Motion Estimation for H.264/AVC", 16th IFIP/IEEE international conference

on very large scale integration (VLSI-SoC 2008), Rhodes Island, Greece, Oct. 13-15,

2008.

[3] Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "Fast VBSME design

using reconfigurable hardware achitecture and search range reduction algorithm", The

- 144 -

Publications

10th IASTED International Conference on Signal and Image Processing (SIP 2008),

Kailua-Kona, Hawaii, August 18 20, 2008.

[4] Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "A Low-cost Reconfigurable Architecture

for AES Algorithm", International Conference on Information and Communications

Security (ICICS 2008), Prague, Czech Republic, July 25-27, 2008.

[5] Yibo FAN

-Rim conference on multimedia (PCM 2007), 2007.

[6] Yibo FAN

ASIC (ASICON 2007), 2007.

[7] Jidong Wang, Yibo FAN, Takeshi Ikenaga, Satoshi G

2007.

[8] Yibo FAN

hop on SOC

(IPS), pp. 17-20, Taipei, Taiwan, July 2007.

Domestic Conference (with review)

[1] Yibo FAN, Takeshi IKENAGA, Yukiyasu TSUNOO, Satoshi GOTO, "A Low-cost

LSI design of AES against DPA attack by hiding power information", The 21th

workshop on circuits and systems in karuizawa, 2008.

[2] Yibo FAN -Speed Design of Montgomery

-24 April,

2007.

[3] Jidong Wang, Yibo FAN fficient Encryption

systems in karuizawa, 23-24 April, 2007.

- 145 -

Publications

Domestic Conference (without review)

[1] Yibo FAN, Jidong WANG, Takeshi IKENAGA, Yukiyasu TSUNOO, Satoshi GOTO,

"Hardware Evaluation of eSTREAM Stream Cipher Candidates in Phase 3 Profile 2:

Moustique, Pomaranch and Decim v2", Symposium on Cryptography and

Information Security (SCIS), 2008.

[2] Yibo FAN, Xiaoyang Zeng, Takeshi Ikenaga, Satoshi Goto, "Hardware Reuse

Architecture for High-Radix Scalable Montgomery Multiplier", 2E2-1, Symposium

on Cryptography and Information Security (SCIS2007), Jan. 2007.

[3] Jidong Wang, Yibo FAN, Xiaoyang Zeng, Takeshi Ikenaga, Satoshi Goto, "No

Compression Ratio Reduction H.264 Video Scrambling", 3B3-1, Symposium on

Cryptography and Information Security (SCIS2007), Jan. 2007.

algorithm and hardware design of encryption scheme for h

Documents