algorithm and hardware design of encryption scheme for h
TRANSCRIPT
Algorithm and Hardware Design ofEncryption Scheme for H.264/AVC
FAN, Yibo
Graduate School of Information, Production and Systems
Waseda University
February 2009
- i -
Abstract
H.264, which is also known as MPEG-4 part 10 or AVC (for Advanced Video Coding),
is the latest international video coding standard proposed in 2003. Currently, there is few
encryption schemes proposed for H.264/AVC standard, and most of the proposed
schemes are designed for previous video coding standards, such as MPEG-1,
MPEG-2/H.262, MPEG-4 and H.263.
This dissertation presents a new video encryption scheme for H.264/AVC and also the
hardware design of encryption module. The contributions of this dissertation include
three parts: 1) the proposed new video encryption scheme provides higher security with
lower computational cost. 2) The proposed scalable hardware architecture for encryption
module achieves great scalability, which can be widely used in different video systems. 3)
The proposed five DPA attack countermeasure methods can be successfully used in
encryption module to prevent DPA attack.
This dissertation consists of seven chapters which are as follows:
Chapter 1 [Introduction] introduces the basic conception of video coding system,
video encryption methods and cryptographic algorithms. H.264 video coding standards,
selective video encryption methods and AES algorithm are also introduced in this
chapter.
Chapter 2 [Selective Video Encryption Schemes] describes the recently proposed
video encryption schemes and some encryption algorithms used in these schemes. The
basic idea of selective encryption is to encrypt a part of video data, and leave others as
unencrypted. The security of selective encryption is low. However, it saves a lot of
computational cost. A brief survey is provided to clearly show the difference between
these schemes. Three main problems of these proposed schemes are discussed: security
problem, computation problem, and feasibility problem.
- ii -
Chapter 3 [Unequal Secure Encryption Scheme for H.264/AVC] describes the
proposed Unequal Secure Encryption (USE) scheme for H.264/AVC. The purpose of this
scheme is to reduce the computational cost while keeping high security level. The main
idea of USE scheme is that using high secure algorithm to encrypt important data
partition, and using low secure algorithm to encrypt unimportant data partition. All of
data are encrypted to improve the security. Some new ideas of the proposed scheme are
listed as follows:
1) Data classification methods. Three data classification methods are proposed: Data
Partitioning, FMO and Parameter Extraction. Each method is proposed for different
coding profiles of H.264/AVC.
2) Multiple security levels definition. Four security levels are defined to make a trade-off
between security and computational cost. For security level 0, the computational cost
is only 18% of full encryption, and for level 3, the computational cost is 50% of full
encryption. Compared to the other selective encryption schemes, our scheme achieves
much lower computational cost while performs 100% video data encryption.
3) Hybrid encryption module. This module includes two encryption functions: AES
encryption for important data partition, and FLEX encryption for unimportant data
partition. Our proposed FLEX algorithm achieved 5 times throughput of AES, and it
can reuse the hardware of AES.
Chapter 4 [Hardware Design of AES & RSA] presents the proposed hardware
architecture for AES and RSA algorithm. For AES, since performance requirement for
different video applications changes very much, the scalability of hardware design
becomes very important. Parallel data path and configurable hardware modules are used
to achieve high scalability. The experimental results show that the throughput of lowest
cost AES implementation which uses 1 S-Box and 1 MixColumn is 75 Mbps, while the
highest cost AES with 20 S-Box and 4 MixColumn can be 2.4 Gbps. Our design
approaches a new way for scalable hardware design of AES. Compared to the other AES
- iii -
architectures which are not scalable, it can be used for designing AES under various
performance specifications. As a result, it is much suitable for video encryption systems.
For RSA, firstly, a modified scalable high-radix Montgomery algorithm is proposed to
reduce critical path. Secondly, a high-radix clock-saving dataflow is proposed to support
high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused
architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data
the implementation results show that the total cost of Montgomery multiplier is 130
KGates, the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption
is 352 Kbps.
Chapter 5 [DPA Attack on AES] introduces the side-channel attack methods,
especially for Differential Power Analysis (DPA) attack. DPA attack method is proposed
by Paul Kocher in 1998, which can successfully recover the secret key by collecting
power consumption of these devices. It posed a serious threat to the security of
cryptographic devices. The detailed attack procedure on AES and some recently proposed
countermeasure methods are also discussed in this chapter.
Chapter 6 [AES Design with DPA Countermeasure] presents our proposed AES
designs with DPA attack countermeasure. A hybrid countermeasure solution which
includes five methods, Independent ARK, Data Sliding, Subbyte Hiding, Simplified S-Box
Masking and Registers Masking is proposed. The theoretical analysis shows that our
solution increases the complexity of DPA attack to 212N times. In this way, even if one or
two countermeasure methods are cracked, the remained other countermeasure methods
can also prevent a successful attacking. There are few papers about hardware design of
DPA countermeasure methods. In this dissertation, the detailed hardware implementation
of DPA countermeasure methods is proposed. Moreover, an ultra low-cost AES with
proposed five countermeasure methods for real-time video encryption is designed. A test
chip includes four AES core is implemented in VDEC project (RHOM 0.18 um, Chip
- iv -
size is 2.5mm×2.5mm): 1 AES0: Pure AES design without any countermeasure
methods. It achieves lowest hardware cost (4678 Gates) with proper throughput (51
Mbps), and clock frequency (80 MHz). 2 AES1.0: AES design with Independent ARK
and Data Sliding. The hardware cost is 5500 Gates, the clock frequency achieves 125
MHz, and throughput is 75 Mbps. 3 AES1.1: AES design with Subbyte Hiding. The
hardware cost is 6244 Gates, and the clock frequency and throughput are same as AES1.0.
4 AES1.2: AES design with Simplified S-Box Masking and Registers Masking. The
hardware cost is 6834 Gates and the clock frequency is reduced to 75 MHz. The
throughput is also reduced to 45 Mbps. In order to evaluate the effectiveness of proposed
countermeasure methods, a DPA attack system based on SASEBO board is designed. The
DPA attack experiment results show that, the AES design with our proposed
countermeasure methods (AES1.0, AES1.1, AES1.2) can successfully prevent DPA
attack.
Chapter 7 [Conclusion] concludes the contributions of this dissertation.
Keywords
Video Encryption, H.264/AVC, Unequal Secure Encryption, AES, RSA, Side-channel
Attack, Differential Power Analysis, Low cost, Scalable Architecture, VLSI, Sasebo
Board
- v -
Acknowledge
First of all, I would like to appreciate Professor Satoshi Goto, for his guidance,
instructions, and support during my research. He advised me to setup a research goal and
to achieve it step by step. What I learned from him must be the most valuable asset in my
life. I also thank Professor Takeshi Ikenaga for his continuous support, instructions and
insightful comments on my work. He gave me a lot of valuable and helpful advice in
detailed technical problems. I also express my appreciation to Professor Yoshimura, for
his continuous support, encouragement and insightful comments throughout my research
work.
I also thank Dr. Tsunoo (NEC Central Research Lab) for advising me in cryptography
research. His great knowledge in cryptography helps me to find the right research
directions and instruct me how to continue my work. Thanks to Mr. Kimura (Y.D.K.
Corp.), Mr. Nozawa (Y.D.K. Corp.) and Mr. Syouji (Y.D.K. Corp.) for helping me to use
Sasebo Board.
I also thank to Mr. Jidong Wang and Mr. Guoyu Qian for working with me in video
encryption and side-channel attack. Thanks to the graduated students of Goto Lab: Dr.
Yang Song, Dr. Lingfeng Li, Dr. Shen Li, Dr. Jing Wang. Discussion with you gave me
great inspirations in my research work. Thanks to all of students of Goto lab, you make
my life be joyful. Thanks also give to all of my friends, I appreciate every moment with
you.
Finally, I would like to thanks my family for their unconditionally support and love.
- vi -
Contents
Abstract.................................................................................................................................................... i
Acknowledge .......................................................................................................................................... v
List of Tables........................................................................................................................................viii
List of Figures ........................................................................................................................................ ix
List of Notations ....................................................................................................................................xi
1 Introduction..................................................................................................................................... 1
1.1 Video Compression ......................................................................................................... 1
1.2 Video Encryption ............................................................................................................ 5
1.3 Cryptography .................................................................................................................. 7
1.4 Our Contributions and Dissertation Organization........................................................... 9
2 Selective Video Encryption Schemes............................................................................................ 12
2.1 Visual Data Formats ...................................................................................................... 12
2.1.1 Video Sequence..................................................................................................... 12
2.1.2 Coded video stream format ................................................................................... 14
2.2 Conventional video encryption methods....................................................................... 15
2.2.1 Cryptography based video encryption................................................................... 15
2.2.2 Permutation based video encryption ..................................................................... 16
2.3 A Survey of selective video encryption schemes .......................................................... 17
2.4 Problems of current video encryption scheme .............................................................. 21
2.5 Conclusion .................................................................................................................... 23
3 Unequal Secure Encryption (USE) Scheme for H.264 ................................................................. 24
3.1 Introduction of H.264.................................................................................................... 24
3.2 USE Scheme for H.264/AVC........................................................................................ 27
3.2.1 Data Partition Methods ......................................................................................... 28
3.2.2 Data Partition Methods ......................................................................................... 29
3.2.3 Security levels ....................................................................................................... 33
3.2.4 Encryption Methods .............................................................................................. 34
3.3 Comparison ................................................................................................................... 37
3.4 Conclusion .................................................................................................................... 39
4 Hardware Design of Encryption Accelerator ................................................................................ 44
4.1 Hardware Design of AES.............................................................................................. 44
4.1.1 Introduction of AES Algorithm..................................................................................... 44
4.1.2 Existing low-cost implementations of AES................................................................... 46
4.1.3 Proposed Scalable Hardware Architecture for AES ...................................................... 50
4.1.3.1 Top Level Architecture.......................................................................................... 50
4.1.3.2 Two typical subclass architectures ........................................................................ 54
4.1.3.3 Sub- ............................................................................................ 57
4.1.4 Performance Analysis.................................................................................................... 59
- vii -
4.1.4.1 Scalability.............................................................................................................. 59
4.1.4.2 Dataflows .............................................................................................................. 60
4.1.4.3 Hardware Implementation..................................................................................... 61
4.2 Hardware Design of RSA.............................................................................................. 65
4.2.1 Introduction of RSA Algorithm..................................................................................... 65
4.2.2 Proposed Optimized Algorithm..................................................................................... 67
4.2.3 Proposed Optimized Data Flow .................................................................................... 68
4.2.4 Proposed Hardware Architecture for RSA .................................................................... 69
4.2.5 Performance Analysis.................................................................................................... 73
4.3 Conclusion .................................................................................................................... 76
5 DPA Attack on AES ...................................................................................................................... 77
5.1 Introduction of Differential Power Analysis attack....................................................... 77
5.1.1 Power Consumption of CMOS Circuit ................................................................. 79
5.1.2 Power Model ......................................................................................................... 82
5.1.3 Hypothetical Power Consumption based on HD model: Case study .................... 83
5.1.4 Differential Power Analysis Attacks ..................................................................... 86
5.2 DPA attack on AES ....................................................................................................... 87
5.2.1 DPA attack on AES: An Example.......................................................................... 87
5.2.2 DPA attack on AES: A successful attack and a failed attack ................................. 89
5.3 Conventional Countermeasure Methods ....................................................................... 92
5.4 Conclusion .................................................................................................................... 96
6 AES Design with DPA Countermeasure ....................................................................................... 97
6.1 Proposed DPA Countermeasure methods for AES........................................................ 97
6.1.1 Register Masking .................................................................................................. 97
6.1.2 S-Box Masking.................................................................................................... 100
6.1.3 Subbytes Hiding .................................................................................................. 101
6.1.4 Independent ARK and Data Sliding .................................................................... 104
6.1.5 Time Complexity Analysis .................................................................................. 108
6.2 Ultra Low-cost Design of AES with DPA Countermeasure ........................................ 112
6.2.1 Specification........................................................................................................ 112
6.2.2 Hardware Architecture ........................................................................................ 114
6.2.3 Data Flow............................................................................................................ 116
6.2.4 Implementation ................................................................................................... 117
6.3 DPA Attack Evaluation Environment.......................................................................... 119
6.3.1 DPA attack platform ............................................................................................ 119
6.3.2 Sasebo Board....................................................................................................... 120
6.3.3 Test Flow............................................................................................................. 122
6.4 Experiment Results of DPA Attack ............................................................................. 124
6.5 Chip Design................................................................................................................. 129
6.6 Conclusion .................................................................................................................. 131
7 Conclusion .................................................................................................................................. 132
- viii -
Reference ............................................................................................................................................ 134
Publications......................................................................................................................................... 143
International Journal ................................................................................................................... 143
International Conference (with review) ...................................................................................... 143
Domestic Conference (with review) ........................................................................................... 144
Domestic Conference (without review) ...................................................................................... 145
List of Tables
Table 2.1 A survey of selective video encryption schemes ........................................................ 18
Table 3.1 Security levels in the USE scheme............................................................................. 33
Table 3.2 Video data partition size ............................................................................................. 40
Table 3.3 Video data partition for different security levels. ....................................................... 41
Table 3.4 Comparison with other video encryption schemes..................................................... 42
Table 4.1 Hardware cost of 32-bit AES @ 131 MHz, 0.11um, [48]. ......................................... 48
Table 4.2 Hardware cost of 8-bit AES @ 100 KHz, 0.35um, [50]............................................. 49
Table 4.3 Bit width of operations in AES algorithm. ................................................................. 53
Table 4.4 Comparison of two architectures................................................................................ 57
Table 4.5 Possible implementations of AES based on scalable architecture. ............................. 59
Table 4.6 Hardware cost of lowest cost AES @ 123 MHz, 0.18 um. ........................................ 62
Table 4.7 Hardware cost of highest performance AES @ 416 MHz, 0.18 um........................... 62
Table 4.8 Scalability of hardware implementations. .................................................................. 63
Table 4.9 ........................................................................ 63
-7, 8]. .......................................................... 71
Table 4.11 M0 to InvM.......................................................................................................... 72
Table 4.12 Clock cycles comparison of different dataflows. ..................................................... 74
............................................................ 74
Table 5.1 Power consumption of four transitions in a circuit..................................................... 80
Table 6.1 Summary of different countermeasure methods....................................................... 111
Table 6.2 Comparison of time complexity for each countermeasure methods. ....................... 111
Table 6.3 Max bit-rate and resolution of selected H.264 levels. .............................................. 113
Table 6.4 AES0@80MHz, TSMC 0.18um............................................................................... 118
Table 6.5 AES1.1@125MHz, TSMC 0.18um.......................................................................... 118
Table 6.6 AES1.0@125MHz, TSMC 0.18um.......................................................................... 118
Table 6.7 AES1.2@75MHz, TSMC 0.18um............................................................................ 118
Table 6.8 VDEC Test Chip....................................................................................................... 130
- ix -
List of Figures
Figure 1.1 Video encoder/decoder system ................................................................................... 3
Figure 1.2 Video encoder. ............................................................................................................ 4
Figure 1.3 Video decoder. ............................................................................................................ 4
Figure 1.4 Secure Video System. ................................................................................................. 6
Figure 2.1 Video Sequence: I, P, B Frames and I, P, B MBs...................................................... 13
Figure 2.2 Coded Video Stream Format..................................................................................... 14
Figure 2.3 Samples of Selective Encryption. ............................................................................. 22
Figure 3.1 H.264 Baseline, Main and Extended Profiles. .......................................................... 25
Figure 3.2 H.264/AVC data format. ........................................................................................... 25
Figure 3.3 Unequal Secure Encryption Scheme......................................................................... 28
Figure 3.4 Data Partition in H.264/AVC Extended Profile. ....................................................... 30
Figure 3.5 Data Partition by FMO. ............................................................................................ 31
Figure 3.6 Data Partition by Parameters Extraction................................................................... 32
Figure 3.7 FLEX encryption algorithm...................................................................................... 35
Figure 3.8 Leak position in the even and odd rounds. ............................................................... 35
Figure 3.9 XOR Method ............................................................................................................ 37
Figure 3.10 Comparison of security and computational complexity.......................................... 43
Figure 4.1 Dataflow. (a) Encryption. (b) Decryption. ................................................................ 45
Figure 4.2 Transformations in AES algorithm. .......................................................................... 46
Figure 4.3 32-bit architecture for AES. ...................................................................................... 47
Figure 4.4 8-bit architecture for AES. ........................................................................................ 49
Figure 4.5 Scalable Hardware Architecture for AES ................................................................. 51
Figure 4.6 Shared S-Box Architecture. ...................................................................................... 55
Figure 4.7 Unified S-Box Architecture ...................................................................................... 56
Figure 4.8 S-Box structure. ........................................................................................................ 58
Figure 4.9 MixColumns structure. ............................................................................................. 58
Figure 4.10 Dataflows for scalable architecture......................................................................... 60
Figure 4.11 Comparison with others .................................................................... 64
Figure 4.12 Optimized Data Flow. ............................................................................................. 69
Figure 4.13 Proposed Hardware Architecture. ........................................................................... 70
Figure 4.14 Implementation of qMj............................................................................................. 73
Figure 4.15 Comparison of dataflow and corresponding data path............................................ 75
Figure 5.1 CMOS Inverter.......................................................................................................... 79
Figure 5.2 Power consumption of a circuit: Case I. ................................................................... 85
Figure 5.3 Power consumption of a circuit: Case II................................................................... 85
Figure 5.4 Last round of AES module........................................................................................ 87
Figure 5.5 2-D views of successful DPA attack. ........................................................................ 90
Figure 5.6 3-D views of successful DPA attack. ........................................................................ 90
- x -
Figure 5.7 2-D views of failed DPA attack. ............................................................................... 91
Figure 5.8 3-D views of failed DPA attack. ............................................................................... 91
Figure 5.9 Time dimension hiding. ............................................................................................ 93
Figure 5.10 Amplitude dimension hiding. ................................................................................. 93
Figure 5.11 AES after masking. ................................................................................................. 95
Figure 5.12 S-Box after masking. .............................................................................................. 96
Figure 6.1 The round ith of the AES without and with masking countermeasures. .................... 98
Figure 6.2 Proposed Registers Masking..................................................................................... 99
Figure 6.3 Proposed S-Box Masking. ...................................................................................... 101
Figure 6.4 A power trace of AES. ............................................................................................ 102
Figure 6.5 Subbytes without and with hiding. ......................................................................... 102
Figure 6.6 Hardware design of Subbytes hiding. ..................................................................... 103
Figure 6.7 Integrated Subbytes and AddRoundKey. ................................................................ 105
Figure 6.8 Separated Subbyte and AddRoundKey. .................................................................. 105
Figure 6.9 Feedback structure and Data Sliding Structure....................................................... 106
Figure 6.10 Ultra low-cost AES with DPA countermeasure. ................................................... 115
Figure 6.11 Data flow for ultra low-cost AES.......................................................................... 115
Figure 6.12 DPAAttack Evaluation System (Photo)................................................................ 121
Figure 6.13 DPAAttack Evaluation System (Architecture). .................................................... 121
Figure 6.14 Sasebo Board. ....................................................................................................... 122
Figure 6.15 DPA attack test flow.............................................................................................. 123
Figure 6.16 Power trace from oscilloscope.............................................................................. 124
Figure 6.17 2-D view of DPA attack on Pure AES................................................................... 125
Figure 6.18 3-D view of DPA attack on Pure AES................................................................... 125
Figure 6.19 2-D view of DPA attack on AES with Subbytes hiding. ....................................... 126
Figure 6.20 3-D view of DPA attack on AES with Subbytes hiding. ....................................... 126
Figure 6.21 2-D view of DPA attack on AES with masking. ................................................... 127
Figure 6.22 3-D view of DPA attack on AES with masking. ................................................... 127
Figure 6.23 2-D view of DPA attack on AES with Independent ARK and Data Sliding. ........ 128
Figure 6.24 3-D view of DPA attack on AES with Independent ARK and Data Sliding. ........ 128
Figure 6.25 Test Chip Architecture. ......................................................................................... 129
Figure 6.26 Chip design of AES .............................................................................................. 130
- xi -
List of Notations
VOD Video on Demand
AVC Advanced Video Coding
MPEG Moving Picture Experts Group
VCEG Video Coding Experts Group
MB Macro Block
I-MB Intra-coded Macro Block
P-MB Inter-coded Macro Block
B-MB Bi-directional coded Macro Block
MV Motion Vector
MVD Motion Vector Difference
DCT Discrete Cosine Transform
Q Quantization
MC Motion Compensation
FMO Flexible Macroblock Ordering
VLC Variable Length Coding
DES Data Encryption Standard
AES Advanced Encryption Standard
USE Unequal Secure Encryption
FLEX Fast Leakage EXtraction
DPA Differential Power Analysis
CPA Correlation coefficient Power
Analysis
GF Galois Field
I.ARK Independent AddRoundKey
D.S. Data Sliding
bps bit-per-second
P Power consumption
Coefficient for power modeling
HD Hamming Distance
HW Hamming Weight
n Noise
R/REG Registers
C/Comb Combinational logic
Inv Inverse Operation
T / t Time
K / k Key
X / Y Random number
~o(DPA) Time complexity of DPA attack on
pure AES
Function
AES0 Pure AES
AES1.0 AES + I.ARK&D.S.
AES1.1 AES + Hiding
AES1.2 AES + Masking
SASEBO Side-channel Attack Standard
Evaluation Board
- 1 -
Introduction
1 Introduction
Multimedia is a hot topic in this IT era, especially for telecommunication and internet.
In ten years ago, people use text-based method, such as ICQ, to communicate with each
other in internet. People published their information in internet only by text or picture.
And now, things change! We talk to others face-to-face in internet by using a monitor and
a web camera. We use skype to make an internet call for free. We share our video in
youtube, share our personal photos in picasa, and watch TV in PPStream or enjoy music
in Kugou. All of these wonderful applications can be used in internet for free. Even more,
we can use our mobile phone to do it! We can enjoy everything in everywhere.
The virtual world becomes more and more attractive because we can sense it. And the
videos play a most important role. Some very popular video applications include: VOD
(Video On Demand) which is used to watch movies in internet, Pay-TV, which is widely
used in television set-top box, and Video conference. However, the data size of video is
very huge, which makes video data transmission and storage become a problem. In order
to reduce data size, people proposed a lot of video compression methods to compress
video data, such as MPEG-4 and the latest H.264/AVC. In the other hand, in order to
protect the sensitive information in video, people proposed many video encryption
schemes to encrypt video data. The cryptosystems are widely used in many video
applications.
1.1 Video Compression
Video makes multimedia applications be more attractive. However, the uncompressed
video sequence requires a large bit rate (approximately 2Gbps for HDTV 1080p). In this
way, the compression is necessary for practical storage and transmission of digital video.
- 2 -
Introduction
Video compression research has long history. During the past decades, many
international standards are developed, namely the ISO/IEC MPEG-x series
[1][2][3][4][5], and the ITU-T H.26x series [6][7]. The Moving Picture Experts Group
(MPEG) is a working group of the International Organization for Standardization (ISO)
standards for compression, processing and representation of moving pictures and audio. It
has been responsible for a series of important standards, such as MPEG-1 [1]
(compression of video and audio for CD playback), MPEG-2 [2] (storage and
broadcasting of television quality video and audio). MPEG-4 [3] (coding of audio-visual
objects) is the latest standard that deals specifically with audio-visual coding. MPEG-7 [4]
and MPEG-21 [5] are concerned with multimedia content representation and a generic
multimedia framework respectively. MPEG is best known for its contribution to audio
and video compression. Particularly, MPEG-2 is widely used in digital TV broadcasting,
and DVD video and MPEG Layer 3 audio coding has become very popular for music
storage and sharing.
The Video Coding Expert Group (VCEG) is another working group of the
International Telecommunication Union Telecommunication Standardization Sector
(ITU-T). VCEG has been developed a series of standards related to video communication
over telecommunication networks and computer networks, such as H.261 [6] standard
(first widely used standard for video conference), and the followed H.263 [7] or later
versions. Since 2001, the cooperation between VCEG and MPEG was carried out, and a
new organization JVT came out. The Joint Video Team (JVT) consists of members of
MPEG and VCEG, and its main purpose is to develop a new video coding standard,
[8] which was also known as MPEG-4 part 10
or H.264.
- 3 -
Introduction
Figure 1.1 Video encoder/decoder system
A basic video encoder/decoder system is shown in Figure 1.1. Camera captures video
sequences and transfers all of uncompressed video data to video encoder. Video encoder
compresses video data according to specific coding standards, and then transfers the
coded video stream to transmit channel. In the receiver part, video decoder receives
coded video stream and decodes it by using same coding standards, and then the display
plays the video sequence.
A conventional architecture of video encoder is shown in Figure 1.2. It consists of
three main functional units: a temporal model, a spatial model and an entropy encoder.
The temporal model attempts to reduce temporal redundancy by exploiting the
similarities between neighbouring video frames or neighbouring macro blocks (MBs)
within one frame. In this figure, there are two predictors: Inter predictor for inter-frame
prediction, and intra predictor for intra-frame prediction. The output of temporal model
includes residual data and a set of model parameters, such as motion vectors, block types,
prediction types and so on.
The spatial model makes use of similarities between neighbouring samples in the
residual frame to reduce spatial redundancy. In MPEG-4 visual and H.264, this is
achieved by DCT transformation and quantization. The input of spatial model is the
residual data produced by temporal model, and the output data of spatial model is a set of
quantized transform coefficients.
- 4 -
Introduction
Figure 1.2 Video encoder.
Figure 1.3 Video decoder.
- 5 -
Introduction
Entropy encoder is used to compress the parameters produced by temporal model and
the transform coefficients p spatial model. The final encoded video stream consists of
header information, parameters information, coded motion vectors and coded residual
data.
A conventional architecture of video decoder is shown in Figure 1.3. The
architecture of video decoder is much simpler than encoder. Entropy decoder extracts the
header, parameters and transforms coefficients from coded video stream. Prediction
parameters are used to reconstruct prediction data in motion compensation (MC) module.
Combines with residual data from inverse DCT and inverse Quantization, the original
video sequence can be reconstructed.
1.2 Video Encryption
With the increase of multimedia applications, huge amounts of digital visual data are
stored on different media and exchanged over various sorts of networks. As a
consequence, techniques are required to provide security functionalities such as privacy,
integrity or authentication. is aimed towards these emerging
technologies and applications.
To protect the video content, the conventional technologies can be classified into three
categories: 1) Encryption technology to provide end-to-end security when distributing
video over internet or other public communication channel. 2) Watermarking technology
to achieve copyright protection, ownership trace, and authentication. 3) Access control
technology to present unauthorized access. In this dissertation, we focus on video data
encryption technology.
- 6 -
Introduction
Figure 1.4 Secure Video System.
Several dedicated international meetings have emerged as a forum to represent and
transaction, and EURASIP and so on. However, a common video encryption standard
does not exist. A conventional video encryption system is shown in Figure 1.4. Two
secure modules are added: Video encryption for video content encryption and RSA
encryption for secret key exchange.
Several review papers have been published on video encryption, such as Liu and
work in [9] [10], and Furht, Socek and
[11]. From these review papers and other literatures from internet,
we found that most of the proposed video encryption schemes are designed for previous
video coding standards such as MPEG-1, MPEG-2/H.262, MPEG 4 and H.263. And the
selective encryption was mostly used in these proposed schemes. Selective encryption is
an encryption method to encrypt a portion of video bit-stream. Respectively, full
encryption will encrypt whole video bit-stream by using a specific encryption algorithm.
- 7 -
Introduction
The full video encryption method has two different approaches: (a) Video scrambling
technology. Permuting the video in the time domain or the frequency domain, however, it
t provide substantial high security. (b) Encryption. Encrypting the entire video data
using standard cryptographic algorithm
computational cost is very high.
The selective video encryption also can be further classified into three types: temporal
domain scheme, spatial domain scheme and entropy coding scheme. Temporal domain
scheme selects temporal model parameters such as motion vectors, DCT coefficients, I
blocks, I frames and so on. Most of the selective encryption methods are based on
temporal domain [14-32]. Spatial domain scheme makes use of spatial model parameters
in video data. In [23], it makes use of quadtree structure of motion vectors and quadtree
structure of residual errors to do video encryption. Entropy coding scheme uses special
entropy codec to do encryption. In [33-35], they use multiple Huffman tables and
multiple state indices in the entropy encoder.
The detailed introduction and discussion of selective video encryption schemes are
provided in Chapter 2.
1.3 Cryptography
Before the modern era, cryptography was concerned solely with message
confidentiality. In recent decades, the field has expanded beyond confidentiality concerns
to include techniques for message integrity checking, sender/receiver identity
authentication, digital signatures, interactive proofs, and secure computation, amongst
others. Modern cryptography can be divided into two types: 1) Symmetric key
cryptography and 2) Asymmetric key cryptography.
- 8 -
Introduction
Symmetric Key Cryptography
The common property for symmetric key cryptography is: a shared secret key is used
in between communication parties. This key is used both as an encryption key and as a
decryption key. Most of the modern cryptosystems use symmetric key cryptography.
Some famous and widely used modern symmetric cryptosystem includes DES [36] (Data
Encryption System), IDEA [40], Triple-DES, AES [37] (Advanced Encryption System),
and so on. Typical Key sizes are 56-bits (DES), 128 bits (IDEA, AES), 192 and 256 bits
(AES).
Asymmetric Key Cryptography
Asymmetric key cryptography is also called public key cryptography. There are two
different keys in this system: a public key, which is publicly known, and the secret key,
are used for encryption and decryption. If data is encrypted with a public key, it can only
be decrypted with the corresponding secret key and vice versa. The asymmetric key
asymmetric cryptosystems are: RSA [38] (Ron Rivest, Adi Shamir, and Leonard
Adleman at MIT) and ECC [39] (Elliptic curve cryptography). RSA is the most popular
used asymmetric cryptosystem with key length 1024 bits, 2048 bits or longer. ECC is a
new asymmetric cryptosystem which has smaller key size.
Both of symmetric key and asymmetric key cryptography has their advantages and
disadvantages. Symmetric ciphers require much smaller key size for the same level of
security and the computations for symmetric ciphers are much faster and the memory
requirements are smaller. However, since every party share a same key, they should keep
the key absolutely secret. This becomes more dangerous with an increasing number of
involved parties. In a conventional cryptosystem, the asymmetric ciphers are used for key
exchange, authentication, digital signature and integration check, and the symmetric
- 9 -
Introduction
ciphers are used for data encryption. For video encryption, symmetric ciphers are widely
used. Meanwhile, some other scrambling methods with low security are also widely used
in selective video encryption.
-
Side-channel attack uses side-channel information of crypto-devices, such as power
consumption or time consumption, to detect the secret key in these devices. Especially,
differential power analysis (DPA) has been successfully used to crack symmetric ciphers
as DES, and asymmetric ciphers as RSA. It posed a serious threat to the security of
current cryptosystems.
1.4 Our Contributions and Dissertation Organization
In this dissertation, a new video encryption scheme for H.264/AVC and the hardware
design of encryption module are proposed.
Unequal Secure Encryption (USE) scheme is proposed for H.264/AVC video
coding standard. There are three major targets in the USE scheme: security, feasibility,
and low computational cost. In the USE scheme, we encrypt the entire video data using
standard cryptography to make our scheme highly secure. We perform all of the
encryption operations after entropy coding to separate the video coding system and
encryption system. In this way, our USE scheme is feasible in any kind of video security
applications. The remaining problem is computational cost. As computational cost of
to make some optimization to reduce the
computational cost. Here we use two methods: (1) Data classification. We classify the
total video data into two data partitions, important data partition and unimportant data
partition. Many new features in H.264/AVC make this procedure easy to implement.
Normally, important data partition has smaller size than unimportant one. (2) Unequal
secure encryption. We use AES to encrypt important data partition and use our proposed
- 10 -
Introduction
FLEX to encrypt unimportant data partition. FLEX is a cipher based on AES. The
computational cost of FLEX is only 20% of AES. In this way, we can keep our scheme
highly secure with low computational cost.
Hardware Architectures for AES and RSA are proposed for low cost and high
performance design. For AES, we propose a scalable framework for designing specific
AES module with different throughput and hardware cost. Especially, it is very useful for
low-cost, low power design of AES module. There are two important features in this
architecture: 1) Scalable S-Box and Scalable MixColumns design. It supports different
number of S-Box and MixColumns running in this architecture. 2) Parallel data path
design. The proposed architecture has three main data paths: S-Box data path,
Mixcolumns data path and AddRoundKey data path. All of these data paths have
different bit width. The advantage is high scalability and shorten critical path. For RSA,
we propose a high speed design of Montgomery multiplier. It parallelizes the data path
and shortens the critical path. By using proposed clock-saving dataflow, it reduces the
total clock cycles of multiplication to a very small number. Our design achieves very
high performance with low hardware cost.
DPA Countermeasure methods for AES design is proposed to counter-
measure DPA attack. Five countermeasure methods are proposed in this dissertation: 1)
Independent ARK is used to separate AddRoundkey operation from other operation in
hardware data path. 2) Data Sliding is used to scramble the register data. 3) Subbytes
Hiding is to randomize the Subyytes operation in time domain. 4) Simplified S-box
masking is proposed to induce randomization in S-Box power consumption. 5) Register
masking is proposed to induce randomization in register power consumption. All of the
five methods can be used together or independently. The theoretical analysis and
experimental results show that our proposed methods greatly increase the security and
efficiently countermeasures DPA attack.
- 11 -
Introduction
Ultra low-cost implementation of AES with DPA Countermeasure is
proposed for video encryption. It bases on the proposed scalable architecture and
combines all of proposed countermeasure methods together. Only one S-Box is used in
this implementation, and the hardware cost is extremely low, about 7k gates by TSMC
0.18 standard cell library. The throughput is about 75 Mbps which can be used in
real-time video encryption. Five countermeasure methods are used, and they are
evaluated by our DPA attack system. A test chip which includes 4 AES implementations
is designed.
The rest of this dissertation is organized as follows: a survey of selective video
encryption schemes is given in Chapter 2. The proposed Unequal Secure Encryption
(USE) scheme is presented in Chapter 3. The hardware architecture for AES and RSA are
presented in Chapter 4. The DPA attack and conventional countermeasure methods are
introduced in Chapter 5. The proposed DPA countermeasure methods, and ultra low-cost
implementation of AES with DPA countermeasure are presented in Chapter 6. Finally,
Chapter 7 concludes the whole dissertation.
- 12 -
Selective Video Encryption Schemes
2 Selective Video Encryption Schemes
In order to encrypt video data in real-time, selective video encryption is proposed to
reduce computational cost. The basic idea of selective encryption is to encrypt a part of
compressed video data. As a result, the computational cost can be reduced. The selected
part of data is considered as important data part. There is no clearly definition for
importance of video data. Normally, the importance can be replaced by another word:
Difficulty. The lost data causes more difficult to reconstruct video, it is regarded as more
important. For example, the header information, the parameters are much more important
than VLC data in video stream. Over the past years, a number of different selective video
encryption schemes for different video coding standards have been proposed. An
introduction and discussion of these schemes are presented in this chapter.
2.1 Visual Data Formats
2.1.1 Video Sequence
Video sequence is usually organized in a rectangular arrays denoted as frames. As
shown in Figure 2.1. Figure 2.1 A) shows a frames structure in video sequence. The video
data is organized as frame by frame. Every frame contains a picture. The frames are
played in time axis to form a video. There are several types of frame: I-frame, P-frame
and B-frame, as shown in B). I-frame is short for Intra frame. I-frame is independent with
other frames. It is encoded by intra-frame prediction. P frame is short for Predicted frame,
which means that it is predicted by the previous frames. B frame is short for
Bi-directional Predicted frame. Both of backward frames and forward frames are used for
prediction. B-frame has two reference frames while P-frame only has one reference frame.
In order to reconstruction a picture, for I-frame, it can reconstruct by itself, for B and P
- 13 -
Selective Video Encryption Schemes
frames, they needs to combine with reference frames to reconstruct a picture.
Figure 2.1 C) shows the MB structure in a frame. The size of MB can be various,
which depends on the MB size definition in specific video coding standards. Normally,
there are two kinds of MB: I-MB and P, B-MB. Similar with I, P, B frames, I-MB is intra
MB which is predicted by the surrounding MBs in the same frame. P, B-MB is inter MBs
which is predicted by MBs in the other frames.
Figure 2.1 Video Sequence: I, P, B Frames and I, P, B MBs.
- 14 -
Selective Video Encryption Schemes
Figure 2.2 Coded Video Stream Format.
2.1.2 Coded video stream format
A coded video sequence consists of one or more video packets. A video packet is
analogous to a slice or a frame in MPEG-1, MPEG-2 or H.264, and consists of a
resynchronization marker, a header field and a serious of coded macro blocks.
As shown in Figure 2.2, the coded video stream is formatted as a layered structure:
Sequence layer defines the properties of whole video sequence by sequence header, and
follows with a string of frames. Frame layer includes the frame header which indicate the
frame properties, such as frame type, whether used for prediction or not and so on, and
MBs in this frame. MB layer consists of MB parameters as MB type, MB partitions, MB
prediction methods and so on. For P, and B MB, it includes motion vectors and VLC
(Residual data coded by Variable
vectors.
- 15 -
Selective Video Encryption Schemes
2.2 Conventional video encryption methods
2.2.1 Cryptography based video encryption
Cryptography based video encryption means that the selected video contents are
encrypted by a cryptosystem, such as DES, AES. The security is highly depends to
cryptosystem. In other words, the security is approved and certified.
Header encryption
Since the header information in video stream contains a lot of parameters to
reconstruct video data, most of schemes for video encryption took use of headers. There
are several headers in different layers in video bit stream: Sequence header contains
global parameters for whole video sequence. Slice header only defines constrains for
current slice. MB header is the lowest level header which consists of MB type, MB
partitions and so on.
I-frame, I-MB encryption
I-frame and I-MB is very important to reconstruct a picture, because they can be
reconstructed without needing any other information. I-frame is widely used to
synchronize video data or to recover the broken pictures in video stream. I-MB is also
very important for MB reconstruction and then used for prediction. Since P and B frames
are reconstructed based on predictions obtained from I-frame, the main assumption is that
if these are encrypted, P and B frames are expected to be protected well.
Motion Vectors encryption
Motion vector is used in B and P frames to indicate the prediction positions of each
MB. In decoder part, the MC (motion compensation) block uses motion vector to
- 16 -
Selective Video Encryption Schemes
comprise about 10%~20% of the entire video data, therefore, lots of high secure
encryption schemes tend to encrypt it.
VLC encryption
VLC data occupies most part of the video data, about 60~80% of whole video data.
Most of the proposed video encryption schemes leave it unencrypted to save
computational cost. However, it may pose a serious security problem in the future. VLC
data also can be classified into two types: I-MB VLC and B, P-MB VLC. Some high
secure encryption schemes encrypt I-MB VLC and leave B, P-MP VLC unencrypted,
because I-MB VLC only occupies about 20% of total VLC data.
2.2.2 Permutation based video encryption
Permutation based video encryption means the selected video contents are encrypted
by permutation, scrambling, shuffling or other simple methods. This kind of encryption
methods target low computation and low security video encryption. Their security is very
weak and not approved.
Macroblock Permutation
As discussed in Section 2.1.1, each frame consists of the same number of macroblocks.
Each macroblock contains a piece of picture. Microblock permutation is to exchange the
order of macroblocks within a frame. This is an encryption variant which is annoying but
not secure. The reason is that based on the correlation of border pixels the originally
neighboring macroblocks can be regained. And this effect becomes more risky when
there are more frames permuted using the same order.
Motion Vectors Permutation
Each predicted P or B-MB has a corresponding motion vector. It is possible to permute
the motion vectors which assigned to distinctive macroblocks. Since it only affects the P
- 17 -
Selective Video Encryption Schemes
and B frames, and in many cases many motion vectors within a same frame have the
same overall directions. Thus, the distortion of encrypted video is very light.
DCT Coefficient Permutation
Similar to macroblock permutation, DCT coefficient permutation is to exchange the
DCT coefficients within a macroblock. There are two kinds of DCT permutation: DC
coefficient permutation and AC coefficient permutation. Since the number of DC
coefficients is much less than AC coefficients, most of proposed schemes use DC
coefficients permutation. However, same as macroblock permutation, this method is not
secure too. It makes the reconstruction difficult, but not impossible.
Sign-bit Masking
A lot of coefficients of coded video stream have sign bit. Such as motion vector has
sign bit to indicate the direction, and DCT coefficient also has sign bit. Sign bit masking
is to mask the sign bit from 1 to -1 or from -1 to 1.
In a real scenario, most of the proposed video encryption schemes adopt more than one
encryption methods to ensure the security. Many schemes also define several security
levels to make balance between security and computational cost.
2.3 A Survey of selective video encryption schemes
There are a lot of selective video encryption schemes have been proposed. Liu and
Eskicioglu in [9], Furht, Socek and Eskicioglu in [11] have presented a comprehensive
classification include most of the presented selective video encryption algorithms. An
recent version of this classification is shown in table 2.1 [65]. According to their work,
these encryption schemes can be classified into three types: frequency domain schemes,
spatial domain schemes and entropy coding schemes. Comprehensive survey studies of
the video encryption techniques are given in [9-13].
- 18 -
Selective Video Encryption Schemes
Table 2.1 A survey of selective video encryption schemes
(adapted from [9][11])
Domain Proposal EncryptionAlgorithm Encrypted Content
Year& Ref
FrequencyDomain
Meyer, Gadegast DES, RSAHeader, All I-block , Partial I-block, I-frame
1995 [14]
Spanos, Maples DES I-frame, Sequence Header, ISO end code 1995[15][16]
Tang DES, Permutation DCT coefficients 1996 [17]
Qiao, Nahrstedt IDEA, XOR, Permutation Every other bit of bit-stream 1997 [18]
Shi, Bhargava XOR DCT coefficients (Only Sign bit) 1998 [19]
Shi, Wang,Bhargava
IDEA Motion Vector (Only Sign bit) 1999 [20]
Alattar, A-Regib,Al-Semari
DES Header ( nth I-macroblock)Header ( nth Predicted-macroblock)
1999[21]
Shin, Sim, Rhee RC4, Permutation DCT coefficients (Only Sign bit of I frame) 1999 [22]
Cheng, Li No algorithm is specified Significance information (Pixel and setrelated) in the residual data (Two highestpyramid levels of SPIHT)
2000 [23]
Tosun, Feng VEA DC&AC (Lower layer). Classify thecoefficients into three layers, the two lowerlayers are encrypted by VEA
2000 [24]
Wen, Severa, Zeng,Luttrell, Jin
AES , DES Significance information (FLC codewords,VLC codewords)
2002 [25]
Zeng, Lei XOR, Permutation Transform coefficients (JPEG & Wavelet)Motion Vector ( JPEG)
2002 [26]
Wu, Mao Any modern cipher, randomshuffling on bit-planes inMPEG-4 FGS
BitstreamQuantized values (before RLC Coding)RLC symbolIntra bit-plane shuffling
2002 [27]
Choon, Samsudin,Budiarto
Confusion , Diffusion Permutation between macroblock and XORtemplate in macroblock
2004 [28]
Liu, Li, Dong Permutation Permutation of macroblocks , DC, AC. 2004 [29]
Liu, Ikenaga, Baba,Goto
Event Shuffle, DCEA DCT coefficients 2004, 2006[30][31]
Wang, Fan,Ikenaga, Goto
XOR, PermutationMotion Vector (Only Sign bit)Intra modeTrailing one of VLC code
2007 [32]
SpatialDomain Cheng, Li No algorithm is specified Motion Vector (Quadtree structure)
Residual errors (Quadtree structure)2000 [23]
EntropyCodec
Wu, Kuo MHT (Multiple Huffmantables) & multiple states
Encryption of data by MHT & multiple stateindices in the QM coder
2000,2001[33][34]
Cheong,Hung,Tung, Ke, Chen
MHT rotation, XOR DCT CoefficientsMotion VectorsMHT
2005 [35]
- 19 -
Selective Video Encryption Schemes
Some important selective video encryption schemes include:
SECMPEG [14]
SECMPEG, also called Secure MPEG was proposed by Meyer and Gadegast in 1995.
It was designed for the MPEG-1 video standard. It defines four levels of security:
1) Header data from the sequence layer down to the slice layer is encrypted.
2) Encrypt the same data as in level 1 and the low frequency DCT coefficients of all
blocks in I-frames.
3) Encrypt all I-blocks (Includes I-frames, I-blocks in B and P-frames).
4) Encrypt the entire video.
The authors chose DES symmetric cryptosystem to do encryption, which was the
natural choice since the DES is the official symmetric algorithm at that time. Anyway,
AES also can be used for SECMPEG for higher security.
Aegis [15][16]
Aegis was proposed by Maples and Spanos in 1995, which was designed for MPEG-1
and MPEG-2. Aegis encrypts the following selection of a video stream: Video sequence
header, I-frames and ISO end code. It leaves B and P frame unencrypted. The encryption
engine was DES too. Ageis is very similar to SECMEG level 3.
VEA [18]
VEA stands for Video Encryption Algorithm, which was developed by Qiao and
Nahrstedt in 1997. It was also designed for MPEG video standard. This algorithm is a
kind of whole video encryption, which is significantly different from other selective
video encryption schemes. The algorithm consists of the following four steps:
1) Let the 2n byte sequence, denote by a1, a2 2n, represent the video frames
2) Separate the sequence into two lists, even list (a1, a3, a5 2n-1) and odd list (a2,
- 20 -
Selective Video Encryption Schemes
a4 2n).
3) XOR two lists into one list: b1, b2 n = a1, a2 2n xor a2, a4 2n .
4) Apply the chosen symmetric cryptosystem E with secret key to encrypt either
odd list or even list. The cipher text sequence is {Ekey(a1, a3, a5 2n-1), b1, b2
bn} or {Ekey(a2, a4 2n), b1, b2 n}.
RVEA [19][20]
RVEA was proposed Shi, Wang, and Bhargava in 1999. Actually, they have proposed
four video encryption methods: Method 1, Method 2 (VEA), Method 3 (MVEA) and
Method 4 (RVEA). RVEA has the highest security level than the previous three methods.
RVEA encrypts the sign bits of DCT coefficients and motion vectors which are simply
extracted from the MPEG video sequence by using a conventional cryptosystem such as
DES or AES. And then, they restored the encrypted bits back to their original position.
Alattar [21]
In 1999, Alattar, Al-Regib and Al-Semari presented a video encryption scheme based on
DES cryptosystem. They defined four security methods:
1) Method 0: encrypts all macroblocks from I-frames and the headers of all
prediction macroblocks.
2) Method 1: Encrypt all data associated with every nth I macro blocks
3) Method 2: Encrypt the same data as in method 1 and all header data of predicted
macroblocks.
4) Methods 3: Encrypt the same data as in method 1 and every nth predicted
macroblocks.
- 21 -
Selective Video Encryption Schemes
2.4 Problems of current video encryption scheme
There are three main problems in current selective encryption schemes.
A. Security Problem
A lot of cryptanalysis work has been done in proposed video encryption schemes [10],
41-45
standard cryptographic algorithms is very low. For example, Permutation is highly risky
shown in [10, 42-44]. Even using standard cryptographic algorithms such as DES or AES
in video encryption scheme, there are also many security problems existing. The
corresponding cryptanalysis can be found in [10, 41, 45].
Another crucial problem of selective encryption is that information ca
concealed after encryption. Some objects in video sequence still can be recognized from
the unencrypted part of video stream. As shown in figure 2.3 [63] [64], the original video
is encrypted by I-Frame/I-MB encryption, DCT coefficient encryption and Motion Vector
encryption respectively. After encryption, the video quality is greatly reduced. However,
the contents of this picture also can be recognized even after encryption. For some
secure-sensitive system, it will pose a great risk.
B. Computational Cost Problem
Some methods can provide substantial security. However, computational overhead and
data overhead become worse. For example, VEA scheme [18] is
[11]. However, it needs to
encrypt half of video data using internal encryption scheme E and transfer a large amount
of additional keys to receiver. The detailed computational cost of selective video
encryption schemes will be further analyzed in next Chapter.
- 22 -
Selective Video Encryption Schemes
Figure 2.3 Samples of Selective Encryption.
C. Feasibility Problem
Feasibility is another problem existed in many schemes. A lot of existing schemes are
so called . It means that the video
encryption module must be integrated into video compression system. For example,
permutation of AC, DC coefficients should be done before entropy coding. In this way,
the encryption should break the procedure of video compression, and the encryption
module must be integrated into video compression system. That is why the standard
this secure encoder should be .
This causes such kind of scheme very hard to be widely used in commercial applications.
- 23 -
Selective Video Encryption Schemes
2.5 Conclusion
In this chapter, I introduced the fundamental of selective video encryption. A survey of
current video encryption schemes and the problem discussion are presented. From the
security point of view, the best way of protecting video data is full encryption algorithm,
which encrypts the entire video data by a standard cryptosystem. However, expensive
computational overhead makes it inefficient or impossible in lots of applications. As a
result, selective encryption targets encrypting only a part of video data in order to reduce
the computational cost, and keep the security level high. However, many proposed
schemes only achieved moderate to low security and only few of the proposed methods
promise to achieve substantial security. A high secure and low computational cost video
encryption scheme is absolutely necessary for future high definition video coded by
H.264/AVC.
- 24 -
Unequal Secure Encryption (USE) Scheme for H.264
3 Unequal Secure Encryption (USE) Scheme for H.264
and the computational cost. Selective encryption schemes show their weak points in
security, and full encryption scheme requires too much computational cost. The original
idea for our USE scheme is: Video data can be selected, why not to select cryptosystems
for selected video data? If doing this, all of the video data can be encrypted, and we can
choose light cryptosystem with low computational cost to encrypt unimportant data set to
reduce total computational cost. As a result, both of the security problem and
computational cost problem can be solved.
This chapter introduces our proposed USE video encryption scheme. In the last part,
the comparison with other schemes is discussed.
3.1 Introduction of H.264
H.264/AVC is the newest international video coding standard. It has been approved by
ITU-T as Recommendation H.264 and by ISO/IEC as International Standard 14496-10
(MPEG-4 part 10) Advanced Video Coding (AVC).
There are a lot of new techniques used in H.264/AVC, which include new coding
techniques, new data structure, new video storage and broadcast techniques. As the USE
scheme is applied after video coding, the details of H.264/AVC coding, storage and
transmission techniques to be considered very much. The H.264/AVC video data
structure has more impact on USE scheme. We need to do data classification by carefully
studying the data structure of H.264/AVC.
- 25 -
Unequal Secure Encryption (USE) Scheme for H.264
Figure 3.1 H.264 Baseline, Main and Extended Profiles.
NALHeader
NALNAL
HeaderNAL
NALHeader
NAL
SliceHeader
Slice DataSlice
HeaderSlice Data
MB MB Skip_run MB MBMB
Mb_type Mb_pred Coded residual
...
... ...
. . . . . .
Figure 3.2 H.264/AVC data format.
- 26 -
Unequal Secure Encryption (USE) Scheme for H.264
In H.264/AVC, profiles and levels specify conformance points. A profile defines a set
of coding tools or algorithms that can be used in generating a conforming bitstream,
whereas a level places constraints on certain parameters of the bitstream. The first version
of H.264/AVC defines a set of three profiles as shown in Figure 3.1 [46][47].
The Baseline profile supports most of features except two sets: 1) B slice, CABAC,
field coding and weighted prediction; 2) SP/SI coding and data partition. Main profile
are supported in baseline profile. The Extended profile supports all features of the
Baseline profile, and both sets of feature except for CABAC.
Some new features which can be used in the USE scheme are listed below:
Coded Data Format: H.264/AVC makes a distinction between a Video Coding Layer
(VCL) and Network Abstraction Layer (NAL). The output of the encoding process is
VCL data which are mapped to NAL units prior to transmission or storage. A coded
video sequence is represented by a sequence of NAL units. The data format of NAL is
shown in Figure 2. One NAL unit contains one or more slices, each slice contains an
integral number of macroblocks (MBs). Each MB contains a series of header elements
and coded residual data.
Parameter sets: H.264 introduces the concept of parameter sets, which provides for
robust and efficient conveyance header information. Parameter sets includes the key
information such as sequence header, picture header, this key information is separated for
handling in a more flexible and specialized manner in H.264/AVC. This new feature is
fully used in our USE scheme.
- 27 -
Unequal Secure Encryption (USE) Scheme for H.264
Flexible macroblock ordering (FMO): FMO is a new technique introduced by
H.264/AVC which has ability to partition the picture into regions called slice groups.
FMO can be used to enhance robustness to data losses in transmission. In the USE
scheme, we provide two kinds of usage of FMO in video encryption scheme.
Data partitioning: As some coded information is more important than others for
purpose of representing the video content, H.264/AVC allows syntax of each slice to be
separated into three partitions. In the USE scheme, this data partition is used.
3.2 USE Scheme for H.264/AVC
The purpose of designing Unequal Secure Encryption scheme is to provide substantial
security with low computational cost for video encryption. As discussed in Chapter 2, a
lot of existing video encryption schemes target low computational cost while ignoring
Integrated video compression
and encryption system
proposed schemes can achieve high security level. However, the computational cost is
bad.
USE scheme is a full encryption scheme which encrypts the entire video data using
selective encryption methods. The target application of the USE scheme is H.264/AVC
based video security system. Especially, the USE scheme can be very efficiently used for
high definition video encryption, because the computational cost of USE scheme is much
The contents of USE scheme can also
be found in [61] and [62].
- 28 -
Unequal Secure Encryption (USE) Scheme for H.264
Figure 3.3 Unequal Secure Encryption Scheme.
3.2.1 Data Partition Methods
The USE scheme is shown in Figure 3.3. It includes two major steps: The first step is
video data classification. The purpose of classification is to divide video data into two
partitions: important video data partition and unimportant video data partition. The
importance is evaluated by how difficult to reconstruct a picture. If the data in important
partition the data in unimportant
partition is lost, the video content can also be reconstructed just with quality reduction.
Therefore, the important video data group needs to be protected more securely than
unimportant one. As shown in this figure, after data classification, H.264/AVC video data
is parted into DPA (Data Partition A, important) and DPB (Data Partition B,
unimportant).
- 29 -
Unequal Secure Encryption (USE) Scheme for H.264
The second step in the USE scheme is unequal secure encryption. Unlike the existing
selective encryption scheme, the USE scheme encrypts entire video data, and different
cryptosystems are selected to encrypt different part of video data. As discussed in
Chapter 2, from the view points of cryptanalysis, the best way to keep security is to
encrypt the entire video data, and use the standard cryptography to do encryption other
than some other methods whose security can t be approved. As shown in Figure 3.3, two
cryptographies are used in the USE scheme. DPA is encrypted by cipher A, and DPB is
encrypted by cipher B. Different algorithm has different security level and computational
cost. In the USE scheme, we use AES as cipher A, and FLEX as cipher B. FLEX is our
proposed algorithm which based on AES, the hardware implementations of AES can also
support FLEX, and the speed of FLEX is faster than AES. Besides AES and FLEX, some
other cryptographic algorithms also can be used in the USE scheme.
The computational cost for USE scheme depends on data classification and
cryptographic algorithms. As the algorithms have been decided, the data classification
plays a more important role. There are three data classification methods in the USE
scheme. As the USE scheme is designed for H.264/AVC, two of these classification
methods use the new features of H.264/AVC.
3.2.2 Data Partition Methods
The purpose of data classification is to partition video data based on importance. There
reconstruct the picture caused by data loss is used to evaluate the importance of data. In
H.264, Header data (includes parameter sets and MVD) loss causes most difficult to
reconstruct the picture. VLC data (includes Intra and Inter residual data) loss causes
video quality reduction. Intra data is independent between each frame while Inter data is
dependent with neighboring frames, so the reconstruction of Intra loss is much more
- 30 -
Unequal Secure Encryption (USE) Scheme for H.264
difficult than Inter loss.
There are three data classification methods in the USE scheme. All of them are
performed after video encoding. The video coding scheme and video encryption scheme
are totally separated in our USE scheme.
Figure 3.4 Data Partition in H.264/AVC Extended Profile.
Data Partitioning (Extended Profile)
This is a new feature in H.264/AVC Extended Profile, which can do data partition
automatically. As shown in Figure 3.4, the coded data that makes up a slice is placed in
three separate Data Partitions (A, B and C). Partition A contains the slice header and
header data of C
more important than B and C. Normally, intra data (Partition B) is considered more
important than inter data (Partition C).
- 31 -
Unequal Secure Encryption (USE) Scheme for H.264
FMO (Baseline Profile, Extended Profile)
FMO is a new feature in H.264/AVC. It has ability to partition the picture into regions
called slice groups. In H.264/AVC standard, FMO consists of seven different partition
types. All of these types make it easy to partition pictures. In the USE scheme, there are
two kinds of partition modes (shown in Figure 3.5). The first partition mode is Region
Based FMO. In this mode, the picture is partitioned into two slice groups: Secret regions
and Normal regions. The shape of secret regions can be decided by other pre-processing
tools such as object recognition and extraction. This mode can support extraction of any
interesting shapes in picture, so object based encryption can be realized. The second
partition mode is Mode Based FMO. In this mode, the picture is partitioned into two slice
groups: Intra MBs and Inter MBs. As Intra MBs is more important than Inter MBs to
reconstruct picture, the Intra MBs should use highly secure encryption algorithms.
Figure 3.5 Data Partition by FMO.
- 32 -
Unequal Secure Encryption (USE) Scheme for H.264
Figure 3.6 Data Partition by Parameters Extraction.
Parameters Extraction (All Profiles)
Since Data Partitioning method and FMO method are profile limited methods, a
common method which can be used in any profiles is needed. The Parameter Extraction
method which is shown in Figure 3.6 is such kind of method. The effect of this method is
like Data Partitioning method. The difference is that Data Partitioning method can be
automatically done by codec. And this method needs a parser to do data classification.
- 33 -
Unequal Secure Encryption (USE) Scheme for H.264
3.2.3 Security levels
According to [71], and other
video encryption survey papers [9-13], the security strength of video data depends on
how much important data are encrypted. The importance of video data is defined as how
difficult to reconstruct video while the data is lost.
Similar as the others video encryption schemes, there are 4 security levels defined in
the USE scheme (Shown in Table 3.1). The definitions are listed as following:
Table 3.1 Security levels in the USE scheme.
SecureLevels
Algorithm Video content Data Classification Methods
Level 0 AES Headers Parameters Extraction
FLEX Inter, Intra, MVD
Level 1 AES Headers, MVD Data Partitioning
Parameters ExtractionFLEX Inter, Intra
Level 2 AES Headers, MVD, Intra Data Partitioning
Parameters Extraction
FMOFLEX Inter
Level 3 AES All -
Level x AES Secret Region FMO
FLEX Normal Region
- 34 -
Unequal Secure Encryption (USE) Scheme for H.264
Level 0: Headers are encrypted by AES, and the remained data are encrypted by
FLEX. In level 0, the computational cost is the lowest. The Parameters Extraction
method can be used in this level.
Level 1: Headers and MVDs (in H.264/AVC, MVD corresponds to motion vector)
are encrypted by AES, and the remained data are encrypted by FLEX. The Data
Partitioning method and Parameters Extraction method can be used in this level.
Level 2: Headers, MVD and Intra MBs are encrypted by AES, and Inter MBs are
encrypted by FLEX. All of three data classification methods can be used in this level.
Level 3: The entire video is encrypted by AES. Level 3 has the highest
computational cost and security.
Level x: This is an extra security levels for the USE scheme. Only FMO methods can
be used in this level. It can be used in object-based encryption applications.
3.2.4 Encryption Methods
A. AES Algorithm
Advanced Encryption Standard (AES), also known as Rijndael, is a block cipher adopted
as an encryption standard by the U.S. government in 2001. AES is the most popular
algorithm used in symmetric key cryptography. AES has a fixed block size of 128 bits and a
key size of 128, 192 or 256 bits. AES operates on a 4×4 array of bytes termed the State. For
encryption, it will implement a round function 10, 12, 14 times (depends on the key length).
The detailed introduction of AES will further discussed in next Chapter.
- 35 -
Unequal Secure Encryption (USE) Scheme for H.264
Figure 3.7 FLEX encryption algorithm.
Figure 3.8 Leak position in the even and odd rounds.
- 36 -
Unequal Secure Encryption (USE) Scheme for H.264
B. FLEX Algorithm
FLEX (which stands for Fast Leak EXtraction) is a crypto method based on the AES
round transformation. FLEX can handle longer context more quickly than AES while
maintain the same key agility and short context block performance. Moreover, the
flexibility for hardware and software implementation is same as AES, and the hardware
resource can be shared between FLEX and AES. The detailed algorithm is shown in
Figure 3.7.
Firstly, the given IV is encrypted by AES invocation: S=AESKey(IV). The 128-bit result
S together with encryption Key constitutes a 256-bit secret state of the stream cipher.
Secondly, we use result S as a new input d Key(S). The cipher stream
it comes
from internal states of AES. As shown in Figure 3.8, 4×4 array of bytes constitutes the
internal state of AES. In every round function of AES, a part of AES States is output. In
FLEX algorithm, b0, 0, b0, 2, b1, 1, b1, 3, b2, 0, b2, 2, b3, 1, b3, 3 are output in odd rounds, b0, 1, b0,
3, b1, 1, b1, 3, b2, 1, b2, 3, b3, 1, b3, 3 are output in even rounds. It totally outputs 80 States of
AES (640 bits) in every AES encryption round. The speed of FLEX is exactly 5 times
faster than AES.
C. XOR Method
In order to further reduce computational cost, we use XOR method to reduce 50% of
computational cost. This method is shown in Figure 3.9. There are three steps of this
method:
Step 1: Divide total plaintext into two partitions A and B (with the same size),
Step 2: Encrypt partition A while XOR partition A with partition B bits by bits,
Step 3: Partition C and D are ciphertext.
- 37 -
Unequal Secure Encryption (USE) Scheme for H.264
Figure 3.9 XOR Method
By using XOR method, we can just encrypt half of video data to achieve low
computational cost. The security of total plaintext is equal to partition A.
3.3 Comparison
In order to compare the computational cost and encrypted data part of our proposed
stream should be firstly calculated.
Table 3.2 shows the experimental results for several H.264/AVC QCIF sequences. It
lists the header information size, MVD size, Intra MBs residue size and Inter MBs
residue size in 10 QCIF test sequences. In every test sequence, it begin with I frame,
followed by P or B frames. Totally 100 frames are included in each test sequence. From
these 10 sequences, the average ratios of data size for Header is about 20%, MVD is
about 20%, Intra residue is about 15%, and Inter residue is about 45%.
Table 3.3 shows the ratios of each data partition for different video sequences under
different security levels. In level 0, about 20% video data is encrypted by AES and 80%
- 38 -
Unequal Secure Encryption (USE) Scheme for H.264
video data is encrypted by FLEX. In level 1, the percentage is 40% and 60%, and level 2
is 55% and 45%. Level 3 is 100% encrypted by AES. Level x uses FMO data partition
ment, Level x is not
included.
Table 3.4 shows the computational cost and encrypted data percentage comparison of
results listed in Table 3.2. We use the average percentage of 10 sequences. The
computational cost is measured by n@AES. We consider that the full encryption by AES
is 100%@AES. For example, the computational cost for SECMPEG level 1 is
20%@AES. It means that the computational cost of SECMPEG level 1 is 20% of full
encryption. The encrypted data percentage reflects the security strength of each video
encryption schemes. As all of the schemes use AES to encrypt the selected important data,
the security can be evaluated by the amount of encrypted data.
From Table 3.4, it can be seen that our scheme can achieve both high security and low
the computational cost of
Level 0 in our USE scheme is just about 18% of naive encryption, and the encrypted data
percentage is 100%.
Figure 3.10 shows the comparison of security and computational complexity of our
proposed USE scheme with other schemes. The computational complexity is defined as
the how many percentage of full encryption by AES algorithm. The security is also
evaluated according to the AES full encryption. We considered the security of FLEX
algorithm is 1/5 of AES algorithm. From this figure, our proposed USE scheme is much
of our scheme is very
low.
- 39 -
Unequal Secure Encryption (USE) Scheme for H.264
3.4 Conclusion
In this chapter, an unequal secure encryption scheme for H.264/AVC is proposed. In
order to maintain high security, our scheme uses full encryption approach to encrypt the
whole video data by selective encryption methods. This scheme mainly includes two
parts: Data classification and Unequal secure encryption:
(1) Data classification: There are three classification methods in the USE scheme:
Data partitioning for extended profile, FMO for main and baseline profile, and
parameters extraction for all profiles. After data classification, the entire video data are
divided into two partitions: the important data partition and unimportant data partition.
(2) Unequal secure encryption: This method can also be called as selective
cryptosystem method. As different cryptosystems have different security level and
different computational cost. In the USE scheme, we choose AES to encrypt the
important data partition and propose a light encryption algorithm called FLEX to encrypt
the unimportant data partition. The speed of FLEX is 5 times faster than AES. However,
the security is less than AES because it leaks many internal states when doing encryption.
The experimental results and comparison show that our scheme can achieve both high
security and low computational cost. For level 0 of USE scheme, the computational cost
is only 18% of full encryption. And for highest security level 3, the computational cost is
only 50% of full encryption. It is very suitable to be used in high security and high
definition video encryption systems.
- 40 -
Unequal Secure Encryption (USE) Scheme for H.264
Table 3.2 Video data partition size
(QCIF@100 Frames, I Frame followed by P or B Frames).
Video
Sequence
Header (Includes MVD) Intra MBs Residue Inter MBs Residue Total size
of
compressed
H.264 File
(bits)Header
(bits)
Header/Total
(%)
MVD
(bits)
MVD/Total
(%)
VLC
(bits)
VLC/Total
(%)
VLC
(bits)
VLC/Total
(%)
Canoa 676577 26.04% 300816 11.58% 769777 29.62% 1152357 44.34% 2608088
CarPhone 314675 51.84% 150868 24.85% 55551 9.15% 236802 39.01% 616672
Claire 95326 57.69% 38300 23.18% 10801 6.54% 59111 35.77% 175640
Container 96239 46.49% 32468 15.68% 23877 11.53% 86899 41.98% 217832
Football 825441 30.14% 390128 14.25% 866291 31.64% 1046531 38.22% 2747592
Foreman 375985 55.99% 195606 29.13% 43971 6.55% 251588 37.46% 680648
Grandma 99382 52.85% 39218 20.86% 17903 9.52% 70763 37.63% 198600
Mobile 454322 36.29% 207090 16.54% 54242 4.33% 743504 59.38% 1261768
News 183186 41.21% 86012 19.35% 55332 12.45% 206017 46.34% 454736
Table 312751 39.18% 165196 21.03% 78360 9.98% 394422 50.21% 795512
- 41 -
Unequal Secure Encryption (USE) Scheme for H.264
Table 3.3 Video data partition for different security levels.
Video
Sequence
Level 0 Level 1 Level 2
AES FLEX AES FLEX AES FLEX
Canoa 14.41% 85.59% 26.04% 73.96% 55.66% 44.34%
CarPhone 26.56% 73.44% 51.84% 48.16% 60.99% 39.01%
Claire 32.47% 67.53% 57.69% 42.31% 64.23% 35.77%
Container 29.28% 70.72% 46.49% 53.51% 58.02% 41.98%
Football 15.84% 84.16% 30.14% 69.86% 61.78% 38.22%
Foreman 26.50% 73.50% 55.99% 44.01% 62.54% 37.46%
Grandma 30.29% 69.71% 52.85% 47.15% 62.37% 37.63%
Mobile 19.59% 80.41% 36.29% 63.71% 40.62% 59.38%
News 21.37% 78.63% 41.21% 58.79% 53.66% 46.34%
Table 18.55% 81.45% 39.18% 60.82% 49.1% 50.84%
- 42 -
Unequal Secure Encryption (USE) Scheme for H.264
Table 3.4 Comparison with other video encryption schemes.
Encryption Schemes Content to be encrypted Computational cost
( @ AES )
EncryptedData
SEC MPEG[14]
Level1
Header 20% @ AES 20%
Level3
Header and Intra 35% @ AES 35%
Level4
All 100% @ AES 100%
Aegis [15][16] Header, I frame 35% @ AES 35%
VEA [18] All 50% @ AES 100%
RVEA [19][20] Sign Bit of DCT and motionvectors
10% @ AES 10%
Alattar[21]
Method 0 Header, Intra and MVD 55% @ AES 55%
Method 1 Every nth I MB 1/n*15%@AES 1/n*15%
Method 2 + Header (1/n*15 + 40)% @ AES (1/n*15 + 40)%
Method 3 + nth Header (1/n*15 +1/n*40)%@ AES (1/n*15+1/n*40)%
Ours
Level 0 All 18% @ AES 100%
Level 1 All 26% @ AES 100%
Level 2 All 32% @ AES 100%
Level 3 All 50% @ AES 100%
- 43 -
Unequal Secure Encryption (USE) Scheme for H.264
Figure 3.10 Comparison of security and computational complexity.
- 44 -
Hardware Design of Encryption Accelerator
4 Hardware Design of Encryption Accelerator
As video streaming needs real-time encryption, the hardware implement for a secure
video system becomes very important, especially for high-performance, low-latency and
also low-power. For a complete secure video streaming system, it includes two main parts:
Secret key exchange module and video content encryption module. For secret key
exchange, the RSA is the most widely used algorithm and Montgomery multiplier is the
core accelerator in RSA. For video content encryption, the USE scheme has been
proposed in chapter 3 and AES is the core accelerator. In this chapter, the hardware
design for AES and RSA are proposed to achieve high-performance and low hardware
cost.
4.1 Hardware Design of AES
4.1.1 Introduction of AES Algorithm
AES [37], also known as Rijndael, is the most popular algorithm used in symmetric
key cryptography. AES operates on a 4×4 array of bytes termed the State. For encryption,
it implements a round function 10, 12, 14 times (depends on the key length). The
encryption and decryption flow of AES algorithm are shown in Figure 4.1 (a) and (b).
Four transformations including Subbytes, ShiftRows, MixColumns and Addroundkey are
performed in the encryption process, and the other four inverse transformations are
performed in the decryption process. A separate KeyExpansion unit is used to generate
keys for each round of AES algorithm. In order to reduce the hardware cost, we propose a
hybrid dataflow for both of encryption and decryption, which is shown in Figure 4.1 (c).
This data flow supports both of encryption and decryption. All of function modules
support forward and inverse operation. KeyExpansion module also supports generating
forward and inverse key sequence. Compared to solution uses two AES cores (one for
encryption and another for decryption), the hybrid solution saves about 40% hardware
- 45 -
Hardware Design of Encryption Accelerator
cost (For single modules, (Inv)Subbytes saves 50%, (Inv)MixColumns saves 30%,
(Inv)ShiftRows saves 20% of hardware cost).
Figure 4.2 shows the operations in AES algorithm. The briefly introduction is listed as
below:
1) SubBytes: The SubBytes operation is a non- linear byte substitution that operates on
each byte of the State using a substitution table.
2) ShiftRows: In the ShiftRows operation, the bytes in the last three rows of the State
are cyclically shifted over different numbers of bytes.
3) MixColumns: Mixing operation which operates on the columns of the State using a
linear transformation.
4) AddRoundKey: A Round Key is added to the State by a simple bitwise XOR
operation.
Figure 4.1 Dataflow. (a) Encryption. (b) Decryption.
(c) Proposed hybrid dataflow for encryption & decryption.
- 46 -
Hardware Design of Encryption Accelerator
Figure 4.2 Transformations in AES algorithm.
4.1.2 Existing low-cost implementations of AES
Many hardware implementations of AES algorithm already have been proposed. They
can be classified into two types: high speed designs and low-cost designs. Most of the
existing designs are high-speed design. However, with the increase of personal security
requirement and portal commercial electronic device usage, the low power and low-cost
design becomes very important.
The existing low cost implementations of AES can be classified into 32-bit design and
8-bit design two types. While using 32-bit width data path, it is called 32-bit design, and
8-bit width data path is called 8-bit design. The briefly introduction of these two kinds of
designs are listed as following.
32-bit implementation of AES Algorithm [48]
The 32-bit implementation of AES algorithm is shown in Figure 4.3. It is proposed by
Satoh in [48]. This architecture uses 32-bit width data path. The key function modules of
- 47 -
Hardware Design of Encryption Accelerator
this architecture include:
S-Box module (32-bit): Every S-Box supports one SubBytes (8-bit) operation, so 32-bit
SubBytes operation is realized by using 4 S-Boxes in the data path. These S-Box modules
can be used for both of AES Round function and Key expansion operation.
MixColumns module (32-bit): MixColumns module can support both of MixColumns
and Inverse MixColumns operation through reusing part of hardware.
In this architecture, S-Box module, MixColumns module and AddRoundKey module
(Xor module) are serially connected. This architecture can be also called as serial data
path design. The hardware cost of this implementation is shown in Table 4.1. The ASIC
implementation shows that the frequency is 130 MHz and the throughput is 311 Mbps.
Figure 4.3 32-bit architecture for AES.
- 48 -
Hardware Design of Encryption Accelerator
Table 4.1 Hardware cost of 32-bit AES @ 131 MHz, 0.11um, [48].
Components Gates %
Data Register 864 16.01%
ShiftRows 160 2.96%
S-Boxes 1,176 21.79%
MixColumns 350 6.48%
AddRoundKey 56 1.04%
Key Expander 1,896 35.12%
Others 699 12.95%
Total 5398 100%
8-bit implementation of AES Algorithm [50]
The 8-bit implementation of AES algorithm is shown in Figure 4.4. It is proposed by
Feldhofer in [50]. This architecture uses 8-bit width data path, and it is designed to be
used in RFID tag. The key function modules are listed as following:
S-Box module (8-bit): There is only 1 S-Box in this architecture. It is also used to
support both of Round function and Key expansion.
1/4 MixColumns module (8-bit): Since Mixcolumns operation is a 32-bit operation, the
authors proposed a new method to implement it into 8-bit. Additional registers and clock
cycles are needed.
In this architecture, S-Box module, MixColumns module and AddRoundKey module
are parallelized. This architecture is also referred as parallel data path design. The
hardware cost of this implementation is shown in Table 4.2. The ASIC implementation of
this design shows that the working frequency is about 100 KHz, and the throughput is
about 12.6 kbps, which is enough for RFID communication.
The 32-bit and 8-bit implementation of AES achieves very low hardware cost, and they
- 49 -
Hardware Design of Encryption Accelerator
are suitable for low cost, low throughput system. However, the architectures in these two
implementations are dedicated to specific applications, and can
performance. For video encryption, the performance requirement is quite different for
different video resolution and frame rate. For example, for oneseg [51] (Mobile TV used
in Japan), the throughput of video data is 160 kbps, however, for HDTV DVD, the
throughput of video data can reach to 50 Mbps. In this way, a scalable architecture which
can provide different performance is the best choice for AES hardware design.
Figure 4.4 8-bit architecture for AES.
Table 4.2 Hardware cost of 8-bit AES @ 100 KHz, 0.35um, [50].
Components Gates %
S-Boxes 395 10.0%
MixColumns 252 7.0%
AddRoundKey 90 2.5%
Key Expander 161 4.5%
RAM 2,337 65%
Controller 360 10.0%
Total 3,595 100%
- 50 -
Hardware Design of Encryption Accelerator
4.1.3 Proposed Scalable Hardware Architecture for AES
Scalable architecture is very important for IP design. Different implementations with
different performance and hardware cost according to specific requirement can be
designed based on a common architecture provides great flexibility and reliability for IP
design and also system integration. For multimedia system, various specifications are
required for different applications. For example, mobile TV usually uses CIF size video
with less than 1 Mbps bit rate. In contrast, for HDTV broadcasting, the bit rate is more
than 20 Mbps, and for future super HDTV the bit rate will be increased to hundreds of
Mbps. A scalable architecture, which can be used for wide specifications, is urgent for
AES IP design.
4.1.3.1 Top Level Architecture
The top level of proposed scalable hardware architecture for AES is shown in Figure 4.5
There are many blocks included in this architecture:
Data Registers
The data registers includes 16 bytes of registers, same as the block length (128-bit) of
plaintext of AES. Every subblocks (Gray color) represents a byte of data, which termed a
State. Before encryption, the data registers load plaintext from external memory, and after
encryption, the data register output the ciphertext.
Key Expander Module
Key expander module is used to generator key for each round of AES. It includes two
part: Key registers and Key scheduler. Key registers can be 16, 24, 32 bytes of register
for 128-bit, 192-bit and 256-bit key length. Currently, 128 bits key length is enough for
high security applications. And most of cryptosystem uses 128-bit key length. Key
Scheduler is a xor gate array with a control unit to generate round keys for encryption.
- 51 -
Hardware Design of Encryption Accelerator
Figure 4.5 Scalable Hardware Architecture for AES
- 52 -
Hardware Design of Encryption Accelerator
ShiftRows Module
Shiftrows module is very easy to implement. It is a wire mapping box to map the right
output ports to input ports. The input and output of ShiftRows module is 128-bit.
MixColumns Array
MixColumns Array is a set of MixColumns modules. The module number can be
implemented from one to four, since the Mixcolumns is 32-bit operation, at most four
modules can be used in an implementation. The detailed structure of Mixcolumns will be
discussed in Chapter 4.3.3.
S-Box Array
S-Box Array is a set of S-Box modules. S-Box is used to do subbytes operation of AES
algorithm, and it is very important for hardware implementation of AES. The subbytes
operation is the main computation in AES. The number of S-Box used in hardware
design greatly affects the performance and the hardware cost. The S-
be particularly discussed in Chapter 4.3.3.
The scalability of this design is achieved by the following new ideas:
Independent Data path for Each Operations
There are three main data paths in our design: MixColumns datapath, SubBytes
datapath and AddRoundKey data path. The advantages of this design include:
1) Scalability
As shown in Table 4.3, the operations in AES have different bit width: SubBytes is an
8-bit operation. MixColumns is a 32-bit operation. AddRoundKey and ShiftRows are
128-bit operation.
A main problem of previous architecture for scalable design is that they integrate
- 53 -
Hardware Design of Encryption Accelerator
different operations in one data path. As a result, all of the operations in the same data
path should use the same bit width. In other words, the number of processing elements
for each operation is highly correlated.
However, in our proposed design, we separate each operation into different data paths.
It makes our design very flexible for increase or decrease parallelism of each single
operation. For example, in our proposed architecture, the number of MixColumns module
and number of S-Box modules can be freely configured, without to consider about other
operations.
Table 4.3 Bit width of operations in AES algorithm.
2) Power & Performance Improvement
Since the operations are distributed in several parallel data paths, the critical path of
hardware implementation becomes much shorter than serial architecture. The maximum
working frequency can be improved to a higher level. For low power design, because the
Operations Bit width for operations
SubBytes 8-bit
ShiftRows 128-bit
MixColumns 32-bit
AddRoundKey 128-bit
SubBytes for KeyExpansion 8-bit
- 54 -
Hardware Design of Encryption Accelerator
critical path becomes shorter, it can use lower power supply and higher threshold voltage
to reduce power consumption.
Scalable S-Box Array, MixColumns Array
As shown in Figure 4.5, the S-Box array and MixColumns array are scalable modules,
which support different number of processing elements. Each S-Box is used to do
Subbyte operation with 8-bit inputs/outputs. The data registers is 128-bit, which should
do 16 times Subbytes in each round of AES algorithm. For key registers, four times
Subbytes is needed in each round of AES. Totally, in one round of AES algorithm, it
executes 20 times Subbytes operation. In our architecture, S-Box array supports 1-20
S-Boxes, which covers a wide range of performance and hardware cost requirement.
For MixColumns array, -bit operation and for each round of AES algorithm,
four MixColumns operations are executed. Correspondingly, 1-4 MixColumns modules
can be used in this architecture.
The advantage of Scalable S-Box Array and MicColumns Array is that the
performance and hardware cost can be balanced by simply adjusting the number of
processing elements. The scalability of AES hardware is greatly improved by these two
scalable modules.
4.1.3.2 Two typical subclass architectures
Based on the proposed scalable architecture, there are two typical subclass
architectures: Shared S-Box Architecture and Unified S-Box Architecture. Table 4.4
compares these two architectures.
- 55 -
Hardware Design of Encryption Accelerator
Figure 4.6 Shared S-Box Architecture.
Shared S-Box Architecture
As shown in Figure 4.6 shared S-box Architecture uses one unique S-Box Array for
both of data registers and key registers. The advantages of this architecture is that all of
the S-Boxes can be used for both of data Subbyte and key Subbyte. However, these two
Subbyte opearion. Thus, this architecture is suitable for low hardware cost
implementations.
- 56 -
Hardware Design of Encryption Accelerator
Figure 4.7 Unified S-Box Architecture
Unified S-Box Architecture
Unified S-Box array separate the S-Boxed into two parts: Data S-Box array and Key
S-Box array, as shown in Figure 4.7. Since key Subbyte executes four times in each round
of AES, four S-Boxes can be used. For data S-Box array, the number of S-box can be
1-16. The advantage of this architecture is that key operation is totally separated from
data operation. It needs less clock cycles than shared architecture. And it is very suitable
for high performance implementations. However, the utilization of key S-Box is not
higher, because the key Subbyte executes much less than data Subbyte.
- 57 -
Hardware Design of Encryption Accelerator
Table 4.4 Comparison of two architectures.
Shared S-Box Arch. Unified S-Box Arch.
Configurable S-
number
1 to 20Data part 1-16
Key part 1-4
Configurable
Mix
1 to 4 1 to 4
Advantages S-Box utilization is high
Save hardware cost
Separate Key and Data operation
Easy to control
Orientation Low-cost AES High-performance AES
4.1.3.3 Sub-
As shown in Figure 4.5, there are three main sub-modules in the architecture: ShiftRows
module, S-Box module, and MixColumns module. ShiftRows is very simple and easy to
implement. It is just some shifting operations which can be easily implemented by hard
wire. The detailed implementation of S-Box and MixColumns are presented as follows:
S-Box Design
The S- [49]. The
architecture of S-Box is shown in Figure 4.8. This S-Box design uses normal basis to
optimize the GF(8) inverter to GF(((22)2)2
affine transformation are matrix operation which is easy to be implemented in hardware.
According to the experimental results of Canr
- 58 -
Hardware Design of Encryption Accelerator
Figure 4.8 S-Box structure.
a) Factors of Inverse MixColumns b) Dual-function module
Figure 4.9 MixColumns structure.
- 59 -
Hardware Design of Encryption Accelerator
MixColumns Design
There are a lot of hardware reuse methods for MixColumns module. We referred
[48]. Figure 4.9 a) shows the reused method of this module. In this
equation, the InversionMixColumns is separated into one MixColumns with two
additional matrixs. Figure 4.9 b) shows the hardware architecture of Dual-function
MixColumns module.
4.1.4 Performance Analysis
4.1.4.1 Scalability
Our proposed scalable architecture provides greatest scalability. There are totally 336
possible implementations based on proposed architecture. As shown in Table 4.5, for
Shared S-Box architecture, there are 1-20 S-Boxes and 1-4 MixColumns can be used.
Totally, 80 possible implementations are achieved based in this architecture. For unified
S-Box architecture, 1-16 S-Boxes for data, 1-4 S-Boxes for key, and 1-4 MixColumns
can be used. Totally, it has 256 possibilities. Taking count of two architectures, there are
336 possible implementations of AES base on the proposed salable architecture.
Table 4.5 Possible implementations of AES based on scalable architecture.
Shared S-Box Arch. Unified S-Box Arch.
S-Box for Data
1- 20
1-16
S-Box for Key 1-4
MixColumns module 1 - 4 1-4
Possible Configurations 80 256
Total 336
- 60 -
Hardware Design of Encryption Accelerator
Figure 4.10 Dataflows for scalable architecture.
4.1.4.2 Dataflows
Figure 4.10 shows the dataflows for proposed scalable architecture. The dataflow
includes three parts: First round, Round i and Final round. First round includes two
sub-procedures: {A, S}, and {ks}. The meaning of notations is listed in the table in this
figure. Especially, A and S are executed in the same clock cycle. Round i is a loop
function of AES. For AES 128-bit key, the number of loops is 9. In this step, for unified
S-Box architecture, the dataflow includes two sub-procedures: {A, S, ks} and {A, R}. For
shared S-Box architecture, the dataflow includes three sub-procedures: {S}, {ks, M} and
{A, R}. All of the operations within same sub-procedures are executed in parallel. Final
round is the last round of AES, and it includes two sub-procedures: {S}, {A}. For an
AES-128 encryption, the total needed clock cycles for these two architectures are,
- 61 -
Hardware Design of Encryption Accelerator
Clock cycles for shared S-Box Architecture
Clock cycles for unified S-Box Architecture
4.1.4.3 Hardware Implementation
In order to reduce hardware cost, we use only one S-box and one MixColumns module
and also constrained the design by loose frequency to get the lowest hardware cost in
synthesis tool. The synthesis results are shown in Table 4.6 Here we use TSMC 0.18 um
standard cell library, and use Synopsys Design Compiler to do synthesis. In order to
measure the highest performance of proposed architecture, we use 20 S-boxes and 4
- 62 -
Hardware Design of Encryption Accelerator
MixColumns modules and constrained the design by maximum frequency. The synthesis
results are shown in Table 4.7.
The results show that the lowest hardware cost for each component of AES hardware
can be achieved below the minimum frequency of 123 MHz. In other words, the
hardware cost will be increased very much while the constrained frequency is set above
123 MHz. Because of parallel data path to shorten critical path, the maximum frequency
of proposed architecture can achieve 416 Mbps. Besides performance increasing, the
hardware cost also increased very much. The detailed hardware cost is listed in Table 4.7.
Table 4.6 Hardware cost of lowest cost AES @ 123 MHz, 0.18 um.
Components Gates
Data Registers 1079
ShiftRows + AddRoundKey 307
S-Box 358
MixColumns/InvMixcolumns 376
Key Expander + Key Registers 1935
Controller 247
Others 318
Total 4620
Table 4.7 Hardware cost of highest performance AES @ 416 MHz, 0.18 um.
Components Gates
Data Registers 1079
ShiftRows + AddRoundKey 310
S-Box 1138 * 20
MixColumns/InvMixcolumns 349 * 4
Key Expander + Key Registers 2548
Controller 264
Others 1112
Total 29469
- 63 -
Hardware Design of Encryption Accelerator
Table 4.8 Scalability of hardware implementations.
Components Lowest hardware cost
implementation
Highest performance
implementation
Technology TSMC 0.18 TSMC 0.18
Frequency 123 MHz 416 MHz
Hardware cost 1 S-Box, 1 MixColumns 20 S-Boxes, 4 MixColumns
Required clock cycles for AES
encryption211 22
Throughput 75Mbps 2.4 Gbps
Table 4.9
*32-bit
Architecture
[48]
8-Bit
Architecture
[50]
Proposed Scalable Architecture
An implementation
example 1
An implementation
example 2
Hardware cost
Gate
4 S-Box,
1 MixColumns
7226
1 S-Box,
1/4 MixColumns
3595
5 S-Box
1 MixColumns
7344
4 S-Box,
1 MixColumns
6986
Frequency 138 MHz
@ 0.18 um
100 KHz
@ 0.35 um
180 MHz
@ 0.18 um
180 MHz
@ 0.18 um
clock cycles for
AES encryption54 >1000 54 64
Throughput327 Mbps 12.6 kbps 427 Mbps 360 Mbps
Power
Consumption18.3 mW 8.5 uW 22.3 mW 20 mW
Scalable NO NO YES
ture, and implemented by us.
- 64 -
Hardware Design of Encryption Accelerator
Figure 4.11
Table 4.8 shows the scalability of proposed hardware design of AES. While implement
the AES to lowest hardware cost, it only use 1 S-Box and 1 MixColumns, and the
throughput is 75Mbps. However, for highest performance implementation, there are 16
S-Boxes and 4 Mixcolumns are used. The throughput can achieve 2.4 Gbps.
Table 4.9 compares the proposed architecture with 32-bit architecture and 8-bit
architecture. In order to equally compare with Sato two example
implementations with the similar hardware cost is designed [66]. Figure 4.11 shows this
comparison more clear. It can be seen that our scalable architecture provides great
scalability in both of performance and hardware cost. In the same level hardware cost,
our design achieves better performance.
- 65 -
Hardware Design of Encryption Accelerator
4.2 Hardware Design of RSA
4.2.1 Introduction of RSA Algorithm
Public key cryptography plays a very important role in modern information security.
It not only can be used to encrypt/decrypt data like symmetric cryptography, but also can
provide service such as confidentiality, authentication, data integrity check and
non-repudiation. RSA algorithm [38], which is proposed by Rivest, Shamir and
Adleman in 1976, is the most widely used public key cryptographic algorithm.
RSA algorithm uses modular multiplication as the primary operation. With the
increase of the key-length used in these algorithms, the speed of modular multiplication
becomes a bottleneck. A lot of papers have been published to accelerate the speed of
modular multiplication. Till now, Montgomery modular multiplication algorithm [73] is
considered as the most efficient algorithm. A lot of hardware implementations are based
on this algorithm. Some of them focus on scalable design [74-75], which makes the
hardware implementation have ability to handle any key-length encryption/ decryption.
Some focus on high-radix design in [76-78, 80], which can reduce total clock cycles for
multiplication. Some focus on dataflow optimization [79], which can reduce the delay
cycles in dataflow.
Algorithm 1 shows the original Montgomery algorithm. The advantage of this
algorithm is that the division in modular operation is replaced by shift. In this way, this
algorithm is very suitable to be implemented in hardware.
The direct hardware implementation of Montgome
key-length operation. To make Montgomery algorithm scalable to variable key-length
and improve the speed of this algorithm, Tenca and Koç proposed a scalable high-radix
Montgomery multiplication algorithm, as shown in algorithm 2.
- 66 -
Hardware Design of Encryption Accelerator
The sign ext in step 12 and 13 is sign extending operation. The Booth function in step
3 is used to support high-radix operation. The detail of Booth function is,
Booth (Xj+k- -1) = -2kXj+k-1+2k-1Xj+k-2 +20Xj-1 (4.3)
Algorithm 1: Montgomery Multiplication Algorithm
Input: X Y M
Output: S = MM (X, Y) = XYr-1 mod M
1. S=0
2. For i=0 to N-1
3. S=S + Xi Y
4. S=S + S0 M
5. S=S/2
6. End For
7. If S>=M Then S=S-M
8. Return S
Algorithm 2: Scalable High-radix Montgomery Multiplication Algorithm
Input X Y M
Output S = MM (X, Y) = XYr-1 mod M
1. S=0, C=0, X-1=0
2. For j=0 to N-1 Step k
3. qYj = Booth(Xj+k-1 j-1)
4. (C0,S0) = C0 + S0 + qYj Y0
5. qMj = (S0k-1 0+ C0
k-1 0) (2k-(M0)-1k-1 0) mod 2k
6. (C0,S0) = C0 + S0 + qMj M0
7. For i = 1 to NW 1
8. (Ci, Si) = Ci + Si + qYj Yi + qMj Mi
9. Si-1 = (Sik-1...0, Si-1
BPW-1...k)
10. Ci-1 = (Cik-1 0, Ci-1
BPW-1 k)
11. End For
12. SNW-1 = sign ext (SNW-1BPW-1 k)
13. CNW-1 = sign ext (SNW-1BPW-1 k)
14. End For
15. P = S+C
16. If P >= M Then P = P M
17. Return P
- 67 -
Hardware Design of Encryption Accelerator
Algorithm 2 provides some advantages compared to algorithm 1. Firstly, word-based
operation makes the multiplier be scalable to variable key-length. Secondly, high-radix
design processes multiple bits of X in every loop (Step 3, Algorithm 2). It can reduce the
clock cycles used in multiplication. Thirdly, carry-save adder is introduced to reduce
critical path.
However, there are some disadvantages in Algorithm 2. Firstly, high radix design
makes the calculation of qYj and qMj very complex, and the path delay will be increased
very quickly when using higher radix. Secondly, scalable design makes the data be
dependent in pipeline, which causes two clock cycles delay in pipeline.
4.2.2 Proposed Optimized Algorithm
Algorithm 3 [68] is different from algorithm 2. Firstly, in step 1, Y is multiplied by 2k.
In this way, all of the k-LSB of Y is zero. The result of (S0k- , C0
k- ) is not changed
Algorithm 3: Modified scalable high-radix Montgomery multiplication algorithm
Input X Y M
Output S = MM (X, Y) = XYr-1 mod M
1. S=0, C=0, X-1=0 Y=Y 2k
2. For j=0 to N+k-1 Step k
3. qYj = Booth(Xj+k-1...j-1)
4. qMj = (S0k-1 0 + C0
k-1...0) (2k - (M0)-1k-1 0) mod 2k
5. (C0,S0) = S0 + C0 + qYj Y0 + qMj M0
6. For i = 1 to NW 1
7. (Ci, Si) = Si + Ci + qYj Yi + qMj Mi
8. Si-1 = (Sik-1...0, Si-1
BPW-1...k)
9. Ci-1 = (Cik-1...0, Ci-1
BPW-1...k)
10. End For
11. SNW-1 = sign ext (SNW-1BPW-1...k)
12. CNW-1 = sign ext (SNW-1BPW-1...k)
13. End For
14. P = S+C
15. If P >= M Then P = P M
16. Return P
- 68 -
Hardware Design of Encryption Accelerator
by adding qYj Y0. The calculation of qMj is independent of qYj. Secondly, the calculation
of qYj and qMj are performed in parallel (Step 3, 4), while these two calculations are
calculated in serial in algorithm 2 (Step 3, 5).
For high-radix design, the path delay of qYj and qMj is large. In this way, our proposed
algorithm can support parallel calculation of qYj and qMj, and it achieves much shorter
critical path than algorithm 2. Our proposed algorithm is much more suitable for
high-speed hardware implementation of Montgomery algorithm.
4.2.3 Proposed Optimized Data Flow
The most frequently used dataflow for scalable high-radix Montgomery algorithm is
shown in Figure 4.12(a), which is proposed by Tenca and Koç in [76]. This data flow is a
pipeline data flow of algorithm 2. Due to the data dependence in algorithm 2 (In Step 9,
10, the output (Si-1, Ci-1) needs both of (Si, Ci) and (Si-1, Ci-1), there are two clock cycles
delay in this dataflow. As a result, this dataflow needs more clock cycles to complete one
time multiplication.
In order to deal with this problem, Herris proposed a new radix-2 dataflow in [79],
which is shown in Figure 4.12(b). This dataflow achieves one clock cycle delay by
removing data dependence in algorithm 2. As shown in algorithm 2, the right-shifting
(Step 9, 10.) causes data dependence. I -shifting of product S
is removed. As a result, the product S is equivalently be multiplied by 2 in every pipeline
stage, so the input data (Y, M) of next stage need to be multiplied by 2 too.
-save result and support high-
dataflow achieves one clock cycle delay while using radix-
- 69 -
Hardware Design of Encryption Accelerator
Figure 4.12 Optimized Data Flow.
The proposed dataflow is shown in Figure 4.12(c) [68]. This dataflow bases on
proposed algorithm 3. It achieves both of one clock cycle delay and high-radix design. As
shown in Figure, operand Y is initially multiplied by 2k as specified in algorithm 3, so the
input data (Y, M) for each stage becomes (2kY, M). In order to achieve one clock cycle
delay in dataflow and support high-radix design, the input data (2kY, M) needs to be
multiplied by 2k accumulatively in each stage (except the first stage). Compare with
-radix and one
clock cycle delay make this dataflow need very few clock cycles to do multiplication.
Secondly, algorithm 3 used in this dataflow can achieve much shorter critical path than
-speed design of
Montgomery multiplier.
4.2.4 Proposed Hardware Architecture for RSA
The proposed Montgomery multiplier is shown in Figure 4.13(a) [68]
data path contains NS MM Cells. The MM Cell is the basic processing element in the
pipeline. There are two coefficient processing elements, qYj PE and qMj PE. They can be
shared to all of the MM Cells in pipeline. From Fig 4.12(c), it can be seen that the
- 70 -
Hardware Design of Encryption Accelerator
calculation of qYj and qMj are done just in the first cycle of each stage (Grey cycle shown
in Fig. 4.12(c)). All of the remained cycles (White cycles) don't need to calculate qYj
and qMj. This property provides possibility to reuse the qYj PE and qMj PE in the data path.
The FIFO in Figure 4.13(a) is used to avoid data overflow in pipeline. When the NW
(Number of words of operands) is larger than NS (number of stages in dataflow), data
overflow will happen in pipeline. The FIFO can be used to store overflowed data
temporarily.
Figure 4.13 Proposed Hardware Architecture.
Parallel Radix-16 MM-Cell Design
The design of MM Cell is shown in Figure 4.13(b) [68]. This is a high-radix design. In
this paper, we implement radix-16 (k = 4) design of MM Cell. As shown in algorithm 3,
the function of MM Cell is:
(Ci, Si) = Si + Ci + qYj Yi + qMj Mi (4.4)
While using proposed dataflow in Figure 4.12(c), the input data (Si, Ci, Yi, Mi)
becomes (2jkSi, 2jkCi, 2(j+1)kYi, 2jkMi) in the jth pipeline stage. The input qYj_sel and
qMj_sel are the select signal for multiplexer, which is generated from qYj and qMj by qYj
- 71 -
Hardware Design of Encryption Accelerator
PE and qMj PE. The implementation of qYj Yi is as following: Firstly, splitting qYj into
two numbers which is power of 2. Secondly, shifting Yi to get two components of qYj Yi
based on these two numbers. Finally, adding these two components with (Si, Ci) by using
4-to-2 carry-save adder. For example, while qYj = 6, qYj can be split into 2 and 4. Then,
6Yi can be represented as (2Yi + 4Yi). The inputs of 4-to-2 carry-save adder are (2Yi, 4Yi,
Si, Ci). Because 2Yi and 4Yi can be easily generated by left-shifting of Yi, the
multiplication of 6 Yi can be avoided. qMj Mi is implemented as same as qYj Yi. The
shift & Inverse modules are used for this purpose.
Considering the equation (1), while using radix-16, it becomes:
qYj =Booth(X -1)= -24Xj+3+23Xj+2+22Xj+1+2Xj+Xj-1 (4.5)
The range of qYj is [-8, 8]. All of the number in this range can be split into two
components, which is power of 2. For qMj under radix-16,
qMj = (S0 + C03...0) (16 - (M0)-1 ) mod 16 (4.6)
The range of qMj
order to deal with this problem, we propose a mapping table from [0, 15] to [-7, 8] in
Table 2, which can be equivalently used for qMj.
Table 4.10 -7, 8].
qMj qMj qMj qMj qMj qMj qMj qMj
0 0 4 4 8 8 12 -4
1 1 5 5 9 -7 13 -3
2 2 6 6 10 -6 14 -2
3 3 7 7 11 -5 15 -1
- 72 -
Hardware Design of Encryption Accelerator
Very Low Complex implementation of qMj
qMj is much more complex than qYj. Normally, qYj PE and qMj PE are directly
implemented by lookup table. The size of look up table is increased exponentially to
radix number. While using radix 16, the table size of qMj is 4 times of qYj. In order to
reduce the cost of qMj calculation, we present a very low complex implementation of qMj
[68]. Considering equation (4.6), we use InvM to present (16 - (M0)-1 ). Table 4.11
shows the mapping from M0 to InvM. As modulus M is an odd number, there is 8
different values of M0 .
Table 4.11 M0 to InvM.
M0 InvM M0 InvM
0001 1111 1001 0111
0011 0101 1011 1101
0101 0011 1101 1011
0111 1001 1111 0001
All of these values can be divided into two groups: {(0001, 1111), (0111, 1001)} and
{(0011, 0101), (1011, 1101)}. Here (A, B) means a pair of (M03:0, InvM) or (InvM, M0
3:0).
In the first group:
InvM = ~ M03:0 + 1 (4.7)
In the second group:
InvM = {M03, M0
1, M02, M0
0} (4.8)
The difference of group 1 and group 2 is:
Group 1: M02 xor M0
1 = 0
Group 2: M02 xor M0
1 = 1
For example, (0001) is in group1, 0 xor 0 is equal to 0. (0011) is in group2, 0 xor 1 is
equal to 1. Based on above analysis, the qMj can be implemented as Figure 4.14 shows.
Because modulus M is an odd number, equation (4.7) can be further presented as:
- 73 -
Hardware Design of Encryption Accelerator
InvM = ~ M03:0 + 1 = {~ M0
3, ~M02, ~M0
1, 1} (4.9)
After qMj is calculated, the qMj_sel can be calculated by using a small size lookup
table as same as qYj_sel.
Figure 4.14 Implementation of qMj.
4.2.5 Performance Analysis
Table 4.12 shows the clock cycles comparison of our dataflow with Tenca-Koç
dataflow and Herris dataflow. It shows that our dataflow achieves much less clock cycles
than their dataflow. In this table, different key length and stages are used to calculate
clock cycles for each dataflow. The BPW is equal to 32. Because our dataflow and
Tenca-Koç dataflow are high-radix dataflow, we use radix-16 for these two dataflows in
table 4, and Herris dataflow uses radix-2. In this table, the NS of one clock cycle delay
dataflow is half of two clock cycles delay dataflow. The reason is illustrated in Figure
4.15. Each Dataflow Stage of two clock cycles delay dataflow in Figure 4.15(a) actually
includes two DataPath Stages. One is MM Cell stage for operation of equation (2), the
other is Register Bank stage for storing (Yi, Mi, Si, Ci). The dashed cycles in Figure
4.15(a) represent the registering stages in the dataflow. For fairly comparing, the NS of
one clock cycle delay dataflow should be two times of two clock cycles delay dataflow.
- 74 -
Hardware Design of Encryption Accelerator
Table 4.12 Clock cycles comparison of different dataflows.
NNS
NWThis Paper
Clock
Cycles
(Radix-16)
Tenca-Koç Dataflow
(Radix-16)
Herris Dataflow
(Radix-2)
Two clock
Cycles Delay
Dataflow
One clock
Cycle Delay
Dataflow
Clock
Cycles
Reduced
Cycles (%)
Clock
Cyels
Reduced
Cycles (%)
512
4 8 16 297 552 46.2% 1111 73.3%
8 16 16 171 289 40.8% 575 70.3%
16 32 16 167 281 40.6% 559 70.1%
1024
8 16 32 578 1072 46.1% 2159 73.2%
16 32 32 333 561 40.6% 1119 70.2%
32 64 32 329 553 40.5% 1103 70.2%
2048
16 32 64 1140 2112 46.0% 4255 73.25%
32 64 64 657 1105 40.5% 2207 70.2%
64 128 64 653 1097 40.5% 2191 70.2%
Table 4.13 .
Ref. Tech Area
(Gates/LUTs)
Frequency
(MHz)
Scalable Radix Delay Cycles
in Dataflow
1024-bit RSA
Throughput (Kbps)
[77] 0.5 m 28k Gates 64 MHz Yes 8 2 22 Kbps
[78] 0.25 m 100k Gates 125 MHz Yes 16 2 143 Kbps
[79] FPGA 5598 LUTs 144 MHz Yes 2 1 62.5 Kbps
[80] FPGA 2847 LUTs + 32
Mults + 5n RAM
102 MHz Yes 216 4 152 Kbps
[72] 0.18 m 150k 300 MHz Yes 2 2 100 Kbps
This
Paper
0.25 m 130k 180 MHz Yes 16 1 352 Kbps
- 75 -
Hardware Design of Encryption Accelerator
Figure 4.15 Comparison of dataflow and corresponding data path.
The ASIC implementation of this work uses NS = 32, BPW = 32, Radix = 16. We use
HHNEC 0.25 m CMOS standard cell library and Synopsys EDA tools to do ASIC
design. Table 4.13
work. [77] is a radix 8 design proposed by Tenca-Koç, [78] is an improved radix 16
design.
frequency is higher than [78] under the same radix number. [79] is a FPGA
implementation by using one clock cycle delay dataflow, [80] is a very high radix design
(radix 216) using multiplier and RAM which is embedded in FPGA, [72] is a radix-2
scalable design which using 2 clock cycles delay dataflow.
Normally, radix-2k design can achieve about k times of performance than radix-2
design. One clock cycle delay dataflow can achieve about 2 times of performance than
two clock cycles delay dataflow. From the table 4.13, our design uses radix-16 with one
clock cycle delay dataflow. It achieves much higher performance than
- 76 -
Hardware Design of Encryption Accelerator
4.3 Conclusion
This chapter presents the hardware design of encryption accelerators, including AES
and RSA. For AES, our proposed architecture provides high scalability for designing
different AES implementations with various hardware cost and performance. Two kinds
of subclass architectures are introduced: Unified S-Box architecture and Share S-Box
architecture. The comparison and advantages of these two architectures are discussed.
The sub- . For RSA, our proposed
architecture parallelizes the data path and shortens the critical path. By using proposed
clock-saving dataflow, it reduces the total clock cycles of multiplication to a very small
number. Finally, the performance analysis for both of AES and RSA are discussed.
- 77 -
DPA Attack on AES
5 DPA Attack on AES
Cryptographic devices are widely used in many places, such as smart card, USB key,
and so on. These devices are used to provide authentication of users or store secret
information. The security problem of these devices is a very hot topic. Many researchers
engage in this field, they proposed a lot of attack methods to crack cryptographic devices.
Correspondingly, many countermeasure methods are proposed to prevent attacking.
In recent years, side-channel attack becomes very popular, because side-channel attack
uses different approach to achieve its goal. Traditional attack methods use mathematical
analysis to reveal the weak point of cryptographic algorithm. The researcher should be
very experienced in crypto-analysis and own deep knowledge about the cryptographic
algorithms. However, side-channel attack only uses the side-channel information, such as
power consumption, time consumption, to decipher the secret information in
cryptographic devices. It is not necessary for attacker to hold crypto-analysis experience
or cryptographic algorithm knowledge. Among the proposed side-channel attack methods,
DPA attack was proved to be most efficient and easy to implement.
In this chapter, we briefly introduce the DPA attack method, especially for DPA attack
on AES. Some DPA attack countermeasure methods are also discussed in the end of this
chapter.
5.1 Introduction of Differential Power Analysis attack
In recent years, several kinds of attacks on cryptographic devices have become public.
The goal of these attacks is to reveal secret keys of cryptographic devices. Attacks on
cryptographic devices differ significantly in terms of cost, time, equipment, and expertise
needed. When Kocher et al. [52] showed in 1998 that power analysis attacks can
- 78 -
DPA Attack on AES
efficiently reveal the secrets of cryptographic devices, the world was shocked. After then,
power analysis attacks received most amount of attention because they are very powerful
and because they can be conducted relatively easily. Consequently, they pose a serious
threat to the security of cryptographic devices in practice. For the design and
development of modern cryptographic devices, it is crucial to power analysis attacks and
countermeasures.
The basic idea of power analysis attack is to reveal the key by analyzing cryptographic
cryptographic device depends on the data it processes and the operation it performs.
Attacker make use of power consumption with some other mathematical methods, the
secret key can be cracked.
Basically, power analysis attack can be classified into two categories: Simple power
analysis and Differential power analysis. Simple power analysis (SPA) attacks are
characterized by Kocher et al
directly interpreting power consumption measurements collected during cryptographic
operations , which means that attacker tries to derive the key more or less directly from a
given trace. In contrast to SPA, Differential Power Analysis (DPA) attacks requires a
large number of power traces, and it exploit the data dependency of the power
consumption of cryptographic device.
DPA attack exploits the fact that the power consumption of cryptographic devices
depends on intermediate values that are processed during the execution of a
cryptographic algorithm. DPA attacks are the most popular type of power analysis attack
due to the fact that DPA attacks do not require detailed knowledge about the attacked
device. Therefore, they can reveal the secret key of a device even if the recorded power
traces are noisy.
- 79 -
DPA Attack on AES
5.1.1 Power Consumption of CMOS Circuit
The total power consumption of a circuit depends two parts: the sum of logic cells
making up this circuit, and the activities of each logic cell. The logic cells can be
considered from system level, architecture level, to final MOS transistor level. Currently,
logic cells are usually implemented using CMOS. We use CMOS invert to describe the
power consumption, because the inverter is representative for all other cells.
As shown in Figure 5.1, the inverter includes consists of two transistors P1 and N1,
and a load capacitance CL. The power consumption includes two parts: Statistic power
consumption and dynamic power consumption.
Figure 5.1 CMOS Inverter.
- 80 -
DPA Attack on AES
Statistic Power Consumption
In this CMOS convert, P1 is conducting and N1 is insulating if the input a is set to
GND. Vice versa, P1 is insulating and N1 is conducting if the input a is set to VDD. In
both of cases, there is no direct connection between the VDD and GND. Therefore, only
a small leakage current is flowing through the MOS transistor. This leakage is denoted by
Ileak. The static power consumption Ps can be calculated by the following equation:
(5.1)
Dynamic Power Consumption
Dynamic power consumption occurs for logic cell switching. For a logic cell, four
transitions can be essentially performed: 0->0, 1->1, 0->1 and 1->0. For the first two
cases (0->0, 1->1), only static power is consumed. For the last two cases (1->0, 0->1), the
dynamic power is consumed. Table 5.1 illustrates those power consumptions for each
transition.
Table 5.1 Power consumption of four transitions in a circuit.
Transitions Power consumption
0 -> 0 Static power consumption
0 -> 1 Static + Dynamic power consumption
1 -> 0 Static + Dynamic power consumption
1 -> 1 Static power consumption
- 81 -
DPA Attack on AES
The dynamic power consumption Pd consists of two parts: Charging power
consumption and Short-circuit power consumption.
1) Charge Power Consumption
CMOS inverter draw a charging current from the power supply to charge the output
capacitance CL when output q switching from 0 to 1. CL is internal capacitances that
connected to the output port q, and it depends on the physical properties of process
technology, fanout cells, and the length of connected wires.
The average charging power Pch is consumed by a cell during the time T, as shown in
equation 5.2
(5.2)
In this equation, pch(t) denotes the instantaneous charging power, f is the clock
frequency, is activity factor of the cell which corresponds to the average number of
0->1 transitions that occur at the output of a cell in every clock cycle.
2) Short-Circuit Power Consumption
Short circuit happened temporary in CMOS circuit during the switching of the output.
In the case of CMOS inverter, there is a short period of time where both of P1 and N1 are
conducting simultaneously. The average power consumption of Psc that caused by
short-circuit can be calculated in equation 5.3:
(5.3)
In this equation, psc(t) denotes the instantaneous short-circuit power consumed by a
cell. Ipeak is the current peak caused by the short circuit during switching event. tsc is the
time of short circuit exists.
- 82 -
DPA Attack on AES
5.1.2 Power Model
Power analysis attack is cryptographic attack that uses power consumption information
and hypothetical power model to reveal secret keys in cryptographic devices. The power
consumption can be easily got when cryptographic devices are running. The most
important thing for power analysis attack is to build an accurate hypothetical power
model. Different from the conventional power model used in other applications, the
absolute values of power consumption are not relevant in power analysis attack. Only
relative differences between simulated power consumption values are important. In this
way, researchers only make use of dynamic power to model circuit power.
Hamming weight (HW) model and hamming distance (HD) model are two important
hypothetical power models for power analysis attack. Hamming weight model is a simple
power model, which assumes that the power consumption is proportional to the number
of bits that are set in the processed data value. The data values that are processed before
and after this value are ignored. Therefore, this power model is not very well suited to
describe the power consumption of a CMOS circuit. The equation 5.4 shows the power
consumption based on HW model.
(5.4)
is a constant to model the ratios of noise power and circuit power. HW(S) is the
Hamming weight of internal state S. n is the noise power.
Hamming distance model is more accurate than hamming weight model. The basic
idea of the hamming distance model is to count the number of 1 0 and 0 1 transitions
that occur in a digital circuit during a certain time interval. This number is then used to
describe the power consumption of the circuit in this time interval. Hamming distance
model assumes that all 1 0 and 0 1 transitions in a digital circuit lead to the same
power consumption. By dividing the entire simulation of a circuit into small intervals, a
- 83 -
DPA Attack on AES
kind of power trace can be generated. This power trace does not contain actual power
consumption value but the number of transitions that occur in the corresponding time
interval. A formal definition of the hamming distance is given in the following.
(5.5)
The hamming distance of two values S1 and S2 corresponds to the hamming weight
of . The hamming weight corresponds to the number of bits that are set to one.
Hence, corresponds to the number of bits that differ in S1 and S2.
The formal equation of power consumption in HD model is as follows:
(5.6)
and n are same as HW model. The HW is replaced by HD in this equation. S1 and
S2 are two states (the output of a circuit) in two clock cycles.
5.1.3 Hypothetical Power Consumption based on HD model: Case study
Based on the discussion in the last two sub sections, two cases of power consumption
of specific circuit are discussed in this section.
Figure 5.2 shows a RTL level graph of a circuit. This circuit includes two registers,
REG0 and REG1, and a combinational circuit is inserted between these two registers. The
left part of this figure represents the states of the circuit in a specific time (S0), and after
one clock cycle, the state of the circuit is changed (S1)as shown in the right part.
The power consumption of this circuit includes two parts: Sequential logic (Registers)
power consumption PREG and combinational logic power consumption PComb. For PREG, it
is,
(5.7)
- 84 -
DPA Attack on AES
where is SNR ratio for registers, denotes power noise produced by registers.
S0, S1 are states of REG in different clock cycles. For PComb, it also can be modeled as,
(5.8)
is SNR ratio for combinational circuit. is power noise produced by
combinational circuit. The total power consumption of this circuit in this clock cycle is:
(5.9)
Figure 5.3 shows another case of a circuit. In this circuit, the execution results
produced by combinational circuit are feedback to the original register. As shown in this
. The total power consumption of this
circuit becomes different since the architecture is changed. Same as discussed above, the
power consumption includes:
(5.10)
(5.11)
(5.12)
As a result, the power consumption of circuit CASE II only depends on one variable
S0, while in CASE I, it depends two variable S0 and S1.
- 85 -
DPA Attack on AES
Figure 5.2 Power consumption of a circuit: Case I.
Figure 5.3 Power consumption of a circuit: Case II.
- 86 -
DPA Attack on AES
5.1.4 Differential Power Analysis Attacks
Differential Power Analysis (DPA) attack was proposed by Kocher et al. in 1998. DPA
is a side channel attack method which measures power consumption and uses
hypothetical power model to recover secret information. In a successful attack, the
hypothetical power consumption trace of the correctly guessed key displays a significant
higher correlation with the actual measurements of the cryptographic device than others.
DPA attack has been proven to be practical and efficient. Therefore, it posed a serious
threat to the security of cryptographic devices.
CPA (Correlation coefficient Power Analysis) was proposed by Brier, Clavier and
Olivier in 2003 [54]. CPA uses hamming distance model instead of Hamming weight
model . Also it uses correlation coefficient instead
of differential coefficient. The CPA is an improvement of original DPA, which provides
much more accurate to successfully attack a cryptographic device. Normally, people use
DPA attack as a common name to represent both of DPA and CPA attack. In this
dissertation, the DPA attack means power analysis attack using DPA or other optimized
DPA methods, such as CPA.
DPA uses following equations to calculate correlation coefficient:
DPA Algorithm:
(5.13)
(5.14)
(5.15)
(5.16)
- 87 -
DPA Attack on AES
(5.17)
(5.18)
W(t) is power traces. P(k) is hypothetical power consumption which is calculated by
hamming distance. t is time step. k is hypothetical key. N is number of power traces.
Equation 5.17 and 5.18 calculate the mean of real power consumption and hypothetical
power consumption for each trace. Equation 5.16 and 5.17 are used to calculate the
variance for real and hypothetical power consumption. Equation 5.14 is used to calculate
covariance of real power and hypothetical power. Finally, the correlation coefficients are
calculated by equation 5.13.
5.2 DPA attack on AES
5.2.1 DPA attack on AES: An Example
Figure 5.4 Last round of AES module.
- 88 -
DPA Attack on AES
A success DPA attack is to probe the power consumption, to get the ciphertext, and to
model the hypothetical power consumption based on hardware. Finally, do the DPA
algorithm to calculate correlation coefficients of real power consumption and
hypothetical power consumption.
Figure 5.2 shows the last round of AES module. Attacker only knows ciphertext and
power consumption. The detailed DPA attack procedures are listed as following:
Step 1. Measuring the power consumption of AES encryption device. The power traces
data W(t) can be easily got by using oscilloscope. The device execute AES
encryption algorithm with the unknown, constant key. The ciphertext is known to
attacker.
Step 2. Calculating hypothetical intermediate values. As shown in figure, using known
ciphertext and hypothetical key, the corresponding intermediate value can be
calculated.
(5.19)
InvOP represents inverse operation of AddRoundkey, ShiftRows and SubBytes. key0
is one byte of hypothetical key. It has 256 possibilities.
Step 3. Calculating hypothetical power consumption. Hamming distance model is
chosen in this step. Hypothetical power consumption is equal to hamming distance of
intermediate value and ciphertext.
- 89 -
DPA Attack on AES
(5.20)
Step 4. Calculating correlation coefficients by using P(k) and W(t). The correctly
guessed key is highly correlated with real power consumption, and it can be
identified from the significant peaks in DPA curves.
5.2.2 DPA attack on AES: A successful attack and a failed attack
According to the discussion in Chapter 5.2.1, the final Correlation Coefficients set
include three dimensions: Time, Hypothetical keys and Correlation coefficients value. It
can be denoted by . T is Time, K is hypothetical key and C is
coefficients value. ByteKey is denotes which byte of key. For AES 128-bit key, there are
totally 16 bytes. The results of DPA attack can be clearly shown in the 2-D or 3-D view.
2-D view shows 16 graphs, and each graph represents a byte of key. The x-axis is time
axis, and y-axis is coefficients value. 2-D view shows the final result
as . It shows the coefficient curves under the right key. The curves
each graphs (Figure 5.5). A failed attack shows smooth and noisy curves (Figure 5.7).
3-D view is used to show coefficient value mesh of one byte of key. x-axis is
hypothetical keys which is from 0 to 255. y-axis is time, and z-axis is coefficients value.
A successful attack shows a wall (Figure 5.6), while the failed attack shows smooth and
noisy mesh (Figure 5.8).
- 90 -
DPA Attack on AES
Figure 5.5 2-D views of successful DPA attack.
(16 bytes of key, the peak in each graph indicates that the hypothetical key is a right key)
Figure 5.6 3-D views of successful DPA attack.
(4th byte of AES key, the wall in this 3-D view indicates that the hypothetical key is a right key)
- 91 -
DPA Attack on AES
Figure 5.7 2-D views of failed DPA attack.
( peak in the correlation graphs)
Figure 5.8 3-D views of failed DPA attack.
(4th byte of AES key, there is no wall in the correlation coefficients mesh)
- 92 -
DPA Attack on AES
5.3 Conventional Countermeasure Methods
DPA attack works because the power consumption of cryptographic device depends on
intermediate values of the executed cryptographic algorithm. The goal of countermeasure
is to avoid or reduce these dependencies. Techniques to prevent DPA attack fall into two
categories, according to reference [53].
Hiding Method
Hiding method is done by breaking the link between power consumption of the devices
and processed data values. This method makes it difficult for an attacker to find
exploitable information in power traces. Two types of hiding methods are introduced by
Stefan Mangard in [53]: Time dimension hiding and amplitude hiding.
Time dimension hiding usually shuffles operations of cryptographic algorithms to
randomize the executions. This makes the power consumption appear to be more or less
random for an attacker. As shown in Figure 5.9, a power trace consists of several
operations, such as A, B, C shown in this figure. Each operation executes in different
clock cycles, and produces different power consumption. Before hiding, the execution
order is operations are fixed: A->B->C. After hiding, the execution order is shuffled. The
operations are executed randomly in time dimension. It changes in every power traces. As
a result, it is impossible for attacker to get the right power consumption for each
Amplitude dimension hiding method differs from time dimension hiding by adding a
noise source in the original circuit. As shown in Figure 5.10, the hided power
consumption consists of two power consumption source: the pure power produced by
pure circuit and noise power produced by noise source. The idea of this method is to
reduce the SNR ratios to make the power consumption becomes too noise to DPA
analysis.
- 93 -
DPA Attack on AES
Figure 5.9 Time dimension hiding.
Figure 5.10 Amplitude dimension hiding.
- 94 -
DPA Attack on AES
Currently, hiding methods is most commonly used in software implementations of AES
in embedded system. The conventional techniques are the random insertion of dummy
operations and the shuffling of operations. Hardware implementation of hiding methods
has not been reported yet.
Masking Method
Masking method is done by randomizing the intermediate values that are processed by
the cryptographic devices. This method makes the power consumption independent of the
intermediate values. The mask operations can be illustrated as
(5.21)
S is the internal state circuit before masking. M is a random number to mask S. SM is
state. The operation is most often the Boolean exclusive-or, the modular addition, or
the modular multiplication.
Masking methods is widely used in both of software and hardware implementations of
AES to against DPA attack. The most frequently used masking method for AES hardware
design was proposed by Akkar and Giraud in [55]. This method includes two parts:
1) Register Masking. Register masking, or internal state masking, is to mask the
intermediate data when AES encryption/decryption is running. As shown in Figure 5.11,
all of the intermediate values (A, B, C, D, E) are masked with a random number X.
2) S-Box Masking. Since S-Box consumes a much higher power consumption compare
to other blocks, attacker always focus on S-Box. As shown in Figure 5.12, S-Box
masking is performed to mask a random number in every intermediate value. One
GF(256) inverter, four GF(256) multiplier and two GF(256) adder are added in the
original S-Box design as extra hardware cost for masking method.
- 95 -
DPA Attack on AES
Figure 5.11 AES after masking.
- 96 -
DPA Attack on AES
1, ,( )i j i jA Y
1, ,i j i jA X
Figure 5.12 S-Box after masking.
5.4 Conclusion
This chapter introduced the DPA attack. Firstly, the basic conceptions, such as power
consumption of CMOS circuit, power models for power estimation, basic DPA workflow
are introduced. Secondly, the DPA attack on AES algorithm is introduced. An attack
procedure consists of four steps, and the final results of a successful attack and a failed
attack are also discussed. Finally, two countermeasure methods, hiding and masking, are
introduced.
- 97 -
AES Design with DPA Countermeasure
6 AES Design with DPA Countermeasure
For hardware design of AES countermeasure DPA attack, currently, only masking
methods is proposed. And almost all of the hardware implements only use masking
methods to countermeasure DPA attack. However, masking method is proved unsecured
to high-order DPA attack in software implementation in [56]. For hardware
implementation, currently, there are few articles about masked AES DPA attack.
Nevertheless, it is still a risk. And many new attack methods are still under research.
Since the attack methods change very quickly. Some countermeasure methods even
show their strong points to some specific attack methods, however, for other methods, the
weak points are still existed. A basic opinion to improve the security is to combing
several countermeasure methods together. In this chapter, we propose several DPA
countermeasure methods for AES hardware design, and finally, an ultra low-cost AES
design with multiple DPA countermeasures, which combines masking, hiding, and our
proposed independent ARK and Data sliding, is proposed. The experiment environment
(DPA attack system) and experimental results are also provided in this chapter.
6.1 Proposed DPA Countermeasure methods for AES
6.1.1 Register Masking
As discussed in Chapter 5, DPA attack use dynamic power consumption and power
model to do attack. For dynamic power consumption in a circuit, it includes sequential
logic power consumption (Consumed by Registers) and combinational logic power
consumption. In order to countermeasure DPA attack, both of registers and combinational
logics should be masked.
- 98 -
AES Design with DPA Countermeasure
Figure 6.1 The round ith of the AES without and with masking countermeasures.
- 99 -
AES Design with DPA Countermeasure
Since registers are updated in every clock cycle. They consumes a lot of power
consumption, and even more, all of registers are refreshed in the same time, it is very
easy for attacker to locate the right positions in power trace, according to the hypothetical
power model. The registers masking is to mask the values stored in register. After
masking, the power consumption of registers PR is randomized by a factor of random
number X:
(6.1)
Figure 6.1 shows round function of AES without and with masking countermeasures. In
the left part, all of the values (A, B, C, D, E) stored in registers are plaintext. The internal
states of AES are known to attacker. After masking, all of the internal states are masked
by a random number X. And this number is changed for every times of encryption.
Figure 6.2 Proposed Registers Masking.
Figure 6.2 shows the registers masking method. Two exclusive-or gates are inserted
around the registers. The values stored in the registers are randomized.
- 100 -
AES Design with DPA Countermeasure
6.1.2 S-Box Masking
The original S-box and the proposed masked S-box are shown in Figure 6.3. Original
S-box includes a GF(28) inversion and a affine operation. GF(28) inversion is a Galois
field operation which is non-linear. Affine operation is a simple matrix transformation
which can be easily implemented by hardwire.
In order to conceal the intermediate value, a simplified S-box masking method is
proposed. The main idea of S-box masking method is: Using a random number to mask
the input data of S-box. In this way, the power consumption of S-box only depends on
masked input data, and independent of original data. Since Galois field is a non-linear
a result, people use Galois field multiplication to do masking.
As shown in Figure 6.3, two Galois field multipliers are added, and a random number Y
is used as masking pattern. Since Y is independent of A, A×Y is also independent of A.
The power consumption is a linear function of hamming-distance of A Y, thus, the
power consumption is independent of A.
consumption PComb is represented as:
(6.2)
Compared to the original masking method, which masks both of S-box and data bus in
[55], in hardware implementation, our proposed design saves one Galois field inverter,
two Galois field multipliers and two Galois field adders.
- 101 -
AES Design with DPA Countermeasure
Figure 6.3 Proposed S-Box Masking.
6.1.3 Subbytes Hiding
Subbytes Hiding means that the sequence of subbytes opearation is shuffled. Subbytes
Hiding breaks the correlation of power consumption and AES operations in time domain.
After shuffling, the attacker can t know the corresponding operations in the power trace s
time coordinate. Thus, the hypothetical power consumption can t be correctly calculated
for power analysis.
A power trace of AES hardware is shown in Figure 6.4. Four operations are executed
one by one: ShiftRows, SubBytes, MixColumns and AddRounKey. Each operation
produces different power consumption. However, for attacker, they know that the same
operation will occurs in the same time in different power traces, like the Figure 6.5 a)
shows. This makes the power analysis attack become feasible.
- 102 -
AES Design with DPA Countermeasure
Figure 6.4 A power trace of AES.
a) Subbytes without hiding b) Subbytes with hiding
Figure 6.5 Subbytes without and with hiding.
- 103 -
AES Design with DPA Countermeasure
In order to prevent the power analysis attacker, the hiding methods shuffle the
execution of operations. As a result, in the collected power traces, the same operation is
randomly distributed in time domain, as shown in Figure 6.5 b). The power consumption
of Subbytes Psub can be denoted as,
(6.3)
Pti represents the power consumption in ith clock cycle. There are totally n possibilities
of Psub, and the n depends on the hiding methods used.
Figure 6.6 Hardware design of Subbytes hiding.
The hardware design of Subbytes hiding is shown in Figure 6.6. One of 16 bytes of data
is selected to do SubBytes. The SEL module is a multiplex for 16-to-1 selection. LFSR is
linear feedback shifting registers used to generate 4-bit selection signals. The initial
vector of LFSR is generated by Random number generator (RNG). RNG is a module
outside of AES, and normally, it existed in every cryptographic device.
- 104 -
AES Design with DPA Countermeasure
6.1.4 Independent ARK and Data Sliding
Independent ARK
The conventional AES hardware design integrated all AES operations on one data path
to save clock cycles. As shown in Figure 6.7, in the last round of AES, Subbytes and
AddRoundKey are executed within one clock cycle, and the ciphertext is calculated. The
power consumption of this procedure can be represented by
(6.4)
Function is the inverse operation of Subbytes and AddRoundKey. C is ciphertext
which is known to attacker. S1 is an internal state of circuit, and S1 is the result of inverse
function of C. key is unknown, and the attacker uses a hypothetical key in this equation to
calculate the hypothetical power consumption. As discussed in the last chapter, DPA
attacks use hypothetical power and real power to do attack. The right key can be easily
recognized in the correlation coefficient graph. In this equation, attacker needs to guess
one byte of key (8-bit), which has 256 possibilities.
Independent ARK means that AddRoundKey operation is separated from other
operations. As shown in Figure 6.8, the last round of AES is separated to two sub steps:
Subbytes and AddRoundKey. These two steps are executed in different clock cycles. For
DPA attack, only Subbytes operation can be used because S-Box consumes much power
than other operations. The power consumption of Subbytes in this circuit is,
(6.5)
S2, S3 are two internal states of circuit. S2 is the result of exclusive-or of ciphertext
and key. S3 is the result of inverse subbyte of S2.
- 105 -
AES Design with DPA Countermeasure
Figure 6.7 Integrated Subbytes and AddRoundKey.
Figure 6.8 Separated Subbyte and AddRoundKey.
- 106 -
AES Design with DPA Countermeasure
Figure 6.9 Feedback structure and Data Sliding Structure.
Data Sliding
Data sliding is used to make the states of registers relate to its neighbouring registers.
In Figure 6.9, two kinds of circuit structures are showed:
A) Feedback circuit structure
Feedback circuit means that the source and destination of a data is the same. As
shown in this figure, R0, R1 are two registers. R0 and Subbytes make up a feedback
circuit. In time t0, the states of these two registers are S0 and S1. After one clock cycle
consumption can be represented as,
(6.6)
- 107 -
AES Design with DPA Countermeasure
B) Data Sliding circuit structure
In Data Sliding circuit, there is no feedback circuit. And destination and source of a
combinational circuit are pointed to different registers. As shown in this figure, the
input of Subbytes comes from R0, and the output of Subbytes goes to R1. The changes
of state of this circuit are also listed in the table of this figure. The power consumption
of Data Sliding circuit can be represented as,
(6.7)
Different from feedback circuit, the power consumption of this circuit depends on
two registers. Both of two registers state should be took account for power consumption.
As shown in this equation, power consumption P depends on both of S0 and S1.
While combining Independent ARK and Data Sliding together, the power consumption
becomes:
(6.8)
key0 and key1 are two bytes of key. C0 and C1 are ciphertext which is known to attacker.
In this power consumption equation, there are two bytes of keys need to hypothesize.
Compared to the DPA attack on conventional circuit which only one byte of key need to
hypothesize, our proposed methods increase it to two bytes. As a result, the
computational cost is increased to 28 times for every power trace.
- 108 -
AES Design with DPA Countermeasure
6.1.5 Time Complexity Analysis
The proposed DPA countermeasure methods greatly increase the computational
complexity of DPA attack methods. The time complexity of DPA attack on AES design
with countermeasure methods is analyzed in this section.
For DPA attack, the time complexity includes three parts: 1) Power traces measuring. 2)
Hypothetical power modeling. 3) Correlation coefficient calculating. For DPA attack on
pure AES (without countermeasure) and secure AES (with countermeasure), the
difference happened in the part 2 and part 3. For pure AES, the hypothetical power traces
is,
(6.9)
In order to do DPA attack, the value of key should be hypothesized. Every hypothetical
key corresponds to a set of hypothetical power traces .The correlation coefficients are
calculated for every hypothetical power traces and real power traces. We define the time
complexity of DPA attack on pure AES as: ~ ( )o DPA .
For AES with countermeasure methods, because the power consumption changes
according to different methods (as discussed in section 6.1.1-6.1.5), the hypothetical
power model has much more unknown variables than pure AES. A summary of
hypothetical power consumption and time complexity of DPA attack is shown as follows:
AES with Masking method
As discussed in equation 6.1 and 6.2, the hypothetical power consumption of AES with
masking method is,
- 109 -
AES Design with DPA Countermeasure
(6.10)
Compared to equation 6.9, an additional 8-bit random number X is included in this
equation. In order to get the power trace, for every hypothetical key, it should additional
guess the random number X. For a single power trace, the possibility of power value is
increased to 28 times. For n power traces, the possibility of this power trace set will be
increased to 28N. In other words, the time complexity also increased to 28N.
AES with Hiding method
As shown in equation 6.3, the hypothetical power consumption of AES with hiding
method is,
(6.11)
Y is a selection random number which indicate which power value is the right value for a
specific time space. Normally, Y equals 16 since there are 16 Subbytes operations in each
round of AES algorithm. Similarly, for a single power trace, the possibility of power trace
is increased to 24 times, and for n power traces, the possibility is increased to 24N. The
time complexity is also increased to 24N.
AES with Independent ARK + Data Sliding
As discussed in Section 6.1.4, the hypothetical power consumption of AES with
Independent ARK and Data Sliding is,
- 110 -
AES Design with DPA Countermeasure
(6.12)
key0 and key1 are 8-bit hypothetical keys. This equation is similar as 6.10 (Considering
key0 as key, key1 as random number X). The time complexity analysis is also similar as
masking. The complexity is increased to 28N times.
Table 6.1 summarizes the power consumption of AES design with each
countermeasure methods. Table 6.2 summarizes the time complexity for AES without
countermeasure, AES with masking, AES with Hiding, and AES with Independent ARK
&Data Sliding.
Our proposed countermeasure methods also can be combined together to improve the
security to a higher level. For example, combing all of methods (Masking, Hiding,
Independent ARK & Data Sliding) together, the power consumption becomes,
(6.13)
There are four unknown variables in this equation: Key (key0, key1) and random number
(X, Y). The computational complexity of DPA attack becomes , which is
212N times more secure than the AES design only with masking method. In this way, even
the masking method may be proved to unsecure, the other countermeasure methods can
also guarantee the security.
- 111 -
AES Design with DPA Countermeasure
Table 6.1 Summary of different countermeasure methods.
Description Effect to power consumption
Masking Randomize the internal
data Add a random number X in power consumption
Subbytes Hiding Shuffling the execution
order of SubbytesRight power consumption belongs a member of
power consumption set
Independent ARK
with Data Sliding
Equal to mask data with
another key Add another key in power consumption
Table 6.2 Comparison of time complexity for each countermeasure methods.
Power consumption Complexity
AES without
DPA
Masking
Subbytes Hiding
Independent ARK
with Data Sliding
- 112 -
AES Design with DPA Countermeasure
6.2 Ultra Low-cost Design of AES with DPA Countermeasure
6.2.1 Specification
The data size of coded video highly depends on the resolution, frame rate and coding
methods. Resolution is the size of picture in video sequence. When the resolution
becomes higher, every frames of video sequence consists of more MBs, and the data size
will greatly increased. In the other hand, high resolution makes video contains more
details, and becomes more attractive to audience. Frame rate is the number of frames
within one second. When frame rate increasing, it means that there are more pictures
should be displayed in one second. High frame rate makes moving pictures seem
smoother, especially for high motion pictures. Normally, for low resolution video (Less
than 1920 1088), the frame rate is set to 30 fps, and for high resolution video (More
than 1920 1088), the frame rate is set to 60 fps. Coding methods is other important
factor for video data size. Some coding methods greatly affect the coded video data size
such as RDO (Rate-Distortion Optimization), QP Matrix, CAVLC, CABAC and so on.
For more information about coding methods, please refer to [46].
Table 6.3 shows the maximum bit-rate of selected levels in H.264 [57]. Each level
defined the maximum bit-rate, video resolution, frame rate and maximum stored frames
in buffer. Table 6.3 vels list can
be found in [57]. Some frequently-used video resolutions are listed in this table. 176
144 (QCIF) and 352 288 (CIF) are usually used in the mobile phone. Since the screen
size and the battery power of mobile phone are limited, the small size video is acceptable
[51] uses QVGA (320 240) @
15fps, 128kbps to broadcasting TV for mobile phone. 720 480 (VGA) is normally used
in high-end portable media player. 1280 720 (HDTV 720p) and 1920 1080 (HDTV
1080p) are widely used in High Definition TV. For future use, the 4096 2048 (4kx2k)
and 8192 4096 (8kx4k) super-HDTV are under researching.
- 113 -
AES Design with DPA Countermeasure
Table 6.3 Max bit-rate and resolution of selected H.264 levels.
H.264
Levels
Max bit rate (bps) Resolution
@
frame rate
Baseline
Main
Extend
Profile
High
Profile
High 10
Profile
High 4:2:2
4:4:4
Profile
1 64 K 80 K 192 K 256 K 128 96@30
1.1 192 K 240 K 576 K 768 K 176 144@30
2 2 M 2.5 M 6 M 8 M 352 288@30
3 10 M 12.5 M 30 M 40 M 720 480@30
3.1 14 M 17.5 M 42 M 56 M 1280 720@30
3.2 20 M 25 M 60 M 80 M 1280 720@60
4 20 M 25 M 60 M 80 M 1920 1080@30
4.1 50 M 62.5 M 150 M 200 M 2048 1024@30
4.2 50 M 62.5 M 150 M 200 M 2048 1080@60
1920 1080@64
5 135 M 168.75 M 405 M 540 M 1920 [email protected]
2048 [email protected]
2048 [email protected]
5.1 240 M 300 M 720 M 960 M 1920 [email protected]
4096 2048@30
- 114 -
AES Design with DPA Countermeasure
Currently, 1920 1080@60fps, high profile is the highest configuration for
commercial products. For video communication, like video conference, the widely used
resolution is VGA. As a conclusion, the maximum bit rate for current video applications
is under 62.5 Mbps. For real-time video encryption module, the bit-rate should above this
number.
6.2.2 Hardware Architecture
As discussed in the last sub section, the maximum throughput for video application is
about 62.5 Mbps. The throughput of lowest hardware cost AES base on scalable
architecture proposed in chapter 4 can achieve 75 Mbps. In this way, for real-time video
encryption, the lowest hardware cost architecture in chapter 4 is the most suitable to be
used.
Figure 6.10 shows the hardware architecture of ultra low-cost AES design with DPA
countermeasure. This architecture bases on our proposed scalable architecture in chapter
4. In this way, most part of architecture is the same. Some important points of this
architecture include:
S-Box with masking: Only one S-Box and one Mixcolumns are used in this design
to reduce the total hardware cost. The S-Box masking method proposed in section 6.1.2 is
used.
Subbytes Shuffling: A 17-to-1 multiplexer is used to do SubBytes hiding. One of 16
data is randomly selected to do Subbytes in every clock cycle. The selection signal is
produced by the circuit proposed in section 6.1.3.
Register Masking: All of the data registers are masked. There are two sets of XOR
gate array before and after each data register, same as in section 6.1.1.
- 115 -
AES Design with DPA Countermeasure
Figure 6.10 Ultra low-cost AES with DPA countermeasure.
Figure 6.11 Data flow for ultra low-cost AES.
- 116 -
AES Design with DPA Countermeasure
Independent ARK: Independent ARK has been already used in the scalable
architecture. Because the data path is separated into 3, the operation AddRoundKey,
SubBytes and MixColumns are independent with each other.
Data Sliding: Data Sliding is achieved by right shifting of data registers in this
architecture. Since only one S-Box is used, the right-shifting should be done in every
clock cycle. This effect equals to data sliding.
In hardware design, masking and hiding methods cost extra hardware cost, in contrast,
Independent ARK and Data Sliding is architecture level design which didn
extra hardware.
6.2.3 Data Flow
The dataflow of proposed ultra low-cost AES design with DPA countermeasure follows
the similar way of unified architecture, which has been already discussed in section 4.4.2.
Figure 6.11 shows this dataflow in detail. The meaning of the notations used in this
dataflow is listed in the notations table. Every block in this dataflow represents a clock
cycle. The operations in the same block means that these operations are executed in
parallel.
The total dataflow consists of three parts: First round, Round i (i is from 1 to 9), and
the Last round. First round cost 5 clock cycles and it execute three operations:
Addroundkey, ShiftRows and key Subbytes. Because Addroundkey and ShiftRows are
done to data registers, they can be merged into one combined operation. The Round i is a
loop function which is executed for 9 times. There are totally 6 operations are executed:
Key update, Data Subbytes, MixColumns, Key Subbytes, AddRoundKey and ShiftRows.
Many operations are executed in parallel to save clock cycles. The Last round consists of
only three operations: Key update, Data Subbytes, and Addroundkey. From this dataflow,
- 117 -
AES Design with DPA Countermeasure
it can be seen that the Addroundkey is always executed as independent. And the data
Subbytes can be executed randomly in every steps.
6.2.4 Implementation
In order to compare the hardware cost for AES designs with different countermeasure
methods, we implement 4 AES designs. All of the designs are coded by verilog HDL, and
synthesized by Synopsys Design Compiler. TSMC 0.18 um standard cell library are used
for circuit synthesis.
AES without countermeasure methods (AES 0)
This AES design has been proposed in reference [58]. The architecture of this design is
similar as scalable architecture proposed in Chapter 4. In order to further reduce the
hardware cost, the Addroundkey, MixColumns and ShiftRows are integrated into a 32-bit
data path. In this way, there are only two parallel data path. This design achieves lowest
hardware cost for pure AES design (without DPA countermeasure). The detailed
description can be found in [58]. And the reconfigurable design of this architecture can be
found in [66]. Table 6.4 shows the hardware cost of AES0 under 80 MHz clock frequency.
AES with Independent ARK and Data Sliding (AES 1.0)
This AES design uses the architecture shown in Figure 6.10. Only independent ARK and
Data sliding are used. In the other words, the 17-to-1 multiplexer shown in this figure is
not used. Independent ARK and Data Sliding are inherent from this architecture and
dataflow. Table 6.6 shows the hardware cost of AES 1.0. The frequency achieves 125 MHz,
which is much higher than AES0. The reason is that AES1.0 use three datapaths, thus, the
critical path is much shorter than AES0.
- 118 -
AES Design with DPA Countermeasure
Table 6.4 AES0@80MHz, TSMC 0.18um
(Pure AES)
Components Gates
S-Box 358
MixColumns 376
Key Expander 1935
Controller 247
Data Registers
+ others (ARK, ShiftRow)
1762
Total 4678
Table 6.5 AES1.1@125MHz, TSMC 0.18um
(AES + Subbytes Hiding)
Components Gates
S-Box 383
MixColumns 313
KeyExpander 2220
Controller 235
Data Registers
+ Others (Multiplexer, ARK,
ShiftRows)
3093
Total 6244
Table 6.6 AES1.0@125MHz, TSMC 0.18um
(AES + Independent ARK, Data Sliding)
Table 6.7 AES1.2@75MHz, TSMC 0.18um
(AES + Masking)
Components Gates
S-Box 423
MixColumns 313
KeyExpander 2223
Controller 235
Data Registers
+ others (ARK, ShiftRow)
2306
Total 5500
Components Gates
S-Box 1124
MixColumns 325
KeyExpander 2220
Controller 235
Data Registers
+ Others (Xor Gates, ARK,
ShiftRows)
2930
Total 6834
- 119 -
AES Design with DPA Countermeasure
AES with Subbytes hiding (AES 1.1)
AES with Subbyte hiding adds a 16-to-1 multiplexer compare to AES 1.0. Same
architecture and same dataflow are used. Table 6.5 shows the hardware cost of AES1.1
under 125 MHz. The performance of AES1.1 is same as AES1.0.
AES with masking (AES 1.2)
AES with masking consists of register masking and S-Box masking. Since masking
adds many circuit to original design, the performance reduced very much compare to
AES1.0. Table 6.7 shows the hardware cost of AES1.2. The clock frequency of AES1.2
reduced to 75 MHz.
From the hardware implementation results listed above, the Independent ARK, Data
Sliding has the smallest effect to hardware cost. Subbytes hiding method is the second
low effect method to hardware design. Masking shows its weak point to both of hardware
cost and performance reduction. For some hardware cost sensitive AES design, like RFID,
our proposed Independent ARK, Data sliding and Subbytes hiding is much better to be
used than masking.
6.3 DPA Attack Evaluation Environment
6.3.1 DPA attack platform
In order to implement the DPA attack on AES algorithm, firstly an attack environment
is necessary. We use Sasebo board to process the test of power analysis attack. Also, we
need an oscilloscope to retrieve the power traces derived from the FPGA board. Moreover,
a PC is needed to process the retrieved power traces data, using the specified power
model.
The following Figure 6.13 shows the photo of our DPA attack system. Figure 6.13 shows
- 120 -
AES Design with DPA Countermeasure
the system architecture. We use SASEBO board provided by AIST to do the AES
operation. The board is connected to the independent power supply. While the encryption
is running, we use digital oscilloscope to retrieve the power traces. We record the power
traces data when there is a trigger signal. After record the data, we transmit the data back
to PC. We transfer two types of data. The cipher text encrypted by the SASEBO device is
transmitted to PC through RS232 serial port communication. On the other hand, the
digitized power trace waveform data will be transmitted back through LAN. The power
analysis attack is totally based on the power consumption data and the cipher text. The
detailed description of our DPA attack system also can be found in [69].
6.3.2 Sasebo Board
Side-channel Attack Standard Evaluation Board (SASEBO) is a board specifically
designed to develop standard evaluation schemes to secure the cryptographic module
against physical attacks. This system is developed by AIST and Tohoku University
[59][60]. It has FPGA version and ASIC version. FPGA version uses a Xilinx FPGA
Virtex-II XC2VP7 to implement AES designs in the board. ASIC version uses an ASIC
chip, which has already implemented several AES designs in this chip.
In this dissertation, we only use FPGA version, because the proposed AES hardware
can be implemented in the FPGA. Figure 6.14 shows the architecture of the SASEBO
board. There are two FPGA modules in this board: FPGA1 is used for cryptographic
operation; FPGA 2 is used for control logic. Two EEPROMs are used to configure FPGA,
and the configuration file is downloaded through the JTAG port. The power supply and
clock source of each FPGA is separated. For PC communication, a RS232 serial port is
used. LED module is used to express the internal status of FPGA1. Detailed description
of SASEBO-G board could be found in website [59][60].
- 121 -
AES Design with DPA Countermeasure
Figure 6.12 DPA Attack Evaluation System (Photo).
Figure 6.13 DPA Attack Evaluation System (Architecture).
- 122 -
AES Design with DPA Countermeasure
Figure 6.14 Sasebo Board.
6.3.3 Test Flow
In this part, we will give the flow path of the testing process. The complete test flow is
shown in Figure 6.15:
Firstly, we select the AES encryption. Then, the oscilloscope needs to be initialized
(For example, the sampling rate is set to 2GSa/s). Then, the number to execute AES
operation should be set. After that, check the oscilloscope to see whether it is in Run
status or not. If not, move back to oscilloscope initializing phase. Then, send the control
signal to FPGA through the RS232C serial port, according to the input data format. After
receiving the control signal, the FPGA could do encryption, decryption or reset. Here, we
do encryption in order to get power trace data. After the encryption, transfer the data back
to PC. At the same time, check if there is a trigger existed. If not, it means that there is
something wrong with the FPGA. If FPGA is normally running, record the power trace
data on PC through LAN, in a CSV or text file type. If the number of AES operation is
satisfied, which means all the operation is done, then step into DPA attack phase.
- 123 -
AES Design with DPA Countermeasure
Figure 6.15 DPA attack test flow.
- 124 -
AES Design with DPA Countermeasure
In DPA attack step, we use Hamming distance model to build the relationship between
power traces and processed data. Then we use correlation coefficient to present the
intensity of the two factors. In the whole flow, we need to transfer data between different
equipments. We use RS232C serial port to connect host PC and FPGA board, and use
LAN to control the data reading/writing between PC and oscilloscope.
6.4 Experiment Results of DPA Attack
Figure 6.16 shows a power trace measured by oscilloscope. For AES encryption,
totally it needs 211 clock cycles. We samples about 5000 points from the start of
encryption to the end of encryption. For one time DPA attack, we collect 5000 power
traces to do DPA analysis. Figure 6.17 and Figure 6.18 shows the 2-D and 3-D result of
DPA attack on Pure AES. Figure 6.19 and Figure 6.20 shows the DPA attack result of
AES design only with hiding method. Figure 6.21 and Figure 6.22 shows the DPA attack
result of AES design only with masking methods. Figure 6.23 and Figure 6.24 shows the
DPA attack result of AES design only with Independent ARK and Data Sliding. From
these figures, all of proposed DPA countermeasure methods can against DPA attack very
well.
Figure 6.16 Power trace from oscilloscope
- 125 -
AES Design with DPA Countermeasure
Figure 6.17 2-D view of DPA attack on Pure AES.
Figure 6.18 3-D view of DPA attack on Pure AES.
(4th byte of key, the other 15 bytes of key are similar)
- 126 -
AES Design with DPA Countermeasure
Figure 6.19 2-D view of DPA attack on AES with Subbytes hiding.
Figure 6.20 3-D view of DPA attack on AES with Subbytes hiding.
(4th byte of key, the other 15 bytes of key are similar)
- 127 -
AES Design with DPA Countermeasure
Figure 6.21 2-D view of DPA attack on AES with masking.
Figure 6.22 3-D view of DPA attack on AES with masking.
(4th byte of key, the other 15 bytes of key are similar)
- 128 -
AES Design with DPA Countermeasure
Figure 6.23 2-D view of DPA attack on AES with Independent ARK and Data Sliding.
Figure 6.24 3-D view of DPA attack on AES with Independent ARK and Data Sliding.
(4th byte of key, the other 15 bytes of key are similar)
- 129 -
AES Design with DPA Countermeasure
6.5 Chip Design
In order to evaluate the proposed countermeasure methods in ASIC, a test chip is
designed. This chip is designed for VDEC project [70]. ROHM 0.18 um standard cell
library is used. The chip size is 2.5mm
constrains. This chip contains four AES designs as discussed in Section 6.2.4. Top
module consists of multiplexers and UART module. A select signal is used to select one
of four AES designs under running. The Architecture of this chip is shown in Figure 6.25.
Figure 6.25 Test Chip Architecture.
- 130 -
AES Design with DPA Countermeasure
Figure 6.26 Chip design of AES
Table 6.8 VDEC Test Chip.
Technology ROHM 0.18 um
Chip Size 2.5mm 2.5mm
PAD Number 128
Voltage 1.8V
Metal 5
Frequency ~100 MHz
Designs AES0, AES1.0, AES1.1, AES1.2
AES0 AES1.0
AES1.1 AES1.2
TOP
Module
+
UART
- 131 -
AES Design with DPA Countermeasure
6.6 Conclusion
This chapter presented five DPA countermeasure methods for AES hardware design:
Register Masking, S-Box Masking, Subbytes Hiding, Independent ARK and Data Sliding.
The theoretical analysis shows that the complexity of DPA attack on the AES, which uses
hybrid countermeasure solution, will be increased to 212N times. In this way, even if one
or two countermeasure methods are cracked, the remained other countermeasure methods
can also prevent a successful attacking. For hardware design, an ultra low-cost AES
design with these countermeasure methods is proposed. This AES is designed for
real-time video encryption. Only one S-box and one Mixcolumns are used in the
architecture. The effect of hardware cost for different countermeasure methods is
discussed. Finally, in order to evaluate the effectiveness of proposed countermeasure
methods, a DPA attack evaluation system and a test chip which includes 4 AES cores was
implemented. The DPA attack experimental results show that our proposed
countermeasure methods successfully prevent DPA attack.
- 132 -
Conclusion
7 Conclusion
In this dissertation, a new video encryption scheme and the hardware design of
encryption module are proposed. This dissertation consists of three parts: 1) In algorithm
level, a new video encryption scheme is proposed. 2) In hardware level, the optimized
hardware architecture for AES and RSA algorithm are proposed. 3) In security level, the
DPA countermeasure methods for AES hardware design are proposed.
Conventional selective video encryption schemes have a lot of problems, such as low
security, high computational cost and hard to be implemented. In order to improve the
security and reduce the computational cost of video encryption, we proposed an Unequal
Secure Encryption (USE) scheme for video encryption, especially for H.264/AVC video
coding standard. This scheme mainly includes two parts: Data classification and Unequal
secure encryption. For data classification, we proposed three data classification methods
based on H.264/AVC. After data classification, the video bit stream can be separated into
two parts: important data partition and unimportant data partition. There are totally four
security levels defined in USE scheme. These security levels are used to balance the
security strength and computational complexity. For unequal secure encryption, we use
two encryption methods: AES encryption algorithm for important data partition, and
FLEX encryption algorithm for unimportant data partition. The FLEX algorithm is based
on AES, and the speed is 5 times of AES. In this way, for encryption module design, only
AES should be implemented.
For hardware design of AES algorithm, a scalable architecture is proposed. Since the
video data size changes very much according to different video levels, a fixed
architecture with specific performance is not a good solution. In this dissertation, we
proposed a scalable architecture. The number of S-Box and MixColumns is configurable
- 133 -
Conclusion
in this architecture. Totally, 1-20 S-Boxes and 1-4 MixColumns can be used. The
experimental results show that the lowest cost implementation only uses one S-Box and
one MixColumns. The throughput achieves 75 Mbps. While using 20 S-Boxes and 4
MixColumns for highest performance implementation, the throughput can achieve 2.4
Gbps.
For RSA hardware design, firstly, a modified scalable high-radix Montgomery
algorithm is proposed to reduce critical path. Secondly, a high-radix clock-saving
dataflow is proposed to support high-radix operation and one clock cycle delay in
dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost
and a parallel radix-16 design of data path is proposed to accelerate the speed. The
implementation results show that the total cost of Montgomery multiplier is 130 KGates,
the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption is 352
Kbps. This design is suitable to be used in high speed RSA or ECC encryption/
decryption. As a scalable design, it supports any key-length encryption/decryption up to
the size of on-chip memory.
In order to enhance the security of AES encryption module, especially for DPA attack
countermeasure, we proposed five DPA attack countermeasure methods: Register
Masking, S-Box Masking, Subbytes Hiding, Independent ARK and Data Sliding. Combing
with these methods, an ultra low-cost AES design with multiple DPA countermeasure
methods is proposed. The DPA attack experimental results show that our proposed
methods successfully prevent DPA attack.
In conclusions, an efficient video encryption scheme for H.264/AVC video coding
standard, and the hardware implementation of the encryption module are presented in this
dissertation. The design proposed in this paper is very useful for secure video
communication systems.
- 134 -
Reference
Reference
[1] ISO/IEC 11172, Information technology coding of moving pictures and associated
audio for digital storage media at up to about 1.5Mbit/s, 1993 (MPEG-1).
[2] ISO/IEC 13818, Information technology: generic coding of moving pictures and
associated audio information, 1995 (MPEG-2).
[3] ISO/IEC 14496-2, Coding of audio-visual objects Part 2: visual, 2001.
[4] ISO/IEC 15938, Information technology multimedia content description interface
(MPEG-7), 2002.
[5] ISO/IEC 21000, Information technology multimedia framework (MPEG-21), 2003.
[6] ITU-T Recommendation H.261, Video CODEC for audiovisual services at px64
kbit/s, 1993
[7] ITU-T Recommendation H.263, Video coding for low bit rate communciation,
Version 2, 1998.
[8] ISO/IEC 14496-10 and IUT-T Rec. H.264, Advanced Video Coding, 2003.
[9] X. Liu and A.M. Eskicioglu "Selective Encryption of Multimedia Content in
Distribution
Conference on Communications, Internet and Information Technology (CIIT 2003),
Scottsdale, AZ, November 17-19, 2003.
[10]
International Journal on Computer and Graphics, Special Issue on Data Security in
Image Communication and Network, 22(3), 1998.
[11]
C, Ch. 3, pp. 93-131.
December 2004.
[12]
- 135 -
Reference
Comprehensive Report on Information Security, International Engineering
Consortium, Chicago, IL, 2003.
[13]T. Lookabaugh, D. C. Sicker, D. M. Keaton,
Analysis of Selectively Encrypted MPEG-
Applications VI Conference, Orlando, FL, September 7-11, 2003.
[14]
Example MPEG-
Berlin, Germany, May 1995.
[15]T.B. Maples and G.A. Spanos, "Performance study of selective encryption scheme
for the security f networked real-time video," in Proceedings of the 4th International
Conference on Computer and Communications, Las Vegas, NV, 1995.
[16]G.A. Spanos and T.B. Maples, "Security for Real-Time MPEG Compressed Video in
Distributed Multimedia Applications," in Conference on Computers and
Communications, 1996, pp. 72-78.
[17]L.
Proceedings of the 4th ACM International Multimedia Conference, Boston, MA,
November 18-22, 1996, pp. 219-230.
[18]
Proceedings of the 1st International Conference on Imaging Science, Systems and
-29.
[19]
of the 6th International Multimedia Conference, Bristol, UK, September 12-16, 1998.
[20]C. Shi, S.- -Time Using
Processing Techniques and Applications (PDPTA'99), Las Vegas, NV, June 28 - July
1, 1999.
- 136 -
Reference
[21]A. M. Alattar, G. I. Al-Regib and S. A. Al-
techniques for Secure Transmission of MPEG Video Bit-
the 1999 International Conference on Image Processing (ICIP '99), Vol. 4, Kobe,
Japan, October 24-28, 1999,pp. 256-260.
[22]
Information Security (ISW 99), Kuala Lumpur, Malaysia, November 1999, Lecture
Notes in Computer Science, Vol. 1729, pp. 191-201, 1999.
[23]
Transactions on Signal Processing, 48(8), 2000, pp. 2439-2451.
[24]A.S. Tosun, -layer coding and encryption of MPEG
July 2000, pp. 119 122.
[25] -Compliant
Configurabl
Transactions of Circuits and Systems for Video Technology, Vol. 12, No. 6, June
2002, pp. 545-557.
[26]
ctions on Multimedia, Vol. 5, No. 1, March 2002, pp. 118-129.
[27] -
International Workshop on Multimedia Signal Processing, St. Thomas, US Virgin
Islands, December 9-11, 2002.
[28]L.S. Choon, -effective MPEG video
Technologies: From Theory to Applications, 19-23 April 2004 pp.525 526.
- 137 -
Reference
[29]
Proceedings of the 12th annual ACM international conference on Multimedia, New
York, USA, October 10-16, 2004, pp.304-307.
[30]
MPEG Com
Communications and Computer Sciences, Volume E89-A, Issue 1, January 2006,
pp.194-202.
[31]
based on event shuffl -Pacific
Conference on Circuits and Systems, Volume 2, 6-9 Dec. 2004, pp.761-764.
[32]
information security,
Huistenbosch, Japan, 23-25 January, 2007, 3B3-1.
[33]C.-P. Wu and C.-
Boston, MA, November 2000, pp. 284-295.
[34]C.-P. Wu and C.-
Volume 4314, San Jose, CA, January 2001.
[35]I. K. Cheong, Y. C. Hung, Y. S. Tung, S. R. Ke
electronics, 8-12 January 2005, pp.61-62.
[36]National Institute of Standards and Technology (U.S.). Data Encryption Standard
(DES). FIPS Publication 46-3, NIST, 1999.
[37]National Institute of Standards and Technology (U.S.). Advanced Encryption
Standards (AES). FIPS Publication 197, 2001.
[38]R. L. RIVEST. A. SHAMIR, AND L. ADLEMAN. A method for obtaining digital
- 138 -
Reference
signatures and public key cryptosystems . Communications of the ACM, 21(1978),
120-126.
[39]
[40]
ypt 91, 1991,
pp.17-38.
[41]
Proceedings of the Symposium on Network and Distributed Systems Security, IEEE,
1996.
[42]L. Qiao, K. Nahrstedt, and I. Tam, "Is MPEG Encryption by Using Random List
Instead of Zigzag Order Secure?" IEEE International Symposium on Consumer
Electronics, December 1997. Singapore.
[43]
2002, available at http://raidlab.cs.purdue.edu/papers/mm.ps.
[44]
Cryptology TATRACRYPT 2003, Bratislava, Slovak Republic, 2003.
[45]A. Alattar and G. Al-
secure transmission of MPEG video bit-
International Symposium on Circuits and Systems, vol. 4, pp IV-340-IV-343, 1999.
[46]Ia -4 Video Compression, Video coding for
next- -223.
[47]
nsactions on Circuits and Systems for Video
Technology, Volume 13, Issue 7, July 2003, pp.560 - 576.
[48]
Architecture with S- - ASIACRYPT
- 139 -
Reference
2001, 7th International Conference on the Theory and Application of Cryptology and
Information Security, Gold Coast, Australia, December 9-13, 2001, pp.239 254.
[49] -
Embedded Systems CHES, September 2005, pp.441 455.
[50]
Systems - CHES 2004, Volume 3156, 2004, pp.357-370.
[51]OneSeg in Japan. http://en.wikipedia.org/wiki/Oneseg
[52]P. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In M. Wiener, editor,
in Computer Science, pages 388 397, Santa Barbara, CA, USA, August 15-19 1999.
Springer-Verlag.
[53]S. Mangard, E. Oswald, and T. Popp Power Analysis Attacks: Revealing the secrets
of smart card published by Springer, 2007.
[54]Eric Brier, Christophe Clavier, and Francis Olivier, "Optimal Statistical Power
Analysis", Cryptology ePrint Archive, http://eprint.iacr.org/2003/152.pdf
[55]M. L. Akkar, C. Giraud
Proceedings International Workshop on Cryptographic Hardware and
Embedded Systems (CHES 2001), pp.309-318, 2001.
[56]M. Joye, P. Paillier, B. Schoenmakers, On second-order differential power analysis,
Proceedings International Workshop on Cryptographic Hardware and Embedded
Systems (CHES 2005), pp.293-308, 2005.
[57]H.264 in wikipedia. http://en.wikipedia.org/wiki/H.264
[58]Yibo Fan, Jidong Wang, Ikenaga, T. Goto, S., "Mixed bus width architecture for low
cost AES VLSI design", 7th International Conference on ASIC (ASICON), 2007,
22-25 Oct. 2007 Page(s):854 857.
[59]SASEBO project in Research Center for Information Security(RCIS),
- 140 -
Reference
www.rcis.aist.go.jp/special/SASEBO/
[60]Cryptographic hardware project in TOHOKU University,
http://www.aoki.ecei.tohoku.ac.jp/crypto/
[61]
Unequal Secure Encryption Scheme For H.264/AVC Video Compression Standard
Dat -A, No.1, pp.12-21,
Jan 2008.
[62]Yibo FAN
-Rim conference on multimedia (PCM 2007), 2007.
[63]Jidong Wang, Yibo FAN
2007.
[64]Jidong Wang, Yibo FAN
Scheme for H.264 Format Vi
systems in karuizawa, 23-24 April, 2007.
[65]Yibo FAN
(IPS), pp. 17-20, Taipei, Taiwan, July 2007.
[66]Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "A Low-cost Reconfigurable Architecture
for AES Algorithm", International Conference on Information and Communications
Security (ICICS 2008), Prague, Czech Republic, July 25-27, 2008.
[67]Yibo FAN, Takeshi IKENAGA, Yukiyasu TSUNOO, Satoshi GOTO, "A Low-cost
LSI design of AES against DPA attack by hiding power information", The 21th
workshop on circuits and systems in karuizawa, 2008.
[68] -speed Design of Montgomery
-A, No.4, pp.971-977, April,
2008.
- 141 -
Reference
[69]Guoyu QIAN, Yibo FAN, Yukiyasu Tsunoo, Takeshi Ikenaga, Satoshi Goto, "FPGA
& ASIC Implementation of Differential Power Analysis Attack on AES", the 4th
International Conferences on Information Security and Cryptology, Dec. 14-17, 2008
(to be published)
[70]VDEC. http://www.vdec.u-tokyo.ac.jp/
[71] Andreas Uhl, Andreas Pommer, "Image and video encryption, from digital rights
management to secured personal communication", Springer, 1 edition November 4,
2004.
[72] -based RSA crypto-processor
-Pacific
Conference on Advanced System Integrated Circuits, pp.218-221, Aug.4-5, 2004.
[73]
computation, vol.44, no.170, pp.519- 521, April 1985.
[74] comparing Montgomery
-33, June 1996.
[75]
.
1215 1221, Sep. 2003.
[76] -Radix Design of a Scalable Modular
-CHES 2001, Lecture
Notes in Computer Science, no.2162, pp.189-205, May 13-16, 2001.
[77]G. Todorov -radix
[78] -16
design of a scalable Montgomery multi
ASIC-ASICON 2005, vol.1, 24-27, pp.153-157, Oct. 2005.
[79]
- 142 -
Reference
Computer Arithmetic, pp.172-178, June 27-29, 2005.
[80]
International Workshop on System-on-Chip for Real-Time Applications, pp.400- 404,
July 2005.
- 143 -
Publications
Publications
International Journal
[1] Yibo FAN
Transaction on Electronics, Vol.E91-C, No.4, pp.440-448, April, 2008.
[2] Yibo FAN, Takeshi Ikenaga, Satoshi -speed Design of Montgomery
-A, No.4, pp.971-977, April,
2008.
[3] Yibo FAN
Unequal Secure Encryption Scheme For H.264/AVC Video Compression Standard
-A, No.1, pp.12-21,
Jan 2008.
International Conference (with review)
[1] Guoyu QIAN, Yibo FAN, Yukiyasu Tsunoo, Takeshi Ikenaga, Satoshi Goto, "FPGA
& ASIC Implementation of Differential Power Analysis Attack on AES", the 4th
International Conferences on Information Security and Cryptology, Dec. 14-17, 2008
(to be published)
[2] Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "Optimized 2-D SAD Tree Architecture of
Integer Motion Estimation for H.264/AVC", 16th IFIP/IEEE international conference
on very large scale integration (VLSI-SoC 2008), Rhodes Island, Greece, Oct. 13-15,
2008.
[3] Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "Fast VBSME design
using reconfigurable hardware achitecture and search range reduction algorithm", The
- 144 -
Publications
10th IASTED International Conference on Signal and Image Processing (SIP 2008),
Kailua-Kona, Hawaii, August 18 20, 2008.
[4] Yibo FAN, Takeshi Ikenaga, Satoshi Goto, "A Low-cost Reconfigurable Architecture
for AES Algorithm", International Conference on Information and Communications
Security (ICICS 2008), Prague, Czech Republic, July 25-27, 2008.
[5] Yibo FAN
-Rim conference on multimedia (PCM 2007), 2007.
[6] Yibo FAN
ASIC (ASICON 2007), 2007.
[7] Jidong Wang, Yibo FAN, Takeshi Ikenaga, Satoshi G
2007.
[8] Yibo FAN
hop on SOC
(IPS), pp. 17-20, Taipei, Taiwan, July 2007.
Domestic Conference (with review)
[1] Yibo FAN, Takeshi IKENAGA, Yukiyasu TSUNOO, Satoshi GOTO, "A Low-cost
LSI design of AES against DPA attack by hiding power information", The 21th
workshop on circuits and systems in karuizawa, 2008.
[2] Yibo FAN -Speed Design of Montgomery
-24 April,
2007.
[3] Jidong Wang, Yibo FAN fficient Encryption
systems in karuizawa, 23-24 April, 2007.
- 145 -
Publications
Domestic Conference (without review)
[1] Yibo FAN, Jidong WANG, Takeshi IKENAGA, Yukiyasu TSUNOO, Satoshi GOTO,
"Hardware Evaluation of eSTREAM Stream Cipher Candidates in Phase 3 Profile 2:
Moustique, Pomaranch and Decim v2", Symposium on Cryptography and
Information Security (SCIS), 2008.
[2] Yibo FAN, Xiaoyang Zeng, Takeshi Ikenaga, Satoshi Goto, "Hardware Reuse
Architecture for High-Radix Scalable Montgomery Multiplier", 2E2-1, Symposium
on Cryptography and Information Security (SCIS2007), Jan. 2007.
[3] Jidong Wang, Yibo FAN, Xiaoyang Zeng, Takeshi Ikenaga, Satoshi Goto, "No
Compression Ratio Reduction H.264 Video Scrambling", 3B3-1, Symposium on
Cryptography and Information Security (SCIS2007), Jan. 2007.