single- and multi-core configurable aes

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010 541

Single- and Multi-core Configurable AESArchitectures for Flexible Security

Mao-Yin Wang, Chih-Pin Su, Chia-Lung Horng, Cheng-Wen Wu, and Chih-Tsun Huang, Senior Member, IEEE

Abstract—As networking technology advances, the gap be-tween network bandwidth and network processing power widens.Information security issues add to the need for developinghigh-performance network processing hardware, particularly thatfor real-time processing of cryptographic algorithms. This paperpresents a configurable architecture for Advanced EncryptionStandard (AES) encryption, whose major building blocks are agroup of AES processors. Each AES processor provides �� blockcipher schemes with a novel on-the-fly key expansion design forthe original AES algorithm and an extended AES algorithm. Inthis multicore architecture, the memory controller of each AESprocessor is designed for the maximum overlapping between datatransfer and encryption, reducing interrupt handling load of thehost processor. This design can be applied to high-speed systemssince its independent data paths greatly reduces the input/outputbandwidth problem. A test chip has been fabricated for the AESarchitecture, using a standard 0.25- m CMOS process. It has asilicon area of 6.29 mm�, containing about 200,500 logic gates,and runs at a 66-MHz clock. In electronic codebook (ECB) andcipher-block chaining (CBC) cipher modes, the throughput ratesare 844.9, 704, and 603.4 Mb/s for 128-, 192-, and 256-b keys,respectively. In order to achieve 1-Gb/s throughput (includingoverhead) at the worst case, we design a multicore architecturecontaining three AES processors with 0.18- m CMOS process.The throughput rate of the architecture is between 1.29 and3.75 Gb/s at 102 MHz. The architecture performs encryptionand decryption of large data with 128-b key in CBC mode usingon-the-fly key generation and composite field S-box, making itmore cost effective (with better thousand-gate/gigabit-per-secondratio) than conventional methods.

Index Terms—Advanced Encryption Standard (AES), config-urability, cryptography, encryption, hardware design, multicorearchitecture, network security.

I. INTRODUCTION

Applications such as electronic transaction and audio/videocommunication require not only significant network bandwidthbut also high security measures [1], [2]. Security processingis computation intensive, which normally includes lookup andfetching/updating of parameters (keys, encryption/authentica-tion algorithms, initial values, and security-related protocolinformation), encryption and authentication, data transfer,bus contention resolution, etc. Powerful security processingarchitectures are thus important in high-speed network appli-cations. Some network security designs have been reported

Manuscript received April 29, 2008; revised September 15, 2008. First pub-lished July 21, 2009; current version published March 24, 2010.

M.-Y. Wang, C.-P. Su, C.-L. Horng, and C.-W. Wu are with the Depart-ment of Electrical Engineering, National Tsing Hua University, Hsinchu 30013,Taiwan.

C.-T. Huang is with the Department of Computer Science, National TsingHua University, Hsinchu 30013, Taiwan.

Digital Object Identifier 10.1109/TVLSI.2009.2013231

recently [3], including a network processor (NP) that offloadscryptographic algorithms into a security accelerator, a securitycoprocessor that handles secure socket layer or IP securityheader processing and cryptographic algorithms, and an in-linesecurity processor that integrates a packet processing engineand multiple cryptographic engines. Another recent designmethod is an NP that contains multiple cryptographic engines[4].

The symmetric block cipher, Rijndael, is standardized by theNational Institute of Standards and Technology as the AdvancedEncryption Standard (AES) [5], which is a replacement to theoriginal Data Encryption Standard [6]. It has widely been usedin the Internet and wireless communications [7]–[9]. Parallelismis a common practice for improving hardware performance [10],[11]. Pipelined implementations were often exploited for AESin the past [12]–[14]. In [15], parallel cores and pipelining areused to enhance the throughput of AES. However, for block ci-pher with feedback, such as cipher-block chaining (CBC) mode,pipelining is limited. A parallel architecture composed of en-cryption blocks for several symmetric encryption algorithms isreported in [16], where is the number of rounds of the blockcipher.

In this paper, we propose the “AES with Tsing HuaExTended Implicit Configurability” (AESTHETIC) architec-ture. It supports the original AES algorithm and also providesthe flexibility to configure the parameters of each of the fourtransforms defined in the AES algorithm. Security-relatedprotocols usually specify critical parameter handshaking (suchas the selection of compression algorithms, cryptographicalgorithms, keys, and initial values) in the first step. Encryptionand/or authentication are done next for sensitive informationtransmission. With AESTHETIC, the user can select randomlyamong many extended AES ciphers each time an encryptionor decryption session is requested. Such a dynamic configura-bility reinforces the security in data communication. In oursecurity processing architecture, we use multiple AESTHETICprocessors to achieve high speed and enhance security level. Inthis architecture, each encryption block handles an independentpacket encryption session. Shared control interface to hostprocessor through the system bus simplifies management ofthe multicore architecture. Independent data path for eachencryption block (input/output (I/O) first in–first outs and theAESTHETIC processor) ensures high-speed data encryption.In the three-core architecture, e.g., the throughput lies between1.29 and 3.75 Gb/s, with a 102-MHz clock.

Because the proposed work is a configurable solution, wesurveyed several similar works about configurable AES forreasonable comparison. We also list here three examples for the

1063-8210/$26.00 © 2010 IEEE

542 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010

need of configurable solution as follows. The first exampleis the secure storage [17]. The authors use a diversifiedAES with different irreducible polynomials to secure thestorage. The second example is that, when standard AES issuccessfully attacked, other AES schemes can be immediatelyused to avoid the risk. The final example is that specific cipherschemes are used for secure communication between twoparties with the same parameters, e.g., the Internet key exchange(IKE) protocol is used for setting up security parameters(cryptographic algorithms) on each side. If other parties donot support extra cipher schemes, they do not have enoughknowledge to communicate with the parties supporting thesecipher schemes, reducing the security risk. These configurablesolutions can also increase the security since additional variables(e.g., configuration bits for selecting one among cipherschemes) is involved. This is similar to increasing key lengthfor security enhancement.

The rest of this paper is organized as follows. Section IIdescribes basic transforms and concepts for the enhancedAES algorithm. In Section III, we present the AESTHETICarchitecture, containing I/O block converter, configurationlogic, AESTHETIC engine, and novel on-the-fly key gener-ator. In Section IV, we propose a multicore architecture forAESTHETIC and analyze its performance and memory re-quirement. The implementation results for the two architecturesare presented in Section V. Finally, Section VI concludes thispaper.

II. CONFIGURABLE AES (AESTHETIC)

The AES algorithm is a symmetric block cipher thatprocesses a series of 128-b data blocks to be encrypted (de-crypted), and produces a series of 128-b encrypted (decrypted)data blocks, with a key of 128, 192, or 256 b [5]. The algorithmconsists of four transformations on the data block, which areorganized in a 4 4 B array, called state. The four transforma-tions are as follows.

1) SubBytes(): Each element of the state is an element inwith the irreducible polynomial . A nonlinear

operation is applied to each element using an S-box (sub-stitution table) [5]. A multiplicative inverse of the elementin is performed first, followed by an affine trans-form. The affine transform can be expressed as

, where is the inverse ofthe element, and and are two polynomialswith a degree less than eight.

2) ShiftRows(): A cyclic shifting operation is done on rows ofthe state with different numbers of bytes (offsets).

3) MixColumns(): Multiplication of a four-term polynomialwith a fixed polynomial modulo is performed.Each column of the state is considered a four-term polyno-mial, with coefficients in .

4) AddRoundKey(): A bitwise XOR operation is done, whichadds a round key to the state in each iteration, where theround keys are generated by the key expansion procedure.

The AESTHETIC engine uses the same encryption/de-cryption and key expansion procedures as the original AESalgorithm. However, the user can respectively select from setsof parameters for , , , and . Against

Fig. 1. Block diagram of the AESTHETIC processor.

existing attacks, several design criteria must be satisfied toensure that the selected tuple cangenerate strong S-boxes. We have selected 16 irreduciblepolynomials of degree eight for . Each has 256 pairsof , which are selected to provide sufficientalgebraic complexity of the S-box. The matrix coefficientsof MixColumns() transform are designed to provide strongdiffusion power and guarantee that the relative inverse matrixexists. The coefficients of are also restricted to fall withinthe set . Under these constraints,we have selected 128 alternatives for . Therefore, there aretotally types of block cipher that can beselected as the extended AES algorithm.

We need to design hardware for SubBytes() and Mix-Columns() with extended configurability, considering bothhardware overhead and performance impact. Most S-box im-plementations use either the lookup table (LUT) approach[12]–[16], [18]–[23] or arithmetic computing [14], [24]–[29].The trade-off between area and computation time for variousS-box implementations has been evaluated in [30]. Usingthe ROM-based LUT approach to support as much askinds of S-boxes will consume unacceptable area. Anotherway is to use a 256 8 b SRAM to store the S-boxvalues. Any change on S-box’s parameters requires dedicatedhardware to recompute all the values, which can lead tolong configuration time. We use the arithmetic computingapproach to implement the data path of the extended AESalgorithm for smaller hardware overhead and easy configura-tion. The arithmetic computing approach results in a longercritical path than the LUT approach. The drawback, however,can be alleviated by the proposed architecture, as shownin Fig. 1, which implements the field multiplier or inverterwith a dedicated irreducible polynomial. The data path canthus be optimized for this particular polynomial instead ofusing a general-purpose inverter or multiplier. Changing theirreducible polynomial can be done easily by convertingthe data between the corresponding fields.

A. Composite Field Arithmetic

Let the elements in be represented as polynomialsof degree one, i.e., , where , . The

WANG et al.: SINGLE- AND MULTI-CORE CONFIGURABLE AES ARCHITECTURES FOR FLEXIBLE SECURITY 543

inverse operation in the S-box can then be computed by takingthe inverse of the polynomial with the irreduciblepolynomial , where ,[29]. Although there are 30 irreducible polynomials in ,only 16 of them are primitive. In order to perform basis trans-formation between and (isomorphic map-ping), a proper primitive polynomial should be obtained. Weuse the primitive polynomial(11D) to construct for producing minimal critical pathand the smallest area overhead [28], the polynomial

for , where is the root of ,and the irreducible polynomial foraccording to [31], which shows an optimal multiplierusing the composite field arithmetic. All the computation logicblocks related to the extended AES algorithm are implementedusing the composite field arithmetic in .

B. Field Conversion

When using the data path with a fixed polynomial to performan extended AES algorithm operating in a different polynomial,it requires data conversion at the I/O of the data path. The mul-tiplication and addition defined in the extended AES algorithmcan easily be done by the composite field data path with the helpof a data conversion hardware. However, the affine transform isdone on the prime field , which does not have an isomor-phic operation defined in . Additional computationof the affine transform’s parameters is required to fit into thecomposite field data path. We derive the isomorphic relation ofthe S-box with polynomials and as follows. Letbe the transfer matrix from the field with to thefield with , and let be the transfer matrix from the field

with the polynomial (11D) to the .Thus, represents the transfer matrix fromwith to . Given the field element ,we can find its composite field representation . As-sume that and is the selected polynomial and constant ofthe affine transform. The relation of the S-box with polynomial

and that in the composite field can be expressed as

where is the inverse operation with polynomial . It canbe substituted by operation in composite field with data conver-sion. From the aforementioned equation, we have

The inverse of the S-box transform can be derived in a similarway, i.e.,

Thus, , , , andmust be computed beforehand. Only 16 irreducible polynomialsare used in our design, and and are precomputedand stored in a small ROM. The parameters of the affinetransform can be obtained within a short configuration time byusing an 8 8 bitwise matrix multiplier.

III. AESTHETIC ARCHITECTURE

The AESTHETIC architecture is shown as a block diagramin Fig. 1, which supports both ECB and CBC cipher modes. Theround keys for encryption and decryption are generated on thefly, without any internal memory. The reconfiguration betweentwo extended AES algorithms can be done within three clock cy-cles. All data access operations are manipulated by the I/O inter-face. The 32–128-b input first-in–first-out (FIFO) buffer cachesthe 32-b input data from the I/O interface to form a block of128-b data, while the output FIFO buffer is used to cache the128-b output block from the AESTHETIC engine. The I/O inter-face also consists of a user key (UK) register, an initial vector(IV) register, and a control register for use in configuring theAESTHETIC engine. The main controller is designed to enablethe configuration procedure, start or halt the encryption/decryp-tion procedure of the AESTHETIC engine, and also manage thedata flow between the input buffer, the data path, and the outputbuffer. Details can be found in [29].

A. I/O Block Converter

The input data from the input buffer will be converted tothe composite field representation in the input block converter,while the output block converter converts the data inback to .

B. Main Controller

The main controller controls the AESTHETIC engine, the keygenerator, and the I/O interface. In brief, it takes three clockcycles for configuration when the control register is set prop-erly. Once the configuration is done, the first round key is storedin the AESTHETIC data path, and the key expansion proce-dure that generates the round keys will be launched immedi-ately. After the last round key is generated and stored in theAESTHETIC data path, the following step is the encryption ordecryption process. The AESTHETIC data path will write theencrypted/decrypted data to the output buffer and continue com-putation in the following clock cycle.

C. Configuration ROMs/Logics

Fig. 2 shows the configuration logic that generates thenecessary coefficients for the AESTHETIC engine. There are16 256 16 b ROMs in Affine_ROM. Each ROM storesthe and parameters for a specific polynomial.The InvAffine Polynomial ROM is used to store the inversepolynomial of each . Either or the inverse polynomialof will be selected as the output according to the stateof the signal.

TMatrix_ROM stores the transfer matrix and its inversematrix , as discussed previously. The output signals

and represent the values of the


Fig. 2. Block diagram of the configuration logic, including the ROMs that store the coefficients of polynomials, affine transform and MixColumn, and a matrixmultiplication circuit. The shaded boxes represent the pipeline registers.

Fig. 3. Block diagram of the AESTHETIC engine.

and matrices, respectively. These signals are usedby MatrixMult to compute either and or

and .The configurable coefficients of MixColumns() transform

are stored in MixColumn_ROM. Since the coefficients of thepolynomial are within therange, 16-b data is sufficient to store the coefficients of

. All the coefficients of MixColumns() and inverse Mix-Columns() transformations are selected by andmultiplexed by . The selected coefficients are then con-verted to the composite field. Finally, a control signal, namely,

, selects between the converted coefficients and the

value. Therefore, in the MixColumns() transformof the AESTHETIC engine, the data will multiply with eitherthe MixColumns matrix or a unit matrix to bypass the effect ofMixColumns() transform. These parameters can be selected by( , , , and ), whichare specified in the control register.

D. AESTHETIC Engine

Fig. 3 shows the block diagram of the AESTHETIC engine. Itconsists of four transforms in the extended AES algorithm andis used for both encryption and decryption. The Block S-box im-plements the S-box transform for a 128-b data block. The Sub-Bytes and InvSubBytes transforms are implemented using one

inverse and two affine transform modules to elim-inate the computation loop. The affine transform and inverseaffine transform circuits are identical; however, the functions aredifferent, represented by different input values of affine matrixand constants. ShiftRows and InvShiftRows perform the samefunctions, as defined in the original AES algorithm. Since theshift operations are done byte-wise, it makes no difference whenimplemented in .

Since all the coefficients of MixColumn transform are pro-grammable, it requires a 4 4 matrix multiplication. Thus,in Block MixColumn, we implement 64 multipliersto process the data block in parallel. The MixColumn and In-vMixColumn functions can easily run on the same hardware bychanging the coefficients according to the processing mode. Inthe final round of the AES algorithm where the MixColumns()transform is omitted, we configure the value of MixColumn co-efficients to perform a unit matrix multiplication. Therefore, noextra hardware is required. The bit-wise AddRoundKey() trans-form is implemented by two groups of XOR gates that are placed


Fig. 4. Data flow graphs of key expansion procedure for (a) 128-b, (b) 192-b, and (c) 256-b keys, respectively.

before and after the Block MixColumn module, respectively.The signal selects one of the two groups of XOR gatesto determine the output data.

In order to balance the computation time in ECB and CBCcipher modes, our data path is designed to perform one roundfunction per clock cycle. In Fig. 3, register R1 latches the outputdata in each round. The output will be fed back to the input ofthe data path for the next round of computation. Register R2is used to support the CBC mode. In CBC mode, R2 holds thelatest result of the encrypted data block. It will be XORed withthe next input data block. In CBC decryption mode, R2 holdsthe latest input data block. Another register R3 is used to delayone clock cycle of R2’s output, which will be XORed with thecurrent decrypted block in the output block converter.

E. Key Generator

We propose a thee-in-one key generator that is associatedwith the AESTHETIC engine. Since the key expansion proce-dure is the same as the original AES algorithm, our design canalso be applied to standard AES cipher. All the round keys aregenerated on the fly without additional memory for the subkeys.Our key generator produces one 128-b round key per clock cycleto fit our data path, regardless of the key size.

Fig. 4 shows the data flow graph for three different key sizesbased on the key expansion procedure. It can be seen that gen-erating a 128-b round key per clock cycle is straightforwardusing the 128-b key expansion procedure. For 192- and 256-bkeys, however, the timing diagram is not compatible with the128-b hardware, so we rearrange the data flow graph for 192-and 256-b keys such that only one 128-b key material is pro-duced in each time frame. By properly shuffling the UK and theround keys, we can use only four computing elements to gen-erate the 128-b round key. The result is shown in Fig. 5. Theregularity in the data flow graph makes it easy to implementthree different key expansion procedures by a unified hardware.For decryption, the key expansion procedure can be derived byreversing the computing order.

Fig. 6 shows the data path of the key generator module basedon the rearranged data flow graph, where are32-b registers to store the shuffled round keys. All the UKsare converted to and stored in ac-cording to the key size. Once the key generator is enabled, thedata path will generate the round key by properly controllingthe multiplexers in each clock cycle. The data shuffling multi-plexer is used to swap the data in the registers and the gener-ated round key iteratively. For 192-b key encryption, the output


Fig. 5. Rearranged data flow graph for 192- and 256-b key expansion.

will be stored in , , , and , while thedata in and will be stored in and to produce thenext round key (see Fig. 5). Similarly, in 256-b key encryption,

will be stored in , , , and , while thedata in , , , and will be swapped to , , ,and . The S-box transform in this data path must behave inthe same way as that in the AESTHETIC engine. Thus, the pa-rameters of affine transform must be provided to configure theS-box. All the round constants should also be convertedto before entering the key generator. In order to re-duce the critical path and eliminate the computation loop, theoutput of the S-box is broadcast to each instead of being con-nected to the zero input of mux_00 and mux_12 in Fig. 6. There-fore, additional XOR gates and multiplexers (such as mux_s0,mux_s1, mux_s2, and mux_s3) are used to provide equivalentmathematical relation.

IV. MULTI-AESTHETIC ARCHITECTURE

The multicore architecture is shown as a block diagram inFig. 7. The host processor controls each AESTHETIC processorvia an AHB-like interface. First, when the host processor has toperform a security processing task, it checks which AESTHETICprocessor is free with the resource control register of the sharedmodule. If one or more AESTHETIC processors are available,it sets the corresponding bit of the register for the selected AES-THETIC processor. Otherwise, it waits. Second, it sets the de-scriptor pointer (Initial DescriptorPtr_Reg) for reading descrip-tors in the linked-list data structure (see Fig. 8), further reducing

the interrupt handling load of the host processor. In addition,related registers (e.g., DataInPtr_Reg) are then updated wheneach descriptor in memory is indexed. In the third step, the hostprocessor writes the UK, the IV (for CBC mode), and the config-uration word to the UK, IV, and control registers, respectively.In the fourth step, each memory controller independently ac-cesses data and result to avoid contention. The data access canbe carried out simultaneously with 128-b block encryption (de-cryption) as well. In the final step, the main controller issuesan interrupt to the host processor after an encryption (decryp-tion) task is complete. The host processor then reads the inter-rupt status register, and it clears the interrupt and correspondingbit of the resource control register for the selected AESTHETICprocessor.

In the multicore architecture, the critical path lies in the AES-THETIC engine in Fig. 1, not the control path. The design isgood for multigigabit wire-speed applications by properly se-lecting the number of AESTHETIC processors. The area over-head is smaller than using multiple copies of the original AES-THETIC processors, thanks to shared control path.

A. Performance Analysis and Memory Requirement

For the multicore architecture, the number of clock cycles fordata encryption (decryption) is , where is thefixed number of clock cycles, which is composed of items inTable I, while and are the encryption cost (e.g.,for 128-b keys) for one 128-b block and the number of 128-bblocks to be encrypted or decrypted, respectively. In addition,bus contention overhead is considered if necessary.


Fig. 6. Data path of the key generator module.

Fig. 7. Block diagram of the multi-AESTHETIC architecture.

To achieve the throughput of 1 Gb/s (with the fixed overhead) in the worst case, we show a three-AESTHETIC architec-

ture. The architecture has been implemented by using a standard0.18- m CMOS technology and a cell-based design flow, oper-ating at 102 MHz with around 382,500 gates. The throughput

equations and performance results of the three-AESTHETICarchitecture are listed in Table II, where is the number ofAESTHETIC processors, is the number of 128-b blocks tobe encrypted or decrypted, and is the operating frequency.After simple analysis for packet-size distribution in the worst


Fig. 8. Structure of I/O data.

TABLE IFIXED NUMBER OF CLOCK CYCLES

Fig. 9. Throughput (in gigabits per second) for a mix of 46- and 1500-B packetsand three key sizes.

case (46-B packet) and the best case (1500-B packet), the per-formance lies between 1.29 and 3.75 Gb/s (see Fig. 9). For ex-ample, 90% in Fig. 9 means that 90% of the total packets are1500-B packets , while 10% of them are 46-B packets

. From Fig. 9, the fixed processing cost has a sig-nificant performance impact in the worst case.

The minimum memory requirement for maximum per-formance in the three-AESTHETIC architecture equals

TABLE IITHROUGHPUT (IN GIGABITS PER SECOND) RESULTS FOR THREE AESTHETIC

PROCESSORS

Fig. 10. AESTHETIC processor die photograph.

TABLE IIIAREA STATISTICS OF THE AESTHETIC PROCESSOR

, where 128 is the size of one de-scriptor, is the size of input data and result, and 2 meansthat there are two memory spaces available for the currentand the following security tasks, i.e., successive cryptographicprocessing without waiting. For encrypting a 1500-B data inthe three-AESTHETIC architecture, the total memory size is142 kb.

V. IMPLEMENTATION RESULTS

The AESTHETIC chip was first fabricated. Then, we de-signed the multicore architecture based on the AESTHETICprocessor. Therefore, two different technologies are used for theAESTHETIC chip and the multicore architecture.

A. AESTHETIC Chip

The chip is fabricated in a mature 0.25- m CMOS technologywith five-level metallization. Fig. 10 shows a microphotographof the processor whose core contains 88,000 instances with200,500 equivalent gates, measuring 3.24 mm 1.94 mmand 259.1 mW. It is implemented by standard cells. The gatecounts of the modules in AESTHETIC are summarized inTable III. A full-scan test is implemented—there are 3906 testpatterns that achieve 99.33% fault coverage. Our chip runs at66 MHz, so the throughput values are 844.8 Mb/s for 128-b


TABLE IVCOMPARISON OF AES DESIGNS

TABLE VCOMPARISON FOR SINGLE AES (ASIC)

keys, 704 Mb/s for 192-b keys, and 603.4 Mb/s for 256-b keys,respectively. The comparison of our implementation with someother AES designs is shown in Table IV. Configurable designshave larger area than nonconfigurable ones, except [23]. In thetable, the configurable AES designs with variable parametersare implemented by field-programmable gate array (FPGA)or application-specific integrated circuit [32], [33]. In [32],the authors use a function generator to produce a set of LUTcontents for the LUT-based AES according to the selectionof parameters. They provide huge parameter space for theconfigurable AES. However, the work does not report totalarea result of AES design including the function generator. In[33], the authors also provide high configurability, although thenumber of MixColumn polynomials is not clearly specified.Nevertheless, these parameters are stored in a lot of externalmemories instead of inside the AES chip. In other words, theapproach increases additional cost and security risk, e.g., theattacker can obtain sensitive security parameters by monitoringthe data between these memories and the AES chip [34]. OurAESTHETIC is designed with medium number of parameters

on a single chip, and the user simply needs to select the AEScipher scheme to perform encryption. Moreover, the securityrisk can be avoided by using the parameter-embedded designwith extra silicon cost (49,260 gates).

For the key generator, on-the-fly key generation has been pro-posed in [21], [26], and [35]. In [21], the architecture was de-signed to support 128-, 192-, and 256-b blocks with three dif-ferent key sizes. Our design provides a more compact hardwarededicated for 128-b blocks, as is defined in standard AES. Thekey generators in [26] and [35] are designed for only 128-b keys,but our key generator can be used for all three encryption anddecryption key sizes, as defined in the standard.

B. Results of Multi-AESTHETIC Architecture

In order to increase throughput, pipelining or parallel tech-niques are often used in high-speed AES designs [12]–[16].For ASIC designs [14], the comparison of our multicore im-plementation (processing 1500-B packets) with them is shownin Table V. For 128-b AES in CBC mode, using on-the-flykey generator and composite field S-box, our architecture ob-tains better thousand-gate/gigabit-per-second ratio. In addition,the three-AESTHETIC architecture operates at 102 MHz with382,500 gates and consumes around 0.19 W. For FPGA designs[12], [13], [15], [16], the implementation results are shown inTable VI. These works use various numbers of hardware unitsor pipeline stages to enhance the total AES throughput. Most ofthem support 128-b AES encryption (En) in ECB mode only.In comparison with them, our design provides additional mode(CBC), decryption, and two key sizes (192 and 256 b). Our AESI/O sizes are smaller than others (128 b) as well. These works[12], [15], [16] exploit ROMs or FPGA block RAMs (BRAMs)that reduce the critical path. In [15], high-speed cells are usedto implement the AES design for maximum performance. Also,only the main data round was implemented, so this is not the per-formance for the entire AES computation. In addition, the max-imum degree of parallelism and utilization rate are limited byencryption cycles of block cipher and I/O bandwidth [15], [16].From Tables V and VI, we know that pipelining is inefficientfor performing encryption in CBC mode without multiple round


TABLE VICOMPARISON OF PARALLEL AND PIPELINING FPGA IMPLEMENTATIONS OF AES

units [12]–[15]. The proposed configurable multicore architec-ture in CBC mode achieves medium slices/gigabit-per-secondratio. In particular, the architecture can be configured as one of

cipher schemes.In cryptography, CBC mode is better than ECB mode. Dif-

ferent key sizes imply multiple security choices as well. Forkey-agility applications, pipelined design with one key moduleis not enough and insecure. Hardware parallelism improves per-formance as well as security [16]. We can use native hardwareredundancy to avoid faulty function induced by attack, materialdefects, or radiation. For high degree of security, our param-eter-configurable design provides extended-security AEScipher schemes. Our multigigabit architecture with four securityproperties leads to higher security than other AES architectures.

VI. CONCLUSION

We have designed a configurable AES chip, called AES-THETIC, which enhances security over standard AES designs.The implicit configurability allows the user to switch among

AES-extended block ciphers. The chip supports ECBand CBC cipher modes with 128-, 192-, and 256-b keys. The

maximum throughput is around 844.8 Mb/s for 128-b keys,704 Mb/s for 192-b keys, and 603.4 Mb/s for 256-b keys undera 66-MHz clock. An on-the-fly key generator is also developedthat provides exactly one 128-b round key per clock cycle. Itcan be used in not only the extended AES algorithm but alsothe original AES algorithm.

Based on our AESTHETIC processor (with configurableS-box and MixColumns transforms), we also propose ahigh-performance multicore architecture, where independentdata path for each AESTHETIC processor provides multigi-gabit security processing without I/O bandwidth problem. Thememory controller coordinates the maximum overlapping be-tween data transfer and encryption (decryption) process. As anexample, a three-AESTHETIC architecture has been designedto provide 1.28-Gb/s throughput in the worst case (256-b keyand 46-B data). When it encrypts 1500-B packets with 128-bkey, the maximum performance is 3.61 Gb/s.

ACKNOWLEDGMENT

The cryptanalysis of the configurable AES algorithms wasdone by Prof. C.-S. Lai of National Cheng Kung University,Tainan, Taiwan.


REFERENCES

[1] S. Kent and R. Atkinson, Security architecture for the internet pro-tocol IETF Netw. Working Group, RFC 2401, 1998 [Online]. Avail-able: http://www.rfc-editor.org/RFCeditor.html

[2] A. Frier, P. Karlton, and P. Kocher, The SSL Protocol Version 3.0.Loudoun County, VA: Netscape, Nov. 1996.

[3] S. Stanley, “Security processors,” Feb. 2003 [Online]. Available: http://www.lightreading.com/document.asp?doc_id=28307

[4] “Intel IXP2850 Network Processor Product Brief,” Intel Corp., SantaClara, CA, 2003 [Online]. Available: http://www.intel.com/

[5] NIST, Springfield, VA, “Advanced Encryption Standard (AES),” Nov.2001.

[6] NIST, Springfield, VA, “Data Encryption Standard (DES),” Oct. 1999.[7] P. Chown, “Advanced Encryption Standard (AES) Ciphersuites for

Transport Layer Security (TLS),” IETF Netw. Working Group, RFC3268, 2002 [Online]. Available: http://www.rfc-editor.org/RFCed-itor.html

[8] S. Frankel, R. Glenn, and S. Kelly, “The AES-CBC cipher algorithmand its use with IPsec,” IETF Netw. Working Group, RFC 3602, 2003[Online]. Available: http://www.rfc-editor.org/RFCeditor.html

[9] IEEE 802.11i Standard, IEEE Std 802.11i]2004, Jul. 2004.[10] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quan-

titative Approach, 2nd ed. San Francisco, CA: Morgan Kaufmann,1996.

[11] D. Carlson, D. Brasili, A. Hughes, A. Jain, T. Kiszely, P. Kodan-dapani, A. Vardharajan, T. Xanthopoulos, and V. Yalala, “A highperformance SSL IPSEC protocol aware security processor,” in Proc.IEEE Int. Solid-State Circuits Conf., San Francisco, CA, Feb. 2003,pp. 142–483.

[12] P. Chodowiec, P. Khuon, and K. Gaj, “Fast implementations of se-cret-key block ciphers using mixed inner- and outer-round pipelining,”in Proc. Int. Symp. Field Programmable Gate Arrays, Monterey, CA,Feb. 2001, pp. 94–102.

[13] K. U. Jarvinen, M. T. Tommiska, and J. O. Skytta, “A fully pipelinedmemoryless 17.8 Gbps AES-128 encryptor,” in Proc. Int. Symp. FPGA,Monterey, CA, 2003, pp. 207–215.

[14] A. Hodjat and I. Verbauwhede, “Minimum area cost for a 30 to 70 Gb/sAES processor,” in Proc. IEEE Comput. Soc. Annu. Symp., Lafayette,LA, Feb. 2004, pp. 83–88.

[15] M. Alam, W. Badawy, and G. Jullien, “A novel pipelined threads ar-chitecture for AES encryption algorithm,” in Proc. IEEE Int. Conf.Appl.-Specific Syst., Architectures, Process., San Jose, CA, Jul. 2002,pp. 296–302.

[16] E. J. Swankoski, R. R. Brooks, V. Narayanan, M. Kandemir, and M.J. Irwin, “A parallel architecture for secure FPGA symmetric encryp-tion,” in Proc. 18th Int. Parallel Distrib. Process. Symp., Santa Fe, NM,Apr. 2004, p. 132.

[17] M.-H. Jing, S.-Y. Ko, and W.-C. Wu, “The SOC design of a highlysecure and reliable storage using a conceptual environment,” in Proc.IEEE Asia-Pacific Conf. Circuits Syst., Tainan, Taiwan, Dec. 2004, pp.865–868.

[18] V. Fischer and M. Drutarovsky, “Two methods of Rijndael imple-mentation in reconfigurable hardware,” in Cryptographic Hard-ware and Embedded Systems (CHES) 2001. Berlin, Germany:Springer-Verlag, May 2001, vol. 2162, LNCS, pp. 77–92.

[19] S. McMillan and C. Patterson, “JBits implementations of the advancedencryption standard (Rijndael),” in Proc. 11th Int. Conf. FPL Appl.,Aug. 2001, vol. 2147, LNCS, pp. 162–171.

[20] P. Chodowiec, K. Gaj, P. Bellows, and B. Schott, “Experimental testingof the gigabit IPSec-compliant implementations of Rijndael and tripleDES using SLAAC-1V FPGA accelerator board,” in Proc. ISC, Oct.2001, vol. 2200, LNCS, pp. 220–234.

[21] H. Kuo and I. Verbauwhede, , Ç K. Koç, D. Naccache, and C. Paar,Eds., “Architectural optimization for a 1.82 Gb/s VLSI implementationof the AES Rijndael algorithm,” in Cryptographic Hardware and Em-bedded Systems (CHES) 2001. Berlin, Germany: Springer-Verlag,May 2001, vol. 2162, LNCS.

[22] S. Morioka and A. Satoh, “A 10 Gbps full-AES crypto design witha twisted-BDD S-Box architecture,” in Proc. IEEE ICCD, Freiburg,Germany, Sep. 2002, pp. 98–103.

[23] I. Verbauwhede, P. Schaumont, and H. Kuo, “Design and performancetesting of a 2.29-GB/s Rijndael processor,” IEEE J. Solid-State Cir-cuits, vol. 38, no. 3, pp. 569–572, Mar. 2003.

[24] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “A compact Rijn-dael hardware architecture with S-box optimization,” in ASIA CRYPT2001. Berlin, Germany: Springer-Verlag, 2001, vol. 2248, LNCS, pp.239–254.

[25] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC imple-mentation of the AES SBoxes,” in CT-RSA 2002.. Berlin, Germany:Springer-Verlag, 2002, vol. 2271, LNCS, pp. 67–78.

[26] S. Mangard, M. Aigner, and S. Dominikus, “A highly regular and scal-able AES hardware architecture,” IEEE Trans. Comput., vol. 52, no. 4,pp. 483–491, Apr. 2003.

[27] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “Unified hardwarearchitecture for 128-bit block ciphers AES and Camellia,” in Cryp-tographic Hardware and Embedded Systems (CHES) 2003. Berlin,Germany: Springer-Verlag, Aug. 2003.

[28] C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, “A high-throughputlow-cost AES processor,” IEEE Commun. Mag., vol. 41, no. 12, pp.86–91, Dec. 2003.

[29] C.-P. Su, C.-L. Horng, C.-T. Huang, and C.-W. Wu, “A configurableAES processor for enhanced security,” in Proc. ASP-DAC, Shanghai,China, Jan. 2005, pp. 361–366.

[30] U. Mayer, C. Oelsner, and T. Kohler, “Evaluation of different Rijndaelimplementations for high end servers,” in Proc. IEEE Int. Symp. Cir-cuits Syst., May 2002, vol. 2, pp. 348–351.

[31] C. Paar, “A new architecture for a parallel finite field multiplier withlow complexity based on composite fields,” IEEE Trans. Comput., vol.45, no. 7, pp. 856–861, Jul. 1996.

[32] M.-H. Jing, Z.-H. Chen, J.-H. Chen, and Y.-H. Chen, “Reconfigurablesystem for high-speed and diversified AES using FPGA,” Micropro-cess. Microsyst., vol. 31, no. 2, pp. 94–102, Mar. 2007.

[33] C.-H. Yen, T.-Y. Pai, and B.-F. Wu, “The implementations of the re-configurable Rijndael algorithm with throughput of 4.9 Gbps,” in Proc.16th VLSI Des./CAD Symp., Hualien, Taiwan, Aug. 2005.

[34] M. Neve, E. Peeters, D. Samyde, and J.-J. Quisquater, “Memories: Asurvey of their secure uses in smart cards,” in Proc. IEEE Int. SecurityStorage Workshop, Washington, DC, Oct. 2003, pp. 62–72.

[35] J. H. Shim, D. W. Kim, Y. K. Kang, T. W. Kwon, and J. R. Choi,“A rijndael cryptoprocessor using shared on-the-fly key scheduler,” inProc. 3rd IEEE Asia-Pacific Conf. ASIC, Taipei, Taiwan, Aug. 2002,pp. 89–92.

Mao-Yin Wang received the B.S. degree from theDepartment of Electronic Engineering, NationalKaohsiung University of Applied Sciences, Kaoh-siung, Taiwan, in 2001. He is currently pursuing thePh.D. degree in the Department of Electrical Engi-neering, National Tsing Hua University, Hsinchu,Taiwan.

His research interests include design and testing ofVLSI, and cryptographic hardware design.

Chih-Pin Su received the B.S.E.E., M.S.E.E., andPh.D. degrees from Department of Electrical Engi-neering, National Tsing Hua University, Hsinchu,Taiwan, in 1996, 1998, and 2004, respectively.

His research includes VLSI testing as well as VLSIcircuit and processor design, especially for crypto-graphic applications. He is also interested in designfor testability, and system–a–chip testing.

Chia-Lung Horng received the B.S. and M.S. de-grees from the Department of Electrical Engineering,National Tsing Hua University, Hsinchu, Taiwan, in2002 and 2004, respectively.

He is currently a Design Engineer with the RealtekSemiconductor Corp., Hsinchu, Taiwan. His researchinterests mainly include VLSI design, especially incryptographic hardware design.


Cheng-Wen Wu received the B.S.E.E. degree fromNational Taiwan University, Taipei, Taiwan, in 1981,and the M.S. and Ph.D. degrees, both in electrical andcomputer engineering, from the University of Cali-fornia, Santa Barbara (UCSB) in 1985 and 1987, re-spectively.

Since 1988, he has been with the Departmentof Electrical Engineering, National Tsing HuaUniversity (NTHU), Hsinchu, Taiwan, where he iscurrently a Professor. He also served as the directorof the Computer Center of NTHU from February

1996 to February 1998, and the director of the Technology Service Center ofNTHU from March 1998 to July 1999. From August 1999 to February 2000,he was a visiting researcher with the Electrical and Computer EngineeringDepartment, UCSB. He then served as Chair of the Electrical EngineeringDepartment, NTHU from 2000 to 2003, Director of the IC Design TechnologyCenter of NTHU from 2000 to 2005, and Dean of the College of ElectricalEngineering and Computer Science from 2004 to 2007. He is currently onleave from NTHU, and serving as the General Director of the SOC TechnologyCenter (STC), Industrial Technology Research Institute (ITRI). His researchinterests include architectures and algorithms with respect to design and testingof reliable VLSI cores and systems, and all testing issues of semiconductormemories. During the past few years, he has consulted for the Computer &Communication Research Labs (CCL) and the SOC Technology Center, ITRI,Hsinchu, Taiwan, the Telecommunications Labs., Chungli, Taiwan, SyntestCo., Hsinchu, Taiwan, Faraday Co., Hsinchu, Taiwan, and Global UniChipCo., Hsinchu, Taiwan.

Prof. Wu was a recipient of the Distinguished Teaching Awards from NTHUin 1996 and 2006, the Outstanding Electrical Engineering Professor Award fromthe Chinese Institute of Electrical Engineers (CIEE) in 1997, the DistinguishedResearch Awards from National Science Council in 2000 and 2002, the Indus-trial Collaboration Awards from the Ministry of Education in 2001 and 2003, theBest Paper Award of the 2002 IEEE International Workshop on Design & Di-agnostics of Electronic Circuits & Systems, the Best Paper Award of the 2003IEEE Asia & South Pacific Design Automation Conference (ASP-DAC), andthe Special Feature Award of the 2003 ASP-DAC University LSI Design Con-test. He was a recipient of the IEEE Computer Society Certificates of Appre-ciation for dedicated service to the Asian Test Symposium and the Asian Ac-tivities of the TTTC in 1997 and 2001. He is a Fellow of the IEEE ComputerSociety and the IEEE Circuits and Systems Society. He is also a member of theAsian Group of the Test Technology Technical Committee (TTTC), IEEE Com-puter Society. He chaired the Technical Meetings Group of TTTC from 2003to 2005. He was the Technical Program Chair of the Fifth Asian Test Sympo-

sium (ATS’96), General Chair of ATS’00, and the General Chair of 2005 and2006 IEEE International Workshop on Memory Technology (MTDT). He is cur-rently the Steering Committee Chair of ATS. In March, 1997, he and 10 otherresearchers founded the VLSI Test Technology Forum (VTTF). He is the Ed-itor-in-Chief for the International Journal of Electrical Engineering (IJEE), anEditor for Journal of Electronic Testing: Theory and Applications (JETTA), andan Editor for IEEE Design and Test of Computers. He was an Editor for IJEEfrom 2000 to 2003, and in 2001 he edited the IJEE Special Issue on Design andTest of System-on-Chip. He also was a Guest Editor of the Journal of Informa-tion Science and Engineering (JISE), Special Issue on VLSI Testing. He also isa life member of the Chinese Institute of Electrical Engineering (CIEE) and alife member of the IC Design Society in Taiwan. In 2004, he became a Fellow ofthe IEEE. In 2005, he received the Academic Award from the Ministry of Edu-cation, as well as the Continuous Service Award and Outstanding ContributionAward from the IEEE Computer Society. In 2006, he became a Golden CoreMember of the IEEE Computer Society, as well as a Tsing Hua Chair Professorof NTHU.

Chih-Tsun Huang (SM’09) received the B.S., M.S.,and Ph.D. degrees in electrical engineering from theNational Tsing Hua University (NTHU), Hsinchu,Taiwan, in 1994, 1996, and 2000, respectively.

He is currently an Assistant Professor with theDepartment of Computer Science, NTHU, where hasbeen since 2004. He is also serving as the AssistantDirector of the Design Technology Center (DTC),NTHU. The research direction focuses on technolo-gies for robustness computing, including securityand cryptography designs, error-correcting codecs,

and system-level debug and performance monitoring. He has coauthored about65 technical publications in IEEE journals and conferences. He also holds fourU.S. patents and four Taiwan R.O.C. patents. During the past few years he hadone technology transfer to the industry, and directly contributed to five othercases.

Prof. Huang was a recipient of the Best Paper Award of the 2003 IEEE Asiaand South Pacific Design Automation Conference (ASP-DAC) and the SpecialFeature Award of the 2003 ASP-DAC University LSI Design Contest. He wasalso an recipient of the 2008 IEEE Computer Society Certificate of Apprecia-tion, TTTC (Test Technology Technical Council) Service Award.

single- and multi-core configurable aes

Documents