aes on modern gpus

13
Author(s) Politehnica University of Bucharest Automatic Control and Computers Faculty Computer Science Department Scientific Advisor AES encryption using GPU architectures Grigore Lupescu Emil Slusanschi Scientific Student Projects Session - May 2014

Upload: glupescu

Post on 16-Jul-2015

24 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: AES on modern GPUs

Author(s)

Politehnica University of

Bucharest

Automatic Control and Computers

Faculty

Computer Science

Department

Scientific Advisor

AES encryption using GPU architectures

Grigore Lupescu Emil Slusanschi

Scientific Student Projects Session - May 2014

Page 2: AES on modern GPUs

AES Encrytion (1)

17.05.2014 Scientific Student Projects Session - May 2014 2

Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext

Most operation modes require an initialization vector

Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)

Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)

Why use ECB ?

Simple, fast, very well parallelizable, max throughput

Provides a good estimate of how CTR would perform

Page 3: AES on modern GPUs

AES Encrytion (2)

17.05.2014 Scientific Student Projects Session - May 2014 3

KeyExpansion: round keys are derived from the cipher key.

InitialRound: (AddRoundKey)

Rounds:

SubBytes— substitution step where each byte is replaced with another according to SBOX table.

ShiftRows— transposition step where the last three rows of the state are shifted.

MixColumns—a mixing operation which operates on the columns of the state. Operations (+,*) are redefined in the Galois Finite Field.

AddRoundKey - bitwise xor of each byte of the state with the round key.

Final Round:(SubBytes, ShiftRows, AddRoundKey).

Page 4: AES on modern GPUs

Target System (1)

15.05.2014 Scientific Student Projects Session - May 2012 4

SoC CPU – AMD A4 4000K (2 cores @3.0ghz, Richland architecture, AES-NI), cores denoted by BLUE

SoC Integrated GPU HD7480 (iGPU), 2 SIMD units of 64 cores each (VLIW4 architecture), SIMD units denoted by RED

Discrete GPU AMD R7 250 (dGPU), 6 SIMD units of 64 cores each (GCN architecture), PCIe 16x 2.0 bus, SIMD units denoted by RED

Data to be encrypted denoted by GREEN

Software – C/C++/OpenCL, Linux Ubuntu 14.04 x64

Page 5: AES on modern GPUs

Target System (2)

15.05.2014 Scientific Student Projects Session - May 2012 5

Page 6: AES on modern GPUs

Algorithm Opt_1

• Array “indata” will reside in global device memory (__global)

• Variable “state” which holds transformations will be in GPU cache (__local)

• Simple operation “ShiftRows” is designed with vector addressing (state.s05AF49E38.. )

• Simple operation “AddRoundKey” is a simple XOR (state ^ key).

• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in constant memory

• Complex operation “MixColumns” will use precomputed tables of Galois_FiniteField, stored in constant memory

• Host sample code bellow (simple blocking enqueues)

while(!done()) { writeData(32MB, &offset);

execKernel(32MB, &offset); readData(32MB, &offset); }

15.05.2014 Scientific Student Projects Session - May 2012 6

Page 7: AES on modern GPUs

Results Opt_1

15.05.2014 Scientific Student Projects Session - May 2012 7

• AMD CodeXL profiling, initial results – iGPU A4 4000, ~100MB/sec AES ECB128

Page 8: AES on modern GPUs

Algorithm Opt_2

• Array “indata” will reside in global device memory (__global)

• Variable “state” which holds transformations will be in GPU cache (__local)

• Simple operation “ShiftRows” - unchanged

• Simple operation “AddRoundKey” – unchanged

• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in cache memory (__local)

• Complex operation “MixColumns” compute values instead of using precomputed (used optimized version of MixColumns)

• Host sample code – unchanged

15.05.2014 Scientific Student Projects Session - May 2012 8

Page 9: AES on modern GPUs

Results Opt_2

15.05.2014 Scientific Student Projects Session - May 2012 9

• Profiling, Opt_1 – iGPU A4 4000, ~100MB/sec AES ECB128

• Profiling, Opt_2 – iGPU A4 4000, ~210MB/sec AES ECB128

Page 10: AES on modern GPUs

Algorithm Opt_3

• Array “indata” will reside in global device memory (__global)

• Variable “state” which holds transformations will be in GPU cache (__local)

• Simple operation “ShiftRows” - unchanged

• Simple operation “AddRoundKey” – unchanged

• Complex operation “SubBytes” – unchanged

• Complex operation “MixColumns” - unchanged

• Host sample code – overlap execution with I/O by creating multiple queues (R, W, E)

15.05.2014 Scientific Student Projects Session - May 2012 10

Page 11: AES on modern GPUs

Algorithm Opt_3 (2)

15.05.2014 Scientific Student Projects Session - May 2012 11

Page 12: AES on modern GPUs

Results Opt_3

15.05.2014 Scientific Student Projects Session - May 2012 12

• Right figure - Results AES ECB128 in MB/sec, of serial (Opt_2) vs overlap (Opt_3)

• Bellow figure – 3 OpenCL queues (R, W, E) for asyncenqueues hence to achieve overlap execution with I/O

Page 13: AES on modern GPUs

Conclusions

15.05.2014 Scientific Student Projects Session - May 2012 13

iGPU AES performance is good (faster than CPU but CPU AESNI is fastest)

Prefer cache over constant memory

Where possible analyze using precomputed tables vs computation on the fly

Overlaping execution with I/O could improve iGPU performance by 10-20%

Space of the iGPU occupied in the x86 SoC die increases with each generation and its contribution in AES throughput will increase as well

Memory transfers are expected to improve with each new generation and with them CPU/iGPU performance