"using sgemm and ffts to accelerate deep learning," a presentation from arm

Copyright © 2016 ARM Ltd 1

Gian Marco Iodice, SW Engineer – ARM

May 3, 2016

Using SGEMM and FFTs to Accelerate

Deep Learning


• About ARM

• Convolutional Neural Networks (CNN)

• Architecture and building blocks

• Convolutional Layer

• SGEMM-based convolution

• FFT-based convolution

• SGEMM vs FFT

• Limited Numerical Precision for CNN

• Lesson Learned

Contents


ARM Ltd

• ARM Holdings plc is a British multinational semiconductor and software

design company (www.arm.com)

• Headquarters in Cambridge, England


Architecture and Building Blocks of CNN

• Convolutional layer (core block of CNN)

• Number of convolution kernels (filters bank)

• Filter shape (width, height and depth)

• Pooling layer (typical values 2x2)

• Non-linear gating (ReLu)

• Classifier: Fully Connected Neural Network

Learned

Non-Linear Trainable Feature


Why Are We Going to Study Convolutional Layer?

*Learning Semantic Image Representations at a Large Scale, Yangqing Jia

conv1 16.9%

relu 0.7%

pool 1.0%

conv2 21.9%

pool2 0.7%

norm2 0.5%

conv3 17.8%

relu3 0.2%

conv4 17.8%

conv5 17.7%

fc6 1.8%

fc7 0.8%

Compute Load for AlexNet Inference*


From 2D Convolution to 3D Batched Convolution

• Most of the time for the convolution layers we have:

• Multiple input images

• Multiple convolution kernels (various dimensions and shapes)

• Multiple channels per image/kernel (not necessarily 3!)

Output images

Input image

Kernels

Why don’t we use sliding window approach?


SGEMM-based Convolution

C = α∙AB + β∙C SGEMM: Single Precision GEeneral Matrix Multiply


Im2col

• im2col stores in each row the

necessary pixels for each

kernel application

• Costs in terms of memory

requirements!

• pixels duplication

• col2im restores the output

image structure

Input image

Output image

A

B

C

Output images

stride X

B

C


SGEMM: Naïve implementation

• Each thread computes a single element of the output matrix

Not cache

friendly!

/* First accumulation */

ai = load(addr_a);

bi = load(addr_b);

c0 += ai * bi;

/* Second accumulation */

ai = load(addr_a + 1);

bi = load(addr_b + 1 * N);

c0 += ai * bi;

…

store(c0, addr_c);

Matrix A

Matrix B

Matrix C


Transpose Matrix B

Matrix B Transposition


ai = load(addr_a);

bi = load(addr_b);

c00 += ai * bi;



bi = load(addr_b + 1);

c00 += ai * bi;

...

store(c0, addr_c);

Matrix A

Matrix B

Matrix C

1.1x…

Speed-up

achievable?


Transpose Matrix B in Chunk of 1x4 (I)

• Each thread computes 1x4 elements of the output matrix

Not cache

friendly!

float4 out = 0.0f;


ai = load(addr_a);

bi = vload4(addr_b);

out += (float4)ai * bi;



bi = vload4(addr_b + 1 * N);


...

store4(out, addr_c);

Matrix A

Matrix B

Matrix C


Transpose Matrix B in Chunk of 1x4 (II)

float4 out = 0.0f;


ai = load(addr_a);

bi = vload4(addr_b);




bi = vload4(addr_b + 4);


...

store4(out, addr_c);

Matrix B

Matrix BT1x4

2

2.5

3

3.5

4

512 1024 2048 4096

SGEMM Speed-Up

Speed-up achievable?

3.5x

N: A=NxN, B=NxN, C=NxN


Reshaping Matrix A (I)

• We can do more…we can compute a block of 4x4 elements per

thread in order to re-use the values loaded from Matrix A

Matrix BT1x4

Matrix A

Matrix C


Reshaping Matrix A (II)

Chunk 0

Chunk 1

Chunk = Block of 4 rows

Matrix A – 8x8

Matrix AI – 2x32

6.5

7

7.5

8

8.5

512 1024 2048 4096


SGEMM Speed-Up

Speed-up achievable?

> 8.0x


FFT-based Convolution

• Convolution in the spatial domain is equivalent to a scalar multiplication in

frequency domain


From Radix-2 to Mixed-Radix

• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)

• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)

• Mixed-Radix is the generalization of the basic radix-2 FFT

Over 1.5x better performance than Radix-2


FFT Implementation

• Recursive FFT in-place computation*

• Each thread computes a single radix-N (floating point computation)

• Block-wise 2x2 in-place transposition

• ~2x times better performance than 2x2 out-of-place transposition

• Out-of-place batched convolution

• High memory requirements as we have to keep the frequency representation for:

1. Input image

2. Convolution kernels

3. Result of convolutions

* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-

means-of-opencl-part-2


SGEMM vs FFT (I)

• High memory requirements due to im2col:

• stride < kernel dimension

• large convolution kernel

• large input image

SGEMM-based convolution

• No efficient way to handle stride != 1

• High memory requirements for batched

convolutions

• It could require considerable effort to

optimize well

SGEMM-based convolution

FFT-based convolution


SGEMM vs FFT (II)

Case 1: 1 input image, 64/128/256 convolution kernels

• Study limited on inference problem

• Stride x = 1 and stride y = 1

• N. of channels = 1

• Pre-computed FFT for convolution kernels

Case 2: 64 input images, 32 convolution kernels

Ima

ge

siz

e

Ima

ge

siz

e

Kernel size / Number of convolutions

Kernel size

and using stride x = 2?

SGEMM

FFT


Limited Numerical Precision for CNN (I)

• Some papers ([1], [2]) have demonstrated the feasibility in using limited

numerical precision for CNN

• This opens an interesting computational scenario if, for instance, HW has

accelerators for 16 bit half-precision:

• Performance boosting

• Reduced memory traffic to/from external memory

• Possible to dispatch fewer threads

• Energy saving

• Essentially due to the reduced memory traffic to/from the external memory

[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David

[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan


Limited Numerical Precision for CNN (II)

1

1.5

2

2.5

512 1024 2048 4096

1

1.5

2

2.5

512 1024 2048 4096

It is possible to dispatch

fewer threads

i.e. 8x4 elements per thread

We can not dispatch fewer

threads

Each thread computes a

single radix-N

SGEMM Speed-Up

FFT Speed-Up


N

> 2.0x

> 1.5x


Lessons Learned

1. Cache-efficient data layout has huge impact on performance of our algorithm

also for GPU computing

2. Simple changes in data layout can bring to:

• dispatch fewer threads

• exploit better vector instructions

3. Limited Numerical Precision plays a crucial role IF HW accelerated

4. Convolutional calculation is an embarrassingly parallel task which can be

easily and efficiently accelerated on mobile GPU by means of OpenCL


Question Time

Question Time


Thank you!

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or

elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

Copyright © 2016 ARM Limited

Thank you!

"using sgemm and ffts to accelerate deep learning," a presentation from arm

Technology