"using sgemm and ffts to accelerate deep learning," a presentation from arm
TRANSCRIPT
Copyright © 2016 ARM Ltd 1
Gian Marco Iodice, SW Engineer – ARM
May 3, 2016
Using SGEMM and FFTs to Accelerate
Deep Learning
Copyright © 2016 ARM Ltd 2
• About ARM
• Convolutional Neural Networks (CNN)
• Architecture and building blocks
• Convolutional Layer
• SGEMM-based convolution
• FFT-based convolution
• SGEMM vs FFT
• Limited Numerical Precision for CNN
• Lesson Learned
Contents
Copyright © 2016 ARM Ltd 3
ARM Ltd
• ARM Holdings plc is a British multinational semiconductor and software
design company (www.arm.com)
• Headquarters in Cambridge, England
Copyright © 2016 ARM Ltd 4
Architecture and Building Blocks of CNN
• Convolutional layer (core block of CNN)
• Number of convolution kernels (filters bank)
• Filter shape (width, height and depth)
• Pooling layer (typical values 2x2)
• Non-linear gating (ReLu)
• Classifier: Fully Connected Neural Network
Learned
Non-Linear Trainable Feature
Copyright © 2016 ARM Ltd 5
Why Are We Going to Study Convolutional Layer?
*Learning Semantic Image Representations at a Large Scale, Yangqing Jia
conv1 16.9%
relu 0.7%
pool 1.0%
conv2 21.9%
pool2 0.7%
norm2 0.5%
conv3 17.8%
relu3 0.2%
conv4 17.8%
conv5 17.7%
fc6 1.8%
fc7 0.8%
Compute Load for AlexNet Inference*
Copyright © 2016 ARM Ltd 6
From 2D Convolution to 3D Batched Convolution
• Most of the time for the convolution layers we have:
• Multiple input images
• Multiple convolution kernels (various dimensions and shapes)
• Multiple channels per image/kernel (not necessarily 3!)
Output images
Input image
Kernels
Why don’t we use sliding window approach?
Copyright © 2016 ARM Ltd 7
SGEMM-based Convolution
C = α∙AB + β∙C SGEMM: Single Precision GEeneral Matrix Multiply
Copyright © 2016 ARM Ltd 8
Im2col
• im2col stores in each row the
necessary pixels for each
kernel application
• Costs in terms of memory
requirements!
• pixels duplication
• col2im restores the output
image structure
Input image
Output image
A
B
C
Output images
stride X
B
C
Copyright © 2016 ARM Ltd 9
SGEMM: Naïve implementation
• Each thread computes a single element of the output matrix
Not cache
friendly!
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c0 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1 * N);
c0 += ai * bi;
…
store(c0, addr_c);
Matrix A
Matrix B
Matrix C
Copyright © 2016 ARM Ltd 10
Transpose Matrix B
Matrix B Transposition
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c00 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1);
c00 += ai * bi;
...
store(c0, addr_c);
Matrix A
Matrix B
Matrix C
1.1x…
Speed-up
achievable?
Copyright © 2016 ARM Ltd 11
Transpose Matrix B in Chunk of 1x4 (I)
• Each thread computes 1x4 elements of the output matrix
Not cache
friendly!
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 1 * N);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix A
Matrix B
Matrix C
Copyright © 2016 ARM Ltd 12
Transpose Matrix B in Chunk of 1x4 (II)
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 4);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix B
Matrix BT1x4
2
2.5
3
3.5
4
512 1024 2048 4096
SGEMM Speed-Up
Speed-up achievable?
3.5x
N: A=NxN, B=NxN, C=NxN
Copyright © 2016 ARM Ltd 13
Reshaping Matrix A (I)
• We can do more…we can compute a block of 4x4 elements per
thread in order to re-use the values loaded from Matrix A
Matrix BT1x4
Matrix A
Matrix C
Copyright © 2016 ARM Ltd 14
Reshaping Matrix A (II)
Chunk 0
Chunk 1
Chunk = Block of 4 rows
Matrix A – 8x8
Matrix AI – 2x32
6.5
7
7.5
8
8.5
512 1024 2048 4096
N: A=NxN, B=NxN, C=NxN
SGEMM Speed-Up
Speed-up achievable?
> 8.0x
Copyright © 2016 ARM Ltd 15
FFT-based Convolution
• Convolution in the spatial domain is equivalent to a scalar multiplication in
frequency domain
Copyright © 2016 ARM Ltd 16
From Radix-2 to Mixed-Radix
• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)
• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)
• Mixed-Radix is the generalization of the basic radix-2 FFT
Over 1.5x better performance than Radix-2
Copyright © 2016 ARM Ltd 17
FFT Implementation
• Recursive FFT in-place computation*
• Each thread computes a single radix-N (floating point computation)
• Block-wise 2x2 in-place transposition
• ~2x times better performance than 2x2 out-of-place transposition
• Out-of-place batched convolution
• High memory requirements as we have to keep the frequency representation for:
1. Input image
2. Convolution kernels
3. Result of convolutions
* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-
means-of-opencl-part-2
Copyright © 2016 ARM Ltd 18
SGEMM vs FFT (I)
• High memory requirements due to im2col:
• stride < kernel dimension
• large convolution kernel
• large input image
SGEMM-based convolution
• No efficient way to handle stride != 1
• High memory requirements for batched
convolutions
• It could require considerable effort to
optimize well
SGEMM-based convolution
FFT-based convolution
Copyright © 2016 ARM Ltd 19
SGEMM vs FFT (II)
Case 1: 1 input image, 64/128/256 convolution kernels
• Study limited on inference problem
• Stride x = 1 and stride y = 1
• N. of channels = 1
• Pre-computed FFT for convolution kernels
Case 2: 64 input images, 32 convolution kernels
Ima
ge
siz
e
Ima
ge
siz
e
Kernel size / Number of convolutions
Kernel size
and using stride x = 2?
SGEMM
FFT
Copyright © 2016 ARM Ltd 20
Limited Numerical Precision for CNN (I)
• Some papers ([1], [2]) have demonstrated the feasibility in using limited
numerical precision for CNN
• This opens an interesting computational scenario if, for instance, HW has
accelerators for 16 bit half-precision:
• Performance boosting
• Reduced memory traffic to/from external memory
• Possible to dispatch fewer threads
• Energy saving
• Essentially due to the reduced memory traffic to/from the external memory
[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David
[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
Copyright © 2016 ARM Ltd 21
Limited Numerical Precision for CNN (II)
1
1.5
2
2.5
512 1024 2048 4096
1
1.5
2
2.5
512 1024 2048 4096
It is possible to dispatch
fewer threads
i.e. 8x4 elements per thread
We can not dispatch fewer
threads
Each thread computes a
single radix-N
SGEMM Speed-Up
FFT Speed-Up
N: A=NxN, B=NxN, C=NxN
N
> 2.0x
> 1.5x
Copyright © 2016 ARM Ltd 22
Lessons Learned
1. Cache-efficient data layout has huge impact on performance of our algorithm
also for GPU computing
2. Simple changes in data layout can bring to:
• dispatch fewer threads
• exploit better vector instructions
3. Limited Numerical Precision plays a crucial role IF HW accelerated
4. Convolutional calculation is an embarrassingly parallel task which can be
easily and efficiently accelerated on mobile GPU by means of OpenCL
Copyright © 2016 ARM Ltd 23
Question Time
Question Time
Copyright © 2016 ARM Ltd 24
Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or
elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
Copyright © 2016 ARM Limited
Thank you!