[ieee 2011 ieee 17th international conference on parallel and distributed systems (icpads) - tainan,...
TRANSCRIPT
![Page 1: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/1.jpg)
Automatic FFT Performance Tuning on OpenCL GPUs
Yan Li∗†‡, Yunquan Zhang∗†, Haipeng Jia§, Guoping Long∗ and Ke Wang∗∗Laboratory of Parallel Software and Computational Science,
Institute of Software, Chinese Academy of Sciences, Beijing, China†State Key Lab. of Computer Science,
Institute of Software, Chinese Academy of Sciences, Beijing, China‡Graduate University of Chinese Academy of Sciences, Beijing, China
§School of Information Science and Engineering,Ocean University of China, Qingdao, China
Email: [email protected], [email protected], {jiahaipeng95, longguoping, wangkehpc}@gmail.com
Abstract—Many fields of science and engineering, such asastronomy, medical imaging, seismology and spectroscopy, havebeen revolutionized by Fourier methods. The fast Fouriertransform (FFT) is an efficient algorithm to compute thediscrete Fourier transform (DFT) and its inverse. The emergingclass of high performance computing architectures, such asGPU, seeks to achieve much higher performance and efficiencyby exposing a hierarchy of distinct memories to programmers.However, the complexity of GPU programming poses a sig-nificant challenge for programmers. In this paper, based onthe Kronecker product form multi-dimensional FFTs, we pro-pose an automatic performance tuning framework for variousOpenCL GPUs. Several key techniques of GPU programmingon AMD and NVIDIA GPUs are also identified. Our OpenCLFFT library achieves up to 1.5 to 4 times, 1.5 to 40 times and 1.4times the performance of clAmdFft 1.0 for 1D, 2D and 3D FFTrespectively on an AMD GPU, and the overall performance iswithin 90% of CUFFT 4.0 on two NVIDIA GPUs.
Keywords-FFT; DFT; GPU; OpenCL; Auto-tuning
I. INTRODUCTION
The fast Fourier transform (FFT) is one of the most
widely used algorithms for scientific and engineering com-
putation especially in the fields of signal processing, image
processing and data compression. Various algorithms have
been proposed for solving FFTs efficiently since 1965 [1].
However, the FFT is only a good starting point if an efficient
implementation exists for the architecture at hand. The
increasingly widening gap between processor performance
and memory latency that has developed over the last decade.
The resulting increased complexity of memory systems to
ameliorate this gap has made it increasingly harder for
compilers to optimize arbitrary code within an acceptable
amount of time.
Hardware accelerators, such as GPUs, are promising
platforms for general-purpose high-performance computing.
However, its programming complexity with explicitly man-
aged memory hierarchies poses a significant challenge for
programmers. Due to the architectural characteristics of G-
PU, it’s necessary to structure algorithms and place reference
data to the processing elements as close as possible. This re-
quires applications to explicitly orchestrate all data transfers
and data coherence among the memory hierarchies. The tra-
ditional programming approaches for multi-core CPUs and
GPUs are very different. CPU based parallel programming
models typically assume a shared address space and do
not encompass the data placement. General purpose GPU
programming models not only address complex memory
hierarchies and vector operations but also are traditionally
platform-specific or hardware-specific. These limitations and
differences make it difficult to access the compute power of
heterogeneous CPUs, GPUs and other types of processors
with a single, multi-platform source code base. Thus, it gets
more and more complicated to build algorithms that are able
to utilize modern computer systems to a satisfactory degree.
Only the use of sophisticated techniques both in hardware
architecture and software development can overcome these
difficulties. Algorithms which were optimized for a specific
architecture several years ago, fail to perform well on current
and emerging architectures. Due to the fast product cycles
in hardware development and the complexity of today’s ex-
ecution environments, it is of utmost importance to provide
users with easy-to-use self-adapting numerical software.
This paper makes three major contributions listed as
follows:
1) High performance multi-dimensional FFT based on
the Kronecker product is implemented in OpenCL on
GPUs.
2) An auto-tuning framework to optimize 3D FFT al-
gorithms on multiple platforms with OpenCL API is
proposed.
3) The differences of architectures between AMD and N-
VIDIA GPUs on OpenCL programming are analyzed.
The principal idea of our work is that the movement
and placement of data across different levels of memory
hierarchy is under explicit control efficiently by our auto-
tuning framework, which ensures the high occupancy of re-
sources on GPUs. The remainder of the paper is organized as
follows. We provide an overview of OpenCL programming
2011 IEEE 17th International Conference on Parallel and Distributed Systems
1521-9097/11 $26.00 © 2011 IEEEDOI 10.1109/ICPADS.2011.32
228
![Page 2: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/2.jpg)
model and establish the principle rationale and approach
for efficient mapping of FFT algorithms onto GPUs in
Section II. Our auto-tuning framework is elaborated and
demonstrated in Section III. Section IV presents the perfor-
mance evaluation of our library in detail. After discussing
related work in Section V, we propose some ideas for future
work with a brief summary.
II. BACKGROUND
A. OpenCL Programming Model
OpenCL (Open Computing Language)[2] is an open,
widely accepted standard for heterogeneous parallel comput-
ing. It defines a uniform programming model that provides
access to various processing units, referred to as devices that
include GPU, multi-core CPUs and the Cell BE. An OpenCL
device is most easily defined as a collection of compute unit-s (CUs), each containing multiple processing elements (PEs)
of which the functionality equals to streaming processorcore (cuda core in Fermi architecture) in NVIDIA GPU or
stream core in ATI GPU. The PE executes the computation
commands submitted from an OpenCL application. All PEs
within a CU execute a single stream of instructions as singleinstruction multiple data (SIMD) or as SPMD.
OpenCL provides three key abstractions, a hierarchy of
thread groups, local memories, and barrier synchronization.
An OpenCL application consists of a host program which
executes on the host processor and kernels that are func-
tions for being accelerated on compute device by using
OpenCL API. The host program provides command-queuefor performing computation in-order or out-of-order on the
PEs, and also defines a multi-dimensional abstract index
space. Each point within its index space is associated with
an execution instance of the kernel, which is defined as
work-item. Work-items are further grouped as work-groups.
Each work-group is assigned to a CU for execution, and
all threads in a work-group can cooperate, whereas threads
from different work-groups cannot. Furthermore,work-items
within a work-group can be synchronized using barriers or
memory fences. The block of work-items that are executed
together is defined as a wavefront/warp, and the number of
work-items in one wavefront/warp is called wavefront/warp
size. If work-items within a wavefront/warp diverge, such
as branching, all execution paths are executed serially. This
phenomenon is called thread divergence[10] which would
degrade the performance greatly.
B. Representation of DFT
In fact, the FFT algorithm is reported to be one of the
top ten algorithms of the 20th century. A considerable
research effort has been devoted to optimization of FFT
codes over the past four decades. The algorithm presented by
Cooley and Tukey [3] reduced the algorithm complexity of
computing the naıve DFT from O(n2) to O(nlogn), which
is viewed as a turning point for applications of the Fourier
transform.
The DFT of a sequence x = x0, ..., xn−1 is defined in
summation form as follows:
yj = DFTNx =
n−1∑k=0
ωjkN xk, (1)
where k ∈ [0, n− 1] and ωN = e−2iπ
n .
DFT can be represented in many different forms. The
butterfly is extracted from a signal flow graph implementing
an FFT algorithm for simplicity. In this paper, we adopt
Kronecker product to design and implement FFT algorithms.
The property of the formalism facilitates verification of
the correctness of the implementation and code generation.
Furthermore, it helps us identify GPU kernel specifications
and optimize the performance.
1) Basic properties of Kronecker product: If A is an m×n matrix and B is a p×q matrix, then the Kronecker product
A by B is an mp×nq matrix denoted by A⊗B and defined
by
A⊗B = [aijB]i,j (2)
with A = [aij ]i,j .
The direct sum of A and B is an (m+q)×(n+s) matrix
denoted by A⊕B and defined by
A⊕B =
(A 00 B
), (3)
where the 0’s denote blocks of zeros of appropriate size.
The mathematical identities for FFT algorithms can be ob-
tained by refering to [4][5][6]. Historically, FFT algorithms
were obtained by applying breakdown rules recursively and
the rules manipulate the resulting formulas to obtain the
respective iteration. The rule we used is expressed in Eq. 4,
and it is applied to exhibit the parallel Kronecker product
structure which leads to the so-called six-step or parallel FFT
algorithm. The more detailed description of FFT algorithms
and their variants can be found in [1][4], and the notation
we used here mostly coincides with the notation in [4][5].
DFTn2n1= Ln2n1
n2(In1
⊗DFTn2)Ln2n1
n1
Tn2n1n1
(In2 ⊗DFTn1)Ln2n1n2
(4)
The twiddle factor matrix, denoted by Tn1n2n1
, is the
diagonal matrix defined by
Tn1n2n1
=n2−1⊕i=0
n1−1⊕j=0
ωijn1n2
= diag(In1,Ωn1n2,n1 , ...,Ωn2−1n1n2,n1
)
=n2−1⊕j=0
diag(1, ωjn1n2
, ..., ωj(n1−1)n1n2
),
(5)
and moreover, the notation Lmnn denotes the stride permu-
tation which indicates that a vector of size mn is reordered
by loading into n segments at stride n.
229
![Page 3: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/3.jpg)
2) Representation of FFT on GPU: We derive some
breakdown rules for FFT algorithms on GPU architecture
with p threads listed in the following equations by applying
formula identities of Kronecker product.
Ln2n1n2
= (Ip ⊗ (Ln2
n2/p⊗ In1/p))(L
p2
p ⊗ In2n1/p2)
(Ip ⊗ Ln2n1/pn2
)(6)
Tn1n2n1
= diag(In1,Ωn1n2,n1 , ...,Ωn2−1n1n2,n1
)
=p−1⊕i1=0
n2/p−1⊕
i2=0Ωn1n2,i1∗n2/p+i2(ωn1n2
)(7)
In1 ⊗DFTn2 = Ip ⊗ (In1/p⊗DFTn2)
In2⊗DFTn1
= Ip ⊗ (In2/p⊗DFTn1
)(8)
The row-column algorithm is yielded by applying the
definition of 2D DFT and properties of Kronecker prod-
uct (Eq. 9). Higher-dimensional DFT algorithms are derived
similarly.
DFTm×n = (Im ⊗DFTn)(DFTm ⊗ In)
= (Im ⊗DFTn)Lmnm (In ⊗DFTm)Lmn
n
(9)
Due to the various features of different architectures,
variants of FFT algorithms and their implementation are
required. We choose Stockham algorithm which the explicit
digit-reversing computations required by the Cooley-Tukey
process are avoided by performing a multi-dimensional
transpose in each step [4][7].For a given size and cer-
tain dimensional FFT, we first partition it into multiple
dimensions by considering the resources available on GPUs
such as registers numbers and on-chip local memory size,
and transform each dimension. To begin with, the threads
load data from global memory to registers for computing
FFT along each dimension, then shuffle data for the next
computation via local memory. The overall structure of FFT
kernel is presented in Fig. 1.
1) Treating the N -point FFT as a multi-dimensional
array
2) Load data from global memory into registers
3) Using local memory to perform FFT along each
dimension with multiple batches
a) Compute small-point FFT in registers and scale
with twiddle factors
b) Store and transpose data via local memory
c) Load data from local memory and continue to
perform small-point FFT in registers
4) Write back data to global memory
Figure 1: Overall stucture of FFT kernel on GPUs
III. ADAPTIVE OPTIMIZATION FRAMEWORK
As a result of the proliferation of multi-core processors
and GPUs, many traditionally used numeric computation
algorithms are being reconsidered. Issues that previously
were very important, such as conserving memory, are no
longer significant now, and other issues have come to the
fore. One issue that is becoming increasingly important for
a wide range of new systems is memory access patterns,
especially the memory stride, which result in significantly
decreased performance. Furthermore, the GPU applications
are input-sensitive and hence developing an efficient GPU
program remains as challenging as before if not even more.
The number of threads in a work-group is limited by the
number of registers and the size of local memory on a CU.
However, reducing the size would decrease the occupancy
which is a measure of how well a GPU utilizes its resources.
Hence, the resources on GPU should be managed appro-
priately to ensure high occupancy. Resources that impact
occupancy primarily include registers, local memory and
global memory. Fig. 4 presents our auto-tuning framework,
and we achieve high utilization of thread resources as well as
memory resources and satisfy the occupancy requirements
as much as possible. with a coalesced memory access in the
kernel, we achieve a high performance.
A. Optimization of global memory access
Arranging global memory transfers in a coalesced ap-
proach is helpful to achieve close to the peak bandwidth
of memory transfer. Coalesced memory access makes what
could be several single memory transactions into one single
memory transaction. There are differences among CUDA
capable GPUs, classified by what is called compute capa-bility[10]. Devices with compute capability 1.0 and 1.1 are
more restricted than that of 1.2 or higher when it comes
to coalesced memory access. The compute capability of
Tesla C1060 and Tesla C2050 are 1.3 and 2.0 respectively.
Furthermore, the global memory space is not cached on
Tesla C1060, and a request of that for a warp is split in
two memory requests, one for each half-warp. However,
the Tesla C2050 are based on the new Fermi architecture
that all accesses to global memory go through L2 cache,
including copies to/from CPU host. There are two types
of loads in global memory access, the caching (the default
mode) and no-caching, the load granularity of the former is
128-byte and that of the later is just 32-byte, and moreover,
the memory operations are issued per warp not half-warp.
We show the performance of global memory copies with
different strides and offsets in Fig.2 (the word accessed by a
thread is 8-byte wide). The unaligned starting addresses and
discontinuous region accesses degrade performance greatly.
Furthermore, we evaluate the efficiency of global memory
access with different vector lengths on both AMD GPU and
NVIDIA GPU. We can see that there is very significant
impact on performance with various vector lengths.
230
![Page 4: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/4.jpg)
0 5 10 15 20 25 30 35 0 50
100 150
200 250
300
0 20 40 60 80
100 120 140
Pef
orm
ance
(GB
yte
s/s)
stride
offset
Pef
orm
ance
(GB
yte
s/s)
(a) ATI 5850 GPU
0 5 10 15 20 25 30 35 0 50
100 150
200 250
300
0 10 20 30 40 50 60 70 80
Pef
orm
ance
(GB
yte
s/s)
stride
offset
Pef
orm
ance
(GB
yte
s/s)
(b) NVIDIA Tesla C1060 GPU
0 5 10 15 20 25 30 35 0 50
100 150
200 250
300
0 10 20 30 40 50 60 70 80 90
100 110
Pef
orm
ance
(GB
yte
s/s)
stride
offset
Pef
orm
ance
(GB
yte
s/s)
(c) NVIDIA Tesla C2050 GPU
Figure 2: Peformance comparison of global memory copies with various strides and starting address offsets on GPUs.
0
0.2
0.4
0.6
0.8
1
1 2 4 8 16
Glo
bal
mem
ory
ban
dw
ith
acc
ess
effi
cien
cy
Vector length
doublefloat
intshort
char
(a) ATI 5850 GPU
0
0.2
0.4
0.6
0.8
1
1 2 4 8 16
Glo
bal
mem
ory
ban
dw
ith
acc
ess
effi
cien
cy
Vector length
doublefloat
intshort
char
(b) NVIDIA Tesla C1060 GPU
0
0.2
0.4
0.6
0.8
1
1 2 4 8 16
Glo
bal
mem
ory
ban
dw
ith
acc
ess
effi
cien
cy
Vector length
doublefloat
intshort
char
(c) NVIDIA Tesla C2050 GPU
Figure 3: Comparison of global memory bandwidth efficiency with different vector lengths.
Fig.3 (b) and Fig.3 (c) present the ratio of practical global
memory bandwidth to theoretical peak bandwidth with d-
ifferent vector lengths and various data types on NVIDIA
GPU. The vector types which achieve peak performance
are char4, short4, int2, float2 and double1. Global memory
instructions support reading or writing 1-, 2-, 4-, 8-, 16-
byte words and the size of vector data type that is larger
than 16, such as double4, float8 and short16, has very poor
performance. On Fermi architecture, the access to global
memory is cached and the cache line size of both L1 and
L2 are tagged with a block size of 128 bytes and maps to
a 128-byte aligned segment in device memory. If the size
of the word accessed by each thread is more than 4 bytes,
a memory request by a warp is firstly split into separate
128-byte memory requests that are issued independently. So
there are little differences in performance among int, int2,
and int4 as shown in Fig.3 (c). Due to not taking advantage
of gaining the performance improvement through memory
access alignment, the performance of short and char vector
type are also very poor. Fig.3 (a) shows that there are little
performance differences in all vector lengths except char1,
char2 and short1. There are two independent memory paths
between the compute units and the memory on ATI GPU,
namely FastPath and CompletePath respectively. When a
kernel has atomic operations or sub 32-bit data transfers,
the kernel will use the CompletePath. The maximum bus
utilization between the shader unit and the memory unit for
the CompletePath is 25% compared to the 100% for the
FastPath. Hence, the global memory access efficiency of
char1, char2 and short1 is very low.
Assume that N -point FFT was computed by dividing into
n kernels with radix nk as follows.
N =
n∏k=1
nk, N(k) = n1n2...nk
P (k) =
k−1∏j=1
nk, R(k) =
n∏j=k+1
nk
(10)
The formulas to read data from global memory at the
beginning and store back the result at the end are LNN/n1
and LNn1
respectively. Each work-item deals with one of
N/n1 and n1 segments respectively, accesses each point
of data at the stride N/n1 or n1. Assuming that the
word size of each thread accessed was accessed word,
if 128byte/(N/N(1) ∗ sizof(accessed word)) < 1 or
(work− items/work−group)∗sizof(accessed word) <128byte, the accessed region does not satisfied the above
coalescing requirements, and the local memory is used
to rearrange data properly for assisting coalesced access.
Furthermore, coalesce width which denotes the number of
work-items in a global memory access request, is 16 on Tesla
231
![Page 5: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/5.jpg)
C1060, C2050 and ATI 5850. If coalesce width|(N/n1) or
coalesce width|n1, the accesses are properly aligned.
B. Optimization of local memory access
In GPUs, large interleaved local memory is used for inter-
thread communication within a work-group. Because of the
interleaved design of memory banks, multiple threads fetch
or store concurrently with unit stride operates at full speed,
since each word resides on a different bank, namely, any
memory load or store of n addresses that spans n distinct
memory banks can be served simultaneously, yielding an
effective bandwidth that is n times as high as the bandwidth
of a single bank.
To avoid bank conflict, CW (CW denotes the number
of work-items that are issued to local memory accesses)
consecutive threads are required to access different banks.
Tesla C2050 (Fermi architecture) has 32 banks with 4-
byte width, and local memory accesses are issued per
warp (CW = 32) not half-warp (CW = 16) in GPUs
which prior to Fermi architecture, such as Tesla C1060
with just 16 banks. However, the value of CW is just a
quarter-wavefront (CW = 16) on ATI 5850 with 32 banks.
The characteristic that each thread can access two banks
simultaneously benefits the vector-based applications on its
platform. For example, single floating-point complex data
type with interleaved data form (float2 consists of interleaved
real and imaginary components) should be transformed to
planar data format (float2 consists of two real or two imag-
inary components) in local memory to avoid 2-way bank
conflict on NVIDIA GPU, but it’s not necessary to do so on
ATI GPU. Among the low-level optimization techniques, we
highlight loop unrolling and constant propagation. Constant
propagation can avoid unnecessary arithmetic instructions,
especially when computing padding functions.
Each thread loads one data from every P (k) × R(k)nk-tuple sub-array to compute nk-point FFT and store
back to local memory. The formula to store the result in
local memory is SNN/nk
= (LNN/nk
)−1 (L stands for the
load operation and S stands for the inverse store opera-
tion.). The indices are ThreadId+N/nk × r (r ∈ [0, nk),ThreadId ∈ [0, N/nk − 1)). The indices are consecutive,
and if the size of work-group is larger than N/nk, the kernel
will perform multiple FFTs at a time. We use Eq. 8 to
partition the multiple batches FFT in work-items via local
memory for exchanging data. Bank conflict occurs when
nk is a power-of-two, and it can be solved by inserting
appropriate padding. Ip(k) ⊗LnkR(k)R(k) is the formula to load
the data from local memory for computing nk-point FFT
and P (k) ⊗ nk ⊗ R(k) indicates the indices information
for fetching data along nk dimension. The threads read
one R(k)-tuple vector with R(k) stride while skipping
R(k) × (nk − 1). This may suffer bank conflict, and we
insert padding after every R(k) × (nk − 1) to deal with
them.
Figure 4: Overview of the adaptive optimization framework
for our FFT algorithms.
C. Automatic Performance Tuning Framework
We employ two-stage adaptation methodology to map
FFT algorithms to GPU architecture and memory hierarchy
(see Fig.4). At the installation time, the codelets library
consists of many small size DFTs that are generated by a
lightweight code generator module which is similar to fftgenmodule in UHFFT [13]. The codelets are straight-line code
which attain optimal performance by reducing the number
of register usage, operations, precomputation of constants
and so on.
At runtime, after acceptation of the obligatory param-
eters such as the DFT size, the dimension, batches size
and transform direction, complex data format (planar or
interleaved storage), the initialization module constructs
many FFT plans and each represents a factorization for
the given transform size. The code generator produces the
GPU kernels to be executed in the next step. Then, the
search module evaluates the performance of the executed
plans to select the best plan. Furthermore, we provide an
empirical value for the performance of a given size. While
the performance of the evaluated plan is below that value,
the search engine would change some parameters in the
initialization stage to repeat aforementioned process, such
as the size of threads in a work-group, the maximum radix
in a plan, the amount of banks interleaved in local memory.
The optimal parameters can be assembled by iteratively
compiling and evaluating the various plans. Finally, the
search module generates the performance data that contains
the information of all evaluated plans besides the selected
plan and reuses it in later sessions. In addition, we also
provide APIs for programmers to adjust some parameters
by script files to avoid exhausted search.
IV. PERFORMANCE EVALUATIONS
In this section we analyze the performance of our FFT
library on different GPUs and compare its performance with
optimized scientific libraries on GPUs and CPUs. Our library
supports batched execution for all dimensional FFT. To val-
idate our methodology, the performances of our library are
232
![Page 6: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/6.jpg)
Table I: Evaluation platforms
Platform CPU RAM(GB) GCC Linux GPU GPU SDK OpenCLSystem1 AMD Phenom II X4 940 with 0.8 GHz 8 4.3.3 Ubuntu 9.04 ATI 5850 ATI Stream SDK 2.4 1.1System2 Intel Xeon X5472 with 3.0 GHz 16 4.4.5 Ubuntu 10.10 Tesla C1060 NVIDIA CUDA SDK 4.0.1 1.1System3 Intel Xeon X5550 with 2.66 GHz 16 4.4.5 Ubuntu 10.10 Tesla C2050 NVIDIA CUDA SDK 4.0.1 1.1
Table II: Configuration of the GPUs we employ in our experiments.
GPU Clock Rate PE CU Peak Perf. Memory Bus Width Peak BW Registers/CU Local Memory Driver(GHz) (GFlops) (GB) (bits) (GB/s) (K)
ATI 5850 0.725 288 18 2088 1.0 256 128 16k 32 cal 1.4.900Tesla C1060 1.30 240 30 933 4.0 512 102 16k 16 280.13Tesla C2050 1.15 448 14 1030 3.0 384 144 16k 48 280.13
PE– Processing Element CU – Compute Unit Peak Perf. – Peak Performance of Single Floating-point BW – Bandwidth
compared with those of FFTW, NVIDIA CUFFT [11] (CU-
DA capable FFT library) and AMD clAmdFft [12]. For a
three dimensional FFT with the total size N = Nx×Ny×Nz
and with execution time of t seconds, its performance is
calculated in GFlops defined by the following equation.
GFlops =5NxNyNz(log2Nx + log2Ny + log2Nz)× 10−9
t(11)
Experiments are conducted to evaluate our approach for
out-of-place complex to complex FFTs of power-of-two
sizes on three platforms with different GPUs which are
shown in Table I and Table II in detail. The FFT libraries
used for comparison are FFTW 3.2.2 on all host CPUs,
CUFFT 4.0.1 on NVIDIA GPU and clAmdFft 1.0.53 on
AMD GPU.
Fig. 5 shows the performance of 1D batched FFT of size
2N on three platforms and our automatic performance tuning
OpenCL FFT library is called clFFT. By comparison of
the version 1.0 and 2.0, the performance improvement is
achieved with automatic padding inserting by taking mem-
ory coalescing and bank-conflict avoided in account. The
performance of our library is 1.5 to 4 times the performance
of clAmdFft library, and closes to CUFFT library, and 6
to 18 times the performance of FFTW library with four
threads. Many bank conflicts of local memory accesses
in clAmdFft library result in suboptimal performance, for
example, the percentage of GPU time that local memory
is stalled by bank conflicts in the library reaches to 24.7%
by profiling the program when FFT size is 1024 with 1024
batches and the value of that in our library is zero. For
the size larger than what can be computed using local
memory FFT, global transposes multiple kernel launches
are needed. For these sizes, n can be decomposed using
much larger base radices for local memory computation by
using Eq. 4 to amortize the cost of the increased number
of device memory accesses. While N is larger than 210
on ATI 5850, 29 on Tesla C1060 and 212 on Tesla C2050,
the performance begins to degrade respectively because of
limited local memory capacity and the register numbers,
so N is partitioned into smaller ones using global memory
to exchange data. Hence, the latency from global memory
access leads to the performance decreases. The performance
of batched 2D FFT of sizes N1 ×N2 and 3D FFT of sizes
N1 × N2 × N3 on three platforms are presented in Fig. 6
and Fig. 7 respectively. For multidimensional FFTs, we use
row-column FFT algorithm to compute each dimension with
multiple batches by using the Eq. 9. Due to the increased
number of transpose operations and stride accesses, it is
crucial to use the coalesced memory access and bank-conflict
free to maximize efficient memory bandwidth. Our auto-
tuner can effectively discover the hotspot by analyzing the
behavior of memory access, then insert appropriate padding
to avoid reduced bandwidth. Further, a higher occupancy is
obtained through increasing the computation ratio in each
thread without registers and local memory overflowed to
overlap the cost of communication. As shown in Fig.6,
the performance of our 2D FFT is 1.5 to 40 times the
performance of clAmdFft library, and closes to CUFFT
library on Tesla C1060, and moreover, we obtain better
performance than CUFFT when the FFT size is large, such
as the size is 2048× 2048. Our 3D FFT is 0.4 times faster
than clAmdFft library with appropriate sizes, and closes to
CUFFT 4.0 on NVIDIA GPUs in less than 5% performance
gap. Average speedup of 1.4x and a maximum of 1.9x is
achieved on our library over clAmdFft on the ATI 5850 GPU
(see Fig.7), and the overall performance is within 90% of
CUFFT 4.0 on two NVIDIA GPUs.
V. RELATED WORK
The FFT is an important computational kernel which has
broad applicability across a wide range of disciplines in
audio signal processing, image processing, spectral methods
for solving partial differential equation (PDE) and so on.
Several FFT libraries on CPUs with automatic performance
tuning have been proposed, e.g. FFTW[8], SPIRAL[9] and
UHFFT[13][14]. Automatic tuning in FFTW is performed
in two different levels, namely the installation time and
runtime. At the installation time, the code generator gen-
erates highly optimized straight-line FFT code blocks called
233
![Page 7: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/7.jpg)
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12
Per
form
ance
(GF
lops)
Size of DFT - 2Ν
clAmdFftclFFT-2.0
FFTWclFFT-1.0
(a) ATI 5850 GPU
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Per
form
ance
(GF
lops)
Size of DFT - 2Ν
CUFFTclFFT-2.0
FFTWclFFT-1.0
(b) NVIDIA Tesla C1060 GPU
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Per
form
ance
(GF
lops)
Size of DFT - 2Ν
CUFFTclFFT-2.0clFFT-1.0
FFTW
(c) NVIDIA Tesla C2050 GPU
Figure 5: Peformance comparison of 1D FFTs on GPUs.
0
20
40
60
80
100
120
140
160
180
32*64
32*1024
32*2046
64*64
64*128
64*256
64*512
64*1024
64*2048
128*128
128*256
128*512
128*1024
128*2046
256*256
256*1024
512*512
512*1024
1024*1024
2048*2048
Per
form
ance
(GF
lops)
Size of DFT - Ν1 ∗ Ν2
clAmdFftclFFT-2.0
(a) ATI 5850 GPU
70
80
90
100
110
120
130
140
150
160
170
180
32*64
32*1024
32*2046
64*64
64*1024
64*2048
128*128
128*256
128*512
128*2046
256*256
256*1024
512*512
512*1024
1024*1024
2048*2048
Per
form
ance
(GF
lops)
Size of DFT - Ν1 ∗ Ν2
CUFFTclFFT-2.0
(b) NVIDIA Tesla C1060 GPU
150
160
170
180
190
200
210
220
230
32*64
64*64
64*128
64*256
128*128
128*256
256*256
512*512
512*1024
1024*1024
2048*2048
Per
form
ance
(GF
lops)
Size of DFT - Ν1 ∗ Ν2
CUFFTclFFT-2.0
(c) NVIDIA Tesla C2050 GPU
Figure 6: Peformance comparison of 2D FFTs on GPUs.
20
40
60
80
100
120
140
4*8*16
8*8*8
8*16*32
16*16*16
16*32*32
32*64*128
32*64*128
64*64*64
64*64*128
128*128*128
Per
form
ance
(GF
lops)
Size of DFT - Ν1 ∗ Ν2 ∗ Ν3
clAmdFftclFFT-2.0
(a) ATI 5850 GPU
0
20
40
60
80
100
120
140
160
4*8*16
8*8*8
8*16*32
16*16*16
16*32*32
32*64*128
32*64*128
64*64*64
64*64*128
128*128*128
Per
form
ance
(GF
lops)
Size of DFT - Ν1 ∗ Ν2 ∗ Ν3
CUFFTclFFT-2.0
(b) NVIDIA Tesla C1060 GPU
60
80
100
120
140
160
180
200
4*8*16
8*8*8
8*16*32
16*16*16
16*32*32
16*64*128
32*64*128
64*64*64
64*64*128
128*128*128
Per
form
ance
(GF
lops)
Size of DFT - Ν1 ∗ Ν2 ∗ Ν3
CUFFTclFFT-2.0
(c) NVIDIA Tesla C2050 GPU
Figure 7: Peformance comparison of 3D FFTs on GPUs.
codelets. At the runtime, the pre-generated codelets are
assembled in a plan to compute large FFT problem size.
The methodology of auto-tuning techniques in the UHFFT
is similar to FFTW. SPIRAL is a program generation and
optimization system which generates optimized codes for
digital signal processing (DSP) transforms. It employs three-
stage adaptation methodology to adapt to various styles of
architecture. In the first stage, the mathematical rules and
identities have been used for the formula generator in virtue
of a special purpose pseudo-mathematical language called
SPL (Signal Processing Language) to expand and optimize
the FFT formula to a given transform. Then, the optimized
SPL formula is translated into source code on a specific
platform. Finally, the source code is compiled and evaluated
to generate the best code by guiding the code generation
process.
Graphics APIs such as DirectX or OpenGL have been
used in the earliest FFT implementations to access the
computing resource of GPUs. Due to the intrinsic restrictions
of these APIs, the performance and the readability of these
implementations are very poor. The advent of CUDA has
led to decrease the complexity of programming on NVIDIA
GPU, and there has been growing research in exploring auto-
tuning techniques for improving performance of algorithms
234
![Page 8: [IEEE 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS) - Tainan, Taiwan (2011.12.7-2011.12.9)] 2011 IEEE 17th International Conference on Parallel](https://reader035.vdocuments.us/reader035/viewer/2022080107/57506b4d1a28ab0f07bd72d0/html5/thumbnails/8.jpg)
such as SpMV (Sparse Matrix-Vector Multiplication), GEM-
M (General Matrix Multiply) and FFTs on CUDA GPUs,
and moreover, several studies have been conducted to opti-
mize the performance manually. Nukada and Matsuoka[15]
presented an auto-tuning algorithm for optimizing 3D FFT
algorithms on CUDA GPUs. Their algorithm optimizes the
number of threads and resolves bank conflicts on local
memory especially. However, the larger size of FFTs may
leads to suboptimal performance as a result of restricting the
search space severely. Yuri Dotsenko, Sara S.Baghsorkhi,
et al.[7] also presented an auto-tuning framework for au-
tomatically generated optimized FFT kernels by pruning
heuristics significantly to reduce the optimization search
space. Although the work they did demonstrated significant
performance improvement against on GPUs, the CUFFT
library they used are prior to the present version. The
version 4.0 is much higher performance and accuracy for the
efficient computation of arbitrary sizes, especially the power-
of-two sizes, on NVIDIA GPU architectures than before.
Furthermore, both of them just considered the NVIDIA
CUDA platform in ignorance of ATI GPUs.
VI. CONCLUSION AND FUTURE WORK
Different architectures among the underlying platforms
pose various challenges in memory optimization and paral-
lelism management, resulting in different performances. In
this paper, we describe an auto-tuning framework of FFT
algorithms on various GPUs, and the parallelism exposed
in the FFT algorithms based on Kronecker product can
be exploited fully. We also identify several key techniques
of GPU programming on AMD and NVIDIA GPUs. Our
OpenCL FFT library achieves up to 1.5 to 4 times, 1.5 to
40 times and 1.4 times the performance of clAmdFft 1.0 for
1D, 2D and 3D FFT respectively on an AMD GPU, and the
overall performance is within 90% of CUFFT 4.0 on two
NVIDIA GPUs.
There are several avenues for our future work. First, we
continue to optimize the performance of our FFT library
on GPUs and port the library onto the other OpenCL
devices, such as Cell and Intel processors with Sandy Bridge
architecture. Second, we will extend our library to use
double precision arithmetic and more FFT algorithms to
deal with arbitrary sizes. Third, we plan to construct a
novel performance model with data training or interpolation
by machine learning to attain a good trade-off between
performance and search time. Finally, the heterogeneous
parallel computing between CPUs and GPUs would be
considered with optimal data distribution and computation
ratios among them.
VII. ACKNOWLEDGMENT AND FUTURE WORK
This work is supported in partial by the Developmen-
t Plan of China under grant No. 2009AA01A129, No.
2009AA01A134, Knowledge Innovation Project of The Chi-
nese Academy of Sciences No. KGCX1-YW-13. We would
like to thank the anonymous referees for their helpful
comments from which the preparation for this version of
the paper has benefited.
REFERENCES
[1] P. Duhamel and M. Vetterli, “Fast Fourier Transforms: ATutorial Review and a State of the Art,” Signal Processing,vol. 4, num. 19, p. 259-299 1990.
[2] Khronos OpenCL Working Group, The OpenCL Specification,version: 1.1, September, 2010.
[3] Cooley, J.W. and Tukey, J.W., “An algorithm for the machinecalculation of complex Fourier series,” Math. Comp., 19,297C301, April 1965.
[4] Charles Van Loan, “Computational frameworks for the fastFourier transform,” Philadelphia:SIAM, 1992.
[5] J. Johnson, R. W. Johnson, D. Rodriguez, R. Tolimieri, “Amethodology for designing, modifying, and implementingFourier transform algorithms on various architectures,” IEEETrans. on Circuits and Systems, pages 449-498, 1990.
[6] Franz Franchetti and Markus Puschel and Yevgen Voronenkoand Srinivaqs Chellappa and Jose M. F. Moura, “DiscreteFourier Transform on Multicore,” IEEE Signal ProcessingMagazine, special issue on “Signal Processing on Platformswith Multiple Cores”, vol.26, num. 6, p. 90-102, 2009.
[7] Yuri Dotsenko, Sara S.Baghsorkhi, Brandon Lloyd, Naga K.Govindaraju, “Auto-tuning of fast fourier transform on graphicsprocessors,” in Proceedings of the 16th ACM Symposium onPrinciples and Practice of Parallel Programming (PPoPP’11),February 12-16, 2011.
[8] Matteo Frigo and Steven G. Johnson, “The design and im-plementation of FFTW3,” in Proceedings of the IEEE, 93(2),2005.
[9] Spiral project website, www.spiral.net.
[10] NVIDIA Corporation, “NVIDIA CUDA Compute UnifiedDevice Architecture ł Programming Guide,” 2008, version 2.1.
[11] NVIDIA Corporation, “CUDA CUFFT Library,” August,2010.
[12] AMD Developer Central, http://developer.amd.com/libraries/appmathlibs/pages/default.aspx.
[13] UHFFT website, http://www2.cs.uh.edu/∼ayaz/uhfft/.
[14] Ayaz Ali, Lennart Johnsson, Dragan Mirkovic, “EmpiricalAuto-tuning Code Generator for FFT and Trigonometric Trans-forms,” ODES: 5th Workshop on Optimizations for DSP andEmbedded Systems, in conjunction with International Sympo-sium on Code Generation and Optimization (CGO), March2007.
[15] Akira Nukada, Satoshi Matsuoka, “Auto-tuning 3-D FFTlibrary for CUDA GPUs,” in Proceedings of the Conferenceon High Performance Computing Networking, Storage andAnalysis, November 14-20, 2009.
235