[ieee 2011 ieee 17th international conference on parallel and distributed systems (icpads) - tainan,...

Automatic FFT Performance Tuning on OpenCL GPUs

Yan Li∗†‡, Yunquan Zhang∗†, Haipeng Jia§, Guoping Long∗ and Ke Wang∗∗Laboratory of Parallel Software and Computational Science,

Institute of Software, Chinese Academy of Sciences, Beijing, China†State Key Lab. of Computer Science,

Institute of Software, Chinese Academy of Sciences, Beijing, China‡Graduate University of Chinese Academy of Sciences, Beijing, China

§School of Information Science and Engineering,Ocean University of China, Qingdao, China

Email: [email protected], [email protected], {jiahaipeng95, longguoping, wangkehpc}@gmail.com

Abstract—Many fields of science and engineering, such asastronomy, medical imaging, seismology and spectroscopy, havebeen revolutionized by Fourier methods. The fast Fouriertransform (FFT) is an efficient algorithm to compute thediscrete Fourier transform (DFT) and its inverse. The emergingclass of high performance computing architectures, such asGPU, seeks to achieve much higher performance and efficiencyby exposing a hierarchy of distinct memories to programmers.However, the complexity of GPU programming poses a sig-nificant challenge for programmers. In this paper, based onthe Kronecker product form multi-dimensional FFTs, we pro-pose an automatic performance tuning framework for variousOpenCL GPUs. Several key techniques of GPU programmingon AMD and NVIDIA GPUs are also identified. Our OpenCLFFT library achieves up to 1.5 to 4 times, 1.5 to 40 times and 1.4times the performance of clAmdFft 1.0 for 1D, 2D and 3D FFTrespectively on an AMD GPU, and the overall performance iswithin 90% of CUFFT 4.0 on two NVIDIA GPUs.

Keywords-FFT; DFT; GPU; OpenCL; Auto-tuning

I. INTRODUCTION

The fast Fourier transform (FFT) is one of the most

widely used algorithms for scientific and engineering com-

putation especially in the fields of signal processing, image

processing and data compression. Various algorithms have

been proposed for solving FFTs efficiently since 1965 [1].

However, the FFT is only a good starting point if an efficient

implementation exists for the architecture at hand. The

increasingly widening gap between processor performance

and memory latency that has developed over the last decade.

The resulting increased complexity of memory systems to

ameliorate this gap has made it increasingly harder for

compilers to optimize arbitrary code within an acceptable

amount of time.

Hardware accelerators, such as GPUs, are promising

platforms for general-purpose high-performance computing.

However, its programming complexity with explicitly man-

aged memory hierarchies poses a significant challenge for

programmers. Due to the architectural characteristics of G-

PU, it’s necessary to structure algorithms and place reference

data to the processing elements as close as possible. This re-

quires applications to explicitly orchestrate all data transfers

and data coherence among the memory hierarchies. The tra-

ditional programming approaches for multi-core CPUs and

GPUs are very different. CPU based parallel programming

models typically assume a shared address space and do

not encompass the data placement. General purpose GPU

programming models not only address complex memory

hierarchies and vector operations but also are traditionally

platform-specific or hardware-specific. These limitations and

differences make it difficult to access the compute power of

heterogeneous CPUs, GPUs and other types of processors

with a single, multi-platform source code base. Thus, it gets

more and more complicated to build algorithms that are able

to utilize modern computer systems to a satisfactory degree.

Only the use of sophisticated techniques both in hardware

architecture and software development can overcome these

difficulties. Algorithms which were optimized for a specific

architecture several years ago, fail to perform well on current

and emerging architectures. Due to the fast product cycles

in hardware development and the complexity of today’s ex-

ecution environments, it is of utmost importance to provide

users with easy-to-use self-adapting numerical software.

This paper makes three major contributions listed as

follows:

1) High performance multi-dimensional FFT based on

the Kronecker product is implemented in OpenCL on

GPUs.

2) An auto-tuning framework to optimize 3D FFT al-

gorithms on multiple platforms with OpenCL API is

proposed.

3) The differences of architectures between AMD and N-

VIDIA GPUs on OpenCL programming are analyzed.

The principal idea of our work is that the movement

and placement of data across different levels of memory

hierarchy is under explicit control efficiently by our auto-

tuning framework, which ensures the high occupancy of re-

sources on GPUs. The remainder of the paper is organized as

follows. We provide an overview of OpenCL programming

2011 IEEE 17th International Conference on Parallel and Distributed Systems

1521-9097/11 $26.00 © 2011 IEEEDOI 10.1109/ICPADS.2011.32

228

model and establish the principle rationale and approach

for efficient mapping of FFT algorithms onto GPUs in

Section II. Our auto-tuning framework is elaborated and

demonstrated in Section III. Section IV presents the perfor-

mance evaluation of our library in detail. After discussing

related work in Section V, we propose some ideas for future

work with a brief summary.

II. BACKGROUND

A. OpenCL Programming Model

OpenCL (Open Computing Language)[2] is an open,

widely accepted standard for heterogeneous parallel comput-

ing. It defines a uniform programming model that provides

access to various processing units, referred to as devices that

include GPU, multi-core CPUs and the Cell BE. An OpenCL

device is most easily defined as a collection of compute unit-s (CUs), each containing multiple processing elements (PEs)

of which the functionality equals to streaming processorcore (cuda core in Fermi architecture) in NVIDIA GPU or

stream core in ATI GPU. The PE executes the computation

commands submitted from an OpenCL application. All PEs

within a CU execute a single stream of instructions as singleinstruction multiple data (SIMD) or as SPMD.

OpenCL provides three key abstractions, a hierarchy of

thread groups, local memories, and barrier synchronization.

An OpenCL application consists of a host program which

executes on the host processor and kernels that are func-

tions for being accelerated on compute device by using

OpenCL API. The host program provides command-queuefor performing computation in-order or out-of-order on the

PEs, and also defines a multi-dimensional abstract index

space. Each point within its index space is associated with

an execution instance of the kernel, which is defined as

work-item. Work-items are further grouped as work-groups.

Each work-group is assigned to a CU for execution, and

all threads in a work-group can cooperate, whereas threads

from different work-groups cannot. Furthermore,work-items

within a work-group can be synchronized using barriers or

memory fences. The block of work-items that are executed

together is defined as a wavefront/warp, and the number of

work-items in one wavefront/warp is called wavefront/warp

size. If work-items within a wavefront/warp diverge, such

as branching, all execution paths are executed serially. This

phenomenon is called thread divergence[10] which would

degrade the performance greatly.

B. Representation of DFT

In fact, the FFT algorithm is reported to be one of the

top ten algorithms of the 20th century. A considerable

research effort has been devoted to optimization of FFT

codes over the past four decades. The algorithm presented by

Cooley and Tukey [3] reduced the algorithm complexity of

computing the naıve DFT from O(n2) to O(nlogn), which

is viewed as a turning point for applications of the Fourier

transform.

The DFT of a sequence x = x0, ..., xn−1 is defined in

summation form as follows:

yj = DFTNx =

n−1∑k=0

ωjkN xk, (1)

where k ∈ [0, n− 1] and ωN = e−2iπ

n .

DFT can be represented in many different forms. The

butterfly is extracted from a signal flow graph implementing

an FFT algorithm for simplicity. In this paper, we adopt

Kronecker product to design and implement FFT algorithms.

The property of the formalism facilitates verification of

the correctness of the implementation and code generation.

Furthermore, it helps us identify GPU kernel specifications

and optimize the performance.

1) Basic properties of Kronecker product: If A is an m×n matrix and B is a p×q matrix, then the Kronecker product

A by B is an mp×nq matrix denoted by A⊗B and defined

by

A⊗B = [aijB]i,j (2)

with A = [aij ]i,j .

The direct sum of A and B is an (m+q)×(n+s) matrix

denoted by A⊕B and defined by

A⊕B =

(A 00 B

), (3)

where the 0’s denote blocks of zeros of appropriate size.

The mathematical identities for FFT algorithms can be ob-

tained by refering to [4][5][6]. Historically, FFT algorithms

were obtained by applying breakdown rules recursively and

the rules manipulate the resulting formulas to obtain the

respective iteration. The rule we used is expressed in Eq. 4,

and it is applied to exhibit the parallel Kronecker product

structure which leads to the so-called six-step or parallel FFT

algorithm. The more detailed description of FFT algorithms

and their variants can be found in [1][4], and the notation

we used here mostly coincides with the notation in [4][5].

DFTn2n1= Ln2n1

n2(In1

⊗DFTn2)Ln2n1

n1

Tn2n1n1

(In2 ⊗DFTn1)Ln2n1n2

(4)

The twiddle factor matrix, denoted by Tn1n2n1

, is the

diagonal matrix defined by

Tn1n2n1

=n2−1⊕i=0

n1−1⊕j=0

ωijn1n2

= diag(In1,Ωn1n2,n1 , ...,Ωn2−1n1n2,n1

)

=n2−1⊕j=0

diag(1, ωjn1n2

, ..., ωj(n1−1)n1n2

),

(5)

and moreover, the notation Lmnn denotes the stride permu-

tation which indicates that a vector of size mn is reordered

by loading into n segments at stride n.

229

2) Representation of FFT on GPU: We derive some

breakdown rules for FFT algorithms on GPU architecture

with p threads listed in the following equations by applying

formula identities of Kronecker product.

Ln2n1n2

= (Ip ⊗ (Ln2

n2/p⊗ In1/p))(L

p2

p ⊗ In2n1/p2)

(Ip ⊗ Ln2n1/pn2

)(6)

Tn1n2n1

= diag(In1,Ωn1n2,n1 , ...,Ωn2−1n1n2,n1

)

=p−1⊕i1=0

n2/p−1⊕

i2=0Ωn1n2,i1∗n2/p+i2(ωn1n2

)(7)

In1 ⊗DFTn2 = Ip ⊗ (In1/p⊗DFTn2)

In2⊗DFTn1

= Ip ⊗ (In2/p⊗DFTn1

)(8)

The row-column algorithm is yielded by applying the

definition of 2D DFT and properties of Kronecker prod-

uct (Eq. 9). Higher-dimensional DFT algorithms are derived

similarly.

DFTm×n = (Im ⊗DFTn)(DFTm ⊗ In)

= (Im ⊗DFTn)Lmnm (In ⊗DFTm)Lmn

n

(9)

Due to the various features of different architectures,

variants of FFT algorithms and their implementation are

required. We choose Stockham algorithm which the explicit

digit-reversing computations required by the Cooley-Tukey

process are avoided by performing a multi-dimensional

transpose in each step [4][7].For a given size and cer-

tain dimensional FFT, we first partition it into multiple

dimensions by considering the resources available on GPUs

such as registers numbers and on-chip local memory size,

and transform each dimension. To begin with, the threads

load data from global memory to registers for computing

FFT along each dimension, then shuffle data for the next

computation via local memory. The overall structure of FFT

kernel is presented in Fig. 1.

1) Treating the N -point FFT as a multi-dimensional

array

2) Load data from global memory into registers

3) Using local memory to perform FFT along each

dimension with multiple batches

a) Compute small-point FFT in registers and scale

with twiddle factors

b) Store and transpose data via local memory

c) Load data from local memory and continue to

perform small-point FFT in registers

4) Write back data to global memory

Figure 1: Overall stucture of FFT kernel on GPUs

III. ADAPTIVE OPTIMIZATION FRAMEWORK

As a result of the proliferation of multi-core processors

and GPUs, many traditionally used numeric computation

algorithms are being reconsidered. Issues that previously

were very important, such as conserving memory, are no

longer significant now, and other issues have come to the

fore. One issue that is becoming increasingly important for

a wide range of new systems is memory access patterns,

especially the memory stride, which result in significantly

decreased performance. Furthermore, the GPU applications

are input-sensitive and hence developing an efficient GPU

program remains as challenging as before if not even more.

The number of threads in a work-group is limited by the

number of registers and the size of local memory on a CU.

However, reducing the size would decrease the occupancy

which is a measure of how well a GPU utilizes its resources.

Hence, the resources on GPU should be managed appro-

priately to ensure high occupancy. Resources that impact

occupancy primarily include registers, local memory and

global memory. Fig. 4 presents our auto-tuning framework,

and we achieve high utilization of thread resources as well as

memory resources and satisfy the occupancy requirements

as much as possible. with a coalesced memory access in the

kernel, we achieve a high performance.

A. Optimization of global memory access

Arranging global memory transfers in a coalesced ap-

proach is helpful to achieve close to the peak bandwidth

of memory transfer. Coalesced memory access makes what

could be several single memory transactions into one single

memory transaction. There are differences among CUDA

capable GPUs, classified by what is called compute capa-bility[10]. Devices with compute capability 1.0 and 1.1 are

more restricted than that of 1.2 or higher when it comes

to coalesced memory access. The compute capability of

Tesla C1060 and Tesla C2050 are 1.3 and 2.0 respectively.

Furthermore, the global memory space is not cached on

Tesla C1060, and a request of that for a warp is split in

two memory requests, one for each half-warp. However,

the Tesla C2050 are based on the new Fermi architecture

that all accesses to global memory go through L2 cache,

including copies to/from CPU host. There are two types

of loads in global memory access, the caching (the default

mode) and no-caching, the load granularity of the former is

128-byte and that of the later is just 32-byte, and moreover,

the memory operations are issued per warp not half-warp.

We show the performance of global memory copies with

different strides and offsets in Fig.2 (the word accessed by a

thread is 8-byte wide). The unaligned starting addresses and

discontinuous region accesses degrade performance greatly.

Furthermore, we evaluate the efficiency of global memory

access with different vector lengths on both AMD GPU and

NVIDIA GPU. We can see that there is very significant

impact on performance with various vector lengths.

230

0 5 10 15 20 25 30 35 0 50

100 150

200 250

300

0 20 40 60 80

100 120 140

Pef

orm

ance

(GB

yte

s/s)

stride

offset

Pef

orm

ance

(GB

yte

s/s)

(a) ATI 5850 GPU

0 5 10 15 20 25 30 35 0 50

100 150

200 250

300

0 10 20 30 40 50 60 70 80

Pef

orm

ance

(GB

yte

s/s)

stride

offset

Pef

orm

ance

(GB

yte

s/s)

(b) NVIDIA Tesla C1060 GPU

0 5 10 15 20 25 30 35 0 50

100 150

200 250

300

0 10 20 30 40 50 60 70 80 90

100 110

Pef

orm

ance

(GB

yte

s/s)

stride

offset

Pef

orm

ance

(GB

yte

s/s)

(c) NVIDIA Tesla C2050 GPU

Figure 2: Peformance comparison of global memory copies with various strides and starting address offsets on GPUs.

0

0.2

0.4

0.6

0.8

1

1 2 4 8 16

Glo

bal

mem

ory

ban

dw

ith

acc

ess

effi

cien

cy

Vector length

doublefloat

intshort

char

(a) ATI 5850 GPU

0

0.2

0.4

0.6

0.8

1

1 2 4 8 16

Glo

bal

mem

ory

ban

dw

ith

acc

ess

effi

cien

cy

Vector length

doublefloat

intshort

char


0

0.2

0.4

0.6

0.8

1

1 2 4 8 16

Glo

bal

mem

ory

ban

dw

ith

acc

ess

effi

cien

cy

Vector length

doublefloat

intshort

char


Figure 3: Comparison of global memory bandwidth efficiency with different vector lengths.

Fig.3 (b) and Fig.3 (c) present the ratio of practical global

memory bandwidth to theoretical peak bandwidth with d-

ifferent vector lengths and various data types on NVIDIA

GPU. The vector types which achieve peak performance

are char4, short4, int2, float2 and double1. Global memory

instructions support reading or writing 1-, 2-, 4-, 8-, 16-

byte words and the size of vector data type that is larger

than 16, such as double4, float8 and short16, has very poor

performance. On Fermi architecture, the access to global

memory is cached and the cache line size of both L1 and

L2 are tagged with a block size of 128 bytes and maps to

a 128-byte aligned segment in device memory. If the size

of the word accessed by each thread is more than 4 bytes,

a memory request by a warp is firstly split into separate

128-byte memory requests that are issued independently. So

there are little differences in performance among int, int2,

and int4 as shown in Fig.3 (c). Due to not taking advantage

of gaining the performance improvement through memory

access alignment, the performance of short and char vector

type are also very poor. Fig.3 (a) shows that there are little

performance differences in all vector lengths except char1,

char2 and short1. There are two independent memory paths

between the compute units and the memory on ATI GPU,

namely FastPath and CompletePath respectively. When a

kernel has atomic operations or sub 32-bit data transfers,

the kernel will use the CompletePath. The maximum bus

utilization between the shader unit and the memory unit for

the CompletePath is 25% compared to the 100% for the

FastPath. Hence, the global memory access efficiency of

char1, char2 and short1 is very low.

Assume that N -point FFT was computed by dividing into

n kernels with radix nk as follows.

N =

n∏k=1

nk, N(k) = n1n2...nk

P (k) =

k−1∏j=1

nk, R(k) =

n∏j=k+1

nk

(10)

The formulas to read data from global memory at the

beginning and store back the result at the end are LNN/n1

and LNn1

respectively. Each work-item deals with one of

N/n1 and n1 segments respectively, accesses each point

of data at the stride N/n1 or n1. Assuming that the

word size of each thread accessed was accessed word,

if 128byte/(N/N(1) ∗ sizof(accessed word)) < 1 or

(work− items/work−group)∗sizof(accessed word) <128byte, the accessed region does not satisfied the above

coalescing requirements, and the local memory is used

to rearrange data properly for assisting coalesced access.

Furthermore, coalesce width which denotes the number of

work-items in a global memory access request, is 16 on Tesla

231

C1060, C2050 and ATI 5850. If coalesce width|(N/n1) or

coalesce width|n1, the accesses are properly aligned.

B. Optimization of local memory access

In GPUs, large interleaved local memory is used for inter-

thread communication within a work-group. Because of the

interleaved design of memory banks, multiple threads fetch

or store concurrently with unit stride operates at full speed,

since each word resides on a different bank, namely, any

memory load or store of n addresses that spans n distinct

memory banks can be served simultaneously, yielding an

effective bandwidth that is n times as high as the bandwidth

of a single bank.

To avoid bank conflict, CW (CW denotes the number

of work-items that are issued to local memory accesses)

consecutive threads are required to access different banks.

Tesla C2050 (Fermi architecture) has 32 banks with 4-

byte width, and local memory accesses are issued per

warp (CW = 32) not half-warp (CW = 16) in GPUs

which prior to Fermi architecture, such as Tesla C1060

with just 16 banks. However, the value of CW is just a

quarter-wavefront (CW = 16) on ATI 5850 with 32 banks.

The characteristic that each thread can access two banks

simultaneously benefits the vector-based applications on its

platform. For example, single floating-point complex data

type with interleaved data form (float2 consists of interleaved

real and imaginary components) should be transformed to

planar data format (float2 consists of two real or two imag-

inary components) in local memory to avoid 2-way bank

conflict on NVIDIA GPU, but it’s not necessary to do so on

ATI GPU. Among the low-level optimization techniques, we

highlight loop unrolling and constant propagation. Constant

propagation can avoid unnecessary arithmetic instructions,

especially when computing padding functions.

Each thread loads one data from every P (k) × R(k)nk-tuple sub-array to compute nk-point FFT and store

back to local memory. The formula to store the result in

local memory is SNN/nk

= (LNN/nk

)−1 (L stands for the

load operation and S stands for the inverse store opera-

tion.). The indices are ThreadId+N/nk × r (r ∈ [0, nk),ThreadId ∈ [0, N/nk − 1)). The indices are consecutive,

and if the size of work-group is larger than N/nk, the kernel

will perform multiple FFTs at a time. We use Eq. 8 to

partition the multiple batches FFT in work-items via local

memory for exchanging data. Bank conflict occurs when

nk is a power-of-two, and it can be solved by inserting

appropriate padding. Ip(k) ⊗LnkR(k)R(k) is the formula to load

the data from local memory for computing nk-point FFT

and P (k) ⊗ nk ⊗ R(k) indicates the indices information

for fetching data along nk dimension. The threads read

one R(k)-tuple vector with R(k) stride while skipping

R(k) × (nk − 1). This may suffer bank conflict, and we

insert padding after every R(k) × (nk − 1) to deal with

them.

Figure 4: Overview of the adaptive optimization framework

for our FFT algorithms.

C. Automatic Performance Tuning Framework

We employ two-stage adaptation methodology to map

FFT algorithms to GPU architecture and memory hierarchy

(see Fig.4). At the installation time, the codelets library

consists of many small size DFTs that are generated by a

lightweight code generator module which is similar to fftgenmodule in UHFFT [13]. The codelets are straight-line code

which attain optimal performance by reducing the number

of register usage, operations, precomputation of constants

and so on.

At runtime, after acceptation of the obligatory param-

eters such as the DFT size, the dimension, batches size

and transform direction, complex data format (planar or

interleaved storage), the initialization module constructs

many FFT plans and each represents a factorization for

the given transform size. The code generator produces the

GPU kernels to be executed in the next step. Then, the

search module evaluates the performance of the executed

plans to select the best plan. Furthermore, we provide an

empirical value for the performance of a given size. While

the performance of the evaluated plan is below that value,

the search engine would change some parameters in the

initialization stage to repeat aforementioned process, such

as the size of threads in a work-group, the maximum radix

in a plan, the amount of banks interleaved in local memory.

The optimal parameters can be assembled by iteratively

compiling and evaluating the various plans. Finally, the

search module generates the performance data that contains

the information of all evaluated plans besides the selected

plan and reuses it in later sessions. In addition, we also

provide APIs for programmers to adjust some parameters

by script files to avoid exhausted search.

IV. PERFORMANCE EVALUATIONS

In this section we analyze the performance of our FFT

library on different GPUs and compare its performance with

optimized scientific libraries on GPUs and CPUs. Our library

supports batched execution for all dimensional FFT. To val-

idate our methodology, the performances of our library are

232

Table I: Evaluation platforms

Platform CPU RAM(GB) GCC Linux GPU GPU SDK OpenCLSystem1 AMD Phenom II X4 940 with 0.8 GHz 8 4.3.3 Ubuntu 9.04 ATI 5850 ATI Stream SDK 2.4 1.1System2 Intel Xeon X5472 with 3.0 GHz 16 4.4.5 Ubuntu 10.10 Tesla C1060 NVIDIA CUDA SDK 4.0.1 1.1System3 Intel Xeon X5550 with 2.66 GHz 16 4.4.5 Ubuntu 10.10 Tesla C2050 NVIDIA CUDA SDK 4.0.1 1.1

Table II: Configuration of the GPUs we employ in our experiments.

GPU Clock Rate PE CU Peak Perf. Memory Bus Width Peak BW Registers/CU Local Memory Driver(GHz) (GFlops) (GB) (bits) (GB/s) (K)

ATI 5850 0.725 288 18 2088 1.0 256 128 16k 32 cal 1.4.900Tesla C1060 1.30 240 30 933 4.0 512 102 16k 16 280.13Tesla C2050 1.15 448 14 1030 3.0 384 144 16k 48 280.13

PE– Processing Element CU – Compute Unit Peak Perf. – Peak Performance of Single Floating-point BW – Bandwidth

compared with those of FFTW, NVIDIA CUFFT [11] (CU-

DA capable FFT library) and AMD clAmdFft [12]. For a

three dimensional FFT with the total size N = Nx×Ny×Nz

and with execution time of t seconds, its performance is

calculated in GFlops defined by the following equation.

GFlops =5NxNyNz(log2Nx + log2Ny + log2Nz)× 10−9

t(11)

Experiments are conducted to evaluate our approach for

out-of-place complex to complex FFTs of power-of-two

sizes on three platforms with different GPUs which are

shown in Table I and Table II in detail. The FFT libraries

used for comparison are FFTW 3.2.2 on all host CPUs,

CUFFT 4.0.1 on NVIDIA GPU and clAmdFft 1.0.53 on

AMD GPU.

Fig. 5 shows the performance of 1D batched FFT of size

2N on three platforms and our automatic performance tuning

OpenCL FFT library is called clFFT. By comparison of

the version 1.0 and 2.0, the performance improvement is

achieved with automatic padding inserting by taking mem-

ory coalescing and bank-conflict avoided in account. The

performance of our library is 1.5 to 4 times the performance

of clAmdFft library, and closes to CUFFT library, and 6

to 18 times the performance of FFTW library with four

threads. Many bank conflicts of local memory accesses

in clAmdFft library result in suboptimal performance, for

example, the percentage of GPU time that local memory

is stalled by bank conflicts in the library reaches to 24.7%

by profiling the program when FFT size is 1024 with 1024

batches and the value of that in our library is zero. For

the size larger than what can be computed using local

memory FFT, global transposes multiple kernel launches

are needed. For these sizes, n can be decomposed using

much larger base radices for local memory computation by

using Eq. 4 to amortize the cost of the increased number

of device memory accesses. While N is larger than 210

on ATI 5850, 29 on Tesla C1060 and 212 on Tesla C2050,

the performance begins to degrade respectively because of

limited local memory capacity and the register numbers,

so N is partitioned into smaller ones using global memory

to exchange data. Hence, the latency from global memory

access leads to the performance decreases. The performance

of batched 2D FFT of sizes N1 ×N2 and 3D FFT of sizes

N1 × N2 × N3 on three platforms are presented in Fig. 6

and Fig. 7 respectively. For multidimensional FFTs, we use

row-column FFT algorithm to compute each dimension with

multiple batches by using the Eq. 9. Due to the increased

number of transpose operations and stride accesses, it is

crucial to use the coalesced memory access and bank-conflict

free to maximize efficient memory bandwidth. Our auto-

tuner can effectively discover the hotspot by analyzing the

behavior of memory access, then insert appropriate padding

to avoid reduced bandwidth. Further, a higher occupancy is

obtained through increasing the computation ratio in each

thread without registers and local memory overflowed to

overlap the cost of communication. As shown in Fig.6,

the performance of our 2D FFT is 1.5 to 40 times the

performance of clAmdFft library, and closes to CUFFT

library on Tesla C1060, and moreover, we obtain better

performance than CUFFT when the FFT size is large, such

as the size is 2048× 2048. Our 3D FFT is 0.4 times faster

than clAmdFft library with appropriate sizes, and closes to

CUFFT 4.0 on NVIDIA GPUs in less than 5% performance

gap. Average speedup of 1.4x and a maximum of 1.9x is

achieved on our library over clAmdFft on the ATI 5850 GPU

(see Fig.7), and the overall performance is within 90% of

CUFFT 4.0 on two NVIDIA GPUs.

V. RELATED WORK

The FFT is an important computational kernel which has

broad applicability across a wide range of disciplines in

audio signal processing, image processing, spectral methods

for solving partial differential equation (PDE) and so on.

Several FFT libraries on CPUs with automatic performance

tuning have been proposed, e.g. FFTW[8], SPIRAL[9] and

UHFFT[13][14]. Automatic tuning in FFTW is performed

in two different levels, namely the installation time and

runtime. At the installation time, the code generator gen-

erates highly optimized straight-line FFT code blocks called

233

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9 10 11 12

Per

form

ance

(GF

lops)

Size of DFT - 2Ν

clAmdFftclFFT-2.0

FFTWclFFT-1.0

(a) ATI 5850 GPU

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Per

form

ance

(GF

lops)

Size of DFT - 2Ν

CUFFTclFFT-2.0

FFTWclFFT-1.0


0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Per

form

ance

(GF

lops)

Size of DFT - 2Ν

CUFFTclFFT-2.0clFFT-1.0

FFTW


Figure 5: Peformance comparison of 1D FFTs on GPUs.

0

20

40

60

80

100

120

140

160

180

32*64

32*1024

32*2046

64*64

64*128

64*256

64*512

64*1024

64*2048

128*128

128*256

128*512

128*1024

128*2046

256*256

256*1024

512*512

512*1024

1024*1024

2048*2048

Per

form

ance

(GF

lops)

Size of DFT - Ν1 ∗ Ν2

clAmdFftclFFT-2.0

(a) ATI 5850 GPU

70

80

90

100

110

120

130

140

150

160

170

180

32*64

32*1024

32*2046

64*64

64*1024

64*2048

128*128

128*256

128*512

128*2046

256*256

256*1024

512*512

512*1024

1024*1024

2048*2048

Per

form

ance

(GF

lops)


CUFFTclFFT-2.0


150

160

170

180

190

200

210

220

230

32*64

64*64

64*128

64*256

128*128

128*256

256*256

512*512

512*1024

1024*1024

2048*2048

Per

form

ance

(GF

lops)


CUFFTclFFT-2.0



20

40

60

80

100

120

140

4*8*16

8*8*8

8*16*32

16*16*16

16*32*32

32*64*128

32*64*128

64*64*64

64*64*128

128*128*128

Per

form

ance

(GF

lops)

Size of DFT - Ν1 ∗ Ν2 ∗ Ν3

clAmdFftclFFT-2.0

(a) ATI 5850 GPU

0

20

40

60

80

100

120

140

160

4*8*16

8*8*8

8*16*32

16*16*16

16*32*32

32*64*128

32*64*128

64*64*64

64*64*128

128*128*128

Per

form

ance

(GF

lops)


CUFFTclFFT-2.0


60

80

100

120

140

160

180

200

4*8*16

8*8*8

8*16*32

16*16*16

16*32*32

16*64*128

32*64*128

64*64*64

64*64*128

128*128*128

Per

form

ance

(GF

lops)


CUFFTclFFT-2.0



codelets. At the runtime, the pre-generated codelets are

assembled in a plan to compute large FFT problem size.

The methodology of auto-tuning techniques in the UHFFT

is similar to FFTW. SPIRAL is a program generation and

optimization system which generates optimized codes for

digital signal processing (DSP) transforms. It employs three-

stage adaptation methodology to adapt to various styles of

architecture. In the first stage, the mathematical rules and

identities have been used for the formula generator in virtue

of a special purpose pseudo-mathematical language called

SPL (Signal Processing Language) to expand and optimize

the FFT formula to a given transform. Then, the optimized

SPL formula is translated into source code on a specific

platform. Finally, the source code is compiled and evaluated

to generate the best code by guiding the code generation

process.

Graphics APIs such as DirectX or OpenGL have been

used in the earliest FFT implementations to access the

computing resource of GPUs. Due to the intrinsic restrictions

of these APIs, the performance and the readability of these

implementations are very poor. The advent of CUDA has

led to decrease the complexity of programming on NVIDIA

GPU, and there has been growing research in exploring auto-

tuning techniques for improving performance of algorithms

234

such as SpMV (Sparse Matrix-Vector Multiplication), GEM-

M (General Matrix Multiply) and FFTs on CUDA GPUs,

and moreover, several studies have been conducted to opti-

mize the performance manually. Nukada and Matsuoka[15]

presented an auto-tuning algorithm for optimizing 3D FFT

algorithms on CUDA GPUs. Their algorithm optimizes the

number of threads and resolves bank conflicts on local

memory especially. However, the larger size of FFTs may

leads to suboptimal performance as a result of restricting the

search space severely. Yuri Dotsenko, Sara S.Baghsorkhi,

et al.[7] also presented an auto-tuning framework for au-

tomatically generated optimized FFT kernels by pruning

heuristics significantly to reduce the optimization search

space. Although the work they did demonstrated significant

performance improvement against on GPUs, the CUFFT

library they used are prior to the present version. The

version 4.0 is much higher performance and accuracy for the

efficient computation of arbitrary sizes, especially the power-

of-two sizes, on NVIDIA GPU architectures than before.

Furthermore, both of them just considered the NVIDIA

CUDA platform in ignorance of ATI GPUs.

VI. CONCLUSION AND FUTURE WORK

Different architectures among the underlying platforms

pose various challenges in memory optimization and paral-

lelism management, resulting in different performances. In

this paper, we describe an auto-tuning framework of FFT

algorithms on various GPUs, and the parallelism exposed

in the FFT algorithms based on Kronecker product can

be exploited fully. We also identify several key techniques

of GPU programming on AMD and NVIDIA GPUs. Our

OpenCL FFT library achieves up to 1.5 to 4 times, 1.5 to

40 times and 1.4 times the performance of clAmdFft 1.0 for

1D, 2D and 3D FFT respectively on an AMD GPU, and the

overall performance is within 90% of CUFFT 4.0 on two

NVIDIA GPUs.

There are several avenues for our future work. First, we

continue to optimize the performance of our FFT library

on GPUs and port the library onto the other OpenCL

devices, such as Cell and Intel processors with Sandy Bridge

architecture. Second, we will extend our library to use

double precision arithmetic and more FFT algorithms to

deal with arbitrary sizes. Third, we plan to construct a

novel performance model with data training or interpolation

by machine learning to attain a good trade-off between

performance and search time. Finally, the heterogeneous

parallel computing between CPUs and GPUs would be

considered with optimal data distribution and computation

ratios among them.

VII. ACKNOWLEDGMENT AND FUTURE WORK

This work is supported in partial by the Developmen-

t Plan of China under grant No. 2009AA01A129, No.

2009AA01A134, Knowledge Innovation Project of The Chi-

nese Academy of Sciences No. KGCX1-YW-13. We would

like to thank the anonymous referees for their helpful

comments from which the preparation for this version of

the paper has benefited.

REFERENCES

[1] P. Duhamel and M. Vetterli, “Fast Fourier Transforms: ATutorial Review and a State of the Art,” Signal Processing,vol. 4, num. 19, p. 259-299 1990.

[2] Khronos OpenCL Working Group, The OpenCL Specification,version: 1.1, September, 2010.

[3] Cooley, J.W. and Tukey, J.W., “An algorithm for the machinecalculation of complex Fourier series,” Math. Comp., 19,297C301, April 1965.

[4] Charles Van Loan, “Computational frameworks for the fastFourier transform,” Philadelphia:SIAM, 1992.

[5] J. Johnson, R. W. Johnson, D. Rodriguez, R. Tolimieri, “Amethodology for designing, modifying, and implementingFourier transform algorithms on various architectures,” IEEETrans. on Circuits and Systems, pages 449-498, 1990.

[6] Franz Franchetti and Markus Puschel and Yevgen Voronenkoand Srinivaqs Chellappa and Jose M. F. Moura, “DiscreteFourier Transform on Multicore,” IEEE Signal ProcessingMagazine, special issue on “Signal Processing on Platformswith Multiple Cores”, vol.26, num. 6, p. 90-102, 2009.

[7] Yuri Dotsenko, Sara S.Baghsorkhi, Brandon Lloyd, Naga K.Govindaraju, “Auto-tuning of fast fourier transform on graphicsprocessors,” in Proceedings of the 16th ACM Symposium onPrinciples and Practice of Parallel Programming (PPoPP’11),February 12-16, 2011.

[8] Matteo Frigo and Steven G. Johnson, “The design and im-plementation of FFTW3,” in Proceedings of the IEEE, 93(2),2005.

[9] Spiral project website, www.spiral.net.

[10] NVIDIA Corporation, “NVIDIA CUDA Compute UnifiedDevice Architecture ł Programming Guide,” 2008, version 2.1.

[11] NVIDIA Corporation, “CUDA CUFFT Library,” August,2010.

[12] AMD Developer Central, http://developer.amd.com/libraries/appmathlibs/pages/default.aspx.

[13] UHFFT website, http://www2.cs.uh.edu/∼ayaz/uhfft/.

[14] Ayaz Ali, Lennart Johnsson, Dragan Mirkovic, “EmpiricalAuto-tuning Code Generator for FFT and Trigonometric Trans-forms,” ODES: 5th Workshop on Optimizations for DSP andEmbedded Systems, in conjunction with International Sympo-sium on Code Generation and Optimization (CGO), March2007.

[15] Akira Nukada, Satoshi Matsuoka, “Auto-tuning 3-D FFTlibrary for CUDA GPUs,” in Proceedings of the Conferenceon High Performance Computing Networking, Storage andAnalysis, November 14-20, 2009.

235

[ieee 2011 ieee 17th international conference on parallel and distributed systems (icpads) - tainan,...

Documents