report fft implementation 08gr943
TRANSCRIPT
-
8/2/2019 Report FFT Implementation 08gr943
1/50
FFT Parallelization for OFDM Systems
9TH SEMESTER PROJECT, AAU
APPLIED SIGNAL PROCESSINGAN D IMPLEMENTATION (ASPI)
Group 943Jeremy LERESTEUX
Jean-Michel LORY
Olivier LE JACQUES
-
8/2/2019 Report FFT Implementation 08gr943
2/50
-
8/2/2019 Report FFT Implementation 08gr943
3/50
AALBORG UNIVERSITY
INSTITUTE FOR ELECTRONIC SYSTEMS
Fredrik Bajers Vej 7 DK-9220 Aalborg East Phone 96 35 80 80 http://www.esn.aau.dk
TITLE:
FFT Parallelization
for OFDM Systems
THEME:Parallel Architecture Processing
FFT implementation
PROJECT PERIOD:
9th Semester
September 2008 to January 2009
PROJECT GROUP:
ASPI 08gr943
PARTICIPANTS:
Jeremy [email protected]
Jean-Michel Lory
Olivier le Jacques
SUPERVISORS:
Yannick Le Moullec (AAU)
Ole Mikkelsen (Rohde&Schwarz)
Jes Toft Kritensen (Rohde&Schwarz)
PUBLICATIONS: 8
NUMBER OF PAGES: 46
APPENDICES: 1 CD-ROM
FINISHED: 5th of January 2009
Abstract
This 9th semester project for the Applied Sig-
nal Processing and Implementation special-
ization at Aalborg University is a study of FFT
algorithms parallelization for OFDM receivers
on Cell BE. The project focuses on mobile ap-
plications, which require efficient bandwidth
utilization like in LTE. This can be achieved
by means of the OFDM technology. A signif-
icant contribution in OFDM is the IFFT/FFT
operations. This can be exploited by the par-
allelization of special FFT algorithms to yield a
lower operations count and intuitively improve
the time of computation. This project seeks
to investigate the possibilities and differences,
with regards to time usage, when computing
FFT algorithms on multiple processors on the
Cell BE. First of all, the definition of LTE and
OFDM is explained. Then, two Fast Fourier
Transform algorithms - a Radix-2 DIT FFT and
a Srensen FFT algorithm (SFFT) are examined
and mapped on Cell Be processor architecture.
Afterwards, tests are done and results are dis-
cussed for both algorithms. It appears SFFT al-
gorithm is better than Radix 2 DIT algorithmin terms of execution time and performance. In
the conclusion, an assessment is done and fu-
ture perspectives are discussed.
-
8/2/2019 Report FFT Implementation 08gr943
4/50
Preface
This report is the documentation for a 9th semester project in Applied Signal Processing and Implementa-tion (ASPI) entitled FFT Parallelization for OFDM Systems at Aalborg University (AAU). This report
is prepared by group 08gr943 and spans from September 2nd, 2008 to January 5th, 2009. The projectis supervised by Yannick Le Moullec, Assistant Professor at AAU, Jes Toft Kritensen and Ole Mikkelsen
from the company Rohde & Schwarz Technology Center A/S in Aalborg. The report is divided into
four parts. These chapter correspond to the introduction of the project, analysis, implementation and
conclusion.
Jeremy Leresteux Jean-Michel Lory
Olivier Le Jacques
Aalborg, January 5th 2008
4
-
8/2/2019 Report FFT Implementation 08gr943
5/50
Contents
Preface 4
1 Introduction 7
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Long Term Evolution (LTE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM) . . . . . . . . . . . . . . . 10
1.1.3 Conclusion on the context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Project subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Fast Fourier Transformation (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.2 Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Project Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Analysis 16
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Design Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Cell BE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Programmation of the CBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 FFT algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Cooley-Tukey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.4 Srensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Conclusion of the Analysis section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Implementation 32
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Cooley-Tukey Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Srensen Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5
-
8/2/2019 Report FFT Implementation 08gr943
6/50
3.3.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.3 Comparison with the CT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Conclusion & Perspectives 45
4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Short term perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Long term perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 47
List of Figures 49
-
8/2/2019 Report FFT Implementation 08gr943
7/50
Chapter 1Introduction
1.1 Context
In 1981, Nordic Mobile Telephony (NMT) led to the commercialization of the first mobile phone (referred
as 1st Generation1). On the 29th of November 2007, 3.3 billion mobile phones have been identifiedworlwide [1]. Most of these phones are GSM phones (2G). But the 3rd Generation phones, which canprovide features like web browsing or videoconferences, approached half a billion of devices at the end of
September 2007. 3G phones have good capabilities but a new generation (4G), with even better capabilities
including higher bandwidth and more flexibility, is approaching. See Figure 1.1 for a summary of the
history of the mobile phone generations.
1.1.1 Long Term Evolution (LTE)
LTE [2] (Long Term Evolution) is the next major step in mobile radio communication. It is one of the
best candidates for the 4th Generation of mobile wireless data transfer. Its development started in 2004 by3GPP [3] and several European mobile constructors and operators [4].
802.16m WiMAX (Worldwide Interoperability for Microwave Access) is one of the other candidates
[5] to the 4G appellation . It is developed by the IEEE and headed by Intel [6]. The last candidate is
the Ultra Mobile Broadband (UMB) developed by 3GPP2 [7] and headed by Qualcomm (it was decided
on November 13, 2008 to stop UMB development to the benefit of LTE [8]). This project only considers
LTE, therefore WiMAX and UMB are disregarded.
LTEs major aim is to improve the 3G UMTS (Universal Mobile Telecommunication System). It
has ambitious requirements for the spectrum efficiency, lowering costs capacities, improving services
like video conferences and VoIP (Voice over Internet Protocol) communication, latency and also betterintegration with other standards.
The 3GPP Release 8 [9] gives what the LTE requirements shall be (only the most significant ones are
listed here):
Peak data rate
Instantaneous downlink peak data rate of 100 Mb/s within a 20 MHz downlink spectrum allocation(5 bps/Hz)
1Generation : Term used to define the technology used in mobile communication. 1G is NMT, 2G is GSM and 3G is UMT-S/HSPA.
7
-
8/2/2019 Report FFT Implementation 08gr943
8/50
2G
3GGSM
40kbps
WCDMA
384kbps
HSPA
14.4Mbps
GSM: 3.3 billion subscribers
WCDMA: 297 million subscribers
HSPA: 55 million subscribers
4G LTE
First call made in 1991
GPRSin 2000
3G UMTSin 2001 EDGE
in 2003
HSDPA
in 2005
LTE
100Mbps
GSM
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2012
2013
2014
HSPA
28/40MbpsEvolved
HSUPA
in 2008
LTEplan in 2009
2011
Figure 1.1: Standardization evolution track. Where GSM is Global System for Mobile communications,
GPRS is General Packet Radio Service, UMTS is Universal Mobile Telecommunications Sys-
tem, WCDMA is Wideband Code Division Multiple Access, EDGE is Enhanced Data Rates
for GSM Evolution, HSPA is High Speed Packet Access, HSDPA is High-Speed Downlink
Packet Access, HSUPA is High-Speed Uplink Packet Access and LTE is Long-Term Evolu-
tion. Modified from [10]
Instantaneous uplink peak data rate of 50 Mb/s within a 20MHz uplink spectrum allocation (2.5bps/Hz)
Latency
Transition time of less than 100 ms from a camped state, Idle Mode, to an active state Less than 5 ms in unload condition (i.e. single user with single data stream) for small IP packetUser capacity and throughput
At least 200 users per cell should be supported in the active state for spectrum allocations up to 5MHz
Downlink: average user throughput per MHz, 3 to 4 times HSDPA (High-Speed Downlink PacketAccess: 3,5G Downlink protocol)
Uplink: average user throughput per MHz, 2 to 3 times HSUPA (High-Speed Uplink Packet Access:3,5G Uplink protocol)
Spectrum efficiency
Downlink: In a loaded network, target for spectrum efficiency (bits/sec/Hz/site), 3 to 4 times HS-DPA
8 Chapter: 1 Introduction
-
8/2/2019 Report FFT Implementation 08gr943
9/50
Uplink: In a loaded network, target for spectrum efficiency (bits/sec/Hz/site), 2 to 3 times HSUPA
Coverage
Throughput, spectrum efficiency and mobility targets above should be met for 5 km cells, and witha slight degradation for 30 km cells. Cells range up to 100 km should not be precluded.
Complexity
Minimize the number of options
No redundant mandatory features
These characteristics are performed thanks to the E-UTRA Air Interface. E-UTRA is the acronym
for Evolved Universal Terrestrial Radio Access. It is the successor of the UTRAN/GERAN (GSM EDGERadio Access Network/ UMTS Terrestrial Radio Access Network), 2G/3G air interface. Also designed by
3GPP, its requirement are as follow:
Mobility
E-UTRAN should be optimized for low mobile speed from 0 to 15 km/h
Higher mobile speed between 15 and 120 km/h should be supported with high performance
Mobility across the cellular network shall be maintained at speeds from 120 km/h to 350 km/h (oreven up to 500 km/h depending on the frequency band)
Spectrum flexibility
E-UTRA shall operate in spectrum allocations of different sizes, including 1.25 MHz, 1.6 MHz, 2.5MHz, 5 MHz, 10 MHz, 15 MHz and 20 MHz in both the uplink and downlink. Operation in paired
and unpaired spectrum shall be supported
Co-existence and Inter-working with 3GPP Radio Access Technology (RAT)
Co-existence in the same geographical area and co-location with GERAN/UTRAN on adjacentchannels.
E-UTRAN terminals supporting also UTRAN and/or GERAN operation should be able to supportmeasurement of, and handover from and to, both 3GPP UTRAN and 3GPP GERAN.
The interruption time during a handover of real-time services between E-UTRAN and UTRAN (orGERAN) should be less than 300 msec.
E-UTRA is the air inteface which permits the communication between a BTS (Base Transmitter Sta-
tion) and a UE (User Equipment). The signal modulation used for the BTS and demodulation for the UE
are a bit different but made on the same technology namely the Frequency-Division Multiplexing. This
provides a lot of similarities between them. SC-FDM (Single Carrier Frequency-Division Multiplexing)
is used for the transmitter part and OFDM (Orthogonal Frequency-Division Multiplexing) for the receiver
part. This has been decided by the 3GPP members and summarized in the Release8.
This project is related to the OFDM aspect at the receiver side. Section 1.1.2 gives an overview of
OFDM fundamentals.
Section: 1.1 Context 9
-
8/2/2019 Report FFT Implementation 08gr943
10/50
1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM)
This section has been inspired by the information provided by this two papers [ 11] [12].
OFDM is a modulation technique which is used in most of the new wireless technologies such as
IEEE802.11 a/b/g, 802.16, HiperLan-2, DVB (digital TV) and DAB [13]. The 3GPP members selected it
to be the LTE/E-UTRA downlink protocol i.e. the system which receives data and communication packets
from a transmitter. As indicated at the end of section 1.1.1, the selected uplink protocol, SC-FDM, presents
similarities to OFDM, that is why this section only introduces OFDM on the transmitter and receiver sides.
1.1.2.1 Overview
With standard single carrier transmitters, the signal is spread into multiple transmission paths, multiple
frequencies. Because of the environment (buildings, cars, distance), the signal becomes less powerful
and distorted. This phenomenon, called fading , appears when signals are reflected on the buildings for
example. The reflected signals arrive to the receiver later than the main signal, which results in distortions,as illustrated in Figure 1.2.
Transmitter
Buildings
Receiver
Mobile Obstacle
Obstacle
Figure 1.2: Multipath propagation. A transmitted signal is spread between different frequencies and ac-
cording to theses frequencies, the obstacle met and the distance covered, the distortion is
more or less present. Modified from [12].
These distortions are a major problem when establishing secured high speed data transfer like usedon the 3G UMTS cell phones. OFDM settles this distortion problem. It is not avoiding reflections but its
characteristics make a transmission safer, in the meaning that data packets are always present by permiting
to send multiple signals by a single radio channel. OFDM is a multi-carrier transmitter/receiver, i.e. it can
send/receive signals to/from several users. The next subsections describe the main principles of OFDM
on the transmitter and receiver sides.
1.1.2.2 OFDM Principles
OFDM distributes the data over a large number of carriers at different frequencies. This spacing provides
the orthogonality which prevents the receivers to see wrong frequencies. In opposite to other multi-
10 Chapter: 1 Introduction
-
8/2/2019 Report FFT Implementation 08gr943
11/50
carriers techniques, like CDMA, OFDM prevents the Inter Symbol Interference (ISI) by adding a cyclic
prefix, which is explained in section Inter-Symbol Interferences.One of the key features of OFDM is the IFFT/FFT pair. These two mathematical tools are used here
to transform several signals on different carriers from the frequency-domain to the time-domain in the
IFFT (or F F T1) and from the time-domain to the frequency-domain in the FFT. See in Figure 1.3, theprinciple with the main parts of an OFDM system.
Serial toparallel
conversionIFFT
Addcyclicprefix
FFTRemovecyclicprefix
Parallel toserialconversion
Transmitter Receiver
Input signal
Modulated signal
Output signal
Frequency Domain Frequency DomainTime Domain
Antenna
Antenna
Figure 1.3: Main principle of an OFDM transmitter/receiver.
The Transmitter Figure 1.4 shows a representation of the transmitter. OFDM divides the spectrum into
N sub-carriers, each on different frequencies, and each carrying a part of the signal by means of the IFFT(also noted F F T1). In opposite with FDM, where there is no coordination or synchronisation betweeneach sub-carriers, OFDM links them with the principle of orthogonality . It results in a overlapping
of the sub-carriers, see Figure 1.5, where all the sub-carriers can be simultaneous transmitted in a tight
spaced frequencies but without Inter-Signal Interference.
s[n] F F T1
DAC
DAC
s(t)90o
fc
Serialto parallel
X0
X1
XN2
XN1
e
m
Constellationmapping
Figure 1.4: Representation of the OFDM transmitter [14]. The digital signal s[n] represents the data totransfer. It is then modulated with a QPSK, 16-QAM or 64-QAM to create symbols. Then thespectrum goes through an IFFT to transform it into time-domain. Real and Imaginary com-
ponents are converted to analog domain to modulate cosine and sine at the carrier frequency,
fc. They are then summed into s(t) to be transferred to the receiver via the antenna.
Signals are orthogonal if they are mutually independant of each other. Orthogonality is based on the
fact that any sub-carriers, sine or cosine wave, admit zero on one half-period. Lets assume two sine
sub-carriers of frequency m and n, both integers, and multiply them together:
f(x) = sin mwt sin nwt (1.1)
Section: 1.1 Context 11
-
8/2/2019 Report FFT Implementation 08gr943
12/50
Its integral yields to a sum of two sinusods of frequency (n m) and (n + m)
f(x) =1
2cos(m n) 1
2cos(m + n) (1.2)
As this two components are sinusods, the integral is equal to zero over one period
f(x) =
20
1
2cos(m n)
20
1
2cos(m + n) (1.3)
It conclues as when two sinusods of different frequencies, n and m/n, are multiplied, the area underthe product is zero. For all n and m, sin mx, cos mx, sin nx and cos nx are all orthogonal to eachothers. These frequencies are called harmonics.
Overlapping gives a better spectrum usage than FDM modulator which just places each carrier next to
each others and results on interferences between them.
f 0 1 2 N-1
f 0 1 2 N-1
FDM
OFDM
foverlapping f
Figure 1.5: Spectrum efficiency difference (f) between FDM and OFDM. With OFDM, signals, on each
sub-carriers, are overlapped but still orthogonal to each others. With FDM, sub-carriers areplaced to next to each others.
The Receiver OFDM symbols are transmitted over the channel to the receiver on an only frequency.
Basically, the receiver performs the same operations as the transmitter, but in the inverse order. By means
of a FFT, an approximation of the source signal is retrieved as illustrated in Figure 1.6.
s[n]FFT
ADC
ADC
r(t)90o
fc
Parallelto serial
Y0
Y1
YN2
YN1
e
m
Symboldetection
Figure 1.6: Representation of the OFDM receiver [14]. The antenna receives each part of the spectrum
as one signal r(t).It is demodulating and after eliminating the cyclic prefix with filters, a FFT
algorithm transforms them back to frequency-domain. Then, each symbol is detected to create
an approximation of the original data signal.
12 Chapter: 1 Introduction
-
8/2/2019 Report FFT Implementation 08gr943
13/50
Inter-Symbol Interference (ISI) As seen in Figure 1.5, signals are overlapped. This overlapping intro-
duces a problem known as Inter-Symbol Interference (ISI). ISI are the spread delay of the signal N1 onN due to the overlapping where, with the example in Figure 1.5, the last element of symbol 0 is overlappedby the first element of symbol 1 because of the channel.
Spread Delay The spread delay corresponds to the propagation of a transmitted signal on the next
one. It is the echo from the first signal on the second one as illustrated in Figure 1.7 (a). This physical
effect depends on the channel and the distance between the two signals.
To avoid this problem, a distance, called guard interval, superior of the spread delay is needed. As it
is impossible not to send anything, samples from the tail of the symbol signal are added to the front,
as illustrated in Figure 1.7 (b). This principle, explained in [15], is called cyclic prefix. In theory, this
security prefix should be added after each sub-carrier, but in practice OFDM signal is a linear combination
thus only one cyclic prefix is added, as illustrated in Figure 1.7 (c).
t
Guard interval
Copy of the tail of the signal
spread delay t
t
(a)
(b)
(c)
Figure 1.7: The cyclic prefix which permits to avoid the ISI problems.(a) shows the spread delay problem.
(b) shows the adding of the cyclic prefix in the guard interval according to the theory. (c)
shows the cyclic prefixs adding in practice because of the linear combination of the OFDM.
Section: 1.1 Context 13
-
8/2/2019 Report FFT Implementation 08gr943
14/50
1.1.2.3 Advantages
OFDM provides better spectrum flexibility by overlapping the signals on orthogonal frequencies, the
harmonics. It is less noise sensitive than a single-carrier system. And the ISI problem is solved thanks to
the guard interval and the cyclic prefix.
1.1.2.4 Drawbacks
OFDM is sensitive to frequency offset and synchronisation problem which can destroy the carriers orthog-
onalities. Also, after the IFFT, OFDM can provide very high amplitude which can lead to a large amount
of power consumption. This high amplitude, called Peak to Average Power Ratio (PAPR), can be reduced
with transmitted signals correction vectors. But this adds complexity to the OFDM transmitter.
1.1.3 Conclusion on the context
The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This
is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem
such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair.
In this project, the focus in on the receiver side, hence on the FFT block. In section 1.2, the FFT
concept is presented by means of 3 FFT algorithms and the issue of parallelizing them is introduced.
1.2 Project subject
1.2.1 Fast Fourier Transformation (FFT)
The group members have selected three FFT algorithms which will be compared. These three algorithmsare presented below :
Radix 2 DIT Fast Fourier Transform (Decimation In Time) : This algorithm is chosen because it isthe simplest form of the Cooley Tukey Algorithm. It exists many other algorithms which compute
DFT faster than radix 2 (radix 4 and split radix for example) but it is important for the project to
be able to compare the basis algorithm with better algorithm (Srensen, Edelman) to show the
difference of computation and complexity (explained in 2.4.3).
"Srensen" Fast Fourier Transform (SFFT): The second algorithm under test is a mix of a Cooley-Tukey algorithm, like Split Radix, and Horners polynomial evaluation scheme. It takes into account
the fact that all the outputs are not interesting for the final result. So only some chosen outputs are
computed. This fact permits to avoid many operations which are time and memory expensive.
Srensen FFT is well known and the project results can be compared with other studies. It is an
interesting algorithm, in terms of complexity and challenge, to implement and compare with other
algorithms like Radix-2 DIT or Edelman.
"Edelman" Fast Fourier Transform : This algorithm computes approximately DFT, doing someerrors but which are minimal against the number of computation. This kind of algorithms allows
increasing speed of computation in spite of some errors. Edelman Algorithm is useful for parallel
computing.
All the algorithms mentioned above are further developed in section 2.4. However because of a lack of
documentation about the Edelman algorithm, it is disregarded from the project.
14 Chapter: 1 Introduction
-
8/2/2019 Report FFT Implementation 08gr943
15/50
1.2.2 Cell Broadband Engine
The purpose of this project is to examine the implementation of FFT algorithms for the OFDM application
presented in section 1.1.2 on a multiprocessors platform, namely the Cell Broadband Engine architecture.
The Cell BE is, for this project, used for:
The implementation of parallelized FFT algorithms Evaluation of the performance, in particular the execution time, of the implementation of the paral-
lelized FFT algorithms
The Cell BE is constructed as an heterogeneous processor architecture, with multiple executions and
memory transfers active at the same time. This architecture is composed of a processor that contains a
PowerPC unit (PPU) with two cores, and eight simpler processors, Synergistic Processing Units (SPUs),
which are designed to perform calculations, whereas the PowerPC performs control, data management and
scheduling of operations. The SPUs contains a RISC processor and are constructed with two pipelines thatcan execute an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside
the SPUs are wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions.
The use of this method produces a processor optimized for computations.
1.2.3 Parallelization
The parallelization is an important part in this project. Indeed, the OFDM receiver requires a FFT as
an integral part of the wireless communication. It is essential that the computation of this FFT be the
fastest possible so that the achievable throughput is maximised. To obtain the best performance from the
application running on the Cell BE processor, the use of multiple SPUs concurrently is evaluated. The
application creates at least as many threads as concurrent SPU contexts are required. Each of these threads
runs a single SPU context at a time. With this method, the FFT is parallelized and uses some of the featuresof the Cell BE to accelerate the computation.
1.3 Problem Definition
This work seeks to answer the following question
"How efficient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized
FFT algorithms?"
1.4 Project Delimitations
Deep researches on LTE and OFDM are not the purposes of this project, nor is a complete mathematical
examination of the FFT algorithms. This project focuses on the use of the Cell-Be for FFT, probably the
single most important tool in digital signal processing (DSP) , according to Srensen and Burrus [ 16]
Section: 1.3 Problem Definition 15
-
8/2/2019 Report FFT Implementation 08gr943
16/50
Chapter 2Analysis
2.1 Overview
The purpose of this chapter is, first, to introduce the design methodology the project group chooses, in
terms on project methodology (A3 design) and the way to parallelize an algorithm following establishedprocedures. Then, this chapter introduces the platform under tests in part 2.3, the Cell-Be, followed, in
part 2.4, by an analysis of the different chosen FFT algorithms, with explanation on the reasons to choose
these algorithms.
2.2 Design Methodology
2.2.1 Design Model
The design of the model is divided in three parts as in the A3 model [17]: Application, Algorithm andArchitecture. First of all, Figure 2.1 shows the generic A3 model. Then, this methodology is appliedto this specific project presented in this report, as showed in figure 2.2.
Application : The application is any system with specifications and constraints. It can be timeconstraints, power consumption, area problems,... It is the main purpose of a project.
Algorithm : At this level, existing algorithms are developed. Special algorithms can be createdfor the application. The algorithms are optimized on a purely mathematical point of view, i.e. the
optimization are only done on the algorithms parts directly related to the application.
Architecture : The mapping of the previous algorithms is realised on the selected platform (DSP,FPGA, Cell-BE,...). In case of uncompatiblity between the specifications/constraints of the applica-
tion and the results, modifications have to be done. On one hand, if the algorithms is implemented on
an established architecture, a modification of the program, in terms of architecture related program
(bus control, data transfer control, memory allocation,...) can be done for the specified architecture.
On the other hand, if the algorithms are established then a modification of the architecture (VHDL
program for a FPGA platform for example) can be done.
Application : In the application domain, a presentation of LTE and OFDM in the context section1.1 is done.
16 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
17/50
Application
Algorithm
Architecture
iterateConstraints
Specifications
Results
Comparison
optimizationsArchitectural
optimizationsAlgorithmic
Mapping
Architectureconstraints
Algorithmicconstraints
Figure 2.1: The generic A3 design methodology.
Algorithm : In the algorithms domain, three Fast Fourier Transform algorithms are compared. Firstof all, an analysis of derivation is done. Then, the complexity, i.e. numbers of computation to
execute the Fourier Transform, of each algorithms is analysed. Finally, the implementation of the
algorithms is done in C language two times: one time for sequential execution and a second time
for parallel execution.
Architecture : In the architecture domain, the platform used to implement different algorithms isanalysed. Available hardware and system limitations are studied. Then, how the compiler used
in order to parallelize programs is examined and also to measure the computation consumption in
terms of resource utilisation, execution speed,. . .
2.3 Cell BE
2.3.1 Architecture
In this section a presentation of the architecture used along the project, the Cell Broadband Engine. Ac-
cording to the A3 model design, this section belongs to the analysis of the architecture, as illustrated onFigure 2.3
2.3.1.1 Architecture Overview
The Cell Broadband Engine (CBE) is a multicore processor. It has a Power Processing Element (PPE)
which is a dual-thread PowerPC Architecture and eight Synergistic Processing Element (SPE) which is a
SIMD (Single Instruction Multiple Data) processor element. The communication path for commands and
data between all processor elements and all chip controllers for memory access or Input/output is made
by the Element Interconnection Bus (EIB) [18, p. 41]. An overview of the architecture is presented on the
figure 2.4.
In the Playstation 3, 6 of the 8 SPEs can be used for computation because one is used by the OS
virtualization layer and the other has been disabled for wafer yield reasons [19, p. 5]. That means that
when running the operating system, 6 SPEs are available for computation, as shown in figure 2.4.
Section: 2.3 Cell BE 17
-
8/2/2019 Report FFT Implementation 08gr943
18/50
OFDM receiverLTE 4G
SrensenEdelman Radix2
iterate
Cell Be
Requirements
Algorithms
Application
Architecture
Figure 2.2: A3 model for project.
2.3.1.2 Power Processing Element (PPE)
The PPE contains a 64-bit, dual-thread PowerPC Architecture RISC core. It has 32 KB level-1 (L1)
instruction and data caches and a 512 KB level-2 (L2) unified (instruction and data) cache. It can run
existing PowerPC architecture software and is well-suited for executing system-control code. However
for this project, it will be used as a managing controller for the SPE threads and it is assumed that the
PPE is fast enough to manage the threads executing on the SPE. The PPE consists of two main units, the
PowerPC processor unit (PPU) which performs instruction execution and the PowerPC processor storage
subsystem (PPSS) which handles memory requests from the PPU and external requests to the PPE from
SPEs [18, p. 41]. The architecture overview of the PPE is presented in figure 2.5.
In the Playstation 3, the PPE is clocked at 3.2GHz, thus it can theoretically reach 2x3.2=6.4GFLOP/s
of IEEE compliant double precision floating-point performance. It can also reach 4x2x3.2=25.6GFLOP/s
of non-IEEE compliant single precision floating-point performance using 4-way single instruction multi-
ple data (SIMD) fused multiply-add operation [19, p. 5].
2.3.1.3 Synergistic Processor Element (SPE)
The SPE is single instruction multiple data (SIMD) processor element that is optimized for data-rich
(computation of FFT butterflies) operations allocated to them by the PPE. Each SPE has a Synergistic
Processor Unit (SPU) which fetches instructions and datas from its 256KB Local Store (LS) and its single
register file which has 128 entries, each 128bits wide. Each SPE has a Direct Memory Access (DMA)
18 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
19/50
OFDM receiverLTE 4G
SrensenEdelman Radix2
iterate
Cell Be
Requirements
Algorithms
Application
Architecture
Figure 2.3: A3 model for project. Highlighted in red, the algorithms analyzed in this section
interface and a channel interface for communicating with its Memory Flow Controller (MFC) and all the
other Processors (PPE and SPE). The SPE is intended to run is own program which is in the LS and to not
run an operating system [18, p. 63]. The architecture overview of the SPE is presented in figure 2.6.
The SPU functional unit, as shown in figure 2.7, consists of a local store (LS) where is stored all
instructions and data used by the SPU, a Synergistic Execution Unit (SXU) which executes all the in-struction and a SPU Register File Unit (SRF) which stores all data types,return addresses and results of
comparisons. The SXU includes 6 executions units :
SPU Odd Fixed-Point Unit (SFS) which executes byte granularity shift, rotate mask and shuffleoperations on quadwords.
SPU Even Fixed-Point Unit (SFX) which executes arithmetic instructions, logical instructions, wordshifts and rotates, floating-point compares, and floating-point reciprocal and reciprocal square-root
estimates.
SPU Floating-Point Unit (SFP) which executes single-precision and double-precision floating-pointinstructions, integer multiplies and conversions, and byte operations. It can perform fully pipelined
single precision (32 bit) floating point instructions and partially pipelined double (64 bits) precisioninstructions.
SPU Load and Store Unit (SLS) which executes load and store instructions. It also handles DMArequests to the LS.
SPU Control Unit (SCN) which fetches and issues instructions to the two pipelines, executes branchinstructions, arbitrates access to the LS and register file, and performs other control functions.
SPU Channel and DMA Unit (SSC) which enables communication, data transfer, and control intoand out of the SPU. The functions of SSC, and those of the associated DMA controller in the
Memory Flow Control (MFC).
Section: 2.3 Cell BE 19
-
8/2/2019 Report FFT Implementation 08gr943
20/50
EIB
SPE1
PPE
XIOXIO
MIC
BEI
IOIF_1 FlexIO
FlexIOIOIF_0
RAM RAM
SPE3 SPE5 SPE7
SPE6SPE4SPE2SPE0
BEI : Cell Broadband Engine interface MIC : Memory Interface Controller
EIB : Element Interconnect Bus PPE : PowerPC Processor Element
FlexIO : Rambus FlexIO Bus RAM : Ressource Allocation ManagementIOIF : I/O Interface SPE : Synergistic Processor Element
XIO : Rambus XDR I/O (XIO) cell
Figure 2.4: Architecture overview of the Cell Broadband Processor. The Element Interconnect Bus is a
connection between all processor elements and all chip controllers for memory access and
Input/Output access. The cell broadband engine has 1 PowerPC processor element and 8
synergistic processor elements. Adopted from [18, p. 37].
The Synergistic Execution Unit (SXU) is divided into an even/odd pipeline (pipeline 0 and 1 respectively)
and it can complete up to two instruction per cycle, one on each pipeline [ 18, p. 68]. Examining the SXU,
the odd pipeline provides the data moving unit and the even pipeline provides the data processing unit.Furthermore, each units of the SXU has a datapath of 128 bits wide resulting in the capability to use Single
Instruction Multiple Data (SIMD). If the SXU is working with data of 32 bits wide, thus it can perform 4
operations in each instruction.
On the Playstation 3, the SPU has a frequency of 3,2GHz. Thus each SPU can theoretically pro-
vide with 32 bits wide data 2*4*3.2GFLOPS (one operation on each pipeline and 4 operations on each
instruction). 6 SPUs are available, thus, this yields a total of 153.6 GFLOPS [ 18, p. 5].
It must be noted that Single precision floating point operations are not conform to the IEEE 754
because of the following differences :
Truncation is used in rounding.
Denormal numbers are treated as zero.
NaN are interpreted as normilazed numbers.The double precision floating point does not have this problem [18, p. 68-69].
20 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
21/50
PowerPC Processor Element (PPE)
PowerPC Processor Unit (PPU)
L1 Instruction
L2 Cache
PowerPC Processor
L1 DataCache Cache
Storage Subsystem (PPSS)
Figure 2.5: Architecture overview of the PPE which consists of a PowerPC processor unit (PPU) and a
PowerPC processor storage subsystem (PPSS). It has a 32 KB Level-1 (L1) instruction and
data caches and a 512 KB level-2 (L2) unified (instruction and data) cache. Adopted from
[18, p. 49].
Synergistic Processor Element (SPE)
Synergistic Processor Unit (SPU)
Local Store (LS)
DMA Controller
Memory Flow Controller (MFC)
Figure 2.6: Architecture overview of the SPE which consists of a Synergistic Processing Unit (SPU) and
a Memory Flow Controller (MFC). The SPU has a LS of 256 KB. Adopted from [18, p. 63].
Section: 2.3 Cell BE 21
-
8/2/2019 Report FFT Implementation 08gr943
22/50
Odd PipelineEven Pipeline
SPU OddFixed-PointUnit(SFS)
SPU Loadand StoreUnit(SLS)
SPUControlUnit(SCN) (SSC)
Unitand DMASPU Channel
SPU EvenFixed-PointUnit(SFX)
SPU
PointUnit(SFP)
Floating
LocalStore(LS)
Synergistic Execution Unit (SXU)
SPURegister File
(SRF)Unit
SPU Functional Unit
Figure 2.7: Architecture overview of the SPU Functional Unit. The 256 KB Local Store (LS) is filled by
the Element Interconnection Bus (EIB) via the MFC. The SXU contains 2 fixed points units
and a floating point unit. The odd pipeline takes care of moving data (fetch instructions to the
pipelines, load and store data between the LS and the register (128 entries of 128 bits wide
each) while the even pipeline takes care of data processing (arithmetic and logic instructions).
Adopted from [18, p. 64].
22 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
23/50
2.3.1.4 Element Interconnection Bus (EIB)
One of the main component in the Playstation 3 is the EIB which connect all the components together
including PPEs, SPEs, main memory and all inputs/outputs. The bus has a bandwidth of 25.6GB/s (96
bytes per clock cycle) and enabling multiple concurrent data transfers [18, p. 42].
2.3.1.5 Memory Interface Controller (MIC)
The Memory Interface Controller (MIC) provides the interface between the EIB and the physical mem-
ory. It supports one or two Rambus extreme data rate (XDR) memory interfaces, which together support
between 64 MB and 64 GB of XDR DRAM memory [18, p. 42].
2.3.1.6 Memory System
The Playstation 3 has a dual channel rambus extreme data rate (XDR) memory however the platform
provides a modest amount of 256 MB which only 200 MB are available for Linux OS and the applications
[19, p. 7]. The SPU access to the ram by the EIB and move the data to his LS via DMA transfers, with
the MFC of the SPU acting as a DMA controller.
2.3.1.7 Cell Broadband Engine Interface (BEI)
The Cell Broadband Engine interface (BEI) unit supports I/O interfacing. It manages data transfers be-
tween the EIB and I/O devices. The BEI supports two Rambus FlexIO interfaces. One of the two interfaces
(IOIF1) supports only a noncoherent I/O interface (IOIF) protocol, which is suitable for I/O devices. The
other interface (IOIF0) is software-selectable between the noncoherent IOIF protocol and the memory-coherent Cell Broadband Engine interface protocol [18, p .42].
2.3.2 Programmation of the CBE
The programming for the CBE is split into two main tasks, the programming of the PPE which manages
the utilization of the SPU and the programming of what is executed on the SPE.
2.3.2.1 Development platform
The platform used for the project is a PlayStation 3 surrounded by a monitor, keyboard, mouse and LAN
connection for remote access. The PlaySation 3 is set up with a linux operating system and a set ofdevelopment tools:
Fedora 8 Linux kernel 2.6.23.1-42.fc8
IBM SDK3.0 for the CBE architecture, includding:
gcc compiler toolchain for the CBE (ppu-gcc and spu-gcc ver. 4.1.1)
lipspe2 - SPE runtime management library ver. 2.2
Makefile from SDK
Section: 2.3 Cell BE 23
-
8/2/2019 Report FFT Implementation 08gr943
24/50
2.3.2.2 Creating a simple application on a SPE
Generally, applications do not have the physical control of the SPE. The operating systems manages this
resources. Applications use software constructs called SPE context. These SPE context are a logical
representation of an SPE. The SPE Runtime Management Library (libspe) provides all the function to
manage the SPE. This library provides also the means for communication and data transfer between the
SPEs and the PPE. The flow of running a single SPU program context, as shown in Figure 2.8, is to create
a SPE context, to load an SPE executable object into the SPE context local store (LS), to run the SPE
context, this is done by the operating system which requests the actual scheduling of the SPE context
onto a physical SPE and lastly to destroy the SPE context in order to free the memory resources used by
the context. It must be noticed that the fact to run the SPE context represents a synchronous call to the
operating system and thus, the calling application blocks until the SPE stops executing [ 20, p. 1]. All
functions for the SPE context management are described in [20].
Create anSPE context
Load an SPE
into SPE context LS
executable object Run theSPE context
Destroy the
SPE context
Figure 2.8: The flow for running a simple application using a SPE.
2.3.2.3 Create an application on several SPEs
The project in order to get faster need to use multiple SPEs concurrently. For achieve this, the application
must create at least as many threads as concurrent SPE contexts are required. The library used to achieve
this is the libspe2 which uses the POSIX (Portable Operating System Interface) threads [ 20, p. 41]. Theflow of running an application on several SPEs is show in Figure 2.9.
Each of these threads may run a single SPE context at a time. If N concurrent contexts are required, it
is common to have a main application thread and beside, N threads dedicated to the SPE context execution.
the main application thread issues a request for the context to be run, and become locked until the context
finished execution. But there is no matter from the lock of the main program thread because it can still
creates as many threads as needed. If all SPEs are busy, the threads are queued up and will be executed in
the same order as they were created.
Finally, when all the threads have been executed, the main program thread destroys the no longer
needed SPE contexts.
2.3.2.4 Project directory structure
In order to program the cell broadband engine, the source code is arranged into two folders, one for the ppu
code and one for the spu code. Furthermore, to use makefile definitions supplied by the SDK for producing
programs,the line "include $(CELL_TOP)/buildutils/make.footer" has to be included in the makefile. The
project directory structure in shown in figure 2.10.
2.3.2.5 Program compilation
The built of the application for the cell Be requires several steps as shown in Figure 2.11.
First all .c files in the ppu folder are compiled using ppu-gcc for PPE programs and all .c files in the
spu folder are compiled using spu-gcc for SPE programs. Next spu-gcc creates SPE executables from SPE
compiled progams. These executables are embedded into the PPE programs by first creating embedded
24 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
25/50
Create N SPE contexts
Load SPE executableobject into context
Create N threads
Stop thread
Run one SPE context
in each thread
Wait for all N threadsto stop
Destroy all N SPE contexts
Figure 2.9: The flow of running an application using several SPEs.
PPE images of the SPE executables (using ppu-embedspu), next creating PPE libraries (using ppu-ar), and
finally compiling the PPE programs again by merging it with the SPE libraries to obtain the final program
FFT (using ppu-gcc).
2.4 FFT algorithms
This section is an introduction to select FFT algorithms with a complete analysis of each algorithm which
will be parallelized in Chapter 3. According to the A3 design Model, this section belongs to the AlgorithmDomain, as illustrated Figure 2.12. First of all, it discusses about the selection of algorithm. Then, the
different mathematical forms of algorithms are developed. At last, the computational time is compared.
A FFT Algorithm allows computing the Discrete Fourier Transform (DFT) with a minimum complex-
ity. In fact, for an application of DFT definition, computing complexity is O(n2). The purpose of FFTalgorithms is to split the transform to obtain a complexity O(n log(n)).
Section: 2.4 FFT algorithms 25
-
8/2/2019 Report FFT Implementation 08gr943
26/50
Makefile in the program directory
Makefile in directory ppu Makefile in directory spu
# SubdirectoriesDIRS = ppu spu
# make.footerinclude $(CELL_TOP)/buildutils/make.footer
# TargetPROGRAM_PPU = main
# make.footerinclude $(CELL_TOP)/buildutils/make.footer
# TargetPROGRAM_spu = fft_spu
# make.footerinclude $(CELL_TOP)/buildutils/make.footer
Figure 2.10: Project directory structure which yields two subfolders : one for the ppu program code and
one for the spu program code.
2.4.1 Overview
It exists lot of algorithms to compute FFT. The most common algorithm is called Cooley-Tukey (CT).
It uses on a kind of approach divide to control thanks to recursion. This recursion divides Discrete
Fourier Transform in several DFT. This algorithm needs o(n) multiplications by twiddles factors. Theyare trigonometric constant coefficients that are multiplied in the course of algorithm developed in 2.4.2.
In 1965, James Cooley and John Tukey published this method but this algorithm has been originally
designed by Carl Friedrich Gauss in 1805. The most well known use of CT algorithm is a division of the
transformation in two parts of similar size.
2.4.2 Discrete Fourier Transform
The Discrete Fourier Transform (DFT) presented in [21] is a mathematical tool for digital signal pro-
cessing, (spectral analysis, data compression, partial differential equations,. . . ) similar to the Continuous
Fourier Transform (CFT) which is used for analog signal treatment. The formula is shown below:
X[k] =
N1n=0
x[n] exp 2j n kN
(2.1)
X[k] =
N1n=0
x[n] Wn kN (2.2)
Wn kN = exp2j (n k)N (2.3)Where Wn kN is known as the Twiddle Factor . The time domain input data x[n] is a finite series
of N samples of length n = [0, 1, . . . , N ] and is transformed to the frequency domain signal X[k] wherek = [0, 1, . . . , N 1].
2.4.3 Cooley-Tukey
This chapter presents a theoretical analysis of Radix 2 DIT FFT. First of all, a development of DFT
formula is done to obtain Radix 2 DIT formula. Then, a data path derivation is shown to optimize the
implementation in language programming code.
26 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
27/50
ppu spucommon.h
Compile SPEprograms
Compile PPEprograms
Compile PPE programs
with libraries
FFT
*.o
Create embeddedPPE images
Create PPElibraries
*.o
*-embed.o
*.a
Create SPEexecutables
*
Figure 2.11: Flow for CBE program compilation : First .c files for PPE programs and SPE programs
are compiled using ppu-gcc and spu-gcc, respectively. SPE compiled programs are used
to create SPE executables, which are compiled through embedded PPE images into PPElibraries to finally get the final program FFT.
2.4.3.1 Radix 2 DIT FFT
This section presents the radix-2 FFT implementation [22] used for testing against Edelman and Srensen
FFT algorithms. It is used because it is one of the simplest FFT algorithm. The simplest for 2 reasons:
it is well-studied therefore it can be used for comparison and then to get acquainted with FFT. First, the
analytical algorithm of the radix 2 calculation of a DFT is presented.
The radix-2 decimation-in-time rearranges the DFT equation into two parts: a sum on the even-
numbered indices n = [0, 2, 4, . . . , N 2] and a sum over the odd-numbered indices n = [1, 3, 5, . . . , N 1] as in the following equations:
Xk =
N21
m=0
x2me2N
(2m)k +
N21
m=0
x2m+1e2N
(2m+1)k (2.4)
Xk =
N21
m=0
x2me2N
(2m)k + e2N
k N21
m=0
x2m+1e2N
(2m)k (2.5)
Xk = DF TN2
[x(0), x(2), . . . , x(N 2)] + WkN DF TN2
[x(1), x(3), . . . , x(N 1)] (2.6)Xk = Odd(k) + W
kN Even(k) (2.7)
Section: 2.4 FFT algorithms 27
-
8/2/2019 Report FFT Implementation 08gr943
28/50
OFDM receiverLTE 4G
SrensenEdelman Radix2
iterate
Cell Be
Requirements
Algorithms
Application
Architecture
Figure 2.12: A3 model for project. Highlighted in red, the algorithms analyzed in this section
where k = [0, 1, . . . , N 1]. The previous simplifications show that the DFT radix-2 DIT can becomputed as the sum of two N
2length DFTs; one of them with the even indexes and the other with the odd
indexes which are multiplied by the twiddle factor WkN = e2N
k . Whereas DFT computation requires
N2 complex multiplications and N2
N complex additions, the radix-2 DIT rearrangement costs only
N2
2 + Ncomplex multiplications and N2
2 complex additions.
2.4.3.2 Data path Derivation
One can notice that the radix 2 DIT simplification is recursive. This kind of expression is simple, but not
optimal to implement in language programming code, because memory consumption and scheduling; that
is why iterative algorithms are generally preferable.
An other property is described below. Even and odd parts are periodic with period N2
; so Odd(k +N2
) = Odd(k) and Even(k+N2 ) = Even(k). In addition, the twiddle factor is periodic Wk+N
2
N = WkN.The equation may be expressed now as:
Xk = Odd(k) + WkN Even(k) (2.8)
Xk+N2
= Odd(k) WkN Even(k) where k = 0, 1, . . . ,N
2 1 (2.9)
The decimation of data sequence can be repeated again and again until the resulting sequences are
reduced to one point sequences. Thus, for N = 2n, this decimation can be performed n = log2N times.Therefore, the total number of complex multiplications is reduced to N
2log2N and the number of additions
to N log2N.
28 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
29/50
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
W08
W08
W08
W08
W08
W08
W28
W28
W08
W18
W28
W38
Figure 2.13: Eight point decimation in time algorithm.
One can observe that the computation is divided in three stages: a four two-point DFT, then a two
four-point DFT and finally a one eight-point DFT. Another important observation is the order of input
data. Indeed, the order of these data have to be inverted to obtain the good sequence for the corresponding
data output.
2.4.4 Srensen
Srensen FFT [16] (SFFT) algorithm is used in the project as a test algorithm like Radix-2 DIT and
Edelman. It is also known as Transform Decomposition. Its principle is different from standard
algorithms, like Radix-2 DIT or Split radix, in terms of length of input and output data points. Standard
algorithms assumes that the both length of data points are equal, as seen in Figure 2.13, where all the data
are computed. SFFT computes them in a different manner where only some output points are said of
interest , thus only these points are computed, as illustrated in Figure 2.14.
Considering the DFT definition (2.2) as:
X[k] =
N1n=0
x[n] Wn kN (2.10)
where k = [0, 1, . . . , N 1]SFFT supposes that only L points are interesting. It exist two sums of lenght P and Q such as N
Section: 2.4 FFT algorithms 29
-
8/2/2019 Report FFT Implementation 08gr943
30/50
x(0)
x(1)
x(N 1)
x0(0)
x0(k)
x0(P 1)x1(0)
x1(k)
x1(P 1)
xQ1(0)
xQ1(k)
xQ1(P 1)
Input Mapping
X0(0)
X0(k)
X0(P 1)X1(0)
X1(k)
X1(P 1)
XQ1(0)
XQ1(k)
XQ1(P 1)
Q length-P FFTs Recombinaison
X(0)
X(k)
X(L 1)
W0N
WkN
WQ1N
Figure 2.14: There are N inputs, but only one output (X(k) in this example) is computed and used forfurther operation. The way it is done is explained in the following paragraphs. Modified
from Srensen and Burrus, 1993, figure 4 [16]
divided by P defines Q as:
Q = N/P (2.11)n = Qn1 + n2 (2.12)
with n1 = [0, . . . , P 1] and n2 = [0, . . . , Q 1]So the DFT equation (2.2) becomes:
X[k] =
Q1n2=0
P1n1=0
x[Qn1 + n2] W(Qn1+n2)kN (2.13)
X[k] =
Q1n2=0
P1n1=0
x[Qn1 + n2] Wn1pNWn2kN (2.14)
where < k >p is k modulo p.
X[k] =
Q1n2=0
Xn2 [< k >p] Wn2kN (2.15)
Xn2 [j] =
P1n1=0
x[Qn1 + n2] Wn1jP (2.16)
Xn2 [j] =
P1n1=0
xn2 [n1] Wn1jP (2.17)
xn2 [n1] = x[Qn1 + n2] (2.18)
30 Chapter: 2 Analysis
-
8/2/2019 Report FFT Implementation 08gr943
31/50
The equation (2.17) is the equation of a length P DFT and can be compute with any FFT algorithm
such as Radix-2 DIT or Split Radix. Srensens paper says that it is better with a Split Radix FFT in termsof number of operations. But as the Radix-2 FFT has been used previously in the project, it is better to use
it to compare with the previous results.
Equation (2.15) shows that Q FFTs of length P have to be computed, as illustrated in Figure 2.14
2.4.4.1 Complexity
The SFFT complexity depends on the number P, which permits to yield Q, the number of FFTs which have
to be performed. Then the complexity depends on the complexity of the FFT algorithm used, Radix-2 DIT
or Split Radix.
2.5 Conclusion of the Analysis section
The Analysis section shows the theorical point of the subject developed in this project. The A3 designmethodology is used to organize the project and permits to establish simple and defined parts. The ap-
plication is defined as developping an OFDM receiver for LTE. Then, the algorithm part describes 2 FFT
algorithms to be used : Radix-2 DIT and Srensen. The last part corresponds to the architecture on which
the algorithms are implemented, namely the Cell Broadband Engine.
The analysis of the Cell Broadband Engine shows a multiprocessor architecture containing one PPE
managing the communication between the 6 SPEs, out of 8 in the Playstation 3 platform. The instructions
and datas are flowing thanks to the Element Interconnect Bus (EIB) which connect the PPE, the SPEs and
memories. The SPUs contains a RISC processor and are constructed with two pipelines that can execute
an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside the SPUs are
wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions. The use of thismethod produces a processor optimized for computations.
The last part of the Analysis section concerns the FFT algorithms. First of all a FFT (or its inverse
the IFFT also known as F F T1) is a tool used in digital signal processing permitting the transformationfrom time domain to frequency domain. This domain is used to determine the usefull frequency from
the added noise. In the case of OFDM, the transmitter contains an IFFT which transforms the digital
symbols in analog signal for its transmission. The receiver operates the inverse computation to retrieve the
data transmitted among noises. These operations are time expensive. An efficient multicore architecture
designed for computation can reduce the computional time. The group project chooses the Radix-2 DIT
for first algorithm to implement on the CBE. This algorithm is assumed to be one of the simplest FFT
algorithm based on the paper by J.Cooley and J.Tukey. Then, the second algorithm to be implemented
on the CBE is the Transform Decomposition algorithm, known as Srensen. This algorithm, a bit more
complex than Radix-2 DIT, permits to speed up the computation time.
Next Section deals with the experiments of implementing the FFT algorithms firstly on the PPU then
on one SPU and finally on several SPUs. The Radix-2 DIT is the first algorithm being used followed by
Srensen algorithm.
Section: 2.5 Conclusion of the Analysis section 31
-
8/2/2019 Report FFT Implementation 08gr943
32/50
Chapter 3Implementation
3.1 Overview
This chapter puts in practice the theoretical analysis developed in the chapter 2. It contains the results
of the tests on one or several processors, with different FFT algorithms. All these results are evaluated,
compared and discussed. According to the A3 design model, this section belongs to both Algorithm andArchitecture Domain, i.e. the mapping of the algorithm onto the architecture, as illustrated in Figure 3.1.
3.2 Cooley-Tukey Implementation
3.2.1 Overall Approach
The tests are carried out with the CT Algorithm. First of all, Matlab is used to have reference results.
Indeed, the fftfunction is used to verify that the results of the implementations are correct. This verifica-
tion is done only for the computation results and the Matlab computation time is of no interest. Indeed,
as mentioned in the second chapter section 2.4.3, CT Algorithm is one of the simplest existing FFT al-
gorithms; therefore it is selected for the initial tests, as its sequential implementation is straightforward.
These also provide elements of comparison for the subsequent implementations.
Then, various types of tests are performed. All the following tests are carried out 10000 times to
insure that the results are meaningful (since the execution is not fully deterministic due to architectural
and OS hazards). The first one is a sequential execution on the main processor (PPU). The second one
is also a sequential computation but on the SPU (without data transfer). These two tests allow seeing
the computation difference between both. Then, the parallel implementation on 6 SPUs is performed toevaluate the potential improvement.
Two parameters are evaluated during these tests: the computation time and the number of computation
per second. Measurement of the time is realized by the function gettimeofday [23] and is carried out for
the execution ofbit reverse function, twiddle factor computation and butterfly computation. The following
calculation allows computing the number of operations per second:
Numberofoperationspersecond =10 N
2 log2N
totaltime(3.1)
where N is the length of the FFT.
32 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
33/50
OFDM receiverLTE 4G
SrensenEdelman Radix2
iterate
Cell Be
Requirements
Algorithms
Application
Architecture
Figure 3.1: A3 model for project. Highlighted in red, the mapping developed in this section
This formula corresponds to the complexity of CT butterflies seen in section 2.4.3.2]. The computation
of the twiddle factor and bit reverse function does not affect equation 3.1 because there are no floating
operations in these functions. Bit reverse is only data transfer and twiddle factorhas no floating operation
(only cosine and sine which, in this case, are not floating point operations).
3.2.2 Results
A graphical representation of the results is seen in Figure 3.2. This graph shows the computation time of
the sequential executions on the PPU and on the SPU. Both are almost linear, which is normal because
when the FFT length is multiplied by 2, the computation time is almost doubled. One more comment about
these results is that the SPUs computation time are larger than the PPUs ones. Indeed, the difference
between both increases with the FFT length.
The following graph in Figure 3.3 depicts the computation time of a parallel implementation according
to the FFT length. Indeed, the CT algorithm is parallelized on the 6 SPUs of the Cell Be. The more the FFT
length increases, the larger the computation time is. This is an unexpected result; firstly, the computation
time for the parallelized version is larger than for the sequential one. Secondly, the larger the FFT length
is, the larger the execution time is. There is an explanation for that. The data transfers between themain storage (RAM) and the local storage (LS) are very long (as compared to the computation, i.e. data
transfers are a bottleneck and the PPUs remain idle for significantly long periods of time). Moreover, no
optimizations have been implemented so far.
Finally, the number of operations per second is drawn according to the number of processors as de-
picted in Figure 3.4. With the FFT length of 1024, it can be observed that the number of operations per
second (in MFLOPS) decreases when the number of processors increases.
Considering the results of the previous tests (computation on 6 SPUs, Figure 3.3), this result was
expected. Indeed, much time (cf. previous results comments) is spent for data transfers when the number
of processors increases. Therefore, the number of operations per second (i.e. actual computations) is very
low compared to the transfer times.
Section: 3.2 Cooley-Tukey Implementation 33
-
8/2/2019 Report FFT Implementation 08gr943
34/50
0 200 400 600 800 1000 12000
50
100
150
200
250
300
350
400
Computationtime(s)
N : Number of points
PPUSPU
Figure 3.2: Computation time of a sequential radix 2 FFT implemented on the PPU (dashed blue) and
one SPU (continuous red) for different lengths FFT (ranging from 4 to 1024).
34 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
35/50
0 200 400 600 800 1000 12000
1
2
3
4
5
6
7
8
9
10
Computationtime(s)
N : Number of points
104
Figure 3.3: Computation time of a parallel radix 2 FFT implemented on 6 SPUs for different FFT lengths,
ranging from 4 to 1024.
Section: 3.2 Cooley-Tukey Implementation 35
-
8/2/2019 Report FFT Implementation 08gr943
36/50
1 2 3 4 5 60
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Numberofoperationspersecond
N : Number of SPUs
Figure 3.4: Number of operation per second for a parallel radix 2 FFT implemented on different number
of processors (from 1 to 6).
3.2.3 Optimizations
Intuitively, one would expect that increasing the parallelism would increase the number of operations per
second. However, the opposite effect has been observed in the results described above. Therefore, the
group members have decided to evaluate whether it is possible to reduce the computation time by means
of several optimizations techniques, as described in what follows.
Problem of data transfers: The time for performing data transfers between the PPU and the SPU is
higher than the computation time. Several methods have been used to reduce this time, as described in the
following paragraphs.
3.2.3.1 Deterministic twiddle factors
The twiddle factors have been made as a constant on the SPU. If the FFT length is fixed, the twiddle
factors are always deterministic (they can be predicted). Instead of passing them as arguments to the SPU,
they are stored in the Local Storage of the SPU. The twiddle factors are complex values with real andimaginary parts. Assuming 32 bits floats and a 1024 length FFT, the size of these data is:
512 x 2 x 4= 4096 bytes
Number of twiddle factor
2 floats : real and imaginary
4 bytes float
That is not a problem for the LS because it is only 4,096 bytes out of the 256 Kb. This technique
allows to not waste precious EIB bandwidth.
36 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
37/50
3.2.3.2 Double Buffering
One of the methods to transfer data to (from) the PPU from (to) the SPU uses Direct Memory Access
(DMA). This section presents a technique called double buffering. To achieve computation on the SPU,
the program has to transfer data from main storage to the LS using DMA data transfer. For example,
consider a SPU program that repeats the following steps:
1. Access data using DMA from the main storage to the LS buffer B,
2. Wait for the transfer to complete,
3. compute on data in buffer B,
This sequence is not efficient because the SPU has to wait for the complete transfer of the data before it
can compute the data in buffer. The process wastes much time. Figure 3.5 illustrates this scenario.
First Iteration Second Iteration
time
DMA Input
Compute
Figure 3.5: Serial computation and data transfer. Modified from [24]
This process can be significantly accelerated by using double buffering. Two buffers, B0 and B1,are allocated, allowing overlapping computation on one buffer with data transfer in the other one. The
diagram scheme is showed in figure 3.6.
Double buffering is achieved by using tag-group identifiers [25]. All transfers involving buffer B0(respectively B1) are applied to Tag-group ID 0 (respectively ID 1). Then, software sets the tag-groupmask to include only tag ID 0 (tag ID 1) and requests conditional tag status update. It enables not to begin
the computation before the transfer to the buffer is complete. Figure 3.7 shows the resulting execution in
time.
Double buffering is used in the project to transfer the data structure from the PPU to the SPU. This
structure is described below:
Initiate DMA transfer
from EA to LS buffer B0
Initiate DMA transfer
from EA to LS buffer B1
Wait for DMA transfer
to buffer B0 to complete
Compute on data
in buffer B0
Initiate DMA transfer
from EA to LS buffer B0
Wait for DMA transfer
to buffer B1 to complete
Compute on data
in buffer B1
Figure 3.6: Double Buffering scheme. Modified from [24]
Section: 3.2 Cooley-Tukey Implementation 37
-
8/2/2019 Report FFT Implementation 08gr943
38/50
B0
time
DMA Input
Compute
B0 B0 B0
B1 B1 B1
Figure 3.7: Paralell Computing and Transfer. Double Buffering is more efficient than the approach pre-
sented in Figure 3.5 as the SPU does not have to wait for the data. A part can be computed
in bufferB0 while the next data is in the DMA transfer to B1. Modified from [24]
Typedef struct complex{
float real;float imag;
}complex_t
Typedef struct{
complex_t *input;complex_t *output;
complex_t *twiddle;int count;
}spe_arg_t
The structure spe_arg_tis passed in arguments from the PPU to the SPU. While the computation of
one butterfly is being performed by means of the first buffer transfer, the second buffer is transferring the
data for the computation of the next butterfly. Although twiddle factors and double buffering methods
have been implemented, no significant improvement for the data transfer time has been observed (since
the results are the same with or without these two methods, the corresponding numbers are not repeated
here).
3.2.3.3 Large amount of data
After further considerations, the group members wanted to evaluate that to gain anything from using
double buffering, a larger amount of data must be transferred. The EIB only becomes efficient if it can
work for longer durations of time. So, in a new experiment, instead of sending 1024 times the input data,
half of the data have been sent to the SPU. Then, the calculations started while the other half was sent.
Although this method has been implemented, no improvement regarding the computation time has been
measured.
3.2.3.4 Computation of several stages on the same SPU
The goal of all the previous optimizations is to reduce the data transfer time. Regarding figure 3.8, the
first four data (x(0), x(4), x(2), x(6)) are used together in stages 1 and 2. It means that only one transferis necessary from the PPU to the SPU to compute these four values in stages 1 and 2. If this method is
applied to a 1024 point FFT on 4 SPUs, 256 data (1024/4) are transferred on each SPU. It means that each
SPU computes 128 Butterflies (27) thus each SPU computes the first seven stages with only one transferof 256 data. This optimization is only possible on a power of 2 numbers of processors. Then, the last three
stages are computed on the 4 SPUs like the method decribed in part [].
38 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
39/50
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
W08
W08
W08
W08
W08
W08
W28
W28
W08
W18
W28
W38
stage 1 stage 2 stage 3
SPU1
SPU2
Figure 3.8: Eight-point decimation in time algorithm. This algorithm is implemented on two SPUs. Thefirst two stages are computed with only one transfer from the main storage to the LS. Modified
from [24]
The results are interesting. By means of this method, the computation time is improved. For a 1024
FFT length, the time without optimization is 30 ms on 2 SPUs. With this one, the computation time is 7
ms. This result shows two things: firstly, the data transfer time is the problem (that was a supposition until
this part). Indeed, the time is divided by 4,3 thanks to sending less data. Secondly, the improvement is not
enough because the computation time for a parallel implementation is always larger than the sequential
one. Another algorithm (Srensen) has been analysed in part 2.4.4. The implementation is developed in
the following part. This implementation is better than Radix 2 DIT in terms of computation time as shown
in section 3.3.
3.3 Srensen Implementation
3.3.1 Overall Approach
The following tests are carried out on the Srensen Algorithm. The reference results come from Matlab
and are the same as presented in section 3.2. This implementation allows comparing the difference with
the CT algorithm. According to the theoretical analysis section 2.4.4, the results should be better (in
terms of execution time) with Srensen than with CT radix 2. Indeed, Srensen algorithm divides a
large FFT in small FFTs, which facilitates the parallelization. Various tests are performed on Srensen.
However, in order to compare the results with those of CT, the same type of tests as those used for CT
Section: 3.3 Srensen Implementation 39
-
8/2/2019 Report FFT Implementation 08gr943
40/50
are carried out. Two sequential implementations on the PPU are performed: one with Q set to 2 and the
other one with Q set to 4 (Q is the number of small FFTs as seen in the part 2.4.4). Then, the parallelimplementation is tested to see the potential improvement. There are also two different values for Q (2and 4). Therefore, the parallel implementation is performed on 2 and 4 SPUs. The same parameters as
for the CT algorithm are measured. The measurement of the computation time concerns the reordering,
compute_fft and recombination functions. The function to measure the time is still gettimeofday [23].
Then, the number of operations per second is evaluated as well but with a different formula because the
complexity of the computation is not the same as for CT. The formula is described below in equation 3.2:
GFLOPS= 5 Q P log2P + 8 (Q 1) Ltotaltime
(3.2)
where Q is the number of small FFTs, P the number of input data for each small FFTs and L thedesired number of output data.
The number of operations per second only concerns the computation of small FFTs and the recombina-tion function. The reordering function has no floating computations. It only consists of the data reordering
by means of data moves.
3.3.2 Results
The graph in Figure 3.9 shows the computation time of the sequential execution on the PPU. There are
two lines: one (continuous red) for a division of the large FFT in two smaller ( Q = 2) and another one(dashed blue) for a division in four smaller (Q = 4). The execution time for Q = 2 is always smallerthan for Q = 4. That is normal because the complexity depends on the chosen subdivision factor P, asthis defines Q, which is the number of small FFTs performed. The larger the factor Q is, the larger thenumber of computation is. Therefore, for a sequential execution, the time increases with the number of
calculation. That explains the behaviour of these measures. Moreover, these two curves are almost linear.That is normal because increasing of the input data increases the computation time.
Figure 3.10 shows the computation time for 2 parallel implementations (Q = 2 and Q = 4, i.e. 2 and4 SPUs, respectively) according to the FFT length (from 4 to 1024). It appears that the execution time is
always larger for a parallelization on 4 SPUs. Thus, it can be deduced that the problem still comes from
the time needed to transfer the data between the PPU and the SPUs as for the parallel implementation
of the CT algorithm. However, the positive point choose aspect for this case is that the computation
time becomes almost constant when the FFT length is increased (due to the effect of the pipeline). The
computation time is always, for the parallel implementations, larger than the sequential one. Moreover,
the 4 SPUs execution is slower than the 2 SPUs one. This is an expected result because there are only 2
data transfers for Q = 2 whereas 4 are needed when Q = 4.
3.3.3 Comparison with the CT algorithmThe goal of this section is to compare the results of Srensen implementation with the different measures
obtained for the CT Radix 2 DIT implementation. Indeed, although several optimizations have been
applied to the CT implementation, the results (especially for the computation time) due to data transfer
between the PPU and the SPUs are larger than the sequential execution time. Figure 3.12 shows the
parallel implementation of Srensen algorithm (Q = 2, i.e. 2 SPUs) and of CT (on 6 SPUs).Although the parallel implementation of the Srensen also suffers from problem of data transfer, it
appears that it is better than Radix 2 DIT in terms of computation time. The explanation is simple; in
Srensens algorithm, data are transferred one and only one time for the computation of one small FFT
whereas for Radix 2 DIT, even with the optimizations, data are transferred four times (N = 1024), as seen
in section 3.2.2. The conclusion is that Srensen is better fitted for parallel implementation due to its
40 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
41/50
0 200 400 600 800 1000 12000
500
1000
1500
2000
2500
Computationtime(s)
N : Number of points
Q = 4Q = 2
Figure 3.9: Computation time of a sequential Srensen FFT implemented on PPU forQ = 2 (continuous
red) andQ = 4 (dashed blue) with different FFT lengths (ranging from 4 to 1024).
Section: 3.3 Srensen Implementation 41
-
8/2/2019 Report FFT Implementation 08gr943
42/50
0 200 400 600 800 1000 12000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Computationtime(s)
N : Nombre de points
Q = 4Q = 2
Figure 3.10: Computation time of a parallel Srensen FFT implemented on PPU for Q = 2 (red) and
Q = 4 (blue dash) with different FFT lengths (ranging from 4 to 1024).
42 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
43/50
0 200 400 600 800 1000 12000
1
2
3
4
5
6
7
8
9
10
Computationtime(s)
N : Number of points
104
CT FFTSFFT Q = 4SFFT Q = 2
Figure 3.11: Comparison of the computation time of a parallel Srensen FFT for Q = 2 (i.e. 2 SPUs)(dashed green), for Q = 4 (i.e. 4 SPUs) (dotted blue) and a parallel CT Radix 2 DIT FFTon 6 SPUs (continuous red) with FFT lengths from 4 to 1024.
Section: 3.3 Srensen Implementation 43
-
8/2/2019 Report FFT Implementation 08gr943
44/50
0 200 400 600 800 1000 12000
2
4
6
8
10
12
Numberofoper
ationspersecond
N : Nombre de points
CT FFT
SFFT Q = 2
Figure 3.12: Comparison of the number of operations per second for a parallel Srensen FFT (Q = 2)(dashed blue) and a parallel Radix 2 DIT FFT on 6 SPUs (continuous red) with different
FFT length (ranging from 4 to 1024).
design as compared to CT. Then, the time needed for data transfer appears to be the main limiting factor
but with a better knowledge of these transfers, the Srensen algorithm is most likely easily applicable
to this kind of parallel architecture. Another element of comparison is the number of computations per
second. Figure 3.12 shows this variable (in MFLOPS) for the Srensen implementation on 2 processors
and for the CT Radix 2 DIT algorithm on 6 SPUs according to the FFT length (from 4 to 1024).
Please note that this is a different type of measure as compared to the execution time (here larger num-
ber indicate a better performance). The number of floating point operations per seconds is larger for the
Srensen implementation than the Radix 2 one. It can be explained by the fact that the Srensen algorithmhas much computations (compared to Radix 2). Indeed, the recombination step, cf 3.2.2, after the com-
putation of the small FFTs, adds some computations. Furthermore, Figure 3.11 has shown that Srensen
FFT was faster than CT in terms of computation time. These two explanations combined together, can
explain the trend observed in Figure 3.12.
44 Chapter: 3 Implementation
-
8/2/2019 Report FFT Implementation 08gr943
45/50
Chapter 4Conclusion & Perspectives
4.1 Conclusions
The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This
is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem
such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair. To speed up the data
transfer provides by the OFDM, an improvement of the computation speed of the IFFT/FFT tool can be
sought. With the latest multiprocessor platform, the speed up can be improved even more as soon as the
data transfer protocol between the different parts of the architecture are well managed.
The goal of this 9th Semester ASPI project is to answer the problem defined in section 1.3 as follow :
"How efficient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized
FFT algorithms?"
First of all, an analysis of the Cell-BE has been done to determine if this multicore architecture can
speed up the algorithm of a common tool of digital signal processing, the FFT. It appears that the Cell-BE,
combined with the SIMD method produces a processor optimized for computations, and then it ables to
improve the computation speed of the execution of FFT algorithms. To evaluate the efficiency of paral-
lelized FFT algoritms on the Cell-Be processor, two FFT algortihms are used (Radix-2 DIT and Srensen).
The first uses
N step with N2
butterflies at each step. It means that this step can be parallelised. This
algorithm is assumed to be one of the simplest FFT algorithm. It is not the most efficient but the easiest
to establish. Srensen algorithm splits the FFT in smaller FFT. It means that each smaller FFT can be
computed on one processor in a parallel scheduling.Then, during the implementation, these algortithms are computed only with the PPE of the Cell-BE.
The fact to use these algorithms only on the PPE provides a computation without any parallelisation to have
a reference against the same algorithms with a parallelisation computation. Radix-2 DIT is implemented
first. The comparison between only PPU implementation and multiple SPUs implementation shows that
data transfers between PPU and SPUs causes waste of time and the results are unexpected, in a way
that they are showing less efficiency than an unparallelized algorithm. Optimizations, like the Double-
Buffering method, is applied to reduce the data transfer time but without any improvement. Srensen
algorithm is implemented after and shows improvement in the computation time results in comparison
with the Radix-2 DIT implementation. However, the results of this implementation are still under what
the theoritical computation power the Cell-Be can provide.
45
-
8/2/2019 Report FFT Implementation 08gr943
46/50
4.2 Perspectives
4.2.1 Short term perspectives
The short time perspective for this project concern the usage of another op