report fft implementation 08gr943

8/2/2019 Report FFT Implementation 08gr943

1/50

FFT Parallelization for OFDM Systems

9TH SEMESTER PROJECT, AAU

APPLIED SIGNAL PROCESSINGAN D IMPLEMENTATION (ASPI)

Group 943Jeremy LERESTEUX

Jean-Michel LORY

Olivier LE JACQUES


2/50


3/50

AALBORG UNIVERSITY

INSTITUTE FOR ELECTRONIC SYSTEMS

Fredrik Bajers Vej 7 DK-9220 Aalborg East Phone 96 35 80 80 http://www.esn.aau.dk

TITLE:

FFT Parallelization

for OFDM Systems

THEME:Parallel Architecture Processing

FFT implementation

PROJECT PERIOD:

9th Semester

September 2008 to January 2009

PROJECT GROUP:

ASPI 08gr943

PARTICIPANTS:

Jeremy [email protected]

Jean-Michel Lory

[email protected]

Olivier le Jacques

[email protected]

SUPERVISORS:

Yannick Le Moullec (AAU)

Ole Mikkelsen (Rohde&Schwarz)

Jes Toft Kritensen (Rohde&Schwarz)

PUBLICATIONS: 8

NUMBER OF PAGES: 46

APPENDICES: 1 CD-ROM

FINISHED: 5th of January 2009

Abstract

This 9th semester project for the Applied Sig-

nal Processing and Implementation special-

ization at Aalborg University is a study of FFT

algorithms parallelization for OFDM receivers

on Cell BE. The project focuses on mobile ap-

plications, which require efficient bandwidth

utilization like in LTE. This can be achieved

by means of the OFDM technology. A signif-

icant contribution in OFDM is the IFFT/FFT

operations. This can be exploited by the par-

allelization of special FFT algorithms to yield a

lower operations count and intuitively improve

the time of computation. This project seeks

to investigate the possibilities and differences,

with regards to time usage, when computing

FFT algorithms on multiple processors on the

Cell BE. First of all, the definition of LTE and

OFDM is explained. Then, two Fast Fourier

Transform algorithms - a Radix-2 DIT FFT and

a Srensen FFT algorithm (SFFT) are examined

and mapped on Cell Be processor architecture.

Afterwards, tests are done and results are dis-

cussed for both algorithms. It appears SFFT al-

gorithm is better than Radix 2 DIT algorithmin terms of execution time and performance. In

the conclusion, an assessment is done and fu-

ture perspectives are discussed.


4/50

Preface

This report is the documentation for a 9th semester project in Applied Signal Processing and Implementa-tion (ASPI) entitled FFT Parallelization for OFDM Systems at Aalborg University (AAU). This report

is prepared by group 08gr943 and spans from September 2nd, 2008 to January 5th, 2009. The projectis supervised by Yannick Le Moullec, Assistant Professor at AAU, Jes Toft Kritensen and Ole Mikkelsen

from the company Rohde & Schwarz Technology Center A/S in Aalborg. The report is divided into

four parts. These chapter correspond to the introduction of the project, analysis, implementation and

conclusion.

Jeremy Leresteux Jean-Michel Lory

Olivier Le Jacques

Aalborg, January 5th 2008

4


5/50

Contents

Preface 4

1 Introduction 7

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1 Long Term Evolution (LTE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM) . . . . . . . . . . . . . . . 10

1.1.3 Conclusion on the context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Project subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Fast Fourier Transformation (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.2 Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Project Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Analysis 16

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Design Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Cell BE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.2 Programmation of the CBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 FFT algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.2 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.3 Cooley-Tukey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.4 Srensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Conclusion of the Analysis section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Implementation 32

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Cooley-Tukey Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Srensen Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5


6/50

3.3.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.3 Comparison with the CT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Conclusion & Perspectives 45

4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Short term perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.2 Long term perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47

List of Figures 49


7/50

Chapter 1Introduction

1.1 Context

In 1981, Nordic Mobile Telephony (NMT) led to the commercialization of the first mobile phone (referred

as 1st Generation1). On the 29th of November 2007, 3.3 billion mobile phones have been identifiedworlwide [1]. Most of these phones are GSM phones (2G). But the 3rd Generation phones, which canprovide features like web browsing or videoconferences, approached half a billion of devices at the end of

September 2007. 3G phones have good capabilities but a new generation (4G), with even better capabilities

including higher bandwidth and more flexibility, is approaching. See Figure 1.1 for a summary of the

history of the mobile phone generations.

1.1.1 Long Term Evolution (LTE)

LTE [2] (Long Term Evolution) is the next major step in mobile radio communication. It is one of the

best candidates for the 4th Generation of mobile wireless data transfer. Its development started in 2004 by3GPP [3] and several European mobile constructors and operators [4].

802.16m WiMAX (Worldwide Interoperability for Microwave Access) is one of the other candidates

[5] to the 4G appellation . It is developed by the IEEE and headed by Intel [6]. The last candidate is

the Ultra Mobile Broadband (UMB) developed by 3GPP2 [7] and headed by Qualcomm (it was decided

on November 13, 2008 to stop UMB development to the benefit of LTE [8]). This project only considers

LTE, therefore WiMAX and UMB are disregarded.

LTEs major aim is to improve the 3G UMTS (Universal Mobile Telecommunication System). It

has ambitious requirements for the spectrum efficiency, lowering costs capacities, improving services

like video conferences and VoIP (Voice over Internet Protocol) communication, latency and also betterintegration with other standards.

The 3GPP Release 8 [9] gives what the LTE requirements shall be (only the most significant ones are

listed here):

Peak data rate

Instantaneous downlink peak data rate of 100 Mb/s within a 20 MHz downlink spectrum allocation(5 bps/Hz)

1Generation : Term used to define the technology used in mobile communication. 1G is NMT, 2G is GSM and 3G is UMT-S/HSPA.

7


8/50

2G

3GGSM

40kbps

WCDMA

384kbps

HSPA

14.4Mbps

GSM: 3.3 billion subscribers

WCDMA: 297 million subscribers

HSPA: 55 million subscribers

4G LTE

First call made in 1991

GPRSin 2000

3G UMTSin 2001 EDGE

in 2003

HSDPA

in 2005

LTE

100Mbps

GSM

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2012

2013

2014

HSPA

28/40MbpsEvolved

HSUPA

in 2008

LTEplan in 2009

2011

Figure 1.1: Standardization evolution track. Where GSM is Global System for Mobile communications,

GPRS is General Packet Radio Service, UMTS is Universal Mobile Telecommunications Sys-

tem, WCDMA is Wideband Code Division Multiple Access, EDGE is Enhanced Data Rates

for GSM Evolution, HSPA is High Speed Packet Access, HSDPA is High-Speed Downlink

Packet Access, HSUPA is High-Speed Uplink Packet Access and LTE is Long-Term Evolu-

tion. Modified from [10]

Instantaneous uplink peak data rate of 50 Mb/s within a 20MHz uplink spectrum allocation (2.5bps/Hz)

Latency

Transition time of less than 100 ms from a camped state, Idle Mode, to an active state Less than 5 ms in unload condition (i.e. single user with single data stream) for small IP packetUser capacity and throughput

At least 200 users per cell should be supported in the active state for spectrum allocations up to 5MHz

Downlink: average user throughput per MHz, 3 to 4 times HSDPA (High-Speed Downlink PacketAccess: 3,5G Downlink protocol)

Uplink: average user throughput per MHz, 2 to 3 times HSUPA (High-Speed Uplink Packet Access:3,5G Uplink protocol)

Spectrum efficiency

Downlink: In a loaded network, target for spectrum efficiency (bits/sec/Hz/site), 3 to 4 times HS-DPA

8 Chapter: 1 Introduction


9/50

Uplink: In a loaded network, target for spectrum efficiency (bits/sec/Hz/site), 2 to 3 times HSUPA

Coverage

Throughput, spectrum efficiency and mobility targets above should be met for 5 km cells, and witha slight degradation for 30 km cells. Cells range up to 100 km should not be precluded.

Complexity

Minimize the number of options

No redundant mandatory features

These characteristics are performed thanks to the E-UTRA Air Interface. E-UTRA is the acronym

for Evolved Universal Terrestrial Radio Access. It is the successor of the UTRAN/GERAN (GSM EDGERadio Access Network/ UMTS Terrestrial Radio Access Network), 2G/3G air interface. Also designed by

3GPP, its requirement are as follow:

Mobility

E-UTRAN should be optimized for low mobile speed from 0 to 15 km/h

Higher mobile speed between 15 and 120 km/h should be supported with high performance

Mobility across the cellular network shall be maintained at speeds from 120 km/h to 350 km/h (oreven up to 500 km/h depending on the frequency band)

Spectrum flexibility

E-UTRA shall operate in spectrum allocations of different sizes, including 1.25 MHz, 1.6 MHz, 2.5MHz, 5 MHz, 10 MHz, 15 MHz and 20 MHz in both the uplink and downlink. Operation in paired

and unpaired spectrum shall be supported

Co-existence and Inter-working with 3GPP Radio Access Technology (RAT)

Co-existence in the same geographical area and co-location with GERAN/UTRAN on adjacentchannels.

E-UTRAN terminals supporting also UTRAN and/or GERAN operation should be able to supportmeasurement of, and handover from and to, both 3GPP UTRAN and 3GPP GERAN.

The interruption time during a handover of real-time services between E-UTRAN and UTRAN (orGERAN) should be less than 300 msec.

E-UTRA is the air inteface which permits the communication between a BTS (Base Transmitter Sta-

tion) and a UE (User Equipment). The signal modulation used for the BTS and demodulation for the UE

are a bit different but made on the same technology namely the Frequency-Division Multiplexing. This

provides a lot of similarities between them. SC-FDM (Single Carrier Frequency-Division Multiplexing)

is used for the transmitter part and OFDM (Orthogonal Frequency-Division Multiplexing) for the receiver

part. This has been decided by the 3GPP members and summarized in the Release8.

This project is related to the OFDM aspect at the receiver side. Section 1.1.2 gives an overview of

OFDM fundamentals.

Section: 1.1 Context 9


10/50

1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM)

This section has been inspired by the information provided by this two papers [ 11] [12].

OFDM is a modulation technique which is used in most of the new wireless technologies such as

IEEE802.11 a/b/g, 802.16, HiperLan-2, DVB (digital TV) and DAB [13]. The 3GPP members selected it

to be the LTE/E-UTRA downlink protocol i.e. the system which receives data and communication packets

from a transmitter. As indicated at the end of section 1.1.1, the selected uplink protocol, SC-FDM, presents

similarities to OFDM, that is why this section only introduces OFDM on the transmitter and receiver sides.

1.1.2.1 Overview

With standard single carrier transmitters, the signal is spread into multiple transmission paths, multiple

frequencies. Because of the environment (buildings, cars, distance), the signal becomes less powerful

and distorted. This phenomenon, called fading , appears when signals are reflected on the buildings for

example. The reflected signals arrive to the receiver later than the main signal, which results in distortions,as illustrated in Figure 1.2.

Transmitter

Buildings

Receiver

Mobile Obstacle

Obstacle

Figure 1.2: Multipath propagation. A transmitted signal is spread between different frequencies and ac-

cording to theses frequencies, the obstacle met and the distance covered, the distortion is

more or less present. Modified from [12].

These distortions are a major problem when establishing secured high speed data transfer like usedon the 3G UMTS cell phones. OFDM settles this distortion problem. It is not avoiding reflections but its

characteristics make a transmission safer, in the meaning that data packets are always present by permiting

to send multiple signals by a single radio channel. OFDM is a multi-carrier transmitter/receiver, i.e. it can

send/receive signals to/from several users. The next subsections describe the main principles of OFDM

on the transmitter and receiver sides.

1.1.2.2 OFDM Principles

OFDM distributes the data over a large number of carriers at different frequencies. This spacing provides

the orthogonality which prevents the receivers to see wrong frequencies. In opposite to other multi-



11/50

carriers techniques, like CDMA, OFDM prevents the Inter Symbol Interference (ISI) by adding a cyclic

prefix, which is explained in section Inter-Symbol Interferences.One of the key features of OFDM is the IFFT/FFT pair. These two mathematical tools are used here

to transform several signals on different carriers from the frequency-domain to the time-domain in the

IFFT (or F F T1) and from the time-domain to the frequency-domain in the FFT. See in Figure 1.3, theprinciple with the main parts of an OFDM system.

Serial toparallel

conversionIFFT

Addcyclicprefix

FFTRemovecyclicprefix

Parallel toserialconversion

Transmitter Receiver

Input signal

Modulated signal

Output signal

Frequency Domain Frequency DomainTime Domain

Antenna

Antenna

Figure 1.3: Main principle of an OFDM transmitter/receiver.

The Transmitter Figure 1.4 shows a representation of the transmitter. OFDM divides the spectrum into

N sub-carriers, each on different frequencies, and each carrying a part of the signal by means of the IFFT(also noted F F T1). In opposite with FDM, where there is no coordination or synchronisation betweeneach sub-carriers, OFDM links them with the principle of orthogonality . It results in a overlapping

of the sub-carriers, see Figure 1.5, where all the sub-carriers can be simultaneous transmitted in a tight

spaced frequencies but without Inter-Signal Interference.

s[n] F F T1

DAC

DAC

s(t)90o

fc

Serialto parallel

X0

X1

XN2

XN1

e

m

Constellationmapping

Figure 1.4: Representation of the OFDM transmitter [14]. The digital signal s[n] represents the data totransfer. It is then modulated with a QPSK, 16-QAM or 64-QAM to create symbols. Then thespectrum goes through an IFFT to transform it into time-domain. Real and Imaginary com-

ponents are converted to analog domain to modulate cosine and sine at the carrier frequency,

fc. They are then summed into s(t) to be transferred to the receiver via the antenna.

Signals are orthogonal if they are mutually independant of each other. Orthogonality is based on the

fact that any sub-carriers, sine or cosine wave, admit zero on one half-period. Lets assume two sine

sub-carriers of frequency m and n, both integers, and multiply them together:

f(x) = sin mwt sin nwt (1.1)



12/50

Its integral yields to a sum of two sinusods of frequency (n m) and (n + m)

f(x) =1

2cos(m n) 1

2cos(m + n) (1.2)

As this two components are sinusods, the integral is equal to zero over one period

f(x) =

20

1

2cos(m n)

20

1

2cos(m + n) (1.3)

It conclues as when two sinusods of different frequencies, n and m/n, are multiplied, the area underthe product is zero. For all n and m, sin mx, cos mx, sin nx and cos nx are all orthogonal to eachothers. These frequencies are called harmonics.

Overlapping gives a better spectrum usage than FDM modulator which just places each carrier next to

each others and results on interferences between them.

f 0 1 2 N-1

f 0 1 2 N-1

FDM

OFDM

foverlapping f

Figure 1.5: Spectrum efficiency difference (f) between FDM and OFDM. With OFDM, signals, on each

sub-carriers, are overlapped but still orthogonal to each others. With FDM, sub-carriers areplaced to next to each others.

The Receiver OFDM symbols are transmitted over the channel to the receiver on an only frequency.

Basically, the receiver performs the same operations as the transmitter, but in the inverse order. By means

of a FFT, an approximation of the source signal is retrieved as illustrated in Figure 1.6.

s[n]FFT

ADC

ADC

r(t)90o

fc

Parallelto serial

Y0

Y1

YN2

YN1

e

m

Symboldetection

Figure 1.6: Representation of the OFDM receiver [14]. The antenna receives each part of the spectrum

as one signal r(t).It is demodulating and after eliminating the cyclic prefix with filters, a FFT

algorithm transforms them back to frequency-domain. Then, each symbol is detected to create

an approximation of the original data signal.



13/50

Inter-Symbol Interference (ISI) As seen in Figure 1.5, signals are overlapped. This overlapping intro-

duces a problem known as Inter-Symbol Interference (ISI). ISI are the spread delay of the signal N1 onN due to the overlapping where, with the example in Figure 1.5, the last element of symbol 0 is overlappedby the first element of symbol 1 because of the channel.

Spread Delay The spread delay corresponds to the propagation of a transmitted signal on the next

one. It is the echo from the first signal on the second one as illustrated in Figure 1.7 (a). This physical

effect depends on the channel and the distance between the two signals.

To avoid this problem, a distance, called guard interval, superior of the spread delay is needed. As it

is impossible not to send anything, samples from the tail of the symbol signal are added to the front,

as illustrated in Figure 1.7 (b). This principle, explained in [15], is called cyclic prefix. In theory, this

security prefix should be added after each sub-carrier, but in practice OFDM signal is a linear combination

thus only one cyclic prefix is added, as illustrated in Figure 1.7 (c).

t

Guard interval

Copy of the tail of the signal

spread delay t

t

(a)

(b)

(c)

Figure 1.7: The cyclic prefix which permits to avoid the ISI problems.(a) shows the spread delay problem.

(b) shows the adding of the cyclic prefix in the guard interval according to the theory. (c)

shows the cyclic prefixs adding in practice because of the linear combination of the OFDM.



14/50

1.1.2.3 Advantages

OFDM provides better spectrum flexibility by overlapping the signals on orthogonal frequencies, the

harmonics. It is less noise sensitive than a single-carrier system. And the ISI problem is solved thanks to

the guard interval and the cyclic prefix.

1.1.2.4 Drawbacks

OFDM is sensitive to frequency offset and synchronisation problem which can destroy the carriers orthog-

onalities. Also, after the IFFT, OFDM can provide very high amplitude which can lead to a large amount

of power consumption. This high amplitude, called Peak to Average Power Ratio (PAPR), can be reduced

with transmitted signals correction vectors. But this adds complexity to the OFDM transmitter.

1.1.3 Conclusion on the context

The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This

is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem

such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair.

In this project, the focus in on the receiver side, hence on the FFT block. In section 1.2, the FFT

concept is presented by means of 3 FFT algorithms and the issue of parallelizing them is introduced.

1.2 Project subject

1.2.1 Fast Fourier Transformation (FFT)

The group members have selected three FFT algorithms which will be compared. These three algorithmsare presented below :

Radix 2 DIT Fast Fourier Transform (Decimation In Time) : This algorithm is chosen because it isthe simplest form of the Cooley Tukey Algorithm. It exists many other algorithms which compute

DFT faster than radix 2 (radix 4 and split radix for example) but it is important for the project to

be able to compare the basis algorithm with better algorithm (Srensen, Edelman) to show the

difference of computation and complexity (explained in 2.4.3).

"Srensen" Fast Fourier Transform (SFFT): The second algorithm under test is a mix of a Cooley-Tukey algorithm, like Split Radix, and Horners polynomial evaluation scheme. It takes into account

the fact that all the outputs are not interesting for the final result. So only some chosen outputs are

computed. This fact permits to avoid many operations which are time and memory expensive.

Srensen FFT is well known and the project results can be compared with other studies. It is an

interesting algorithm, in terms of complexity and challenge, to implement and compare with other

algorithms like Radix-2 DIT or Edelman.

"Edelman" Fast Fourier Transform : This algorithm computes approximately DFT, doing someerrors but which are minimal against the number of computation. This kind of algorithms allows

increasing speed of computation in spite of some errors. Edelman Algorithm is useful for parallel

computing.

All the algorithms mentioned above are further developed in section 2.4. However because of a lack of

documentation about the Edelman algorithm, it is disregarded from the project.



15/50

1.2.2 Cell Broadband Engine

The purpose of this project is to examine the implementation of FFT algorithms for the OFDM application

presented in section 1.1.2 on a multiprocessors platform, namely the Cell Broadband Engine architecture.

The Cell BE is, for this project, used for:

The implementation of parallelized FFT algorithms Evaluation of the performance, in particular the execution time, of the implementation of the paral-

lelized FFT algorithms

The Cell BE is constructed as an heterogeneous processor architecture, with multiple executions and

memory transfers active at the same time. This architecture is composed of a processor that contains a

PowerPC unit (PPU) with two cores, and eight simpler processors, Synergistic Processing Units (SPUs),

which are designed to perform calculations, whereas the PowerPC performs control, data management and

scheduling of operations. The SPUs contains a RISC processor and are constructed with two pipelines thatcan execute an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside

the SPUs are wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions.

The use of this method produces a processor optimized for computations.

1.2.3 Parallelization

The parallelization is an important part in this project. Indeed, the OFDM receiver requires a FFT as

an integral part of the wireless communication. It is essential that the computation of this FFT be the

fastest possible so that the achievable throughput is maximised. To obtain the best performance from the

application running on the Cell BE processor, the use of multiple SPUs concurrently is evaluated. The

application creates at least as many threads as concurrent SPU contexts are required. Each of these threads

runs a single SPU context at a time. With this method, the FFT is parallelized and uses some of the featuresof the Cell BE to accelerate the computation.

1.3 Problem Definition

This work seeks to answer the following question

"How efficient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized

FFT algorithms?"

1.4 Project Delimitations

Deep researches on LTE and OFDM are not the purposes of this project, nor is a complete mathematical

examination of the FFT algorithms. This project focuses on the use of the Cell-Be for FFT, probably the

single most important tool in digital signal processing (DSP) , according to Srensen and Burrus [ 16]

Section: 1.3 Problem Definition 15


16/50

Chapter 2Analysis

2.1 Overview

The purpose of this chapter is, first, to introduce the design methodology the project group chooses, in

terms on project methodology (A3 design) and the way to parallelize an algorithm following establishedprocedures. Then, this chapter introduces the platform under tests in part 2.3, the Cell-Be, followed, in

part 2.4, by an analysis of the different chosen FFT algorithms, with explanation on the reasons to choose

these algorithms.

2.2 Design Methodology

2.2.1 Design Model

The design of the model is divided in three parts as in the A3 model [17]: Application, Algorithm andArchitecture. First of all, Figure 2.1 shows the generic A3 model. Then, this methodology is appliedto this specific project presented in this report, as showed in figure 2.2.

Application : The application is any system with specifications and constraints. It can be timeconstraints, power consumption, area problems,... It is the main purpose of a project.

Algorithm : At this level, existing algorithms are developed. Special algorithms can be createdfor the application. The algorithms are optimized on a purely mathematical point of view, i.e. the

optimization are only done on the algorithms parts directly related to the application.

Architecture : The mapping of the previous algorithms is realised on the selected platform (DSP,FPGA, Cell-BE,...). In case of uncompatiblity between the specifications/constraints of the applica-

tion and the results, modifications have to be done. On one hand, if the algorithms is implemented on

an established architecture, a modification of the program, in terms of architecture related program

(bus control, data transfer control, memory allocation,...) can be done for the specified architecture.

On the other hand, if the algorithms are established then a modification of the architecture (VHDL

program for a FPGA platform for example) can be done.

Application : In the application domain, a presentation of LTE and OFDM in the context section1.1 is done.

16 Chapter: 2 Analysis


17/50

Application

Algorithm

Architecture

iterateConstraints

Specifications

Results

Comparison

optimizationsArchitectural

optimizationsAlgorithmic

Mapping

Architectureconstraints

Algorithmicconstraints

Figure 2.1: The generic A3 design methodology.

Algorithm : In the algorithms domain, three Fast Fourier Transform algorithms are compared. Firstof all, an analysis of derivation is done. Then, the complexity, i.e. numbers of computation to

execute the Fourier Transform, of each algorithms is analysed. Finally, the implementation of the

algorithms is done in C language two times: one time for sequential execution and a second time

for parallel execution.

Architecture : In the architecture domain, the platform used to implement different algorithms isanalysed. Available hardware and system limitations are studied. Then, how the compiler used

in order to parallelize programs is examined and also to measure the computation consumption in

terms of resource utilisation, execution speed,. . .

2.3 Cell BE

2.3.1 Architecture

In this section a presentation of the architecture used along the project, the Cell Broadband Engine. Ac-

cording to the A3 model design, this section belongs to the analysis of the architecture, as illustrated onFigure 2.3

2.3.1.1 Architecture Overview

The Cell Broadband Engine (CBE) is a multicore processor. It has a Power Processing Element (PPE)

which is a dual-thread PowerPC Architecture and eight Synergistic Processing Element (SPE) which is a

SIMD (Single Instruction Multiple Data) processor element. The communication path for commands and

data between all processor elements and all chip controllers for memory access or Input/output is made

by the Element Interconnection Bus (EIB) [18, p. 41]. An overview of the architecture is presented on the

figure 2.4.

In the Playstation 3, 6 of the 8 SPEs can be used for computation because one is used by the OS

virtualization layer and the other has been disabled for wafer yield reasons [19, p. 5]. That means that

when running the operating system, 6 SPEs are available for computation, as shown in figure 2.4.

Section: 2.3 Cell BE 17


18/50

OFDM receiverLTE 4G

SrensenEdelman Radix2

iterate

Cell Be

Requirements

Algorithms

Application

Architecture

Figure 2.2: A3 model for project.

2.3.1.2 Power Processing Element (PPE)

The PPE contains a 64-bit, dual-thread PowerPC Architecture RISC core. It has 32 KB level-1 (L1)

instruction and data caches and a 512 KB level-2 (L2) unified (instruction and data) cache. It can run

existing PowerPC architecture software and is well-suited for executing system-control code. However

for this project, it will be used as a managing controller for the SPE threads and it is assumed that the

PPE is fast enough to manage the threads executing on the SPE. The PPE consists of two main units, the

PowerPC processor unit (PPU) which performs instruction execution and the PowerPC processor storage

subsystem (PPSS) which handles memory requests from the PPU and external requests to the PPE from

SPEs [18, p. 41]. The architecture overview of the PPE is presented in figure 2.5.

In the Playstation 3, the PPE is clocked at 3.2GHz, thus it can theoretically reach 2x3.2=6.4GFLOP/s

of IEEE compliant double precision floating-point performance. It can also reach 4x2x3.2=25.6GFLOP/s

of non-IEEE compliant single precision floating-point performance using 4-way single instruction multi-

ple data (SIMD) fused multiply-add operation [19, p. 5].

2.3.1.3 Synergistic Processor Element (SPE)

The SPE is single instruction multiple data (SIMD) processor element that is optimized for data-rich

(computation of FFT butterflies) operations allocated to them by the PPE. Each SPE has a Synergistic

Processor Unit (SPU) which fetches instructions and datas from its 256KB Local Store (LS) and its single

register file which has 128 entries, each 128bits wide. Each SPE has a Direct Memory Access (DMA)



19/50

OFDM receiverLTE 4G


iterate

Cell Be

Requirements

Algorithms

Application

Architecture

Figure 2.3: A3 model for project. Highlighted in red, the algorithms analyzed in this section

interface and a channel interface for communicating with its Memory Flow Controller (MFC) and all the

other Processors (PPE and SPE). The SPE is intended to run is own program which is in the LS and to not

run an operating system [18, p. 63]. The architecture overview of the SPE is presented in figure 2.6.

The SPU functional unit, as shown in figure 2.7, consists of a local store (LS) where is stored all

instructions and data used by the SPU, a Synergistic Execution Unit (SXU) which executes all the in-struction and a SPU Register File Unit (SRF) which stores all data types,return addresses and results of

comparisons. The SXU includes 6 executions units :

SPU Odd Fixed-Point Unit (SFS) which executes byte granularity shift, rotate mask and shuffleoperations on quadwords.

SPU Even Fixed-Point Unit (SFX) which executes arithmetic instructions, logical instructions, wordshifts and rotates, floating-point compares, and floating-point reciprocal and reciprocal square-root

estimates.

SPU Floating-Point Unit (SFP) which executes single-precision and double-precision floating-pointinstructions, integer multiplies and conversions, and byte operations. It can perform fully pipelined

single precision (32 bit) floating point instructions and partially pipelined double (64 bits) precisioninstructions.

SPU Load and Store Unit (SLS) which executes load and store instructions. It also handles DMArequests to the LS.

SPU Control Unit (SCN) which fetches and issues instructions to the two pipelines, executes branchinstructions, arbitrates access to the LS and register file, and performs other control functions.

SPU Channel and DMA Unit (SSC) which enables communication, data transfer, and control intoand out of the SPU. The functions of SSC, and those of the associated DMA controller in the

Memory Flow Control (MFC).



20/50

EIB

SPE1

PPE

XIOXIO

MIC

BEI

IOIF_1 FlexIO

FlexIOIOIF_0

RAM RAM

SPE3 SPE5 SPE7

SPE6SPE4SPE2SPE0

BEI : Cell Broadband Engine interface MIC : Memory Interface Controller

EIB : Element Interconnect Bus PPE : PowerPC Processor Element

FlexIO : Rambus FlexIO Bus RAM : Ressource Allocation ManagementIOIF : I/O Interface SPE : Synergistic Processor Element

XIO : Rambus XDR I/O (XIO) cell

Figure 2.4: Architecture overview of the Cell Broadband Processor. The Element Interconnect Bus is a

connection between all processor elements and all chip controllers for memory access and

Input/Output access. The cell broadband engine has 1 PowerPC processor element and 8

synergistic processor elements. Adopted from [18, p. 37].

The Synergistic Execution Unit (SXU) is divided into an even/odd pipeline (pipeline 0 and 1 respectively)

and it can complete up to two instruction per cycle, one on each pipeline [ 18, p. 68]. Examining the SXU,

the odd pipeline provides the data moving unit and the even pipeline provides the data processing unit.Furthermore, each units of the SXU has a datapath of 128 bits wide resulting in the capability to use Single

Instruction Multiple Data (SIMD). If the SXU is working with data of 32 bits wide, thus it can perform 4

operations in each instruction.

On the Playstation 3, the SPU has a frequency of 3,2GHz. Thus each SPU can theoretically pro-

vide with 32 bits wide data 2*4*3.2GFLOPS (one operation on each pipeline and 4 operations on each

instruction). 6 SPUs are available, thus, this yields a total of 153.6 GFLOPS [ 18, p. 5].

It must be noted that Single precision floating point operations are not conform to the IEEE 754

because of the following differences :

Truncation is used in rounding.

Denormal numbers are treated as zero.

NaN are interpreted as normilazed numbers.The double precision floating point does not have this problem [18, p. 68-69].



21/50

PowerPC Processor Element (PPE)

PowerPC Processor Unit (PPU)

L1 Instruction

L2 Cache

PowerPC Processor

L1 DataCache Cache

Storage Subsystem (PPSS)

Figure 2.5: Architecture overview of the PPE which consists of a PowerPC processor unit (PPU) and a

PowerPC processor storage subsystem (PPSS). It has a 32 KB Level-1 (L1) instruction and

data caches and a 512 KB level-2 (L2) unified (instruction and data) cache. Adopted from

[18, p. 49].

Synergistic Processor Element (SPE)

Synergistic Processor Unit (SPU)

Local Store (LS)

DMA Controller

Memory Flow Controller (MFC)

Figure 2.6: Architecture overview of the SPE which consists of a Synergistic Processing Unit (SPU) and

a Memory Flow Controller (MFC). The SPU has a LS of 256 KB. Adopted from [18, p. 63].



22/50

Odd PipelineEven Pipeline

SPU OddFixed-PointUnit(SFS)

SPU Loadand StoreUnit(SLS)

SPUControlUnit(SCN) (SSC)

Unitand DMASPU Channel

SPU EvenFixed-PointUnit(SFX)

SPU

PointUnit(SFP)

Floating

LocalStore(LS)

Synergistic Execution Unit (SXU)

SPURegister File

(SRF)Unit

SPU Functional Unit

Figure 2.7: Architecture overview of the SPU Functional Unit. The 256 KB Local Store (LS) is filled by

the Element Interconnection Bus (EIB) via the MFC. The SXU contains 2 fixed points units

and a floating point unit. The odd pipeline takes care of moving data (fetch instructions to the

pipelines, load and store data between the LS and the register (128 entries of 128 bits wide

each) while the even pipeline takes care of data processing (arithmetic and logic instructions).

Adopted from [18, p. 64].



23/50

2.3.1.4 Element Interconnection Bus (EIB)

One of the main component in the Playstation 3 is the EIB which connect all the components together

including PPEs, SPEs, main memory and all inputs/outputs. The bus has a bandwidth of 25.6GB/s (96

bytes per clock cycle) and enabling multiple concurrent data transfers [18, p. 42].

2.3.1.5 Memory Interface Controller (MIC)

The Memory Interface Controller (MIC) provides the interface between the EIB and the physical mem-

ory. It supports one or two Rambus extreme data rate (XDR) memory interfaces, which together support

between 64 MB and 64 GB of XDR DRAM memory [18, p. 42].

2.3.1.6 Memory System

The Playstation 3 has a dual channel rambus extreme data rate (XDR) memory however the platform

provides a modest amount of 256 MB which only 200 MB are available for Linux OS and the applications

[19, p. 7]. The SPU access to the ram by the EIB and move the data to his LS via DMA transfers, with

the MFC of the SPU acting as a DMA controller.

2.3.1.7 Cell Broadband Engine Interface (BEI)

The Cell Broadband Engine interface (BEI) unit supports I/O interfacing. It manages data transfers be-

tween the EIB and I/O devices. The BEI supports two Rambus FlexIO interfaces. One of the two interfaces

(IOIF1) supports only a noncoherent I/O interface (IOIF) protocol, which is suitable for I/O devices. The

other interface (IOIF0) is software-selectable between the noncoherent IOIF protocol and the memory-coherent Cell Broadband Engine interface protocol [18, p .42].

2.3.2 Programmation of the CBE

The programming for the CBE is split into two main tasks, the programming of the PPE which manages

the utilization of the SPU and the programming of what is executed on the SPE.

2.3.2.1 Development platform

The platform used for the project is a PlayStation 3 surrounded by a monitor, keyboard, mouse and LAN

connection for remote access. The PlaySation 3 is set up with a linux operating system and a set ofdevelopment tools:

Fedora 8 Linux kernel 2.6.23.1-42.fc8

IBM SDK3.0 for the CBE architecture, includding:

gcc compiler toolchain for the CBE (ppu-gcc and spu-gcc ver. 4.1.1)

lipspe2 - SPE runtime management library ver. 2.2

Makefile from SDK



24/50

2.3.2.2 Creating a simple application on a SPE

Generally, applications do not have the physical control of the SPE. The operating systems manages this

resources. Applications use software constructs called SPE context. These SPE context are a logical

representation of an SPE. The SPE Runtime Management Library (libspe) provides all the function to

manage the SPE. This library provides also the means for communication and data transfer between the

SPEs and the PPE. The flow of running a single SPU program context, as shown in Figure 2.8, is to create

a SPE context, to load an SPE executable object into the SPE context local store (LS), to run the SPE

context, this is done by the operating system which requests the actual scheduling of the SPE context

onto a physical SPE and lastly to destroy the SPE context in order to free the memory resources used by

the context. It must be noticed that the fact to run the SPE context represents a synchronous call to the

operating system and thus, the calling application blocks until the SPE stops executing [ 20, p. 1]. All

functions for the SPE context management are described in [20].

Create anSPE context

Load an SPE

into SPE context LS

executable object Run theSPE context

Destroy the

SPE context

Figure 2.8: The flow for running a simple application using a SPE.

2.3.2.3 Create an application on several SPEs

The project in order to get faster need to use multiple SPEs concurrently. For achieve this, the application

must create at least as many threads as concurrent SPE contexts are required. The library used to achieve

this is the libspe2 which uses the POSIX (Portable Operating System Interface) threads [ 20, p. 41]. Theflow of running an application on several SPEs is show in Figure 2.9.

Each of these threads may run a single SPE context at a time. If N concurrent contexts are required, it

is common to have a main application thread and beside, N threads dedicated to the SPE context execution.

the main application thread issues a request for the context to be run, and become locked until the context

finished execution. But there is no matter from the lock of the main program thread because it can still

creates as many threads as needed. If all SPEs are busy, the threads are queued up and will be executed in

the same order as they were created.

Finally, when all the threads have been executed, the main program thread destroys the no longer

needed SPE contexts.

2.3.2.4 Project directory structure

In order to program the cell broadband engine, the source code is arranged into two folders, one for the ppu

code and one for the spu code. Furthermore, to use makefile definitions supplied by the SDK for producing

programs,the line "include $(CELL_TOP)/buildutils/make.footer" has to be included in the makefile. The

project directory structure in shown in figure 2.10.

2.3.2.5 Program compilation

The built of the application for the cell Be requires several steps as shown in Figure 2.11.

First all .c files in the ppu folder are compiled using ppu-gcc for PPE programs and all .c files in the

spu folder are compiled using spu-gcc for SPE programs. Next spu-gcc creates SPE executables from SPE

compiled progams. These executables are embedded into the PPE programs by first creating embedded



25/50

Create N SPE contexts

Load SPE executableobject into context

Create N threads

Stop thread

Run one SPE context

in each thread

Wait for all N threadsto stop

Destroy all N SPE contexts

Figure 2.9: The flow of running an application using several SPEs.

PPE images of the SPE executables (using ppu-embedspu), next creating PPE libraries (using ppu-ar), and

finally compiling the PPE programs again by merging it with the SPE libraries to obtain the final program

FFT (using ppu-gcc).

2.4 FFT algorithms

This section is an introduction to select FFT algorithms with a complete analysis of each algorithm which

will be parallelized in Chapter 3. According to the A3 design Model, this section belongs to the AlgorithmDomain, as illustrated Figure 2.12. First of all, it discusses about the selection of algorithm. Then, the

different mathematical forms of algorithms are developed. At last, the computational time is compared.

A FFT Algorithm allows computing the Discrete Fourier Transform (DFT) with a minimum complex-

ity. In fact, for an application of DFT definition, computing complexity is O(n2). The purpose of FFTalgorithms is to split the transform to obtain a complexity O(n log(n)).

Section: 2.4 FFT algorithms 25


26/50

Makefile in the program directory

Makefile in directory ppu Makefile in directory spu

# SubdirectoriesDIRS = ppu spu

# make.footerinclude $(CELL_TOP)/buildutils/make.footer

# TargetPROGRAM_PPU = main


# TargetPROGRAM_spu = fft_spu


Figure 2.10: Project directory structure which yields two subfolders : one for the ppu program code and

one for the spu program code.

2.4.1 Overview

It exists lot of algorithms to compute FFT. The most common algorithm is called Cooley-Tukey (CT).

It uses on a kind of approach divide to control thanks to recursion. This recursion divides Discrete

Fourier Transform in several DFT. This algorithm needs o(n) multiplications by twiddles factors. Theyare trigonometric constant coefficients that are multiplied in the course of algorithm developed in 2.4.2.

In 1965, James Cooley and John Tukey published this method but this algorithm has been originally

designed by Carl Friedrich Gauss in 1805. The most well known use of CT algorithm is a division of the

transformation in two parts of similar size.

2.4.2 Discrete Fourier Transform

The Discrete Fourier Transform (DFT) presented in [21] is a mathematical tool for digital signal pro-

cessing, (spectral analysis, data compression, partial differential equations,. . . ) similar to the Continuous

Fourier Transform (CFT) which is used for analog signal treatment. The formula is shown below:

X[k] =

N1n=0

x[n] exp 2j n kN

(2.1)

X[k] =

N1n=0

x[n] Wn kN (2.2)

Wn kN = exp2j (n k)N (2.3)Where Wn kN is known as the Twiddle Factor . The time domain input data x[n] is a finite series

of N samples of length n = [0, 1, . . . , N ] and is transformed to the frequency domain signal X[k] wherek = [0, 1, . . . , N 1].

2.4.3 Cooley-Tukey

This chapter presents a theoretical analysis of Radix 2 DIT FFT. First of all, a development of DFT

formula is done to obtain Radix 2 DIT formula. Then, a data path derivation is shown to optimize the

implementation in language programming code.



27/50

ppu spucommon.h

Compile SPEprograms

Compile PPEprograms

Compile PPE programs

with libraries

FFT

*.o

Create embeddedPPE images

Create PPElibraries

*.o

*-embed.o

*.a

Create SPEexecutables

*

Figure 2.11: Flow for CBE program compilation : First .c files for PPE programs and SPE programs

are compiled using ppu-gcc and spu-gcc, respectively. SPE compiled programs are used

to create SPE executables, which are compiled through embedded PPE images into PPElibraries to finally get the final program FFT.

2.4.3.1 Radix 2 DIT FFT

This section presents the radix-2 FFT implementation [22] used for testing against Edelman and Srensen

FFT algorithms. It is used because it is one of the simplest FFT algorithm. The simplest for 2 reasons:

it is well-studied therefore it can be used for comparison and then to get acquainted with FFT. First, the

analytical algorithm of the radix 2 calculation of a DFT is presented.

The radix-2 decimation-in-time rearranges the DFT equation into two parts: a sum on the even-

numbered indices n = [0, 2, 4, . . . , N 2] and a sum over the odd-numbered indices n = [1, 3, 5, . . . , N 1] as in the following equations:

Xk =

N21

m=0

x2me2N

(2m)k +

N21

m=0

x2m+1e2N

(2m+1)k (2.4)

Xk =

N21

m=0

x2me2N

(2m)k + e2N

k N21

m=0

x2m+1e2N

(2m)k (2.5)

Xk = DF TN2

[x(0), x(2), . . . , x(N 2)] + WkN DF TN2

[x(1), x(3), . . . , x(N 1)] (2.6)Xk = Odd(k) + W

kN Even(k) (2.7)



28/50

OFDM receiverLTE 4G


iterate

Cell Be

Requirements

Algorithms

Application

Architecture

Figure 2.12: A3 model for project. Highlighted in red, the algorithms analyzed in this section

where k = [0, 1, . . . , N 1]. The previous simplifications show that the DFT radix-2 DIT can becomputed as the sum of two N

2length DFTs; one of them with the even indexes and the other with the odd

indexes which are multiplied by the twiddle factor WkN = e2N

k . Whereas DFT computation requires

N2 complex multiplications and N2

N complex additions, the radix-2 DIT rearrangement costs only

N2

2 + Ncomplex multiplications and N2

2 complex additions.

2.4.3.2 Data path Derivation

One can notice that the radix 2 DIT simplification is recursive. This kind of expression is simple, but not

optimal to implement in language programming code, because memory consumption and scheduling; that

is why iterative algorithms are generally preferable.

An other property is described below. Even and odd parts are periodic with period N2

; so Odd(k +N2

) = Odd(k) and Even(k+N2 ) = Even(k). In addition, the twiddle factor is periodic Wk+N

2

N = WkN.The equation may be expressed now as:

Xk = Odd(k) + WkN Even(k) (2.8)

Xk+N2

= Odd(k) WkN Even(k) where k = 0, 1, . . . ,N

2 1 (2.9)

The decimation of data sequence can be repeated again and again until the resulting sequences are

reduced to one point sequences. Thus, for N = 2n, this decimation can be performed n = log2N times.Therefore, the total number of complex multiplications is reduced to N

2log2N and the number of additions

to N log2N.



29/50

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

x(0)

x(4)

x(2)

x(6)

x(1)

x(5)

x(3)

x(7)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

W08

W08

W08

W08

W08

W08

W28

W28

W08

W18

W28

W38

Figure 2.13: Eight point decimation in time algorithm.

One can observe that the computation is divided in three stages: a four two-point DFT, then a two

four-point DFT and finally a one eight-point DFT. Another important observation is the order of input

data. Indeed, the order of these data have to be inverted to obtain the good sequence for the corresponding

data output.

2.4.4 Srensen

Srensen FFT [16] (SFFT) algorithm is used in the project as a test algorithm like Radix-2 DIT and

Edelman. It is also known as Transform Decomposition. Its principle is different from standard

algorithms, like Radix-2 DIT or Split radix, in terms of length of input and output data points. Standard

algorithms assumes that the both length of data points are equal, as seen in Figure 2.13, where all the data

are computed. SFFT computes them in a different manner where only some output points are said of

interest , thus only these points are computed, as illustrated in Figure 2.14.

Considering the DFT definition (2.2) as:

X[k] =

N1n=0

x[n] Wn kN (2.10)

where k = [0, 1, . . . , N 1]SFFT supposes that only L points are interesting. It exist two sums of lenght P and Q such as N



30/50

x(0)

x(1)

x(N 1)

x0(0)

x0(k)

x0(P 1)x1(0)

x1(k)

x1(P 1)

xQ1(0)

xQ1(k)

xQ1(P 1)

Input Mapping

X0(0)

X0(k)

X0(P 1)X1(0)

X1(k)

X1(P 1)

XQ1(0)

XQ1(k)

XQ1(P 1)

Q length-P FFTs Recombinaison

X(0)

X(k)

X(L 1)

W0N

WkN

WQ1N

Figure 2.14: There are N inputs, but only one output (X(k) in this example) is computed and used forfurther operation. The way it is done is explained in the following paragraphs. Modified

from Srensen and Burrus, 1993, figure 4 [16]

divided by P defines Q as:

Q = N/P (2.11)n = Qn1 + n2 (2.12)

with n1 = [0, . . . , P 1] and n2 = [0, . . . , Q 1]So the DFT equation (2.2) becomes:

X[k] =

Q1n2=0

P1n1=0

x[Qn1 + n2] W(Qn1+n2)kN (2.13)

X[k] =

Q1n2=0

P1n1=0

x[Qn1 + n2] Wn1pNWn2kN (2.14)

where < k >p is k modulo p.

X[k] =

Q1n2=0

Xn2 [< k >p] Wn2kN (2.15)

Xn2 [j] =

P1n1=0

x[Qn1 + n2] Wn1jP (2.16)

Xn2 [j] =

P1n1=0

xn2 [n1] Wn1jP (2.17)

xn2 [n1] = x[Qn1 + n2] (2.18)



31/50

The equation (2.17) is the equation of a length P DFT and can be compute with any FFT algorithm

such as Radix-2 DIT or Split Radix. Srensens paper says that it is better with a Split Radix FFT in termsof number of operations. But as the Radix-2 FFT has been used previously in the project, it is better to use

it to compare with the previous results.

Equation (2.15) shows that Q FFTs of length P have to be computed, as illustrated in Figure 2.14

2.4.4.1 Complexity

The SFFT complexity depends on the number P, which permits to yield Q, the number of FFTs which have

to be performed. Then the complexity depends on the complexity of the FFT algorithm used, Radix-2 DIT

or Split Radix.

2.5 Conclusion of the Analysis section

The Analysis section shows the theorical point of the subject developed in this project. The A3 designmethodology is used to organize the project and permits to establish simple and defined parts. The ap-

plication is defined as developping an OFDM receiver for LTE. Then, the algorithm part describes 2 FFT

algorithms to be used : Radix-2 DIT and Srensen. The last part corresponds to the architecture on which

the algorithms are implemented, namely the Cell Broadband Engine.

The analysis of the Cell Broadband Engine shows a multiprocessor architecture containing one PPE

managing the communication between the 6 SPEs, out of 8 in the Playstation 3 platform. The instructions

and datas are flowing thanks to the Element Interconnect Bus (EIB) which connect the PPE, the SPEs and

memories. The SPUs contains a RISC processor and are constructed with two pipelines that can execute

an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside the SPUs are

wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions. The use of thismethod produces a processor optimized for computations.

The last part of the Analysis section concerns the FFT algorithms. First of all a FFT (or its inverse

the IFFT also known as F F T1) is a tool used in digital signal processing permitting the transformationfrom time domain to frequency domain. This domain is used to determine the usefull frequency from

the added noise. In the case of OFDM, the transmitter contains an IFFT which transforms the digital

symbols in analog signal for its transmission. The receiver operates the inverse computation to retrieve the

data transmitted among noises. These operations are time expensive. An efficient multicore architecture

designed for computation can reduce the computional time. The group project chooses the Radix-2 DIT

for first algorithm to implement on the CBE. This algorithm is assumed to be one of the simplest FFT

algorithm based on the paper by J.Cooley and J.Tukey. Then, the second algorithm to be implemented

on the CBE is the Transform Decomposition algorithm, known as Srensen. This algorithm, a bit more

complex than Radix-2 DIT, permits to speed up the computation time.

Next Section deals with the experiments of implementing the FFT algorithms firstly on the PPU then

on one SPU and finally on several SPUs. The Radix-2 DIT is the first algorithm being used followed by

Srensen algorithm.

Section: 2.5 Conclusion of the Analysis section 31


32/50

Chapter 3Implementation

3.1 Overview

This chapter puts in practice the theoretical analysis developed in the chapter 2. It contains the results

of the tests on one or several processors, with different FFT algorithms. All these results are evaluated,

compared and discussed. According to the A3 design model, this section belongs to both Algorithm andArchitecture Domain, i.e. the mapping of the algorithm onto the architecture, as illustrated in Figure 3.1.

3.2 Cooley-Tukey Implementation

3.2.1 Overall Approach

The tests are carried out with the CT Algorithm. First of all, Matlab is used to have reference results.

Indeed, the fftfunction is used to verify that the results of the implementations are correct. This verifica-

tion is done only for the computation results and the Matlab computation time is of no interest. Indeed,

as mentioned in the second chapter section 2.4.3, CT Algorithm is one of the simplest existing FFT al-

gorithms; therefore it is selected for the initial tests, as its sequential implementation is straightforward.

These also provide elements of comparison for the subsequent implementations.

Then, various types of tests are performed. All the following tests are carried out 10000 times to

insure that the results are meaningful (since the execution is not fully deterministic due to architectural

and OS hazards). The first one is a sequential execution on the main processor (PPU). The second one

is also a sequential computation but on the SPU (without data transfer). These two tests allow seeing

the computation difference between both. Then, the parallel implementation on 6 SPUs is performed toevaluate the potential improvement.

Two parameters are evaluated during these tests: the computation time and the number of computation

per second. Measurement of the time is realized by the function gettimeofday [23] and is carried out for

the execution ofbit reverse function, twiddle factor computation and butterfly computation. The following

calculation allows computing the number of operations per second:

Numberofoperationspersecond =10 N

2 log2N

totaltime(3.1)

where N is the length of the FFT.

32 Chapter: 3 Implementation


33/50

OFDM receiverLTE 4G


iterate

Cell Be

Requirements

Algorithms

Application

Architecture

Figure 3.1: A3 model for project. Highlighted in red, the mapping developed in this section

This formula corresponds to the complexity of CT butterflies seen in section 2.4.3.2]. The computation

of the twiddle factor and bit reverse function does not affect equation 3.1 because there are no floating

operations in these functions. Bit reverse is only data transfer and twiddle factorhas no floating operation

(only cosine and sine which, in this case, are not floating point operations).

3.2.2 Results

A graphical representation of the results is seen in Figure 3.2. This graph shows the computation time of

the sequential executions on the PPU and on the SPU. Both are almost linear, which is normal because

when the FFT length is multiplied by 2, the computation time is almost doubled. One more comment about

these results is that the SPUs computation time are larger than the PPUs ones. Indeed, the difference

between both increases with the FFT length.

The following graph in Figure 3.3 depicts the computation time of a parallel implementation according

to the FFT length. Indeed, the CT algorithm is parallelized on the 6 SPUs of the Cell Be. The more the FFT

length increases, the larger the computation time is. This is an unexpected result; firstly, the computation

time for the parallelized version is larger than for the sequential one. Secondly, the larger the FFT length

is, the larger the execution time is. There is an explanation for that. The data transfers between themain storage (RAM) and the local storage (LS) are very long (as compared to the computation, i.e. data

transfers are a bottleneck and the PPUs remain idle for significantly long periods of time). Moreover, no

optimizations have been implemented so far.

Finally, the number of operations per second is drawn according to the number of processors as de-

picted in Figure 3.4. With the FFT length of 1024, it can be observed that the number of operations per

second (in MFLOPS) decreases when the number of processors increases.

Considering the results of the previous tests (computation on 6 SPUs, Figure 3.3), this result was

expected. Indeed, much time (cf. previous results comments) is spent for data transfers when the number

of processors increases. Therefore, the number of operations per second (i.e. actual computations) is very

low compared to the transfer times.

Section: 3.2 Cooley-Tukey Implementation 33


34/50

0 200 400 600 800 1000 12000

50

100

150

200

250

300

350

400

Computationtime(s)

N : Number of points

PPUSPU

Figure 3.2: Computation time of a sequential radix 2 FFT implemented on the PPU (dashed blue) and

one SPU (continuous red) for different lengths FFT (ranging from 4 to 1024).



35/50

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9

10

Computationtime(s)


104

Figure 3.3: Computation time of a parallel radix 2 FFT implemented on 6 SPUs for different FFT lengths,

ranging from 4 to 1024.



36/50

1 2 3 4 5 60

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Numberofoperationspersecond

N : Number of SPUs

Figure 3.4: Number of operation per second for a parallel radix 2 FFT implemented on different number

of processors (from 1 to 6).

3.2.3 Optimizations

Intuitively, one would expect that increasing the parallelism would increase the number of operations per

second. However, the opposite effect has been observed in the results described above. Therefore, the

group members have decided to evaluate whether it is possible to reduce the computation time by means

of several optimizations techniques, as described in what follows.

Problem of data transfers: The time for performing data transfers between the PPU and the SPU is

higher than the computation time. Several methods have been used to reduce this time, as described in the

following paragraphs.

3.2.3.1 Deterministic twiddle factors

The twiddle factors have been made as a constant on the SPU. If the FFT length is fixed, the twiddle

factors are always deterministic (they can be predicted). Instead of passing them as arguments to the SPU,

they are stored in the Local Storage of the SPU. The twiddle factors are complex values with real andimaginary parts. Assuming 32 bits floats and a 1024 length FFT, the size of these data is:

512 x 2 x 4= 4096 bytes

Number of twiddle factor

2 floats : real and imaginary

4 bytes float

That is not a problem for the LS because it is only 4,096 bytes out of the 256 Kb. This technique

allows to not waste precious EIB bandwidth.



37/50

3.2.3.2 Double Buffering

One of the methods to transfer data to (from) the PPU from (to) the SPU uses Direct Memory Access

(DMA). This section presents a technique called double buffering. To achieve computation on the SPU,

the program has to transfer data from main storage to the LS using DMA data transfer. For example,

consider a SPU program that repeats the following steps:

1. Access data using DMA from the main storage to the LS buffer B,

2. Wait for the transfer to complete,

3. compute on data in buffer B,

This sequence is not efficient because the SPU has to wait for the complete transfer of the data before it

can compute the data in buffer. The process wastes much time. Figure 3.5 illustrates this scenario.

First Iteration Second Iteration

time

DMA Input

Compute

Figure 3.5: Serial computation and data transfer. Modified from [24]

This process can be significantly accelerated by using double buffering. Two buffers, B0 and B1,are allocated, allowing overlapping computation on one buffer with data transfer in the other one. The

diagram scheme is showed in figure 3.6.

Double buffering is achieved by using tag-group identifiers [25]. All transfers involving buffer B0(respectively B1) are applied to Tag-group ID 0 (respectively ID 1). Then, software sets the tag-groupmask to include only tag ID 0 (tag ID 1) and requests conditional tag status update. It enables not to begin

the computation before the transfer to the buffer is complete. Figure 3.7 shows the resulting execution in

time.

Double buffering is used in the project to transfer the data structure from the PPU to the SPU. This

structure is described below:

Initiate DMA transfer

from EA to LS buffer B0



Wait for DMA transfer

to buffer B0 to complete

Compute on data

in buffer B0



Wait for DMA transfer

to buffer B1 to complete

Compute on data

in buffer B1

Figure 3.6: Double Buffering scheme. Modified from [24]



38/50

B0

time

DMA Input

Compute

B0 B0 B0

B1 B1 B1

Figure 3.7: Paralell Computing and Transfer. Double Buffering is more efficient than the approach pre-

sented in Figure 3.5 as the SPU does not have to wait for the data. A part can be computed

in bufferB0 while the next data is in the DMA transfer to B1. Modified from [24]

Typedef struct complex{

float real;float imag;

}complex_t

Typedef struct{

complex_t *input;complex_t *output;

complex_t *twiddle;int count;

}spe_arg_t

The structure spe_arg_tis passed in arguments from the PPU to the SPU. While the computation of

one butterfly is being performed by means of the first buffer transfer, the second buffer is transferring the

data for the computation of the next butterfly. Although twiddle factors and double buffering methods

have been implemented, no significant improvement for the data transfer time has been observed (since

the results are the same with or without these two methods, the corresponding numbers are not repeated

here).

3.2.3.3 Large amount of data

After further considerations, the group members wanted to evaluate that to gain anything from using

double buffering, a larger amount of data must be transferred. The EIB only becomes efficient if it can

work for longer durations of time. So, in a new experiment, instead of sending 1024 times the input data,

half of the data have been sent to the SPU. Then, the calculations started while the other half was sent.

Although this method has been implemented, no improvement regarding the computation time has been

measured.

3.2.3.4 Computation of several stages on the same SPU

The goal of all the previous optimizations is to reduce the data transfer time. Regarding figure 3.8, the

first four data (x(0), x(4), x(2), x(6)) are used together in stages 1 and 2. It means that only one transferis necessary from the PPU to the SPU to compute these four values in stages 1 and 2. If this method is

applied to a 1024 point FFT on 4 SPUs, 256 data (1024/4) are transferred on each SPU. It means that each

SPU computes 128 Butterflies (27) thus each SPU computes the first seven stages with only one transferof 256 data. This optimization is only possible on a power of 2 numbers of processors. Then, the last three

stages are computed on the 4 SPUs like the method decribed in part [].



39/50

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

x(0)

x(4)

x(2)

x(6)

x(1)

x(5)

x(3)

x(7)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

W08

W08

W08

W08

W08

W08

W28

W28

W08

W18

W28

W38

stage 1 stage 2 stage 3

SPU1

SPU2

Figure 3.8: Eight-point decimation in time algorithm. This algorithm is implemented on two SPUs. Thefirst two stages are computed with only one transfer from the main storage to the LS. Modified

from [24]

The results are interesting. By means of this method, the computation time is improved. For a 1024

FFT length, the time without optimization is 30 ms on 2 SPUs. With this one, the computation time is 7

ms. This result shows two things: firstly, the data transfer time is the problem (that was a supposition until

this part). Indeed, the time is divided by 4,3 thanks to sending less data. Secondly, the improvement is not

enough because the computation time for a parallel implementation is always larger than the sequential

one. Another algorithm (Srensen) has been analysed in part 2.4.4. The implementation is developed in

the following part. This implementation is better than Radix 2 DIT in terms of computation time as shown

in section 3.3.

3.3 Srensen Implementation

3.3.1 Overall Approach

The following tests are carried out on the Srensen Algorithm. The reference results come from Matlab

and are the same as presented in section 3.2. This implementation allows comparing the difference with

the CT algorithm. According to the theoretical analysis section 2.4.4, the results should be better (in

terms of execution time) with Srensen than with CT radix 2. Indeed, Srensen algorithm divides a

large FFT in small FFTs, which facilitates the parallelization. Various tests are performed on Srensen.

However, in order to compare the results with those of CT, the same type of tests as those used for CT

Section: 3.3 Srensen Implementation 39


40/50

are carried out. Two sequential implementations on the PPU are performed: one with Q set to 2 and the

other one with Q set to 4 (Q is the number of small FFTs as seen in the part 2.4.4). Then, the parallelimplementation is tested to see the potential improvement. There are also two different values for Q (2and 4). Therefore, the parallel implementation is performed on 2 and 4 SPUs. The same parameters as

for the CT algorithm are measured. The measurement of the computation time concerns the reordering,

compute_fft and recombination functions. The function to measure the time is still gettimeofday [23].

Then, the number of operations per second is evaluated as well but with a different formula because the

complexity of the computation is not the same as for CT. The formula is described below in equation 3.2:

GFLOPS= 5 Q P log2P + 8 (Q 1) Ltotaltime

(3.2)

where Q is the number of small FFTs, P the number of input data for each small FFTs and L thedesired number of output data.

The number of operations per second only concerns the computation of small FFTs and the recombina-tion function. The reordering function has no floating computations. It only consists of the data reordering

by means of data moves.

3.3.2 Results

The graph in Figure 3.9 shows the computation time of the sequential execution on the PPU. There are

two lines: one (continuous red) for a division of the large FFT in two smaller ( Q = 2) and another one(dashed blue) for a division in four smaller (Q = 4). The execution time for Q = 2 is always smallerthan for Q = 4. That is normal because the complexity depends on the chosen subdivision factor P, asthis defines Q, which is the number of small FFTs performed. The larger the factor Q is, the larger thenumber of computation is. Therefore, for a sequential execution, the time increases with the number of

calculation. That explains the behaviour of these measures. Moreover, these two curves are almost linear.That is normal because increasing of the input data increases the computation time.

Figure 3.10 shows the computation time for 2 parallel implementations (Q = 2 and Q = 4, i.e. 2 and4 SPUs, respectively) according to the FFT length (from 4 to 1024). It appears that the execution time is

always larger for a parallelization on 4 SPUs. Thus, it can be deduced that the problem still comes from

the time needed to transfer the data between the PPU and the SPUs as for the parallel implementation

of the CT algorithm. However, the positive point choose aspect for this case is that the computation

time becomes almost constant when the FFT length is increased (due to the effect of the pipeline). The

computation time is always, for the parallel implementations, larger than the sequential one. Moreover,

the 4 SPUs execution is slower than the 2 SPUs one. This is an expected result because there are only 2

data transfers for Q = 2 whereas 4 are needed when Q = 4.

3.3.3 Comparison with the CT algorithmThe goal of this section is to compare the results of Srensen implementation with the different measures

obtained for the CT Radix 2 DIT implementation. Indeed, although several optimizations have been

applied to the CT implementation, the results (especially for the computation time) due to data transfer

between the PPU and the SPUs are larger than the sequential execution time. Figure 3.12 shows the

parallel implementation of Srensen algorithm (Q = 2, i.e. 2 SPUs) and of CT (on 6 SPUs).Although the parallel implementation of the Srensen also suffers from problem of data transfer, it

appears that it is better than Radix 2 DIT in terms of computation time. The explanation is simple; in

Srensens algorithm, data are transferred one and only one time for the computation of one small FFT

whereas for Radix 2 DIT, even with the optimizations, data are transferred four times (N = 1024), as seen

in section 3.2.2. The conclusion is that Srensen is better fitted for parallel implementation due to its



41/50

0 200 400 600 800 1000 12000

500

1000

1500

2000

2500

Computationtime(s)


Q = 4Q = 2

Figure 3.9: Computation time of a sequential Srensen FFT implemented on PPU forQ = 2 (continuous

red) andQ = 4 (dashed blue) with different FFT lengths (ranging from 4 to 1024).



42/50

0 200 400 600 800 1000 12000

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Computationtime(s)

N : Nombre de points

Q = 4Q = 2

Figure 3.10: Computation time of a parallel Srensen FFT implemented on PPU for Q = 2 (red) and

Q = 4 (blue dash) with different FFT lengths (ranging from 4 to 1024).



43/50

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9

10

Computationtime(s)


104

CT FFTSFFT Q = 4SFFT Q = 2

Figure 3.11: Comparison of the computation time of a parallel Srensen FFT for Q = 2 (i.e. 2 SPUs)(dashed green), for Q = 4 (i.e. 4 SPUs) (dotted blue) and a parallel CT Radix 2 DIT FFTon 6 SPUs (continuous red) with FFT lengths from 4 to 1024.



44/50

0 200 400 600 800 1000 12000

2

4

6

8

10

12

Numberofoper

ationspersecond

N : Nombre de points

CT FFT

SFFT Q = 2

Figure 3.12: Comparison of the number of operations per second for a parallel Srensen FFT (Q = 2)(dashed blue) and a parallel Radix 2 DIT FFT on 6 SPUs (continuous red) with different

FFT length (ranging from 4 to 1024).

design as compared to CT. Then, the time needed for data transfer appears to be the main limiting factor

but with a better knowledge of these transfers, the Srensen algorithm is most likely easily applicable

to this kind of parallel architecture. Another element of comparison is the number of computations per

second. Figure 3.12 shows this variable (in MFLOPS) for the Srensen implementation on 2 processors

and for the CT Radix 2 DIT algorithm on 6 SPUs according to the FFT length (from 4 to 1024).

Please note that this is a different type of measure as compared to the execution time (here larger num-

ber indicate a better performance). The number of floating point operations per seconds is larger for the

Srensen implementation than the Radix 2 one. It can be explained by the fact that the Srensen algorithmhas much computations (compared to Radix 2). Indeed, the recombination step, cf 3.2.2, after the com-

putation of the small FFTs, adds some computations. Furthermore, Figure 3.11 has shown that Srensen

FFT was faster than CT in terms of computation time. These two explanations combined together, can

explain the trend observed in Figure 3.12.



45/50

Chapter 4Conclusion & Perspectives

4.1 Conclusions

The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This

is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem

such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair. To speed up the data

transfer provides by the OFDM, an improvement of the computation speed of the IFFT/FFT tool can be

sought. With the latest multiprocessor platform, the speed up can be improved even more as soon as the

data transfer protocol between the different parts of the architecture are well managed.

The goal of this 9th Semester ASPI project is to answer the problem defined in section 1.3 as follow :

"How efficient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized

FFT algorithms?"

First of all, an analysis of the Cell-BE has been done to determine if this multicore architecture can

speed up the algorithm of a common tool of digital signal processing, the FFT. It appears that the Cell-BE,

combined with the SIMD method produces a processor optimized for computations, and then it ables to

improve the computation speed of the execution of FFT algorithms. To evaluate the efficiency of paral-

lelized FFT algoritms on the Cell-Be processor, two FFT algortihms are used (Radix-2 DIT and Srensen).

The first uses

N step with N2

butterflies at each step. It means that this step can be parallelised. This

algorithm is assumed to be one of the simplest FFT algorithm. It is not the most efficient but the easiest

to establish. Srensen algorithm splits the FFT in smaller FFT. It means that each smaller FFT can be

computed on one processor in a parallel scheduling.Then, during the implementation, these algortithms are computed only with the PPE of the Cell-BE.

The fact to use these algorithms only on the PPE provides a computation without any parallelisation to have

a reference against the same algorithms with a parallelisation computation. Radix-2 DIT is implemented

first. The comparison between only PPU implementation and multiple SPUs implementation shows that

data transfers between PPU and SPUs causes waste of time and the results are unexpected, in a way

that they are showing less efficiency than an unparallelized algorithm. Optimizations, like the Double-

Buffering method, is applied to reduce the data transfer time but without any improvement. Srensen

algorithm is implemented after and shows improvement in the computation time results in comparison

with the Radix-2 DIT implementation. However, the results of this implementation are still under what

the theoritical computation power the Cell-Be can provide.

45


46/50

4.2 Perspectives

4.2.1 Short term perspectives

The short time perspective for this project concern the usage of another op

report fft implementation 08gr943

Documents