portable parallel fft for mimd multiprocessors

23
CONCURRENCY: PRACTICE AND EXPERIENCE Concurrency: Pract. Exper., Vol. 10(8), 583–605 (1998) Portable parallel FFT for MIMD multiprocessors AMIR AVERBUCH * AND ERAN GABBER Computer Science Department, School ofMathematical Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel (e-mail: [email protected]) SUMMARY A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with de- tailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. 1998 John Wiley & Sons, Ltd. 1. INTRODUCTION AND OVERVIEW Portability analysis of real-world problems is valuable for developing parallel applications, especially when the reasons for inefficiencies are explained. This information can be used both to design better algorithms and to improve machine architecture. Portability analysis can assist in predicting the performance of different algorithms on distinct architectures without the need to devise a new algorithm each time. The discrete Fourier transform (DFT) and its accelerated version, the fast Fourier trans- form (FFT), has become a necessary tool in many scientific and engineering applications. A detailed review of parallel FFT implementations on various architectures is given in [1]. This paper is an extension of [1]. The extension is three-fold: first, the implementation uses portable software packages (VMMP[2,3]) and PVM[4], which allowed us to run the same algorithm on serial computers, shared memory and distributed memory MIMD mul- tiprocessors. The same code will run unmodified on all future multiprocessors supported by VMMP or PVM. Since the VMMP and the PVM have the same design and implemen- tations ideas (see Section 4.1) and since the VMMP was not ported to run on IBM SP2 we ran the same algorithm on IBM SP2 using PVM instead of VMMP. Second, we allow decomposition of the input vector into unequal factors, which improves the efficiency for * Correspondence to: A. Averbuch, Computer Science Department, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel. (e-mail: [email protected]) CCC 1040–3108/98/080583–23$17.50 Received 26 September 1996 1998 John Wiley & Sons, Ltd. Accepted 11 June 1997

Upload: eran

Post on 06-Jun-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Portable parallel FFT for MIMD multiprocessors

CONCURRENCY: PRACTICE AND EXPERIENCEConcurrency: Pract. Exper., Vol. 10(8), 583–605 (1998)

Portable parallel FFT for MIMD multiprocessorsAMIR AVERBUCH∗AND ERAN GABBER

Computer Science Department, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel(e-mail: [email protected])

SUMMARYA portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors ispresented. The implementation uses the virtual machine for multiprocessors (VMMP) andPVM portable software packages. Since VMMP provides the same set of services on all targetmachines, a single version of the parallel FFT code was used for shared memory (25-processorSequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memorymultiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with de-tailed performance analysis of the implementations. The algorithm achieved high efficiencieson all target machines. The analysis indicates that most overheads are caused by the targetarchitecture and not by VMMP or PVM inefficiencies.

The portability analysis of the FFT provides several important insights. On the messagepassing architecture, the parallel FFT algorithm can obtain linearly increasing speedup withrespect to the number of processors with only a moderate increase in the problem size.

The parallel FFT can be executed by any number of processors, but generally the number ofprocessors is much less than the length of the input data. The results indicate that the parallelFFT is portable: it achieves very good speedups on either a shared memory multiprocessorwith high memory bandwidth or on a message passing multiprocessor without any change inthe programs. 1998 John Wiley & Sons, Ltd.

1. INTRODUCTION AND OVERVIEW

Portability analysis of real-world problems is valuable for developing parallel applications,especially when the reasons for inefficiencies are explained. This information can be usedboth to design better algorithms and to improve machine architecture. Portability analysiscan assist in predicting the performance of different algorithms on distinct architectureswithout the need to devise a new algorithm each time.

The discrete Fourier transform (DFT) and its accelerated version, the fast Fourier trans-form (FFT), has become a necessary tool in many scientific and engineering applications.A detailed review of parallel FFT implementations on various architectures is given in [1].

This paper is an extension of [1]. The extension is three-fold: first, the implementationuses portable software packages (VMMP[2,3]) and PVM[4], which allowed us to run thesame algorithm on serial computers, shared memory and distributed memory MIMD mul-tiprocessors. The same code will run unmodified on all future multiprocessors supportedby VMMP or PVM. Since the VMMP and the PVM have the same design and implemen-tations ideas (see Section 4.1) and since the VMMP was not ported to run on IBM SP2we ran the same algorithm on IBM SP2 using PVM instead of VMMP. Second, we allowdecomposition of the input vector into unequal factors, which improves the efficiency for

∗Correspondence to: A. Averbuch, Computer Science Department, School of Mathematical Sciences, Tel-AvivUniversity, Tel-Aviv 69978, Israel. (e-mail: [email protected])

CCC 1040–3108/98/080583–23$17.50 Received 26 September 19961998 John Wiley & Sons, Ltd. Accepted 11 June 1997

Page 2: Portable parallel FFT for MIMD multiprocessors

584 A. AVERBUCH AND E. GABBER

input data lengths which are not a multiple of equal factors. Thus, the algorithm can beutilized for non-uniform distribution of the modeled (transformed) data. Third, a detailedperformance analysis of the algorithm on shared memory and distributed memory mul-tiprocessors indicates that the algorithm is portable, and it should achieve good speedupwhen executed on a larger number of processors. This analysis also indicates that mostoverheads are caused by the target architecture and not by VMMP inefficiencies.

The parallel FFT algorithm was implemented in the C language using VMMP[2,3]services and PVM[4]. VMMP ensures total portability for explicitly parallel applicationprograms. In other words, the same program code executes on all targets supporting VMMP.This code will also run on all future multiprocessors supporting VMMP or PVM.

Detailed performance analysis reveals that the execution bottlenecks are in the initialdata distribution and in the inter-stage data distribution and collection. These bottlenecksarise from access to shared data, which is implemented by VMMP shared objects. Shareddata access is inherently slow in both the shared memory and the distributed memoryimplementations. Either the shared data access is implemented by access to shared memoryvia a shared bus or it is implemented by message passing. Both implementations revealarchitectural bottlenecks.

The algorithm achieved 81% efficiency for 128K data points on a 25-processor SequentSymmetry shared memory multiprocessor. However, efficiency is lower on the distributedmemory multiprocessor due to the high data transfer time and lack of cache (see Sec-tions 6.1.1, 6.2.2 and 6.3.1). On SP2, which is a more advanced and modern messagepassing multiprocessor, it achieved in many cases more than 90% efficiency (see Sec-tion 6.4).

A separate analysis indicates that the performance of the transputer is greatly influencedby the memory access time. Using the external memory (regular memory) may increasethe run time by 50% compared to internal memory access.

This paper is organized as follows. In Section 2 we described related algorithms. InSection 3 we review the parallel one dimensional algorithm, which is based on [1, Sections 3and 4]. Section 4 contains an overview of VMMP. The portable parallel FFT implementationis described in Section 5. Section 6 contains performance measurements and a detailedtiming analysis.

2. RELATED PARALLEL FFT ALGORITHMS

A detailed review of parallel FFT implementations is given in [1]. Many parallel FFTalgorithms were implemented on hypercube-likedistributed memory multiprocessors[5–9].These works focus on the communication complexity of the algorithms, which is the centralissue in the implementation of the FFT on the hypercube. Chamberlain[10] describes theimplementation of the FFT on a 64-node INTEL hypercube. The data are distributedaccording to a Gray code, so that neighbouring points in the domain are in neighbouringprocessors. Another FFT implementation is on the ICL DAP[11].

Ragland and Outten[12] describe an efficient parallel implementation of this FFT al-gorithm on the IBM 3090-600E VF supercomputer. Their work is based on our parallelFFT algorithm[1]. They also give a detailed account on the implementation of the FFTalgorithm on multivector computers. Carlson[13] describes an FFT variant, which uses thelocal memory of the CRAY-2 supercomputer to boost performance.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 3: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 585

Swarztrauber[14] developed several multiprocessor FFTs for both vector multiprocessorswith shared memory and for the hypercube distributed memory machines. This work doesnot emphasize portability. He developed different code for shared memory and messagepassing architectures. The hypercube code emphasizes the minimization of interprocessorcommunication.

The paper by Bailey and Swarztrauber[15] describes the fractional Fourier transformand a corresponding fast form, which are useful for computing DFTs of sequences of primelength and for other purposes.

An attempt to provide methodology for scaling parallel programs for multiprocessorsis given in [16]. They show that the scaling of real applications is more complicated thanthe standard analysis, which is based on computation and communication complexities ofindividual algorithms.

3. 1-DIMENSIONAL PARALLEL FFT REVISITED

For the completeness of the description we repeat here a part of the mathematical descriptionof the algorithm as it appeared in Sections 3 and 4 of [1].

Let {x(n) : 0 ≤ n ≤ N − 1} be N samples of a time function. The DFT is defined by

X(k) =

N−1∑n=0

x(n)W knN , 0 ≤ k ≤ N − 1, WN = e−i

2πN (1)

Assume thatN = N1, . . . , Np (2)

Conceptually, we can think of a one-dimensional array x of dimension N to be a p-dimensional array x′ of dimension Np ×Np−1 × · · · ×N1. That is,

x′(np, . . . , ni, . . . , n1) = x(np + np−1Np + · · ·n1N2N3 · · ·Np) (3)

ni = 0, 1, . . . , Ni − 1, i = 1, 2, . . . p

The first step of the algorithm is an ‘index reversal’ which is a generalization of a ‘bitreversal’. In our case, this corresponds to a p-dimensional transposition. Therefore, wedefine another p-dimensional array x0 in the following way:

x0(n1, . . . , np) = x′(np, . . . , n1) (4)

This step is done in parallel.From (3) we have that

x0(n1, . . . , np) = x(np + np−1Np + · · ·+ n1N2N3 · · ·Np) (5)

We treat the array X of length n as a p-dimensional array xp,

xp(k1, . . . , kp) = X(k1 +N1k2 + · · ·+N1N2 · · ·Np−1kp) (6)

ki = 0, . . . , Ni − 1, i = 1, . . . , p

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 4: Portable parallel FFT for MIMD multiprocessors

586 A. AVERBUCH AND E. GABBER

Note that xp and X are equivalent arrays. Substituting (5) and (6) in (1), we obtain

xp(k1, . . . , kp) =∑

n1,n2,...,np

x0(n1, . . . , np)WfpgpN (7)

wherefp = np + np−1Np + · · ·n1N2N3 · · ·Np (8)

andgp = k1 +N1k2 + · · ·+N1N2 · · ·NP−1kp (9)

and the general definition of these functions is

fj = fj(n1, n2, . . . , nj) = nj + nj−1Nj + · · ·+ n1N2N3 · · ·Nj (10)

and

gj = gj(k1, k2, . . . , kj) = k1 +N1k2 + · · ·+N1N2 · · ·Nj−1kj (11)

j = 1, . . . , p

DefineLj = N1N2 · · ·Nj, j = 1, 2, . . . , p (12)

(10) and (11) can be written recursively using (12) as:

fj = nj + fj−1Nj , gj = gj−1 + Lj−1kj (13)

From the above,

fjgjLj

= fj−1kj +fj−1gj−1

Lj−1+njgj−1

Lj+njkjLj

, j = 1, . . . , p (14)

Define f0 = g0 = 0 and L0 = 1. From the above, the DFT of (1) can be performed inj = 1, . . . , p successive stages in which N

Njone-dimensional DFTs of Xj , of length Nj ,

are performed along the rows of the partially transformedXj−1:

Xj(Kj) =

Nj−1∑nj=0

Xj−1(Kj−1)Wnjgj−1

LjW

njkjNj

(15)

where

Kj = nj+1, . . . , np, k1, . . . , kj

Lj =

j∏l=1

Nl

gj =

j∑l=1

klLl−1, L0 = 1

nj = 0, . . . , Nj − 1

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 5: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 587

andX0(K0) is obtained from x(n) by an ‘index-reversal’ transformation. At the end of thepth stage, X(k) = Xp(Kp).

In (15), the second term represents the twiddle factor scaling the jth stage. The lastterm represents the DFTs of length Nj . We have to compute here the twiddle factor, i.e.W

njgj−1

Lj, j = 1, . . . , p (15), but we can incorporate the computation of the twiddle factor

into the computation of the FFT which will yield a 3-step parallel FFT which we called thegeneralized FFT (GFFT). From (15)

Xj(Kj) =

Nj−1∑nj=0

Xj−1(Kj−1)Wnj

(kj+

gj−1Lj−1

)Nj

,g0

L0= 0 (16)

Equation (16) can be interpreted as the generalized in frequency FFT (GFFT) of Xj−1.During the initialization stage of the algorithm, Lj =

∏jl=1 Nl, j = 1, . . . , p − 1, are

precomputed. Then the computation of NNj− 1 vector multiplications using N

Nj− 1 pre-

computed twiddle factor vectors can be incorporated in the computation of the FFT, andthis is the main reason for the speedup achievement. This fact increases the locality of thecomputation.

If a vector of lengthN that has to be transformed can be represented as a p-dimensionalarray, where N = N1N2 · · ·Np, then the DFT of the vector can be done in p successivestages, where each stage consists of the following phases:

1. Load NNj

vectors of length Nj .

2. Compute the radix-Nj vector DFT. This contains NNj−1 vector multiplications using

NNj− 1 precomputed twiddle factor vectors (16).

3. Store NNj

vectors.

4. VMMP: A VIRTUAL MACHINE FOR MULTIPROCESSORS

VMMP (virtual machine for multiprocessors)[2,3] is a software package, developed atthe computer science department in Tel Aviv University, which provides a coherent setof services for explicitly parallel application programs running on diverse MIMD multi-processors, both shared memory and message passing. It is intended to simplify parallelprogram writing and to promote portable and efficient programming. VMMP ensures highportability of application programs by implementing the same services on all target mul-tiprocessors. It may also serve as a common intermediate language between very highlevel language compilers and various target multiprocessors. VMMP services have thesame meaning regardless of the peculiarities of the specific target machine. In this way, theapplications are truly portable and require only a recompilation in order to execute on adifferent multiprocessor. VMMP is designed for medium and large grained computations.

VMMP may be used directly by the application programmer as an abstract parallelmachine to aid portability without sacrificing efficiency. VMMP may also be used as acommon intermediate language between high level programming tools, such as parallelizersand specification language processors, and the target machines. In particular, VMMP isused as the intermediate language in P3C (portable parallelizing Pascal compiler)[17,18]and PPFC (portable parallelizing Fortran compiler)[19]. Both compilers translate serialprograms in the respective languages to explicitly parallel code for a variety of target

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 6: Portable parallel FFT for MIMD multiprocessors

588 A. AVERBUCH AND E. GABBER

Figure 1. VMMP usage and its portable environment structure

multiprocessors. VMMP usage is illustrated in Figure 1. Full details and additional examplescan be found in [2] and in the VMMP user’s guide[3].

VMMP supports two programming models: a virtual tree machine for tree computations,and a co-operating set (crowd) of processes for crowd computations.

Tree computations are a general computation strategy, which solves the problem bybreaking it into several simpler subproblems, that are solved recursively. The solutionprocess looks like a tree of computations, in which the data flows down from the root intothe leaves and solutions flow up towards the root. Some examples of tree computations aredivide-and-conquer algorithms, game tree search and combinatorial algorithms.

Crowd computations are used for algorithms that are defined in terms of a set of co-operating processes. Each of these processes is an independent computational entity, whichcommunicates and synchronizes with its peers. The processes communicate and synchro-nize mainly by message passing. A crowd may create a tree computation and vice versa.

A basic tool used by both models is the shared (memory) object. Both computationmodels can use shared (memory) objects. Shared objects provide an asynchronous andflexible communication mechanism in addition to parameters and message passing for allcomputations. Shared objects may be regarded as virtually shared memory areas, whichcontain any type of data, and can be accessed only by secure operations. Shared objects canbe created, deleted, passed by reference, read and written. They provide a high bandwidthasynchronous communication mechanism.

The shared objects are intended to distribute data and collect results, and shared objectsare intended to distribute data, collect results and collect global results by some associativeand commutative operations. Other uses of the shared objects, such as random updates to

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 7: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 589

locations of random update to small portions of a large object are discouraged, becausethe implementation of these operations incurs high overhead on a message passing multi-processor. Users must manipulate the shared objects only through VMMP services, whichguarantee the correctness of the operations.

VMMP supports several data distribution strategies using shared objects or by specialcrowd operations. In particular, regular distribution of arrays (e.g. by rows or columns) issupported for both shared objects and crowd communication.

VMMP provides high level services, which can be implemented efficiently on all targetmachines. It does not provide low level services, such as direct access to shared memoryor semaphores, since these services are difficult to implement efficiently on all targets.

4.1. VMMP services and its relation to PVM

VMMP provides the abstract machine interface for the convenient use of tree computations,crowd computations and shared memory objects. As such, VMMP does not support baselevel services like semaphores and barriers. The use of its services ensures the efficientimplementation on a wide variety of target machines, and the ability to effectively hideimplementation details from the user or the application library developer.

The VMMP services may be used to create, synchronize and terminate crowd processes.These processes can communicate using message passing services, or through the use ofthe above-mentioned shared objects. Shared objects may be created, initialized, updatedand deleted. VMMP supports several data distribution strategies by some crowd operationsand by shared objects operations.

Crowd services are used in the following manner: VMMP creates an initial process,which executes the user’s main program. This process may spawn other processes usingthe Vcrowd service, which creates a crowd of co-operating processes running concurrentlywith the father process. The Vwait service forces the father process to wait until thetermination of all its child processes in order to ensure correct output results.

Crowd processes receive input and produce output using shared objects. Shared objectsare created and initialized by the Vdef service. The Vset service changes the value of ashared object and Vget retrieves the current value. Other services used will be detailed inthe code examples later on. All VMMP services are documented in [3].

VMMP resembles other software packages, such as PVM[4] and Express[20]. All of thesepackages provide a set of machine independent services to explicitly parallel applicationprograms. They support portable applications by defining ‘virtual machines’, which areindependent of any specific hardware.

PVM is more mature than VMMP. It is used by a large community of users worldwide.It emphasizes the support of a dynamic network of heterogeneous machines, while VMMPsupports only a single multiprocessor at a time. On the other hand, VMMP can be usedeasily as an intermediate language for a parallelizing compiler or a specification language.

It is easy to port VMMP applications to PVM and vice versa, since both tools providecrowd and tree computations with similar characteristics.

VMMP provides shared objects memory, which is not supported by either PVM orExpress, but could be emulated using their services.

VMMP development started in 1989, when no other package with similar characteristicswas available. It is amazing to see that three different software packages implemented anddeveloped independently a similar set of services.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 8: Portable parallel FFT for MIMD multiprocessors

590 A. AVERBUCH AND E. GABBER

4.2. VMMP implementation

VMMP has been implemented on the following multiprocessors:

Sequent Symmetry. A 26-processor Sequent Symmetry running the DYNIX operatingsystem.

MMX. An experimental shared bus multiprocessor containing four processors running asimple runtime executive[21].

ACE. An experimental multiprocessor developed at IBM T.J. Watson Research Centerrunning the MACH operating system[22].

MOS. A shared bus multiprocessor containing eight NS32532 processors running MOSIX,a distributed Unix[23].

Transputers. A distributed memory multiprocessor consisting of a network of eight T800transputers running the HELIOS distributed operating system[24].

VMMP simulators are available on SUN3, SUN4, IBM PC/RT and VME532 serialcomputers. The same application program will run unchanged on all target machines aftera recompilation.

VMMP was used in the implementation of many other problems, such as parallel solu-tion of underwater acoustic IFD (implicit finite difference) schemes[25], using multigridto solve Poisson equations on all the above target machines[26], solving Topelitz matri-ces[27] and parallel debugger[28], parallel implementation of BLAS 3[29]. All the aboveimplementations were also tested on shared memory and message passing multiprocessors.

VMMP can be ported to a new multiprocessor in approximately 2–4 man months.

4.3. Useful VMMP services

The following VMMP services are used by the implementation of the parallel FFT:

Vcrowd. Creates a crowd of processes of the specified size running the given procedure.Vcrowd part. Computes the distribution of a one dimensional vector among crowd pro-

cesses.Vcombine. Combines a distributed vector and computes a global result by an associative

operation.Vget. Gets the value of a shared object.Vget slice. Gets a slice of a shared object. Slice boundaries are computed by Vcrowd part.Vget sync. Waits until all processes arrive at the Vget sync and then gets the shared object

value. This is a combination of a barrier synchronization and Vget.Vset. Sets shared object value.Vset expand. Sets scattered data area into the shared object. The data scatter stride in

memory may be different than in the object.Vwait. Waits until all crowd processes terminate.

5. PORTABLE IMPLEMENTATION OF THE PARALLEL FFT

5.1. Implementation decisions

The parallel FFT is an improvement on the program reported in [1]. The original programassumed that the input data lengthN should be a multiplication of equal factorsN = Mp,

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 9: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 591

where M is a power of 2. The current implementation supports the decomposition of theinput data into unequal factors, such that N = N1 ×N2, where N1 and N2 are not powersof 2. This decomposition reduces the number of the computation stages in the FFT, whichimproves the efficiency by reducing the inter-stage communication and synchronization.

The implementation of the parallel FFT using VMMP was straightforward. The mostimportant implementation decisions are described below:

1. A crowd of processes computes the parallel FFT. One process is placed in eachphysical processor.

2. The input data and the results are stored in shared objects, which is the most naturalway to pass large amounts of data in VMMP.

3. The initial index reversal and loading of input data elements into the processes wereimplemented differently according to the ability to access shared object data directlyby all processors. Direct access to the input data by a pointer is much more efficientthan using the corresponding VMMP service (and less secure). This is the only placein the implementation where the program interrogates the capabilities of the hostmachine and chooses the action in runtime.If direct access to shared object data is possible (as detected by a VMMP predicate)then each processor retrieves the necessary input data points directly into its localmemory. Such direct access is possible on shared memory implementations. Whendirect access to the object data is not possible (e.g. on a distributed memory multipro-cessor), then the entire input data is sent to all processors by the Vget sync service,which employs a spanning tree for efficient broadcasting. After the data have arrived,the processes perform the index reversal on the local copy.Note that the above code is completely portable, since it will run correctly on alltargets.

4. The data exchange between the stages is performed by distributing slices of the localresult into an intermediate shared object, so that the processes will be able to readtheir next stage input from a contiguous slice of this object. We added a new VMMPservice (Vset expand) for writing non-contiguous slices in order to speed up thisaction.

5. The processes and the shared objects are not embedded in the parallel processor usingany specific topology, because VMMP provides transparent access to shared objects.This is in contrast with other parallel FFT implementations, such as [10], which tryto reduce communication overhead by a specific embedding of data to processors.

6. The GFFT computation in each run in the processor was implemented as in the serialcase.

The pseudocode for the parallel implementation of the FFT algorithm is:

main(){

init() /* Parameter check and data structure initialization */Vdef() /* Define shared objects */Vcrowd() /* Start parallel code */Vwait() /* Wait (for Vcrowd) till all results arrive from the processors */Vclock() /* Elapsed time=Vclock()-start time */Vget() /* Results from shared object */

}

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 10: Portable parallel FFT for MIMD multiprocessors

592 A. AVERBUCH AND E. GABBER

worker(){

gen tables() /* Generate sin and cos tables */do fft() /* Start the FFT in each processor */

}

do fft(){N = m0 ×m1 /* Decompose the size of the input vector to be transformed */Vcrowd part() /* Compute the limits for the process */Vget sync() /* Wait till all the processes arrive here (barrier) */

and read the input from the shared-objects */index rev() /* index reversal on the input */for (stage=0; stage≤1;stage++){

Vcrowd part() /* divide m[stage] to slices */if stage>0

Vget sync() /* Wait for all the processes to arrive */gfft() /* Compute sub-stages for N/M steps using gfft */

/* Prepare the complete output in contigous area in memory */Vset expand() /* Write the output from continuous array into */

shared object with jumps */exchange() /* Exchange between the input area and temporary area */}

}

synchronization(){

Vcombine() /* Barrier synchronization */Vget slice() /* Get slice from shared object */local com() /* Local computation. Shuffle the coefficients */Vset expand() /* Re-shuffle into shared object */

}

do fft() describes the actual computation in each processor. We start with a vector of sizeN which can be decomposed into unequal parts m0 and m1.

The parallel FFT was developed and debugged using the SUN simulator. This sourcecode has been later recompiled and run on the shared memory and on the distributedmemory multiprocessors. No source changes were necessary. The same code will alsorun on any future multiprocessor supporting VMMP or PVM. The only source changesnecessary will be target specific performance improvements (e.g. using different VMMPservices).

6. PORTABILITY ANALYSIS

In our performance analysis if a vector of length N that has to be transformed can berepresented as a 2-dimensional array whereN = N1N2, then the DFT of the vector can bedone in two successive stages where each stage consists of the following phases:

1. Load NNj

vectors of length Nj, j = 1, 2.

2. Compute the radix-Nj vector DFT. This contains NNj−1 vector multiplications using

NNj− 1 precomputed twiddle factor vectors (16).

3. Store NNj, j = 1, 2 vectors.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 11: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 593

The algorithm we analyze is the result of the two application of the above 3-step parallelFFT.

The components of the FFT computation are:

Initialization: creating the parallel processes and computing sin and cos tables.Initial data distribution: initial index reversal and retrieving the input slice into the pro-

cessor.In-memory data shuffle: index reversal in local memory after each computation stage.Inter-stage data distribution/collection: distributing and collecting data between stages

and writing the final results.GFFT computation: GFFT computation on a data slice in each processor.

From Table 2 we can see that the algorithm overhead (all components except the GFFTcomputation) is less than 10% for the Sequent with 25 processors. The overhead increasesslowly from 5% on two processors to 9% on 25 processors. Most of the time is spent onthe computation and not on communication. The same trend can be seen in Tables 4, 6and 10. Even if there is a small percentage of global communication it will not degradesubstantially the performance and will not damage the efficiency and the portable natureof the algorithm.

The following subsections describe the performance of the parallel FFT and analyse itsoverhead on shared and distributed memory multiprocessors. The next Tables describe theparallel FFT performance for various data sizes on various multiprocessors. The pointscolumn is the data size and its decomposition into factors. Notice that several data sizes area multiplication of unequal factors. The serial column is the serial radix-2 FFT timing. Weuse the serial radix-2 timing for comparison with the parallel timing, because the parallelFFT employs radix-2 butterfly operations. The parallel column is the parallel algorithmperformanceon varying numbers of processors. The time row is the running time in seconds,including all associated overheads (initial data distribution, results collections, computationof sin and cos tables, parallel task creation, etc.). The speedup row is the parallel algorithmspeedup compared to the serial case. The eff% row is the parallel algorithm efficiency,which is computed by the formula efficiency = speedup× 100

no. processors .

6.1. Sequent Symmetry performance

The Sequent Symmetry contains 26 i386 processors with 64KB of cache for each processor.The global memory size is 32MB. The Sequent runs the DYNIX operating system, whichis a parallel variant of UNIX. Sequent Symmetry is illustrated in [26, p. 26, Figure 15].

Table 1 describes the parallel FFT performance for various data sizes when extended upto 25 Sequent processors (one processor must be left for operating system use).

Table 1 does not contain results for data sizes 256 and 512 on 20 and 25 processors,since the number of processors in these cases is greater than the smallest factor of the datasize. In this case, several processors will be idle.

6.1.1. Sequent Symmetry timing analysis

Table 2 describes the components of the parallel FFT execution time on 131,072 (256×512)data points. The time column is the execution time in seconds. The ‘%’ column is the ratioof the component’s execution time to the total time.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 12: Portable parallel FFT for MIMD multiprocessors

594 A. AVERBUCH AND E. GABBER

Table 1. Parallel FFT timing on the Sequent Symmetry shared memory multiprocessor

Points Data Serial Parallel (no. processors)1 2 5 10 15 20 25

256 time, s 0.10 0.14 0.07 0.04 0.03 0.03(16× 16) speedup 0.7 1.4 2.5 3.3 3.3

eff., % 71 71 50 33 22512 time, s 0.22 0.29 0.15 0.08 0.05 0.05(16× 32) speedup 0.8 1.5 2.8 4.4 4.4

eff., % 76 73 55 44 291024 time, s 0.47 0.63 0.32 0.15 0.09 0.07 0.06 0.06(32× 32) speedup 0.8 1.5 3.1 5.2 6.7 7.8 7.8

eff., % 75 73 63 52 45 39 312048 time, s 1.02 1.34 0.68 0.30 0.17 0.14 0.11 0.10(32× 64) speedup 0.8 1.5 3.4 6.0 7.3 9.3 10.2

eff., % 76 75 68 60 49 46 414096 time, s 2.24 2.84 1.44 0.59 0.33 0.24 0.20 0.16(64× 64) speedup 0.8 1.6 3.8 6.8 9.3 11.2 14.0

eff., % 79 78 76 68 62 56 568192 time, s 4.88 6.07 3.05 1.25 0.66 0.48 0.39 0.32(64× 128) speedup 0.8 1.6 3.9 7.4 10.2 12.5 15.3

eff., % 80 80 78 74 68 63 6116384 time, s 10.63 12.84 6.45 2.62 1.32 0.92 0.73 0.63(128× 128) speedup 0.8 1.7 4.1 8.1 11.6 14.6 16.9

eff., % 83 83 81 81 77 73 6832768 time, s 22.90 27.04 13.60 5.51 2.77 1.93 1.48 1.26(128× 256) speedup 0.9 1.7 4.2 8.3 11.9 15.5 18.2

eff., % 85 84 83 83 79 77 7365536 time, s 48.88 56.78 28.50 11.53 5.79 4.00 2.94 2.49(256× 256) speedup 0.9 1.7 4.2 8.4 12.2 16.6 19.6

eff., % 86 86 85 84 82 83 79131072 time, s 103.66 118.97 59.71 24.06 12.15 8.30 6.18 5.12(256× 512) speedup 0.9 1.7 4.3 8.5 12.5 16.8 20.3

eff., % 87 87 86 85 83 84 81

Discussion. Table 2 indicates that the algorithm overhead (all components except the GFFTcomputation) is less than 10% for 25 processors. The overhead increases slowly from 5%on two processors to 9% on 25 processors. Initial data distribution time is proportional to

1no. processors . Inter-stage data distribution and collection time increases from 1.3% on twoprocessors to 3.7% on 25.

Table 2. Components of parallel FFT execution time on Sequent Symmetry

Activity No. Processors1 2 5 25

Time % Time % Time % Time %

Initialization 0.04 0.0 0.04 0.1 0.04 0.2 0.06 1.2Initial data distribution 2.66 2.2 1.33 2.2 0.54 2.2 0.11 2.1In-memory data shuffle 1.88 1.6 1.07 1.8 0.31 1.3 0.10 2.0Inter-stage data 0.96 0.8 0.76 1.3 0.33 1.4 0.19 3.7Distribution/collectionGFFT computations 113.43 95.4 56.51 94.6 22.84 94.9 4.66 91.0

Total 118.97 100.0 59.71 100.0 24.06 100.0 5.12 100.0

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 13: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 595

Data communication is not a major bottleneck on the Sequent, because the Sequent busbandwidth is matched to the processor speed. This is not be true for other shared memorymultiprocessors, such as the MOS (see Section 6.2.2).

If we look at Table 1 then the difference in efficiency between row 5 (4096 points) androw 6 (8192 points) is around 5–6%. The same is true between row 7 (16,384 points)and row 8 (32,768 points). As the problem size increases then this gap narrows. This canbe seen in rows 9, 10 (65,536 and 131,072 points). Here the difference is 1–3%. Theseresults indicate that the portable parallel FFT is efficient. It will achieve good speedup ona shared memory multiprocessor with a larger number of processors if the multiprocessorhas sufficient memory bandwidth.

6.2. MOS performance

The MOS multiprocessor comprises eight DB532 processor cards connected to a sharedbus via a VME bus. Each processor card contains a 25 MHz NS32532 CPU and FPU,64 KB of write through cache and 4 MB of local memory. The on-board cache minimizesglobal memory accesses. The multiprocessor runs the MOSIX[23] operating system, whichis a distributed version of Unix. Figure 2 illustrates the MOS multiprocessor structure.

Figure 2. MOS shared memory multiprocessor structure

6.2.1. What is MOSIX

MOSIX is an enhancement of UNIX that allows distributed-memory multicomputers,including a LAN connected network of workstations (NOW) or shared bus, to share theirresources by supporting pre-emptive process migration and dynamic load balancing amonghomogeneous subsets of nodes. These mechanisms respond to variations in the load of theworkstations by migrating processes from one workstation to another, transparently, at anystage of the life cycle of a process. The granularity of the work distribution in MOSIX isthe UNIX process. Users can benefit from the MOSIX execution environment by initiatingmultiple processes, e.g. for parallel execution. Alternatively, MOSIX supports an efficientmulti-user execution environment.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 14: Portable parallel FFT for MIMD multiprocessors

596 A. AVERBUCH AND E. GABBER

The NOW MOSIX is designed to run on configurations that include personal work-stations, file servers and CPU servers that are connected by LANs, a shared bus or fastinterconnection networks. In these configurations each workstation (node) is an independentcomputer, with one or more (shared-memory) processors, local memory, communicationand I/O devices. Users interact with the multicomputer via their own workstations. Aslong as the load of the user’s workstation is light, all the user’s processes are confined tothe user’s workstation. When this load increases above a certain threshold level, e.g. theload created by one CPU bound process, the process migration mechanism (transparently)migrates some processes to other workstations and to CPU servers.

The most noticeable properties of executing applications on MOSIX are its networktransparency, the symmetry and flexibility of its configuration, and its pre-emptive processmigration. The combined effect of these properties is that application programs do not haveto know the current state of the system configuration. This is most useful for time-sharingand parallel processing systems. Users need not recompile their applications due to node orcommunication failures, nor be concerned about the load of the various processors. Parallelapplications can be executed on MOSIX by simply creating many processes, just like asingle-machine environment. The operating system automatically attempts to optimize theprocess allocations and balance the load.

The system image model of MOSIX is a NOW, in which all the user’s processes seem torun at the user’s ‘home’ workstation. Each new process is created at the same site(s) of itsparent process. All the processes of each user have the execution environment of the user’shome workstation. Processes that migrate to other (remote) workstations use local resourcesif possible, but interact with the user’s environment through the user’s workstation.

6.2.2. MOS timing analysis

Table 3 describes the parallel FFT performance for various data sizes on an 8-processorMOS multiprocessor.

Table 4 describes the components of the parallel FFT execution time on 65,536 (256×256) data points. The description of table columns and components is the same as forTable 2.

Discussion. Table 4 indicates that the bottlenecks in the MOS implementation are in theinitial data distribution and the inter-stage data distribution and collection. These activitiesinvolve access to shared objects, which are stored in the shared memory. Shared memoryaccess is a bottleneck due to the relatively low bandwidth of the shared bus on this machine.

Initial data distribution takes about the same total time on two and eight processors,because the same amount of data is accessed in both cases. It is actually a measurement ofMOS bus throughput.

Both the in-memory data shuffle and the GFFT computation exhibit super-linear speedup(the time on eight processors is less than 1

8 of the time on a single processor). Thisphenomenon is explained by a better cache hit ratio in eight processors, because the size oflocal data decreases when the number of processors increases.

Inter-stage data distribution/collection involves process synchronization and communi-cation. They consume approximately the same amount of time in two processors as in eight,because the total amount of data distributed and collected is the same in all cases.

The lower efficiency on the MOS relative to the Sequent Symmetry is the result of the

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 15: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 597

Table 3. Parallel FFT timing on the MOS shared memory multiprocessor

Points Data Serial Parallel (no. processors)1 2 4 6 8

256 time, s 0.03 0.05 0.02 0.01 0.02 0.02(16× 16) speedup 0.6 1.5 3.0 1.5 1.5

eff., % 60 75 75 25 19512 time, s 0.07 0.10 0.05 0.03 0.02 0.02(16× 32) speedup 0.7 1.4 2.3 3.5 3.5

eff., % 70 70 58 58 441024 time, s 0.16 0.21 0.11 0.06 0.04 0.04(32× 32) speedup 0.8 1.5 2.7 4.0 4.0

eff., % 76 73 67 67 502048 time, s 0.36 0.44 0.23 0.12 0.09 0.07(32× 64) speedup 0.8 1.6 3.0 4.0 5.1

eff., % 82 78 75 67 644096 time, s 0.78 0.95 0.49 0.25 0.18 0.14(64× 64) speedup 0.8 1.6 3.1 4.3 5.6

eff., % 82 80 78 72 708192 time, s 1.72 2.01 1.04 0.54 0.37 0.29(64× 128) speedup 0.9 1.7 3.2 4.7 5.9

eff., % 86 83 80 78 7416384 time, s 3.74 4.28 2.22 1.17 0.82 0.64(128× 128) speedup 0.9 1.7 3.2 4.6 5.8

eff., % 87 84 80 76 7332768 time, s 8.10 9.10 4.69 2.47 1.73 1.34(128× 256) speedup 0.9 1.7 3.3 4.7 6.0

eff., % 89 86 82 78 7665536 time, s 17.53 19.19 9.89 5.18 3.58 2.80(256× 256) speedup 0.9 1.8 3.4 4.9 6.3

eff., % 91 89 85 82 78

MOS bus bottlenecks. This is apparent from the comparison of the initial data distributiontime on the MOS and on the Sequent. This time is fixed for more than two MOS processors,while it is proportional to 1

no. processors on the Sequent.If we look at Table 3 we see that the difference in efficiency between row 4 (2048 points)

and row 5 (4096 points) is around 5–6%. As the problem size increases the efficiency gap isnarrowed. For example, the difference between row 6 and 7 (8192 and 16,384 points) is 1%.MOS automatic load balancing was very useful for the FFT with non-equal components.

Table 4. Components of parallel FFT execution time on MOS

Activity No. Processors1 2 8

Time % Time % Time %

Initialization 0.02 0.1 0.02 0.2 0.02 0.7Initial data distribution 0.68 3.5 0.35 3.5 0.30 10.7In-memory data shuffle 0.25 1.4 0.12 1.2 0.01 0.4Inter-stage data 0.18 0.9 0.38 3.9 0.43 15.4Distribution/collectionGFFT computations 18.06 94.1 9.02 91.2 2.04 72.8

Total 19.19 100.0 9.89 100.0 2.80 100.0

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 16: Portable parallel FFT for MIMD multiprocessors

598 A. AVERBUCH AND E. GABBER

6.3. Message passing performance

The distributed memory multiprocessor consists of eight T800[24] transputer nodes. Eachnode contains a T800 transputer and 2 or 4 MB of local memory. Transputer clock speedis 20 MHz and memory access time is 80 ns. The nodes are connected by high speedbidirectional serial links. There is no shared memory. The transputers run the Helios[30]distributed operating system. The host server provides I/O services for the transputers.

Table 5 contains the parallel FFT timing on the transputer system.

Table 5. Parallel FFT timing on the transputer distributed memory multiprocessor

Points Data Serial Parallel (no. processors)1 2 4 6 8

256 time, s 0.03 0.07 0.05 0.06 0.09 0.15(16× 16) speedup 0.4 0.6 0.5 0.3 0.2

eff., % 43 30 13 6 3512 time, s 0.08 0.14 0.09 0.11 0.16 0.27(16× 32) speedup 0.6 0.9 0.7 0.5 0.3

eff., % 57 44 18 8 41024 time, s 0.17 0.27 0.18 0.13 0.12 0.13(32× 32) speedup 0.6 0.9 1.3 1.4 1.3

eff., % 63 47 33 24 162048 time, s 0.36 0.57 0.35 0.23 0.19 0.20(32× 64) speedup 0.6 1.0 1.6 1.9 1.8

eff., % 63 51 39 32 234096 time, s 0.78 1.18 0.69 0.42 0.33 0.29(64× 64) speedup 0.7 1.1 1.9 2.4 2.7

eff., % 66 57 46 39 348192 time, s 1.66 2.46 1.47 0.84 0.64 0.54(64× 128) speedup 0.7 1.1 2.0 2.6 3.1

eff., % 68 57 49 43 3816384 time, s 3.56 5.18 2.92 1.71 1.26 1.05(128× 128) speedup 0.7 1.2 2.1 2.8 3.4

eff., % 69 61 52 47 4232768 time, s 7.57 10.87 6.09 3.54 2.59 2.19(128× 256) speedup 0.7 1.2 2.1 2.9 3.5

eff., % 70 62 53 49 4365536 time, s 16.04 22.64 12.95 7.29 5.29 4.39(256× 256) speedup 0.7 1.2 2.2 3.0 3.7

eff., % 71 62 55 51 46

6.3.1. Message passing timing analysis

Table 6 describes the components of the parallel FFT execution time on 65,536 (256×256)data points. The description of the table columns and the components is the same as forTables 2 and 4. The time required for broadcasting the input to all processors, which is apart of the initial data distribution time, is shown in the Vget sync row (see Section 5.1 fordetails).

Discussion. The bottlenecks in the distributed memory implementation are the initial datadistribution and the inter-stage data distribution and collection. These activities consume

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 17: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 599

Table 6. Components of parallel FFT execution time on message passing multiprocessor

Activity No. Processors1 2 8

Time % Time % Time %

Initialization 0.07 0.3 0.07 0.5 0.08 1.8Initial data distribution 1.25 5.5 1.35 10.4 0.67 15.3(Vget sync only)† 0.04 0.2 0.46 3.6 0.47 10.7In-memory data shuffle 0.60 2.6 0.37 2.9 0.12 2.7Inter-stage data 0.08 0.4 0.74 5.7 0.57 13.0Distribution/collectionGFFT computations 20.64 91.2 10.42 80.5 2.95 67.2

Total 22.64 100.0 12.95 100.0 4.39 100.0

† Vget sync time is included in the initial data distribution time

less time on eight processors, but they do not scale relative to the number of processors. No-tice that the input data broadcast time (Vget sync time) remains the same on two and eightprocessors, because it employs a spanning tree for efficiency. The current implementationof data distribution and collection calls some shared object services, which are serializedby processor 0, with evident performance degradation on a large number of processors.

In-memory data shuffle scales with the number of processors as the amount of local datadecreases.

The GFFT computation scales with the number of processors. There is no super-linearspeedup here because the transputer has no cache. However, the lack of cache causes theGFFT computation on a single processor to take 28% more time than in the serial FFT,because the GFFT involves an extra scaling of the data, which increases the number ofmemory references. As shown in Section 6.3.2, these extra memory references have anadverse effect on execution time. This should be contrasted with MOS, in which a parallelFFT on a single processor takes only 9% more time than the serial version thanks to thecache.

The efficiency difference between rows 7 and 9 (16,384 and 65,536 points) in Table 5 is3%, which indicates again that the algorithm is portable.

We expect to improve the performanceof the parallel FFT on the transputer by optimizingthe shared object services invoked in the data distribution and selection. However, theperformance will remain poor relative to the shared memory implementation due to thelack of cache. The efficiency of the parallel FFT will increase on message passing machinesequipped with a cache.

6.3.2. The effects of memory access time on transputer performance

Each T800 transputer contains 4 KB of fast on-chip RAM and can be connected to a muchlarger external memory (several MB). The internal memory is accessed in a single clockcycle, while the external memory is accessed in three cycles. The T800 transputer doesnot contain any cache to speed up external memory access. A 20MHz T800 transputer candeliver up to 10 MIPS and one double precision MFLOP only when both program codeand data reside in the internal memory.

The external memory access penalty can increase typical program running time by about

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 18: Portable parallel FFT for MIMD multiprocessors

600 A. AVERBUCH AND E. GABBER

50%, as illustrated in Table 7. This Table shows the running time, MFLOPS and executiontime ratio of a program computing 10,000 vector operations of the form v ← v + α × w,where v and w are vectors of length 100 and α is a scalar. All computations are in doubleprecision floating point. The execution times are given for all combinations of programcode and data placements. This inner loop is typical of many numerical algorithms.

Table 7. Execution times on a T800 with various code and data placements

Placement Execution MFLOPS TimeCode Data time, s ratio, %

external external 6.02 0.332 146.1internal external 5.54 0.361 134.4external internal 4.55 0.440 110.4internal internal 4.12 0.485 100.0

Because the transputer CPU is relatively fast compared to the external memory access,the number of data references affects the program execution time much more than thenumber of floating point operations. In other words, most programs suffer a performancedegradation due to the external memory access time.

The parallel FFT, like almost all other practical programs, could not use the fast internalmemory, because it can hold only a small amount of code or data. Moreover, the parallelFFT contains extra data shuffling and scaling compared to the serial FFT. This additionalwork increases the amount of memory accesses, which affects the execution time.

6.4. SP2 performance

VMMP was not ported to SP2. But since VMMP resembles PVM and it is easy to port aVMMP application to PVM and vice versa and since both tools provide crowd and treecomputations with similar characteristics (see Section 6.1) then instead of checking theparallel FFT with VMMP it was examined with PVM on SP2. The implementation of theparallel FFT using PVM was straightforward.

The IBM SP2 has 64 processors, 62 of which are of the type Thin Node 2; each nodeincludes 128 Mbytes of memory and a 4.5 Gbyte internal disk. The two remaining processorsare of type Thin Node 2, in which each node has 512 Mbytes of memory and 4.5 Gbyteinternal disk. Each processor has a POWER2 architecture RS/6000 processor runningat 66.7 MHz. The processors are interconnected with a multistage switch made up ofbidirectional 4×4 crossbars. The SP2 is theoretically capable of 266 MFLOPS and itcontains the AIX Windows Environment/6000. The communication rates are given inTable 8.

Table 8. SP2 communication rates

Communication rate, Mbit/s 280Latency, (µs) 40

Table 9 describes the parallel FFT performance for various data sizes when extended upto 64 SP2 processors. It does not contain results for data sizes less than 16,384 points.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 19: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 601

Table 9. Parallel FFT timing on the SP2 multiprocessor

Points Data Serial Parallel (no. processors)1 2 8 16 24 32 40 64

16,384 time, s 0.19 0.24 0.13 0.03 0.014(128× 128) speedup 0.79 1.46 6.3 13.76

eff., % 79 73 79 8632,768 time, s 0.40 0.51 0.25 0.06 0.03 0.019 0.014(128× 256) speedup 0.78 1.6 6.6 13.3 21.1 28.5

eff., % 78 80 82 83 88 8965,536 time, s 0.85 1.08 0.54 0.13 0.07 0.04 0.03 0.024(256× 256) speedup 0.79 1.6 6.5 12.8 21.3 28.3 35.4

eff., % 79 80 81.3 80 89 88 88131,072 time, s 1.82 2.25 1.21 0.26 0.13 0.083 0.06 0.05 0.031(256× 512) speedup 0.81 1.5 6.9 14.1 21.9 29.8 36.5 58.7

eff., % 81 75 86 88 91 91 93 91262,144 time, s 4.10 5.46 2.41 0.58 0.28 0.19 0.14 0.11 0.069(512× 512) speedup 0.75 1.7 7.1 14.6 21.6 29.3 37.3 59.4

eff., % 88 85 88 91 90 91 93 93524,288 time, s 9.00 10.7 5.6 1.25 0.6 0.41 0.31 0.24 0.152(512× 1024) speedup 0.84 1.6 7.2 14.9 21.8 29.0 37.1 59.2

eff., % 84 80 90 93 91 90 92 921,048,576 time, s 21.60 26.02 12.71 3.0 1.49 1.11 0.724 0.59 0.36(1024× 1024) speedup 0.83 1.7 7.2 14.5 18.9 29.8 36.9 59.4

eff., % 83 85 90 90 87 93 92 932,097,152 time, s 45.70 55.73 26.9 6.26 3.17 2.21 1.52 1.29 0.76(1024× 2048) speedup 0.82 1.7 7.3 14.4 20.6 30.1 35.4 60.1

eff., % 82 85 90 90 86 94 88 944,194,304 time, s 98.20 122.75 54.55 13.45 6.59 4.79 3.17 2.71 1.60(2048× 2048) speedup 0.80 1.8 7.3 14.9 20.5 31.0 36.2 61.3

eff., % 80 90 91 93 85 97 90 96

6.4.1. SP2 timing analysis

Table 10 describes the components of the parallel FFT execution time on 4,194,304(2048×2048) data points. The time column is the execution time in seconds. The ‘%’column is the ratio of the component’s execution time to the total time.

Table 10. Components of parallel FFT execution time on SP2

Activity No. processors8 16 32 64

Time % Time % Time % Time %

Initialization – – – – – – – –Initial data distribution – – – – – – 0.01 1In-memory data shuffle 0.067 0.5 0.046 0.7 0.032 1.0 0.022 1.4Inter-stage data 0.143 1.06 0.079 1.2 0.047 1.5 0.029 1.8Distribution/collectionGFFT computations 13.25 98.5 6.46 98.1 3.10 97.5 1.55 96.8

Total 13.45 100.0% 6.59 100.0% 3.17 100.0% 1.60 100%

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 20: Portable parallel FFT for MIMD multiprocessors

602 A. AVERBUCH AND E. GABBER

Discussion. Table 10 indicates that the algorithm overhead (all components except theGFFT computation) is less than 4% for 64 processors. The overhead increases slowly from1.5% on eight processors to 3.2% on 64 processors. Initial data distribution time is neglible.Inter-stage data distribution and collection time increases from 1.0% on two processors to1.8% on 64.

Data communication is not a major bottleneck on the SP2, because the SP2 bandwidthis matched to the processor speed.

Table 9 shows that the difference in efficiency between row 5 (262,144 points) and row 6(524,288 points) is around 1–2%. The same is true between row 7 (1,048,576 points) androw 8 (2,097,152 points). As the problem size increases then this gap stays invariant. Theseresults indicate that the portable parallel FFT is efficient. It will achieve good speedup ona message-passing multiprocessor.

7. CONCLUSIONS

When implementing this algorithm on a MIMD machine the main problem is how toexploit the architecture to perform the required data organization between each stage ofthe proposed algorithm. The main problems encountered were how to avoid the steadydecrease in the vector length at successive stages, how to avoid costly data unscrambling,and optimal load balancing.

To get near peak performance from the machine we note that the delivered speedupis dramatically better for much longer transforms. In this case, the parallel overhead isminimized because of the greater amount of work performed in the local processors. Inaddition, there is less traffic on the bus and less synchronization is required. The increaseof the vector length may increase the data locality and therefore make optimal use of thecache.

The GFFT that is described in this paper is fitted to this architecture and its features com-prise a good match between the architecture (the increase of the locality of the computation)and the algorithm (the preprocessing of the twiddle factor which reduce substantially theamount of access to the global memory). The locality of the computation, which is basedon local tasks, enables this parallel FFT to achieve a good speedup and hence demonstratesan almost optimal integration of the algorithm and the architecture.

From examination of the results and by the structure of the algorithm it is impossibleto improve performance in the range of short transforms. In the case of short transforms,the parallel overhead dominates the gain in the performance since the processors are doingless and more time is spent on synchronization, message passing, use of semaphores andaccess to global memory.

Can we improve the current algorithm? There are some points in the implementationwhich were changed and thus yield better performance:

1. In our case where the decomposition of the vector to be transformed can be writtenas N = N1 × N2 × · · ·Np then with this implementation we get better results.On the other hand, in this general setup, it is impossible to hold pre-computedcoefficient tables for each Ni and therefore we are forced to get them from the on-line computation of the mathematical functions sin and cos which may degrade theperformance.But a potential algorithm for a mixed-radix GFFT is given by the following analysis.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 21: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 603

If we denote by η the ratio between the computational complexity of the twiddlefactor array and that of the frequency shift array, then we have

η =

p∑j=1

Lj−1(Nj − 1)

p∑j=1

Lj−1 − 1

(17)

Equation (17) can be reduced easily to the following form:

η =Lp − 1

p∑j=1

Lj−1 − 1

' Np −NpNp−1

An approximate upper limit for η is given by

η ' Np −NpNp−1

− · · · (18)

According to (17) and (18) it can be concluded that:

(a) To increase the efficiency of a mixed-radix GFFT scheme it is recommendedto factor the input vector in an ascending order of Nj .

(b) For equi-radix Nj = r, j = 1, . . . , p, the bound equals r − 1. However, inthis case it is sufficient to compute only the corresponding arrays of the laststage as they are composed of sub-arrays of the preceding stages which can bereferenced by proper indexing.

(c) The potential savings increase linearly with the radix of the transform kernel.

This paper presented a parallel implementation of the FFT algorithm, which is basedon the VMMP[2] portable software package. The parallel FFT achieved good speedupon both shared memory and distributed memory multiprocessors, although it did not useany machine dependent embedding to reduce overhead. A detailed performance analysisindicated that this parallel FFT is portable and will scale efficiently to a large number ofprocessors.

REFERENCES

1. A. Averbuch, E. Gabber, B. Gordissky and Y. Medan, ‘A parallel FFT on an MIMD machine’,Parallel Comput., 15, 61–74, (1990).

2. E. Gabber, ‘VMMP: A practical tool for the development of portable and efficient programs formultiprocessors’, IEEE Trans. Parallel Distrib. Syst., 1(3), 304–317 (1990).

3. E. Gabber, ‘VMMP user’s guide’, Technical Report 239/92, Computer Science Department, TelAviv University, January 1992.

4. A. Geist et al., PVM: Parallel Virtual Machine. A User’s Guide and Tutorial for NetworkedParallel Computing, The MIT Press, 1994.

5. A. Dubey, M. Zubair and C. E. Grosch, ‘A general purpose subroutine for fast Fourier transformon a distributed memory parallel machine’, ICASE Report 92-56, NASA, 1992.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 22: Portable parallel FFT for MIMD multiprocessors

604 A. AVERBUCH AND E. GABBER

6. D. Gannon and J. Rosendale, ‘On the impact of communication complexity on the design ofparallel numerical algorithms’, IEEE Trans. Comput., 33, 1180–1194 (1984).

7. S. L. Johnsson, Communication efficient basic linear algebra computations on hypercube ar-chitectures, Yale University, Report No. YALEU/DCS/RR-361, 1985.

8. O. A. McBryan and Eric F. Van de Velde, ‘Hypercube algorithms and implementations’, SIAMJ. Sci. Stat. Comput., 8(2) s227–s287 (1987).

9. Y. Saad and M. H. Schultz, ‘Data communication in hypercubes’, Yale University, Report No.YALEU/DCS/RR-428, 1985.

10. R. M. Chamberlain, ‘Gray codes, fast Fourier transforms and hypercubes’, Parallel Comput., 6,225–233 (1988).

11. P. M. Flanders, ‘A unified approach to a class of data movements on array processor’, IEEETrans. Comput., 31, 809–819 (1982).

12. T. E. Ragland II and C. A. Outten, An Efficient Parallel Implementation of the FFT Algorithmon IBM 3090-600E, Lab. for Scientific Computation, Technical Report, University of Michigan,September 1990, to be published.

13. D. A. Carlson, ‘Using local memory to boost the performance of FFT algorithms on the CRAY-2supercomputer’, J. Supercomput., 4, 345–356 (1990).

14. P. N. Swarztrauber, ‘Multiprocessors FFTs’, Parallel Comput., 5, 197–210 (1987).15. D. H. Bailey and P. N. Swarztrauber, ‘The fractional Fourier transform and applications’, SIAM

Rev., 33(3), 389–404 (1991).16. J. P. Singh, J. L. Hennesy and A. Gupta, ‘Scaling parallel programs for multiproessors: Method-

ology and examples’, Preprint, Computer Systems Laboratory, Stanford University.17. E. Gabber, A. Averbuch and A. Yehudai, ‘A portable parallelizing Pascal compiler’, IEEE Softw.,

10, 71–81 (1993).18. E. Gabber, A. Averbuch and A. Yehudai, ‘A system for generating portable and efficient code

for multiprocessors’, Advanced Compilation Techniques for Novel Architectures, M. Chen andR. Pinter (Eds), Lecture Notes in Computer Science, Springer Verlag, 1992.

19. A. Averbuch, R. Dekel and E. Gabber, ‘Portable parallelizing Fortran compiler’, Concurrency:Pract. Exp., 8(2), 91–123 (1996); also Technical Report 283/93, Dept. of Computer Science,Tel Aviv University, 1993; also in International Conference on Parallel and Distributed Pro-cessing Techniques and Applications (PDPTA’96), Vol. III, pp. 1236–1247, 9–11 August 1996,Sunnyvale, CA, 1996.

20. J. Flower, A. Kolawa and B. Sharadwaj, ‘The Express way to distributed processing’, Super-comput. Rev., 54–55 (1991).

21. E. Gabber, ‘Parallel programming using the MMX operating system and its processor’, Proc.of the 3rd Israel Conference on Computer Systems and Software Engineering, Tel-Aviv, Israel,6–7 June 1988, pp. 122–132. A summary of Technical Report 107/88, Eskenasy Institute ofComputer Sciences, Tel-Aviv University, Tel-Aviv, Israel, May 1988.

22. A. Garcia, D. J. Foster and R. F. Freitas, The Advance Computing Environment MultiprocessorWorkstation, IBM T.J. Watson Research Center, IBM Research Report RC 14491 (#64901),March 1989.

23. A. Barak and A. Litman, ‘MOS: A multicomputer distributed operating system’, Software:Pract. Exp., 15(8), 725–737 (1985).

24. M. Homewood, D. May, D. Shepherd and R. Shepherd, ‘The IMS T800 transputer’, IEEE Micro,7(5), 10–26 (1987).

25. A. Averbuch, S. Itzikowitz, H. Primak and E. Sammsonov, ‘An efficient parallel algorithm forthe solution of tridiagonal systems associated with the underwater acoustic parabolic equation’,3rd Int. Assoc. for Math. and Computers Simulation (IMACS), Harvard Univ., ComputationalAcoustics, Vol. 2, June 1991, pp. 221–232.

26. A. Averbuch, E. Gabber, S. Itzikowitz and B. Shoham, ‘On the parallel elliptic single/multi-gridsolutions about aligned and nonaligned bodies using VMMP’, Scientific Program., 3, 13–32(1994).

27. I. Gohberg, I. Koltracht, A. Averbuch and B. Shoham, ‘Timing analysis of a parallel algorithmfor Toeplitz matrices on a MIMD parallel machine’, Parallel Comput., 17, 563–577 (1991).

28. A. Averbuch and D. Zifrony, Pdeb – A Parallel Debugger, Parallel Computing, D. J. Evans etal. (Eds), North-Holland, Leiden, Holland, 1989, pp. 481–486.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)

Page 23: Portable parallel FFT for MIMD multiprocessors

PORTABLE PARALLEL FFT 605

29. A. Averbuch, D. Amitai, R. Friedman and E. Gabber, ‘Portable parallel implementation of BLAS3’, Concurrency: Pract. Exp., 6(5), 411–459 (1994).

30. Prehilion Software, HELIOS Operating System, Prentice Hall, 1989.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 10, 583–605 (1998)