2009/4/21 third french-japanese paap workshop 1 a volumetric 3-d fft on clusters of multi-core...

2009/4/21 Third French-Japanese PAAP Workshop 1

A Volumetric 3-D FFT on Clusters of Multi-Core

Processors

Daisuke Takahashi

University of Tsukuba, Japan


Outline• Background

• Objectives

• Approach

• 3-D FFT Algorithm

• Volumetric 3-D FFT Algorithm

• Performance Results

• Conclusion


Background• The fast Fourier transform (FFT) is an algorithm wi

dely used today in science and engineering.• Parallel 3-D FFT algorithms on distributed-memor

y parallel computers have been well studied.• November 2008 TOP500 Supercomputing Sites

– Roadrunner: 1,105.00 TFlops (129,600 Cores)– Jaguar (Cray XT5 QC 2.3GHz): 1,059.00 TFlops

(150,152 Cores)

• Recently, the number of cores keeps increasing.


Background (cont’d)• A typical decomposition for performing a par

allel 3-D FFT is slabwise.– A 3-D array is distributed along t

he third dimension .– must be greater than or equal to the number

of MPI processes.

• This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors.

),,( 321 NNNx3N

3N


Related Works• Scalable framework for 3-D FFTs on the Blu

e Gene/L supercomputer[Eleftheriou et al. 03, 05]– Based on a volumetric decomposition of data.– Scale well up to 1,024 nodes for 3-D FFTs of siz

e 128x128x128.

• 3-D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07]– 3-D FFTs of size 128x128x128 can scale well on

QCDOC up to 4,096 nodes.


Objectives

• Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors.

• Reduce the communication time for larger numbers of MPI processes.

• A comparison between 1-D and 2-D distribution for 3-D FFT.


Approach• Some previously presented volumetric 3-D F

FT algorithms[Eleftheriou et al. 03, 05, Fang07]uses the 3-D distribution for 3-D FFT.– These schemes require three all-to-all communic

ations.

• We use a 2-D distribution for volumetric 3-D FFT.– It requires only two all-to-all communications.


3-D FFT• 3-D discrete Fourier transform (DFT) is

given by

)/2exp(,10

),,(),,(

11

1

22

2

33

3

1

1

2

2

3

3

1

0

1

0

1

0321321

rnrr

kjn

kjn

kjn

n

j

n

j

n

j

nink

jjjxkkky

r


1-D distribution along z-axis

x

z

y

1. FFTs in x-axis

x

z

yx

z

y

2. FFTs in y-axis 3. FFTs in z-axis

With a slab decomposition


2-D distribution along y- and z-axes

x

z

y

1. FFTs in x-axis

x

z

yx

z

y

2. FFTs in y-axis 3. FFTs in z-axis

With a volumetric domain decomposition


Communication time of 1-D distribution• Let us assume for -point FFT:

– Latency of communication: (sec)– Bandwidth: (Byte/s)– The number of processors:

• One all-to-all communication

• Communication time of 1-D distribution

WPQ

NLPQ

WPQ

NLPQT

16

)(

16)1(

2dim1

(sec)

321 NNNN L

WQP


Communication time of 2-D distribution• Two all-to-all communications

– simultaneous all-to-all communications of processors in y-axis.

– simultaneous all-to-all communications of processors in z-axis.

• Communication time of 2-D distribution

WPQ

NLQP

WPQ

NLQ

WQP

NLPT

32)(

16)1(

16)1(

22dim2

(sec)

Q

Q

P

P


Comparing communication time• Communication time of 1-D distribution

• Communication of 2-D distribution

• By comparing two equations, the communication time of the 2-D distribution is less than that of the1-D distribution for larger number of processors and latency .

WPQ

NLPQT

16

dim1

WPQ

NLQPT

32

)(dim2

QP L


Performance Results• To evaluate parallel 3-D FFTs, we compared

– 1-D distribution– 2-D distribution

• and -point FFTs on from 1 to 4,096 cores.

• Target parallel machine:– T2K-Tsukuba system (256 nodes, 4,096 cores).– The flat MPI programming model was used.– MVAPICH 1.2.0 was used as a communication libr

ary.– The compiler used was Intel Fortran compiler 10.1.

333 128,64,32N 3256


T2K-Tsukuba System• Specification

– The number of nodes: 648 （ Appro Xtreme-X3 Server ）

– Theoretical peak performance: 95.4 TFlops– Node configuration: 4-socket of quad-core AMD Opteron

8356 (Barcelona 2.3 GHz)– Total main memory size: 20 TB– Network interface: DDR InfiniBand Mellanox ConnectX H

CA x 4– Network toporogy: Fat Tree– Full-bisection bandwidth: 5.18 TB/s


Computation Node of T2K-Tsukuba

Bridge

NVIDIAnForce

3050

Bridge

NVIDIAnForce

3050

USBUSB

Dual ChannelReg DDR2

HyperTransport8GB/s(Full-duplex)

PCI-XPCI-XI/O HubI/O Hub

8GB/s8GB/s

(A)2(B)2

4GB/s(Full-duplex)

4GB/s(Full-duplex)

(A)1(B)1

4GB/s(Full-duplex)

4GB/s(Full-duplex)

Bridge

NVIDIAnForce

3600

Bridge

NVIDIAnForce

3600BridgeBridge

PCI-Express X16

PCI-Express X8

PCI-X

PCI-X

X16

X8

X4

PCI-Express X16

PCI-Express X8

SASSAS

X16

X8

X4

2GB 667MHz DDR2 DIMM x4




Mellanox MHGH28-XTC ConnectX HCA x2

(1.2µs MPI Latency, 4X DDR 20Gb/s)

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2µs MPI Latency, 4X DDR 20Gb/s)


Performance of parallel 3-D FFTs(2-D distribution)

0.1

1

10

100

1000

Number of cores

GF

LO

PS

N=32^3

N=64^3

N=128^3

N=256^3


Discussion (1/2)• For -point FFT, we can clearly see that

communication overhead dominates the execution time.– In this case, the total working set size is only 1MB.

• On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT.– Performance on 4,096 cores is over 401 GFlops,

about 1.1% of theoretical peak.– Performance except for all-to-all communications i

s over 10 TFlops, about 26.7% of theoretical peak.

332N

3256N


Performance of parallel 3-D FFTs(N=256^3)

0.1

1

10

100

1000

Number of cores

GF

LO

PS

1-Ddistribution2-Ddistribution


Discussion (2/2)• For , the performance of the 1-D

distribution is better than that of the 2-D distribution .– This is because that the total communication

amount of the 1-D distribution is a half of the2-D distribution.

• However, for , the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency.

64QP

128QP


Breakdown of parallel 3-D FFTs(256cores, N=256^3)

0

0.01

0.02

0.03

0.04

0.05

0.06

1-Ddistribution

2-Ddistribution

Tim

e (s

ec)

Comp.

Comm.


Conclusions• We implemented of a volumetric parallel 3-D FFT on

clusters of multi-core processors.• We showed that a 2-D distribution improves perform

ance effectively by reducing the communication time for larger numbers of MPI processes.

• The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors.

• We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT.

3256N

2009/4/21 third french-japanese paap workshop 1 a volumetric 3-d fft on clusters of multi-core...

Documents

d distributionlet

d array

d distribution communication

d distributiontwo all

d fft algorithms eleftheriou

d ffts of size 128x128x128

communications of processors

latency of communication