2009/4/21 third french-japanese paap workshop 1 a volumetric 3-d fft on clusters of multi-core...

22
2009/4/21 Third French-Japanese PAAP Worksho p 1 A Volumetric 3-D FFT on Clusters of Multi- Core Processors Daisuke Takahashi University of Tsukuba, Japan

Upload: silas-daniels

Post on 04-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 1

A Volumetric 3-D FFT on Clusters of Multi-Core

Processors

Daisuke Takahashi

University of Tsukuba, Japan

Page 2: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 2

Outline• Background

• Objectives

• Approach

• 3-D FFT Algorithm

• Volumetric 3-D FFT Algorithm

• Performance Results

• Conclusion

Page 3: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 3

Background• The fast Fourier transform (FFT) is an algorithm wi

dely used today in science and engineering.• Parallel 3-D FFT algorithms on distributed-memor

y parallel computers have been well studied.• November 2008 TOP500 Supercomputing Sites

– Roadrunner: 1,105.00 TFlops (129,600 Cores)– Jaguar (Cray XT5 QC 2.3GHz): 1,059.00 TFlops

(150,152 Cores)

• Recently, the number of cores keeps increasing.

Page 4: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 4

Background (cont’d)• A typical decomposition for performing a par

allel 3-D FFT is slabwise.– A 3-D array is distributed along t

he third dimension .– must be greater than or equal to the number

of MPI processes.

• This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors.

),,( 321 NNNx3N

3N

Page 5: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 5

Related Works• Scalable framework for 3-D FFTs on the Blu

e Gene/L supercomputer[Eleftheriou et al. 03, 05]– Based on a volumetric decomposition of data.– Scale well up to 1,024 nodes for 3-D FFTs of siz

e 128x128x128.

• 3-D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07]– 3-D FFTs of size 128x128x128 can scale well on

QCDOC up to 4,096 nodes.

Page 6: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 6

Objectives

• Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors.

• Reduce the communication time for larger numbers of MPI processes.

• A comparison between 1-D and 2-D distribution for 3-D FFT.

Page 7: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 7

Approach• Some previously presented volumetric 3-D F

FT algorithms[Eleftheriou et al. 03, 05, Fang07]uses the 3-D distribution for 3-D FFT.– These schemes require three all-to-all communic

ations.

• We use a 2-D distribution for volumetric 3-D FFT.– It requires only two all-to-all communications.

Page 8: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 8

3-D FFT• 3-D discrete Fourier transform (DFT) is

given by

)/2exp(,10

),,(),,(

11

1

22

2

33

3

1

1

2

2

3

3

1

0

1

0

1

0321321

rnrr

kjn

kjn

kjn

n

j

n

j

n

j

nink

jjjxkkky

r

Page 9: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 9

1-D distribution along z-axis

x

z

y

1. FFTs in x-axis

x

z

yx

z

y

2. FFTs in y-axis 3. FFTs in z-axis

With a slab decomposition

Page 10: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 10

2-D distribution along y- and z-axes

x

z

y

1. FFTs in x-axis

x

z

yx

z

y

2. FFTs in y-axis 3. FFTs in z-axis

With a volumetric domain decomposition

Page 11: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 11

Communication time of 1-D distribution• Let us assume for -point FFT:

– Latency of communication: (sec)– Bandwidth: (Byte/s)– The number of processors:

• One all-to-all communication

• Communication time of 1-D distribution

WPQ

NLPQ

WPQ

NLPQT

16

)(

16)1(

2dim1

(sec)

321 NNNN L

WQP

Page 12: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 12

Communication time of 2-D distribution• Two all-to-all communications

– simultaneous all-to-all communications of processors in y-axis.

– simultaneous all-to-all communications of processors in z-axis.

• Communication time of 2-D distribution

WPQ

NLQP

WPQ

NLQ

WQP

NLPT

32)(

16)1(

16)1(

22dim2

(sec)

Q

Q

P

P

Page 13: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 13

Comparing communication time• Communication time of 1-D distribution

• Communication of 2-D distribution

• By comparing two equations, the communication time of the 2-D distribution is less than that of the1-D distribution for larger number of processors and latency .

WPQ

NLPQT

16

dim1

WPQ

NLQPT

32

)(dim2

QP L

Page 14: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 14

Performance Results• To evaluate parallel 3-D FFTs, we compared

– 1-D distribution– 2-D distribution

• and -point FFTs on from 1 to 4,096 cores.

• Target parallel machine:– T2K-Tsukuba system (256 nodes, 4,096 cores).– The flat MPI programming model was used.– MVAPICH 1.2.0 was used as a communication libr

ary.– The compiler used was Intel Fortran compiler 10.1.

333 128,64,32N 3256

Page 15: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 15

T2K-Tsukuba System• Specification

– The number of nodes: 648 ( Appro Xtreme-X3 Server )

– Theoretical peak performance: 95.4 TFlops– Node configuration: 4-socket of quad-core AMD Opteron

8356 (Barcelona 2.3 GHz)– Total main memory size: 20 TB– Network interface: DDR InfiniBand Mellanox ConnectX H

CA x 4– Network toporogy: Fat Tree– Full-bisection bandwidth: 5.18 TB/s

Page 16: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 16

Computation Node of T2K-Tsukuba

Bridge

NVIDIAnForce

3050

Bridge

NVIDIAnForce

3050

USBUSB

Dual ChannelReg DDR2

HyperTransport8GB/s(Full-duplex)

PCI-XPCI-XI/O HubI/O Hub

8GB/s8GB/s

(A)2(B)2

4GB/s(Full-duplex)

4GB/s(Full-duplex)

(A)1(B)1

4GB/s(Full-duplex)

4GB/s(Full-duplex)

Bridge

NVIDIAnForce

3600

Bridge

NVIDIAnForce

3600BridgeBridge

PCI-Express X16

PCI-Express X8

PCI-X

PCI-X

X16

X8

X4

PCI-Express X16

PCI-Express X8

SASSAS

X16

X8

X4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2µs MPI Latency, 4X DDR 20Gb/s)

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2µs MPI Latency, 4X DDR 20Gb/s)

Page 17: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 17

Performance of parallel 3-D FFTs(2-D distribution)

0.1

1

10

100

1000

Number of cores

GF

LO

PS

N=32^3

N=64^3

N=128^3

N=256^3

Page 18: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 18

Discussion (1/2)• For -point FFT, we can clearly see that

communication overhead dominates the execution time.– In this case, the total working set size is only 1MB.

• On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT.– Performance on 4,096 cores is over 401 GFlops,

about 1.1% of theoretical peak.– Performance except for all-to-all communications i

s over 10 TFlops, about 26.7% of theoretical peak.

332N

3256N

Page 19: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 19

Performance of parallel 3-D FFTs(N=256^3)

0.1

1

10

100

1000

Number of cores

GF

LO

PS

1-Ddistribution2-Ddistribution

Page 20: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 20

Discussion (2/2)• For , the performance of the 1-D

distribution is better than that of the 2-D distribution .– This is because that the total communication

amount of the 1-D distribution is a half of the2-D distribution.

• However, for , the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency.

64QP

128QP

Page 21: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 21

Breakdown of parallel 3-D FFTs(256cores, N=256^3)

0

0.01

0.02

0.03

0.04

0.05

0.06

1-Ddistribution

2-Ddistribution

Tim

e (s

ec)

Comp.

Comm.

Page 22: 2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 22

Conclusions• We implemented of a volumetric parallel 3-D FFT on

clusters of multi-core processors.• We showed that a 2-D distribution improves perform

ance effectively by reducing the communication time for larger numbers of MPI processes.

• The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors.

• We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT.

3256N