2009/4/21 third french-japanese paap workshop 1 a volumetric 3-d fft on clusters of multi-core...
TRANSCRIPT
2009/4/21 Third French-Japanese PAAP Workshop 1
A Volumetric 3-D FFT on Clusters of Multi-Core
Processors
Daisuke Takahashi
University of Tsukuba, Japan
2009/4/21 Third French-Japanese PAAP Workshop 2
Outline• Background
• Objectives
• Approach
• 3-D FFT Algorithm
• Volumetric 3-D FFT Algorithm
• Performance Results
• Conclusion
2009/4/21 Third French-Japanese PAAP Workshop 3
Background• The fast Fourier transform (FFT) is an algorithm wi
dely used today in science and engineering.• Parallel 3-D FFT algorithms on distributed-memor
y parallel computers have been well studied.• November 2008 TOP500 Supercomputing Sites
– Roadrunner: 1,105.00 TFlops (129,600 Cores)– Jaguar (Cray XT5 QC 2.3GHz): 1,059.00 TFlops
(150,152 Cores)
• Recently, the number of cores keeps increasing.
2009/4/21 Third French-Japanese PAAP Workshop 4
Background (cont’d)• A typical decomposition for performing a par
allel 3-D FFT is slabwise.– A 3-D array is distributed along t
he third dimension .– must be greater than or equal to the number
of MPI processes.
• This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors.
),,( 321 NNNx3N
3N
2009/4/21 Third French-Japanese PAAP Workshop 5
Related Works• Scalable framework for 3-D FFTs on the Blu
e Gene/L supercomputer[Eleftheriou et al. 03, 05]– Based on a volumetric decomposition of data.– Scale well up to 1,024 nodes for 3-D FFTs of siz
e 128x128x128.
• 3-D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07]– 3-D FFTs of size 128x128x128 can scale well on
QCDOC up to 4,096 nodes.
2009/4/21 Third French-Japanese PAAP Workshop 6
Objectives
• Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors.
• Reduce the communication time for larger numbers of MPI processes.
• A comparison between 1-D and 2-D distribution for 3-D FFT.
2009/4/21 Third French-Japanese PAAP Workshop 7
Approach• Some previously presented volumetric 3-D F
FT algorithms[Eleftheriou et al. 03, 05, Fang07]uses the 3-D distribution for 3-D FFT.– These schemes require three all-to-all communic
ations.
• We use a 2-D distribution for volumetric 3-D FFT.– It requires only two all-to-all communications.
2009/4/21 Third French-Japanese PAAP Workshop 8
3-D FFT• 3-D discrete Fourier transform (DFT) is
given by
)/2exp(,10
),,(),,(
11
1
22
2
33
3
1
1
2
2
3
3
1
0
1
0
1
0321321
rnrr
kjn
kjn
kjn
n
j
n
j
n
j
nink
jjjxkkky
r
2009/4/21 Third French-Japanese PAAP Workshop 9
1-D distribution along z-axis
x
z
y
1. FFTs in x-axis
x
z
yx
z
y
2. FFTs in y-axis 3. FFTs in z-axis
With a slab decomposition
2009/4/21 Third French-Japanese PAAP Workshop 10
2-D distribution along y- and z-axes
x
z
y
1. FFTs in x-axis
x
z
yx
z
y
2. FFTs in y-axis 3. FFTs in z-axis
With a volumetric domain decomposition
2009/4/21 Third French-Japanese PAAP Workshop 11
Communication time of 1-D distribution• Let us assume for -point FFT:
– Latency of communication: (sec)– Bandwidth: (Byte/s)– The number of processors:
• One all-to-all communication
• Communication time of 1-D distribution
WPQ
NLPQ
WPQ
NLPQT
16
)(
16)1(
2dim1
(sec)
321 NNNN L
WQP
2009/4/21 Third French-Japanese PAAP Workshop 12
Communication time of 2-D distribution• Two all-to-all communications
– simultaneous all-to-all communications of processors in y-axis.
– simultaneous all-to-all communications of processors in z-axis.
• Communication time of 2-D distribution
WPQ
NLQP
WPQ
NLQ
WQP
NLPT
32)(
16)1(
16)1(
22dim2
(sec)
Q
Q
P
P
2009/4/21 Third French-Japanese PAAP Workshop 13
Comparing communication time• Communication time of 1-D distribution
• Communication of 2-D distribution
• By comparing two equations, the communication time of the 2-D distribution is less than that of the1-D distribution for larger number of processors and latency .
WPQ
NLPQT
16
dim1
WPQ
NLQPT
32
)(dim2
QP L
2009/4/21 Third French-Japanese PAAP Workshop 14
Performance Results• To evaluate parallel 3-D FFTs, we compared
– 1-D distribution– 2-D distribution
• and -point FFTs on from 1 to 4,096 cores.
• Target parallel machine:– T2K-Tsukuba system (256 nodes, 4,096 cores).– The flat MPI programming model was used.– MVAPICH 1.2.0 was used as a communication libr
ary.– The compiler used was Intel Fortran compiler 10.1.
333 128,64,32N 3256
2009/4/21 Third French-Japanese PAAP Workshop 15
T2K-Tsukuba System• Specification
– The number of nodes: 648 ( Appro Xtreme-X3 Server )
– Theoretical peak performance: 95.4 TFlops– Node configuration: 4-socket of quad-core AMD Opteron
8356 (Barcelona 2.3 GHz)– Total main memory size: 20 TB– Network interface: DDR InfiniBand Mellanox ConnectX H
CA x 4– Network toporogy: Fat Tree– Full-bisection bandwidth: 5.18 TB/s
2009/4/21 Third French-Japanese PAAP Workshop 16
Computation Node of T2K-Tsukuba
Bridge
NVIDIAnForce
3050
Bridge
NVIDIAnForce
3050
USBUSB
Dual ChannelReg DDR2
HyperTransport8GB/s(Full-duplex)
PCI-XPCI-XI/O HubI/O Hub
8GB/s8GB/s
(A)2(B)2
4GB/s(Full-duplex)
4GB/s(Full-duplex)
(A)1(B)1
4GB/s(Full-duplex)
4GB/s(Full-duplex)
Bridge
NVIDIAnForce
3600
Bridge
NVIDIAnForce
3600BridgeBridge
PCI-Express X16
PCI-Express X8
PCI-X
PCI-X
X16
X8
X4
PCI-Express X16
PCI-Express X8
SASSAS
X16
X8
X4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2µs MPI Latency, 4X DDR 20Gb/s)
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2µs MPI Latency, 4X DDR 20Gb/s)
2009/4/21 Third French-Japanese PAAP Workshop 17
Performance of parallel 3-D FFTs(2-D distribution)
0.1
1
10
100
1000
Number of cores
GF
LO
PS
N=32^3
N=64^3
N=128^3
N=256^3
2009/4/21 Third French-Japanese PAAP Workshop 18
Discussion (1/2)• For -point FFT, we can clearly see that
communication overhead dominates the execution time.– In this case, the total working set size is only 1MB.
• On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT.– Performance on 4,096 cores is over 401 GFlops,
about 1.1% of theoretical peak.– Performance except for all-to-all communications i
s over 10 TFlops, about 26.7% of theoretical peak.
332N
3256N
2009/4/21 Third French-Japanese PAAP Workshop 19
Performance of parallel 3-D FFTs(N=256^3)
0.1
1
10
100
1000
Number of cores
GF
LO
PS
1-Ddistribution2-Ddistribution
2009/4/21 Third French-Japanese PAAP Workshop 20
Discussion (2/2)• For , the performance of the 1-D
distribution is better than that of the 2-D distribution .– This is because that the total communication
amount of the 1-D distribution is a half of the2-D distribution.
• However, for , the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency.
64QP
128QP
2009/4/21 Third French-Japanese PAAP Workshop 21
Breakdown of parallel 3-D FFTs(256cores, N=256^3)
0
0.01
0.02
0.03
0.04
0.05
0.06
1-Ddistribution
2-Ddistribution
Tim
e (s
ec)
Comp.
Comm.
2009/4/21 Third French-Japanese PAAP Workshop 22
Conclusions• We implemented of a volumetric parallel 3-D FFT on
clusters of multi-core processors.• We showed that a 2-D distribution improves perform
ance effectively by reducing the communication time for larger numbers of MPI processes.
• The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors.
• We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT.
3256N