practical experience oncqse.ntu.edu.tw/cqse/download_file/fakao_20090116.pdf8. 效能比較. tesla...
TRANSCRIPT
-
DATE:1/16/09
Fang-an Kuo
Practical Experience on CUDA
-
22
OutlineParallel loop via CUDA
CUDA簡介以3D Array之元素和為例傳統迴圈計算其元素和(Sum)利用CUDA平行計算元素和效能比較
FFT via CUDAFFTW 3.2alphaCUFFT範例效能比較
Matrix multiplication on multiGPU結論
參考資料
-
33
CUDA簡介CUDA是Nvidia提出的GPGPU 技術
透過CUDA開發GPGPU程式是 利用其C語言函式庫與CUDA的 延伸來編寫,故不用openGL、 Direct3D
CUDA架構Host、Device
Grid、Block、thread
SIMD
Memory Management
-
44
以3D Array之元素和為例3D Array: mat[256][256][256]
加總計算mat元素和(Sum)
傳統方法:triple loop
GPGPU方法:
Parallel loop via CUDAmat[i][j][k]
k
i
j
Block 1Block 2
Block 3
Thread 1 Thread 2 Thread 3
-
55
傳統方式計算其元素和(Sum)利用for loop加總元素
在CPU Q6850 3.0G主機 上計算約耗時930(ms), 約為0.144(Gflops)
-
66
利用CUDA平行計算元素和Submit a job Sum_gpu1 with n blocks and n threads per block.
Psum[i][k]=
CPU: Q6850 (quad-core 3.0G),OS : OpenSUSE 10.2GPU: Tesla C870, 1350.0 MHz clock, 1535.8 MB memoryInterface : PCI-e 1.0 16x===========Sourse===============
sum_gpu1(psum, dmat, n);
__global__ void sum_gpu1(float *psum, float *dmat, int n){
int shift = bid*n*n;psum += bid*n;dmat += shift;__syncthreads();for(int i=tid;i
-
77
利用CUDA方式平行計算元素和Psum[i] =
using shared memory.
Psum[0]=
using shared memory.
Sum = Psum[0]
===========Sourse===============sum_gpu2(psum, n);
__global__ void sum_gpu2(float *psum, int n){
extern __shared__ float shared[];shared[tid]=0;for(int i=tid;i
-
88
效能比較
Tesla C870 Kernel Only表 示僅計算Kernel函數所花費
時間,不包括記憶體交換時間加速2到50倍
Tesla C870表示總計算時 間,包括記憶體交換時間
約加速2到5倍
Tesla C870 Kernel Only vs. Q6850
0
0.5
1
1.5
2
16 32 48 64
Size
Gflo
ps CPU
GPU
Tesla C870 vs. Q6850
0
0.2
0.4
0.6
0.8
16
80
144
208
272
336
400
464
Size
Glo
ps GPU
CPU
-
99
FFT via CUDAFFTW 3.2alpha vs CUDA CUFFT
Comparing complex-complex single precision FFT 3D
The test machine:intel Q6850(quad-core 3.0G),L2:4MB
Tesla C870, 1.35G clock, 1.5G VRAM(PCI-e 1.0)
CUDA 2.0(NVCC 1.1)
openSUSE 10.2
-
1010
FFTW 3.2alpha本次僅比較CUFFT 3D與FFT 3D單
精度轉換部份比較fftw 3.2在單線程與多線程上的
效能轉換mat[N][N][N]為
Fmat[N][N][N]
NCPU
Time(ms)CPU 4-thread
Time(ms)
16 0.81 0.8632 1.62 1.7548 10.62 9.7364 17.18 15.0080 44.08 28.4496 76.65 40.93
NCPU
(Gflops)CPU 4-thread
(Gflops)
16 395.63 372.0532 494.46 457.4048 126.21 137.8164 111.78 128.0180 57.37 88.9296 41.24 77.22
=============Sourse===============fftwf_complex *mat,*Fmat;mat = (fftwf_complex*)
fftwf_malloc(sizeof(fftwf_complex) *N3);Fmat= (fftwf_complex*)
fftwf_malloc(sizeof(fftwf_complex) *N3);fftwf_plan forward;forward = fftwf_plan_dft_3d(N, N, N, mat,
Fmat, FFTW_FORWARD, FFTW_ESTIMATE);fftwf_execute(forward);=============++++++=============
-
1111
CUFFT比較cufft核心函數與cufft包含資
料交換上的的時間與效能在N=48以上開始有較fftw好的
計算效率產生(總元素為
個)
NGPU
Time(ms)GPU Kernel
Time(ms)
16 2.03 1.9432 3.71 3.4748 8.93 8.2864 8.03 6.6480 18.08 15.4396 24.05 19.55
NGPU
(Gflops)GPU Kernel
(Gflops)
16 157.48 165.3132 215.69 230.5548 150.15 161.9664 239.22 289.2580 139.89 163.9496 131.40 161.65
348
=============Sourse===============cufftComplex *dmat,*Fdmat;cudaMalloc((void**) &dmat , sizeof(cufftComplex)*N3);cudaMalloc((void**) &Fdmat, sizeof(cufftComplex)*N3);cudaMemcpy(dmat, &mat[0], sizeof(cufftComplex) *N3
, cudaMemcpyHostToDevice);cufftHandle plan;cufftPlan3d(&plan, N, N, N, CUFFT_C2C);cufftExecC2C(plan, dmat, Fdmat,-1);cudaMemcpy(&Fmat[0], Fdmat, sizeof(cufftComplex) *N3
, cudaMemcpyDeviceToHost);=============++++++=============
-
1212
FFTW3 vs. CUDA CUFFT
FFT 3D
0
100
200
300
400
500
600
16 32 48 64 80 96 112 128 144 160 176 192Size
Gflo
ps
FFTW serialFFTW 4 threadsCUDA with memory transferCUDA, Kernel Only
-
1313
FFTW3 vs. CUDA CUFFT
FFT 3D
0
100
200
300
400
500
600
700
800
900
16 32 48 64 80 96 112 128 144 160 176 192Size
Ela
pse
d t
ime(m
s)
FFTW serialFFTW 4 threadsCUDA with memory transferCUDA, Kernel Only
-
1414
Matrix multiplication on multiGPU
Comparing the Sgemm on 4-thread CPU,1xGPU and 2xGPU.
Sgemm : single-precision matrix multiplication.
The test machine:CPU:Q6850(quad-core),3.0G,L2:4MB.
GPU:2x Tesla C870,1.35G clock,1.5G memory.
PCI Interface : PCI-e 1.0(16x)
OS:openSUSE 10.2
CUDA 2.0(NVCC 1.1)
gcc 4.3.0(openMP 3.0)
MKL 9.1.023
-
1515
Matrix multiplication on multiGPUopenMP+CUDA
Sourse: A,B,C in HostA:MxK matrix
B:KxN matrix
C:MxN matrix
dA,dB,dC in devicedA:M/2xK
dB:KxN
dC:M/2xN
In thread 1: dC=C+th_id*M/th*K
#include "cublas.h"......omp_set_num_threads(2);#pragma omp parallel shared(C) private(th,th_id,dA,dB,dC){th_id = omp_get_thread_num();th = omp_get_num_threads();cudaSetDevice( th_id );cublasInit();#pragma omp barriercublasAlloc( M/th * K, sizeof(float), (void**)&dA );cublasAlloc( K * N, sizeof(float), (void**)&dB );cublasAlloc( M/th * N, sizeof(float), (void**)&dC );cudaMemcpy( dB, B , K * N*sizeof(float),
cudaMemcpyHostToDevice );cudaMemcpy( dA, A+th_id*M/th*K, M/th*K*sizeof(float),
cudaMemcpyHostToDevice );cudaMemcpy( dC, C+th_id*M/th*N, M/th*N*sizeof(float),
cudaMemcpyHostToDevice );cublasSgemm( transa, transb, M/th, N, K, alpha, dA
, M/th, dB, K, beta, dC, M/th );cudaMemcpy( C+th_id*M/th * N, dC , M/th *
N*sizeof(float), cudaMemcpyDeviceToHost );}==========Variable=================M, K, N, alpha, beta are constants.A:MxK , B:KxN , C:MxNSgemm : C= alpha .* AB + beta .* C
-
1616
Model:
openMP : shared memory mode
MPI : distributed memory mode
PC+GPU v.s. GPU Cluster
Matrix multiplication on multiGPU
C1
C2
A1
A2
B*
-
1717
Matrix multiplication on multiGPU
N4-thread
Time(ms)1xGPU
Time(ms)2xGPU
Time(ms)4-thread
Gflops1xGPUGflops
2xGPUGflops
2048 727.64 106.77 90.93 23.61 160.90 188.94
2304 1090.98 148.38 118.76 22.42 164.86 205.97
2560 1266.69 200.05 164.42 26.49 167.73 204.08
2816 1934.53 262.34 207.68 23.09 170.24 215.05
3072 2400.53 335.51 251.29 24.15 172.82 230.74
3328 2955.40 422.49 315.94 24.94 174.49 233.33
3584 3423.45 522.42 380.92 26.89 176.25 241.71
3840 4188.64 638.63 465.56 27.04 177.33 243.25
4096 5190.28 767.16 539.57 26.48 179.15 254.72
Benchmark, CUDA with memory transfer.
-
1818
Matrix multiplication on multiGPU
Sgemm on MKL vs. Sgemm on CUDA withmemory transfer.
0
50
100
150
200
250
128
384
640
896
1152
1408
1664
1920
2176
2432
Size
Gflops
2-GPU1-GPU4-CPU
Size = 1560
-
1919
Matrix multiplication on multiGPU
Sgemm on MKL vs. Sgemm on CUDA(Kernel Only)
0
50
100
150
200
250
300
350
400
450
128
384
640
896
1152
1408
1664
1920
2176
2432
Size
Gflops
2-GPU1-GPU4-CPU
-
2020
Matrix multiplication on multiGPU
Sgemm on MKL vs. Sgemm on CUDA withmemory transfer.
0
20
40
60
80
100
120
128 256 384 512 640 768 896 1024
Size
Ela
pse
d t
ime(m
s)
2-GPU1-GPU4-CPU
-
2121
Matrix multiplication on multiGPU
Sgemm on MKL vs. Sgemm on CUDA(Kernel Only)
0
20
40
60
80
100
120
128 256 384 512 640 768 896 1024
Size
Ela
pse
d t
ime(m
s)
2-GPU1-GPU4-CPU
-
2222
結論平行化for loop在CUDA上有非常高的運算速度。
Cufft對總元素個數達
以上的三維傅立葉轉換較 FFTW3來的快速。
CUDA矩陣相乘,當相乘矩陣大於1560x1560時, 使用兩張GPU相較使用一張GPU更為快速。
利用openMP+CUDA是單機上使用CUDA multiGPU最方便且快速的方法。
以上技巧使用於國網中心外部委託案-材料科學模擬之 應用,整體加速約50倍以上,因為專案尚在研究中, 故僅提出部份CUDA技巧分享。
348
-
2323
參考資料CUDA ZONE
http://www.nvidia.com/object/cuda_home.htm
openMP Webhttp://openmp.org/wp/
Heresy’s Homehttp://heresy.spaces.live.com/blog/
FFTWhttp://www.fftw.org/
Vasily Volkov and James W. ,2008, Benchmarking GPUs to tune dense linear algebra, SC ’08.
http://portal.acm.org/citation.cfm?doid=1413370.1413402
-
2424
Thank you
投影片編號 1OutlineCUDA簡介以3D Array之元素和為例傳統方式計算其元素和(Sum)利用CUDA平行計算元素和利用CUDA方式平行計算元素和效能比較FFT via CUDAFFTW 3.2alphaCUFFTFFTW3 vs. CUDA CUFFTFFTW3 vs. CUDA CUFFTMatrix multiplication on multiGPU�Matrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPU結論參考資料投影片編號 24