practical experience oncqse.ntu.edu.tw/cqse/download_file/fakao_20090116.pdf8. 效能比較. tesla...

24
DATE:1/16/09 Fang-an Kuo Practical Experience on CUDA

Upload: others

Post on 22-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • DATE:1/16/09

    Fang-an Kuo

    Practical Experience on CUDA

  • 22

    OutlineParallel loop via CUDA

    CUDA簡介以3D Array之元素和為例傳統迴圈計算其元素和(Sum)利用CUDA平行計算元素和效能比較

    FFT via CUDAFFTW 3.2alphaCUFFT範例效能比較

    Matrix multiplication on multiGPU結論

    參考資料

  • 33

    CUDA簡介CUDA是Nvidia提出的GPGPU 技術

    透過CUDA開發GPGPU程式是 利用其C語言函式庫與CUDA的 延伸來編寫,故不用openGL、 Direct3D

    CUDA架構Host、Device

    Grid、Block、thread

    SIMD

    Memory Management

  • 44

    以3D Array之元素和為例3D Array: mat[256][256][256]

    加總計算mat元素和(Sum)

    傳統方法:triple loop

    GPGPU方法:

    Parallel loop via CUDAmat[i][j][k]

    k

    i

    j

    Block 1Block 2

    Block 3

    Thread 1 Thread 2 Thread 3

  • 55

    傳統方式計算其元素和(Sum)利用for loop加總元素

    在CPU Q6850 3.0G主機 上計算約耗時930(ms), 約為0.144(Gflops)

  • 66

    利用CUDA平行計算元素和Submit a job Sum_gpu1 with n blocks and n threads per block.

    Psum[i][k]=

    CPU: Q6850 (quad-core 3.0G),OS : OpenSUSE 10.2GPU: Tesla C870, 1350.0 MHz clock, 1535.8 MB memoryInterface : PCI-e 1.0 16x===========Sourse===============

    sum_gpu1(psum, dmat, n);

    __global__ void sum_gpu1(float *psum, float *dmat, int n){

    int shift = bid*n*n;psum += bid*n;dmat += shift;__syncthreads();for(int i=tid;i

  • 77

    利用CUDA方式平行計算元素和Psum[i] =

    using shared memory.

    Psum[0]=

    using shared memory.

    Sum = Psum[0]

    ===========Sourse===============sum_gpu2(psum, n);

    __global__ void sum_gpu2(float *psum, int n){

    extern __shared__ float shared[];shared[tid]=0;for(int i=tid;i

  • 88

    效能比較

    Tesla C870 Kernel Only表 示僅計算Kernel函數所花費

    時間,不包括記憶體交換時間加速2到50倍

    Tesla C870表示總計算時 間,包括記憶體交換時間

    約加速2到5倍

    Tesla C870 Kernel Only vs. Q6850

    0

    0.5

    1

    1.5

    2

    16 32 48 64

    Size

    Gflo

    ps CPU

    GPU

    Tesla C870 vs. Q6850

    0

    0.2

    0.4

    0.6

    0.8

    16

    80

    144

    208

    272

    336

    400

    464

    Size

    Glo

    ps GPU

    CPU

  • 99

    FFT via CUDAFFTW 3.2alpha vs CUDA CUFFT

    Comparing complex-complex single precision FFT 3D

    The test machine:intel Q6850(quad-core 3.0G),L2:4MB

    Tesla C870, 1.35G clock, 1.5G VRAM(PCI-e 1.0)

    CUDA 2.0(NVCC 1.1)

    openSUSE 10.2

  • 1010

    FFTW 3.2alpha本次僅比較CUFFT 3D與FFT 3D單

    精度轉換部份比較fftw 3.2在單線程與多線程上的

    效能轉換mat[N][N][N]為

    Fmat[N][N][N]

    NCPU

    Time(ms)CPU 4-thread

    Time(ms)

    16 0.81 0.8632 1.62 1.7548 10.62 9.7364 17.18 15.0080 44.08 28.4496 76.65 40.93

    NCPU

    (Gflops)CPU 4-thread

    (Gflops)

    16 395.63 372.0532 494.46 457.4048 126.21 137.8164 111.78 128.0180 57.37 88.9296 41.24 77.22

    =============Sourse===============fftwf_complex *mat,*Fmat;mat = (fftwf_complex*)

    fftwf_malloc(sizeof(fftwf_complex) *N3);Fmat= (fftwf_complex*)

    fftwf_malloc(sizeof(fftwf_complex) *N3);fftwf_plan forward;forward = fftwf_plan_dft_3d(N, N, N, mat,

    Fmat, FFTW_FORWARD, FFTW_ESTIMATE);fftwf_execute(forward);=============++++++=============

  • 1111

    CUFFT比較cufft核心函數與cufft包含資

    料交換上的的時間與效能在N=48以上開始有較fftw好的

    計算效率產生(總元素為

    個)

    NGPU

    Time(ms)GPU Kernel

    Time(ms)

    16 2.03 1.9432 3.71 3.4748 8.93 8.2864 8.03 6.6480 18.08 15.4396 24.05 19.55

    NGPU

    (Gflops)GPU Kernel

    (Gflops)

    16 157.48 165.3132 215.69 230.5548 150.15 161.9664 239.22 289.2580 139.89 163.9496 131.40 161.65

    348

    =============Sourse===============cufftComplex *dmat,*Fdmat;cudaMalloc((void**) &dmat , sizeof(cufftComplex)*N3);cudaMalloc((void**) &Fdmat, sizeof(cufftComplex)*N3);cudaMemcpy(dmat, &mat[0], sizeof(cufftComplex) *N3

    , cudaMemcpyHostToDevice);cufftHandle plan;cufftPlan3d(&plan, N, N, N, CUFFT_C2C);cufftExecC2C(plan, dmat, Fdmat,-1);cudaMemcpy(&Fmat[0], Fdmat, sizeof(cufftComplex) *N3

    , cudaMemcpyDeviceToHost);=============++++++=============

  • 1212

    FFTW3 vs. CUDA CUFFT

    FFT 3D

    0

    100

    200

    300

    400

    500

    600

    16 32 48 64 80 96 112 128 144 160 176 192Size

    Gflo

    ps

    FFTW serialFFTW 4 threadsCUDA with memory transferCUDA, Kernel Only

  • 1313

    FFTW3 vs. CUDA CUFFT

    FFT 3D

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    16 32 48 64 80 96 112 128 144 160 176 192Size

    Ela

    pse

    d t

    ime(m

    s)

    FFTW serialFFTW 4 threadsCUDA with memory transferCUDA, Kernel Only

  • 1414

    Matrix multiplication on multiGPU

    Comparing the Sgemm on 4-thread CPU,1xGPU and 2xGPU.

    Sgemm : single-precision matrix multiplication.

    The test machine:CPU:Q6850(quad-core),3.0G,L2:4MB.

    GPU:2x Tesla C870,1.35G clock,1.5G memory.

    PCI Interface : PCI-e 1.0(16x)

    OS:openSUSE 10.2

    CUDA 2.0(NVCC 1.1)

    gcc 4.3.0(openMP 3.0)

    MKL 9.1.023

  • 1515

    Matrix multiplication on multiGPUopenMP+CUDA

    Sourse: A,B,C in HostA:MxK matrix

    B:KxN matrix

    C:MxN matrix

    dA,dB,dC in devicedA:M/2xK

    dB:KxN

    dC:M/2xN

    In thread 1: dC=C+th_id*M/th*K

    #include "cublas.h"......omp_set_num_threads(2);#pragma omp parallel shared(C) private(th,th_id,dA,dB,dC){th_id = omp_get_thread_num();th = omp_get_num_threads();cudaSetDevice( th_id );cublasInit();#pragma omp barriercublasAlloc( M/th * K, sizeof(float), (void**)&dA );cublasAlloc( K * N, sizeof(float), (void**)&dB );cublasAlloc( M/th * N, sizeof(float), (void**)&dC );cudaMemcpy( dB, B , K * N*sizeof(float),

    cudaMemcpyHostToDevice );cudaMemcpy( dA, A+th_id*M/th*K, M/th*K*sizeof(float),

    cudaMemcpyHostToDevice );cudaMemcpy( dC, C+th_id*M/th*N, M/th*N*sizeof(float),

    cudaMemcpyHostToDevice );cublasSgemm( transa, transb, M/th, N, K, alpha, dA

    , M/th, dB, K, beta, dC, M/th );cudaMemcpy( C+th_id*M/th * N, dC , M/th *

    N*sizeof(float), cudaMemcpyDeviceToHost );}==========Variable=================M, K, N, alpha, beta are constants.A:MxK , B:KxN , C:MxNSgemm : C= alpha .* AB + beta .* C

  • 1616

    Model:

    openMP : shared memory mode

    MPI : distributed memory mode

    PC+GPU v.s. GPU Cluster

    Matrix multiplication on multiGPU

    C1

    C2

    A1

    A2

    B*

  • 1717

    Matrix multiplication on multiGPU

    N4-thread

    Time(ms)1xGPU

    Time(ms)2xGPU

    Time(ms)4-thread

    Gflops1xGPUGflops

    2xGPUGflops

    2048 727.64 106.77 90.93 23.61 160.90 188.94

    2304 1090.98 148.38 118.76 22.42 164.86 205.97

    2560 1266.69 200.05 164.42 26.49 167.73 204.08

    2816 1934.53 262.34 207.68 23.09 170.24 215.05

    3072 2400.53 335.51 251.29 24.15 172.82 230.74

    3328 2955.40 422.49 315.94 24.94 174.49 233.33

    3584 3423.45 522.42 380.92 26.89 176.25 241.71

    3840 4188.64 638.63 465.56 27.04 177.33 243.25

    4096 5190.28 767.16 539.57 26.48 179.15 254.72

    Benchmark, CUDA with memory transfer.

  • 1818

    Matrix multiplication on multiGPU

    Sgemm on MKL vs. Sgemm on CUDA withmemory transfer.

    0

    50

    100

    150

    200

    250

    128

    384

    640

    896

    1152

    1408

    1664

    1920

    2176

    2432

    Size

    Gflops

    2-GPU1-GPU4-CPU

    Size = 1560

  • 1919

    Matrix multiplication on multiGPU

    Sgemm on MKL vs. Sgemm on CUDA(Kernel Only)

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    128

    384

    640

    896

    1152

    1408

    1664

    1920

    2176

    2432

    Size

    Gflops

    2-GPU1-GPU4-CPU

  • 2020

    Matrix multiplication on multiGPU

    Sgemm on MKL vs. Sgemm on CUDA withmemory transfer.

    0

    20

    40

    60

    80

    100

    120

    128 256 384 512 640 768 896 1024

    Size

    Ela

    pse

    d t

    ime(m

    s)

    2-GPU1-GPU4-CPU

  • 2121

    Matrix multiplication on multiGPU

    Sgemm on MKL vs. Sgemm on CUDA(Kernel Only)

    0

    20

    40

    60

    80

    100

    120

    128 256 384 512 640 768 896 1024

    Size

    Ela

    pse

    d t

    ime(m

    s)

    2-GPU1-GPU4-CPU

  • 2222

    結論平行化for loop在CUDA上有非常高的運算速度。

    Cufft對總元素個數達

    以上的三維傅立葉轉換較 FFTW3來的快速。

    CUDA矩陣相乘,當相乘矩陣大於1560x1560時, 使用兩張GPU相較使用一張GPU更為快速。

    利用openMP+CUDA是單機上使用CUDA multiGPU最方便且快速的方法。

    以上技巧使用於國網中心外部委託案-材料科學模擬之 應用,整體加速約50倍以上,因為專案尚在研究中, 故僅提出部份CUDA技巧分享。

    348

  • 2323

    參考資料CUDA ZONE

    http://www.nvidia.com/object/cuda_home.htm

    openMP Webhttp://openmp.org/wp/

    Heresy’s Homehttp://heresy.spaces.live.com/blog/

    FFTWhttp://www.fftw.org/

    Vasily Volkov and James W. ,2008, Benchmarking GPUs to tune dense linear algebra, SC ’08.

    http://portal.acm.org/citation.cfm?doid=1413370.1413402

  • 2424

    Thank you

    投影片編號 1OutlineCUDA簡介以3D Array之元素和為例傳統方式計算其元素和(Sum)利用CUDA平行計算元素和利用CUDA方式平行計算元素和效能比較FFT via CUDAFFTW 3.2alphaCUFFTFFTW3 vs. CUDA CUFFTFFTW3 vs. CUDA CUFFTMatrix multiplication on multiGPU�Matrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPUMatrix multiplication on multiGPU結論參考資料投影片編號 24