Download - ECE 734 PROJECT Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU -Vikrant

ECE 734 PROJECT

Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU

Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU

-Vikrant Soman

Agenda

Problem Statement Motivation Introduction to SPH – analysis and synthesis Overview of GPU architecture CPU-GPU implementation Results Conclusions and Future work References and Acknowledgements

Problem Statement

• Critical computational kernel in numerical weather prediction and climate modeling and other global geo-potential related applications

• Resolution of satellites is improving leading to enormous global datasets of very high degrees and orders becoming available

Motivation

• The computational aspects of SHTs have become challenging and time consuming.

Makes SPH more DATA INTENSIVE and SLOWER !

• No one has tried using GPU for SHT before.

Try Google search for “Spherical Harmonic Transforms on GPU” !!

Spherical Harmonic Transforms

• Spherical Harmonic Transforms (SHTs) are essentially Fourier transforms on the sphere.

• Consists of an Analysis step and Synthesis step.

• Analysis: Project grid point data on the sphere onto the spectral modes.

• Synthesis: Inverse transform reconstructs grid point data from the spectral information.

FFT of grid point along longitudes (F) * gaussian weights (G)

Spectral values (X)

Legendre polynomial functions

Spectral values (S)

Compute IFFT and Normalize results

Synthesis Analysis

GPU architecture - Overview

• Consists of 4 types of memory – 1. Global(Device)2. Shared3. Constant4. Texture

Cuda

• CUDA extends C by allowing the programmer to define C functions, called kernels.

• Executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

// Kernel definition__global__ void vecAdd(float* A, float* B, float* C){}int main(){// Kernel invocationvecAdd<<<1, N>>>(A, B, C);}

• One of the best parts of the GPGPU – Heterogeneous programming

• BLAS operation acceleration.

• Allows the implementation of CPU-GPU architecture which I have used.

Implementation Details

• Exploit the heterogeneous programming model

• CPU code implemented in MATLAB.

• Identified data intensive loops in the code.

• Map the loop indexing to GPGPU architecture to exploit parallelism

• Offload computation to GPU retrieve data back to CPU

AS(ty, tx) = A[k*wA*wA + aBegin + wA * ty + tx];

BS(ty, tx) = B[bBegin_x + wB * ty + tx];

Csub (ty,tx) = 0;

// Synchronize to make sure the matrices are loaded __syncthreads(); Csub(ty,tx) = AS(ty,tx) * BS(ty,tx);

int c = bx*BLOCK_SIZE + by*BLOCK_SIZE*BLOCK_SIZE*(wA/BLOCK_SIZE);

A[k*wA*wA + c + tx + ty*wA] = Csub(ty,tx);

Part of the kernel program

for n=0:nn

Pn = (legendre(n,yg))'; % Note error in Matlab normalization

for m= 0:n Nmn = (-1)^m * sqrt((2*n+1)/2 * factorial(n-m)/factorial(n+m) );

P(1:njo2,n+1,m+1) = Nmn*Pn(1:njo2,m+1); end end

Loop mapped to GPU

Legendre polynomial calculation

• Offload data intensive operation to GPU

Analysis step

• Compute FFT on CPU side.

• MATLAB has highly

optimized FFT operation.

Synthesis step

• IFFT is again given to CPU.

• GPU FFT is good only for very high points ! ( >10000 etc.)

• CPU side – DELL, Intel Quad Core @2.5Ghz and 2.5GB RAM

• GPU – NVIDIA® 8800 GT

• CPU side code on MATLAB

• GPU code written in MATLAB extensions provided by NVIDIA® called NVMEX

• Interfacing between CPU-GPU via plug-in for MATLAB.

Results

• For grid size of 512 speed up of almost 42x !!

• Shows upward trend for higher sizes

• Not much speed up for analysis kernel.

• Values are comparable though

Conclusions and Future work

Improves the on-the-fly Legendre polynomial calculation.

Good speed up overall Errors are low. ( less than E-10 on average)

Need to look into performance for higher grid sizes. Complete synthesis step results Possible exchange of ideas with PhD student at SMU,

Dallas

References

• Drake, J. B., Worley, P., and D’Azevedo, E. 2008. Algorithm 888: Spherical harmonic transform algorithms. ACM Trans. Math. Softw. 35, 3, Article 23 (October 2008)

• Akshara Kaginalkar, Sharad Purohit, Benchmarking of Medium Range Weather Forecasting Model on PARAM -A parallel machine, Center for Development of Advanced Computing (C-DAC), Pune University Campus, Pune 411007 India

• Martin J. Mohlenkamp, A Fast Transform for Spherical Harmonics, The Journal of Fourier Analysis and Applications, 1999

• Huadong Xiao, Yang Lu, Parallel computation for spherical harmonic synthesis and analysis, Computers & Geosciences, Volume 33, Issue 3, March 2007 5.

• NVIDIA CUDA Programming Guide 2.0

“Special thanks to Prof. Dan Negrut and Makarand Datar, UW Mech department for access to their GPU machines”

Download - ECE 734 PROJECT Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU -Vikrant

Top Related