ECE 734 PROJECT
Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU
Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU
-Vikrant Soman
Agenda
Problem Statement Motivation Introduction to SPH – analysis and synthesis Overview of GPU architecture CPU-GPU implementation Results Conclusions and Future work References and Acknowledgements
Problem Statement
• Critical computational kernel in numerical weather prediction and climate modeling and other global geo-potential related applications
• Resolution of satellites is improving leading to enormous global datasets of very high degrees and orders becoming available
Motivation
• The computational aspects of SHTs have become challenging and time consuming.
Makes SPH more DATA INTENSIVE and SLOWER !
• No one has tried using GPU for SHT before.
Try Google search for “Spherical Harmonic Transforms on GPU” !!
Spherical Harmonic Transforms
• Spherical Harmonic Transforms (SHTs) are essentially Fourier transforms on the sphere.
• Consists of an Analysis step and Synthesis step.
• Analysis: Project grid point data on the sphere onto the spectral modes.
• Synthesis: Inverse transform reconstructs grid point data from the spectral information.
FFT of grid point along longitudes (F) * gaussian weights (G)
Spectral values (X)
Legendre polynomial functions
Spectral values (S)
Compute IFFT and Normalize results
Synthesis Analysis
GPU architecture - Overview
• Consists of 4 types of memory – 1. Global(Device)2. Shared3. Constant4. Texture
Cuda
• CUDA extends C by allowing the programmer to define C functions, called kernels.
• Executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
// Kernel definition__global__ void vecAdd(float* A, float* B, float* C){}int main(){// Kernel invocationvecAdd<<<1, N>>>(A, B, C);}
• One of the best parts of the GPGPU – Heterogeneous programming
• BLAS operation acceleration.
• Allows the implementation of CPU-GPU architecture which I have used.
Implementation Details
• Exploit the heterogeneous programming model
• CPU code implemented in MATLAB.
• Identified data intensive loops in the code.
• Map the loop indexing to GPGPU architecture to exploit parallelism
• Offload computation to GPU retrieve data back to CPU
AS(ty, tx) = A[k*wA*wA + aBegin + wA * ty + tx];
BS(ty, tx) = B[bBegin_x + wB * ty + tx];
Csub (ty,tx) = 0;
// Synchronize to make sure the matrices are loaded __syncthreads(); Csub(ty,tx) = AS(ty,tx) * BS(ty,tx);
int c = bx*BLOCK_SIZE + by*BLOCK_SIZE*BLOCK_SIZE*(wA/BLOCK_SIZE);
A[k*wA*wA + c + tx + ty*wA] = Csub(ty,tx);
Part of the kernel program
for n=0:nn
Pn = (legendre(n,yg))'; % Note error in Matlab normalization
for m= 0:n Nmn = (-1)^m * sqrt((2*n+1)/2 * factorial(n-m)/factorial(n+m) );
P(1:njo2,n+1,m+1) = Nmn*Pn(1:njo2,m+1); end end
Loop mapped to GPU
Legendre polynomial calculation
• Offload data intensive operation to GPU
Analysis step
• Compute FFT on CPU side.
• MATLAB has highly
optimized FFT operation.
Synthesis step
• IFFT is again given to CPU.
• GPU FFT is good only for very high points ! ( >10000 etc.)
• CPU side – DELL, Intel Quad Core @2.5Ghz and 2.5GB RAM
• GPU – NVIDIA® 8800 GT
• CPU side code on MATLAB
• GPU code written in MATLAB extensions provided by NVIDIA® called NVMEX
• Interfacing between CPU-GPU via plug-in for MATLAB.
Results
• For grid size of 512 speed up of almost 42x !!
• Shows upward trend for higher sizes
• Not much speed up for analysis kernel.
• Values are comparable though
Conclusions and Future work
Improves the on-the-fly Legendre polynomial calculation.
Good speed up overall Errors are low. ( less than E-10 on average)
Need to look into performance for higher grid sizes. Complete synthesis step results Possible exchange of ideas with PhD student at SMU,
Dallas
References
• Drake, J. B., Worley, P., and D’Azevedo, E. 2008. Algorithm 888: Spherical harmonic transform algorithms. ACM Trans. Math. Softw. 35, 3, Article 23 (October 2008)
• Akshara Kaginalkar, Sharad Purohit, Benchmarking of Medium Range Weather Forecasting Model on PARAM -A parallel machine, Center for Development of Advanced Computing (C-DAC), Pune University Campus, Pune 411007 India
• Martin J. Mohlenkamp, A Fast Transform for Spherical Harmonics, The Journal of Fourier Analysis and Applications, 1999
• Huadong Xiao, Yang Lu, Parallel computation for spherical harmonic synthesis and analysis, Computers & Geosciences, Volume 33, Issue 3, March 2007 5.
• NVIDIA CUDA Programming Guide 2.0
“Special thanks to Prof. Dan Negrut and Makarand Datar, UW Mech department for access to their GPU machines”