computing using gpus
DESCRIPTION
Slides for a BarCamp Bangalore 9 session. We discussed computing using GPUs. The audience was predominantly not familiar with GPUs.TRANSCRIPT
Computing Using
Shree Kumar, Hewlett Packardhttp://www.shreekumar.in/
Graphics Cards
Speaker Intro
• High Performance Computing @ Hewlett‐Packard– VizStack (http://vizstack.sourceforge.net)– GPU Computing
• Big 3D enthusiast• Travels a lot• Blogs at http://www.shreekumar.in/
What we will cover
• GPUs and their history• Why use GPUs• Architecture• Getting Started with GPU Programming• Challenges, Techniques & Pitfalls• Where not to use GPUs ?• Resources• The Future
What is a GPU
• Graphics Programming Unit– Coined in 1999 by NVidia– Specialized add‐on board
• Accelerates interactive 3D rendering– 60 image updates (or more) on large data– Solves embarrassingly parallel problem– Game driven volume economics
• NVidia v/s ATI, just like Intel v/s AMD
• Demand for better effects led to– programmable GPUs– floating point capabilities– this led to General Purpose GPU(GPGPU) Computation
History of GPUs : a GPGPU Perspective
Date Product Trans Cores Flops Technology
1997 RIVA 128 3 M Rasterization
1999 GeForce 256 25 M Transform & Lighting
2001 GeForce 3 60 M Programmable shaders
2002 GeForce FX 125 M 16, 32 bit FP, long shaders
2004 GeForce 6800 222 M Infinite length shaders, branching
2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA, 64 bit FP
2008 GeForce GTX
280
1.4 B 240 933 G78 M
IEEE FP, CUDA C, OpenCL and
DirectCompute, PCI‐express Gen 2
2009 Tesla M2050 3.0 B 512 1.03 T515 G
Improved 64 bit perf, caching, ECC
memory, 64‐bit unified addressing,
asynchronous bidirectional data
transfer, multiple kernels
Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
The GPU Advantage
30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth
Energy Efficient : 5x Performance/WattAll Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/
Add to these a3x Performance/$
People use GPUs for…
Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
More “why to use GPUs”
• Proliferation of GPUs– Mobile devices will have capable GPUs soon !
• Make more things possible– Make things real‐time
• From seconds to real‐time interactive performance
– Reduce offline processing overhead• Research Opportunities
– New & efficient algorithms– Pairing Multi‐core CPUs and massively multi‐threaded
GPUs
A GPU isn’t a CPU replacement!
GPU Computing 1‐2‐3
GPU Computing 1‐2‐3
There ain’t no such thing as a FREE Lunch!
GPU Computing 1‐2‐3
You don’t always “port”
a CPU algorithm to a GPU!
CPU versus GPU
• CPU– Optimized for latency– Speedup techniques
• Vectorization (MMX, SSE, …)• Coarse Grained Parallelism using multiple CPUs and cores
– Memory approaching a TB• GPU
– Optimized for throughput– Speedup techniques
• Massive multithreading• Fine grained parallelism
– A few GBs of memory max
Getting Started
• Software– CUDA (NVidia specific)– OpenCL (Cross‐platform, GPU/CPU)– DirectCompute (MS specific)
• Hardware– A system equipped with GPU
• OS no bar– But Windows, RedHat Enterprise Linux seem better
supported
CUDA
• Compute Unified Device
Architecture
• Most popular GPGPU toolkit• CUDA C extends C with
constructs– Easy to write programs
• Lower level “driver”
API is
available– Provides more control– Use multiple GPUs in the same
application
– Mix graphics & compute code• Language bindings available
– PyCUDA, Java, .NET• Toolkit provides conveniences
Source: NVIDIA CUDA Architecture, Introduction and Overview
CUDA Toolkit
CUDA Architecture
• 1 more streaming multiprocessors (“cores”)
• Thread Blocks– Single Instruction, Multiple
Thread (SIMT)
– Hide latency by parallelism• Memory Hierarchy
– Fermi GPUs can access
system memory
• Primitives for– Thread synchronization– Atomic Operations on
memory
Source : The GPU Computing Era
Simple Example : Vector Addition
void VecAdd(const float *A, const float*B, float *C, int N) {for(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];
}VecAdd(A,B,C,N);
void VecAdd(const float *A, const float*B, float *C, int N) {#pragma omp forfor(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];
}VecAdd(A,B,C,N);
C/C++ ‐
serial code
C/C++ with OpenMP
–
thread level parallelism
Vector Addition using CUDA
__global__ void VecAdd(const float *A, const float*B, float *C, int N) {int I = blockDim.x * blockIdx.x + threadIdx.x;if(i<N)
C[i]=A[i]+B[i];}
CUDA C – element level parallelism
cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);int threadsPerBlock = 256;int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
# nvcc vectorAdd.cu –I ../../common/inc
Compilation
Invoking the function
Allocate Memory on GPU
Copy Arrays to GPU
Invoke function
Copy Result Back to Main Memory
Free GPU Memory
GPU Programming Challenges
• Need high “occupancy”
for best performance• Extracting parallelism with limited resources
– Limited Registers– Limited Shared Memory
• Preferred Approach– Small Kernels– Multiple Passes if needed
• Decompose Problem into Parallel Pieces– Write once, scale perform everywhere!
GPU Programming
• Use Shared Memory when possible– Cooperation between threads in a block– Reduce access to global memory
• Reduce Data Transfer over the Bus• It’s still a GPU !
– use textures to your advantage– use vector data types if you can
• Watch out for GPU capability differences!
Enough Theory!
Demo Time&
Let’s do some programming
Watch out for
• Portability of programs across GPUs– Capabilities vary from GPU to GPU– Memory usage
• Arithmetic differences in the result• Pay careful attention to demos…
Resources
• CUDA– Tools on NVIDIA Developer Site
http://developer.nvidia.com/object/gpucomputing.html
– CUDPP http://code.google.com/p/cudpp/
• OpenCL• Google Search !
The Future
• Better throughput– More GPU cores, scaling by Moore’s law– PCIe Gen 3
• Easier to program• Arbitrary control and data access patterns
Questions ?