computing using gpus

Computing Using

Shree Kumar, Hewlett Packardhttp://www.shreekumar.in/

Graphics Cards

Speaker Intro

• High Performance Computing @ Hewlett‐Packard– VizStack (http://vizstack.sourceforge.net)– GPU Computing

• Big 3D enthusiast• Travels a lot• Blogs at http://www.shreekumar.in/

http://vizstack.sourceforge.net/

http://www.shreekumar.in/

What we will cover

• GPUs and their history• Why use GPUs• Architecture• Getting Started with GPU Programming• Challenges, Techniques & Pitfalls• Where not to use GPUs ?• Resources• The Future

What is a GPU

• Graphics Programming Unit– Coined in 1999 by NVidia– Specialized add‐on board

• Accelerates interactive 3D rendering– 60 image updates (or more) on large data– Solves embarrassingly parallel problem– Game driven volume economics

• NVidia v/s ATI, just like Intel v/s AMD

• Demand for better effects led to– programmable GPUs– floating point capabilities– this led to General Purpose GPU(GPGPU) Computation

Presenter

Presentation Notes

The term “GPU” was coined by NVidia to coincide with the launch of their “GeForce 256” graphics card (around Aug 1999). IMO, the idea was to latch onto the very popular “CPU”. Earlier graphics cards handled rasterization of 2D primitives. 3D transformations and lighting computations were done using the host CPU. The GeForce 256 changed this by doing T&L in the graphics card.

History of GPUs : a GPGPU Perspective

Date Product Trans Cores Flops Technology

1997 RIVA 128 3 M Rasterization

1999 GeForce 256 25 M Transform & Lighting

2001 GeForce 3 60 M Programmable shaders

2002 GeForce FX 125 M 16, 32 bit FP, long shaders

2004 GeForce 6800 222 M Infinite length shaders, branching

2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA, 64 bit FP

2008 GeForce GTX

280

1.4 B 240 933 G78 M

IEEE FP, CUDA C, OpenCL and

DirectCompute, PCI‐express Gen 2

2009 Tesla M2050 3.0 B 512 1.03 T515 G

Improved 64 bit perf, caching, ECC

memory, 64‐bit unified addressing,

asynchronous bidirectional data

transfer, multiple kernels

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

Presenter

Presentation Notes

Lot of material sourced from Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

The GPU Advantage

30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth

Energy Efficient : 5x Performance/WattAll Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/

Add to these a3x Performance/$

People use GPUs for…

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

More “why to use GPUs”

• Proliferation of GPUs– Mobile devices will have capable GPUs soon !

• Make more things possible– Make things real‐time

• From seconds to real‐time interactive performance

– Reduce offline processing overhead• Research Opportunities

– New & efficient algorithms– Pairing Multi‐core CPUs and massively multi‐threaded

GPUs

A GPU isn’t a CPU replacement!

GPU Computing 1‐2‐3


There ain’t no such thing as a FREE Lunch!


You don’t always “port”

a CPU algorithm to a GPU!

CPU versus GPU

• CPU– Optimized for latency– Speedup techniques

• Vectorization (MMX, SSE, …)• Coarse Grained Parallelism using multiple CPUs and cores

– Memory approaching a TB• GPU

– Optimized for throughput– Speedup techniques

• Massive multithreading• Fine grained parallelism

– A few GBs of memory max

Getting Started

• Software– CUDA (NVidia specific)– OpenCL (Cross‐platform, GPU/CPU)– DirectCompute (MS specific)

• Hardware– A system equipped with GPU

• OS no bar– But Windows, RedHat Enterprise Linux seem better

supported

CUDA

• Compute Unified Device

Architecture

• Most popular GPGPU toolkit• CUDA C extends C with

constructs– Easy to write programs

• Lower level “driver”

API is

available– Provides more control– Use multiple GPUs in the same

application

– Mix graphics & compute code• Language bindings available

– PyCUDA, Java, .NET• Toolkit provides conveniences

Source: NVIDIA CUDA Architecture, Introduction and Overview

CUDA Toolkit

CUDA Architecture

• 1 more streaming multiprocessors (“cores”)

• Thread Blocks– Single Instruction, Multiple

Thread (SIMT)

– Hide latency by parallelism• Memory Hierarchy

– Fermi GPUs can access

system memory

• Primitives for– Thread synchronization– Atomic Operations on

memory

Source : The GPU Computing Era

Simple Example : Vector Addition

void VecAdd(const float *A, const float*B, float *C, int N) {for(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

void VecAdd(const float *A, const float*B, float *C, int N) {#pragma omp forfor(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

C/C++ ‐

serial code

C/C++ with OpenMP

–

thread level parallelism

Vector Addition using CUDA

__global__ void VecAdd(const float *A, const float*B, float *C, int N) {int I = blockDim.x * blockIdx.x + threadIdx.x;if(i<N)

C[i]=A[i]+B[i];}

CUDA C – element level parallelism

cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);int threadsPerBlock = 256;int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

# nvcc vectorAdd.cu –I ../../common/inc

Compilation

Invoking the function

Allocate Memory on GPU

Copy Arrays to GPU

Invoke function

Copy Result Back to Main Memory

Free GPU Memory

GPU Programming Challenges

• Need high “occupancy”

for best performance• Extracting parallelism with limited resources

– Limited Registers– Limited Shared Memory

• Preferred Approach– Small Kernels– Multiple Passes if needed

• Decompose Problem into Parallel Pieces– Write once, scale perform everywhere!

GPU Programming

• Use Shared Memory when possible– Cooperation between threads in a block– Reduce access to global memory

• Reduce Data Transfer over the Bus• It’s still a GPU !

– use textures to your advantage– use vector data types if you can

• Watch out for GPU capability differences!

Enough Theory!

Demo Time&

Let’s do some programming

Watch out for

• Portability of programs across GPUs– Capabilities vary from GPU to GPU– Memory usage

• Arithmetic differences in the result• Pay careful attention to demos…

Resources

• CUDA– Tools on NVIDIA Developer Site

http://developer.nvidia.com/object/gpucomputing.html

– CUDPP http://code.google.com/p/cudpp/

• OpenCL• Google Search !

http://developer.nvidia.com/object/gpucomputing.html

http://code.google.com/p/cudpp/

The Future

• Better throughput– More GPU cores, scaling by Moore’s law– PCIe Gen 3

• Easier to program• Arbitrary control and data access patterns

Questions ?

[email protected]

computing using gpus

Technology

gpu computing erasimpleexample

computing usingshreekumar

ieee micro

nvidia cuda architecture

forunsigned int

host cpu

popular cpu

latest gpus