computing using gpus

Post on 15-Jun-2015

634 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides for a BarCamp Bangalore 9 session. We discussed computing using GPUs. The audience was predominantly not familiar with GPUs.

TRANSCRIPT

Computing Using

Shree Kumar, Hewlett Packardhttp://www.shreekumar.in/

Graphics Cards

Speaker Intro

• High Performance Computing @ Hewlett‐Packard– VizStack (http://vizstack.sourceforge.net)– GPU Computing

• Big 3D enthusiast• Travels a lot• Blogs at http://www.shreekumar.in/

What we will cover

• GPUs and their history• Why use GPUs• Architecture• Getting Started with GPU Programming• Challenges, Techniques & Pitfalls• Where not to use GPUs ?• Resources• The Future

What is a GPU

• Graphics Programming Unit– Coined in 1999 by NVidia– Specialized add‐on board

• Accelerates interactive 3D rendering– 60 image updates (or more) on large data– Solves embarrassingly parallel problem– Game driven volume economics

• NVidia v/s ATI, just like Intel v/s AMD

• Demand for better effects led to– programmable GPUs– floating point capabilities– this led to General Purpose GPU(GPGPU) Computation

Presenter
Presentation Notes
The term “GPU” was coined by NVidia to coincide with the launch of their “GeForce 256” graphics card (around Aug 1999). IMO, the idea was to latch onto the very popular “CPU”. Earlier graphics cards handled rasterization of 2D primitives. 3D transformations and lighting computations were done using the host CPU. The GeForce 256 changed this by doing T&L in the graphics card.

History of GPUs : a GPGPU Perspective

Date Product Trans Cores Flops Technology

1997 RIVA 128 3 M Rasterization

1999 GeForce 256 25 M Transform & Lighting

2001 GeForce 3 60 M Programmable shaders

2002 GeForce FX 125 M 16, 32 bit FP, long shaders

2004 GeForce 6800 222 M Infinite length shaders, branching

2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA, 64 bit FP

2008 GeForce GTX 

280

1.4 B 240 933 G78 M

IEEE FP, CUDA C, OpenCL and 

DirectCompute, PCI‐express Gen 2

2009 Tesla M2050 3.0 B 512 1.03 T515 G

Improved 64 bit perf, caching, ECC 

memory, 64‐bit unified addressing, 

asynchronous bidirectional data 

transfer, multiple kernels

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

Presenter
Presentation Notes
Lot of material sourced from Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

The GPU Advantage

30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth

Energy Efficient : 5x Performance/WattAll Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/

Add to these a3x Performance/$

People use GPUs for…

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

More “why to use GPUs”

• Proliferation of GPUs– Mobile devices will have capable GPUs soon !

• Make more things possible– Make things real‐time

• From seconds to real‐time interactive performance

– Reduce offline processing overhead• Research Opportunities

– New & efficient algorithms– Pairing Multi‐core CPUs and massively multi‐threaded 

GPUs

A GPU isn’t a CPU replacement!

GPU Computing 1‐2‐3

GPU Computing 1‐2‐3

There ain’t no such thing as a FREE Lunch!

GPU Computing 1‐2‐3

You don’t always “port”

a CPU algorithm to a GPU!

CPU versus GPU

• CPU– Optimized for latency– Speedup techniques

• Vectorization (MMX, SSE, …)• Coarse Grained Parallelism using multiple CPUs and cores

– Memory approaching a TB• GPU

– Optimized for throughput– Speedup techniques

• Massive multithreading• Fine grained parallelism

– A few GBs of memory max

Getting Started

• Software– CUDA (NVidia specific)– OpenCL (Cross‐platform, GPU/CPU)– DirectCompute (MS specific)

• Hardware– A system equipped with GPU

• OS no bar– But Windows, RedHat Enterprise Linux seem better 

supported

CUDA

• Compute Unified Device 

Architecture

• Most popular GPGPU toolkit• CUDA C extends C with 

constructs– Easy to write programs

• Lower level “driver”

API is 

available– Provides more control– Use multiple GPUs in the same 

application

– Mix graphics & compute code• Language bindings available

– PyCUDA, Java, .NET• Toolkit provides conveniences

Source: NVIDIA CUDA Architecture, Introduction and Overview

CUDA Toolkit

CUDA Architecture

• 1 more streaming  multiprocessors (“cores”)

• Thread Blocks– Single Instruction, Multiple 

Thread (SIMT)

– Hide latency by parallelism• Memory Hierarchy

– Fermi GPUs can access 

system memory

• Primitives for– Thread synchronization– Atomic Operations on 

memory

Source : The GPU Computing Era

Simple Example : Vector Addition

void VecAdd(const float *A, const float*B, float *C, int N) {for(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

void VecAdd(const float *A, const float*B, float *C, int N) {#pragma omp forfor(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

C/C++ ‐

serial code

C/C++ with OpenMP

thread level parallelism

Vector Addition using CUDA

__global__ void VecAdd(const float *A, const float*B, float *C, int N) {int I = blockDim.x * blockIdx.x + threadIdx.x;if(i<N)

C[i]=A[i]+B[i];}

CUDA C – element level parallelism

cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);int threadsPerBlock = 256;int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

# nvcc vectorAdd.cu –I ../../common/inc

Compilation

Invoking the function

Allocate Memory on GPU

Copy Arrays to GPU

Invoke function

Copy Result Back to Main Memory

Free GPU Memory

GPU Programming Challenges

• Need high “occupancy”

for best performance• Extracting parallelism with limited resources

– Limited Registers– Limited Shared Memory

• Preferred Approach– Small Kernels– Multiple Passes if needed

• Decompose Problem into Parallel Pieces– Write once, scale perform everywhere!

GPU Programming

• Use Shared Memory when possible– Cooperation between threads in a block– Reduce access to global memory

• Reduce Data Transfer over the Bus• It’s still a GPU !

– use textures to your advantage– use vector data types if you can

• Watch out for GPU capability differences!

Enough Theory!

Demo Time&

Let’s do some programming 

Watch out for

• Portability of programs across GPUs– Capabilities vary from GPU to GPU– Memory usage

• Arithmetic differences in the result• Pay careful attention to demos…

Resources

• CUDA– Tools on NVIDIA Developer Site 

http://developer.nvidia.com/object/gpucomputing.html

– CUDPP http://code.google.com/p/cudpp/

• OpenCL• Google Search !

The Future

• Better throughput– More GPU cores, scaling by Moore’s law– PCIe Gen 3

• Easier to program• Arbitrary control and data access patterns

Questions ?

shree.shree@gmail.com

top related