computing using gpus

24
Computing Using Shree Kumar, Hewlett Packard http://www.shreekumar.in/ Graphics Cards

Upload: shree-kumar

Post on 15-Jun-2015

634 views

Category:

Technology


4 download

DESCRIPTION

Slides for a BarCamp Bangalore 9 session. We discussed computing using GPUs. The audience was predominantly not familiar with GPUs.

TRANSCRIPT

Page 1: Computing using GPUs

Computing Using

Shree Kumar, Hewlett Packardhttp://www.shreekumar.in/

Graphics Cards

Page 2: Computing using GPUs

Speaker Intro

• High Performance Computing @ Hewlett‐Packard– VizStack (http://vizstack.sourceforge.net)– GPU Computing

• Big 3D enthusiast• Travels a lot• Blogs at http://www.shreekumar.in/

Page 3: Computing using GPUs

What we will cover

• GPUs and their history• Why use GPUs• Architecture• Getting Started with GPU Programming• Challenges, Techniques & Pitfalls• Where not to use GPUs ?• Resources• The Future

Page 4: Computing using GPUs

What is a GPU

• Graphics Programming Unit– Coined in 1999 by NVidia– Specialized add‐on board

• Accelerates interactive 3D rendering– 60 image updates (or more) on large data– Solves embarrassingly parallel problem– Game driven volume economics

• NVidia v/s ATI, just like Intel v/s AMD

• Demand for better effects led to– programmable GPUs– floating point capabilities– this led to General Purpose GPU(GPGPU) Computation

Presenter
Presentation Notes
The term “GPU” was coined by NVidia to coincide with the launch of their “GeForce 256” graphics card (around Aug 1999). IMO, the idea was to latch onto the very popular “CPU”. Earlier graphics cards handled rasterization of 2D primitives. 3D transformations and lighting computations were done using the host CPU. The GeForce 256 changed this by doing T&L in the graphics card.
Page 5: Computing using GPUs

History of GPUs : a GPGPU Perspective

Date Product Trans Cores Flops Technology

1997 RIVA 128 3 M Rasterization

1999 GeForce 256 25 M Transform & Lighting

2001 GeForce 3 60 M Programmable shaders

2002 GeForce FX 125 M 16, 32 bit FP, long shaders

2004 GeForce 6800 222 M Infinite length shaders, branching

2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA, 64 bit FP

2008 GeForce GTX 

280

1.4 B 240 933 G78 M

IEEE FP, CUDA C, OpenCL and 

DirectCompute, PCI‐express Gen 2

2009 Tesla M2050 3.0 B 512 1.03 T515 G

Improved 64 bit perf, caching, ECC 

memory, 64‐bit unified addressing, 

asynchronous bidirectional data 

transfer, multiple kernels

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

Presenter
Presentation Notes
Lot of material sourced from Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
Page 6: Computing using GPUs

The GPU Advantage

30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth

Energy Efficient : 5x Performance/WattAll Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/

Add to these a3x Performance/$

Page 7: Computing using GPUs

People use GPUs for…

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

Page 8: Computing using GPUs

More “why to use GPUs”

• Proliferation of GPUs– Mobile devices will have capable GPUs soon !

• Make more things possible– Make things real‐time

• From seconds to real‐time interactive performance

– Reduce offline processing overhead• Research Opportunities

– New & efficient algorithms– Pairing Multi‐core CPUs and massively multi‐threaded 

GPUs

Page 9: Computing using GPUs

A GPU isn’t a CPU replacement!

GPU Computing 1‐2‐3

Page 10: Computing using GPUs

GPU Computing 1‐2‐3

There ain’t no such thing as a FREE Lunch!

Page 11: Computing using GPUs

GPU Computing 1‐2‐3

You don’t always “port”

a CPU algorithm to a GPU!

Page 12: Computing using GPUs

CPU versus GPU

• CPU– Optimized for latency– Speedup techniques

• Vectorization (MMX, SSE, …)• Coarse Grained Parallelism using multiple CPUs and cores

– Memory approaching a TB• GPU

– Optimized for throughput– Speedup techniques

• Massive multithreading• Fine grained parallelism

– A few GBs of memory max

Page 13: Computing using GPUs

Getting Started

• Software– CUDA (NVidia specific)– OpenCL (Cross‐platform, GPU/CPU)– DirectCompute (MS specific)

• Hardware– A system equipped with GPU

• OS no bar– But Windows, RedHat Enterprise Linux seem better 

supported

Page 14: Computing using GPUs

CUDA

• Compute Unified Device 

Architecture

• Most popular GPGPU toolkit• CUDA C extends C with 

constructs– Easy to write programs

• Lower level “driver”

API is 

available– Provides more control– Use multiple GPUs in the same 

application

– Mix graphics & compute code• Language bindings available

– PyCUDA, Java, .NET• Toolkit provides conveniences

Source: NVIDIA CUDA Architecture, Introduction and Overview

CUDA Toolkit

Page 15: Computing using GPUs

CUDA Architecture

• 1 more streaming  multiprocessors (“cores”)

• Thread Blocks– Single Instruction, Multiple 

Thread (SIMT)

– Hide latency by parallelism• Memory Hierarchy

– Fermi GPUs can access 

system memory

• Primitives for– Thread synchronization– Atomic Operations on 

memory

Source : The GPU Computing Era

Page 16: Computing using GPUs

Simple Example : Vector Addition

void VecAdd(const float *A, const float*B, float *C, int N) {for(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

void VecAdd(const float *A, const float*B, float *C, int N) {#pragma omp forfor(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

C/C++ ‐

serial code

C/C++ with OpenMP

thread level parallelism

Page 17: Computing using GPUs

Vector Addition using CUDA

__global__ void VecAdd(const float *A, const float*B, float *C, int N) {int I = blockDim.x * blockIdx.x + threadIdx.x;if(i<N)

C[i]=A[i]+B[i];}

CUDA C – element level parallelism

cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);int threadsPerBlock = 256;int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

# nvcc vectorAdd.cu –I ../../common/inc

Compilation

Invoking the function

Allocate Memory on GPU

Copy Arrays to GPU

Invoke function

Copy Result Back to Main Memory

Free GPU Memory

Page 18: Computing using GPUs

GPU Programming Challenges

• Need high “occupancy”

for best performance• Extracting parallelism with limited resources

– Limited Registers– Limited Shared Memory

• Preferred Approach– Small Kernels– Multiple Passes if needed

• Decompose Problem into Parallel Pieces– Write once, scale perform everywhere!

Page 19: Computing using GPUs

GPU Programming

• Use Shared Memory when possible– Cooperation between threads in a block– Reduce access to global memory

• Reduce Data Transfer over the Bus• It’s still a GPU !

– use textures to your advantage– use vector data types if you can

• Watch out for GPU capability differences!

Page 20: Computing using GPUs

Enough Theory!

Demo Time&

Let’s do some programming 

Page 21: Computing using GPUs

Watch out for

• Portability of programs across GPUs– Capabilities vary from GPU to GPU– Memory usage

• Arithmetic differences in the result• Pay careful attention to demos…

Page 22: Computing using GPUs

Resources

• CUDA– Tools on NVIDIA Developer Site 

http://developer.nvidia.com/object/gpucomputing.html

– CUDPP http://code.google.com/p/cudpp/

• OpenCL• Google Search !

Page 23: Computing using GPUs

The Future

• Better throughput– More GPU cores, scaling by Moore’s law– PCIe Gen 3

• Easier to program• Arbitrary control and data access patterns

Page 24: Computing using GPUs

Questions ?

[email protected]