computing using gpus

Computing Using

Shree Kumar, Hewlett Packardhttp://www.shreekumar.in/

Graphics Cards

Speaker Intro

• High Performance Computing @ Hewlett‐Packard– VizStack (http://vizstack.sourceforge.net)– GPU Computing

• Big 3D enthusiast• Travels a lot• Blogs at http://www.shreekumar.in/

What we will cover

• GPUs and their history• Why use GPUs• Architecture• Getting Started with GPU Programming• Challenges, Techniques & Pitfalls• Where not to use GPUs ?• Resources• The Future

What is a GPU

• Graphics Programming Unit– Coined in 1999 by NVidia– Specialized add‐on board

• Accelerates interactive 3D rendering– 60 image updates (or more) on large data– Solves embarrassingly parallel problem– Game driven volume economics

• NVidia v/s ATI, just like Intel v/s AMD

• Demand for better effects led to– programmable GPUs– floating point capabilities– this led to General Purpose GPU(GPGPU) Computation

History of GPUs : a GPGPU Perspective

Date Product Trans Cores Flops Technology

1997 RIVA 128 3 M Rasterization

1999 GeForce 256 25 M Transform & Lighting

2001 GeForce 3 60 M Programmable shaders

2002 GeForce FX 125 M 16, 32 bit FP, long shaders

2004 GeForce 6800 222 M Infinite length shaders, branching

2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA, 64 bit FP

2008 GeForce GTX

1.4 B 240 933 G78 M

IEEE FP, CUDA C, OpenCL and

DirectCompute, PCI‐express Gen 2

2009 Tesla M2050 3.0 B 512 1.03 T515 G

Improved 64 bit perf, caching, ECC

memory, 64‐bit unified addressing,

asynchronous bidirectional data

transfer, multiple kernels

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

The GPU Advantage

30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth

Energy Efficient : 5x Performance/WattAll Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/

Add to these a3x Performance/$

People use GPUs for…

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

More “why to use GPUs”

• Proliferation of GPUs– Mobile devices will have capable GPUs soon !

• Make more things possible– Make things real‐time

• From seconds to real‐time interactive performance

– Reduce offline processing overhead• Research Opportunities

– New & efficient algorithms– Pairing Multi‐core CPUs and massively multi‐threaded

A GPU isn’t a CPU replacement!

GPU Computing 1‐2‐3

There ain’t no such thing as a FREE Lunch!

GPU Computing 1‐2‐3

You don’t always “port”

a CPU algorithm to a GPU!

CPU versus GPU

• CPU– Optimized for latency– Speedup techniques

• Vectorization (MMX, SSE, …)• Coarse Grained Parallelism using multiple CPUs and cores

– Memory approaching a TB• GPU

– Optimized for throughput– Speedup techniques

• Massive multithreading• Fine grained parallelism

– A few GBs of memory max

Getting Started

• Software– CUDA (NVidia specific)– OpenCL (Cross‐platform, GPU/CPU)– DirectCompute (MS specific)

• Hardware– A system equipped with GPU

• OS no bar– But Windows, RedHat Enterprise Linux seem better

supported

• Compute Unified Device

Architecture

• Most popular GPGPU toolkit• CUDA C extends C with

constructs– Easy to write programs

• Lower level “driver”

API is

available– Provides more control– Use multiple GPUs in the same

application

– Mix graphics & compute code• Language bindings available

– PyCUDA, Java, .NET• Toolkit provides conveniences

Source: NVIDIA CUDA Architecture, Introduction and Overview

CUDA Toolkit

CUDA Architecture

• 1 more streaming multiprocessors (“cores”)

• Thread Blocks– Single Instruction, Multiple

Thread (SIMT)

– Hide latency by parallelism• Memory Hierarchy

– Fermi GPUs can access

system memory

• Primitives for– Thread synchronization– Atomic Operations on

memory

Source : The GPU Computing Era

Simple Example : Vector Addition

void VecAdd(const float *A, const float*B, float *C, int N) {for(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

void VecAdd(const float *A, const float*B, float *C, int N) {#pragma omp forfor(unsigned int i=0;i<N;i++)C[i]=A[i]+B[i];

}VecAdd(A,B,C,N);

C/C++ ‐

serial code

C/C++ with OpenMP

thread level parallelism

Vector Addition using CUDA

__global__ void VecAdd(const float *A, const float*B, float *C, int N) {int I = blockDim.x * blockIdx.x + threadIdx.x;if(i<N)

C[i]=A[i]+B[i];}

CUDA C – element level parallelism

cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);int threadsPerBlock = 256;int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

# nvcc vectorAdd.cu –I ../../common/inc

Compilation

Invoking the function

Allocate Memory on GPU

Copy Arrays to GPU

Invoke function

Copy Result Back to Main Memory

Free GPU Memory

GPU Programming Challenges

• Need high “occupancy”

for best performance• Extracting parallelism with limited resources

– Limited Registers– Limited Shared Memory

• Preferred Approach– Small Kernels– Multiple Passes if needed

• Decompose Problem into Parallel Pieces– Write once, scale perform everywhere!

GPU Programming

• Use Shared Memory when possible– Cooperation between threads in a block– Reduce access to global memory

• Reduce Data Transfer over the Bus• It’s still a GPU !

– use textures to your advantage– use vector data types if you can

• Watch out for GPU capability differences!

Enough Theory!

Demo Time&

Let’s do some programming

Watch out for

• Portability of programs across GPUs– Capabilities vary from GPU to GPU– Memory usage

• Arithmetic differences in the result• Pay careful attention to demos…

Resources

• CUDA– Tools on NVIDIA Developer Site

http://developer.nvidia.com/object/gpucomputing.html

– CUDPP http://code.google.com/p/cudpp/

• OpenCL• Google Search !

The Future

• Better throughput– More GPU cores, scaling by Moore’s law– PCIe Gen 3

• Easier to program• Arbitrary control and data access patterns

Questions ?

shree.shree@gmail.com

computing using gpus

gpu computing erasimpleexample

computing usingshreekumar

ieee micro

nvidia cuda architecture

forunsigned int

host cpu

popular cpu

latest gpus

Technology

graphics and computing gpus jehan-françois pâris...

using gpus for parallel processing

gpus and the future of accelerated computing · gpus and...

extending a legacy fortran code to gpus using...

region-scale evacuation modeling using gpus

computing using gpus - desy · high performance computing...

opencl™ parallel computing for cpus and gpus - amd ·...

high-performance computing using python, gpus & …€¦ ·...

ptx isa 14 - nvidia · instruction set architecture (isa)....

lapack for gpus - innovative computing laboratory

openclâ„¢ - parallel computing for cpus and gpus

scalable clustering using multiple gpus

vector computing on gpus -...

scalable multi-cache simulation using gpus

opencv - accelerated computer vision using gpus

cs 179: gpu...

accelerating ansys fluent 15.0 using nvidia gpus ·...

parallel computing with gpus

csc266 introduction to parallel computing using gpus ... ·...

graphics and computing gpus - nus computing - home€¦ ·...