gpu tutorial: how to program for gpus

25
High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial: How To Program for GPUs Krešimir Ćosić 1 , (1) University of Split, Croatia

Upload: didier

Post on 05-Feb-2016

82 views

Category:

Documents


1 download

DESCRIPTION

GPU Tutorial: How To Program for GPUs. Kre šimir Ć osi ć 1 , (1) University of Split, Croatia. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A. Overview. CUDA Hardware architecture Programming model Convolution on GPU. CUDA. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPU Tutorial: How To Program for GPUs

High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

GPU Tutorial: How To Program for GPUsKrešimir Ćosić1,

(1) University of Split, Croatia

Page 2: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Overview

CUDA Hardware architecture

Programming model Convolution on GPU

Page 3: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

CUDA

‘Compute Unified Device Architecture’– Hardware and software architecture for issuing

and managing computations on GPU• Massively parallel architecture

– over 8000 threads is common• C for CUDA (C++ for CUDA)

– C/C++ language with some additions and restrictions

• Enables GPGPU – ‘General Purpose Computing on GPUs’

Page 4: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

GPU: a multithreaded coprocessor

SMstreaming multiprocessor

32xSP (or 16, 48 or more)

Fast local ‘shared memory’(shared between SPs)

16 KiB (or 64 KiB)

GLOBAL MEMORY(ON DEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

SP: scalar processor ‘CUDA core’

Executes one thread

Page 5: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

GDDR memory

512 MiB - 6 GiB

GPU: SMs

o 30xSM on GT200, o 14xSM on Fermi

For example, GTX 480: 14 SMs x 32 cores

= 448 cores on a GPU

GLOBAL MEMORY(ON DEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

Page 6: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

How To Program For GPUs

Parallelization Decomposition to threads

Memory shared memory, global memory

GLOBAL MEMORY(ON DEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

Page 7: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Important Things To Keep In Mind

Avoid divergent branches Threads of single SM must be

executing the same code Code that branches heavily and

unpredictably will execute slowly Threads shoud be independent as

much as possible Synchronization and

communication can be done efficiently only for threads of single multiprocessor

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

Page 8: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

How To Program For GPUs

Parallelization Decomposition to threads

Memory shared memory, global memory

Enormous processing power Avoid divergence

Thread communication Synchronization, no

interdependencies GLOBAL MEMORY(ON DEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

Page 9: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Programming model

Page 10: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Thread blocks

Threads grouped in thread blocks 128, 192 or 256

threads in a block

• One thread block executes on one SM– All threads sharing the ‘shared memory’– 32 threads are executed simultaneously (‘warp’)

BLOCK 1

THREAD(0,0)

THREAD(0,1)

THREAD(0,2)

THREAD(1,0)

THREAD(1,1)

THREAD(1,2)

Page 11: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Thread blocks

Blocks execute on SMs - execute in parallel - execute independently!

BLOCK 1

THREAD(0,0)

THREAD(0,1)

THREAD(0,2)

THREAD(1,0)

THREAD(1,1)

THREAD(1,2)

Grid

BLOCK 0 BLOCK 1 BLOCK 2

BLOCK 3 BLOCK 4 BLOCK 5

BLOCK 6 BLOCK 7 BLOCK 8

• Blocks form a GRID• Thread ID

unique within block• Block ID

unique within grid

Page 12: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Code that executes on GPU: Kernels

Kernel - a simple C function - executes on GPU - Executes in parallel

as many times as there are threads

The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU)

Page 13: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Convolution

To get one pixel of output image:

- multiply (pixelwise) mask with image at corresponding position

- sum the products

Page 14: GPU Tutorial: How To Program for GPUs

Kernel - Example code pt 1

__global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out){ const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x;

const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH;

for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads)

{ int x = curPixel % outW; int y = curPixel / outW;

float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; }}

for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) {

Page 15: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Setup and data transfer

cudaMemcpy transfer data to and from GPU (global memory)

cudaMalloc Allocate memory on GPU (global memory)

GPU is the ‘device’, CPU is the ‘host’

Kernel call syntax

Page 16: GPU Tutorial: How To Program for GPUs

Examle setup and data transfer 1

int main() { ... float* img ... int imgW, imgH ...

float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction

);

float* filter ... int filterW, filterH ...

float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction

);

Page 17: GPU Tutorial: How To Program for GPUs

Examle setup and data transfer 2

int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve<<<grid, block>>> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction );

cudaThreadExit(); ... }

Page 18: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Page 19: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Speedup

Linear combination of 3 filters sized 15x15 Image size: 2k x 2k

CPU: Core 2 @ 2.0 GHz (1 core) GPU: Tesla S1070 (GT200 )

30xSM, 240 CUDA cores, 1.3 GHz

CPU: 6.58 s 0.89 Mpixels/s GPU: 0.21 s 27.99 Mpixels/s

31 times faster!

Page 20: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Page 21: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

CUDA capabilities

1.0 GeForce 8800 Ultra/GTX/GTS

1.1 GeForce 9800 GT, GTX, GTS 250+ atomic instructions …

1.2 GeForce GT 220

1.3 Tesla S1070, C1060, GeForce GTX 275,285+ double precision (slow) …

2.0 Tesla C2050, GeForce GTX 480, 470+ ECC, L1 and L2 cache, faster IMUL, faster atomics,

faster double precision on Tesla cards …

Page 22: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

CUDA essentials

developer.nvidia.com/object/cuda_3_1_downloads.html

Download Driver Toolkit (compiler nvcc) SDK (examples) (recommended)

CUDA Programmers guide

Page 23: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Other tools

‘Emulator’ Executes on CPU Slow

Simple profiler cuda-gdb (Linux) Paralel Nsight (Vista)

simple profiler on-device debugger

Page 24: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

...

...

Page 25: GPU Tutorial: How To Program for GPUs

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

High Performance Computing with GPUs: An Introduction

Logical thread hierarchy

Thread ID – unique within block Block ID – unique within grid

To get globally unique thread ID: Combine block ID and thread ID

Threads can access both shared and global memory