cuda by examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · three problems for serial...

53

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

CUDA by ExampleThe University of Mississippi

Computer Science Seminar Series

[email protected]

SINTEF ICT

Department of Applied Mathematics

April 28, 2010

Page 2: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Outline

1 The GPU

2 cudaPI

3 CUDA MJPEG-encoding

4 Summary

Applied Mathematics 2/50

Page 3: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Moore's Law

Moore's Law states that �thenumber of transistors that can be

placed inexpensively on an

integrated circuit has doubled

approximately every two years�.

Applied Mathematics 3/50

Page 4: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Why GPUs?

Three problems for serial computing:

The power wall

The von Neumann bottleneck

The Instruction Level Parallelism wall

We need more �ops, and serial computing is no longer capable ofdelievering it. The frequency of new processors is not increasinganymore.One possible solution: Compute on multiple cores!

Applied Mathematics 4/50

Page 5: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

GPU history - 1981

Applied Mathematics 5/50

Page 6: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

GPU history - 1992

Applied Mathematics 6/50

Page 7: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

GPU history - 2001

Applied Mathematics 7/50

Page 8: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

GPU history - 2009

Applied Mathematics 8/50

Page 9: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The GPU today

The graphics card

GPU - massively parallel processor(480 cores)

Memory: 1.5 GB (177.4 GB/sec)

Connected to the host through aPCIe x 16 Gen2 bus (8 GB/s)

Processor clock @ 1401 MHzFigure: The NVIDIAGeForce GTX 480 card.

Applied Mathematics 9/50

Page 10: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The GPU vs. the CPU

A lot more transistors for �oating point operations!

But: Algorithms cannot simply be �translated� 1-to-1 fromserial CPU-code to GPU-code.

You have to parallelize the whole or parts of your algorithm toachieve e�ective GPU-code.

Applied Mathematics 10/50

Page 11: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

A heterogeneous architecture

On a heterogeneous architecture several types of processorswork together, each one solving the task for which it is bestsuited, e.g., the CPU in cooperation with the GPU.

The CPU is running the show, and the GPU launches programson many cores in parallel, as requested by the CPU-program.

Do data parallel processing on the GPU, and high-level logicon the CPU.

Applied Mathematics 11/50

Page 12: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

CUDA - Compute Uni�ed Device Architecture

Small set of extensions to the C language

Made to expose the GPU to the programmer, and allow easyaccess to the computational resources of the GPU

Does support some C++-features, like templates, and moreare being added

A Fortran compiler is also available, developed by NVIDIA incooperation with The Portland Group

Applied Mathematics 12/50

Page 13: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA tools

What you need to get started:

1 CUDA driver (device driver for the graphics card)

2 CUDA toolkit (CUDA compiler, runtime library etc.)

3 CUDA SDK (software development kit, with code examples)

All this can be found at http://nvidia.com/cuda

Applied Mathematics 13/50

Page 14: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

OpenCL

Open Computing Language

Initially developed by Apple

Has a lot in common with CUDA

Released in December 2008 (version 1.0)

Applied Mathematics 14/50

Page 15: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Our �rst example

A HelloWorld-type of application using CUDA.

Calculate decimals of π by using geometric observations.

This is not the best way to calculate π, but is serves us goodas an example of parallel programming with CUDA.

Applied Mathematics 15/50

Page 16: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

p =πr2

4r2(1)

π = 4p (2)

We can calculate the ratio p to get an estimate of π:

1 By placing n dots within the square, some fraction k of thesedots will lie inside the circle.

2 This gives us p ≈ k

nand π ≈ 4k

n.

Applied Mathematics 16/50

Page 17: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA grid

Applied Mathematics 17/50

Page 18: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA grid

We set our block size to16x16.

The block size is a di�cultparameter to set, and it canhave a big impact onperformance.

Applied Mathematics 18/50

Page 19: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA grid

Then we set our grid to be8x8.

This gives us (16 ∗ 8) x(16 ∗ 8) = 16384 threadstotally.

If we use a larger grid weget a �ner resultion, and abetter estimate of π.

Applied Mathematics 19/50

Page 20: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The thread workload

We could let eachthread represent onepixel, but this wouldnot give each threadmuch of a workload.

Therefore, we leteach thread calculateseveral points.

Applied Mathematics 20/50

Page 21: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Setup on the CPU (host)

The dim3-variables are used later when we call ourCUDA-program.

const int width = 128;

const int height = 128;

// set the block size

const dim3 blockSize (16, 16);

// determine the size of the grid

const size_t gridWidth = width/blockSize.x;

const size_t gridHeight = height/blockSize.y;

const dim3 gridSize(gridWidth , gridHeight );

// number of points per thread (=m^2)

const int m = 10;

Applied Mathematics 21/50

Page 22: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Allocating device memory

First we allocate memory on the device (the GPU), which wewill use to save our intermediate results to.

cudaMallocPitch allocates width*height*sizeof(float)bytes of linear memory on the device.

Ensures e�cient memory access, when address is updatedfrom row to row.

float* d_results;

size_t d_pitch;

// allocate device memory

CUDA_SAFE_CALL(cudaMallocPitch ((void **)& d_results ,

&d_pitch ,

width * sizeof(float),

height ));

Applied Mathematics 22/50

Page 23: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Allocating host memory

We also need to allocate memory on the host, so we havesomewhere to download our intermediate results to.

// allocate host memory

float* h_results = new float[width*height ];

Applied Mathematics 23/50

Page 24: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA kernel

The __global__-keyword de�nes a CUDA kernel.

__global__ void cudaPIkernel(float* results , size_t pitch) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

float dx = 1.0f / (float) (gridDim.x * blockDim.x);

float dy = 1.0f / (float) (gridDim.y * blockDim.y);

int inside = 0;

...

Applied Mathematics 24/50

Page 25: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA kernel cont'd

Calculate whether the points (x , y) are inside our outside theunit circle.

...

for (int l = 0; l < m; ++l) {

float y = j / (float) (gridDim.y * blockDim.y);

y += dy * (l/( float)m);

for (int k = 0; k < m; ++k) {

float x = i / (float) (gridDim.x * blockDim.x);

x += dx * (k/( float)m);

if (x * x + y * y < 1.0f) {

// point is inside circle!

++ inside;

}

}

}

float* elementPtr = (float *) ((char*) results + j * pitch);

elementPtr[i] = inside;

}

Applied Mathematics 25/50

Page 26: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Calling the CUDA kernel

Now we know enough to write our call to the CUDA-kernel.

// run kernel

cudaPIkernel <<<gridSize , blockSize >>>(d_results , d_pitch );

CUT_CHECK_ERROR("cudaPIkernel");

Applied Mathematics 26/50

Page 27: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Device to host copy

Now we copy the intermediate results from the GPU to theCPU.

// fetch results from device

CUDA_SAFE_CALL(cudaMemcpy2D(h_results , width * sizeof(float),

d_results , d_pitch ,

width * sizeof(float),

height ,

cudaMemcpyDeviceToHost ));

Applied Mathematics 27/50

Page 28: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The host code

Lastly, we calculate π from the intermediate results we fetchedfrom the device.

// sum up results from each CUDA -thread

float k = 0.0;

for(int i = 0; i < width * height; ++i) {

k += h_results[i];

}

// calculate final result

int n = (width * m) * (height * m);

float pi = 4 * ((float) k / n);

cout << "PI is (approximately ): " << pi << endl;

Applied Mathematics 28/50

Page 29: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Optimization: Allocating host memory using theCUDA API

We should also allocate memory on the host using the CUDAAPI.

This ensures fast memory transfers between the device and thehost.

cudaMallocHost allocates page-locked memory that isaccessible to the device.

float* h_results;

// allocate host memory

CUDA_SAFE_CALL(cudaMallocHost ((void **)& h_results ,

width * height * sizeof(float )));

Applied Mathematics 29/50

Page 30: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Demo

Applied Mathematics 30/50

Page 31: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

JPEG

A lossy compression method for photographic images

Applied Mathematics 31/50

Page 32: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Motion-JPEG

A videoformat where each frame is individually compressed asa JPEG image

Uses no inter-frame compression (motion compensation)

Used by many portable devices, e.g., webcams and otherdevices that stream video

Applied Mathematics 32/50

Page 33: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

CPU-version of the algorithm

We already have a CPU-version of the algorithm for M-JPEGencoding.

It is often the case that you want to accelerate some alreadyexisting algorithm.

Understanding the algorithm in detail is important.

Applied Mathematics 33/50

Page 34: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

M-JPEG encoding

Split up image intomacroblocks

Zig−zag image, truncateand do Huffman

RGB −> YUV

DCT

Quantize

Split the image into macroblocks of size 8x8.

Applied Mathematics 34/50

Page 35: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

M-JPEG encoding

Split up image intomacroblocks

Zig−zag image, truncateand do Huffman

RGB −> YUV

DCT

Quantize

Subtract 128 to shift the values so that theyare centered around zero.

Perform a two-dimensional DCT on eachmacroblock, converting them into afrequency-domain representation.

Applied Mathematics 34/50

Page 36: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

M-JPEG encoding

Split up image intomacroblocks

Zig−zag image, truncateand do Huffman

RGB −> YUV

DCT

Quantize

Reduce the amount of information in the highfrequency components, by simply dividingeach component by a prede�ned constant.

The human eye is not so good atdistinguishing the exact strength of a highfrequency brightness variation.

Applied Mathematics 34/50

Page 37: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

M-JPEG encoding

Split up image intomacroblocks

Zig−zag image, truncateand do Huffman

RGB −> YUV

DCT

QuantizeGo through the macroblock in azigzag-pattern.

Remove the trailing zeros.

Hu�man-encoding is a lossless compressionalgorithm which further compresses the result.

Applied Mathematics 34/50

Page 38: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Mapping an algorithm to the GPU

1 Identify which parts are most computationally demanding (bypro�ling)

2 Investigate if all or some of these parts can be done in aparallel fashion

3 Write CUDA-kernels for these parts, which are called from thehost code

4 Repeat

Applied Mathematics 35/50

Page 39: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Pro�ling the CPU-version

By running a pro�ler like gprof (remember to compile yourcode with -gp), we can determine which functions aredominating our programs runtime.

We �nd that the DCT spends well over 90% of the runtime.

Applied Mathematics 36/50

Page 40: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Overview

CPU GPU

Split up image into

macroblocks

Zigzag, truncate

and Huffman

Read frame from file

DCT

Quantize

Write to file

Figure: Overview CPU/GPU.

Applied Mathematics 37/50

Page 41: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

DCT on the CPU

Let us take a look at the CPU-version's DCT.

For each macroblock, the following code is executed:

for(int u = 0; u < 8; ++u) {

for(int v = 0; v < 8; ++v) {

float dct = 0;

for(int j = 0; j < 8; ++j) {

for(int i = 0; i < 8; ++i) {

float coeff =

in_data [(y+j)*width+(x+i)] - 128.0f;

dct += coeff

* (float) (cos ((2*i+1)*u*PI /16.0f)

* cos ((2*j+1)*v*PI/16.0f));

}

}

float a1 = !u ? ISQRT2 : 1.0f;

float a2 = !v ? ISQRT2 : 1.0f;

/* Scale according to normalizing function */

dct *= a1*a2/4.0f;

...Applied Mathematics 38/50

Page 42: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Quantization on the CPU

...and the quantization:

...

/* Quantize */

out_data [(y+v)*width+(x+u)] =

(int16_t )(floor (0.5f + dct /

(float )( quantization[v*8+u])));

}

}

Applied Mathematics 39/50

Page 43: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The GPU compute gridVIDEO FRAME

MACROBLOCK

GPU COMPUTE GRID

GPU COMPUTE BLOCK

PIXEL

THREAD

Figure: One videoframe and the GPU compute grid.

Applied Mathematics 40/50

Page 44: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Quantization matrices

We store the three quantization matrices (for Y, U and V) inconstant memory on the device.

/**

* Quantization matrices

*/

__constant__ float yquanttbl [] =

{

16, 11, 12, 14, 12, 10, 16, 14,

13, 14, 18, 17, 16, 19, 24, 40,

26, 24, 22, 22, 24, 49, 35, 37,

29, 40, 58, 51, 61, 30, 57, 51,

56, 55, 64, 72, 92, 78, 64, 68,

87, 69, 55, 56, 80, 109, 81, 87,

95, 98, 103, 104, 103, 62, 77, 113,

121, 112, 100, 120, 92, 101, 103, 99

};

...

Applied Mathematics 41/50

Page 45: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA-kernelReads one macroblock into shared memory.__syncthreads() is a barrier, which make threads wait untilall threads have reached this point in the kernel.

__global__ void cudaDct(float *output , size_t output_pitch

float *input , size_t input_pitch) {

const int bx = blockIdx.x;

const int by = blockIdx.y;

const int tx = threadIdx.x;

const int ty = threadIdx.y;

const int x = (bx * 8) + tx;

const int y = (by * 8) + ty;

__shared__ float macroblock [8][8];

float* inputPtr = (float *)(( char*) input + y * input_pitch );

macroblock[ty][tx] = inputPtr[x];

__syncthreads ();

... Applied Mathematics 42/50

Page 46: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

The CUDA-kernel cont'd

__cosf is an intrinsic function.

...

float dct = 0;

for (int j = 0; j < 8; ++j) {

for (int i = 0; i < 8; ++i) {

float coeff = macroblock[j][i] - 128.0f;

dct += coeff * __cosf ((2 * i + 1) * tx * PI / 16.0f)

* __cosf ((2 * j + 1) * ty * PI / 16.0f);

}

}

float a1 = !tx ? ISQRT2 : 1.0f;

float a2 = !ty ? ISQRT2 : 1.0f;

/* Scale according to normalizing function */

dct *= a1 * a2 / 4.0f;

/* Quantize */

float* inputPtr = (float *)(( char*) output + y * output_pitch );

output[x] = dct / yquanttbl[ty * 8 + tx];

}

Applied Mathematics 43/50

Page 47: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Moving data between host and device

Using cudaMemcpy2D, similar to the cudaPI-example:

// copy image to device

CUDA_SAFE_CALL(cudaMemcpy2D(h_in_data , h_pitch_in ,

d_in_data , d_pitch_in ,

width * sizeof(float),

height ,

cudaMemcpyHostToDevice ));

// launch kernel

cudaDct <<<gridSize , blockSize >>>(d_out_data , d_pitch_out

d_in_data , d_pitch_in );

CUT_CHECK_ERROR("cudaDct");

// copy image back to host

CUDA_SAFE_CALL(cudaMemcpy2D(d_out_data , d_pitch_out ,

h_out_data , h_pitch_out ,

width * sizeof(float),

height ,

cudaMemcpyDeviceToHost ));

Applied Mathematics 44/50

Page 48: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Optimizations

Run kernel on four consecutive macroblocks to achievecoalescing.

Applied Mathematics 45/50

Page 49: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Optimizations cont'd

Threaded I/O on host

Double-bu�ering input and output bu�ers on the CPU

Makes it possible to do �le IO and GPU memory transferssimultaneously

Can be implemented using threading on the CPU, e.g.,boost::thread

Could also run the Hu�man-encoding in a separate thread,since this is a compute intensive algorithm

Applied Mathematics 46/50

Page 50: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Optimizations cont'd

Streams

Utilizing CUDA streams

Uses asynchronous memory transfers

Allows overlapping of uploading, kernel invocation, anddownloading

Applied Mathematics 47/50

Page 51: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Is GPU-computing worth the e�ort?

It takes time to learn a new technology and a new platform.

It will take time to rewrite your existing algorithms.

But: You will get signi�cant speed ups (5-100X) of algorithmsthat bene�t from parallelism.

Once you have your parallel algorithms, they will scale onnewer generations of parallel hardware.

Applied Mathematics 48/50

Page 52: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

CUDA Documentation

Programming Guide

Reference Manual

Best Practices Guide

Pay special attention to the �Performance Guidelines�-sectionof the programming guide to get the most performance out ofyour CUDA-apps

http://gpgpu.org

Applied Mathematics 49/50

Page 53: CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

The GPU cudaPI CUDA MJPEG-encoding Summary

Good luck with your CUDA-programming, and thank you :-)Contact: [email protected]://martinsa.at.i�.uio.no

Applied Mathematics 50/50