gpu computing overview - sea computing.pdfa gpu is a throughput optimized processor gpu achieves...

41
GPU Computing Overview UCAR SEA 2013 Tony Scudiero

Upload: others

Post on 08-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

GPU Computing Overview

UCAR SEA 2013

Tony Scudiero

Page 2: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Outline

Evolution of GPU Computing

Parallelism

Heterogeneous Computing Concepts

Using GPUs to Accelerate Applications

Accelerated Libraries

Compiler Directives

Programming Languages

Page 3: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Once upon a time . . .

GPU: Graphical Processing Unit

Originated as specialized hardware for 3D games.

Why a different processor?

Rendering is the most computationally intense part of a game.

CPU is not an ideal device for computer graphics rendering

Freed CPU allows more complex AI, dynamic world generation, realistic dynamics.

Quake Software Rendering

Quake Hardware Rendering

Page 4: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Programmable GPUs

Graphics research diverged from OpenGL pipeline.

2001 - Programmable vertex and fragment shaders. (Geforce3)

OpenGL 2.0 (2004) updates pipeline

GPGPU era begins

NVIDIA Geforce3

Page 5: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

GPGPU: “Because It’s There”

Warning: Nerdy Movie Reference

Page 6: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

GPU Computing Era: G80 and Fermi

G80

Unified Shader Architecture

Double Precision

CUDA Introduced

Fermi (GF100)

ECC

Enhanced Double precision

Memory hierarchy and expanded caching

Page 7: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

12 years later

NVIDIA Kepler

1.31 tflop double precision

3.95 tflop single precision

250 gb/sec memory bandwidth

2,688 Functional Units (cores)

~= #1 on Top500 in 1997

Can still play video games

NVIDIA GK110 - KeplerNVIDIA GeForce Titan

Page 8: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Not A Video Game

18,866 NVIDIA GK110 + AMD InterlagosRpeak: 27,112.5 TFLOPS

Top500 Rank: #1 Nov’12

Page 9: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Science Uses GPUs

Medical Imaging U of Utah

Molecular DynamicsU of IL, Urbana

Matlab ComputingAccelerEyes

AstrophysicsRIKEN

Financial SimulationOxford

Linear AlgebraUniversidad Jaime

3D UltrasoundTechniscan

Quantum ChemistryU of Illinois, Urbana

Gene SequencingU of Maryland

Video TranscodingElemental Tech

Page 10: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Outline

Evolution of GPU Computing

Parallelism

Heterogeneous Computing Concepts

Using GPUs to Accelerate Applications

Accelerated Libraries

Compiler Directives

Programming Languages

Page 11: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Latency vs. Throughput

Throughput

Work per unit time

Matrix operations, FFTs, signal processing, data analytics, rendering/visualization

Latency

Time per unit work

Databases, web servers, transaction servers, Internet infrastructure.

GPU CPU

Page 12: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

A GPU is a Throughput Optimized Processor

GPU Achieves high throughput by parallel execution

2,688 cores (GK110)

Millions of resident threads

GPU Threads are much lighter weight than CPU threads like pThreads

Processing in Parallel is how GPU achieves performance

Page 13: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

A GPU is a Throughput Optimized Processor

GPU Achieves high throughput by parallel execution

2,688 cores (GK110)

Millions of resident threads

GPU Threads are much lighter weight than CPU threads like pThreads

Processing in parallel is how HPC achieves performance

Page 14: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Parallelism

Parallel Poll

Ideal Serial Processor: IPC 1.0, 8ghz, perfect pipeline: 8gflops/core

Yellowstone: 1.5 Pflops @ 2.6ghz (#13 on Top 500 Nov’12!)

4,512 Nodes

2 Sockets per node

8 cores per socket

Each Core: 2 x 256bit-wide pipelined FP units @ 2.6 ghz

Theoretical Peak: 20.8 gflops / core

Levels ofParallelism

Page 15: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Parallelism

Exposing Parallelism is . . .

. . . a necessary part of GPU Computing

. . . essential to utilize current HPC systems

. . . going to be more important on future systems

Page 16: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Outline

Evolution of GPU Computing

Parallelism

Heterogeneous Computing Concepts

Using GPUs to Accelerate Applications

Accelerated Libraries

Compiler Directives

Programming Languages

Page 17: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Heterogeneous Computing

� The CPU and its memory (host)

� The GPU and its memory (device)

CPU GPU

Page 18: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Heterogeneous Computing

#include <iostream>#include <algorithm>

using namespace std;

#define N 1024#define RADIUS 3#define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {

temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

// Synchronize (ensure all the data is available)__syncthreads();

// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];

// Store the resultout[gindex] = result;

}

void fill_ints(int *x, int n) {fill_n(x, n, 1);

}

int main(void) {int *in, *out; // host copies of a, b, cint *d_in, *d_out; // device copies of a, b, cint size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup valuesin = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copiescudaMalloc((void **)&d_in, size);cudaMalloc((void **)&d_out, size);

// Copy to devicecudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPUstencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);

// Copy result back to hostcudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanupfree(in); free(out);cudaFree(d_in); cudaFree(d_out);return 0;

}

serial code

parallel code

serial code

parallel fn

Page 19: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

GPU Computing

Application Code

+

GPU CPUUse GPU to Parallelize

Compute-Intensive FunctionsRest of Sequential

CPU Code

Page 20: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Simple Processing Flow

1. Copy input data from CPU memory to GPU memory

PCI Bus

Page 21: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Simple Processing Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute.Results stored in GPU memory.

PCI Bus

Page 22: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Simple Processing Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute.Results stored in GPU memory.

3. Copy results from GPU memory to CPU memory

PCI Bus

Page 23: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Outline

Evolution of GPU Computing

Parallelism

Heterogeneous Computing Concepts

Using GPUs to Accelerate Applications

Accelerated Libraries

Compiler Directives

Programming Languages

Page 24: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Simple Example: SAXPY

BLAS1 function

Y = a*X+Y

Y and X are length N vectors

A is a scalar

single precision data (float)

void saxpy(int n, float a, void saxpy(int n, float a, void saxpy(int n, float a, void saxpy(int n, float a,

float *x, float *y)float *x, float *y)float *x, float *y)float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

int N = 1<<20;int N = 1<<20;int N = 1<<20;int N = 1<<20;

// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements

saxpy(N, 2.0, x, y);saxpy(N, 2.0, x, y);saxpy(N, 2.0, x, y);saxpy(N, 2.0, x, y);

Page 25: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

Page 26: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

Page 27: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

GPU Accelerated Libraries

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore NVIDIA cuFFT

C++ STL Features for CUDA

Sparse Linear AlgebraIMSL Library

Building-block Algorithms for CUDA

ArrayFire Matrix Computations

Page 28: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

SAXPY in cuBLAS

int N = 1<<20;int N = 1<<20;int N = 1<<20;int N = 1<<20;

............

// Use your choice of blas library// Use your choice of blas library// Use your choice of blas library// Use your choice of blas library

// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements

blas_saxpy(N, 2.0, x, 1, y, 1);blas_saxpy(N, 2.0, x, 1, y, 1);blas_saxpy(N, 2.0, x, 1, y, 1);blas_saxpy(N, 2.0, x, 1, y, 1);

int N = 1<<20;int N = 1<<20;int N = 1<<20;int N = 1<<20;

cublasInitcublasInitcublasInitcublasInit();();();();

cublasSetVectorcublasSetVectorcublasSetVectorcublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);(N, sizeof(x[0]), x, 1, d_x, 1);(N, sizeof(x[0]), x, 1, d_x, 1);(N, sizeof(x[0]), x, 1, d_x, 1);

cublasSetVectorcublasSetVectorcublasSetVectorcublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);(N, sizeof(y[0]), y, 1, d_y, 1);(N, sizeof(y[0]), y, 1, d_y, 1);(N, sizeof(y[0]), y, 1, d_y, 1);

// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements

cublasSaxpycublasSaxpycublasSaxpycublasSaxpy(N, 2.0, d_x, 1, d_y, 1);(N, 2.0, d_x, 1, d_y, 1);(N, 2.0, d_x, 1, d_y, 1);(N, 2.0, d_x, 1, d_y, 1);

cublasGetVectorcublasGetVectorcublasGetVectorcublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);(N, sizeof(y[0]), d_y, 1, y, 1);(N, sizeof(y[0]), d_y, 1, y, 1);(N, sizeof(y[0]), d_y, 1, y, 1);

cublasShutdowncublasShutdowncublasShutdowncublasShutdown();();();();

Serial BLAS Code Parallel cuBLAS Code

http://developer.nvidia.com/cublas

You can also call cuBLAS from Fortran, C++, Python, and other languages

Page 29: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

Page 30: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

OpenACC Compiler Directives

Program myscience... serial code ...

!$acc kernelsdo k = 1,n1

do i = 1,n2... parallel code ...

enddoenddo

!$acc end kernels...

End Program myscience

CPU GPU

Your original

Fortran or C code

Simple Compiler hints

Compiler Parallelizes code

Works on many-core GPUs &

multicore CPUs

OpenACCC

ompiler

Hint

Page 31: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

SAXPY with OpenACC Directives

subroutine subroutine subroutine subroutine saxpysaxpysaxpysaxpy(n, a, x, y)(n, a, x, y)(n, a, x, y)(n, a, x, y)real :: x(:), y(:), areal :: x(:), y(:), areal :: x(:), y(:), areal :: x(:), y(:), ainteger :: n, integer :: n, integer :: n, integer :: n, iiii

!$!$!$!$accaccaccacc kernelskernelskernelskernelsdo do do do iiii=1,n=1,n=1,n=1,n

y(y(y(y(iiii) = a*x() = a*x() = a*x() = a*x(iiii)+y()+y()+y()+y(iiii))))enddoenddoenddoenddo

!$!$!$!$accaccaccacc end kernelsend kernelsend kernelsend kernelsend subroutine end subroutine end subroutine end subroutine saxpysaxpysaxpysaxpy

............!!!! Perform SAXPY on 1M elementsPerform SAXPY on 1M elementsPerform SAXPY on 1M elementsPerform SAXPY on 1M elementscall call call call saxpysaxpysaxpysaxpy(2**20, 2.0, (2**20, 2.0, (2**20, 2.0, (2**20, 2.0, x_dx_dx_dx_d, , , , y_dy_dy_dy_d))))............

void saxpy(int n, void saxpy(int n, void saxpy(int n, void saxpy(int n,

float a, float a, float a, float a,

float *x, float *x, float *x, float *x,

float *y)float *y)float *y)float *y)

{{{{

#pragma acc kernels#pragma acc kernels#pragma acc kernels#pragma acc kernels

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

............

// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements

saxpy(1<<20, 2.0, x, y);saxpy(1<<20, 2.0, x, y);saxpy(1<<20, 2.0, x, y);saxpy(1<<20, 2.0, x, y);

............

Parallel C Code Parallel Fortran Code

http://developer.nvidia.com/openacc or http://openacc.org

Page 32: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

OpenACC Will be covered Friday

Exposing Parallelism with OpenACC

Profiling and Tuning OpenACCApplications

Coding and profiling hands onA Shameless Plug

Page 33: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

Page 34: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

CUDA C/C++

Class hierarchies

__device__ methods

Templates

Operator overloading

Functors (function objects)

Device-side new/delete

More…

templatetemplatetemplatetemplate <typename T><typename T><typename T><typename T>

struct Functor {struct Functor {struct Functor {struct Functor {

__device__ Functor(_a) : a(_a) {}__device__ Functor(_a) : a(_a) {}__device__ Functor(_a) : a(_a) {}__device__ Functor(_a) : a(_a) {}

__device__ __device__ __device__ __device__ T T T T operatoroperatoroperatoroperator(T x) { return a*x; }(T x) { return a*x; }(T x) { return a*x; }(T x) { return a*x; }

T a;T a;T a;T a;

}}}}

templatetemplatetemplatetemplate <typename T, typename Oper><typename T, typename Oper><typename T, typename Oper><typename T, typename Oper>

__global__ __global__ __global__ __global__ void kernel(T *output, int n) {void kernel(T *output, int n) {void kernel(T *output, int n) {void kernel(T *output, int n) {

Oper op(3.7);Oper op(3.7);Oper op(3.7);Oper op(3.7);

output = output = output = output = newnewnewnew T[n]; // dynamic allocationT[n]; // dynamic allocationT[n]; // dynamic allocationT[n]; // dynamic allocation

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx....xxxx;;;;

ifififif (i(i(i(i <<<< n) n) n) n)

output[i] = op(i); // apply functoroutput[i] = op(i); // apply functoroutput[i] = op(i); // apply functoroutput[i] = op(i); // apply functor

}}}}

http://developer.nvidia.com/cuda-toolkit

CUDA C++ features enable

sophisticated and flexible

applications and middleware

Page 35: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

CUDA Fortran

Program GPU using Fortran

Key language for HPC

Simple language extensions

Kernel functions

Thread / block IDs

Device & data management

Parallel loop directives

Familiar syntax

Use allocate, deallocate

Copy CPU-to-GPU with assignment (=)

module mymodule containsattributes(global) subroutine saxpy(n,a,x,y)real :: x(:), y(:), a, integer n, iattributes(value) :: a, ni = threadIdx%x+(blockIdx%x-1)*blockDim%x

if (i<=n) y(i) = a*x(i) + y(i);

end subroutine saxpyend module mymodule

program mainuse cudafor; use mymodulereal, device :: x_d(2**20), y_d(2**20)x_d = 1.0; y_d = 2.0call saxpy<<<4096,256>>>(2**20,3.0,x_d,y_d,)y = y_dwrite(*,*) 'max error=', maxval(abs(y-5.0))

end program main

http://developer.nvidia.com/cuda-fortran

Page 36: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

CUDA C/C++ will be covered Thursday

Optimization and GPU Architecture

Hands on development, optimization, and profiling

YASP: Yet Another Shameless Plug

Page 37: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

SAXPY in CUDA C

void saxpy(int n, float a, void saxpy(int n, float a, void saxpy(int n, float a, void saxpy(int n, float a,

float *x, float *y)float *x, float *y)float *x, float *y)float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

int N = 1<<20;int N = 1<<20;int N = 1<<20;int N = 1<<20;

// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements

saxpy(N, 2.0, x, y);saxpy(N, 2.0, x, y);saxpy(N, 2.0, x, y);saxpy(N, 2.0, x, y);

__global__ __global__ __global__ __global__

void saxpy(int n, float a, void saxpy(int n, float a, void saxpy(int n, float a, void saxpy(int n, float a,

float *x, float *y)float *x, float *y)float *x, float *y)float *x, float *y)

{{{{

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;

ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

int N = 1<<20;int N = 1<<20;int N = 1<<20;int N = 1<<20;

cudaMemcpycudaMemcpycudaMemcpycudaMemcpy(d_x(d_x(d_x(d_x, , , , xxxx, N, cudaMemcpyHostToDevice);, N, cudaMemcpyHostToDevice);, N, cudaMemcpyHostToDevice);, N, cudaMemcpyHostToDevice);

cudaMemcpycudaMemcpycudaMemcpycudaMemcpy(d_y(d_y(d_y(d_y, , , , yyyy, N, cudaMemcpyHostToDevice);, N, cudaMemcpyHostToDevice);, N, cudaMemcpyHostToDevice);, N, cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements// Perform SAXPY on 1M elements

saxpysaxpysaxpysaxpy<<<4096,256>>><<<4096,256>>><<<4096,256>>><<<4096,256>>>(N, 2.0, (N, 2.0, (N, 2.0, (N, 2.0, d_xd_xd_xd_x, , , , d_yd_yd_yd_y););););

cudaMemcpycudaMemcpycudaMemcpycudaMemcpy(y(y(y(y, , , , d_yd_yd_yd_y, N, cudaMemcpyDeviceToHost);, N, cudaMemcpyDeviceToHost);, N, cudaMemcpyDeviceToHost);, N, cudaMemcpyDeviceToHost);

Standard C CUDA C

http://developer.nvidia.com/cuda-toolkit

Page 38: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

SAXPY in CUDA Fortran

module module module module mymodulemymodulemymodulemymodule containscontainscontainscontainsattributes(attributes(attributes(attributes(globalglobalglobalglobal) subroutine ) subroutine ) subroutine ) subroutine saxpysaxpysaxpysaxpy(n, a, x, y)(n, a, x, y)(n, a, x, y)(n, a, x, y)

realrealrealreal :: x(:), y(:), a:: x(:), y(:), a:: x(:), y(:), a:: x(:), y(:), ainteger :: n, integer :: n, integer :: n, integer :: n, iiiiattributes(value) :: a, nattributes(value) :: a, nattributes(value) :: a, nattributes(value) :: a, niiii = = = = threadIdx%xthreadIdx%xthreadIdx%xthreadIdx%x+(+(+(+(blockIdx%xblockIdx%xblockIdx%xblockIdx%x----1)*1)*1)*1)*blockDim%xblockDim%xblockDim%xblockDim%xif (if (if (if (iiii<=n) <=n) <=n) <=n) y(y(y(y(iiii) = a*x() = a*x() = a*x() = a*x(iiii)+y()+y()+y()+y(iiii))))

end subroutine end subroutine end subroutine end subroutine saxpysaxpysaxpysaxpyend module end module end module end module mymodulemymodulemymodulemymodule

program mainprogram mainprogram mainprogram mainuse use use use cudaforcudaforcudaforcudafor; use ; use ; use ; use mymodulemymodulemymodulemymodulerealrealrealreal, , , , devicedevicedevicedevice :: :: :: :: x_dx_dx_dx_d(2**20), (2**20), (2**20), (2**20), y_dy_dy_dy_d(2**20)(2**20)(2**20)(2**20)x_dx_dx_dx_d = 1.0, = 1.0, = 1.0, = 1.0, y_dy_dy_dy_d = 2.0 = 2.0 = 2.0 = 2.0

! ! ! ! Perform SAXPY on 1M elementsPerform SAXPY on 1M elementsPerform SAXPY on 1M elementsPerform SAXPY on 1M elementscall call call call saxpysaxpysaxpysaxpy<<<4096,256>>><<<4096,256>>><<<4096,256>>><<<4096,256>>>(2**20, 2.0, (2**20, 2.0, (2**20, 2.0, (2**20, 2.0, x_dx_dx_dx_d, , , , y_dy_dy_dy_d))))

end end end end program mainprogram mainprogram mainprogram main

http://developer.nvidia.com/cuda-fortran

module module module module mymodulemymodulemymodulemymodule containscontainscontainscontainssubroutine subroutine subroutine subroutine saxpysaxpysaxpysaxpy(n, a, x, y)(n, a, x, y)(n, a, x, y)(n, a, x, y)

real :: x(:), y(:), areal :: x(:), y(:), areal :: x(:), y(:), areal :: x(:), y(:), ainteger :: n, integer :: n, integer :: n, integer :: n, iiiido do do do iiii=1,n=1,n=1,n=1,n

y(y(y(y(iiii) = a*x() = a*x() = a*x() = a*x(iiii)+y()+y()+y()+y(iiii))))enddoenddoenddoenddo

end subroutine end subroutine end subroutine end subroutine saxpysaxpysaxpysaxpyend module end module end module end module mymodulemymodulemymodulemymodule

program mainprogram mainprogram mainprogram mainuse use use use mymodulemymodulemymodulemymodulereal :: x(2**20), y(2**20)real :: x(2**20), y(2**20)real :: x(2**20), y(2**20)real :: x(2**20), y(2**20)x = 1.0, y = 2.0x = 1.0, y = 2.0x = 1.0, y = 2.0x = 1.0, y = 2.0

! ! ! ! Perform SAXPY on 1M elementsPerform SAXPY on 1M elementsPerform SAXPY on 1M elementsPerform SAXPY on 1M elementscall call call call saxpysaxpysaxpysaxpy(2**20, 2.0, x, y)(2**20, 2.0, x, y)(2**20, 2.0, x, y)(2**20, 2.0, x, y)

end program end program end program end program mainmainmainmain

Standard Fortran CUDA Fortran

Page 39: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Takeaways

GPUs accelerate science

Increase throughput on parallel computations common to scientific applications.

GPUs have proven their worth in accelerating real science applications.

Three ways to make use of GPUs in your application:

1. Accelerated Libraries

2. Compiler Directives – OpenACC in C, C++, Fortran

Tutorial - Friday

3. Programming Languages- CUDA C/C++, CUDA Fortran

Tutorial – Thursday

Page 40: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

© NVIDIA

Final Words:

We live in a parallel universe

Think in parallel about your problem in parallel

Page 41: GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

|===============----------|

Conference Progress: 60%

|=========================|

Plenary Sessions: 100%