performance analysis: c vs cuda

n-Queens Problem: A Comparison Between CPU and

GPU using C++ and Cuda

Vitor Pamplonavitor@vitorpamplona.com

2Copyright Vitor F. Pamplona

● Learn CudaLearn Cuda and its limitations ● Implement some n-Queens solutions

● Cuda version ● C++ version

● Compare performanceCompare performance● Check for possible papers possible papers

● Parallel processing ● Computer graphics

N by N Queens Problem

http://en.wikipedia.org/wiki/Eight_queens_puzzle

Possibilities vs Solutions

1 1 12 4 03 27 04 256 25 3,125 106 46,656 47 823,543 408 16,777,216 929 387,420,489 35210 10,000,000,000 72411 285,311,670,611 2,68012 8,916,100,448,256 14,20013 302,875,106,592,253 73,71314 11,112,006,825,558,000 365,59615 437,893,890,380,859,000 2,299,18416 18,446,744,073,709,600,000 14,772,51217 827,240,261,886,337,000,000 95,815,104

Board Size Possibilities Solutions

Cu... what?

● CCompute UUnified DDevice AArchitecture● C-style language and compiler● DesignedDesigned for for parallelparallel solutionssolutions● Not a graphics APINot a graphics API● Runs on current graphics hardware

● nVidia GeForce 8+● Faster transfers between CPU and GPUFaster transfers between CPU and GPU● Compiler for CPU and GPU

Hardware Architecture

Processor

Memory

Processor

Memory

CacheMemory

Processor

Host Memory

CacheDevice Memory

Device

Host Memory

CacheDevice Memory

Host Memory

CacheDevice Memory

thread

Host Memory

thread

Host Memory

thread

Constant(64kB)

thread

CacheHost

Host Memory

Constant(64kB)

thread

CacheHost

Host Memory

Global

Constant(64kB)

thread

CacheHost

Host Memory

Global

Textureoptimized for 2D access

Constant(64kB)

thread

CacheHost

Host Memory

Global

Textureoptimized for 2D access

Memory Access

Basics of Programming

Libraries and Access

Application

CUDA Libraries

Application

CUDA Runtime

CUDA Libraries

Application

CUDA Driver

CUDA Runtime

CUDA Libraries

Application

CUDA Driver

CUDA Runtime

CUDA Libraries

Application

Startup

● Special Windows/Linux driversdrivers● CUDA ToolkitToolkit● CUDA Developer SDKSDK, which includes

● API Documentation● Programming guide● Compiler (nvcc)● Libraries (CUFFT, CUBLAS)● Source code examplesSource code examples

Host Example

float *pHostData = (float*) malloc(sizeof(float) * 256);// fill in the data array...

// allocate global memoryfloat *pInput, *pOutput;cudaMalloc((void**) &pInput, sizeof(float) * 256));cudaMalloc((void**) &pOutput, sizeof(float) * 256));

// host memory to global memorycudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice));

dim3 nDimGrid(1, 1, 1);// 1 block onlydim3 nDimBlock(32, 1, 1); // 32 threads per blockint nSharedMemBytes = sizeof(float) * 32;MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput);

// global memory to host memorycudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost));

free(pHostData); free(pInput); free(pOutput);

Kernel Example

__global__ void MyKernel(float* pInData, float* pOutData){extern __shared__ float sharedData[];const unsigned int tid = threadIdx.x;const unsigned int num_threads = blockDim.x;

// global memory to shared memorysharedData[tid] = pInData[tid];__syncthreads();

// do somethingsharedData[tid] = (float) num_threads * sharedData[tid];__syncthreads();

// shared memory to global memorypOutData[tid] = sharedData[tid];

Competitors

● AMD/ATI Close to Metal (CTM)● RapidMind● Acceleware● PeakStream

● Unavailable since acquisition by Google● BrookGPU● OpenGL/Direct3D + GLSL/HLSL/Cg● BSGPBSGP

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

Back to Work

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

Back to Work

● Monothread depth-first rMonothread depth-first recursiveecursive● Monothread depth-first plain ● N-threads depth-first plain

Back to Work

● Monothread depth-first recursive● Monothread depth-first plain Monothread depth-first plain ● N-threads depth-first plain

CPU Monothread Depth-first Plain

● Optimized implementation● Single thread● Depth-first approach● No recursionNo recursion, no function call● Memory buffersMemory buffers :)● Fast, really fast!

Back to Work

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plainN-threads depth-first plain

CPU N-threads Depth-first Plain

● N-threadsN-threads, where N is the board size● First column filled in the main thread● Create N linux pthreads

● One thread for each lineOne thread for each line● Each thread process N-1 columns

● Critical Section● solutions++;● saveSolution(board);

Back to Work

● 3 solutions for GPU● Step-based breadth-firstStep-based breadth-first static memory● Step-based breadth-firstStep-based breadth-first dynamic memory● Plain depth-first dynamic memory version

GPU Step Breadth-first

Step 1

Thread 1

Thread 2

Thread 3

Thread N

Step 1

In Out

Threads = Num. Solutions * N

Step 2

Thread 1

Thread 2

Thread 3

Thread N*N

Step 2

In Out

Step 3

Why a Breadth-first solution?

● Graphics processors are not Intel/AMD● Slow: 650 MHzSlow: 650 MHz

● Driver can kill time-expensive kernels● Lots of threadsLots of threads

● Good for GPU● Easy solution-thread mapping by indexes

● Fast kernelsFast kernels● Good for GPU

● StaticStatic memory version● Bad: One sort in the output for each step● Good for GPU

● DynamicDynamic memory version● Bad: Synchronized memory access● Bad: Global last output index

Back to Work

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory versionPlain depth-first dynamic memory version

Plain Depth-first Dynamic

● Best case: N^4 threadsBest case: N^4 threads● Thread indexes fill the first 4 columns● Depth-first approach

● Synchronized global memory access

Implementations and Threads

SoluçãoSol * NSol * N1NNN*NN*N*NN*N*N*NN^N

ThreadsGPU-breadth-first static memGPU-breadth-first dynamic memGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first n*n-grids*n-threadsGPU-depth-first n*n-grids*n*n-threadsGPU-depth-first FULL threads

CPU-PlainCPU-RecursiveCPU-Plain-Threads

Test platforms

● CPU: CPU: Intel Quad Core 2.4 GhzIntel Quad Core 2.4 Ghz● Ubuntu● 4GB RAM

● GPU: Geforce 9600 GTGeforce 9600 GT● 8 multiprocessor● 64 processors at 650 Mhz● 512MB RAM at 900 Mhz● Cuda 1.0

Results: CPUCPU

12 13 140

Results: GPU: Static vs DynamicDynamic

11 120

breadth-first staticbreadth-first dynamicCPU-PlainCPU-RecursiveCPU-Plain-Threads

Results: Same Number of Threads

12 130

depth-first n-Threadsdepth-first n-GridsCPU-Plain-Threads

Results: Only 1 Thread

10 11 120

depth-first 1-ThreadCPU-RecursiveCPU-Plain

Results: Dynamic vs Depth

breadth-first dynamicdepth-first n-Threadsdepth-first n-GridsCPU-PlainCPU-RecursiveCPU-Plain-Threads

Results: Depth vs CPU

1800depth-first n-Threads

depth-first n-Gridsdepth-first n*n-gridsdepth-first n*n-grids*n-threads

depth-first n*n-grids*n*n-threadsCPU-Plain

CPU-Recursive

CPU-Plain-Threads

Results: GPU N^N solution

7 8 90

12000depth-first n^nCPU-PlainCPU-RecursiveCPU-Plain-Threads

Results: Dynamic, Depth, CPU

10 11 12 130

1600breadth-first dynamicdepth-first N*N*N*NCPU-PlainCPU-RecursiveCPU-Plain-Threads

Results: Depth vs CPU Threads

14 15 160

100000

120000

140000depth-first N*N*N*NCPU-PlainCPU-Plain-Threads

Results

Solução Threads 1 2 3 4 5 6 7 8 9GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420

CPU-Plain 1 2 2 2 2 2 2 2 2 3CPU-Recursive 1 2 2 2 2 2 2 2 2 3CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5

Results

Solução 11 12 13 14 15 16 17Sol * N 1234 6184Sol * N 218 407 1481 7886 Cont1 1463 7198N 441 1561 7827N 301 824 3604N*N 216 424 1425 7025N^3 192 267 661 2937N^4 181 199 360 1369 7562 43488 05:38.99N^N

1 18 91 502 3020 196851 35 198 1225 8283 58493N 17 84 290 1393 8578 32010 04:40.95

ThreadsGPU-breadth-first static Mem Mem Mem Mem MemGPU-breadth-first dynamic Mem MemGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first N^3GPU-depth-first N^4GPU-depth-first FULL

Conclusions

● Cuda is slow. is slow.● Low use of GPU graphics resources● GLSL, HLSL and Cg are faster● Compiler needs improvements ● More documentation on assembly optimizationassembly optimization

● Instable ● GPU kill some process (I don't know why)

● Performance depends on implementation ● Good for mixed solutionsmixed solutions: CPU + GPU

Conclusions

● %, * and / are slow%, * and / are slow● ThreadIdx and blockIdx are fantastic● __shared____shared__ memory helps● Cuda locks the screenlocks the screen while processing

● No inter-process scheduling● Synchronized architecture

● Think synchronized

Perguntas?

Vitor Pamplonavitor@vitorpamplona.com

performance analysis: c vs cuda

hardware architecturelocall

cudavitor pamplona

memory access

memory shareddatatid

poutput global memory

x global memory

global memory poutdatatid

syncthreads shared memory

Technology

cuda performance study on hadoop mapreduce cluster

performance study of cuda -aware mpi libraries for gpu...

optimizing application performance with …...mayank jain,...

a cuda implementation of the high performance conjugate

cuda vs opencl

new benchmarking cuda applications best practices when ·...

gpu programming and cuda sathish vadhiyar high performance...

precision and performance: floating point and ieee...

cuda programming performance considerations (cuda best...

cuda 3.2 math libraries performance

cuda vs. opencl: uma comparação teórica e...

high performance computing on gpus using nvidia cuda ·...

from cuda to opencl: towards a performance … · from cuda...

advanced cuda feature...

introduction to cuda c++€¦ · cuda c/c++ and fortran...

high-performance particle simulation using cuda

from cuda to opencl: towards a performance-portable solution...

optimizing application performance with cuda®...

cuda 8 performance overview -...

cuda flux: a lightweight instruction profiler for cuda...