performance analysis: c vs cuda

Post on 18-May-2015

2.073 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Some tests comparing n-queens solutions between CPU using C and GPU using Cuda.

TRANSCRIPT

n-Queens Problem: A Comparison Between CPU and

GPU using C++ and Cuda

Vitor Pamplonavitor@vitorpamplona.com

2Copyright Vitor F. Pamplona

Goals

● Learn CudaLearn Cuda and its limitations ● Implement some n-Queens solutions

● Cuda version ● C++ version

● Compare performanceCompare performance● Check for possible papers possible papers

● Parallel processing ● Computer graphics

3Copyright Vitor F. Pamplona

N by N Queens Problem

http://en.wikipedia.org/wiki/Eight_queens_puzzle

4Copyright Vitor F. Pamplona

Possibilities vs Solutions

1 1 12 4 03 27 04 256 25 3,125 106 46,656 47 823,543 408 16,777,216 929 387,420,489 35210 10,000,000,000 72411 285,311,670,611 2,68012 8,916,100,448,256 14,20013 302,875,106,592,253 73,71314 11,112,006,825,558,000 365,59615 437,893,890,380,859,000 2,299,18416 18,446,744,073,709,600,000 14,772,51217 827,240,261,886,337,000,000 95,815,104

Board Size Possibilities Solutions

5Copyright Vitor F. Pamplona

Cu... what?

● CCompute UUnified DDevice AArchitecture● C-style language and compiler● DesignedDesigned for for parallelparallel solutionssolutions● Not a graphics APINot a graphics API● Runs on current graphics hardware

● nVidia GeForce 8+● Faster transfers between CPU and GPUFaster transfers between CPU and GPU● Compiler for CPU and GPU

6Copyright Vitor F. Pamplona

CPU

Hardware Architecture

GPU

7Copyright Vitor F. Pamplona

CPU

Hardware Architecture

GPU

Processor

Memory

Cache

8Copyright Vitor F. Pamplona

CPU

Hardware Architecture

GPU

Processor

Memory

CacheMemory

Processor

9Copyright Vitor F. Pamplona

Hardware Architecture

CPU

GPU

Host

Host Memory

CacheDevice Memory

Device

10Copyright Vitor F. Pamplona

Hardware Architecture

Host

Host Memory

CacheDevice Memory

11Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Host

Host Memory

CacheDevice Memory

thread

warp

12Copyright Vitor F. Pamplona

CPU

Hardware Architecture

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Host

Host Memory

Cache

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

13Copyright Vitor F. Pamplona

CPU

Hardware Architecture

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Host

Host Memory

Cache

warp

local

thread

banks

14Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

15Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

16Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

Textureoptimized for 2D access

Cache

17Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

Textureoptimized for 2D access

Cache

18Copyright Vitor F. Pamplona

Memory Access

19Copyright Vitor F. Pamplona

Basics of Programming

20Copyright Vitor F. Pamplona

Hardware Architecture

21Copyright Vitor F. Pamplona

Hardware Architecture

22Copyright Vitor F. Pamplona

Hardware Architecture

23Copyright Vitor F. Pamplona

Hardware Architecture

24Copyright Vitor F. Pamplona

Libraries and Access

CPU

GPU

Application

25Copyright Vitor F. Pamplona

Libraries and Access

CPU

GPU

CUDA Libraries

Application

26Copyright Vitor F. Pamplona

Libraries and Access

CPU

GPU

CUDA Runtime

CUDA Libraries

Application

27Copyright Vitor F. Pamplona

Libraries and Access

CPU

CUDA Driver

GPU

CUDA Runtime

CUDA Libraries

Application

28Copyright Vitor F. Pamplona

Libraries and Access

CPU

CUDA Driver

GPU

CUDA Runtime

CUDA Libraries

Application

29Copyright Vitor F. Pamplona

Startup

● Special Windows/Linux driversdrivers● CUDA ToolkitToolkit● CUDA Developer SDKSDK, which includes

● API Documentation● Programming guide● Compiler (nvcc)● Libraries (CUFFT, CUBLAS)● Source code examplesSource code examples

30Copyright Vitor F. Pamplona

Host Example

float *pHostData = (float*) malloc(sizeof(float) * 256);// fill in the data array...

// allocate global memoryfloat *pInput, *pOutput;cudaMalloc((void**) &pInput, sizeof(float) * 256));cudaMalloc((void**) &pOutput, sizeof(float) * 256));

// host memory to global memorycudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice));

dim3 nDimGrid(1, 1, 1);// 1 block onlydim3 nDimBlock(32, 1, 1); // 32 threads per blockint nSharedMemBytes = sizeof(float) * 32;MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput);

// global memory to host memorycudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost));

free(pHostData); free(pInput); free(pOutput);

31Copyright Vitor F. Pamplona

Kernel Example

__global__ void MyKernel(float* pInData, float* pOutData){extern __shared__ float sharedData[];const unsigned int tid = threadIdx.x;const unsigned int num_threads = blockDim.x;

// global memory to shared memorysharedData[tid] = pInData[tid];__syncthreads();

// do somethingsharedData[tid] = (float) num_threads * sharedData[tid];__syncthreads();

// shared memory to global memorypOutData[tid] = sharedData[tid];

}

32Copyright Vitor F. Pamplona

Competitors

● AMD/ATI Close to Metal (CTM)● RapidMind● Acceleware● PeakStream

● Unavailable since acquisition by Google● BrookGPU● OpenGL/Direct3D + GLSL/HLSL/Cg● BSGPBSGP

33Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

34Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

35Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first rMonothread depth-first recursiveecursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

36Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

37Copyright Vitor F. Pamplona

CPU Monothread Depth-first Plain

● Optimized implementation● Single thread● Depth-first approach● No recursionNo recursion, no function call● Memory buffersMemory buffers :)● Fast, really fast!

38Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plainN-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

39Copyright Vitor F. Pamplona

CPU N-threads Depth-first Plain

● N-threadsN-threads, where N is the board size● First column filled in the main thread● Create N linux pthreads

● One thread for each lineOne thread for each line● Each thread process N-1 columns

● Critical Section● solutions++;● saveSolution(board);

40Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-firstStep-based breadth-first static memory● Step-based breadth-firstStep-based breadth-first dynamic memory● Plain depth-first dynamic memory version

41Copyright Vitor F. Pamplona

GPU Step Breadth-first

Step 1

In

42Copyright Vitor F. Pamplona

GPU Step Breadth-first

1

2

3

N..

.

Thread 1

Thread 2

Thread 3

Thread N

Step 1

In Out

Threads = Num. Solutions * N

43Copyright Vitor F. Pamplona

GPU Step Breadth-first

1

2

3

8

Step 2

In..

.

44Copyright Vitor F. Pamplona

GPU Step Breadth-first

1

2

3

8..

.

Thread 1

Thread 2

Thread 3

Thread N*N

Step 2

In Out

1 1

1 2

1 3

N N

...

Threads = Num. Solutions * N

45Copyright Vitor F. Pamplona

GPU Step Breadth-first

1 3

1 4

1 5

8 6

...

Step 3

In..

.

Threads = Num. Solutions * N

46Copyright Vitor F. Pamplona

Why a Breadth-first solution?

● Graphics processors are not Intel/AMD● Slow: 650 MHzSlow: 650 MHz

● Driver can kill time-expensive kernels● Lots of threadsLots of threads

● Good for GPU● Easy solution-thread mapping by indexes

● Fast kernelsFast kernels● Good for GPU

47Copyright Vitor F. Pamplona

GPU Step Breadth-first

● StaticStatic memory version● Bad: One sort in the output for each step● Good for GPU

● DynamicDynamic memory version● Bad: Synchronized memory access● Bad: Global last output index

48Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory versionPlain depth-first dynamic memory version

49Copyright Vitor F. Pamplona

Plain Depth-first Dynamic

● Best case: N^4 threadsBest case: N^4 threads● Thread indexes fill the first 4 columns● Depth-first approach

● Synchronized global memory access

50Copyright Vitor F. Pamplona

Implementations and Threads

SoluçãoSol * NSol * N1NNN*NN*N*NN*N*N*NN^N

11N

ThreadsGPU-breadth-first static memGPU-breadth-first dynamic memGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first n*n-grids*n-threadsGPU-depth-first n*n-grids*n*n-threadsGPU-depth-first FULL threads

CPU-PlainCPU-RecursiveCPU-Plain-Threads

51Copyright Vitor F. Pamplona

Test platforms

● CPU: CPU: Intel Quad Core 2.4 GhzIntel Quad Core 2.4 Ghz● Ubuntu● 4GB RAM

● GPU: Geforce 9600 GTGeforce 9600 GT● 8 multiprocessor● 64 processors at 650 Mhz● 512MB RAM at 900 Mhz● Cuda 1.0

52Copyright Vitor F. Pamplona

Results: CPUCPU

12 13 140

1000

2000

3000

4000

5000

6000

7000

8000

9000

CPU-PlainCPU-RecursiveCPU-Plain-Threads

53Copyright Vitor F. Pamplona

Results: GPU: Static vs DynamicDynamic

11 120

1000

2000

3000

4000

5000

6000

7000

breadth-first staticbreadth-first dynamicCPU-PlainCPU-RecursiveCPU-Plain-Threads

54Copyright Vitor F. Pamplona

Results: Same Number of Threads

12 130

1000

2000

3000

4000

5000

6000

7000

8000

9000

depth-first n-Threadsdepth-first n-GridsCPU-Plain-Threads

55Copyright Vitor F. Pamplona

Results: Only 1 Thread

10 11 120

1000

2000

3000

4000

5000

6000

7000

8000

depth-first 1-ThreadCPU-RecursiveCPU-Plain

56Copyright Vitor F. Pamplona

Results: Dynamic vs Depth

120

200

400

600

800

1000

1200

1400

1600

1800

breadth-first dynamicdepth-first n-Threadsdepth-first n-GridsCPU-PlainCPU-RecursiveCPU-Plain-Threads

57Copyright Vitor F. Pamplona

Results: Depth vs CPU

120

200

400

600

800

1000

1200

1400

1600

1800depth-first n-Threads

depth-first n-Gridsdepth-first n*n-gridsdepth-first n*n-grids*n-threads

depth-first n*n-grids*n*n-threadsCPU-Plain

CPU-Recursive

CPU-Plain-Threads

58Copyright Vitor F. Pamplona

Results: GPU N^N solution

7 8 90

2000

4000

6000

8000

10000

12000depth-first n^nCPU-PlainCPU-RecursiveCPU-Plain-Threads

59Copyright Vitor F. Pamplona

Results: Dynamic, Depth, CPU

10 11 12 130

200

400

600

800

1000

1200

1400

1600breadth-first dynamicdepth-first N*N*N*NCPU-PlainCPU-RecursiveCPU-Plain-Threads

60Copyright Vitor F. Pamplona

Results: Depth vs CPU Threads

14 15 160

20000

40000

60000

80000

100000

120000

140000depth-first N*N*N*NCPU-PlainCPU-Plain-Threads

61Copyright Vitor F. Pamplona

Results

Solução Threads 1 2 3 4 5 6 7 8 9GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420

CPU-Plain 1 2 2 2 2 2 2 2 2 3CPU-Recursive 1 2 2 2 2 2 2 2 2 3CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5

62Copyright Vitor F. Pamplona

Results

Solução 11 12 13 14 15 16 17Sol * N 1234 6184Sol * N 218 407 1481 7886 Cont1 1463 7198N 441 1561 7827N 301 824 3604N*N 216 424 1425 7025N^3 192 267 661 2937N^4 181 199 360 1369 7562 43488 05:38.99N^N

1 18 91 502 3020 196851 35 198 1225 8283 58493N 17 84 290 1393 8578 32010 04:40.95

ThreadsGPU-breadth-first static Mem Mem Mem Mem MemGPU-breadth-first dynamic Mem MemGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first N^3GPU-depth-first N^4GPU-depth-first FULL

CPU-PlainCPU-RecursiveCPU-Plain-Threads

63Copyright Vitor F. Pamplona

Conclusions

● Cuda is slow. is slow.● Low use of GPU graphics resources● GLSL, HLSL and Cg are faster● Compiler needs improvements ● More documentation on assembly optimizationassembly optimization

● Instable ● GPU kill some process (I don't know why)

● Performance depends on implementation ● Good for mixed solutionsmixed solutions: CPU + GPU

64Copyright Vitor F. Pamplona

Conclusions

● %, * and / are slow%, * and / are slow● ThreadIdx and blockIdx are fantastic● __shared____shared__ memory helps● Cuda locks the screenlocks the screen while processing

● No inter-process scheduling● Synchronized architecture

● Think synchronized

Perguntas?

Vitor Pamplonavitor@vitorpamplona.com

top related