performance analysis: c vs cuda

65
n-Queens Problem: A Comparison Between CPU and GPU using C++ and Cuda Vitor Pamplona [email protected]

Upload: vitor-pamplona

Post on 18-May-2015

2.072 views

Category:

Technology


3 download

DESCRIPTION

Some tests comparing n-queens solutions between CPU using C and GPU using Cuda.

TRANSCRIPT

Page 1: Performance Analysis: C vs CUDA

n-Queens Problem: A Comparison Between CPU and

GPU using C++ and Cuda

Vitor [email protected]

Page 2: Performance Analysis: C vs CUDA

2Copyright Vitor F. Pamplona

Goals

● Learn CudaLearn Cuda and its limitations ● Implement some n-Queens solutions

● Cuda version ● C++ version

● Compare performanceCompare performance● Check for possible papers possible papers

● Parallel processing ● Computer graphics

Page 3: Performance Analysis: C vs CUDA

3Copyright Vitor F. Pamplona

N by N Queens Problem

http://en.wikipedia.org/wiki/Eight_queens_puzzle

Page 4: Performance Analysis: C vs CUDA

4Copyright Vitor F. Pamplona

Possibilities vs Solutions

1 1 12 4 03 27 04 256 25 3,125 106 46,656 47 823,543 408 16,777,216 929 387,420,489 35210 10,000,000,000 72411 285,311,670,611 2,68012 8,916,100,448,256 14,20013 302,875,106,592,253 73,71314 11,112,006,825,558,000 365,59615 437,893,890,380,859,000 2,299,18416 18,446,744,073,709,600,000 14,772,51217 827,240,261,886,337,000,000 95,815,104

Board Size Possibilities Solutions

Page 5: Performance Analysis: C vs CUDA

5Copyright Vitor F. Pamplona

Cu... what?

● CCompute UUnified DDevice AArchitecture● C-style language and compiler● DesignedDesigned for for parallelparallel solutionssolutions● Not a graphics APINot a graphics API● Runs on current graphics hardware

● nVidia GeForce 8+● Faster transfers between CPU and GPUFaster transfers between CPU and GPU● Compiler for CPU and GPU

Page 6: Performance Analysis: C vs CUDA

6Copyright Vitor F. Pamplona

CPU

Hardware Architecture

GPU

Page 7: Performance Analysis: C vs CUDA

7Copyright Vitor F. Pamplona

CPU

Hardware Architecture

GPU

Processor

Memory

Cache

Page 8: Performance Analysis: C vs CUDA

8Copyright Vitor F. Pamplona

CPU

Hardware Architecture

GPU

Processor

Memory

CacheMemory

Processor

Page 9: Performance Analysis: C vs CUDA

9Copyright Vitor F. Pamplona

Hardware Architecture

CPU

GPU

Host

Host Memory

CacheDevice Memory

Device

Page 10: Performance Analysis: C vs CUDA

10Copyright Vitor F. Pamplona

Hardware Architecture

Host

Host Memory

CacheDevice Memory

Page 11: Performance Analysis: C vs CUDA

11Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Host

Host Memory

CacheDevice Memory

thread

warp

Page 12: Performance Analysis: C vs CUDA

12Copyright Vitor F. Pamplona

CPU

Hardware Architecture

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Host

Host Memory

Cache

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

Page 13: Performance Analysis: C vs CUDA

13Copyright Vitor F. Pamplona

CPU

Hardware Architecture

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Host

Host Memory

Cache

warp

local

thread

banks

Page 14: Performance Analysis: C vs CUDA

14Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Page 15: Performance Analysis: C vs CUDA

15Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

Page 16: Performance Analysis: C vs CUDA

16Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

Textureoptimized for 2D access

Cache

Page 17: Performance Analysis: C vs CUDA

17Copyright Vitor F. Pamplona

CPU

Hardware Architecture

Constant(64kB)

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Lwarp

local

thread

banks

CacheHost

Host Memory

Cache

Global

Textureoptimized for 2D access

Cache

Page 18: Performance Analysis: C vs CUDA

18Copyright Vitor F. Pamplona

Memory Access

Page 19: Performance Analysis: C vs CUDA

19Copyright Vitor F. Pamplona

Basics of Programming

Page 20: Performance Analysis: C vs CUDA

20Copyright Vitor F. Pamplona

Hardware Architecture

Page 21: Performance Analysis: C vs CUDA

21Copyright Vitor F. Pamplona

Hardware Architecture

Page 22: Performance Analysis: C vs CUDA

22Copyright Vitor F. Pamplona

Hardware Architecture

Page 23: Performance Analysis: C vs CUDA

23Copyright Vitor F. Pamplona

Hardware Architecture

Page 24: Performance Analysis: C vs CUDA

24Copyright Vitor F. Pamplona

Libraries and Access

CPU

GPU

Application

Page 25: Performance Analysis: C vs CUDA

25Copyright Vitor F. Pamplona

Libraries and Access

CPU

GPU

CUDA Libraries

Application

Page 26: Performance Analysis: C vs CUDA

26Copyright Vitor F. Pamplona

Libraries and Access

CPU

GPU

CUDA Runtime

CUDA Libraries

Application

Page 27: Performance Analysis: C vs CUDA

27Copyright Vitor F. Pamplona

Libraries and Access

CPU

CUDA Driver

GPU

CUDA Runtime

CUDA Libraries

Application

Page 28: Performance Analysis: C vs CUDA

28Copyright Vitor F. Pamplona

Libraries and Access

CPU

CUDA Driver

GPU

CUDA Runtime

CUDA Libraries

Application

Page 29: Performance Analysis: C vs CUDA

29Copyright Vitor F. Pamplona

Startup

● Special Windows/Linux driversdrivers● CUDA ToolkitToolkit● CUDA Developer SDKSDK, which includes

● API Documentation● Programming guide● Compiler (nvcc)● Libraries (CUFFT, CUBLAS)● Source code examplesSource code examples

Page 30: Performance Analysis: C vs CUDA

30Copyright Vitor F. Pamplona

Host Example

float *pHostData = (float*) malloc(sizeof(float) * 256);// fill in the data array...

// allocate global memoryfloat *pInput, *pOutput;cudaMalloc((void**) &pInput, sizeof(float) * 256));cudaMalloc((void**) &pOutput, sizeof(float) * 256));

// host memory to global memorycudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice));

dim3 nDimGrid(1, 1, 1);// 1 block onlydim3 nDimBlock(32, 1, 1); // 32 threads per blockint nSharedMemBytes = sizeof(float) * 32;MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput);

// global memory to host memorycudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost));

free(pHostData); free(pInput); free(pOutput);

Page 31: Performance Analysis: C vs CUDA

31Copyright Vitor F. Pamplona

Kernel Example

__global__ void MyKernel(float* pInData, float* pOutData){extern __shared__ float sharedData[];const unsigned int tid = threadIdx.x;const unsigned int num_threads = blockDim.x;

// global memory to shared memorysharedData[tid] = pInData[tid];__syncthreads();

// do somethingsharedData[tid] = (float) num_threads * sharedData[tid];__syncthreads();

// shared memory to global memorypOutData[tid] = sharedData[tid];

}

Page 32: Performance Analysis: C vs CUDA

32Copyright Vitor F. Pamplona

Competitors

● AMD/ATI Close to Metal (CTM)● RapidMind● Acceleware● PeakStream

● Unavailable since acquisition by Google● BrookGPU● OpenGL/Direct3D + GLSL/HLSL/Cg● BSGPBSGP

Page 33: Performance Analysis: C vs CUDA

33Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

Page 34: Performance Analysis: C vs CUDA

34Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

Page 35: Performance Analysis: C vs CUDA

35Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first rMonothread depth-first recursiveecursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

Page 36: Performance Analysis: C vs CUDA

36Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

Page 37: Performance Analysis: C vs CUDA

37Copyright Vitor F. Pamplona

CPU Monothread Depth-first Plain

● Optimized implementation● Single thread● Depth-first approach● No recursionNo recursion, no function call● Memory buffersMemory buffers :)● Fast, really fast!

Page 38: Performance Analysis: C vs CUDA

38Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plainN-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory version

Page 39: Performance Analysis: C vs CUDA

39Copyright Vitor F. Pamplona

CPU N-threads Depth-first Plain

● N-threadsN-threads, where N is the board size● First column filled in the main thread● Create N linux pthreads

● One thread for each lineOne thread for each line● Each thread process N-1 columns

● Critical Section● solutions++;● saveSolution(board);

Page 40: Performance Analysis: C vs CUDA

40Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-firstStep-based breadth-first static memory● Step-based breadth-firstStep-based breadth-first dynamic memory● Plain depth-first dynamic memory version

Page 41: Performance Analysis: C vs CUDA

41Copyright Vitor F. Pamplona

GPU Step Breadth-first

Step 1

In

Page 42: Performance Analysis: C vs CUDA

42Copyright Vitor F. Pamplona

GPU Step Breadth-first

1

2

3

N..

.

Thread 1

Thread 2

Thread 3

Thread N

Step 1

In Out

Threads = Num. Solutions * N

Page 43: Performance Analysis: C vs CUDA

43Copyright Vitor F. Pamplona

GPU Step Breadth-first

1

2

3

8

Step 2

In..

.

Page 44: Performance Analysis: C vs CUDA

44Copyright Vitor F. Pamplona

GPU Step Breadth-first

1

2

3

8..

.

Thread 1

Thread 2

Thread 3

Thread N*N

Step 2

In Out

1 1

1 2

1 3

N N

...

Threads = Num. Solutions * N

Page 45: Performance Analysis: C vs CUDA

45Copyright Vitor F. Pamplona

GPU Step Breadth-first

1 3

1 4

1 5

8 6

...

Step 3

In..

.

Threads = Num. Solutions * N

Page 46: Performance Analysis: C vs CUDA

46Copyright Vitor F. Pamplona

Why a Breadth-first solution?

● Graphics processors are not Intel/AMD● Slow: 650 MHzSlow: 650 MHz

● Driver can kill time-expensive kernels● Lots of threadsLots of threads

● Good for GPU● Easy solution-thread mapping by indexes

● Fast kernelsFast kernels● Good for GPU

Page 47: Performance Analysis: C vs CUDA

47Copyright Vitor F. Pamplona

GPU Step Breadth-first

● StaticStatic memory version● Bad: One sort in the output for each step● Good for GPU

● DynamicDynamic memory version● Bad: Synchronized memory access● Bad: Global last output index

Page 48: Performance Analysis: C vs CUDA

48Copyright Vitor F. Pamplona

Back to Work

● Brute force implementations● 3 solutions for CPU

● Monothread depth-first recursive● Monothread depth-first plain ● N-threads depth-first plain

● 3 solutions for GPU● Step-based breadth-first static memory● Step-based breadth-first dynamic memory● Plain depth-first dynamic memory versionPlain depth-first dynamic memory version

Page 49: Performance Analysis: C vs CUDA

49Copyright Vitor F. Pamplona

Plain Depth-first Dynamic

● Best case: N^4 threadsBest case: N^4 threads● Thread indexes fill the first 4 columns● Depth-first approach

● Synchronized global memory access

Page 50: Performance Analysis: C vs CUDA

50Copyright Vitor F. Pamplona

Implementations and Threads

SoluçãoSol * NSol * N1NNN*NN*N*NN*N*N*NN^N

11N

ThreadsGPU-breadth-first static memGPU-breadth-first dynamic memGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first n*n-grids*n-threadsGPU-depth-first n*n-grids*n*n-threadsGPU-depth-first FULL threads

CPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 51: Performance Analysis: C vs CUDA

51Copyright Vitor F. Pamplona

Test platforms

● CPU: CPU: Intel Quad Core 2.4 GhzIntel Quad Core 2.4 Ghz● Ubuntu● 4GB RAM

● GPU: Geforce 9600 GTGeforce 9600 GT● 8 multiprocessor● 64 processors at 650 Mhz● 512MB RAM at 900 Mhz● Cuda 1.0

Page 52: Performance Analysis: C vs CUDA

52Copyright Vitor F. Pamplona

Results: CPUCPU

12 13 140

1000

2000

3000

4000

5000

6000

7000

8000

9000

CPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 53: Performance Analysis: C vs CUDA

53Copyright Vitor F. Pamplona

Results: GPU: Static vs DynamicDynamic

11 120

1000

2000

3000

4000

5000

6000

7000

breadth-first staticbreadth-first dynamicCPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 54: Performance Analysis: C vs CUDA

54Copyright Vitor F. Pamplona

Results: Same Number of Threads

12 130

1000

2000

3000

4000

5000

6000

7000

8000

9000

depth-first n-Threadsdepth-first n-GridsCPU-Plain-Threads

Page 55: Performance Analysis: C vs CUDA

55Copyright Vitor F. Pamplona

Results: Only 1 Thread

10 11 120

1000

2000

3000

4000

5000

6000

7000

8000

depth-first 1-ThreadCPU-RecursiveCPU-Plain

Page 56: Performance Analysis: C vs CUDA

56Copyright Vitor F. Pamplona

Results: Dynamic vs Depth

120

200

400

600

800

1000

1200

1400

1600

1800

breadth-first dynamicdepth-first n-Threadsdepth-first n-GridsCPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 57: Performance Analysis: C vs CUDA

57Copyright Vitor F. Pamplona

Results: Depth vs CPU

120

200

400

600

800

1000

1200

1400

1600

1800depth-first n-Threads

depth-first n-Gridsdepth-first n*n-gridsdepth-first n*n-grids*n-threads

depth-first n*n-grids*n*n-threadsCPU-Plain

CPU-Recursive

CPU-Plain-Threads

Page 58: Performance Analysis: C vs CUDA

58Copyright Vitor F. Pamplona

Results: GPU N^N solution

7 8 90

2000

4000

6000

8000

10000

12000depth-first n^nCPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 59: Performance Analysis: C vs CUDA

59Copyright Vitor F. Pamplona

Results: Dynamic, Depth, CPU

10 11 12 130

200

400

600

800

1000

1200

1400

1600breadth-first dynamicdepth-first N*N*N*NCPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 60: Performance Analysis: C vs CUDA

60Copyright Vitor F. Pamplona

Results: Depth vs CPU Threads

14 15 160

20000

40000

60000

80000

100000

120000

140000depth-first N*N*N*NCPU-PlainCPU-Plain-Threads

Page 61: Performance Analysis: C vs CUDA

61Copyright Vitor F. Pamplona

Results

Solução Threads 1 2 3 4 5 6 7 8 9GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420

CPU-Plain 1 2 2 2 2 2 2 2 2 3CPU-Recursive 1 2 2 2 2 2 2 2 2 3CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5

Page 62: Performance Analysis: C vs CUDA

62Copyright Vitor F. Pamplona

Results

Solução 11 12 13 14 15 16 17Sol * N 1234 6184Sol * N 218 407 1481 7886 Cont1 1463 7198N 441 1561 7827N 301 824 3604N*N 216 424 1425 7025N^3 192 267 661 2937N^4 181 199 360 1369 7562 43488 05:38.99N^N

1 18 91 502 3020 196851 35 198 1225 8283 58493N 17 84 290 1393 8578 32010 04:40.95

ThreadsGPU-breadth-first static Mem Mem Mem Mem MemGPU-breadth-first dynamic Mem MemGPU-depth-first 1–ThreadGPU-depth-first n-ThreadsGPU-depth-first n-gridsGPU-depth-first n*n-gridsGPU-depth-first N^3GPU-depth-first N^4GPU-depth-first FULL

CPU-PlainCPU-RecursiveCPU-Plain-Threads

Page 63: Performance Analysis: C vs CUDA

63Copyright Vitor F. Pamplona

Conclusions

● Cuda is slow. is slow.● Low use of GPU graphics resources● GLSL, HLSL and Cg are faster● Compiler needs improvements ● More documentation on assembly optimizationassembly optimization

● Instable ● GPU kill some process (I don't know why)

● Performance depends on implementation ● Good for mixed solutionsmixed solutions: CPU + GPU

Page 64: Performance Analysis: C vs CUDA

64Copyright Vitor F. Pamplona

Conclusions

● %, * and / are slow%, * and / are slow● ThreadIdx and blockIdx are fantastic● __shared____shared__ memory helps● Cuda locks the screenlocks the screen while processing

● No inter-process scheduling● Synchronized architecture

● Think synchronized

Page 65: Performance Analysis: C vs CUDA

Perguntas?

Vitor [email protected]