1 itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson, jan 22, 2013 memcoalescing.ppt memory...
TRANSCRIPT
![Page 1: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/1.jpg)
1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013MemCoalescing.ppt
Memory Coalescing
These notes will demonstrate the effects of memory coalescing
Use of matrix transpose to improve matrix multiplication performance
![Page 2: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/2.jpg)
2
Memory coalescing is combining separate memory accesses into one combined access – it is done by the GPU when the locations are sequential locations in global memory banks.
Consider setting the elements of two-dimensional array to given data values.
This could be done across rows or down columns
In the following code, we will demonstrate the effects of each approach
![Page 3: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/3.jpg)
3
Load numbers into a two-dimensional array
Flattened global threadID of thread loaded into array element being accessed so one can tell which thread accesses which location when one prints out array
Access done across rows and also across column and time of execution compared. In practice, a problem may dictate the access order
GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)
Approach
![Page 4: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/4.jpg)
4
__global__ void gpu_Comput1 (int *h, int N, int T) {
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N; // thread ID
int index = col + row * N; // array index
for (int t = 0; t < T; t++) // loop to reduce other time effects
h[index] = threadID; // load array with global thread ID
}
One way
![Page 5: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/5.jpg)
5
__global__ void gpu_Comput2 (int *h, int N, int T) {
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N; // thread ID
int index = row + col * N; // array index
for (int t = 0; t < T; t++) // loop to reduce other time effects
h[index] = threadID; // load array with global thread ID
}
Another way
![Page 6: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/6.jpg)
6
/* ------------------------- GPU Computation 1 -----------------------------------*/
gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); // launch once kernel outside timing
cudaEventRecord( start, 0 );
gpu_Comput1<<< Grid, Block >>>(dev_h, N, T);
cudaEventRecord( stop, 0 ); // measure end time
cudaEventSynchronize( stop ); // wait for event recording
cudaEventElapsedTime( &elapsed_time_ms1, start, stop );
cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Results to check
printf("\nComputation with memory coalescing possible\n");
printArray(h,N);
printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1);
Computation 2 similar
![Page 7: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/7.jpg)
7
Some results
A grid of one block and one iterationArray 32x32
No speedup recorded because time of other operations dominate execution time
![Page 8: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/8.jpg)
8
A grid of one block and 1000000 iterations
Array 32 x 32
Speedup = 17.16
![Page 9: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/9.jpg)
9
Repeat just to check results are consistent
![Page 10: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/10.jpg)
10
A grid of 16 x 16 blocks and 10000 iterations
Array 512x512
Speedup = 12.08
Different numbers of iterations produce similar results
![Page 11: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/11.jpg)
11
Array size Speedup
32 x 32 16.70
64 x 64 15.11
128 x 128 15.12
256 x 256 11.85
512 x 512 12.03
1024 x 1024 11.75
2048 x 2048 11.80
4096 x 4096 11.90
1000 iterations. Block size 32 x 32. Number of blocks to suit array size
Different Array Sizes
![Page 12: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/12.jpg)
12
Effects of memory access in matrix multiplication
One thread is responsible for computing one result C ij and needs access a row of A and a column of B:
Thread
Each thread access one row of A and one column of BN2 row/column combinations, N2 threads
![Page 13: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/13.jpg)
13
Seen another way, in first time period, each thread accesses the first element in a row of A:
Thread 0, …
Thread I, …
Thread N-1, …
Consider those threads that access different rowsGiven the row-major order of how A is stored, those threads will locations are not in consecutive locations
– Bad cannot do memory coalescing.
Question: how many threads access the same location?
![Page 14: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/14.jpg)
14
Next, each thread accesses the first element in a column of B:
Thread 0, … Thread I, … Thread N-1, …
Consider those threads that access different columnsGiven the row-major order of how A is stored, those threads will locations are in consecutive locations.
– Good! Can do memory coalescing.
Question: how many threads access the same location?
![Page 15: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/15.jpg)
15
How can we get better memory accesses and memory coalcesing?
1. Transpose one array Copy all rows of A to columns and all columns of A to rows before access A and modify program according.
(Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)
![Page 16: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/16.jpg)
16
Sequential code for a transpose using same array:
for (i=0; i < N; i++) for (j=0; j < i; j++) {
temp = B[i][j];B[i][j] = b[j][i];B[j][i] = temp;
}
(In my code, I use separate arrays)
Could be done on host prior to copying to device.How would the code look like if on device?
![Page 17: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/17.jpg)
17
/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/
transposeArray(a, a_T, N); // transpose array
cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose
// cudaEventSynchronize(start); // Needed?
cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice);// cpy transp. AcudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B
gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms2, start, stop );
printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time
![Page 18: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/18.jpg)
18
Some results
8 x 8 array
1 blockof 8 x 8 threads
Speedup = 1.62 over not transposing array
![Page 19: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/19.jpg)
19
Some results
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 1.17 over not transposing array
![Page 20: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/20.jpg)
20
Some results
256 x 256 array
8 blocksof 32 x 32 threads
Speedup = 0.89!! over not transposing array
![Page 21: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/21.jpg)
21
Some results
1024 x 1024 array
32 blocksof 32 x 32 threads
Speedup = 0.93!! over not transposing array
![Page 22: 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649ef45503460f94c0777d/html5/thumbnails/22.jpg)
Questions