acrashcourseon (c)cuda programmingfor(mul)gpu ·...

200
A crash course on (C) CUDA programming for (mul8) GPU Massimo Bernaschi [email protected] h;p://www.iac.cnr.it/~massimo

Upload: others

Post on 29-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A  crash  course  on    (C-­‐)  CUDA  programming  for  (mul8)  GPU  

Massimo  Bernaschi  [email protected]  

h;p://www.iac.cnr.it/~massimo  

Page 2: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Single  GPU  part  of  the  class  based  on  

Code  samples  available  from  (wget)    h3p://twin.iac.rm.cnr.it/cccsample.tgz  Slides  available  from  (wget)  h3p://twin.iac.rm.cnr.it/CCC.pdf                

Let  us  start  with  few  basic  info  

Page 3: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A  Graphics  Processing  Unit  is  not  a  CPU!    

And  there  is  no  GPU  without  a  CPU…  

Cache

ALU Control

ALU

ALU

ALU

DRAM DRAM

CPU GPU

Page 4: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Graphics  Processing  Unit    ü Original ly   dedicated   to   specific  opera8ons   required   by   video   cards   for  accelera8ng  graphics,  GPUs  have  become    flexible   general-­‐purpose   computa8onal  engines  (GPGPU):  

ü Op8mized  for  high  memory  throughput  ü Very  suitable  to  data-­‐parallel  processing    

ü Tradi8onally   GPU   programming   has  been   tricky   but   CUDA,   OpenCL   and   now  OpenACC  made  it  affordable.    

Page 5: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

What is CUDA? §  CUDA=Compute Unified Device Architecture

—  Expose general-purpose GPU computing as first-class capability

—  Retain traditional DirectX/OpenGL graphics performance

§  CUDA C —  Based on industry-standard C —  A handful of language extensions to allow heterogeneous

programs —  Straightforward APIs to manage devices, memory, etc.

Page 6: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  Programming  Model  •  The  GPU  is  viewed  as  a  compute  device  that:  

–  has  its  own  RAM  (device  memory)  –  runs  data-­‐parallel  por8ons  of  an  applica8on  as  kernels  

by  using  many  threads  •  GPU  vs.  CPU  threads    

–  GPU  threads  are  extremely  lightweight  •  Very  li;le  crea8on  overhead  

–  GPU  needs  1000s  of  threads  for  full  efficiency  •  A  mul8-­‐core  CPU  needs  only  a  few    

(basically  one  thread  per  core)  

Page 7: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA C Jargon: The Basics

Host

§ Host – The CPU and its memory (host memory)

§ Device – The GPU and its memory (device memory)

Device

Page 8: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

What  Programmer  Expresses  in  CUDA    

ü  Computa8on  par88oning  (where  does  computa8on  occur?)  ü  Declara8ons  on  func8ons  __host__,  __global__,  __device__  ü Mapping  of  thread  programs  to  device:  compute  <<<gs,  bs>>>(<args>)  

ü  Data  par88oning  (where  does  data  reside,  who  may  access  it  and  how?)  

ü Declara8ons  on  data  __shared__,  __device__,  __constant__,  …  ü  Data  management  and  orchestra8on  

ü Copying  to/from  host:    e.g.,  cudaMemcpy(h_obj,d_obj,  size,  cudaMemcpyDevicetoHost)  

ü  Concurrency  management  ü  e.g.  __synchthreads()  

P   P  

HO

ST

(CP

U)

M  

DE

VIC

E (G

PU

)

Interconnect  between  devices  and  memories  

M  

Page 9: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Code Samples  

1.   Connect  to  the  system  ssh  –X  [email protected]  

2.   Get  the  code  samples:    wget  h;p://twin.iac.rm.cnr.it/ccc.tgz  wget  h;p://twin.iac.rm.cnr.it/cccsample.tgz  

3.   Unpack  the  samples:  tar  zxvf  cccsample.tgz  

4.   Go  to  source  directory:  cd  cccsample/source  

5.   Get  the  CUDA  C  Quick  Reference  wget  h;p://twin.iac.rm.cnr.it/CUDA_C_QuickRef.pdf  

Page 10: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Hello, World! int main( void ) { printf( "Hello, World!\n" ); return 0; }    •  To  compile:  nvcc  –o  hello_world  hello_world.cu  

 •  To  execute:  ./hello_world  

 •  This  basic  program  is  just  standard  C  that  runs  on  the  host    

•  NVIDIA’s  compiler  (nvcc)  will  not  complain  about  CUDA  programs  with  no  device  code  

•  At  its  simplest,  CUDA  C  is  just  C!  

Page 11: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Hello, World! with Device Code __global__ void kernel( void ) { } int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!\n" ); return 0; }

•  Two  notable  addi8ons  to  the  original    “Hello,  World!”  

To  compile:  nvcc  –o  simple_kernel  simple_kernel.cu    To  execute:  ./simple_kernel  

Page 12: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Hello, World! with Device Code __global__ void kernel( void ) { }    •  CUDA  C  keyword    __global__ indicates  that  a  

func8on  –  Runs  on  the  device  –  Called  from  host  code  

•  nvcc  splits  source  file  into  host  and  device  components  –  NVIDIA’s  compiler  handles  device  func8ons  like  kernel() –  Standard  host  compiler  handles  host  func8ons  like  main()  

•  gcc, icc, … •  Microso]  Visual  C  

Page 13: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Hello, World! with Device Code int main( void ) { kernel<<< 1, 1 >>>(); printf( "Hello, World!\n" ); return 0; }

•  Triple  angle  brackets  mark  a  call  from  host  code  to  device  code  –  A  “kernel  launch”  in  CUDA  jargon    –  We’ll  discuss  the  parameters  inside  the  angle  brackets  later  

 •  This  is  all  that’s  required  to  execute  a  func8on  on  the  GPU!   •  The  func8on  kernel()  does  nothing,  so  let  us  run  something  a  

bit  more  useful:  

Page 14: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A kernel to make an addition __global__ void add(int a, int b, int *c) { *c = a + b; }

•  Compile:  nvcc  –o  addint  addint.cu  

•  Run:    ./addint  

•  Look  at  line  28  of  the  addint.cu    code  How  do  we  pass  the  first  two  arguments?  

Page 15: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A More Complex Example •  Another  kernel  to  add  two  integers:     __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }

 •  As  before,  __global__ is  a  CUDA  C  keyword  meaning  –  add()  will  execute  on  the  device  –  add()  will  be  called  from  the  host  

Page 16: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A More Complex Example •  No8ce  that  now  we  use  pointers  for  all  our  variables:  

  __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }  

•  add()  runs  on  the  device…so  a,  b,  and  c  must  point  to  device  memory  

•  How  do  we  allocate  memory  on  the  GPU?  

Page 17: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Memory Management •  Up  to  CUDA  4.0  host  and  device  memory  were  dis8nct  

en88es  from  the  programmers’  viewpoint  –  Device  pointers  point  to  GPU  memory  

•  May  be  passed  to  and  from  host  code  •  (In  general)  May  not  be  dereferenced  from  host  code  

–  Host  pointers  point  to  CPU  memory    •  May  be  passed  to  and  from  device  code  •  (In  general)  May  not  be  dereferenced  from  device  code  

 

Star8ng  on  CUDA  4.0  there  is  a  Unified  Virtual  Addressing  feature.  •  Basic  CUDA  API  for  dealing  with  device  memory  

–  cudaMalloc(&p, size),  cudaFree(p),    cudaMemcpy(t, s, size, direction)  

–  Similar  to  their  C  equivalents:  malloc(),  free(),  memcpy()

pointer  to  pointer  

Page 18: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A More Complex Example: main() int main( void ) { int a, b, c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = sizeof( int ); // we need space for an integer // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = 2; b = 7; // copy inputs to device cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice ); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ) return 0; }  

Page 19: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Programming in CUDA C •  But  wait…GPU  compu8ng  is  about  massive  parallelism  

•  So  how  do  we  run  code  in  parallel  on  the  device?  

•  Solu8on  lies  in  the  parameters  between  the  triple  angle  brackets:  

add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); add<<< N, 1 >>>( dev_a, dev_b, dev_c );

•  Instead  of  execu8ng  add()  once,  add()  executed  N  8mes  in  parallel  

Page 20: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Programming in CUDA C •  With  add()  running  in  parallel…let’s  do  vector  addi8on  

•  Terminology:  Each  parallel  invoca8on  of  add()  referred  to  as  a  block  

•  Kernel  can  refer  to  its  block’s  index  with  the  variable  blockIdx.x •  Each  block  adds  a  value  from  a[]  and  b[],  storing  the  result  in  c[]:

__global__ void add( int *a, int *b, int *c ) { c[blockIdx.x] = a[blockIdx.x]+b[blockIdx.x]; }

•  By  using  blockIdx.x  to  index  arrays,  each  block  handles  a  different  index    •  blockIdx.x is  the  first  example  of  a  CUDA  predefined  variable.  

Page 21: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Programming in CUDA C

Block  1  

c[1]=a[1]+b[1];

§ We write this code: __global__ void add( int *a, int *b, int *c ) {

c[blockIdx.x] = a[blockIdx.x]+b[blockIdx.x];

}

§ This is what runs in parallel on the device:

Block  0  

c[0]=a[0]+b[0];

Block  2  

c[2]=a[2]+b[2];

Block  3  

c[3]=a[3]+b[3];

Page 22: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Addition: main() #define N 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for 512

// integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

Page 23: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Addition: main() (cont) // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with N parallel blocks add<<< N, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }  

Page 24: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Review §  Difference between “host” and “device”

—  Host = CPU —  Device = GPU

§  Using __global__ to declare a function as device code —  Runs on device —  Called from host

§  Passing parameters from host code to a device function

Page 25: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Review (cont) §  Basic device memory management

—  cudaMalloc() —  cudaMemcpy() —  cudaFree()

§  Launching parallel kernels —  Launch N copies of add() with: add<<< N, 1 >>>(); —  blockIdx.x allows to access block’s index

•  Exercise: look at, compile and run the add_simple_blocks.cu code

Page 26: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Threads §  Terminology: A block can be split into parallel threads

§  Let’s change vector addition to use parallel threads instead of parallel blocks:

__global__ void add( int *a, int *b, int *c ) { c[ ] = a[ ]+ b[ ]; }

§  We use threadIdx.x instead of blockIdx.x in add()

§  main() will require one change as well…

threadIdx.x threadIdx.x threadIdx.x blockIdx.x blockIdx.x blockIdx.x

Page 27: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Addition (Threads): main() #define N 512 int main( void ) { int *a, *b, *c; //host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int )//we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

Page 28: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Addition (Threads): main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with N add<<< >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }  

threads

1, N

blocks

N, 1

Exercise:  compile  and  run  the    add_simple_threads.cu  code  

Page 29: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Using Threads And Blocks §  We’ve seen parallel vector addition using

—  Many blocks with 1 thread apiece —  1 block with many threads

§  Let’s adapt vector addition to use lots of both blocks and

threads

§  After using threads and blocks together, we’ll talk about why threads

§  First let’s discuss data indexing…

Page 30: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Indexing Arrays With Threads & Blocks •  No  longer  as  simple  as  just  using  threadIdx.x  or  blockIdx.x  as  indices  

•  To  index  array  with  1  thread  per  entry  (using  8  threads/block)    

•  If  we  have  M  threads/block,  a  unique  array  index  for  each  entry  is  given  by   int index = threadIdx.x + blockIdx.x * M;

int index = x + y * width;

 

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

threadIdx.x

0 1 2 3 4 5 6 7

threadIdx.x

0 1 2 3 4 5 6 7

threadIdx.x

0 1 2 3 4 5 6 7

threadIdx.x

0 1 2 3 4 5 6 7

Page 31: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Indexing Arrays: Example •  In  this  example,  the  red  entry  would  have  an  index  of  21:  

int index = threadIdx.x + blockIdx.x * M;                                                  = 5 + 2 * 8; = 21;

 

blockIdx.x = 2

M = 8 threads/block

0 17 8 16 18 19 20 21 2 1 3 4 5 6 7 10 9 11 12 13 14 15

Page 32: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Indexing Arrays: other examples (4 blocks with 4 threads per block)

Page 33: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Addition with Threads and Blocks §  blockDim.x is a built-in variable for threads per block: int index= threadIdx.x + blockIdx.x * blockDim.x; §  gridDim.x is a built-in variable for blocks in a grid;

§  A combined version of our vector addition kernel to use blocks and

threads: __global__ void add( int *a, int *b, int *c ) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; }

§  So what changes in main() when we use both blocks and threads?

Page 34: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Addition (Blocks/Threads) #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int ); // we need space for N

// integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size );

random_ints( a, N ); random_ints( b, N );

Page 35: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Addition (Blocks/Threads) // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with blocks and threads add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }  

Exercise:  compile  and  run  the    add_simple.cu  code  

Page 36: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Exercise:  array  reversal  

•  Start  from  the  arrayReversal.cu  code  The  code  must  fill  an  array  d_out  with  the  contents  of  an  input  array  d_in  in  reverse  order  so  that  if    d_in  is  [100,  110,  200,  220,  300]  then    d_out  must  be  [300,  220,  200,  110,  100].  

•  You  must  complete  line  13,  14  and  34.  •  Remember  that:    

– blockDim.x is the number of threads per block; – gridDim.x is the number of blocks in a grid;

Page 37: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Array  Reversal:  Soluaon  

__global__  void  reverseArrayBlock(int  *d_out,  int  *d_in)  {  

       int  in    =      blockDim.x  *  blockIdx.x    +  threadIdx.x  ;            int  out  =      (blockDim.x  *  gridDim.x  -­‐1)  –  in  ;          d_out[out]  =  d_in[in];  }    

Page 38: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Why Bother With Threads? §  Threads seem unnecessary

—  Added a level of abstraction and complexity —  What did we gain?

§  Unlike parallel blocks, parallel threads have mechanisms to —  Communicate —  Synchronize

§  Let’s see how…

Page 39: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Dot Product §  Unlike vector addition, dot product is a reduction from vectors to

a scalar

c = a · b

c = (a0, a1, a2, a3) · (b0, b1, b2, b3)

c = a0 b0 + a1 b1 + a2 b2 + a3 b3

Page 40: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Dot Product •  Parallel  threads  have  no  problem  compu8ng  the  pairwise  products:  

•  So  we  can  start  a  dot  product  CUDA  kernel  by  doing  just  that:   __global__ void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x];

Page 41: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Dot Product •  But  we  need  to  share  data  between  threads  to  compute  the  final  sum:  

  __global__ void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x];   // Can’t compute the final sum // Each thread’s copy of ‘temp’ is private!!! }

Page 42: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Sharing Data Between Threads §  Terminology: A block of threads shares memory called…

§  Extremely fast, on-chip memory (user-managed cache)

§  Declared with the __shared__ CUDA keyword

§  Not visible to threads in other blocks running in parallel

shared memory

Shared  Memory  

Threads  

Block  0  

Shared  Memory  

Threads  

Block  1  

Shared  Memory  

Threads  

Block  2  

…  

Page 43: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product: dot() §  We perform parallel multiplication, serial addition: #define N 512 __global__ void dot( int *a, int *b, int *c ) { // Shared memory for results of multiplication __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; // Thread 0 sums the pairwise products if( 0 == threadIdx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i];

} *c = sum; } }

Page 44: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product Recap §  We perform parallel, pairwise multiplications

§  Shared memory stores each thread’s result

§  We sum these pairwise products from a single thread

§  Sounds good…

But…

Exercise:  Compile  and  run  dot_simple_threads.cu.    Does  it  work  as  expected?  

Page 45: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Faulty Dot Product Exposed! §  Step 1: In parallel, each thread writes a pairwise product

§  Step 2: Thread 0 reads and sums the products

§  But there’s an assumption hidden in Step 1…

__shared__ int temp  

__shared__ int temp  

Page 46: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Read-Before-Write Hazard §  Suppose thread 0 finishes its write in step 1

§  Then thread 0 reads index 12 in step 2

§  Before thread 12 writes to index 12 in step 1

This read returns garbage!

Page 47: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Synchronization §  We need threads to wait between the sections of dot(): __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; // * NEED THREADS TO SYNCHRONIZE HERE * // No thread can advance until all threads // have reached this point in the code // Thread 0 sums the pairwise products if( 0 == threadIdx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i];

} *c = sum; } }

Page 48: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

__syncthreads()

§ We can synchronize threads with the function __syncthreads()

§  Threads in the block wait until all threads have hit the __syncthreads()

§  Threads are only synchronized within a block!

__syncthreads()  

__syncthreads()  

__syncthreads()  

__syncthreads()  

__syncthreads()  

Thread 0  Thread 1  Thread 2  

Thread 3  

Thread 4  …  

Page 49: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product: dot() __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i];

} *c = sum; } }

§  With a properly synchronized dot() routine, let’s look at main()

Page 50: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product: main() #define N 512 int main( void ) { int *a, *b, *c; // copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

Page 51: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product: main() // copy inputs to device

cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, sizeof( int ), cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }  

Exercise:  insert  __syncthreads()  in  the  dot  kernel.  Compile  and  run.  

Page 52: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Review §  Launching kernels with parallel threads

—  Launch add() with N threads: add<<< 1, N >>>(); —  Used threadIdx.x to access thread’s index

§  Using both blocks and threads —  Used (threadIdx.x + blockIdx.x * blockDim.x) to index input/output —  N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads

total

Page 53: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Review (cont) §  Using __shared__ to declare memory as shared

memory —  Data shared among threads in a block —  Not visible to threads in other parallel blocks

§  Using __syncthreads() as a barrier —  No thread executes instructions after __syncthreads() until

all threads have reached the __syncthreads() —  Needs to be used to prevent data hazards

Page 54: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Multiblock Dot Product •  Recall  our  dot  product  launch:   // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c );

•  Launching  with  one  block  will  not  u8lize  much  of  the  GPU  

•  Let’s  write  a  mul9block  version  of  dot  product  

Page 55: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Multiblock Dot Product: Algorithm §  Each block computes a sum of its pairwise products like before:

Page 56: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Multiblock Dot Product: Algorithm §  And then contributes its sum to the final result:

Page 57: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Multiblock Dot Product: dot() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) { sum += temp[i];

} } }

§  But  we  have  a  race  condi8on…  Compile  and  run  dot_simple_mul8block.cu  

§  We  can  fix  it  with  one  of  CUDA’s  atomic  opera8ons.    Use  atomicAdd  in  the  dot  kernel.  Compile  with  nvcc  –o  dot_simple_mulablock  dot_simple_mulablock.cu  –arch=sm_35  

*c += sum;  atomicAdd( c , sum );  

Page 58: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Race Conditions

§  Thread 0, Block 1 –  Read value at address c –  Add sum to value –  Write result to address c

§  Terminology: A race condition occurs when program behavior depends upon relative timing of two (or more) event sequences

§  What actually takes place to execute the line in question: *c += sum; —  Read value at address c —  Add sum to value —  Write result to address c

§  What if two threads are trying to do this at the same time?

§  Thread 0, Block 0 —  Read value at address c —  Add sum to value —  Write result to address c

Terminology: Read-Modify-Write

Page 59: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Global Memory Contention

c

Block 0 sum = 3

Block 1 sum = 4

Reads 0 0

Computes 0+3 0+3 = 3 3

Writes 3

Reads 3 3

Computes 3+4 3+4 = 7 7

Writes 7

0  

Read-Modify-Write

Read-Modify-Write

*c += sum   3  0   3   3   7  

Page 60: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Global Memory Contention

0  c

Block 0 sum = 3

Block 1 sum = 4

Reads 0 0

Computes 0+3 0+3 = 3 3

Writes 3

Reads 0 0

Computes 0+4 0+4 = 4 4

Writes 4

Read-Modify-Write

Read-Modify-Write

*c += sum   0   0   3   4  0  

Page 61: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Atomic Operations §  Terminology: Read-modify-write uninterruptible when atomic

§  Many atomic operations on memory available with CUDA C

§  Predictable result when simultaneous access to memory required

§ We need to atomically add sum to c in our multiblock dot product

§  atomicAdd() §  atomicSub() §  atomicMin() §  atomicMax()

§  atomicInc() §  atomicDec() §  atomicExch() §  atomicCAS()old == compare ? val : old

Page 62: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Multiblock Dot Product: dot() __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ){ sum += temp[i];

} atomicAdd( c , sum ); } }

§  Now  let’s  fix  up  main()  to  handle  a  mul9block  dot  product  

Page 63: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product: main() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N ints // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

Page 64: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel Dot Product: main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch dot() kernel dot<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( c, dev_c, sizeof( int ), cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }  

Page 65: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Review §  Race conditions

—  Behavior depends upon relative timing of multiple event sequences

—  Can occur when an implied read-modify-write is interruptible

§  Atomic operations —  CUDA provides read-modify-write operations guaranteed to be

atomic —  Atomics ensure correct results when multiple threads modify

memory —  To use atomic operations a new option (-arch=…) must be used

at compile time

Page 66: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Eratosthenes’  Sieve  

•  A  sieve  has  holes  in  it  and  is  used  to  filter  out  the  juice.  

•  Eratosthenes’s  sieve  filters  out  numbers  to  find  the  prime  numbers.  

Page 67: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Defini8on  

•   Prime  Number  –  a  number  that  has  only  two  factors,  itself  and  1.  

7 7 is prime because the only numbers that will divide into it evenly are 1 and 7.

Page 68: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

The  Eratosthenes’Sieve  

Page 69: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

The  Prime  Numbers  from  1  to  120  are  the  following:  

2,3,5,7,11,13,17,19,23,  29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,  101,103,107,109,113    Download  (wget)  h3p://twin.iac.rm.cnr.it/eratostenePr.cu  

Page 70: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  Thread  organiza8on:  Grids  and  Blocks  

•  A  kernel  is  executed  as  a  1D,  2D  or  3D  grid  of  thread  blocks.    –  All  threads  share  the  global  memory  

•  A  thread  block  is  a  1D,  2D  or  3D  batch  of  threads  that  can  cooperate  with  each  other  by:  –  Synchronizing  their  execu8on  

•  For  hazard-­‐free  shared  memory  accesses  

–  Efficiently  sharing  data  through  a  low  latency  shared  memory  

•  Threads  blocks  are  independent  of  each  other  and  can  be  executed    in  any  order!  

Host

Kernel 1

Kernel 2

Device

Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Grid 2

Block (1, 1)

Thread (0, 1)

Thread (1, 1)

Thread (2, 1)

Thread (3, 1)

Thread (4, 1)

Thread (0, 2)

Thread (1, 2)

Thread (2, 2)

Thread (3, 2)

Thread (4, 2)

Thread (0, 0)

Thread (1, 0)

Thread (2, 0)

Thread (3, 0)

Thread (4, 0)

Courtesy: NVIDIA

Page 71: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Built-­‐in  Variables  to  manage  grids  and  blocks  

•  dim3 gridDim; –  Dimensions  of  the  grid  in  blocks    

•  dim3 blockDim; –  Dimensions  of  the  block  in  threads  

•  dim3 blockIdx; –  Block  index  within  the  grid  

•  dim3 threadIdx; –  Thread  index  within  the  block  

dim3:  a  new  datatype  defined  by  CUDA:  struct  dim3  {  unsigned  int  x,  y,  z  };  three  unsigned  ints  where  any  unspecified  component  defaults  to  1.  

Page 72: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

int  main()  {          int  dimx  =  16;          int  dimy  =  16;          int  num_bytes  =  dimx*dimy*sizeof(int);            int  *d_a=0,  *h_a=0;  //  device  and  host  pointers            h_a  =  (int*)malloc(num_bytes);          cudaMalloc(  (void**)&d_a,  num_bytes  );            dim3  grid,  block;          block.x  =  4;          block.y  =  4;          grid.x    =  dimx  /  block.x;          grid.y    =  dimy  /  block.y;            kernel<<<grid,  block>>>(  d_a,  dimx,  dimy  );                  cudaMemcpy(h_a,d_a,num_bytes,                                                                cudaMemcpyDeviceToHost);            for(int  row=0;  row<dimy;  row++)  {                  for(int  col=0;  col<dimx;  col++)                          prin|("%d  ",  h_a[row*dimx+col]  );                  prin|("\n");          }            free(  h_a  );          cudaFree(  d_a  );          return  0;  }  

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = idx+1; }

Bi-­‐dimensional  threads  configura8on  by  example:  

set  the  elements  of  a  square  matrix  

(assume  the  matrix  is  a  single  block  of  memory!)  

Exercise:  compile  and  run:  setmatrix.cu  

Page 73: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Matrix  mul8ply  with  one  thread  block  

•  One  block  of  threads  computes  matrix  dP  –  Each  thread  computes  one  element  

of  dP  •  Each  thread  

–  Loads  a  row  of  matrix  dM  –  Loads  a  column  of  matrix  dN  –  Perform  one  mul8ply  and  addi8on  

for  each  pair  of  dM  and  dN  elements  •  Size  of  matrix  limited  by  the  number  

of  threads  allowed  in  a  thread  block!    

Grid 1 Block 1

3 2 5 4

2

4

2

6

48

Thread )2, 2(‏

WIDTH

dM dP

dN

Page 74: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

ü Start  from  MMGlob.cu  ü Complete  the  func8on  

MatrixMulOnDevice  and  the  kernel  MatrixMulKernel  

ü Use  one  bi-­‐dimensional  block  ü Threads  will  have  an  x    (threadIdx.x) and  an  y  iden8fier  (threadIdx.y)  

ü Max  size  of  the  matrix:  16  Compile:  nvcc  –o  MMGlob  MMglob.cu  Run:  ./MMGlob  N  

   

__global__ void MatrixMulKernel(DATA* dM, DATA* dN, DATA* dP, int Width) { DATA Pvalue = 0.0; for (int k = 0; k < Width; ++k) { DATA Melement = dM[ ]; DATA Nelement = dN[ ]; Pvalue += Melement * Nelement; } dP[ ] = Pvalue; }

Exercise:  matrix-­‐matrix    mul8ply  

Page 75: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  Compiler:  nvcc  basic  op8ons  ü  -­‐arch=sm_13    (or  sm_20,  sm_35)  Enable  double  precision  (on  compa8ble  hardware)  ü  -­‐G        Enable  debug  for  device  code  

ü  -­‐-­‐ptxas-­‐op8ons=-­‐v  Show  register  and  memory  usage  

ü  -­‐-­‐maxrregcount  <N>  Limit  the  number  of  registers  

ü  -­‐use_fast_math  Use  fast  math  library    

ü  -­‐O3  Enables  compiler  op8miza8on    

ü  -­‐ccbin  compiler_path  use  a  different  C  compiler  (e.g.,  Intel  instead  of  gcc)  Exercises:    

1.  re-­‐compile  the  MMGlob.cu  program  and  check  how  many  registers  the  kernel  uses:  nvcc  –o  MMGlob  MMGlob.cu  -­‐-­‐ptxas-­‐op8ons=-­‐v  

2.  Edit  MMGlob.cu  and  modify  DATA  from  float  to  double,  then  compile  for  double  precision  and  run  

Page 76: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  Error  Repor8ng  to  CPU  

•  All  CUDA  calls  return  an  error  code:  –  except  kernel  launches  –  cudaError_t  type    

•  cudaError_t  cudaGetLastError(void)  –  returns  the  code  for  the  last  error  (cudaSuccess  means  “no  error”)    

•  char*  cudaGetErrorString(cudaError_t  code)  –  returns  a  null-­‐terminated  character  string  describing  the  error    

prinn(“%s\n”,cudaGetErrorString(cudaGetLastError()  )  );  

Page 77: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  Event  API  •  Events  are  inserted  (recorded)  into  CUDA  call  streams  

•  Usage  scenarios:  –  measure  elapsed  8me  (in  milliseconds)  for  CUDA  calls  (resolu8on  ~0.5  microseconds)  –  query  the  status  of  an  asynchronous  CUDA  call  –  block  CPU  un8l  CUDA  calls  prior  to  the  event  are  completed  

 

cudaEvent_t  start,  stop;  float  et;  cudaEventCreate(&start);      cudaEventCreate(&stop);    

cudaEventRecord(start,  0);    

kernel<<<grid,  block>>>(...);    

cudaEventRecord(stop,  0);    

cudaEventSynchronize(stop);      

cudaEventElapsedTime(&et,  start,  stop);    

cudaEventDestroy(start);        cudaEventDestroy(stop);  prin|(“Kernel  execu8on  8me=%f\n”,et);  

Read  the  func8on  MatrixMulOnDevice  of  the  MMGlobWT.cu  program.  Compile  and  run      

Page 78: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Execution Model Software Hardware

Threads  are  executed  by  scalar  processors  

Thread  

Scalar    Processor  

Thread    Block   Mul8processor  

Thread  blocks  are  executed  on  mul8processors    Thread  blocks    do  not  migrate    Several  concurrent  thread  blocks  can  reside  on  one  mul8processor  -­‐  limited  by  mul8processor  resources  (shared  memory  and  register  file)  

...

Grid   Device  

A  kernel  is  launched  as  a  grid  of  thread  blocks    

Page 79: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

86  

Transparent  Scalability  •  Hardware  is  free  to  assigns  blocks  to  any  

processor  at  any  8me  –  A  kernel  scales  across  any  number  of  parallel  

processors  Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each  block  can  execute  in  any  order  relaave  to  other  blocks!    

8me  

Page 80: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

87!

Warps and Half Warps

Thread    Block   Mul8processor  

32  Threads  

32  Threads  

32  Threads  

...  

Warps  

16  

Half  Warps  

16  

DRAM  

 

 

Global

Local

A  thread  block  consists  of  32-­‐thread  warps    A  warp  is  executed  physically  in  parallel  (SIMD)  on  a  mul8processor  

Device    Memory  

=

A  half-­‐warp  of  16  threads  can  coordinate  global  memory  accesses  into  a  single  transac8on  

Page 81: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

88  

Execu8ng  Thread  Blocks  

•  Threads  are  assigned  to  Streaming  Mulaprocessors  (SM)  in  block  granularity  –  Up  to  16  blocks  to  each  SM    

on  Kepler.  8  blocks  on  Fermi  SM.    –  A  Fermi  SM  can  take  up  to  1536  threads  

•  Could  be  256  (threads/block)  *  6  blocks    or  192  (threads/block)  *  8  blocks  but,  for  instance,  128  (threads/block)  *  12  blocks  not  allowed!  

–  A  Kepler  SM  can  take  up  to  2048  threads  •  Threads  run  concurrently  

–  SM  manages/schedules  thread  execu8on  

t0 t1 t2 … tm

Blocks

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

t0 t1 t2 … tm

Blocks

SM 1 SM 0

The  number  of  threads  in  a  block  depends  on  the  capability    (i.e.,  the  version)  of  the  GPU.  On  a  Fermi  or  Kepler  GPU  a  block  may  have  up  to  1024  threads.  

Page 82: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

89  

Block  Granularity  Considera8ons  •  For  a  kernel  running  on  a  Fermi  and  using  mul8ple  blocks,  should  

we  use  8X8,  16X16  or  32X32  blocks?  

–  For  8X8,  we  have  64  threads  per  Block.  Since  each  Fermi  SM  can  take  up  to  1536  threads,  there  are  24  Blocks.  However,  each  SM  can  only  take  up  to  8  Blocks,  only  512  threads  will  go  into  each  SM!    

–  For  32X32,  we  have  1024  threads  per  Block.    Only  one  block  can  fit  into  a  SM!    

–  For  16X16,  we  have  256  threads  per  Block.  Since  each  Fermi  SM  can  take  up  to  1536  threads,  it  can  take  up  to  6  Blocks  and  achieve  full  capacity  unless  other  resource  considera8ons  overrule.  

Repeat  the  analysis  for  a  Kepler  card!  

Page 83: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

90  

Programmer  View  of  Register  File  •  There  are  up  to  32768  (32  bit)  

registers  in  each  Fermi  SM  •  There  are  up  to  65536  (32  bit)  

registers  in  each  Kepler  SM  –  This  is  an  implementa8on  decision,  

not  part  of  CUDA  –  Registers  are  dynamically  par88oned  

across  all  blocks  assigned  to  the  SM  –  Once  assigned  to  a  block,  the  register  

is  NOT  accessible  by  threads  in  other  blocks  

–  Each  thread  in  the  same  block  only  access  registers  assigned  to  itself  

4  blocks   3  blocks  

Page 84: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Registers  “occupancy”  example  •  If  a  Block  has  16x16  threads  and  each  thread  uses  30  registers,  how  many  blocks  can  run  on  each  Fermi  SM?  – Each  block  requires  30*256  =  7680  registers  – 32768/7680  =  4  +  change  (2048)  – So,  4  blocks  can  run  on  an  SM  as  far  as  registers  are  concerned  

•  How  about  if  each  thread  increases  the  use  of  registers  by  10%?  – Each    Block  now  requires  33*256  =  8448  registers  – 32768/8448  =  3  +  change  (7424)  – Only  three  Blocks  can  run  on  an  SM,  25%  reducaon  of  parallelism!!!  

Repeat  the  analysis  for  a  Kepler  card!  

Page 85: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Optimizing threads per block ü  Choose threads per block as a multiple of warp size (32)

ü  Avoid wasting computation on under-populated warps ü  Facilitates efficient memory access (coalescing )

ü  Run as many warps as possible per multiprocessor (hide latency) ü  SM can run up to 16 (on Kepler) blocks at a time

ü  Heuristics ü Minimum: 64 threads per block ü  192 or 256 threads a better choice

ü Usually still enough registers to compile and invoke successfully

ü  The right tradeoff depends on your computation, so experiment, experiment, experiment!!!

Page 86: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Review  on  Memory  Hierarchy  •  Local  Memory:  per-­‐thread  

–  Private  per  thread  but  slow!  –  Auto  variables,  register  spill  

•  Shared  Memory:  per-­‐block  –  Shared  by  threads  of  the  same  block  –  Fast  inter-­‐thread  communica8on  

•  Global  Memory:  per-­‐applica8on  –  Shared  by  all  threads  –  Inter-­‐Grid  communica8on  

Thread

Local Memory

Grid 0

. . . Global

Memory

. . .

Grid 1 Sequential Grids in Time

Block

Shared Memory

Page 87: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Op8mizing  Global  Memory  Access  • Memory  Bandwidth  is  very  high  (~  150  Gbytes/sec.)  but…  

• Latency  is  also  very  high  (hundreds  of  cycles)!  • Local  memory  has  the  same  latency!  

• Reduce  number  of  memory  transac8ons  by  applying  a  proper        memory  access  pa;ern.  

Thread  locality  is  be3er  than  data  locality  on  GPU!  

Array  of  structures  (data  locality!)  

Structure  of  arrays  (thread  locality!)  

Page 88: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Global  Memory  Opera8ons  

•  Two  types  of  loads:  –  Caching  

•  Default  mode  •  A;empts  to  hit  in  L1,  then  L2,  then  GMEM  •  Load  granularity  is  128-­‐byte  line  

– Non-­‐caching  •  Compile  with  –Xptxas  –dlcm=cg  op8on  to  nvcc  •  A;empts  to  hit  in  L2,  then  GMEM  

–  Do not hit in L1, invalidate the line if it’s in L1 already •  Load  granularity  is  32-­‐bytes  

•  Stores:  –  Invalidate  L1,  write-­‐back  for  L2  

Page 89: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Caching  Load  •  Warp  requests  32  aligned,  consecu8ve  4-­‐byte  words  •  Addresses  fall  within  1  cache-­‐line  

– Warp  needs  128  bytes  – 128  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:  100%  

...  addresses  from  a  warp  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

0  

Page 90: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Non-­‐caching  Load  •  Warp  requests  32  aligned,  consecu8ve  4-­‐byte  words  •  Addresses  fall  within  4  segments  

– Warp  needs  128  bytes  – 128  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:  100%  

...  addresses  from  a  warp  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

0  

Page 91: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Caching  Load  

•  Warp  requests  32  aligned,  permuted  4-­‐byte  words  •  Addresses  fall  within  1  cache-­‐line  

– Warp  needs  128  bytes  – 128  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:  100%  

...  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

addresses  from  a  warp  

0  

Page 92: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Non-­‐caching  Load  

•  Warp  requests  32  aligned,  permuted  4-­‐byte  words  •  Addresses  fall  within  4  segments  

– Warp  needs  128  bytes  – 128  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:  100%  

...  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

addresses  from  a  warp  

0  

Page 93: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Caching  Load  •  Warp  requests  32  misaligned,  consecu8ve  4-­‐byte  words  

•  Addresses  fall  within  2  cache-­‐lines  – Warp  needs  128  bytes  – 256  bytes  move  across  the  bus  on  misses  – Bus  u8liza8on:  50%  

96   192  128   160   224   288  256  

...  addresses  from  a  warp  

32   64  0   352  320   384   448  416  Memory  addresses  

Page 94: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Non-­‐caching  Load  •  Warp  requests  32  misaligned,  consecu8ve  4-­‐byte  words  •  Addresses  fall  within  at  most  5  segments  

– Warp  needs  128  bytes  – 160  bytes  move  across  the  bus  on  misses  – Bus  u8liza8on:  at  least  80%  

•  Some  misaligned  pa;erns  will  fall  within  4  segments,  so  100%  u8liza8on  

96   192  128   160   224   288  256  

...  addresses  from  a  warp  

32   64  0   352  320   384   448  416  Memory  addresses  

Page 95: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Caching  Load  •  All  threads  in  a  warp  request  the  same  4-­‐byte  word  •  Addresses  fall  within  a  single  cache-­‐line  

– Warp  needs  4  bytes  – 128  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:  3.125%  

...  addresses  from  a  warp  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

0  

Page 96: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Non-­‐caching  Load  •  All  threads  in  a  warp  request  the  same  4-­‐byte  word  •  Addresses  fall  within  a  single  segment  

– Warp  needs  4  bytes  – 32  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:  12.5%  

...  addresses  from  a  warp  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

0  

Page 97: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Caching  Load  

•  Warp  requests  32  sca;ered  4-­‐byte  words  •  Addresses  fall  within  N  cache-­‐lines  

– Warp  needs  128  bytes  – N*128  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:    128  /  (N*128)  

...  addresses  from  a  warp  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

0  

Page 98: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Non-­‐caching  Load  

•  Warp  requests  32  sca;ered  4-­‐byte  words  •  Addresses  fall  within  N  segments  

– Warp  needs  128  bytes  – N*32  bytes  move  across  the  bus  on  a  miss  – Bus  u8liza8on:    128  /  (N*32)  

...  addresses  from  a  warp  

96   192  128   160   224   288  256  32   64   352  320   384   448  416  Memory  addresses  

0  

Page 99: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Shared  Memory  •  On-­‐chip  memory  

–  2  orders  of  magnitude  lower  latency  than  global  memory  – Much  higher  bandwidth  than  global  memory  –  Up  to  48  KByte  per  mul8processor  (on  Fermi  and  Kepler)  

•  Allocated  per  thread-­‐block  •  Accessible  by  any  thread  in  the  thread-­‐block  

–  Not  accessible  to  other  thread-­‐blocks  •  Several  uses:  

–  Sharing  data  among  threads  in  a  thread-­‐block  –  User-­‐managed  cache  (reducing  global  memory  accesses)        __shared__    float  As[16][16];  

Page 100: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Shared  Memory  Example  •  Compute  adjacent_difference  

–  Given  an  array  d_in  produce  a  d_out  array  containing  the  difference:  d_out[0]=d_in[0];    d_out[i]=d_in[i]-­‐d_in[i-­‐1];  /*  i>0  */  

Exercise:  Complete  the  shared_variables.cu  code  

Page 101: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body  problems  An  N-­‐body  simulaaon  numerically  approximates  the   evoluaon   of   a   system   of   bodies   in   which  each   body   conanuously   interacts   with   every  other  body.  The  all-­‐pairs  approach  to  N-­‐body  simulaaon  is  a  brute-­‐force   technique   that   evaluates   all   pair-­‐wise  interacaons  among  the  N  bodies.    

Page 102: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body  problems  We   use   the   gravitaaonal   potenaal   to   illustrate  the  basic  form  of  computaaon  in  an  all-­‐pairs  N-­‐body  simulaaon.  

Page 103: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body  problems  To   integrate   over   ame,   we   need   the  acceleraaon  a   i  =  F   i  /mi  to  update  the  posiaon  and  velocity  of  body  i.  

Download  h3p://twin.iac.rm.cnr.it/mini_nbody.tgz  and  unpack:  tar  zxvf  mini_nbody.tgz    look  at  nbody.c    

Page 104: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  main  loop  void  bodyForce(Body  *p,  float  dt,  int  n)  {  for  (int  i  =  0;  i  <  n;  i++)  {          float  Fx  =  0.0f;  float  Fy  =  0.0f;  float  Fz  =  0.0f;            for  (int  j  =  0;  j  <  n;  j++)  {              float  dx  =  p[j].x  -­‐  p[i].x;              float  dy  =  p[j].y  -­‐  p[i].y;              float  dz  =  p[j].z  -­‐  p[i].z;              float  distSqr  =  dx*dx  +  dy*dy  +  dz*dz  +  SOFTENING;              float  invDist  =  1.0f  /  sqrn(distSqr);              float  invDist3  =  invDist  *  invDist  *  invDist;                Fx  +=  dx  *  invDist3;  Fy  +=  dy  *  invDist3;  Fz  +=  dz  *  invDist3;          }            p[i].vx  +=  dt*Fx;  p[i].vy  +=  dt*Fy;  p[i].vz  +=  dt*Fz;      }  }    

Page 105: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  parallel  processing  •  We   may   think   of   the   all-­‐pairs   algorithm   as  calculaang  each  entry   fij   in   an  NxN  grid  of   all  pair-­‐wise  forces.  

•  Each  entry  can  be  computed  independently,  so  there  is  O(N  2)  available  parallelism  

•  We  may  assign  each  body  to  a  thread  – If   the   block   size   is,   say,   256   then   the   number   of  blocks  is:  (nBodies  +255)  /  256  

Page 106: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  a  first  CUDA  code  __global__  void  bodyForce(Body  *p,  float  dt,  int  n)  {      int  i  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;      if  (i  <  n)  {          float  Fx  =  0.0f;  float  Fy  =  0.0f;  float  Fz  =  0.0f;          for  (int  j  =  0;  j  <  n;  j++)  {              float  dx  =  p[j].x  -­‐  p[i].x;              float  dy  =  p[j].y  -­‐  p[i].y;              float  dz  =  p[j].z  -­‐  p[i].z;              float  distSqr  =  dx*dx  +  dy*dy  +  dz*dz  +  SOFTENING;              float  invDist  =  rsqr|(distSqr);              float  invDist3  =  invDist  *  invDist  *  invDist;                Fx  +=  dx  *  invDist3;  Fy  +=  dy  *  invDist3;  Fz  +=  dz  *  invDist3;          }          p[i].vx  +=  dt*Fx;  p[i].vy  +=  dt*Fy;  p[i].vz  +=  dt*Fz;      }  }  

Page 107: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  a  first  CUDA  code  •  Do  we  expect  this  code  to  be  efficient?  

– All  threads  read  all  data  from  the  global  memory  

•  It   is   be3er   to   serialize   some   of   the  computaaons   to   achieve   the   data   reuse  needed   to   reach   peak   performance   of   the  arithmeac   units   and   to   reduce   the   memory  bandwidth  required.  

•  To  that  purpose  let  us  introduce  the  noaon  of  computaaonal  8le  

Page 108: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  Computa8onal  9le  

•  A  8le  is  a  square  region  of  the  grid  of  pair-­‐wise  forces  consisang  of  p  rows  and  p  columns.    –  Only   2p   body   descripaons   are   required   to   evaluate   all   p2  interacaons  in  the  ale  (p  of  which  can  be  reused  later).    

–  These  body  descripaons  can  be  stored  in  shared  memory  or  in  registers.  

Page 109: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  Computa8onal  9le  

•  Each  thread  in  a  block  loads  a  body  •  All  threads  in  a  block  compute  the  interacaons  with  the  bodies  that  have  been  loaded  

Page 110: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  using  Shared  Memory  (1)  __global__  void  bodyForce(float4  *p,  float4  *v,  float  dt,  int  n)  {  int  i  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;      if  (i  <  n)  {          float  Fx  =  0.0f;  float  Fy  =  0.0f;  float  Fz  =  0.0f;          for  (int  8le  =  0;  8le  <  gridDim.x;  8le++)  {                __shared__  float3  spos[BLOCK_SIZE];                float4  tpos  =  p[8le  *  blockDim.x  +  threadIdx.x];                spos[threadIdx.x]  =  make_float3(tpos.x,  tpos.y,  tpos.z);            __syncthreads();  

         for  (int  j  =  0;  j  <  BLOCK_SIZE;  j++)  {  

                   …..                }                    __syncthreads();          }          v[i].x  +=  dt*Fx;  v[i].y  +=  dt*Fy;  v[i].z  +=  dt*Fz;      }  }  

Page 111: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  using  Shared  Memory  (2)  __global__  void  bodyForce(float4  *p,  float4  *v,  float  dt,  int  n)  {  int  i  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;      if  (i  <  n)  {          ....              for  (int  j  =  0;  j  <  BLOCK_SIZE;  j++)  {                  float  dx  =  spos[j].x  -­‐  p[i].x;                  float  dy  =  spos[j].y  -­‐  p[i].y;                  float  dz  =  spos[j].z  -­‐  p[i].z;                  float  distSqr  =  dx*dx  +  dy*dy  +  dz*dz  +  SOFTENING;                  float  invDist  =  rsqr|(distSqr);                  float  invDist3  =  invDist  *  invDist  *  invDist;                    Fx  +=  dx  *  invDist3;  Fy  +=  dy  *  invDist3;  Fz  +=  dz  *  invDist3;              }          ….      }  }  

Page 112: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

N-­‐Body:  Further  op8miza8ons  

•  Tuning  of  block  size  •  Unrolling  of  the  inner  loop  

for  (int  j  =  0;  j  <  BLOCK_SIZE;  j++)  {  }  

•  Test:    1.   shmoo-­‐cpu-­‐nbody.sh  2.   cuda/shmoo-­‐cuda-­‐nbody-­‐orig.sh  3.   cuda/shmoo-­‐cuda-­‐nbody-­‐block.sh  4.   cuda/shmoo-­‐cuda-­‐nbody-­‐unroll.sh  

Page 113: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Shared  Memory  •  Organiza8on:  

–  32  banks,  4-­‐byte  wide  banks  –  Successive  4-­‐byte  words  belong  to  different  banks  

•  Performance:  –  4  bytes  per  bank  per  2  clocks  per  mul8processor  –  smem  accesses  are  issued  per  32  threads  (warp)  –  serializa8on:  if  N  threads  of  32  access  different  4-­‐byte  words  in  the  same  bank,  N  accesses  are  executed  serially  

– mul8cast:  N  threads  access  the  same  word  in  one  fetch  •  Could  be  different  bytes  within  the  same  word  

Page 114: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Bank  Addressing  Examples  •  No  Bank  Conflicts   •  No  Bank  Conflicts  

Bank  31  

Bank  7  Bank  6  Bank  5  Bank  4  Bank  3  Bank  2  Bank  1  Bank  0  

Thread  31  

Thread  7  Thread  6  Thread  5  Thread  4  Thread  3  Thread  2  Thread  1  Thread  0  

Bank  31  

Bank  7  Bank  6  Bank  5  Bank  4  Bank  3  Bank  2  Bank  1  Bank  0  

Thread  31  

Thread  7  Thread  6  Thread  5  Thread  4  Thread  3  Thread  2  Thread  1  Thread  0  

Page 115: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Bank  Addressing  Examples  •  2-­‐way  Bank  Conflicts   •  x  

Thread  31  Thread  30  Thread  29  Thread  28  

Thread  4  Thread  3  Thread  2  Thread  1  Thread  0  

Bank  31  

Bank  7  Bank  6  Bank  5  Bank  4  Bank  3  Bank  2  Bank  1  Bank  0  

Thread  31  

Thread  7  Thread  6  Thread  5  Thread  4  Thread  3  Thread  2  Thread  1  Thread  0  

Bank  9  Bank  8  

Bank  31  

Bank  7  

Bank  2  Bank  1  Bank  0  x8  

x8  

Page 116: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Shared  Memory:  Avoiding  Bank  Conflicts  

•  32x32  SMEM  array  •  Warp  accesses  a  column:  

– 32-­‐way  bank  conflicts  (threads  in  a  warp  access  the  same  bank)  

31

2 1 0

31 2 1 0

31 2 1 0

warps: 0 1 2 31

Bank 0 Bank 1 … Bank 31 2 0 1

31

Page 117: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Shared  Memory:  Avoiding  Bank  Conflicts  

•  Add  a  column  for  padding:  – 32x33  SMEM  array  

•  Warp  accesses  a  column:  – 32  different  banks,  no  bank  conflicts  

31 2 1 0

31 2 1 0

31 2 1 0

warps: 0 1 2 31 padding

Bank 0 Bank 1 … Bank 31

31 2 0 1

Page 118: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Tree-­‐Based  Parallel  Reduc8ons  in  Shared  Memory  

•  CUDA  solves  this  by  exposing  on-­‐chip  shared  memory  – Reduce  blocks  of  data  in  shared  memory  to  save  bandwidth  

4 7 5 9

11 14

25

3 1 7 0 4 1 6 3

Page 119: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel  Reduc8on:  Interleaved  Addressing  

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory)

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Values

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Values

0 1

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values

Thread IDs

Step 1 Stride 1

Step 2 Stride 2

Step 3 Stride 4

Step 4 Stride 8

Thread IDs

Thread IDs

Thread IDs

Interleaved addressing results in bank conflicts

•  Arbitrarily  bad  bank  conflicts  •  Requires  barriers  if  N  >  warpsize    •  Supports  non-­‐commutaave  operators  

Page 120: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel  Reduc8on:  Sequen8al  Addressing  

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict free

•  No  bank  conflicts  •  Requires  barriers  if  N  >  warpsize    •  Does  not  support  non-­‐commutaave  operators  

Page 121: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Parallel  Prefix  Sum  (Scan)  

•  Given  an  array  A  =  [a0,  a1,  …,  an-­‐1]    and  a  binary  associa8ve  operator  ⊕  with  iden8ty  I,    

•  scan(A)  =  [I,  a0,  (a0  ⊕  a1),  …,  (a0  ⊕  a1  ⊕  …  ⊕  an-­‐2)]  

•  Example:    if  ⊕  is  addi8on,  then  scan  on  the  set  –  [3  1  7  0  4  1  6  3]  

•  returns  the  set    –  [0  3  4  11  11  15  16  22]  

Page 122: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

O(n  log  n)  Scan  

•  Step  efficient  (log  n  steps)  •  Not  work  efficient  (n  log  n  work)  •  Requires  barriers  at  each  step  (WAR  dependencies)  

Page 123: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

132  

Constant  Memory  

ü  Both  scalar  and  array  values  can  be  stored  ü  Constants  stored  in  DRAM,  and  cached  on  chip  

–  L1  per  SM  ü  A  constant  value  can  be  broadcast  to  all  threads  in  

a  Warp  ü  Extremely  efficient  way  of  accessing  a  value  that  is  

common  for  all  threads  in  a  block!    

     

I $ L 1

Multithreaded Instruction Buffer

R F C $

L 1 Shared Mem

Operand Select

MAD SFU

cudaMemcpyToSymbol(ds,  &hs,  sizeof(int),  0,cudaMemcpyHostToDevice);  

Page 124: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texture  Memory  (and  co-­‐stars)  

•  Another  type  of  memory  system,  featuring:  – Spa8ally-­‐cached  read-­‐only  access  – Avoid  coalescing  worries  –  Interpola8on  –  (Other)  fixed-­‐func8on  capabili8es  – Graphics  interoperability  

Page 125: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Tradi8onal  Caching  

•  When  reading,  cache  “nearby  elements”    –  (i.e.  cache  line)  

– Memory  is  linear!  

– Applies  to  CPU,  GPU  L1/L2  cache,  etc  

Page 126: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texture-­‐Memory  Caching  

•  Can  cache  “spa8ally!”  (2D,  3D)  – Specify  dimensions  (1D,  2D,  3D)  on  crea8on  

•  1D  applica8ons:    –  Interpola8on,  clipping  (later)  – Caching  when,  e.g.,  coalesced  access  is  infeasible  

Page 127: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texture  Memory  

•  “Memory  is  just  an  unshaped  bucket  of  bits”  (CUDA  Handbook)  

•  Need  texture  reference  in  order  to:  –  Interpret  data  – Deliver  to  registers  

Page 128: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Interpolaaon  

•  Can  “read  between  the  lines!”  

h;p://cuda-­‐programming.blogspot.com/2013/02/texture-­‐memory-­‐in-­‐cuda-­‐what-­‐is-­‐texture.html  

Download  h3p://twin.iac.rm.cnr.it/readTexels.cu  

Page 129: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

•  Seamlessly  handle  reads  beyond  region!  

Clamping  

h;p://cuda-­‐programming.blogspot.com/2013/02/texture-­‐memory-­‐in-­‐cuda-­‐what-­‐is-­‐texture.html  

Page 130: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

“CUDA  Arrays”  

•  So  far,  we’ve  used  standard  linear  arrays  

•  “CUDA  arrays”:  – Different  addressing  calcula8on  

•  Con8guous  addresses  have  2D/3D  locality!    

– Not  pointer-­‐addressable  –  (Designed  specifically  for  texturing)  

Page 131: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texture  Memory  

•  Texture  reference  can  be  a;ached  to:  – Ordinary  device-­‐memory  array  – “CUDA  array”  

•  Many  more  capabili8es  

Page 132: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texturing  Example  (2D)  

h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf  h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf  

Page 133: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texturing  Example  (2D)  

h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf  

Page 134: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texturing  Example  (2D)  

h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf  

Page 135: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Texturing  Example  (2D)  

Page 136: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

146!

Arithmetic Instruction Throughput

Number  of  operaaons  per  clock  cycle  per  mulaprocessor  

Page 137: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

147  

Control  Flow  Instruc8ons  •  Main  performance  concern  with  branching  is  divergence  

–  Threads  within  a  single  warp  (group  of  32)  take  different  paths  –  Different  execu8on  paths  are  serialized  

•  The  control  paths  taken  by  the  threads  in  a  warp  are  traversed  one  at  a  8me  un8l  there  is  no  more.  

•  Avoid  divergence  when  branch  condi8on  is  a  func8on  of  thread  ID  –  Example  with  divergence:    

•  if (threadIdx.x > 7) { } •  This  creates  two  different  control  paths  for  threads  in  a  block  •  Branch  granularity  <  warp  size;  threads  0,  1  ,  …  7  follow  different  path  than  the  

rest  of  the  threads  8,  9,  …,  31  in  the  same  warp.  –  Example  without  divergence:  

•  if (threadIdx.x / WARP_SIZE > 2) { } •  Also  creates  two  different  control  paths  for  threads  in  a  block •  Branch  granularity  is  a  whole  mul8ple  of  warp  size;  all  threads  in  any  given  

warp  follow  the  same  path  

Page 138: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

148  

Warp  Vote  Func8ons  and  other  useful  primi8ves  

•  int __all(predicate) return  true  iff  the  predicate  evaluates  as  true  for  all  threads  of  a    warp    

•  int __any(predicate) return  true  iff  the  predicate  evaluates  as  true  for  any  threads  of  a    warp    

•  unsigned int __ballot(predicate) return  an  unsigned  int  whose  Nth  is  set  iff  predicate  evaluates  as  true  for  the  Nth  thread  of  the  warp.  

•  int __popc(int x) POPula8on  Count:  return  the  number  of  bits  that  are  set  to  1  in  the  32-­‐bit  integer  x.  

•  int __clz(int x) Count  Leading  Zero:  return  the  number  of  consecu8ve  zero  bits  beginning  at  the  most  significant  bit  of  the  32-­‐bit  integer  x.  

Page 139: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

149  

Shuffle  Instruc8on  

•  Instruc8on  to  exchange  data  in  a  warp    

•  Threads  can  “read”  other  threads’  registers    

•  No  shared  memory  is  needed  

•  It  is  available  star8ng  from  SM  3.0    •  Kepler’s  shuffle  instruc8on  (SHFL)  enables  a  thread  to  directly  

read  a  register  from  another  thread  in  the  same  warp  

Page 140: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

150  

Shuffle  Instruc8on  

•  4  Variants  (idx,  up,  down,  bfly):    

Page 141: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

151  

A  simple  shuffle  example  (1)  

int  __shfl_down(int  var,  unsigned  int  delta,  int  width=warpSize);  

•  __shfl_down()  calculates  a  source  lane  ID  by  adding  delta  to  the  caller’s  lane  ID    –  the  lane  ID  is  a  thread’s  index  within  its  warp,  from  0  to  31.  Can  be  computed  as    threadIdx.x  %  32;  

•  The  value  of  var  held  by  the  resul8ng  lane  ID  is  returned:  this  has  the  effect  of  shi�ing  var  down  the  warp  by  delta  lanes  

•  If  the  source  lane  ID  is  out  of  range  or  the  source  thread  has  exited,  the  calling  thread’s  own  var  is  returned.  

Page 142: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

152  

A  simple  shuffle  example  (2)  

int  i  =  threadIdx.x  %  32;  int  j  =  __shfl_down(i,  2,  8);  

Page 143: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

153  

Using  shuffle  for  a  reduce  opera8on  

__inline__  __device__  int  warpReduceSum(int  val)  {        for  (int  offset  =  warpSize/2;  offset  >  0;  offset  /=  2)  {              val  +=  __shfl_down(val,  offset);  

 }                  return  val;  }  

Page 144: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

154  

Using  shuffle  for  an  all  reduce  opera8on  

__inline__  __device__  int  warpAllReduceSum(int  val)  {        for  (int  mask  =  warpSize/2;  mask  >  0;  mask  /=  2)  {              val  +=  __shfl_xor(val,  mask);  

 }    return  val;  

}  

If   all   threads   in   the   warp   need   the   result,   it   possible   to  implement  an  “all  reduce”  within  a  warp  using  __shfl_xor()  instead  of  __shfl_down(),  like  the  following:  

Page 145: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

155  

Block  reduce  by  using  shuffle  •  First  reduce  within  warps.    •  Then  the  first  thread  of  each  warp  writes  its  par8al  sum  to  shared  

memory.    •  Finally,  a]er  synchronizing,  the  first  warp  reads  from  shared  

memory  and  reduces  again.  __inline__  __device__  int  blockReduceSum(int  val)  {    

   staac  __shared__  int  shared[32];  //  Shared  mem  for  32  paraal  sums      int  lane  =  threadIdx.x  %  warpSize;      int  wid  =  threadIdx.x  /  warpSize;      val  =  warpReduceSum(val);                    //  Each  warp  performs  paraal  reducaon      if  (lane==0)  shared[wid]=val;  //  Write  reduced  value  to  shared  memory      __syncthreads();                            //  Wait  for  all  paraal  reducaons                //read  from  shared  memory  only  if  that  warp  existed      val  =  (threadIdx.x  <  blockDim.x  /  warpSize)  ?  shared[lane]  :  0;      if  (wid==0)  val  =  warpReduceSum(val);  //Final  reduce  within  first  warp      return  val;  

}  

Page 146: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

156  

Reducing  Large  Arrays  •  Reducing  across  blocks  means  global  communica8on  which  

means  synchronizing  the  grid  •  Two  possible  approaches:  

–  Breaking  the  computa8on  into  two  separate  kernel  launches  –  Use  atomics  opera8ons  

•  Exercise  (home  work!):  write  a  deviceReduce  that  works  across  a  complete  grid  (many  thread  blocks).    

Page 147: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

157  

Implement  SHFL  for  64  bits  Numbers    

Generic  SHFL:  h3ps://github.com/BryanCatanzaro/generics  

Page 148: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

158  

Opamized  Filtering  Let   us   have   a   source   array,   src,   containing   n   elements,   and   a  predicate,   and   we   need   to   copy   all   elements   of   src   sa8sfying   the  predicate   into   the   des8na8on   array,   dst.   For   the   sake   of   simplicity,  assume   that   dst   has   length   of   at   least   n   and   that   the   order  of  elements  in  the  dst  array  does  not  ma;er.  If  the  predicate  is  src[i]>0,  a  simple  CPU  solu8on  is:  int  filter(int  *dst,  const  int  *src,  int  n)  {      int  nres  =  0;      for  (int  i  =  0;  i  <  n;  i++)          if  (src[i]  >  0)              dst[nres++]  =  src[i];      //  return  the  number  of  elements  copied      return  nres;  }  

Page 149: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

159  

Opamized  Filtering  on  GPU:    global  memory  

A  straigh|orward  approach  is  to  use  a  single  global  counter  and   atomically   increment   it   for   each   new   element  wri;en  int  the  dst  array:  __global__    void  filter_k(int  *dst,  int  *nres,  const  int  *src,  int  n)  {      int  i  =  threadIdx.x  +  blockIdx.x  *  blockDim.x;      if(i  <  n  &&  src[i]  >  0)          dst[atomicAdd(nres,  1)]  =  src[i];  }  Here  atomic  opera8ons  are  clearly  a  bo;leneck,  and  need  to  be  removed  or  reduced  to  increase  performance  

Page 150: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Opamized  Filtering  on  GPU:    using  shared  memory  

_global__    void  filter_shared_k(int  *dst,  int  *nres,  const  int*  src,  int  n)  {      __shared__  int  l_n;      int  i  =  blockIdx.x  *  (NPER_THREAD  *  BS)  +  threadIdx.x;        for  (int  iter  =  0;  iter  <  NPER_THREAD;  iter++)  {          //  zero  the  counter          if  (threadIdx.x  ==  0)    l_n  =  0;          __syncthreads()          //  get  the  value,  evaluate  the  predicate,  and          //  increment  the  counter  if  needed          int  d,  pos;          if(i  <  n)  {              d  =  src[i];              if(d  >  0)  pos  =  atomicAdd(&l_n,  1);          }  

 __syncthreads();          //  leader  increments  the  global  counter          if(threadIdx.x  ==  0)    l_n  =  atomicAdd(nres,  l_n);          __syncthreads();            //  threads  with  true  predicates  write  their  elements          if(i  <  n  &&  d  >  0)  {              pos  +=  l_n;  //  increment  local  pos  by  global  counter              dst[pos]  =  d;          }          __syncthreads();            i  +=  BS;      }  }  

160  

Page 151: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Opamized  Filtering  on  GPU:    warp-­‐aggregated  atomics  

Warp   aggrega9on   is   the   process   of   combining   atomic  opera8ons  from  mul8ple  threads  in  a  warp  into  a  single  atomic.  1.  Threads  in  the  warp  elect  a  leader  thread.  2.  Threads  in  the  warp  compute  the  total  atomic  increment  for  

the  warp.  3.  The  leader  thread  performs  an  atomic  add  to  compute  the  

offset  for  the  warp.  4.  The  leader  thread  broadcasts  the  offset  to  all  other  threads  

in  the  warp.  5.  Each  thread  adds  its  own  index  within  the  warp  to  the  warp  

offset  to  get  its  posi8on  in  the  output  array.  

161  

Page 152: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Step  1:  elect  a  leader  

The  thread  iden8fier  within  a  warp  can  be  computed  as    #define  WARP_SZ  32  __device__  inline  int  lane_id(void)  {  return  threadIdx.x  %  WARP_SZ;  }  

A  reasonable  choice  is  to  select  as  leader  the  lowest  ac9ve  lane  int  mask  =  __ballot(1);    //  mask  of  acave  lanes  int  leader  =  __ffs(mask)  -­‐  1;    //  -­‐1  for  0-­‐based  indexing  __ffs(int  v)  returns  the  1-­‐based  posi8on  of  the  lowest  bit  that  is  set  in  v,  or  0  if  no  bit  is  set  (ffs  stands  for  find  first  set)  

162  

Page 153: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Step  2:  compute  the  total  increment  

The  total  increment  for  the  warp  is  equal  to  the  number  of  ac8ve  lanes  which  hold  true  predicates.  The  value  can  be  obtained  by  using  the  __popc  intrinsic  which   returns   the   number   of   bits   set   in   the   binary  representa8on  of  its  argument:  int  change  =  __popc(mask);  

163  

Page 154: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Step  3:  performing  the  atomic  

Only  the  leader  lane  performs  the  atomic  opera8on  int  res;  if(lane_id()  ==  leader)      res  =  atomicAdd(ctr,  change);    //  ctr  is  the  pointer  to  the  counter  

164  

Page 155: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Step  4:  broadcasang  the  result  

We  can   implement   the  broadcast  efficiently  using   the  shuffle  intrinsic  __device__    int  warp_bcast(int  v,  int  leader)  {  return  __shfl(v,  leader);  }      

 

165  

Page 156: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Step  5:  compuang  the  index    for  each  lane  

This   is   a   bit   tricky:   we   must   compute   the   output  index   for   each   lane,   by   adding   the   broadcast   counter  value  for  the  warp  to  the  lane’s  intra-­‐warp  offset  •  the   intra-­‐warp   offset   is   the   number   of   ac9ve   lanes  with  IDs  lower  than  current  lane  ID  

•  We   can   compute   it   by   applying   a  mask   that   selects  only   lower   lane   IDs,  and  then  applying  the  __popc()  intrinsic  to  the  result.  

res  +=  __popc(mask  &  ((1  <<  lane_id())  –  1));        

166  

Page 157: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Final  code    //  warp-­‐aggregated  atomic  increment  __device__  int  atomicAddInc(int  *ctr)  {      int  mask  =  __ballot(1);      //  select  the  leader      int  leader  =  __ffs(mask)  –  1;      //  leader  does  the  update      int  res;      if(lane_id()  ==  leader)    res  =  atomicAdd(ctr,  __popc(mask));      //  broadcast  result      res  =  warp_bcast(res,  leader);      //  each  thread  computes  its  own  value      return  res  +  __popc(mask  &  ((1  <<  lane_id())  –  1));  }  //  atomicAddInc  Download  h3p://twin.iac.rm.cnr.it/filter.cu  

167  

Page 158: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Exercise  Write   a   shuffle-­‐based   kernel   for   the   N-­‐body  problem.  Hint:  start  from  the  nbody-­‐unroll.cu  code    

168  

Page 159: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

169

CUDA  Dynamic  Parallelism  •  A facility introduced by NVIDIA in their GK110 chip/architecture •  Allows a kernel to call another kernel from within it without returning

the host. •  Each such kernel call has the same calling construction - the grid

and block sizes and dimensions are set at the time of the call. •  Facility allows computations to be done with dynamically altering

grid structures and recursion, to suit the computation. •  For example allows a 2/3D simulation mesh to be non-uniform with

increased precision at places of interest, see previous slides.

Page 160: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

170

Host code

kernel1<<< G1,B1>>>(…) ;

__global__ void kernel1 (…)

kernel2<<< G2,B2>>>(…) ;

return 0; Dynamic  Parallelism  

__global__ void kernel2 (…)

kernel3<<< G3,B3>>>(…) ;

return 0;

Kernels calling  other  kernels  

Notice the kernel call is a standard syntax and allows each kernel call to have different grid/block structures

Device code

Nested depth limited by memory and <= 63 or 64

Page 161: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

171

Host code

kernel1<<< G1,B1>>>(…) ;

__global__ void kernel1 (…)

kernel1<<< G2,B2>>>(…) ;

return 0;

Dynamic  Parallelism  

Recursion  is  allowed  

Device code

Page 162: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

172 Derived from Fig 20.4 of Kirk and Hwu 2nd Ed.

Grid A (Parent)

Grid B (Child)

Grid B launch

Grid A threads

Grid B completes

Host (CPU) thread

Parent-­‐Child  Launch  Nes8ng  

Grid A launch

Time

Grid B threads

Grid A completes

“Implicit synchronization between parent and child forcing parent to wait until all children exit before it can exit.”

Page 163: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

173!

Host-Device Data Transfers

ü  Device-to-host memory bandwidth much lower than device-to-device bandwidth ü  8 GB/s peak (PCI-e x16 Gen 2) vs. ~ 150 GB/s peak

ü  Minimize transfers ü  Intermediate data can be allocated, operated on, and deallocated

without ever copying them to host memory

ü  Group transfers ü  One large transfer is usually much better than many small ones

Page 164: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Synchronizing  GPU  and  CPU  •  All  kernel  launches  are  asynchronous!  

–  control  returns  to  CPU  immediately  •  Be  careful  measuring  8mes!  

–  kernel  starts  execu8ng  once  all  previous  CUDA  calls  have  completed  •  Memcopies  are  synchronous  

–  control  returns  to  CPU  once  the  copy  is  complete  –  copy  starts  once  all  previous  CUDA  calls  have  completed  

•  cudaDeviceSynchronize()  –  CPU  code  –  blocks  un8l  all  previous  CUDA  calls  complete  

•  Asynchronous  CUDA  calls  provide:  –  non-­‐blocking  memcopies  –  ability  to  overlap  memcopies  and  kernel  execuaon  

Page 165: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

175!

Page-Locked Data Transfers ü  cudaMallocHost() allows allocation of page-locked (“pinned”) host

memory ü  Same syntax of cudaMalloc() but works with CPU pointers and

memory

ü  Enables highest cudaMemcpy performance ü  3.2 GB/s on PCI-e x16 Gen1 ü  6.5 GB/s on PCI-e x16 Gen2

ü  Use with caution!!! ü  Alloca8ng  too  much  page-­‐locked  memory  can  reduce  overall  

system  performance    

See  and  test  the  bandwidth.cu sample  

Page 166: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

176!

Overlapping Data Transfers and Computation

ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable

devices ü Kernel computation can overlap data transfers on devices with

“Concurrent copy and execution” (compute capability >= 1.1)

ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches

Page 167: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

ü  Asynchronous host-device memory copy returns control immediately to CPU ü  cudaMemcpyAsync(dst, src, size, dir, stream); ü  requires pinned host memory

(allocated with cudaMallocHost)

ü  Overlap CPU computation with data transfer ü  0 = default stream

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

177!

Asynchronous Data Transfers overlapped

Page 168: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

178!

Overlapping  kernel  and  data  transfer  

ü  Requires: ü  Kernel and transfer use different, non-zero streams

ü  For  the  crea8on  of  a  stream:  cudaStreamCreate ü  Remember that a CUDA call to stream 0 blocks until all previous calls

complete and cannot be overlapped

ü  Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);

cudaMemcpyAsync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>(…);

overlapped

Exercise:  read,  compile  and  run  simpleStreams.cu  

Page 169: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Device  Management  

•  CPU  can  query  and  select  GPU  devices  –  cudaGetDeviceCount(  int*  count  )  –  cudaSetDevice(  int  device  )  –  cudaGetDevice(  int    *current_device  )  –  cudaGetDeviceProper8es(  cudaDeviceProp*  prop,    int  device  )  –  cudaChooseDevice(  int  *device,  cudaDeviceProp*  prop  )  

•  Mul8-­‐GPU  setup  –  device  0  is  used  by  default  –  Star8ng  on  CUDA  4.0  a  CPU  thread  can  control  more  than  one  GPU  –  mul8ple  CPU  threads  can  control  the  same  GPU    

– the  driver  regulates  the  access.  

Page 170: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Device  Management  (sample  code)  

   int  cudadevice;      struct  cudaDeviceProp  prop;      cudaGetDevice(  &cudadevice  );      cudaGetDeviceProper8es    (&prop,  cudadevice);      mpc=prop.mul8ProcessorCount;      mtpb=prop.maxThreadsPerBlock;      shmsize=prop.sharedMemPerBlock;      prin|("Device  %d:  number  of  mul8processors  %d\n"                          "max  number  of  threads  per  block  %d\n"                          "shared  memory  per  block  %d\n",                              cudadevice,  mpc,  mtpb,  shmsize);  

Exercise:  compile  and  run  the  enum_gpu.cu  code  

Page 171: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  libraries    

ü  CUDA    includes  a  number  of  widely  used  libraries  –  CUBLAS:  BLAS  implementa8on  –  CUFFT:        FFT  implementa8on  –  CURAND:  Random  number  genera8on  –  Etc.  

ü  Thrust  h;p://docs.nvidia.com/cuda/thrust  –  Reduc8on  –  Scan  –  Sort  –  Etc.  

Page 172: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  libraries    

ü  There  are  two  types  of  run8me  math  opera8ons  –  __func():  direct  mapping  to  hardware  ISA  

•  Fast  but  low  accuracy  •  Examples:  __sin(x),  __exp(x),  __pow(x,y)  

–  func()  :  compile  to  mul8ple  instruc8ons  •  Slower  but  higher  accuracy  (5  ulp,  units  in  the  least  place,  or  

less)  •  Examples:  sin(x),  exp(x),  pow(x,y)    

ü The -use_fast_math compiler option forces every func() to compile to __func()

Page 173: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CURAND  Library  

•  Library  for  genera8ng  random  numbers  

•  Features:  –  XORWOW  pseudo-­‐random  generator  –  Sobol’  quasi-­‐random  number  generators  – Host  API  for  genera8ng  random  numbers  in  bulk  –  Inline  implementa8on  allows  use  inside  GPU  func8ons/kernels  

–  Single-­‐  and  double-­‐precision,  uniform,  normal  and  log-­‐normal  distribu8ons  

Page 174: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CURAND  Features  

–  Pseudo-­‐random  numbers  George  Marsaglia.  Xorshi�  RNGs.  Journal  of  Sta9s9cal  SoRware,  8(14),  2003.  Available  at  h;p://www.jstatso�.org/v08/i14/paper.  

–  Quasi-­‐random  32-­‐bit  and  64-­‐bit    Sobol’  sequences  with  up  to  20,000  dimensions.  

–  Host  API:  call  kernel  from  host,  generate  numbers  on  GPU,  consume  numbers  on  host  or  on  GPU.  

–  GPU  API:  generate  and  consume  numbers  during  kernel  execu8on.  

Page 175: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CURAND  use  1.  Create  a  generator:      

curandCreateGenerator()  2.  Set  a  seed:        

curandSetPseudoRandomGeneratorSeed()  3.  Generate  the  data  from  a  distribu8on:    

curandGenerateUniform()/(curandGenerateUniformDouble():  Uniform  curandGenerateNormal()/cuRandGenerateNormalDouble():  Gaussian  curandGenerateLogNormal/curandGenerateLogNormalDouble():  Log-­‐Normal  

4.  Destroy  the  generator:    curandDestroyGenerator()  

Page 176: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Example  CURAND  Program:    Compu8ng  π  with  MonteCarlo  method  

π = 4 * (Σ red  points)/(Σ points)    1.  Generate    0  <=  x  <  1  2.   Generate    0  <=  y  <  1  3.  Compute    z  =  x2  +  y2  4.  If  z  <=  1  the  point  is  red    Accuracy  increases  slowly…    

Page 177: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Example  CURAND  Program:    Compu8ng  π  with  MonteCarlo  method  

Compare  the  speed  of  pigpu  and  picpu  cd  Pi    gcc  –o  picpu  –O3  picpu.c  -­‐lm  nvcc  –o  pigpu  –O3  pigpu.cu  –arch=sm_13  -­‐lm  Run  for  100000  itera8ons  Look  at  pigpu  and  try  to  understand  why  is  it  so  SLOW…?    

Page 178: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Mul8-­‐GPU  Memory  

ü GPUs  do  not  share  global  memory  ü But  star8ng  on  CUDA  4.0  one  GPU  can  copy  data  from/to  another  GPU  memory  directly  if  the  GPU  are  connected  to  the  same  PCIe  switch  

ü Inter-­‐GPU  communica8on  ü Applica8on  code  is  responsible  for  copying/moving  data  between  GPU  

ü Data  travel  across  the  PCIe  bus  ü Even  when  GPUs  are  connected  to  the  same  PCIe  switch!  

Page 179: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Mul8-­‐GPU  Environment      

ü GPUs  have  consecu8ve  integer  IDs,  star8ng  with  0  ü Star8ng  on  CUDA  4.0,  a  host  thread  can  maintain  more  than  one  GPU  context  at  a  8me  ü CudaSetDevice  allows  to  change  the  “ac8ve”  GPU  (and  so  the  context)    

ü Device  0  is  chosen  when  cudaSetDevice  is  not  called  ü Remember  that  mul8ple  host  threads  can  establish  contexts  with  the  same  GPU  ü Driver  handles  8me-­‐sharing  and  resource  par88oning  unless  the  GPU  is  in  exclusive  mode  

Page 180: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Mul8-­‐GPU  Environment:  a  simple  approach      //  Run  independent  kernel  on  each  CUDA  device  

int  numDevs  =  0;    cudaGetNumDevices(&numDevs);  ...    for  (int  d  =  0;  d  <  numDevs;  d++)  {    cudaSetDevice(d);    kernel<<<blocks,  threads>>>(args);  

}  

Exercise:  Compare  the  dot_simple_mul8block.cu  and  dot_mul8gpu.cu  programs.  Compile  and  run.  A;en8on!  Add  the  –arch=sm_35  op8on:  nvcc  –o  dot_mulagpu  dot_mulagpu.cu  –arch=sm_35  

Page 181: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

General  Mula-­‐GPU  Programming  Pa3ern  

•  The  goal  is  to  hide  communicaaon  cost  – Overlap  with  computa8on  

•  So,  every  ame-­‐step,  each  GPU  should:  –  Compute  parts  (halos)  to  be  sent  to  neighbors  –  Compute  the  internal  region  (bulk)  –  Exchange  halos  with  neighbors  

•  Linear  scaling  as  long  as  internal-­‐computaaon  takes  longer  than  halo  exchange  – Actually,  separate  halo  computa8on  adds  some  overhead  

overlapped  

Page 182: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Example:  Two  Subdomains  

Page 183: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Example:  Two  Subdomains  

send   compute  send  compute  

compute  compute  

Phase  1  

Phase  2  

GPU-­‐0:  green  subdomain  GPU-­‐1:  grey  subdomain  

Page 184: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

CUDA  Features  Useful  for  Mula-­‐GPU  

•  Control  mulaple  GPUs  with  a  single  CPU  thread  – Simpler  coding:  no  need  for  CPU  mul8threading  

•  Streams:  – Enable  execu8ng  kernels  and  memcopies  concurrently  – Up  to  2  concurrent  memcopies:  to/from  GPU  

•  Peer-­‐to-­‐Peer  (P2P)  GPU  memory  copies  – Transfer  data  between  GPUs  using  PCIe  P2P  support  – Done  by  GPU  DMA  hardware  –  host  CPU  is  not  involved  

•  Data  traverses  PCIe  links,  without  touching  CPU  memory  – Disjoint  GPU-­‐pairs  can  communicate  simultaneously  

Page 185: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Peer-­‐to-­‐Peer  GPU  Communica8on  •  Up  to  CUDA  4.0,  GPU  could  not  

exchange  data  directly.              

Example:  read  p2pcopy.cu,  compile  nvcc  –o  p2pcopy  p2pcopy.cu  –arch=sm_35  

•  With  CUDA  >=  4.0  two  GPU  connected  to  the  same  PCI/express  switch  can  exchange  data  directly.              

cudaStreamCreate(&stream_on_gpu_0);  cudaSetDevice(0);  cudaDeviceEnablePeerAccess(  1,  0  );  cudaMalloc(d_0,  num_bytes);  /*  create  array  on  GPU  0  */  cudaSetDevice(1);  cudaMalloc(d_1,  num_bytes);    /*  create  array  on  GPU  1  */  cudaMemcpyPeerAsync(  d_0,  0,  d_1,  1,  num_bytes,  stream_on_gpu_0  );                                                                                    /*  copy  d_1  from  GPU  1  to  d_0  on  GPU  0:  pull  copy  */    

Page 186: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

What  does  the  code  look  like?  

•  Arguments  to  the  cudaMemcpyPeerAsync()  call:  –  Dest  address,  dest  GPU,  src  addres,  src  GPU,  num  bytes,  stream  ID  

•  The  middle  loop  isn’t  necessary  for  correctness  –  Improves  performance  by  preven8ng  the  two  stages  from  interfering  with  

each  other  (15  vs.  11  GB/s  using  4-­‐GPU  example)  

for( int i=0; i<num_gpus-1; i++ ) // “right” stage cudaMemcpyPeerAsync( d_a[i+1], gpu[i+1], d_a[i], gpu[i], num_bytes, stream[i] );

for( int i=0; i<num_gpus; i++ ) cudaStreamSynchronize( stream[i] );

for( int i=1; i<num_gpus; i++ ) // “left” stage cudaMemcpyPeerAsync( d_b[i-1], gpu[i-1], d_b[i], gpu[i], num_bytes, stream[i] );

Page 187: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Code  Pa3ern  for  Mula-­‐GPU  

•  Every  ame-­‐step  each  GPU  will:  – Compute  halos  to  be  sent  to  neighbors  (stream  h)  – Compute  the  internal  region  (stream  i)  – Exchange  halos  with  neighbors  (stream  h)  

Page 188: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Code  For  Looping  through  Time  for( int istep=0; istep<nsteps; istep++) {

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>( ... ); kernel<<<..., stream_halo[i]>>>( ... ); cudaStreamQuery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>( ... ); }

for( int i=0; i<num_gpus-1; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudaStreamSynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); cudaDeviceSynchronize(); // swap input/output pointers }

}

Compute  halos  

Exchange  halos  

Synchronize  before  next  step  

Compute  internal  subdomain  

Page 189: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Code  For  Looping  through  Time  •  Two  CUDA  streams  per  GPU  

–  halo  and  internal  –  Each  in  an  array  indexed  by  GPU  

–  Calls  issued  to  different  streams  can  overlap  

•  Prior  to  ame-­‐loop  –  Create  streams  –  Enable  P2P  access  

   

for( int istep=0; istep<nsteps; istep++) {

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>( ... ); kernel<<<..., stream_halo[i]>>>( ... ); cudaStreamQuery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>( ... ); }

for( int i=0; i<num_gpus-1; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudaStreamSynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); cudaDeviceSynchronize(); // swap input/output pointers }

}

Page 190: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

GPUs  in  Different  Cluster  Nodes  

•  Add  network  communicaaon  to  halo-­‐exchange  –  Transfer  data  from  “edge”  GPUs  of  each  node  to  CPU  – Hosts  exchange  via  MPI  (or  other  API)  –  Transfer  data  from  CPU  to  “edge”  GPUs  of  each  node  

•  With  CUDA  >=  4.0,  transparent  interoperability  between  CUDA  and  Infiniband    export  CUDA_NIC_INTEROP=1    Let  MPI  use,  for  Remote  Direct  Memory  Access  opera8ons,  memory  pinned  by  GPU    

Page 191: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Inter-­‐GPU  Communica8on  with  MPI  •  Example:  simpleMPI.c  

–  Generate  some  random  numbers  on  one  node.  –  Dispatch  them  to  all  nodes.  –  Compute  their  square  root  on  each  node's  GPU.  –  Compute  the  average  of  the  results  by  using  MPI.  To  compile:  

$nvcc  -­‐c  simpleCUDAMPI.cu      

         $mpic++  -­‐o  simpleMPI  simpleMPI.c  simpleCUDAMPI.o    

 -­‐L/usr/local/cuda/lib64/  -­‐lcudart    To  run:  

$mpirun  –np  2  simpleMPI  

On  a  single  line!  

Page 192: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

GPU-­‐aware  MPI  

•  Support    GPU    to    GPU    communica8on    through    standard    MPI      interfaces:  –  e.g.    enable    MPI_Send,    MPI_Recv    from/to    GPU    memory  – Made  possible  by  Unified  Virtual  Addressing  (UVA)  in  CUDA  4.0  

•  Provide    high    performance    without    exposing    low    level    details      to    the    programmer:  –  Pipelined    data    transfer    which    automa8cally  provides    op8miza8ons      inside    MPI    library    without    user    tuning  

From:  MVAPICH2-­‐GPU:    Op8mized  GPU  to  GPU  Communica8on  for  InfiniBand  Clusters  H  Wang,  S.  Potluri,  M    Luo,  A.K.  Singh,    S.  Sur,  D.    K.    Panda  

Page 193: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Naive  code  with  “old  style”  MPI    

•  Sender:        cudaMemcpy(s_buf,    s_device,    size,  cudaMemcpyDeviceToHost);      MPI_Send(s_buf,    size,    MPI_CHAR,    1,    1,    MPI_COMM_WORLD);  

•  Receiver:    MPI_Recv(r_buf,    size,    MPI_CHAR,    0,    1,    MPI_COMM_WORLD,    &req);  cudaMemcpy(r_device,    r_buf,    size,    cudaMemcpyHostToDevice);  

Page 194: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Opamized  code  with  “old  style”  MPI    

•  Pipelining    at    user    level    with    non-­‐blocking    MPI    and    CUDA    func8ons  

•  Sender:        for    (j    =    0;    j    <    pipeline_len;    j++)        cudaMemcpyAsync(s_buf+j*block_sz,s_device+j*block_sz,…);    for    (j    =    0;    j    <    pipeline_len;    j++)  {      while    (result    !=    cudaSuccess)    {        results  =  cudaStreamQuery(..);        if  (j>0)  MPI_Test(…);        }      MPI_Isend(s_buf+j*block_sz,block_sz,MPI_CHAR,1,1,…)    }    MPI_Waitall();  

Page 195: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Code  with  GPU-­‐aware  MPI    

•  Standard  MPI  interfaces  from  device  buffers    Sender:        MPI_Send(s_dev,    size,    MPI_CHAR,    1,    1,    MPI_COMM_WORLD);  

 Receiver:    MPI_Recv(r_dev,    size,    MPI_CHAR,    0,    1,    MPI_COMM_WORLD,    &req);    

• MPI    library  can  automa8cally  differen8ate    between    device    memory    and      host    memory      • Overlap    CUDA    copy    and    RDMA    transfer  • Pipeline    DMA    of    data    from    GPU    and    InfiniBand    RDMA  

Page 196: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

229  

Major  GPU  Performance  Detractors  

ü  Some8mes  it  is  be;er  to  recompute  than  to  cache  ü GPU  spends  its  transistors  on  ALUs,  not  on  cache!  

ü Do  more  computa8on  on  the  GPU  to  avoid  costly  data  transfers  ü Even   low  parallelism  computa8ons   can  at  8mes  be  faster  than  transferring  data  back  and  forth  to  host  

Page 197: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

230  

Major  GPU  Performance  Detractors  

ü  Par88on  your  computa8on  to  keep  the  GPU  mul8processors  equally  busy  ü Many  threads,  many  thread  blocks  

ü  Keep  resource  usage  low  enough  to  support  mul8ple  ac8ve  thread  blocks  per  mul8processor  

•  Max  number  of  blocks  per  SM:  16  (Kepler)  •  Max  number  of  threads  per  SM:  2048  (Kepler)  •  Max  number  of  threads  per  block:  1024  (Fermi  &  Kepler)  •  Total  number  of  registers  per  SM:  65536  (Kepler)  •  Size  of  shared  memory  per  SM:  48  Kbyte  (Fermi  &  Kepler)  

Page 198: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A  useful  tool:  the  CUDA  Profiler  ü What  it  does:  

ü Provide  access  to  hardware  performance  monitors  ü Pinpoint  performance  issues  and  compare  across  implementa8ons  

ü Two  interfaces:  ü Text-­‐based:    

ü Built-­‐in  and  included  with  compiler  export  CUDA_PROFILE=1  (bash)  or  setenv  CUDA_PROFILE  1  (tcsh)  export    CUDA_PROFILE_LOG=XXXXXX  or  setenv  CUDA_PROFILE_LOG  XXXXX  where  XXXXX  is  any  file  name  you  want  

ü GUI:  specific  for  each  pla|orm  

Exercise:  run  the  MMGlob  program  under  the  control  of    the  profiler  

Page 199: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

Final  CUDA  Highlights    

•  The  CUDA  API  is  a  minimal  extension  of  the  ANSI  C  programming  language                        Low  learning  curve  

•  The  memory  hierarchy  is  directly  exposed  to  the  programmer  to  allow  op8mal  control  of  data  flow                  Huge  differences  in  performance  with,  apparently,  8ny  changes  to  the  data  organiza8on.  

Page 200: Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

What  is  missing?  • ECC  effect  • Debugging  and  Advanced  Profiing  • Single  virtual  address  space  and  other  variants  of  memory  copy  opera8ons  • Many  GPU  libraries  

Thanks!