![Page 1: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/1.jpg)
GPU Computing with CUDALecture 1 - Introduction
Christopher Cooper Boston University
August, 2011UTFSM, Valparaíso, Chile
1
![Page 2: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/2.jpg)
General Ideas
‣Objectives
- Learn CUDA
- Recognize CUDA friendly algorithms and practices
‣ Requirements
- C/C++ (hopefully)
‣ Lab oriented course!
2
![Page 3: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/3.jpg)
Resources
‣Material available in http://www.bu.edu/pasi/materials/post-pasi-training/
‣Machines to be used
- Cluster USM
- BUNGEE/BUDGE at Boston University
‣Main references
- Kirk, D. and Hwu, W. Programming Massively Parallel Processors
- Sanders, J. and Kandrot, E. CUDA by Example
- CUDA C Programming Guide
- CUDA C Best Practices Guide 3
![Page 4: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/4.jpg)
Outline of course
‣Week 1
- Basic CUDA
‣Week 2
- Optimization
‣Week 3
- CUDA Libraries
‣Week 4
- Applications
4
![Page 5: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/5.jpg)
Outline
‣Understanding the need of multicore architectures
‣Overview of the GPU hardware
‣ Introduction to CUDA
- Main features
- Thread hierarchy
- Simple example
- Concepts behind a CUDA friendly code
5
![Page 6: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/6.jpg)
Moore’s Law
‣ Transistor count of integrated circuits doubles every two years
6
![Page 7: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/7.jpg)
Getting performance
‣We have more and more transistors, then?
‣How can we get more performance?
- Increase processor speed
‣ Have gone from MHz to GHz in the last 30 years
‣ Power scales as frequency3!
‣Memory needs to catch up
- Parallel executing
‣ Concurrency, multithreading
7
![Page 8: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/8.jpg)
Getting performance
‣We have more and more transistors, then?
‣How can we get more performance?
- Increase processor speed
‣ Have gone from MHz to GHz in the last 30 years
‣ Power scales as frequency3!
‣Memory needs to catch up
- Parallel executing
‣ Concurrency, multithreading
7
![Page 9: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/9.jpg)
Getting performance
‣We have more and more transistors, then?
‣How can we get more performance?
- Increase processor speed
‣ Have gone from MHz to GHz in the last 30 years
‣ Power scales as frequency3!
‣Memory needs to catch up
- Parallel executing
‣ Concurrency, multithreading
7
![Page 10: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/10.jpg)
Getting performance
8
![Page 11: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/11.jpg)
Moore’s Law
‣ Serial performance scaling reached its peak
‣ Processors are not getting faster, but wider
‣Challenge: parallel thinking
9
![Page 12: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/12.jpg)
Graphic Processing Units (GPU)
10
![Page 13: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/13.jpg)
Graphic Processing Units (GPU)
‣GPU is a hardware specially designed for highly parallel applications (graphics)
11
!"#$%&'()*(+,%'-./0%1-,!
!
!
2( ( !345(!(6'-7'#881,7(9/1.&(:&';1-,(<*=!
!
!
>17/'&()?)*( >@-#%1,7?6-1,%(A$&'#%1-,;($&'(B&0-,.(#,.(C&8-'D(E#,.F1.%"(G-'(%"&(!63(#,.(963(
![Page 14: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/14.jpg)
Graphic Processing Units (GPU)
‣ Fast processing must come with high bandwidth!
12
!"#$%&'()*(+,%'-./0%1-,!
!
!
2( ( !345(!(6'-7'#881,7(9/1.&(:&';1-,(<*=!
!
!
>17/'&()?)*( >@-#%1,7?6-1,%(A$&'#%1-,;($&'(B&0-,.(#,.(C&8-'D(E#,.F1.%"(G-'(%"&(!63(#,.(963(
![Page 15: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/15.jpg)
Graphic Processing Units (GPU)
‣Graphics market is hungry for better and faster rendering
‣Development is pushed by this HUGE industry
- High quality product
- Cheap!
‣ Proven to be a real alternative for scientific applications
13
Top 500 list. June 2011
![Page 16: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/16.jpg)
GPU chip design
‣CPU vs GPU
14
! !"#$%&'()*(+,%'-./0%1-,!
!
!
!234(!(5'-6'#771,6(8/1.&(9&':1-,(;*<! ! =!!
!
"#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/<-$4+)-$)'+=$>!#+3#20!/&%&22$2!.(;/<-&-+()!!!$?&.-20!5#&-!3%&/#+.'!%$),$%+)3!+'!&*(<-!!!&),!-#$%$1(%$!,$'+3)$,!'<.#!-#&-!;(%$!-%&)'+'-(%'!&%$!,$=(-$,!-(!,&-&!/%(.$''+)3!%&-#$%!-#&)!,&-&!.&.#+)3!&),!12(5!.()-%(2>!&'!'.#$;&-+.&220!+22<'-%&-$,!*0!@+3<%$!A4BC!
!
!
>16/'&()?@*( A"&(852(3&B-%&:(C-'&(A'#,:1:%-':(%-(3#%#(5'-0&::1,6(
!
D(%$!'/$.+1+.&220>!-#$!978!+'!$'/$.+&220!5$224'<+-$,!-(!&,,%$''!/%(*2$;'!-#&-!.&)!*$!$?/%$''$,!&'!,&-&4/&%&22$2!.(;/<-&-+()'!!!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!+)!/&%&22$2!!!5+-#!#+3#!&%+-#;$-+.!+)-$)'+-0!!!-#$!%&-+(!(1!&%+-#;$-+.!(/$%&-+()'!-(!;$;(%0!(/$%&-+()'C!E$.&<'$!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!1(%!$&.#!,&-&!$2$;$)->!-#$%$!+'!&!2(5$%!%$F<+%$;$)-!1(%!'(/#+'-+.&-$,!12(5!.()-%(2>!&),!*$.&<'$!+-!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!&),!#&'!#+3#!&%+-#;$-+.!+)-$)'+-0>!-#$!;$;(%0!&..$''!2&-$).0!.&)!*$!#+,,$)!5+-#!.&2.<2&-+()'!+)'-$&,!(1!*+3!,&-&!.&.#$'C!
G&-&4/&%&22$2!/%(.$''+)3!;&/'!,&-&!$2$;$)-'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!D&)0!&//2+.&-+()'!-#&-!/%(.$''!2&%3$!,&-&!'$-'!.&)!<'$!&!,&-&4/&%&22$2!/%(3%&;;+)3!;(,$2!-(!'/$$,!</!-#$!.(;/<-&-+()'C!H)!IG!%$),$%+)3>!2&%3$!'$-'!(1!/+?$2'!&),!=$%-+.$'!&%$!;&//$,!-(!/&%&22$2!-#%$&,'C!J+;+2&%20>!+;&3$!&),!;$,+&!/%(.$''+)3!&//2+.&-+()'!'<.#!&'!/('-4/%(.$''+)3!(1!%$),$%$,!+;&3$'>!=+,$(!$).(,+)3!&),!,$.(,+)3>!+;&3$!'.&2+)3>!'-$%$(!=+'+()>!&),!/&--$%)!%$.(3)+-+()!.&)!;&/!+;&3$!*2(.K'!&),!/+?$2'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!H)!1&.->!;&)0!&23(%+-#;'!(<-'+,$!-#$!1+$2,!(1!+;&3$!%$),$%+)3!&),!/%(.$''+)3!&%$!&..$2$%&-$,!*0!,&-&4/&%&22$2!/%(.$''+)3>!1%(;!3$)$%&2!'+3)&2!/%(.$''+)3!(%!/#0'+.'!'+;<2&-+()!-(!.(;/<-&-+()&2!1+)&).$!(%!.(;/<-&-+()&2!*+(2(30C!
)*@ !234!D(#(8&,&'#E?5/'$-:&(5#'#EE&E(
!-7$/%1,6(4'0"1%&0%/'&(
H)!L(=$;*$%!BMMN>!LOHGHP!+)-%(,<.$,!68GP"#$&!3$)$%&2!/<%/('$!/&%&22$2!.(;/<-+)3!&%.#+-$.-<%$!!!5+-#!&!)$5!/&%&22$2!/%(3%&;;+)3!;(,$2!&),!+)'-%<.-+()!'$-!&%.#+-$.-<%$!!!-#&-!2$=$%&3$'!-#$!/&%&22$2!.(;/<-$!$)3+)$!+)!LOHGHP!978'!-(!
"#$%&!
'()!"*+,-*.!
'()!
'()!
'()!
/0'1!
"2)!
/0'1!
!!
!
!!
!
!!
!
!!
!
!!
!
!!
!
!!
!
!!
!
32)!
! !"#$%&'()*(+,%'-./0%1-,!
!
!
!234(!(5'-6'#771,6(8/1.&(9&':1-,(;*<! ! =!!
!
"#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/<-$4+)-$)'+=$>!#+3#20!/&%&22$2!.(;/<-&-+()!!!$?&.-20!5#&-!3%&/#+.'!%$),$%+)3!+'!&*(<-!!!&),!-#$%$1(%$!,$'+3)$,!'<.#!-#&-!;(%$!-%&)'+'-(%'!&%$!,$=(-$,!-(!,&-&!/%(.$''+)3!%&-#$%!-#&)!,&-&!.&.#+)3!&),!12(5!.()-%(2>!&'!'.#$;&-+.&220!+22<'-%&-$,!*0!@+3<%$!A4BC!
!
!
>16/'&()?@*( A"&(852(3&B-%&:(C-'&(A'#,:1:%-':(%-(3#%#(5'-0&::1,6(
!
D(%$!'/$.+1+.&220>!-#$!978!+'!$'/$.+&220!5$224'<+-$,!-(!&,,%$''!/%(*2$;'!-#&-!.&)!*$!$?/%$''$,!&'!,&-&4/&%&22$2!.(;/<-&-+()'!!!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!+)!/&%&22$2!!!5+-#!#+3#!&%+-#;$-+.!+)-$)'+-0!!!-#$!%&-+(!(1!&%+-#;$-+.!(/$%&-+()'!-(!;$;(%0!(/$%&-+()'C!E$.&<'$!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!1(%!$&.#!,&-&!$2$;$)->!-#$%$!+'!&!2(5$%!%$F<+%$;$)-!1(%!'(/#+'-+.&-$,!12(5!.()-%(2>!&),!*$.&<'$!+-!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!&),!#&'!#+3#!&%+-#;$-+.!+)-$)'+-0>!-#$!;$;(%0!&..$''!2&-$).0!.&)!*$!#+,,$)!5+-#!.&2.<2&-+()'!+)'-$&,!(1!*+3!,&-&!.&.#$'C!
G&-&4/&%&22$2!/%(.$''+)3!;&/'!,&-&!$2$;$)-'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!D&)0!&//2+.&-+()'!-#&-!/%(.$''!2&%3$!,&-&!'$-'!.&)!<'$!&!,&-&4/&%&22$2!/%(3%&;;+)3!;(,$2!-(!'/$$,!</!-#$!.(;/<-&-+()'C!H)!IG!%$),$%+)3>!2&%3$!'$-'!(1!/+?$2'!&),!=$%-+.$'!&%$!;&//$,!-(!/&%&22$2!-#%$&,'C!J+;+2&%20>!+;&3$!&),!;$,+&!/%(.$''+)3!&//2+.&-+()'!'<.#!&'!/('-4/%(.$''+)3!(1!%$),$%$,!+;&3$'>!=+,$(!$).(,+)3!&),!,$.(,+)3>!+;&3$!'.&2+)3>!'-$%$(!=+'+()>!&),!/&--$%)!%$.(3)+-+()!.&)!;&/!+;&3$!*2(.K'!&),!/+?$2'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!H)!1&.->!;&)0!&23(%+-#;'!(<-'+,$!-#$!1+$2,!(1!+;&3$!%$),$%+)3!&),!/%(.$''+)3!&%$!&..$2$%&-$,!*0!,&-&4/&%&22$2!/%(.$''+)3>!1%(;!3$)$%&2!'+3)&2!/%(.$''+)3!(%!/#0'+.'!'+;<2&-+()!-(!.(;/<-&-+()&2!1+)&).$!(%!.(;/<-&-+()&2!*+(2(30C!
)*@ !234!D(#(8&,&'#E?5/'$-:&(5#'#EE&E(
!-7$/%1,6(4'0"1%&0%/'&(
H)!L(=$;*$%!BMMN>!LOHGHP!+)-%(,<.$,!68GP"#$&!3$)$%&2!/<%/('$!/&%&22$2!.(;/<-+)3!&%.#+-$.-<%$!!!5+-#!&!)$5!/&%&22$2!/%(3%&;;+)3!;(,$2!&),!+)'-%<.-+()!'$-!&%.#+-$.-<%$!!!-#&-!2$=$%&3$'!-#$!/&%&22$2!.(;/<-$!$)3+)$!+)!LOHGHP!978'!-(!
"#$%&!
'()!"*+,-*.!
'()!
'()!
'()!
/0'1!
"2)!
/0'1!
!!
!
!!
!
!!
!
!!
!
!!
!
!!
!
!!
!
!!
!
32)!CPU GPU
GPU devotes more transistors to data processing
![Page 17: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/17.jpg)
GPU chip design
15
![Page 18: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/18.jpg)
GPU chip design
‣A glance at the GeForce 8800
16
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture Texture Texture
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Load/store Load/store Load/store Load/store Load/store
Streaming multiprocessor
Streaming processor
![Page 19: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/19.jpg)
GPU chip design
‣Cores?
17SM
3 billion transistors512 CUDA cores (32 in 16 SMs)64kB of on chip RAMHigh bandwidth
SP
![Page 20: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/20.jpg)
GPU chip design
‣ The GPU core is the stream processor
‣ Stream processors are grouped in Stream Multiprocessors
- SM is basically a SIMD processor (Single Instruction Multiple Data)
18
![Page 21: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/21.jpg)
GPU chip design
‣Core ideas
- GPUs consist of many “simple” cores
‣ Designed for many simpler tasks: high throughput
‣ Fewer control units: latency
19
Maximize throughput of all threads
Minimize latency of a thread
![Page 22: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/22.jpg)
Parallel thinking
‣ Latency? Throughput?
- Latency: time to complete a task
- Throughput: number of tasks in a fixed time
‣Decisions!?!
- Depends on the problem
20
Execution time
Bandwidth
“If you were plowing a field, which would you rather use: two strong oxen or 1024 chicken?”
Seymour Cray
![Page 23: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/23.jpg)
Parallel thinking
‣ 1024 chicken? We better have a strategy!
- Rethink our algorithms to be more parallel friendly
- Massively parallel:
‣ Data parallelism
‣ Load balancing
‣ Regular computations
‣ Data access
‣ Avoid conflicts
‣ and so on...21
![Page 24: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/24.jpg)
Computer Unified Device Architecture (CUDA)
‣ Parallel computer architecture developed by NVIDIA
‣General purpose programming model
‣ Specially designed for General Purpose GPU computing:
- Offers a compute designed API
- Explicit GPU memory managing
22
![Page 25: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/25.jpg)
CUDA enabled GPUs
‣What is the difference between GPUs
- CUDA enabled GPUs
- GPUs classified according to compute capability
23
CUDA Programming Guide Appendix A CUDA Programming Guide Appendix F
![Page 26: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/26.jpg)
CUDA - Main features
‣C/C++ with extensions
‣Heterogeneous programming model:
- Operates in CPU (host) and GPU (device)
24
Host CPU
Device GPU
![Page 27: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/27.jpg)
CUDA - Threads
‣ In CUDA, a kernel is executed by many threads
- A thread is a sequence of executions
- Multi-thread: many threads will be running at the same time
25
__global__voidvec_add(float*A,float*B,float*C,intN){inti=threadIdx.x+blockDim.x*blockIdx.x;
if(i>=N){return;}C[i]=A[i]+B[i];}
voidvec_add(float*A,float*B,float*C,intN){for(inti=0;i<N;i++)
C[i]=A[i]+B[i];}
![Page 28: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/28.jpg)
CUDA - Threads
‣ Threads are grouped into thread blocks
- Programming abstraction
- All threads within a thread block run in the same SM
- Threads of the same block can communicate
‣ Thread blocks conform a grid
26
![Page 29: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/29.jpg)
CUDA - Threads
‣CUDA virtualizes the physical hardware
- Thread is a virtualized scalar processor
- Thread blocks is a virtualized multiprocessors
‣ Thread blocks need to be independent
- They run to completion
- No idea in which order they run
27
![Page 30: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/30.jpg)
CUDA - Threads
‣ Provides automatic scalability across GPUs
- Thread hierarchy
- Shared memories
- Barrier synchronization
28
![Page 31: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/31.jpg)
CUDA - Threads
‣ Threads have a unique combination of block ID and thread ID
- We can operate in different parts of the data
- SIMT: Single Instruction Multiple Threads
- Threads: 1D, 2D or 3D
- Blocks: 1D or 2D
29
![Page 32: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/32.jpg)
CUDA - Extensions to C
30
‣Minimal effort from C to CUDA
‣ Function declarations
‣Variable qualifiers
‣ Execution configuration
‣Others
__global__voidkernel()__device__floatfunction()
__shared__floatarray[5]__constant__floatarray[5]
dim3dim_grid(100,100);//10000blocksdim3dim_block(16,16);//256threadsperblockin2Dkernel<<<dim_grid,dim_block>>>();//Launchkernel
threadIdx.xblockIdx.x
![Page 33: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/33.jpg)
CUDA - Program execution
31
Allocate and initialize data on CPU
Allocate data on GPU
Transfer data from CPU to GPU
Run kernel
Transfer data from GPU to CPU
![Page 34: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/34.jpg)
CUDA - Vector add example
32
__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;
if(i>=N){return;}C[i]=A[i]+B[i];}
intmain(){
....//LaunchkernelwithN/256blocksof256threads
intblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
}
![Page 35: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/35.jpg)
CUDA - Vector add example
32
__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;
if(i>=N){return;}C[i]=A[i]+B[i];}
intmain(){
....//LaunchkernelwithN/256blocksof256threads
intblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
}
Thread indexing
![Page 36: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/36.jpg)
CUDA - Vector add example
32
__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;
if(i>=N){return;}C[i]=A[i]+B[i];}
intmain(){
....//LaunchkernelwithN/256blocksof256threads
intblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
}
Perform operation
![Page 37: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/37.jpg)
CUDA - Vector add example
33
intN=200;
//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];
for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}
//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;
cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
![Page 38: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/38.jpg)
CUDA - Vector add example
33
intN=200;
//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];
for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}
//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;
cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
CPU allocation
![Page 39: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/39.jpg)
CUDA - Vector add example
33
intN=200;
//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];
for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}
//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;
cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
GPU allocation
![Page 40: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/40.jpg)
CUDA - Vector add example
33
intN=200;
//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];
for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}
//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;
cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
Copy to device
![Page 41: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/41.jpg)
CUDA - Vector add example
34
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);
//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);
![Page 42: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/42.jpg)
CUDA - Vector add example
34
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);
//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);
Copy to device
![Page 43: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/43.jpg)
CUDA - Vector add example
34
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);
//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);
Launch kernel
![Page 44: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/44.jpg)
CUDA - Vector add example
34
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);
//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);
Copy back to host
![Page 45: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/45.jpg)
CUDA - Vector add example
34
//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);
//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);
//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);
//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);
Free device
![Page 46: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/46.jpg)
CUDA - Vector add example
35
__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;
if(i>=N){return;}C[i]=A[i]+B[i];}
Thread i
FetchA[i], B[i]
Write C[i]
Operate
Thread N-1
FetchA[N-1], B[N-1]
Write C[N-1]
Operate
Thread i+1
FetchA[i+1], B[i+1]
Write C[i+1]
Operate
Thread 0
FetchA[0], B[0]
Write C[0]
Operate
... ...
![Page 47: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/47.jpg)
CUDA - To keep in mind
‣ Keep executions simple
- Leave tough things to the CPU
‣ Performance:
- Load balance
- Data parallelism
- Avoid conflicts
- Keep the GPU busy!
‣How fast can we go?
36
![Page 48: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/48.jpg)
CUDA - To keep in mind
‣ Load balance
- All threads should do the same amount of work
‣Data Parallelism (Massive)
- Arrange your algorithm so is data parallel friendly
- Look for regularity
‣Avoid conflicts
- Data access conflicts will serialize your applications or give you wrong answers!
37
![Page 49: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/49.jpg)
CUDA - To keep in mind
‣ Keep the GPU busy!
- High peak compute throughput compared to bandwidth
- Fermi:
‣ 1TFLOP peak throughput, 144 GB/s peak off chip memory access (36 Gfloats per second)
‣ 4*1TFLOP = 4000GB/s for peak throughput!
‣ 1000/36 ≈ 28 operations per fetched value
- Need to hide latency!
38
![Page 50: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/50.jpg)
CUDA - To keep in mind
‣How fast can we go?
- Keep Amdahl’s law in mind
- Know how much parallelization can be done in your application
- Measure independent parts of your algorithm before going to the GPU
39
P: parallel portionS: sequential portion
speedup =1
(1! P ) + PS
![Page 51: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1](https://reader031.vdocuments.us/reader031/viewer/2022021821/5af2faf97f8b9a95468bdea2/html5/thumbnails/51.jpg)
CUDA - Applications
‣Ultrasound imaging‣Molecular dynamics‣ Seismic migration‣Astrophysics simulations‣Graphics‣ ....
40
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009!ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign!