cs179: gpu programming
DESCRIPTION
CS179: GPU Programming. Lecture 1: Introduction. Today. Course summary Administrative details Brief history of GPU computing Introduction to CUDA. Course Summary. GPU Programming What: GPU: Graphics processing unit -- highly parallel APIs for accelerated hardware Why: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/1.jpg)
CS179: GPU ProgrammingLecture 1: Introduction
![Page 2: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/2.jpg)
Today Course summary Administrative details Brief history of GPU computing Introduction to CUDA
![Page 3: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/3.jpg)
Course Summary GPU Programming
What: GPU: Graphics processing unit -- highly parallel APIs for accelerated hardware
Why: Parallel processing
![Page 4: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/4.jpg)
Course Summary:Why GPU?
![Page 5: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/5.jpg)
Course Summary:Why GPU? How many cores, exactly?
GeForce 8800 Ultra (2007) - 128 GeForce GTX 260 (2008) - 192 GeForce GTX 295 (2009) - 480* GeForce GTX 480 (2010) - 480 GeForce GTX 590 (2011) - 1024* GeForce GTX 690 (2012) - 3072* GeForce GTX Titan Z (2014) - 5760*
* indicates these are cards shipped with 2 GPUs in them, effectively doubling the cores
![Page 6: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/6.jpg)
Course Summary:Why GPU?
![Page 7: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/7.jpg)
Course Summary:Why GPU?
![Page 8: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/8.jpg)
Course Summary:Why GPU?
![Page 9: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/9.jpg)
Course Summary:Why GPU? What kinds of speedups do we get?
![Page 10: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/10.jpg)
Course Summary:Overview What will you learn?
CUDA Parallelizing problems Optimizing GPU code CUDA libraries
What will we not cover? OpenGL C/C++
![Page 11: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/11.jpg)
Administrative:Course Details CS179: GPU Programming
Website: http://courses.cms.caltech.edu/cs179/ Course Instructors/TA’s:
Connor DeFanti ([email protected]) Kevin Yuh ([email protected])
Overseeing Instructor: Al Barr ([email protected])
Class time: MWF 5:00-5:55PM
![Page 12: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/12.jpg)
Administrative:Assignments Homework:
8 assignments Each worth 10% of your grade (100 pts. each)
Final Project: 2 weeks for a custom final project Details are up to you! 20% of your grade (200 pts.)
![Page 13: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/13.jpg)
Administrative:Assignments Assignments will be due Wednesday, 5PM Extensions may be granted…
Talk to TA’s beforehand! Office Hours: located in 104 ANB
Connor: Tuesday, 8-10PM Kevin: Monday, 8-10PM
![Page 14: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/14.jpg)
Administrative:Assignments Doing the assignments:
CUDA-capable machine required! Must have NVIDIA GPU Setting up environment can be tricky Three options:
DIY with your own setup Use provided instructions with given environment Use lab machines
![Page 15: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/15.jpg)
Administrative:Assignments Submitting assignments:
Due date: Wednesday 5PM Submit assignment as .tar/.zip, or similar Include README file!
Name, compilation instructions, answers to conceptual questions on sets, etc.
Submit all assignments to [email protected] Receiving graded assignments:
Assignments should get back 1 week after submission We will email you back with grade and comments
![Page 16: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/16.jpg)
GPU History:Early Days Before GPUs:
All graphics run on the CPU Each pixel drawn in series
Super slow! (CS171, anyone?) Early GPUs:
1980s: Blitters (fixed image sprites) allowed fast image memory transfer
1990s: Introduction of DirectX and OpenGL Brought fixed function pipeline for rendering
![Page 17: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/17.jpg)
GPU History:Early Days Fixed Function Pipeline:
“Fixed” OpenGL states Phong or Gouraud shading? Render as wireframe or solid?
Very limiting, made early games look similar
![Page 18: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/18.jpg)
GPU History:Shaders Early 2000’s: shaders introduced
Allow for much more interesting shading models
![Page 19: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/19.jpg)
GPU History:Shaders Shaders: expanded world of rendering greatly
Vertex shaders: apply operations per-vertex Fragment shaders: apply operations per-pixel Geometry shaders: apply operations to add new geometry
![Page 20: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/20.jpg)
GPU History:Shaders These are great when dealing with graphics data…
Vertices, faces, pixels, etc. What about general purpose?
Can trick GPU DirectX “compute” shader may be an option Anything slicker?
![Page 21: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/21.jpg)
GPU History:CUDA 2007: NVIDIA introduces CUDA
C-style programming API for GPU Easier to do GPGPU Easier memory handling Better tools, libraries, etc.
![Page 22: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/22.jpg)
GPU History:CUDA New advantages on the table:
Scattered reads Shared memory Faster memory transfer to/from the GPU
![Page 23: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/23.jpg)
GPU History:Other APIs Plenty of other API’s exist for GPGPU
OpenCL/WebCL DirectX Compute Shader Other
![Page 24: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/24.jpg)
Using the GPU Highly parallelizable parts of computational problems
![Page 25: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/25.jpg)
A simple problem… Add two arrays
A[] + B[] -> C[]
On the CPU:(allocate memory for C)For (i from 1 to array length)
C[i] <- A[i] + B[i]
Operates sequentially… can we do better?
![Page 26: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/26.jpg)
A simple problem… On the CPU (multi-threaded):
(allocate memory for C)Create # of threads equal to number of cores on processor (around 2, 4, perhaps 8)(Allocate portions of A, B, C to each thread...)
...
In each thread,For (i from beginning region of thread)
C[i] <- A[i] + B[i]//lots of waiting involved for memory reads, writes, ...
Wait for threads to synchronize...
Slightly faster – 2-8x (slightly more with other tricks)
![Page 27: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/27.jpg)
A simple problem… How many threads? How does performance scale?
Context switching: High penalty on the CPU!
![Page 28: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/28.jpg)
A simple problem… On the GPU:
(allocate memory for A, B, C on GPU)Create the “kernel” – each thread will perform one (or a few) additionsSpecify the following kernel operation:
For (all i‘s assigned to this thread)C[i] <- A[i] + B[i]
Start ~20000 (!) threadsWait for threads to synchronize...
Speedup: Very high! (e.g. 10x, 100x)
![Page 29: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/29.jpg)
GPU: Strengths Revealed Parallelism Low context switch penalty!
We can “cover up” performance loss by creating more threads!
![Page 30: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/30.jpg)
GPU Computing: Step by Step Setup inputs on the host (CPU-accessible memory) Allocate memory for inputs on the GPU Copy inputs from host to GPU Allocate memory for outputs on the host Allocate memory for outputs on the GPU Start GPU kernel Copy output from GPU to host
(Copying can be asynchronous)
![Page 31: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/31.jpg)
GPU: Internals Blocks: Groups of threads
Can cooperate via shared memory Can synchronize with each other Max size: 512, 1024 threads (hardware-dependent)
Warps: Subgroups of threads within block Execute “in-step” Size: 32 threads
![Page 32: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/32.jpg)
GPU: InternalsBlock
SIMD processing unit
Warp
![Page 33: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/33.jpg)
The Kernel Our “parallel” function Simple implementation (won’t work for lots of values)
![Page 34: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/34.jpg)
Indexing Can get a block ID and thread ID within the block:
Unique thread ID!
![Page 35: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/35.jpg)
Calling the Kernel
…
![Page 36: CS179: GPU Programming](https://reader036.vdocuments.us/reader036/viewer/2022081512/5681661d550346895dd96ece/html5/thumbnails/36.jpg)
Calling the Kernel (2)