![Page 1: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/1.jpg)
CS 380 - GPU and GPGPU ProgrammingLecture 8+9: GPU Architecture 7+8
Markus Hadwiger, KAUST
![Page 2: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/2.jpg)
2
Reading Assignment #5 (until March 12)
Read (required):
• Programming Massively Parallel Processors book,Chapter 3 (Introduction to CUDA)
• Programming Massively Parallel Processors book,Chapter 4 (CUDA Threads) until (including) 4.3
Read (optional):• NVIDIA Fermi graphics (GF100) and compute white papers:
http://www.nvidia.com/object/IO_86775.html
http://www.nvidia.com/object/IO_86776.html
• NVIDIA Kepler (GK110) white papers:http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
• NVIDIA Maxwell (GM107) white paper:http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-
Ti-Whitepaper.pdf
![Page 3: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/3.jpg)
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
From Shader Code to a Teraflop:How Shader Cores Work
Kayvon FatahalianStanford University
![Page 4: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/4.jpg)
![Page 5: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/5.jpg)
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
My chip!
16 cores
8 mul-add ALUs per core(128 total)
16 simultaneousinstruction streams
64 concurrent (but interleaved)instruction streams
512 concurrent fragments
= 256 GFLOPs (@ 1GHz)
5
![Page 6: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/6.jpg)
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
My “enthusiast” chip!
32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)6
![Page 7: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/7.jpg)
![Page 8: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/8.jpg)
![Page 9: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/9.jpg)
![Page 10: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/10.jpg)
![Page 11: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/11.jpg)
KAUST King Abdullah University of Science and Technology 11
![Page 12: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/12.jpg)
KAUST King Abdullah University of Science and Technology 12
![Page 13: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/13.jpg)
KAUST King Abdullah University of Science and Technology 13
![Page 14: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/14.jpg)
![Page 15: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/15.jpg)
![Page 16: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/16.jpg)
![Page 17: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/17.jpg)
![Page 18: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/18.jpg)
NVIDIA G80/GT200 Architecture
• Streaming Processor (SP)
• Streaming Multiprocessor (SM)
• Texture/Processing Cluster (TPC)18
Courtesy AnandTech
![Page 19: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/19.jpg)
NVIDIA G80/GT200 Architecture
• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs
• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)
19
G80 / G92 GT200Courtesy AnandTech
![Page 20: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/20.jpg)
NVIDIA GT200 GPGPU Hardware
NVIDIA Tesla 10-series• Based on GT200 architecture
• 1 Teraflop / device
• 4GB RAM / device
• Multiple devices pernode / machine
Tesla C1060
Tesla S1070
![Page 21: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/21.jpg)
NVIDIA Fermi / GF100 Hardware
Geforce GTX 580• 512 CUDA cores (16 SMs)
• 1.5 GB memory
Tesla 20-series• Cards: M2070/C2070, ...
• Blades: S2050/S2070
• 3GB or 6GB / GPU, ECC memory
![Page 22: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/22.jpg)
22
NVIDIA Fermi / GF100 Features
Names
• Compute: Fermi; product: Tesla-20 series
• Graphics: GF100 (product: Geforce GTX 480, 580, ...)
Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/
html/C/doc/ptx_isa_3.0.pdf
• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf
L1 and L2 caches
More CUDA cores (up to 512)
Faster double precision float performance, faster atomics, float atomics
DirectX 11 and OpenGL 4 functionality
• New shader types, scatter writes to images, ...
![Page 23: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/23.jpg)
23
NVIDIA Fermi / GF100 Stats
![Page 24: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/24.jpg)
24
Streaming Multiprocessor
Streaming processors are nowCUDA cores
32 CUDA cores per Fermistreaming multiprocessor (SM)
16 SMs = 512 CUDA cores
CPU-like cache hierarchy• L1 cache / shared memory
• L2 cache
Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)
![Page 25: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/25.jpg)
Dual Warp Schedulers
Markus Hadwiger, KAUST 25
![Page 26: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/26.jpg)
26
Graphics Processor Clusters (GPC)
(instead of TPC on GT200)
4 Streaming Processors
32 CUDA cores / SM
4 SMs / GPC =128 cores / GPC
Decentralized rasterizationand geometry
• 4 raster engines
• 16 ”PolyMorph” engines
![Page 27: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/27.jpg)
27
NVIDIA Fermi / GF100 Structure
Full size
• 4 GPCs
• 4 SMs each
• 6 64-bitmemorycontrollers(= 384 bit)
![Page 28: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/28.jpg)
28
NVIDIA Fermi / GF100 Die
Full size
• 4 GPCs
• 4 SMs each
![Page 29: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/29.jpg)
29
Compute Capab. 2.0
• 1024 threads / block
• More threads / SM
• 32K registers / SM
• New synchronization functions
![Page 30: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/30.jpg)
30
L1 Cache vs. Shared Memory
Two different configs• 64KB total
• 16KB shared, 48KB L1 cache
• 48KB shared, 16KB L1 cache
• Set per kernel
![Page 31: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/31.jpg)
31
Global Memory Access
Cached on Fermi
L1 cache per SM
Global L2 cache
Compile time flag can choose:• Caching in both L1 and L2
• Caching only in L2
Cache line size (L1, L2):• 128 bytes
![Page 32: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/32.jpg)
NVIDIA Kepler Architecture
Two different versions• GK104, compute capability 3.0
– Geforce GTX 680, …– Quadro K5000– Tesla K10 series
• GK110, compute capability 3.5– Geforce GTX Titan (just released!)– Tesla K20 series
Markus Hadwiger, KAUST 32
![Page 33: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/33.jpg)
GF100 Graphics Pipeline
• ?Input Assembler
Vertex Shader
Pixel Shader
Hull Shader
Rasterizer
Output Merger
Tessellator
Domain Shader
Geometry Shader Stream Output
![Page 34: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/34.jpg)
34
NVIDIA Kepler / GK104 Structure
Full size
• 4 GPCs
• 2 SMXs each
= 8 SMXs,1536 CUDA cores
![Page 35: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/35.jpg)
GK104 SMX
• 192 CUDA cores
• 32 LD/ST units
• 16 SFUs
• 16 texture units
Markus Hadwiger, KAUST 35
![Page 36: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/36.jpg)
36
NVIDIA Kepler / GK110 Structure
Full size
• 15 SMXs
• 2880CUDAcores
![Page 37: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/37.jpg)
GK110 SMX
• 192 CUDA cores
• 64 DP units
• 32 LD/ST units
• 16 SFUs
• 16 texture units
New read-onlydata cache (48KB)
Markus Hadwiger, KAUST 37
![Page 38: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/38.jpg)
Compute Capabilities 2.0 – 3.5
Markus Hadwiger, KAUST 38
![Page 39: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/39.jpg)
Maxwell vs. Kepler Architecture
GM107
Markus Hadwiger, KAUST 39
![Page 40: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/40.jpg)
Maxwell vs. Kepler Architecture
GK107
vs.
GM107
Markus Hadwiger, KAUST 40
![Page 41: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture](https://reader033.vdocuments.us/reader033/viewer/2022060812/609078a86fcd3a4c0558db42/html5/thumbnails/41.jpg)
Thank you.