nvidia fermi architecture - penn engineering · 2011. 11. 4. · fermi: unified address space...
TRANSCRIPT
![Page 1: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/1.jpg)
NVIDIA Fermi Architecture
Joseph KiderUniversity of PennsylvaniaCIS 565 - Fall 2011
![Page 2: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/2.jpg)
Administrivia
Project checkpoint on Monday
![Page 3: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/3.jpg)
Sources
Patrick Cozzi Spring 2011NVIDIA CUDA Programming GuideCUDA by ExampleProgramming Massively Parallel Processors
![Page 4: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/4.jpg)
G80, GT200, and Fermi
November 2006: G80June 2008: GT200March 2011: Fermi (GF100)
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 5: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/5.jpg)
New GPU Generation
What are the technical goals for a new GPU generation?
![Page 6: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/6.jpg)
New GPU Generation
What are the technical goals for a new GPU generation?
Improve existing application performance. How?
![Page 7: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/7.jpg)
New GPU Generation
What are the technical goals for a new GPU generation?
Improve existing application performance. How?Advance programmability. In what ways?
![Page 8: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/8.jpg)
Fermi: What’s More?
More total cores (SPs) – not SMs thoughMore registers: 32K per SMMore shared memory: up to 48K per SMMore Super Functional Units (SFUs)
![Page 9: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/9.jpg)
Fermi: What’s Faster?
Faster double precision – 8x over GT200Faster atomic operations. What for?
5-20xFaster context switches
Between applications – 10xBetween graphics and compute, e.g., OpenGL and CUDA
![Page 10: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/10.jpg)
Fermi: What’s New?
L1 and L2 caches.For compute or graphics?
Dual warp schedulingConcurrent kernel executionC++ supportFull IEEE 754-2008 support in hardwareUnified address spaceError Correcting Code (ECC) memory supportFixed function tessellation for graphics
![Page 11: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/11.jpg)
G80, GT200, and Fermi
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 12: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/12.jpg)
G80, GT200, and Fermi
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 13: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/13.jpg)
GT200 and Fermi
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 14: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/14.jpg)
Fermi Block Diagram
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
GF10016 SMsEach with 32 cores
512 total coresEach SM hosts up to
48 warps, or1,536 threads
In flight, up to24,576 threads
![Page 15: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/15.jpg)
Fermi SM
Why 32 cores per SM instead of 8?Why not more SMs?
G80 – 8 cores GT200 – 8 cores GF100 – 32 cores
![Page 16: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/16.jpg)
Fermi SM
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Dual warp schedulingWhy?
32K registers32 cores
Floating point and integer unit per core
16 Load/stores4 SFUs
![Page 17: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/17.jpg)
Fermi SM
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
16 SMs * 32 cores/SM = 512 floating point operations per cycleWhy not in practice?
![Page 18: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/18.jpg)
Fermi SM
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Each SM64KB on-chip memory
48KB shared memory / 16KB L1 cache, or16KB L1 cache / 48 KB shared memory
Configurable by CUDA developer
![Page 19: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/19.jpg)
Fermi Dual Warping Scheduling
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 20: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/20.jpg)
Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf
![Page 21: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/21.jpg)
Fermi Caches
Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 22: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/22.jpg)
Fermi Caches
Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 23: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/23.jpg)
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi: Unified Address Space
![Page 24: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/24.jpg)
Fermi: Unified Address Space
64-bit virtual addresses40-bit physical addresses (currently)CUDA 4: Shared address space with CPU. Why?
![Page 25: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/25.jpg)
Fermi: Unified Address Space
64-bit virtual addresses40-bit physical addresses (currently)CUDA 4: Shared address space with CPU. Why?
No explicit CPU/GPU copiesDirect GPU-GPU copiesDirect I/O device to GPU copies
![Page 26: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/26.jpg)
Fermi ECC
ECC ProtectedRegister file, L1, L2, DRAM
Uses redundancy to ensure data integrity against cosmic rays flipping bits
For example, 64 bits is stored as 72 bitsFix single bit errors, detect multiple bit errorsWhat are the applications?
![Page 27: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/27.jpg)
Fermi Tessellation
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 28: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/28.jpg)
Fermi Tessellation
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
![Page 29: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/29.jpg)
Fermi Tessellation
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fixed function hardware on each SM for graphics
Texture filteringTexture cacheTessellationVertex Fetch / Attribute SetupStream OutputViewport Transform. Why?
![Page 30: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/30.jpg)
Observations
Becoming easier to port CPU code to the GPU
Recursion, fast atomics, L1/L2 caches, faster global memory
In fact…
![Page 31: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address](https://reader035.vdocuments.us/reader035/viewer/2022062610/6118d007b609802fd416e902/html5/thumbnails/31.jpg)
Observations
Becoming easier to port CPU code to the GPU
Recursion, fast atomics, L1/L2 caches, faster global memory
In fact…GPUs are starting to look like CPUs
Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics