![Page 1: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/1.jpg)
Computer Graphics and Imaging UC Berkeley CS184/284A
Lecture 23:
How GPUs work: from shader codeto a TeraFLOP
(slides by Kayvon Fatahalian)
![Page 2: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/2.jpg)
Unreal Engine Kite Demo (Epic Games 2015)
Goal: Highly Complex 3D Scenes in Realtime• Complex vertex and fragment shader computations • 100’s of thousands to millions of triangles in a scene • High resolution (2-4 megapixel + supersampling) • 30-60 frames per second (even higher for VR)
![Page 3: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/3.jpg)
A diffuse reflectance shader
�3
samplermySamp; Texture2D<float3>myTex; float3lightDir;
float4diffuseShader(float3norm,float2uv) { float3kd; kd=myTex.Sample(mySamp,uv); kd*=clamp(dot(lightDir,norm),0.0,1.0); returnfloat4(kd,1.0); }
How much compute is this?
4 multiply-adds & 1 texture fetch
4K = 8 MPixels x 5x Overdraw = 40 MPixels / frame x 60hz = 2.4 GPixels/sec
~10 GFLOPS ~10 GB/sec
A real game is 10s to 100s of times more!
![Page 4: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/4.jpg)
This lecture
Three major ideas that make GPU processing cores run fast How can we exploit massive parallelism to run shaders fast?
Closer look at a real GPU design NVIDIA GTX 1080
The GPU memory hierarchy: moving data to processors
�4
![Page 5: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/5.jpg)
Part 1: throughput processing
Three key concepts behind how modern GPU processing cores run code
Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU core) designs 2. Understand how “GPU” cores do (and don’t!) differ from “CPU” cores 3. Optimize shaders/compute kernels 4. Establish intuition: what workloads might benefit from the design of these
architectures?
�5
![Page 6: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/6.jpg)
What’s in a GPU?
�6
Shader Core
Shader Core
Shader Core
Shader Core
Shader Core
Shader Core
Shader Core
Shader Core
Tex
Tex
Tex
Tex
Input Assembly
Rasterizer
Output Blend
Video Decode
Work Distributor
A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)
![Page 7: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/7.jpg)
A diffuse reflectance shader
�7
samplermySamp; Texture2D<float3>myTex; float3lightDir;
float4diffuseShader(float3norm,float2uv) { float3kd; kd=myTex.Sample(mySamp,uv); kd*=clamp(dot(lightDir,norm),0.0,1.0); returnfloat4(kd,1.0); }
Shader programming model: Fragments are processed independently, but there is no explicit parallel programming Key architectural ideas: How can we exploit parallelism to run faster?
![Page 8: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/8.jpg)
Compile shader
�8
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
1 unshaded fragment input record
1 shaded fragment output record
samplermySamp; Texture2D<float3>myTex; float3lightDir;
float4diffuseShader(float3norm,float2uv) { float3kd; kd=myTex.Sample(mySamp,uv); kd*=clamp(dot(lightDir,norm),0.0,1.0); returnfloat4(kd,1.0); }
![Page 9: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/9.jpg)
Execute shader
�9
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 10: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/10.jpg)
Execute shader
�10
<diffuseShader>:
sampler0,v4,t0,s0 mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
ALU (Execute)
Fetch/ Decode
Execution Context
![Page 11: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/11.jpg)
Execute shader
�11
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 12: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/12.jpg)
Execute shader
�12
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 13: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/13.jpg)
Execute shader
�13
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 14: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/14.jpg)
Execute shader
�14
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 15: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/15.jpg)
Execute shader
�15
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 16: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/16.jpg)
“CPU-style” cores
�16
Fetch/ Decode
Execution Context
ALU (Execute)
Data cache (a big one)
Out-of-order control logic
Fancy branch predictor
Memory pre-fetcher
![Page 17: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/17.jpg)
Slimming down
�17
Fetch/ Decode
Execution Context
ALU (Execute)
Idea #1: Remove components that help a single instruction stream run fast
![Page 18: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/18.jpg)
Two cores (two fragments in parallel)
�18
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
<diffuseShader>: sampler0,v4,t0,s0 mulr3,v0,cb0[0] maddr3,v1,cb0[1],r3 maddr3,v2,cb0[2],r3 clmpr3,r3,l(0.0),l(1.0) mulo0,r0,r3 mulo1,r1,r3 mulo2,r2,r3 movo3,l(1.0)
fragment 1
<diffuseShader>: sampler0,v4,t0,s0 mulr3,v0,cb0[0] maddr3,v1,cb0[1],r3 maddr3,v2,cb0[2],r3 clmpr3,r3,l(0.0),l(1.0) mulo0,r0,r3 mulo1,r1,r3 mulo2,r2,r3 movo3,l(1.0)
fragment 2
![Page 19: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/19.jpg)
Four cores (four fragments in parallel)
�19
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
![Page 20: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/20.jpg)
Sixteen cores (sixteen fragments in parallel)
�20
16 cores = 16 simultaneous instruction streams
![Page 21: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/21.jpg)
Instruction stream sharing
�22
But, many fragments should be able to share an instruction stream!
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
![Page 22: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/22.jpg)
Fetch/ Decode
Recall: simple processing core
�23
Execution Context
ALU (Execute)
![Page 23: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/23.jpg)
Add ALUs
�24
Fetch/ Decode
Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs
SIMD processingCtx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
![Page 24: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/24.jpg)
Modifying the shader
�25
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
<diffuseShader>:
sampler0,v4,t0,s0
mulr3,v0,cb0[0]
maddr3,v1,cb0[1],r3
maddr3,v2,cb0[2],r3
clmpr3,r3,l(0.0),l(1.0)
mulo0,r0,r3
mulo1,r1,r3
mulo2,r2,r3
movo3,l(1.0)
Original compiled shader:
Processes one fragment using scalar ops on scalar registers
![Page 25: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/25.jpg)
Modifying the shader
�26
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
New compiled shader:
Processes eight fragments using vector ops on vector registers
<VEC8_diffuseShader>:
VEC8_samplevec_r0,vec_v4,t0,vec_s0
VEC8_mulvec_r3,vec_v0,cb0[0]
VEC8_maddvec_r3,vec_v1,cb0[1],vec_r3
VEC8_maddvec_r3,vec_v2,cb0[2],vec_r3
VEC8_clmpvec_r3,vec_r3,l(0.0),l(1.0)
VEC8_mulvec_o0,vec_r0,vec_r3
VEC8_mulvec_o1,vec_r1,vec_r3
VEC8_mulvec_o2,vec_r2,vec_r3
VEC8_movo3,l(1.0)
![Page 26: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/26.jpg)
Modifying the shader
�27
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
1 2 3 45 6 7 8
<VEC8_diffuseShader>:
VEC8_samplevec_r0,vec_v4,t0,vec_s0
VEC8_mulvec_r3,vec_v0,cb0[0]
VEC8_maddvec_r3,vec_v1,cb0[1],vec_r3
VEC8_maddvec_r3,vec_v2,cb0[2],vec_r3
VEC8_clmpvec_r3,vec_r3,l(0.0),l(1.0)
VEC8_mulvec_o0,vec_r0,vec_r3
VEC8_mulvec_o1,vec_r1,vec_r3
VEC8_mulvec_o2,vec_r2,vec_r3
VEC8_movo3,l(1.0)
![Page 27: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/27.jpg)
128 fragments in parallel
�28
16 cores = 128 ALUs , 16 simultaneous instruction streams
![Page 28: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/28.jpg)
128 [ ] in parallel
�29
vertices/fragments primitives
OpenCL work items CUDA threads
fragments
vertices
primitives
![Page 29: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/29.jpg)
But what about branches?
�30
ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8
if(x>0){
}else{
}
<unconditionalshadercode>
<resumeunconditionalshadercode>
y=pow(x,exp);
y*=Ks;
refl=y+Ka;
x=0;
refl=Ka;
![Page 30: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/30.jpg)
But what about branches?
�31
ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8
if(x>0){
}else{
}
<unconditionalshadercode>
<resumeunconditionalshadercode>
y=pow(x,exp);
y*=Ks;
refl=y+Ka;
x=0;
refl=Ka;
T T T F FF F F
![Page 31: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/31.jpg)
But what about branches?
�32
ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8
if(x>0){
}else{
}
<unconditionalshadercode>
<resumeunconditionalshadercode>
y=pow(x,exp);
y*=Ks;
refl=y+Ka;
x=0;
refl=Ka;
T T T F FF F F
Not all ALUs do useful work! Worst case: 1/8 peak performance
![Page 32: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/32.jpg)
But what about branches?
�33
ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8
if(x>0){
}else{
}
<unconditionalshadercode>
<resumeunconditionalshadercode>
y=pow(x,exp);
y*=Ks;
refl=y+Ka;
x=0;
refl=Ka;
T T T F FF F F
![Page 33: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/33.jpg)
Clarification
Option 1: explicit vector instructions x86 SSE, Intel Larrabee Option 2: scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures (“wavefronts”)
�34
SIMD processing does not imply SIMD instructions
In practice: 16 to 64 fragments share an instruction stream.
![Page 34: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/34.jpg)
�35
Stalls!
Texture access latency = 100’s to 1000’s of cycles
We’ve removed the fancy caches and logic that helps avoid stalls.
Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.
![Page 35: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/35.jpg)
�36
But we have LOTS of independent fragments.
Idea #3: Interleave processing of many fragments on a single
core to avoid stalls caused by high latency operations.
![Page 36: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/36.jpg)
Hiding shader stalls
�37
Time (clocks) Frag 1 … 8
Fetch/ Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
![Page 37: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/37.jpg)
Hiding shader stalls
�38
Time (clocks)
Fetch/ Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8
1 2 3 4
1 2
3 4
![Page 38: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/38.jpg)
Hiding shader stalls
�39
Time (clocks)
Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8
1 2 3 4
Stall
Runnable
![Page 39: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/39.jpg)
Hiding shader stalls
�40
Time (clocks)
Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8
1 2 3 4
Stall
Runnable
Stall
Stall
Stall
![Page 40: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/40.jpg)
Throughput!
�41
Time (clocks)
Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8
1 2 3 4
Stall
Runnable
Stall
Runnable
Stall
Runnable
Stall
Runnable
Done!
Done!
Done!
Done!
Start
Start
Start
Increase run time of one group to increase throughput of many groups
![Page 41: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/41.jpg)
Storing contexts
�42
Fetch/ Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Pool of context storage 128 KB
![Page 42: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/42.jpg)
Eighteen small contexts
�43
Fetch/ Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
(maximal latency hiding)
![Page 43: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/43.jpg)
Twelve medium contexts
�44
Fetch/ Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
![Page 44: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/44.jpg)
Four large contexts
�45
Fetch/ Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
1 2
3 4
(low latency hiding ability)
![Page 45: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/45.jpg)
Our chip
�46
16 cores
8 mul-add ALUs per core (128 total)
16 simultaneous instruction streams
64 concurrent (but interleaved) instruction streams
512 concurrent fragments
= 256 GFLOPs (@ 1GHz)
![Page 46: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/46.jpg)
Our “enthusiast” chip
�47
32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)
![Page 47: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/47.jpg)
Summary: three key ideas to exploit parallelism for performance1. Use many “slimmed down cores” to run in parallel
2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments)
Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware
3. Avoid latency stalls by interleaving execution of many groups of fragments
When one group stalls, work on another group�48
![Page 48: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/48.jpg)
�49
Part 2: Putting the three ideas into practice:
A closer look at real GPUs
NVIDIA GeForce GTX 1080
![Page 49: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/49.jpg)
NVIDIA GeForce GTX 1080
NVIDIA-speak: 2560 stream processors (“CUDA cores”) “SIMT execution”
Generic speak: 20 cores 4 groups of 32 SIMD functional units per core
�50
![Page 50: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/50.jpg)
NVIDIA GeForce GTX 1080 “core”
�51
= SIMD function unit, control shared across 32 units (1 MUL-ADD per clock)
“Shared” memory (96 KB)
Execution contexts (registers) (256 KB)
• Groups of 32 [fragments/vertices/CUDA threads] share an instruction stream
• Up to 64 groups are simultaneously interleaved
• Up to 2,048 individual contexts can be stored
Source: NVIDIA Pascal tuning guide
Fetch/ Decode
Fetch/ Decode
Fetch/ Decode
Fetch/ Decode
Fetch/ Decode
Fetch/ Decode
Fetch/ Decode
Fetch/ Decode
![Page 51: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/51.jpg)
NVIDIA GeForce GTX 1080
�52
There are 20 of these things on the GTX 1080
That’s 40,960 fragments!
(Or 40,960 “CUDA threads”)
![Page 52: Lecture 23: How GPUs work - cs184/284a · Unreal Engine Kite Demo (Epic Games 2015) Goal: Highly Complex 3D Scenes in Realtime • Complex vertex and fragment shader computations](https://reader035.vdocuments.us/reader035/viewer/2022070812/5f0b26bb7e708231d42f19d7/html5/thumbnails/52.jpg)
�67
Thank you
Original Slides: Kayvon Fatahalian
Contributors: Kurt Akeley Solomon Boulos Mike Doggett Pat Hanrahan Mike Houston Jeremy Sugerman