![Page 1: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/1.jpg)
Tuning GPGPU Applications for Performance
Justin Hensley and Jason YangGraphics Product Group
![Page 2: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/2.jpg)
Overview
• Goals:– How to approach the optimization process– Taking advantage of hardware features
• GPGPU from real world applications– Decoding H.264 Video– Radial Distortion Correction
• Cap ‘N’ Stream Demo• Conclusion and Questions
2
![Page 3: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/3.jpg)
H.264 Decoding:A Performance Case Study
When fast is not good enough!
![Page 4: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/4.jpg)
Where to begin?
• You’re done writing code, now what?• Does it work?• Is it fast?• What does fast mean?
– 60x faster than the CPU is pretty good– What are you leaving on the table?
• How close is it to theoretical?
4
![Page 5: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/5.jpg)
Breaking down GPGPU performance
• GPGPU applications generally progress through the hardware in a predictable fashion, unlike rendering
• Theoretical performance can be calculated• What we know…
– # of ALU, TEX, and Write operations– GPUShaderAnalyzer
• What do we need to find out?
5
ALU Memory Tex Output
![Page 6: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/6.jpg)
• Radeon X1900XT (R580)– 8:48:16:16 (VER:ALU:TEX:ROP) – 256 bit memory bus– 625/750 MHz Engine/Memory Clocks
• ALU Performance– 1 ALU Shader
Calculating Theoretical Performance
6
ms07.0(625Mhz)(48)
(1)1088)9201(speed) engine (3D (alu/clk)
ns)instructioalu (# pixels)(#=
×
××=
×
×
![Page 7: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/7.jpg)
Calculating Theoretical Performance (2)
• Texture Performance– 1 ALU + TEX Shader– ALU and TEX operations occur in parallel
• Memory Performance– 1 Byte in 1 Byte out (Copy)
7
ms085.02DDR)Mhz (750(256)(16)1088)9201(
speed)(memory width)(buspixel)per bitsoutput (input pixels)(#
=××
××=
×
+×
ms21.0Mhz) (625(16)
(1)1088)9201(speed) engine (3D (tex/clk)
ns)instructiotex (# pixels)(#=
×
××=
×
×
![Page 8: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/8.jpg)
Calculating Theoretical Performance (3)
• Overall Theoretical Performance – max(ALU,TEX,Memory)= max(0.07, 0.21, 0.085) = 0.21 ms
• Remember, this is only a starting point!– ALU and TEX calculation is reasonable– Memory is peak
8
![Page 9: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/9.jpg)
Optimization Example: H.264
• Interprediction – filter data from previously decoded frames
• Deblocking – filter out block edges• Today: Loosely based on DX9 HW
Implementation9
In-LoopDeblocking
Intraprediction
Interprediction
inversetransform
Render
![Page 10: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/10.jpg)
Problem Domain
• Max Resolution 1080p = 1920x1088 “RGBA” pixels (3Bpp)– Luma (Y) = 1920x1088 pels (1Bpp)– Chroma (U and V) = 960x544 pels each
10
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y…Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y…Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y…Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y…Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y……
![Page 11: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/11.jpg)
Interprediction
• Decode each pel by interpolating a subregion of previously decoded frames
• Example case: 6x6 fetches per pel– Kernel not aligned to x4 boundary
11
Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y ……
……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y …
…
Source
Target
![Page 12: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/12.jpg)
Do more work per pixel• Problem: Shader has a lot of texture fetches• Share already fetched data by processing four pels in one
pixel (thread)– Consolidate texture fetches & Consolidate ALU instructions
– Consolidate writes (four bytes per pixel)
– Increase number of threads
12
Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y ……
……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y …
…
Source
TargetSaved 96 fetches!
![Page 13: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/13.jpg)
Optimizing the algorithm
• Theoretical Model for 4 pels per pixel– 115 ALU, 56 TEX, 66 Bytes In/Out– 480x1088 pixels processed– From above equations:
• ALU time = 2 ms• TEX time = 3 ms• Mem time = 0.67 ms• Theoretical time = 3 ms
• Using the theoretical model we can find the problems and fix it
13
![Page 14: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/14.jpg)
Do even more work per pixel!
• Use multiple render targets– 16 pels per pixel
14
Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y …Y Y Y Y Y Y Y Y ……
……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y ……Y Y Y Y Y Y Y Y Y Y Y Y …
…
Source
Target
Saved 522 fetches!
![Page 15: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/15.jpg)
What happens to the model?
• Theoretical Model for 16 pels per pixel– 304 ALU, 85 TEX, 107 Bytes In/Out– 480x272 pixels processed– From above equations:
• ALU time = 1.32 ms• TEX time = 1.12 ms• Mem time = 0.29 ms• Theoretical time = 1.32 ms
• Shader goes from TEX bound to ALU bound
15
![Page 16: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/16.jpg)
Oops...
• Render targets split the image
16
Could do a copy Could do a copy shader shader at the end, orat the end, or……
0
1
2
3
0
1
2
3
![Page 17: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/17.jpg)
Interleave Render Targets
• CAL API supports interleaved render targets• Four surfaces appear as one
• Scatter is slow. MRTs are fast.
17
0
1
2
3
0123012
0123
30123…
![Page 18: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/18.jpg)
Back to the real world…
• Compare actual performance with theoretical
• Easy way to determine next step– If you’re close, you probably have to redesign if
you’ve missed your performance target– If you’re off, time to look at the hardware
18
![Page 19: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/19.jpg)
Interprediction Filtering
• Each block of 4x4 pels are processed differently– Up to 6 filter cases– Up to 2 prediction directions– Up to 2 block types (frame/field)– Up to 16 input textures (max 4 for 1080p)
• With up to up to 384 possible cases, that’s a pretty big shader!
19
![Page 20: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/20.jpg)
Branching isn’t perfect
• Shader length• Branch overhead• Divergent paths between pixel blocks
– One pixel branch can stall the others– “Branch Granularity”
20
![Page 21: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/21.jpg)
Our old friend the Z-Buffer
• Use a fast z-pass to initialize the buffer• Render each “case” separately
• Fast because of top-of-the-pipe pixel kills• Very little branching overhead• Potentially shorter shaders• More threads
21
![Page 22: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/22.jpg)
Deblocking
• Filters block edges• Performance can suffer due to render
dependencies– Input to next pixel is output from previous pixel– Frame divided into MB of 16x16 pels– 4 vertical and 4 horizontal passes per MB– For 1080p frame, up to 748 passes
• Example…
22
![Page 23: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/23.jpg)
Vertical Deblocking
23
0 1 2 3 0 1 2 3 0 1 2 3 1 2 30
Edge 0 Edge 1 Edge 2 Edge 3
8 pels
Previous macroblock
![Page 24: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/24.jpg)
Do more work in a pixel revisited
• Save on number of passes– If I have the data, why not use it?– Filter two edges in one pixel
24
0 1 2 3
Edge 0 and 1
16 pels
Render as 1x16 rectangleto 3 MRT
![Page 25: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/25.jpg)
Hmmm….
• Performance improves, but is still off– Hardware units arranged in 2x2 pattern – 1x16 strip only uses half the pipes– Rearrange the rendering
25
![Page 26: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/26.jpg)
Read/Write to one texture
• Because of dependencies the output is the input to the next pass.
• CAL allows reading and writing from the same texture
• No need to reset states
26
![Page 27: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/27.jpg)
Getting more info
• Look at performance counters– HW Utilization– Texture cache miss– Z-buffer statistics
27
![Page 28: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/28.jpg)
Cache and Memory Tips
• Rearrange the data• Play with the memory layout
– CAL API allows more fine grain control
• Memory performance is hard to predict– Access pattern– Memory subsystem is unknown
• Tip: Try replacing input textures with a very small texture (1x1)
• Doesn’t help with dependent texture fetches
28
![Page 29: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/29.jpg)
Radial Distortion Correction
![Page 30: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/30.jpg)
Radial Correction
System Requirements:– Capture video from one or more cameras. – Transfer images to GPU– Convert bayer-pattern images to RGB images– Remove lens distortions– Return to host for further processing
• System needs to use limited power– Mobile GPU
• Want to minimize correction time– Images further processed in a large real-time system
30
![Page 31: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/31.jpg)
Naive Implementation
31
Input images(system memory)
Output images(system memory)
Input images(GPU memory)
Intermediate(GPU memory)
Output images(GPU memory)
copy
Debayerkernel
Undistortkernel
copy
??? Output images(system memory)
![Page 32: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/32.jpg)
Output images(system memory)
Faster Implementation
32
Undistortkernel
???
Intermediate(GPU memory)
Debayerkernel
Input images(system memory)
Output images(system memory)
Accessing system memory directly is 1.8x faster than
naive implementation! (2.74 ms vs. 4.96 ms)
![Page 33: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/33.jpg)
Caveat: System Memory
• Performance is chipset dependent• Rasterizers optimized for texture cache
performance when rendering graphics
33
![Page 34: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/34.jpg)
Caveat: System Memory
• Performance is chipset dependent• Rasterizers optimized for texture cache
performance when rendering graphics
34
Not a system memory “friendly”
traversal
![Page 35: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/35.jpg)
“Forcing” Raster pattern
• “Force” the rasterizer to be more friendly with system memory traversal– Use strips of geometry (CTM can directly stamp quads)
35
![Page 36: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/36.jpg)
Effect of “Forcing” Raster pattern
• 2048x2048 float32x4 “copy” shader– Reads input in local GPU memory, writes to system
memory– RD580 with an R580
• Full screen quad time = 45.51 ms – ~1.5 GB/sec readback
• “Raster-Blocks” time = 26.53 ms– ~2.5 GB/sec readback
• Technique could also be used to optimize shaders with non-standard to local memory accesses
36
1.7x faster
![Page 37: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/37.jpg)
GPU HD Encoding Demo
• Cap ‘N’ Stream• DX9 game displaying at 1280x768• 720p MPEG2 encoding on the GPU
– Motion estimation– Multi-GPU capable (not required)– Quality scales with HW performance
37
![Page 38: Tuning GPGPU Applications for Performance - … Decoding: A Performance Case Study When fast is not good enough!](https://reader036.vdocuments.us/reader036/viewer/2022062907/5aa201917f8b9a1f6d8ca4ca/html5/thumbnails/38.jpg)
Conclusion and Questions
• Extremely useful to estimate performance• Direct access to system memory• “Forcing” raster pattern can be helpful
• Useful tools– GPU Shader Analyzer, GPU Perf Studio Tool– http://ati.amd.com/developer/
• Information Contact:
38