optimizing texture transfers - nvidiadeveloper.download.nvidia.com › ... ›...
TRANSCRIPT
![Page 2: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/2.jpg)
Outline
Definitions
— Upload : Host (CPU) -> Device (GPU)
— Readback: Device (GPU) -> Host (CPU)
Focus on OpenGL graphics
— Implementing various transfer methods
— Multi-threading and Synchronization
— Debugging transfers
— Best Practices & Results
![Page 3: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/3.jpg)
Applications
Streaming videos/time varying geometry or volumes
— Broadcast, real-time fluid simulations etc
Level of detailing
— Out of core image viewers, terrain engines
— Bricks paged in as needed
Parallel rendering
— Fast communication between multiple GPUs for scaling data/render
Remoting Graphics
— Readback GPU results fast and stream over network
CPU
GPU
PCIe
8GB/s
100GB/s
5-10GB/s
RAM
Graphics Memory
![Page 4: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/4.jpg)
OpenGL Graphics – Streaming Data
Previous approaches
— Synchronous – CPU and GPU idle during transfer
— CPU Asynchronous
GPU and CPU Asynchronous with Copy Engines
— Application layout
— Use cases
— Results
![Page 5: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/5.jpg)
Synchronous Transfers
Straightforward
— Upload texture every frame
— Driver does all copy
Copy, download and draw are
sequential
…
pData
[nBricks]
Main
Memory
[0]
[1]
[2]
Graphics
Memory
texID
Disk
glTexSubImage
time
Upload Upload Upload
CPU
GPU Draw Draw Draw
Frame Draw
Copy Copy Copy
Bus
glTexSubImage
Frame Draw
Other work
![Page 6: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/6.jpg)
CPU Asynchronous Transfers
Non CPU-blocking transfer using Pixel Buffer Objects (PBO)
— Ping-pong PBO’s for optimal throughput
— Data must be in GPU native format
OpenGL Controlled
Memory
Datacur: glTexSubImage
PBO0
PBO1
…
pData
[nBricks]
Main Memory
[0]
[1]
[2]
Graphics Memory
texID
Datanext memcpy
Textures
Disk
PBO0
PBO1
![Page 7: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/7.jpg)
Example – 3D texture +Ping-Pong PBOs
Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead
unsigned int curPBO = 0;
//bind current pbo for app->pbo transfer
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo
GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,
GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);
memcpy(ptr,pData[curBrick],xdim*ydim*zdim);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);
//Copy pixels from pbo to texture object
glBindTexture(GL_TEXTURE_3D,texId);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo
glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);
glBindTexture(GL_TEXTURE_3D,0);
curPBO = 1-curPBO;
//Call drawing code here
![Page 8: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/8.jpg)
CPU Async - Execution Timeline
time
Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1
CPU
GPU Drawt0 Drawt2 Drawt1
Frame Draw
Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0
Bus
CPU Async
Analysis with GPUView
(http://graphics.stanford.edu/~mdfish
er/GPUView.html)
GLDriver
GPU
GLDriver
CPU
App
![Page 9: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/9.jpg)
Results – Synchronous vs CPU Async
PBOs
Synchronous
0
500
1000
1500
2000
2500
3000
3500
4000
4500
16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)
PBO vs Synchronous uploads - Quadro 6000
PBO (MB/s) TexSubImage (MB/s)
- Transfers only
- Adding rendering will reduce bandwidth, GPU can’t do both
- Ideally – want to sustain bandwidth with render, need GPU overlap
Bandw
idth
(M
B/s)
Texture Size
![Page 10: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/10.jpg)
Achieving Overlap - Copy Engines
Fermi+ have copy engines
— GeForce, low-end Quadro- 1 CE
— Quadro 4000+ - 2 CEs
Allows copy-to-host + compute +
copy-to-device to overlap
simultaneously
Graphics/OpenGL
— Using PBO’s in multiple threads
— Handle synchronization
![Page 11: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/11.jpg)
GPU Asynchronous Transfers
Downloads/uploads in separate thread
— Using OpenGL PBOs
ARB_SYNC used for context
synchronization
Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1
CPU
GPU Drawt0 Drawt2 Drawt1
Frame Draw
Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0
Bus
Using PBO
Using CE
Upload Draw
Init
Main App Thread
Shared textures
Readback
![Page 12: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/12.jpg)
Upload–Render : Application Layout
Disk
OpenGL Controlled
Memory
PBO0
PBO1
…
pData
[nBricks]
Main Memory
[0]
[1]
[2]
Graphics Memory srcTex
[numTextures]
Render
Thread
glBindTexture
Upload Thread
Datacur: glTexSubImage
Datanext : memcpy
uploadGLRC
mainGLRC
![Page 13: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/13.jpg)
Multi-threaded Context Creation
Sharing textures between multiple contexts
— Don’t use wglShareLists
— Use WGL/GLX_ARB_CREATE_CONTEXT instead
— Set OpenGL debug on
static const int contextAttribs[] =
{
WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,
0
};
mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);
wglMakeCurrent(winDC, mainGLRC);
glGenTextures(numTextures, srcTex);
//uploadGLRC now shares all its textures with mainGLRC
uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);
//Create Upload thread
//Do above for readback if using
![Page 14: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/14.jpg)
Synchronization using ARB_SYNC
OpenGL commands are asynchronous
— When glDrawXXX returns, does not mean command is completed
Sync object glSync (ARB_SYNC) is used for multi-threaded
apps that need sync
— Eg rendering a texture waits for upload completion
Fence is inserted in a unsignaled state but when completed
changed to signaled.
//Upload //Render
glTexSubImage(texID,..) glWaitSync(fence);
GLSync fence = glFenceSync(..) glBindTexture(.., texID);
unsignaled
signaled
![Page 15: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/15.jpg)
Upload-Render Sychronizaton
Need additional CPU event to coordinate waiting for GPU
sync!
WaitForSingleObject(startUploadValid)
glWaitSync(startUpload[2])
glBindTexture(srcTex[2])
glTexSubImage(..)
endUpload[2] = glFenceSync(…)
SetEvent(endUploadValid)
srcTex
Upload
WaitForSingleObject(endUploadValid)
glWaitSync(endUpload[0])
glBindTexture(srcTex[0])
//Draw
startUpload[0] = glFenceSync(…)
SetEvent(startUploadValid);
Render
…
[0]
[2]
GLsync startUpload[MAX_BUFFERS], endUpload[MAX_BUFFERS]; //GPU fence sync objects
HANDLE startUploadValid, endUploadValid; //cpu event to coordinate wait for GPU sync
![Page 16: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/16.jpg)
Analysis with GPUView
Upload and Render in
separate threads
— Map to distinct
hardware queues on
GPU
— Executed concurrently
— Will serialize on pre-
Fermi hardware
![Page 17: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/17.jpg)
Adding Readback
OpenGL Controlled
Memory
Images
[nFrames]
[0]
[1]
[2]
Framecur: glGetTexImage
Frameprev : memcpy
glFramebufferTexture
(GL_DRAW_FRAMEBUFFER
_TEXTURE,…)
DRAW
[0]
[1]
[2]
[3]
PBO0
PBO1
Use glGetTexImage, not glReadPixels between threads
mainGLRC
readbackGLRC
Render Thread Readback Thread
Main Memory
Graphics Memory
resultTex
[numTextures]
![Page 18: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/18.jpg)
Render-Readback Synchronizaton
WaitForSingleObject(endReadbackValid)
glWaitSync(endReadback[2])
glFramebufferTexture(resultTex[2])
//Draw
startReadback[3] = glFenceSync(…)
SetEvent(startReadbackValid)
resultTex
Render
WaitForSingleObject(startReadbackValid)
glWaitSync(startReadback[0])
glGetTexImage(resultTex[0])
//Read pixels to png-pong pbo
endReadback[0] = glFenceSync(…)
SetEvent(endReadbackValid);
Readback
…
[0]
[2]
GLsync startReadback[MAX_BUFFERS],endReadback[MAX_BUFFERS]; //GPU fence sync objects
HANDLE startReadbackValid, endReadbackValid; //cpu event to coordinate wait for GPU
sync
![Page 19: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/19.jpg)
GeForce vs Quadro Readbacks
Readbacks on GeForce are 3x slower than Quadro
0
1000
2000
3000
4000
5000
256K 1MB 8MB 32MB
PCI-
e b
andw
idth
(M
B/s)
Texture Size
Render-Download Bandwidth for Quadro vs GeForce
GeForce GTX 570 Quadro 6000
![Page 20: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/20.jpg)
Upload-Render-Readback pipeline
// Wait for signal to start upload
CPUWait(startUploadValid);
glWaitSync(startUpload[2]);
// Bind texture object
BindTexture(capTex[2]);
// Upload
glTexSubImage(texID…);
// Signal upload complete
GLSync endUpload[2]= glFenceSync(…);
CPUSignal(endUploadValid);
// Wait for download to complete
CPUWait(endDownloadValid);
glWaitSync(endDownload[3]);
// Wait for upload to complete
CPUWait(endUploadValid);
glWaitSync(endUpload)[0]);
// Bind render target
glFramebufferTexture(playTex[3]);
// Bind video capture source texture
BindTexture(capTex[0]);
// Draw
// Signal next upload
startUpload[0] = glFenceSync(…);
CPUSignal(startUploadValid);
// Signal next download
startDownload[3] = glFenceSync(…);
CPUSignal(startDownloadValid);
// Playout thread
CPUWait(startDownloadValid);
glWaitSync(startDownload[2]);
// Readback
glGetTexImage(playTex[2]);
// Read pixels to PBO
// Signal download complete
endDownload[2] = glFenceSync(…);
CPUSignal(endDownloadValid);
Capture Thread Render Thread Playout Thread
True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings
[0]
[1]
[2]
[3]
[0]
[1]
[2]
[3]
![Page 21: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/21.jpg)
GPUView trace showing 3-way overlap
Copy Engines
are idle
Frame time
Readback
Render
Upload
Readback
Render
Upload
Balanced render, upload
and readback times
Render time larger than
upload and readback
![Page 22: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/22.jpg)
Debugging Transfers
Some OGL calls may not overlap between transfer/render
thread
— Eg non-transfer related OGL calls in transfer thread
— Driver generates debug message
“Pixel transfer is synchronized with 3D rendering”
— Application uses ARB_DEBUG_OUTPUT to check the OGL debug log
— OpenGL 4.0 and above
GL_ARB_debug_output -
http://www.opengl.org/registry/specs/ARB/debug_output.txt
![Page 23: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/23.jpg)
Copy Engine Results – Best Case
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
256KB 1MB 8MB 32MB
Scaln
g F
acto
r
Texture Size
Performance Scaling from CPU Asynchronous Transfers
Upload-Render Scaling Render-Download Scalng
4.2 GB/s 3.2GB/s
1.4 GB/s
900 MB/s
Perfect Scaling
No Scaling
Quadro 6000
![Page 24: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/24.jpg)
Conclusion
Presented different transfer methods
Keep the transfer method simple
— Look at your application transfer needs and render times
— Tradeoff in scaling vs application complexity
Future
— Debugging multi-threaded transfers made much easier with
Nsight Visual studio http://developer.nvidia.com/nvidia-nsight-
visual-studio-edition)
![Page 25: Optimizing Texture Transfers - Nvidiadeveloper.download.nvidia.com › ... › S0356-GTC2012-Texture-Transf… · Optimized Texture Transfers - GPU Technology Conference 2012 Author:](https://reader034.vdocuments.us/reader034/viewer/2022052306/5f0cd2817e708231d4374d62/html5/thumbnails/25.jpg)
References
Venkataraman, Fermi Asynchronous Texture
Transfers, OpenGL Insights, 2012
— Source code (around SIGGRAPH 2012) –
https://github.com/organizations/OpenGLInsights
Related GTC Talks
— S0328, Thomas True, Best Practices in GPU-based
video processng
— S0049, Alina Alt &Tom True, Using the GPU Direct for
Video API
— S0353, S Venkataraman, Programming multi-gpus for
scalable rendering