siggraph2013 - multi-gpu programming for visual computing

Post on 23-Apr-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SIGGRAPH 2013 Shaping the Future of Visual Computing

Multi-GPU Programming for Visual Computing Wil Braithwaite - NVIDIA Applied Engineering

Talk Outline

Compute and Graphics API Interoperability.

— The basics.

— Interoperability Methodologies & driver optimizations.

Interoperability at a system level.

— fine-grained control for managing scaling

Demo.

Peer-to-peer & UVA, NUMA considerations

Scaling beyond one-to-one = single-compute : single-render

— Enumerating Graphics & Compute Resources

— many-to-many = Multiple-compute : multiple-render

2

NVIDIA Maximus Initiative

Mixing Tesla and Quadro GPUs

— Tight integration with OEMs and System Integrators

— Optimized driver paths for GPU-GPU communication

NVIDIA® MAXIMUS™ Workstation

Visualize Simulate (Cluster)

Simulate0 Simulate1

+ Visualize0

Simulate2

+ Visualize1

Simulate3

+ Visualize2

Traditional Workstation

Multi-GPU Compute+Graphics Use Cases

Image processing

— Multiple compute GPUs and a low-

end display GPU for blitting

Mixing rasterization with

compute

— Polygonal rendering done in

OpenGL and input to Compute for

further processing

Visualization for HPC Simulation

— Numerical simulation distributed

across multiple compute GPUs,

possibly on remote supercomputer

Seismic Interpretation

(NVIDIA Index)

Real-time CFD Viz

(Morpheus Medical)

Interactive Fluid-dynamics

Compute/Graphics interoperability

Setup the objects in the graphics context.

Register objects with the compute context.

Map / Unmap the objects from the compute context.

CUDA OpenGL/DX

Application

CUDA Array

CUDA Buffer Buffer Object

Texture Object

6

Code Sample – Simple buffer interop

Setup and Registration of Buffer Objects:

GLuint vboId;

cudaGraphicsResource_t vboRes;

// OpenGL buffer creation...

glGenBuffers(1, &vboId);

glBindBuffer(GL_ARRAY_BUFFER, vboId);

glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_STREAM_DRAW);

glBindBuffer(GL_ARRAY_BUFFER, 0);

// Registration with CUDA.

cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);

9

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Code Sample – Simple buffer interop

Mapping between contexts:

float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

runGL(vboId);

}

10

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Resource Behavior: Single-GPU

The resource is shared.

Context switch is fast and independent on data size.

GL contextCUDA context

Data

interop API

GPU

11

float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

runGL(vboId);

}

Code Sample – Simple buffer interop

Mapping between contexts:

12

Context-switching happens

when these commands are

processed.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Timeline: Single-GPU

Driver-interop

1 2 3

map

kernel

GL

kernel

GL

runCUDA

unmap

runGL

kernel

GL

4

13 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

runGL(vboId);

}

Code Sample – Simple buffer interop

Adding synchronization for analysis:

14

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Timeline: Single-GPU

Driver-interop, synchronous*

— (we synchronize after map and unmap calls)

1 2 3

map *

kernel

GL

kernel

GL

runCUDA

unmap *

runGL

kernel

Synchronize after “map”

waits for GL to finish

before context-switch.

GL

4

Synchronize after “unmap”

waits for CUDA (& GL) to finish.

15 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver must

copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

18

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver must

copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

MAP

19

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver must

copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

UNMAP

20

Timeline: Multi-GPU

Driver-interop, synchronous*

— SLOWER! (Tasks are serialized).

1 2 3

map *

kernel CUtoGL

GL

kernel

GL

runCUDA

unmap *

runGL

GLtoCU kernel GLtoCU GLtoCU CUtoGL

Resources are mirrored and

synchronized across the GPUs

“map” has to wait for GL to complete

before it synchronizes the resource.

22 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Interoperability Methodologies

READ-ONLY

— GL produces... and CUDA consumes.

e.g. Post-process the GL render in CUDA.

WRITE-DISCARD

— CUDA produces... and GL consumes.

e.g. CUDA simulates fluid, and GL renders result.

READ & WRITE

— Useful if you want to use the rasterization pipeline.

e.g. Feedback loop:

— runGL(texture) framebuffer

— runCUDA(framebuffer) texture

23

float* vboPtr;

cudaGraphicsResourceSetMapFlags(vboRes, cudaGraphicsMapFlagsWriteDiscard);

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

runGL(vboId);

}

CUDA produces... and OpenGL consumes:

Code Sample – WRITE-DISCARD

Hint that we do not care about

the previous contents of buffer.

24

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Timeline: Single-GPU

Driver-interop, synchronous*, WRITE-DISCARD

1 2 3

map *

kernel

GL

kernel

GL

runCUDA

unmap *

runGL

kernel

Synchronize after “map”

waits for GL to finish

before context-switch.

GL

4

Synchronize after “unmap”

waits for CUDA (& GL) to finish.

Context switch forces serialization.

25 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Timeline: Multi-GPU

Driver-interop, synchronous*, WRITE-DISCARD

1 2 3

map *

kernel CUtoGL kernel

runCUDA

unmap *

runGL

CUtoGL kernel CUtoGL

4

When multi-GPU,

“map” does nothing.

Compute & render can overlap

as they are on different GPUs.

GL GL GL

Synchronize after “unmap”

waits for CUDA & GL to finish.

26 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Timeline: Multi-GPU

Driver-interop, synchronous*, WRITE-DISCARD

— if render is long...

1 2 3

map *

kernel CUtoGL kernel

runCUDA

unmap *

runGL

CUtoGL kernel CUtoGL

GL

4

“unmap” will wait for GL.

GL GL

27 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Single-GPU

Driver-Interop: System View

Multi-GPU

28

Multi-GPU

Manual-Interop: System View

compute-GPU render-GPU

29

Manual-Interop

Driver-Interop hides all the complexity

but sometimes...

— We don’t want to transfer all data

Kernel may only compute subregions.

— We may have a complex system with multiple compute-GPUs and/or

render-GPUs.

— We have application specific pipelining and multi-buffering

— May have some CPU code in your algorithm between compute and

graphics.

cudaMalloc((void**)&d_data, vboSize);

cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data, 0);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample – Manual-Interop

Create a temporary buffer

in pinned host-memory.

31

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Timeline: Multi-GPU

Manual-interop, synchronous*, WRITE-DISCARD

1 2 3

map

kernel CUtoH kernel

runCUDA

unmap

runGL

CUtoH kernel CUtoH

CUtoH *

HtoGL *

HtoGL HtoGL HtoGL

GL GL

32

4

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

cudaMalloc((void**)&d_data, vboSize);

cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data, 0);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaSetDevice(computeGPU);

// cudaStreamSynchronize(0);

runGL(vboId);

}

Code Sample - Manual Interop (Async)

33

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD

1 2 3

map

kernel CUtoH kernel

runCUDA

unmap

runGL

CUtoH kernel CUtoH

GL

4

CUtoH *

HtoGL

HtoGL

GL GL

HtoGL HtoGL

34 C

OM

PU

TE

R

EN

DE

R

CUDA

CUDA

GL

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD

— if render is long...

35

1 2 3

map

kernel CUtoH kernel

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL

HtoGL

GL

HtoGL HtoGL

GL

We are downloading while uploading! Drifting out of sync!

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

Timeline: Multi-GPU (safe Async)

Manual-interop, asynchronous, WRITE-DISCARD

— if render is long...

36

1 2 3

map

kernel CUtoH kernel

CO

MP

UT

E

RE

ND

ER

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL

HtoGL

GL

HtoGL HtoGL

GL

Synchronization must also wait for HtoGL to finish

CUDA

CUDA

GL

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

runCUDA(d_data, 0);

cudaStreamWaitEvent(0, uploadFinished, 0);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

// (all commands must be asynchronous.)

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaEventRecord(uploadFinished, 0);

cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample - Manual Interop (safe Async)

38

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA

39

1 2 3

map

kernel[A] CUtoH[A]

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL

GL

H[A]toGL H[B]toGL

GL GL

kernel[B] CUtoH[B] kernel[B]

H[B]toGL

GL

5

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

int read = 1, write = 0;

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

cudaStreamWaitEvent(custream[write], kernelFinished[read]);

runCUDA(d_data[write], custream[write]);

cudaEventRecord(kernelFinished[write], custream[write]);

cudaStreamWaitEvent(custream[write], uploadFinished[read]);

cudaMemcpyAsync(h_data[write], d_data[write], vboSize, cudaMemcpyDeviceToHost, custream[write]);

cudaEventRecord(downloadFinished[write], custream[write]);

// Map the renderGPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, glstream);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaStreamWaitEvent(glstream, downloadFinished[read]);

cudaMemcpyAsync(vboPtr, h_data[read], size, cudaMemcpyHostToDevice, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaEventRecord(uploadFinished[read], glstream);

cudaStreamSynchronize(glstream); // Sync for easier analysis!

cudaSetDevice(computeGPU);

runGL(vboId);

swap(&read, &write);

}

Code Sample - Manual Interop (flipping)

40

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA

— if render is long...

41

1 2 3

map

kernel[A] CUtoH[A]

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL

GL

H[A]toGL H[B]toGL

GL GL

kernel[B] CUtoH[B] kernel[B]

H[B]toGL

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA & GL

— if render is long... we have a problem.

42

1 2 3

map

kernel[A] CUtoH[A]

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL[A]

GL[A]

H[A]toGL[A] H[B]toGL[B]

GL[B] GL[A]

kernel[B] CUtoH[B] kernel[B]

H[B]toGL[B]

GL[B]

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

We can’t overlap contexts!

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA & GL

— Use the GL context for upload.

43

1 2 3

cuSync *

kernel[A] CUtoH[A]

runCUDA

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

memcpy *

GL[A] GL[B] GL[A]

kernel[B] CUtoH[B] kernel[B]

GL[B]

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

GL

5

gl-map

gl-unmap

Visual Studio Nsight

44

Demo

45

Demo - results

— runCUDA (20ms)

— runGL (10ms)

— copy (10ms)

Single-GPU

— Driver-interop = 30ms

Multi-GPU

— Driver-interop = 36ms

— Async Manual-interop = 32ms

— Flipped Manual-interop = 22ms

Too large data size

makes multi-GPU

interop worse.

Overlapping the

download helps

us break even.

But using streams

and flipping is a

significant win!

46

Some final thoughts on Interoperability.

Similar considerations are applicable when OpenGL is the producer and CUDA is the consumer.

Use cudaGraphicsMapFlagsReadOnly

Avoid synchronized GPUs for CUDA.

— Watch out for Windows’s WDDM implicit synchronization on unmap!

CUDA-OpenGL interoperability can perform slower if OpenGL context spans multiple GPUs.

Context switch performance varies with system config and OS.

47

Scaling Beyond Single Compute+Graphics

Scaling compute

— Divide tasks across multiple devices

— When data does not fit into single GPU memory - Distribute data

Scaling graphics

— Multi-displays, Stereo

— More complex rendering e.g. raytracing

Higher compute density

— Amortize host or server costs eg CPU, memory, RAID shared with

multiple GPUs

Multi-GPU Image Processing - SagivTech

Renderer

Multiple Compute : Single Render GPU

CUDA

Context

GPU 1

CUDA

Context

GPU 1 GPU 0

Compute Devices Render Device

OpenGL

Context

Aux

CUDA

Context

GPU 2

Map

Unmap

-Explicit copies via host

-CudaMemcpyPeer

-P2P Memory Access

Unified Virtual Addressing (UVA)

Easier programming with a single address space:

System

Memory

CPU GPU0

GPU0

Memory

GPU1

GPU1

Memory

PCI-e

0x0000

0xFFFF

No UVA – Multiple Memory Space UVA – Single Address Space

Peer-to-Peer (P2P) Communication

NUMA/Topology Considerations

Memory access is non-uniform.

— Local GPU access is faster than remote

(extra QPI hop)

— This affects PCIe transfer throughput.

NUMA APIs

— Thread affinity considerations

Pitfalls of set process affinity

— Does not work with graphics APIs

Memory

CPU0

IOH0

CPU1

IOH1

QPI

QPI

QPI

Memory

QPI

GPU0 GPU1

6 GB/s

~4 GB/s

SandyBridge

Integrated IOH

Dale Southard. Designing and Managing GPU Clusters. Supercomputing 2011

http://nvidia.fullviewmedia.com/fb/nv-sc11/tabscontent/archive/401-thu-southard.html

P2P disabled over QPI – Copying staged via host (P2H2P)

Peer-to-Peer Configurations

CPU0

QPI PCI-e

IOH0

Best P2P Performance

Between GPUs on the

Same PCIe Switch

Eg K10 dual-GPU card

(~11GB)

P2P Communication Supported

Between GPUs on the Same IOH

(~5GB)

x16 x16 x16 x16

x16 x16

Tesla

GPU0

Tesla

GPU1 Tesla

GPU2

Tesla

GPU3

Switch

0 Switch

1

PCI-e x16

x16 CPU1 IOH1

Quadro

GPU4

Quadro

GPU5

QPI

P2P communication

- Linux or TCC only

- No WDDM

Peer-to-Peer Initialization

cudaDeviceCanAccessPeer(&isAccessible, srcGPU, dstGPU)

— Returns in 1st arg if srcGPU can access memory of dstGPU

— Need to do this bidirectionally

cudaDeviceEnablePeerAccess(peerDevice, 0)

— Enables current GPU to access peerDevice

— Note that this is asymmetric!

cudaDeviceDisablePeerAccess

— P2P can be limited to a specific phase in order to reduce overhead and

free resources

Peer-to-Peer Copy

cudaMemcpyPeerAsync

— Will fall back to staging copies via host

for unsupported configurations or if

peer-access is disabled.

cudaMemcpyAsync

— Using UVA, regular memory copies will

also work.

NVIDIA CUDA Webinars – Multi-GPU Programming

http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf

cudaMalloc((void**)&d_data, vboSize);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, glstream);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyPeerAsync(vboPtr, renderGPU, d_data, computeGPU, size, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaStreamSynchronize(glstream);

cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample - Manual Interop (P2P)

60

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Timeline: Multi-GPU

Manual-interop, synchronous*, P2P, WRITE-DISCARD

1 2 3

map

kernel

P2P

kernel

runCUDA

unmap

runGL

P2P

kernel

P2P

GL

P2P *

GL GL

61

4

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

Copy is faster, but still

serialized with render.

int read = 1, write = 0;

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

cudaStreamWaitEvent(custream[write], peerFinished[read]);

runCUDA(d_data[write], custream[write]);

cudaEventRecord(kernelFinished[write], custream[write]);

// Map the renderGPU’s resource and p2p copy the compute buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, glstream);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaStreamWaitEvent(glstream, kernelFinished[read]);

cudaMemcpyPeerAsync(vboPtr, renderGPU, d_data[read], computeGPU, size, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaEventRecord(peerFinished[read], glstream);

cudaStreamSynchronize(glstream); // Sync for easier analysis!

cudaSetDevice(computeGPU);

runGL(vboId);

swap(&read, &write);

}

Code Sample - Manual Interop (P2P flipping)

62

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Timeline: Multi-GPU

Manual-interop, synchronous*, P2P flipping, WRITE-DISCARD

63

1 2 3

map

kernel[A]

runCUDA

unmap *

runGL

kernel[A]

4

CUtoH

P2P

P2P[A]

GL

P2P[A] P2P[B]

GL GL

kernel[B] kernel[B]

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

Demo

64

Peer-to-Peer Memory Access

Sometimes we don’t want to explicitly copy but want access to

entire space on all GPUs and CPU

Already possible with linear memory with UVA, recent texture

addition useful for graphics.

Example – large dataset that can’t fit into 1 GPU memory

— Distribute the domain/data across GPUs

— Each GPU now need to access the other GPU’s data for halos/boundary

exchange

Mapping Algorithms to Hardware Topology

Won-Ki Jeong. S3308 - Fast Compressive

Sensing MRI Reconstruction on a Multi-

GPU System, GTC 2013.

Paulius M. S0515 – Implementing

3D Finite Difference Code on

GPUs, GTC 2009.

GPU0 GPU1 Data reduction, Multli-phase problems

Sharing texture across GPUs

Kernel runs on device 0 accesses texture from device 1

Peer access is unidirectional. cudaSetDevice(0);

cudaMalloc3DArray(d0_volume, ...);

cudaSetDevice(1);

cudaMalloc3DArray(d1_volume, ...);

while (!done) {

// Set CUDA device to COMPUTE DEVICE 0

cudaSetDevice(0);

cudaBindTextureToArray(tex0, d0_volume);

cudaBindTextureToArray(tex1, d1_volume);

myKernel <<<...>>>

}

__global__ void myKernel(...)

{

float voxel0 = tex3D(tex0, u, v, w); //accesses gpu0

float voxel1 = tex3D(tex1, u, v, w); //accesses gpu1 through p2p

}

Scaling Graphics - Multiple Quadro GPUs

Virtualized SLI case

— Multi-GPU mosaic drive a large wall

Access each GPU separately

— Parallel rendering

— Multi view-frustum eg Stereo, CAVEs

Hurricane Sandy Simulation showing multiple

computed tracks for different input params –

Image Credits : NOAA

Sort +

Alpha Composite

Data Distribution +

Render

GPU-0

GPU-2

GPU-1

GPU-3

Visible Human 14GB Texture Data

CUDA With SLI Mosaic

Use driver interop with the current cuda device

— Driver handles the optimized data transfer (Work in progress)

The OpenGL context spans 2 cuda devices

Cuda

Enumeration

[0] = K20

[1] = K5000

[2] = K5000

OpenGL

Enumeration

with SLI

[0] = K5000

cudaGLgetGLDevices

GL context

int glDeviceIndices[MAX_GPU];

int glDeviceCount;

cudaGLGetDevices(&glDeviceCount, glDeviceIndices, glDeviceCount,

cudaGLDeviceListAll));

glDeviceCount = 2

Specifying Render GPU Explicitly

Linux

— Specify separate X screens using XOpenDisplay

Windows

— Using NV_GPU_AFFINITY extension

Display* dpy = XOpenDisplay(“:0.”+gpu)

GLXContext = glxCreateContextAttribs(dpy,…);

BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)

For #GPUs enumerated {

GpuMask[0]=hGPU[0];

GpuMask[1]=NULL;

//Get affinity DC based on GPU

HDC affinityDC = wglCreateAffinityDCNV(GpuMask);

setPixelFormat(affinityDC);

HGLRC affinityGLRC = wglCreateContext(affinityDC);

}

Mapping Cuda Affinity GL Devices

How to map OpenGL device to CUDA

cudaWGLGetDevice(int *cudadevice,HGPUNV oglGPUHandle)

Don’t expect the same order

— GL order is windows specific while CUDA is device specific

Cuda

Enumeration

[0] = K20

[1] = K5000

[2] = K5000

GL Affinity

Enumeration

[0] = K5000

[1] = K5000

GL Affinity

Enumeration

(with SLI)

[0] = K5000

cudaGLgetGLDevices

cudaWGLgetDevice

GL context GL context

Compute & Multiple Render

GL_NV_COPY_IMAGE - Copy between GL devices wglCopyImageSubData and glXCopyImageSubDataNV

— Texture objects & Renderbuffers only.

— Can copy Texture subregions.

OpenGL

Context

GPU 2

CUDA

Context

GPU 1

OpenGL

Context

CudaMemcpyPeer

Aux

CUDA

Context

GPU 1 GPU 2 GPU 0

Compute Render Devices

wglCopyImageSubData

Map

Unmap

Multithreading with OpenGL Contexts

// Wait for signal to start consuming

CPUWait(producedFenceValid);

glWaitSync(producedFence[0]);

// Bind texture object

glBindTexture(destTex[0]);

// Use as needed

// Signal we are done wit this texture

consumedFence[0] = glFenceSync(…);

CPUSignal(consumedFenceValid);

// Wait for

CPUWait(consumedFenceValid);

glWaitSync(consumedFence[1]);

// Bind render target

glFramebufferTexture2D(srcTex[1]);

// Draw here…

// Unbind

glFramebufferTexture2D(0);

// Copy over to consumer GPU

wglCopyImageSubDataNV(srcCtx,srcTex[1],

..destCtx,destTex[1]);

// Signal that producer has completed

producedFence[1] = glFenceSync(…);

CPUSignal(producedFenceValid);

Thread - Dest GPU1 Thread - Source GPU0

[0]

[1]

GLsync consumedFence[MAX_BUFFERS];

GLsync producedFence[MAX_BUFFERS];

HANDLE consumedFenceValid, producedFenceValid;

destTex

Multi-level CPU and GPU sync primitives

[0]

[1]

srcTex

GPU1 GPU0

glCopyImage

Shalini V. S0353 - Programming Multi-GPUs for Scalable Rendering, GTC 2012.

Some final thoughts...

Hardware architecture influences transfer and system

throughput. e.g. NUMA

Peer-To-Peer copies/access simplify multi-gpu programming.

— but context-switching between CUDA and GL will serialize.

When scaling graphics

— remember the device mapping between CUDA and OpenGL.

Conclusions & Resources

The driver can do all the heavy-lifting but...

Scalability and final performance is up to the developer

— For fine-grained control and optimization, you might want to move the data manually.

CUDA samples/documentation: — http://developer.nvidia.com/cuda-downloads

OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.

www.openglinsights.com

75

Demo code

win7 & linux Code for the benchmarking demo is available at:

— http://bitbucket.org/wbraithwaite_nvidia/interopbenchmark

76

top related