advanced cuda 02

7/30/2019 Advanced CUDA 02

1/49

NVIDIA Corporation 2010

Advanced C


2/49


Agenda

System optimizations

Transfers between host and device

Kernel optimizations

Measuring performance effective bandwidth

Coalescing

Shared memory

Constant memory

Textures


3/49


HOST DEVICE TRANSFERSSystem Optimizations


4/49


PCIe Performance

Device to host bandwidth is much lower than device to dev

PCIe x16 Gen 2: 8 GB/s

C1060: 102 GB/s

Minimize transfers

Intermediate data can be allocated, operated on, and deallocated

copying to host memory

Group transfers together

One large transfer is better than many small transfers


5/49 NVIDIA Corporation 2010

Pinned (non-pageable) Memory

Pinned memory enables:

faster PCIe copies (~2x throughput on FSB systems)

memcopies asynchronous with CPU

memcopies asynchronous with GPU

Allocation

cudaHostAlloc() / cudaFreeHost()

i.e. Instead of malloc / free

Implication:

Pinned memory is essentially removed from host virtual memory



Asynchronous API and Streams

Default API:

Kernel launches are asynchronous with CPU

Memcopies (D2H, H2D) block CPU thread

CUDA calls are serialized by the driver

Streams and async functions provide:

Memcopies (D2H, H2D) asynchronous with CPU

Ability to concurrently execute a kernel and a memcopy

Stream = sequence of operations that execute in issue-ord

Operations from different streams can be interleaved

A kernel and memcopy from different streams can be overlapped



Overlap CPU Work with Kernel

Kernel launches are asynchronous

cudaMemcpy(a_d, a_h, size, // Blocks CPU until transf

cudaMemcpyHostToDevice);

kernel(a_d); // Asynchronous launch

cpuFunction(); // Overlaps kernel

cudaMemcpy(a_h, a_d, size, // Block until kernel is c

cudaMemcpyDeviceToHost); // then block until transf



Overlap CPU Work with Kernel/Memcpy

Overlap the CPU with the memory transfer

Default stream is 0

cudaMemcpyAsync(a_d, a_h, size, // Asynchronous transfer

cudaMemcpyHostToDevice, 0);

kernel(a_d); // Asynchronous launch (qu

// memcpy since in same st

cpuFunction(); // Overlaps memcpy and ker



Overlap Kernel with Memcpy

Requirements:

D2H or H2D memcpy from pinned memory

Device with compute capability 1.1 (G84 and later)

Kernel and memcpy in different, non-0 streams

cudaStream_t stream1, stream2;

cudaStreamCreate(&stream1); // Create stream handlescudaStreamCreate(&stream2);

cudaMemcpyAsync(dst, src, size, // Asynchronous

dir, stream1);

kernel(); // overlaps the transfer



Host Device Synchronization

Context based

Block until all outstanding CUDA commands complete

cudaThreadSynchronize(), cudaMemcpy(...)

Stream based

Block until all outstanding CUDA commands in stream complete

cudaStreamSynchronize(stream)

Test whether stream is idle (empty), returns cudaSuccess or

cudaErrorNotReady non-blocking

cudaStreamQuery(stream)



Host Device Synchronization

Event based (within streams)

Record event in a stream

cudaEventRecord(event, stream)

Event is recorded (assigned timestamp) when reached by the GP

Block until event is recorded

cudaEventSynchronize(event)

Test whether event has been recorded, returns cudaSuccess o

cudaErrorNotReady non-blocking

cudaEventQuery(stream)



Zero-Copy Memory

Device directly accesses host memory (i.e. no explicit cop

Check canMapHostMemory device property

Setup from host

cudaSetDeviceFlags(cudaDeviceMapHost);

...

cudaHostAlloc((void **)&a_h, size, cudaHostAllo

cudaHostGetDevicePointer((void **)&a_d, (void *

kernel(a_d, N);


13/49


Zero-Copy Memory Considerations

Always beneficial for integrated devices

Integrated devices access host memory directly

Test integrated device property

Typically beneficial if data only read/written once

Copy input data from CPU to GPU memory

Run one kernel

Copy output data back from GPU to CPU memory

Coalescing is even more important with zero-copy!


14/49


MEASURING PERFORMANCEKernel Optimizations


15/49


Theoretical Bandwidth

Device bandwidth for C1060

Memory clock: 800 MHz (800 * 106) Hz

Memory interface: 512 bits (512 / 8) bytes

DDR: yes 2 bits per cycle

Theoretical bandwidth = (800 * 106) * (512 / 8) * 2 bytes/sec

= 102 GB/s

Be consistent in definition of giga (10243 or 109)


16/49


Effective Bandwidth

Measure time and calculate the effective bandwidth

For example, copying array of N floats

Size of data: N * sizeof(float) (N * 4)

Read and write: yes 2 ops per word

Time: t

Effective bandwidth = (N * 4) * 2 / t bytes/s


17/49


Global Memory Throughput Metric

Many applications are memory throughput bound

When coding from scratch:Start with memory operations first, achieve good throughput

Add the arithmetic, measuring performance as you go

When optimizing:

Measure effective memory throughput

Compare to the theoretical bandwidth

70-80% is very good, ~50% is good if arithmetic is nontrivial

Measuring throughput

From the app point of view (useful bytes)

From the hw point of view (actual bytes moved across the bus)

The two are likely to be different

Due to coalescing, discrete bus transaction sizes


18/49


Measuring Memory Throughput

Latest Visual Profilerreports memory throughput

From HW point of viewBased on counters for one TPC (3 multiprocessors)

Need compute capability 1.2 or higherGPU


19/49


Measuring Memory Throughput

Visual Profilerreports memory throughput

From HW point of viewBased on counters for one TPC (3 multiprocessors)

Need compute capability 1.2 or higherGPU

How throughput is calculated

Count load/store bus transactions of each size (32B, 64B, 128B) on the TPC

Extrapolate from one TPC to the entire GPU

Multiply by (total threadblocks / threadblocks on TPC)

i.e. (grid size / cta launched)


20/49


COALESCING

Kernel Optimizations


21/49


GMEM Coalescing: Compute Capability 1.2

Possible GPU memory bus transaction sizes:

32B, 64B, or128B

Transaction segment must be aligned

First address = multiple of segment size

Hardware coalescing for each half-warp (16 threads):

Memory accesses are handled per half-warpsCarry out the smallest possible number of transactions

Reduce transaction size when possible


22/49


23/49


HW Steps when Coalescing for half-warp

Find the memory segment that contains the address requested by the lowest

thread:32B segment for8-bit data

64B segment for16-bit data

128B segment for32, 64 and 128-bit data

Find all other active threads whose requested address lies in the same segme

Reduce the transaction size, if possible:

If size == 128B and only the lower or upper half is used, reduce transaction to 64

If size == 64B and only the lower or upper half is used, reduce transaction to 32B

Applied even if 64B was a reduction from 128B

Carry out the transaction, mark serviced threads as inactive

Repeat until all threads in the half-warp are serviced


24/49


25/49


Threads 0-15: 4-byte words at address 116

Thread 0 is lowest active, accesses address 116

128-byte segment: 0-127

96 192128

128B segment

160 224

t1 t2...

t0 t15

0 32 64

t3


26/49




128-byte segment: 0-127 (reduce to 64B)

96 192128

64B segment

160 224

t1 t2...

t0 t15

0 32 64

t3


27/49





96 192128

32B transaction

160 224

t1 t3...

t0 t15

0 32 64

t2

Th d 0 15 4 b t d t dd 116


28/49




128-byte segment: 128-255

96 192128

128B segment

160 224

t1 t2...

t0 t15

0 32 64

t3

Th d 0 15 4 b t d t dd 116


29/49





96 192128

64B transaction

160 224

t1 t2...

t0 t15

0 32 64

t3


30/49

C i C t C biliti


31/49


Comparing Compute Capabilities

Compute capability 1.0-1.1

Requires threads in a half-warp to:Access a single aligned 64B, 128B, or256B segment

Threads must issue addresses in sequence

If requirements are not satisfied:

Separate 32B transaction for each thread

Compute capability 1.2-1.3

Does not require sequential addressing by threads

Performance degrades gracefully when a half-warp addresses m

segments

E i t I t f Add Ali t


32/49


Experiment: Impact of Address Alignment

Assume half-warp accesses a contiguous region

Throughput is maximized when region is aligned on its siz100% of bytes in a bus transaction are useful

Impact of misaligned addressing:

32-bit words, streaming code, Quadro FX5800 (102 GB/s)

0 word offset: 76 GB/s (perfect alignment, typical perf)

8 word offset: 57 GB/s (75% of aligned case)

All others: 46 GB/s (61% of aligned case)

Address Alignment 32 bit words


33/49


Address Alignment, 32-bit words

8-word (32B) offset from perfect alignment:

Observed 75% of the perfectly aligned perfSegments starting at multiple of32B

One 128B transaction (50% efficiency)

Segments starting at multiple of96B

Two 32B transactions (100% efficiency)

9664320 128

128B transaction

9664320 128

32B32B



34/49



4-word (16B) offset (other offsets have the same perf):

Observed 61% of the perfectly aligned perfTwo types of segments, based on starting address

One 128B transaction (50% efficiency)

One 64B and one 32B transaction (67% efficiency)

9664320 128

128B transaction

9664320 128

32B64B



35/49



Can be analyzed similarly to 32-bit case:

0B offset: 80 GB/s (perfectly aligned)8B offset: 62 GB/s (78% of perfectly aligned)

16B offset: 62 GB/s (78% of perfectly aligned)



Compare 0 and 64B offset performance:

Both consume 100% of the bytes

64B: two 64B transactions

0B: single 128B transaction, slightly faster


36/49


SHARED MEMORY


Shared Memory


37/49


Shared Memory

Uses:

Inter-thread communication within a blockCache data to reduce global memory accesses

Avoid non-coalesced access

Organization:

16 banks, 32-bit wide

Successive 32-bit words belong to different banks

Performance:32 bits per bank per 2 clocks per multiprocessor

smem accesses are per 16-threads (half-warp)

Serialization: if n threads (out of 16) access the same bank, n accesses

serially

Broadcast: n threads access the same word in one fetch

Bank Addressing Examples


38/49



No Bank Conflicts No Bank Conflicts

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0



39/49



2-way Bank Conflicts 8-way Bank Conflict

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0 x

Shared Memory Bank Conflicts


40/49


Shared Memory Bank Conflicts

warp_serialize profiler signal reflects conflicts

The fast case:

If all threads of a half-warp access different banks, there is no ba

If all threads of a half-warp read identical address, there is no ba

(broadcast)

The slow case:

Bank conflict: multiple threads in the same half-warp access the

Must serialize the accesses

Assess Impact On Performance


41/49


Assess Impact On Performance

Replace all SMEM indexes with threadIdx.x

Eliminates conflictsWill show how much performance could be improved by eliminat

conflicts


42/49


CONSTANTS


Constant Memory


43/49


Constant Memory

Data stored in global memory, read through a constant-ca

__constant__qualifier in declarationsCan only be read by GPU kernels

Limited to 64KB

To be used when all threads in a warp read the same add

Serializes otherwise (indicated by warp_serialize counter in profi

Throughput:32 bits per warp per clock per multiprocessor


44/49


TEXTURES


Textures in CUDA


45/49


Textures in CUDA

Texture is an object for reading data

Benefits:Data is cached

Helpful when coalescing is a problem

Filtering

Linear / bilinear / trilinear interpolation

Dedicated hardware

Wrap modes (for out-of-bounds addresses)

Clamp to edge / repeat

Addressable in 1D, 2D, or 3D

Using integer or normalized coordinates

Texture Addressing Modes


46/49


Texture Addressing Modes

1

2

3

0(5.5, 1.5)

0 1 2 3 4

1

23

0(2.5, 0.5)

(1.0, 1.0)

1

2

3

0

Wrap

Out-of-bounds coordinate is

wrapped (modulo arithmetic)

Clamp

Out-of-bounds coordinate i

replaced with the closest b

0 1 2 3 4 0 1 2 3 4

CUDA Texture Types


47/49


yp

Bound to linear memoryGlobal memory address is bound to a texture

Only 1DInteger addressing, no filtering, no addressing modes

Bound to CUDA arraysBlock linear CUDA array is bound to a texture

1D, 2D, or 3D

Float addressing (size-based or normalized)

Filtering & addressing modes (clamp, repeat) supportedBound to pitch linear

Global memory address is bound to a 2D texture

Float/integer addressing, filtering, and clamp/repeat addressing modCUDA arrays

Using Textures


48/49


g

Host (CPU) code:

Allocate/obtain memory (global linear/pitch linear, or CUDA arrayCreate a texture reference object (currently must be at file scope

Bind the texture reference to memory/array

Unbind the texture reference when finished, free resources

Device (kernel) code:Fetch using texture reference

Linear memory textures: tex1Dfetch()

Array textures: tex1D() ortex2D() ortex3D()

Pitch linear textures: tex2D()


49/49

advanced cuda 02

Documents