advanced cuda 02

Upload: proxymo1

Post on 14-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Advanced CUDA 02

    1/49

    NVIDIA Corporation 2010

    Advanced C

  • 7/30/2019 Advanced CUDA 02

    2/49

    NVIDIA Corporation 2010

    Agenda

    System optimizations

    Transfers between host and device

    Kernel optimizations

    Measuring performance effective bandwidth

    Coalescing

    Shared memory

    Constant memory

    Textures

  • 7/30/2019 Advanced CUDA 02

    3/49

    NVIDIA Corporation 2010

    HOST DEVICE TRANSFERSSystem Optimizations

  • 7/30/2019 Advanced CUDA 02

    4/49

    NVIDIA Corporation 2010

    PCIe Performance

    Device to host bandwidth is much lower than device to dev

    PCIe x16 Gen 2: 8 GB/s

    C1060: 102 GB/s

    Minimize transfers

    Intermediate data can be allocated, operated on, and deallocated

    copying to host memory

    Group transfers together

    One large transfer is better than many small transfers

  • 7/30/2019 Advanced CUDA 02

    5/49 NVIDIA Corporation 2010

    Pinned (non-pageable) Memory

    Pinned memory enables:

    faster PCIe copies (~2x throughput on FSB systems)

    memcopies asynchronous with CPU

    memcopies asynchronous with GPU

    Allocation

    cudaHostAlloc() / cudaFreeHost()

    i.e. Instead of malloc / free

    Implication:

    Pinned memory is essentially removed from host virtual memory

  • 7/30/2019 Advanced CUDA 02

    6/49 NVIDIA Corporation 2010

    Asynchronous API and Streams

    Default API:

    Kernel launches are asynchronous with CPU

    Memcopies (D2H, H2D) block CPU thread

    CUDA calls are serialized by the driver

    Streams and async functions provide:

    Memcopies (D2H, H2D) asynchronous with CPU

    Ability to concurrently execute a kernel and a memcopy

    Stream = sequence of operations that execute in issue-ord

    Operations from different streams can be interleaved

    A kernel and memcopy from different streams can be overlapped

  • 7/30/2019 Advanced CUDA 02

    7/49 NVIDIA Corporation 2010

    Overlap CPU Work with Kernel

    Kernel launches are asynchronous

    cudaMemcpy(a_d, a_h, size, // Blocks CPU until transf

    cudaMemcpyHostToDevice);

    kernel(a_d); // Asynchronous launch

    cpuFunction(); // Overlaps kernel

    cudaMemcpy(a_h, a_d, size, // Block until kernel is c

    cudaMemcpyDeviceToHost); // then block until transf

  • 7/30/2019 Advanced CUDA 02

    8/49 NVIDIA Corporation 2010

    Overlap CPU Work with Kernel/Memcpy

    Overlap the CPU with the memory transfer

    Default stream is 0

    cudaMemcpyAsync(a_d, a_h, size, // Asynchronous transfer

    cudaMemcpyHostToDevice, 0);

    kernel(a_d); // Asynchronous launch (qu

    // memcpy since in same st

    cpuFunction(); // Overlaps memcpy and ker

  • 7/30/2019 Advanced CUDA 02

    9/49 NVIDIA Corporation 2010

    Overlap Kernel with Memcpy

    Requirements:

    D2H or H2D memcpy from pinned memory

    Device with compute capability 1.1 (G84 and later)

    Kernel and memcpy in different, non-0 streams

    cudaStream_t stream1, stream2;

    cudaStreamCreate(&stream1); // Create stream handlescudaStreamCreate(&stream2);

    cudaMemcpyAsync(dst, src, size, // Asynchronous

    dir, stream1);

    kernel(); // overlaps the transfer

  • 7/30/2019 Advanced CUDA 02

    10/49 NVIDIA Corporation 2010

    Host Device Synchronization

    Context based

    Block until all outstanding CUDA commands complete

    cudaThreadSynchronize(), cudaMemcpy(...)

    Stream based

    Block until all outstanding CUDA commands in stream complete

    cudaStreamSynchronize(stream)

    Test whether stream is idle (empty), returns cudaSuccess or

    cudaErrorNotReady non-blocking

    cudaStreamQuery(stream)

  • 7/30/2019 Advanced CUDA 02

    11/49 NVIDIA Corporation 2010

    Host Device Synchronization

    Event based (within streams)

    Record event in a stream

    cudaEventRecord(event, stream)

    Event is recorded (assigned timestamp) when reached by the GP

    Block until event is recorded

    cudaEventSynchronize(event)

    Test whether event has been recorded, returns cudaSuccess o

    cudaErrorNotReady non-blocking

    cudaEventQuery(stream)

  • 7/30/2019 Advanced CUDA 02

    12/49 NVIDIA Corporation 2010

    Zero-Copy Memory

    Device directly accesses host memory (i.e. no explicit cop

    Check canMapHostMemory device property

    Setup from host

    cudaSetDeviceFlags(cudaDeviceMapHost);

    ...

    cudaHostAlloc((void **)&a_h, size, cudaHostAllo

    cudaHostGetDevicePointer((void **)&a_d, (void *

    kernel(a_d, N);

  • 7/30/2019 Advanced CUDA 02

    13/49

    NVIDIA Corporation 2010

    Zero-Copy Memory Considerations

    Always beneficial for integrated devices

    Integrated devices access host memory directly

    Test integrated device property

    Typically beneficial if data only read/written once

    Copy input data from CPU to GPU memory

    Run one kernel

    Copy output data back from GPU to CPU memory

    Coalescing is even more important with zero-copy!

  • 7/30/2019 Advanced CUDA 02

    14/49

    NVIDIA Corporation 2010

    MEASURING PERFORMANCEKernel Optimizations

  • 7/30/2019 Advanced CUDA 02

    15/49

    NVIDIA Corporation 2010

    Theoretical Bandwidth

    Device bandwidth for C1060

    Memory clock: 800 MHz (800 * 106) Hz

    Memory interface: 512 bits (512 / 8) bytes

    DDR: yes 2 bits per cycle

    Theoretical bandwidth = (800 * 106) * (512 / 8) * 2 bytes/sec

    = 102 GB/s

    Be consistent in definition of giga (10243 or 109)

  • 7/30/2019 Advanced CUDA 02

    16/49

    NVIDIA Corporation 2010

    Effective Bandwidth

    Measure time and calculate the effective bandwidth

    For example, copying array of N floats

    Size of data: N * sizeof(float) (N * 4)

    Read and write: yes 2 ops per word

    Time: t

    Effective bandwidth = (N * 4) * 2 / t bytes/s

  • 7/30/2019 Advanced CUDA 02

    17/49

    NVIDIA Corporation 2010

    Global Memory Throughput Metric

    Many applications are memory throughput bound

    When coding from scratch:Start with memory operations first, achieve good throughput

    Add the arithmetic, measuring performance as you go

    When optimizing:

    Measure effective memory throughput

    Compare to the theoretical bandwidth

    70-80% is very good, ~50% is good if arithmetic is nontrivial

    Measuring throughput

    From the app point of view (useful bytes)

    From the hw point of view (actual bytes moved across the bus)

    The two are likely to be different

    Due to coalescing, discrete bus transaction sizes

  • 7/30/2019 Advanced CUDA 02

    18/49

    NVIDIA Corporation 2010

    Measuring Memory Throughput

    Latest Visual Profilerreports memory throughput

    From HW point of viewBased on counters for one TPC (3 multiprocessors)

    Need compute capability 1.2 or higherGPU

  • 7/30/2019 Advanced CUDA 02

    19/49

    NVIDIA Corporation 2010

    Measuring Memory Throughput

    Visual Profilerreports memory throughput

    From HW point of viewBased on counters for one TPC (3 multiprocessors)

    Need compute capability 1.2 or higherGPU

    How throughput is calculated

    Count load/store bus transactions of each size (32B, 64B, 128B) on the TPC

    Extrapolate from one TPC to the entire GPU

    Multiply by (total threadblocks / threadblocks on TPC)

    i.e. (grid size / cta launched)

  • 7/30/2019 Advanced CUDA 02

    20/49

    NVIDIA Corporation 2010

    COALESCING

    Kernel Optimizations

  • 7/30/2019 Advanced CUDA 02

    21/49

    NVIDIA Corporation 2010

    GMEM Coalescing: Compute Capability 1.2

    Possible GPU memory bus transaction sizes:

    32B, 64B, or128B

    Transaction segment must be aligned

    First address = multiple of segment size

    Hardware coalescing for each half-warp (16 threads):

    Memory accesses are handled per half-warpsCarry out the smallest possible number of transactions

    Reduce transaction size when possible

  • 7/30/2019 Advanced CUDA 02

    22/49

  • 7/30/2019 Advanced CUDA 02

    23/49

    NVIDIA Corporation 2010

    HW Steps when Coalescing for half-warp

    Find the memory segment that contains the address requested by the lowest

    thread:32B segment for8-bit data

    64B segment for16-bit data

    128B segment for32, 64 and 128-bit data

    Find all other active threads whose requested address lies in the same segme

    Reduce the transaction size, if possible:

    If size == 128B and only the lower or upper half is used, reduce transaction to 64

    If size == 64B and only the lower or upper half is used, reduce transaction to 32B

    Applied even if 64B was a reduction from 128B

    Carry out the transaction, mark serviced threads as inactive

    Repeat until all threads in the half-warp are serviced

  • 7/30/2019 Advanced CUDA 02

    24/49

  • 7/30/2019 Advanced CUDA 02

    25/49

    NVIDIA Corporation 2010

    Threads 0-15: 4-byte words at address 116

    Thread 0 is lowest active, accesses address 116

    128-byte segment: 0-127

    96 192128

    128B segment

    160 224

    t1 t2...

    t0 t15

    0 32 64

    t3

  • 7/30/2019 Advanced CUDA 02

    26/49

    NVIDIA Corporation 2010

    Threads 0-15: 4-byte words at address 116

    Thread 0 is lowest active, accesses address 116

    128-byte segment: 0-127 (reduce to 64B)

    96 192128

    64B segment

    160 224

    t1 t2...

    t0 t15

    0 32 64

    t3

  • 7/30/2019 Advanced CUDA 02

    27/49

    NVIDIA Corporation 2010

    Threads 0-15: 4-byte words at address 116

    Thread 0 is lowest active, accesses address 116

    128-byte segment: 0-127 (reduce to 32B)

    96 192128

    32B transaction

    160 224

    t1 t3...

    t0 t15

    0 32 64

    t2

    Th d 0 15 4 b t d t dd 116

  • 7/30/2019 Advanced CUDA 02

    28/49

    NVIDIA Corporation 2010

    Threads 0-15: 4-byte words at address 116

    Thread 3 is lowest active, accesses address 128

    128-byte segment: 128-255

    96 192128

    128B segment

    160 224

    t1 t2...

    t0 t15

    0 32 64

    t3

    Th d 0 15 4 b t d t dd 116

  • 7/30/2019 Advanced CUDA 02

    29/49

    NVIDIA Corporation 2010

    Threads 0-15: 4-byte words at address 116

    Thread 3 is lowest active, accesses address 128

    128-byte segment: 128-255 (reduce to 64B)

    96 192128

    64B transaction

    160 224

    t1 t2...

    t0 t15

    0 32 64

    t3

  • 7/30/2019 Advanced CUDA 02

    30/49

    C i C t C biliti

  • 7/30/2019 Advanced CUDA 02

    31/49

    NVIDIA Corporation 2010

    Comparing Compute Capabilities

    Compute capability 1.0-1.1

    Requires threads in a half-warp to:Access a single aligned 64B, 128B, or256B segment

    Threads must issue addresses in sequence

    If requirements are not satisfied:

    Separate 32B transaction for each thread

    Compute capability 1.2-1.3

    Does not require sequential addressing by threads

    Performance degrades gracefully when a half-warp addresses m

    segments

    E i t I t f Add Ali t

  • 7/30/2019 Advanced CUDA 02

    32/49

    NVIDIA Corporation 2010

    Experiment: Impact of Address Alignment

    Assume half-warp accesses a contiguous region

    Throughput is maximized when region is aligned on its siz100% of bytes in a bus transaction are useful

    Impact of misaligned addressing:

    32-bit words, streaming code, Quadro FX5800 (102 GB/s)

    0 word offset: 76 GB/s (perfect alignment, typical perf)

    8 word offset: 57 GB/s (75% of aligned case)

    All others: 46 GB/s (61% of aligned case)

    Address Alignment 32 bit words

  • 7/30/2019 Advanced CUDA 02

    33/49

    NVIDIA Corporation 2010

    Address Alignment, 32-bit words

    8-word (32B) offset from perfect alignment:

    Observed 75% of the perfectly aligned perfSegments starting at multiple of32B

    One 128B transaction (50% efficiency)

    Segments starting at multiple of96B

    Two 32B transactions (100% efficiency)

    9664320 128

    128B transaction

    9664320 128

    32B32B

    Address Alignment 32 bit words

  • 7/30/2019 Advanced CUDA 02

    34/49

    NVIDIA Corporation 2010

    Address Alignment, 32-bit words

    4-word (16B) offset (other offsets have the same perf):

    Observed 61% of the perfectly aligned perfTwo types of segments, based on starting address

    One 128B transaction (50% efficiency)

    One 64B and one 32B transaction (67% efficiency)

    9664320 128

    128B transaction

    9664320 128

    32B64B

    Address Alignment 64 bit words

  • 7/30/2019 Advanced CUDA 02

    35/49

    NVIDIA Corporation 2010

    Address Alignment, 64-bit words

    Can be analyzed similarly to 32-bit case:

    0B offset: 80 GB/s (perfectly aligned)8B offset: 62 GB/s (78% of perfectly aligned)

    16B offset: 62 GB/s (78% of perfectly aligned)

    32B offset: 68 GB/s (85% of perfectly aligned)

    64B offset: 76 GB/s (95% of perfectly aligned)

    Compare 0 and 64B offset performance:

    Both consume 100% of the bytes

    64B: two 64B transactions

    0B: single 128B transaction, slightly faster

  • 7/30/2019 Advanced CUDA 02

    36/49

    NVIDIA Corporation 2010

    SHARED MEMORY

    Kernel Optimizations

    Shared Memory

  • 7/30/2019 Advanced CUDA 02

    37/49

    NVIDIA Corporation 2010

    Shared Memory

    Uses:

    Inter-thread communication within a blockCache data to reduce global memory accesses

    Avoid non-coalesced access

    Organization:

    16 banks, 32-bit wide

    Successive 32-bit words belong to different banks

    Performance:32 bits per bank per 2 clocks per multiprocessor

    smem accesses are per 16-threads (half-warp)

    Serialization: if n threads (out of 16) access the same bank, n accesses

    serially

    Broadcast: n threads access the same word in one fetch

    Bank Addressing Examples

  • 7/30/2019 Advanced CUDA 02

    38/49

    NVIDIA Corporation 2010

    Bank Addressing Examples

    No Bank Conflicts No Bank Conflicts

    Bank 15

    Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

    Thread 15

    Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

    Thread 15

    Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

    Bank Addressing Examples

  • 7/30/2019 Advanced CUDA 02

    39/49

    NVIDIA Corporation 2010

    Bank Addressing Examples

    2-way Bank Conflicts 8-way Bank Conflict

    Thread 11Thread 10Thread 9Thread 8

    Thread 4Thread 3Thread 2Thread 1Thread 0

    Bank 15

    Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

    Thread 15

    Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0 x

    Shared Memory Bank Conflicts

  • 7/30/2019 Advanced CUDA 02

    40/49

    NVIDIA Corporation 2010

    Shared Memory Bank Conflicts

    warp_serialize profiler signal reflects conflicts

    The fast case:

    If all threads of a half-warp access different banks, there is no ba

    If all threads of a half-warp read identical address, there is no ba

    (broadcast)

    The slow case:

    Bank conflict: multiple threads in the same half-warp access the

    Must serialize the accesses

    Assess Impact On Performance

  • 7/30/2019 Advanced CUDA 02

    41/49

    NVIDIA Corporation 2010

    Assess Impact On Performance

    Replace all SMEM indexes with threadIdx.x

    Eliminates conflictsWill show how much performance could be improved by eliminat

    conflicts

  • 7/30/2019 Advanced CUDA 02

    42/49

    NVIDIA Corporation 2010

    CONSTANTS

    Kernel Optimizations

    Constant Memory

  • 7/30/2019 Advanced CUDA 02

    43/49

    NVIDIA Corporation 2010

    Constant Memory

    Data stored in global memory, read through a constant-ca

    __constant__qualifier in declarationsCan only be read by GPU kernels

    Limited to 64KB

    To be used when all threads in a warp read the same add

    Serializes otherwise (indicated by warp_serialize counter in profi

    Throughput:32 bits per warp per clock per multiprocessor

  • 7/30/2019 Advanced CUDA 02

    44/49

    NVIDIA Corporation 2010

    TEXTURES

    Kernel Optimizations

    Textures in CUDA

  • 7/30/2019 Advanced CUDA 02

    45/49

    NVIDIA Corporation 2010

    Textures in CUDA

    Texture is an object for reading data

    Benefits:Data is cached

    Helpful when coalescing is a problem

    Filtering

    Linear / bilinear / trilinear interpolation

    Dedicated hardware

    Wrap modes (for out-of-bounds addresses)

    Clamp to edge / repeat

    Addressable in 1D, 2D, or 3D

    Using integer or normalized coordinates

    Texture Addressing Modes

  • 7/30/2019 Advanced CUDA 02

    46/49

    NVIDIA Corporation 2010

    Texture Addressing Modes

    1

    2

    3

    0(5.5, 1.5)

    0 1 2 3 4

    1

    23

    0(2.5, 0.5)

    (1.0, 1.0)

    1

    2

    3

    0

    Wrap

    Out-of-bounds coordinate is

    wrapped (modulo arithmetic)

    Clamp

    Out-of-bounds coordinate i

    replaced with the closest b

    0 1 2 3 4 0 1 2 3 4

    CUDA Texture Types

  • 7/30/2019 Advanced CUDA 02

    47/49

    NVIDIA Corporation 2010

    yp

    Bound to linear memoryGlobal memory address is bound to a texture

    Only 1DInteger addressing, no filtering, no addressing modes

    Bound to CUDA arraysBlock linear CUDA array is bound to a texture

    1D, 2D, or 3D

    Float addressing (size-based or normalized)

    Filtering & addressing modes (clamp, repeat) supportedBound to pitch linear

    Global memory address is bound to a 2D texture

    Float/integer addressing, filtering, and clamp/repeat addressing modCUDA arrays

    Using Textures

  • 7/30/2019 Advanced CUDA 02

    48/49

    NVIDIA Corporation 2010

    g

    Host (CPU) code:

    Allocate/obtain memory (global linear/pitch linear, or CUDA arrayCreate a texture reference object (currently must be at file scope

    Bind the texture reference to memory/array

    Unbind the texture reference when finished, free resources

    Device (kernel) code:Fetch using texture reference

    Linear memory textures: tex1Dfetch()

    Array textures: tex1D() ortex2D() ortex3D()

    Pitch linear textures: tex2D()

  • 7/30/2019 Advanced CUDA 02

    49/49