gpus and gpu programming bharadwaj subramanian, apollo ellis, keshav pingali

28
GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis, Keshav Pingali Imagery taken from Nvidia Dawn Demo Slide on GPUs, Cuda and Programming Models by Apollo Ellis Slides on OpenCL by Bharadwaj Subramanian

Upload: kort

Post on 14-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis, Keshav Pingali. Imagery taken from Nvidia Dawn Demo Slide on GPUs, Cuda and Programming Models by Apollo Ellis Slides on OpenCL by Bharadwaj Subramanian. A GPU is a Multi-core Architecture. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

GPUs and GPU ProgrammingBharadwaj Subramanian, Apollo Ellis, Keshav Pingali

Imagery taken fromNvidia Dawn Demo

Slide on GPUs, Cuda and Programming Models by Apollo Ellis

Slides on OpenCL by Bharadwaj Subramanian

Page 2: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

A GPU is a Multi-core Architecture

• High throughput is prioritized over low latency single task execution

• Large collection of fixed function and software programmable resources

Page 3: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Graphics Pipeline

• Virtual scene Virtual camera used to render• Direct3D and OpenGL formulate the process as

a pipeline of operations on fundamental entities– Vertices– Primitives– Fragments– Pixels

• Data flows in entity streams between pipeline stages.

Page 4: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Graphics Pipeline

Page 5: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Graphics Pipeline

• GPU Front End– Otherwise known as Vertex Generator• Takes in vertex descriptors: Location plus Type (Line,

Triangle, Quad, Poly)– Attributes (Normal, Texture Coordinate, Color etc.)

• Performs a prefetch on the vertex data and constructs a vertex stream.

Page 6: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Graphics Pipeline

• Vertex Processing– Programmable Vertex Shader Execute• Typically converts from world space to camera space• Languages include Cg and HLSL

• Primitive Assembly– Convert form vertices to primitives

• Rasterization– Primitive Sampler in Screen space– Fragment Generator

Page 7: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Graphics Pipeline

• Fragment Processing– Programmable Fragment Shader Execute• Texture Lookup and Light Interaction Calculation• Cg and HLSL

• ROP– Raster Operations (Depth Buffer Cull, Alpha Blend)– Calculate each fragment’s contribution to given

pixels

Page 8: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Shader Programming

• Fragment or Vertex processing is defined by shader programs written in Cg or GLSL or HLSL

• Compiled at runtime to binary • Or compiled offline and then transformed at

runtime• C-like function that processes a single input and

output in isolation• Run in parallel on multiple shader cores• Wide SIMD instructions due to instruction streaming

Page 9: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Parallel Processing and Encapsulation

• Task Parallelism is available across stages– Eg. Vertices are processed while fragments processed

etc.• Data Parallelism is available across stream

entities.– Each entity is independ of each other because of the

task offloading onto the fixed function units• Fixed Function Units encapsulate hard to

parallelize work in optimized hardware components

Page 10: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Still A Scheduling Problem

• Processing and on-chip resources must be dynamically reallocated to pipeline stages

• Depends on the current loads at different stages

• How to determine if different stages get more cores or more cache becomes an issue.

• Hardware Multithreading provides a solution to thread stalls distributes resources more evenly

Page 11: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

CUDA

• CUDA is a more general data parallel model– No Pipe

• Clusters of Threads• Scatter Operations (Multiple Write)• Gather Operations (Multiple Read)• Application based decomposition of threads• Threads can share data and communicate with

each other

Page 12: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

CUDA Programming Model

• GPU is viewed as a coprocessor with DRAM and many parallel threads

• Data parallel portions of applications can be offloaded onto this coprocessor

• C on the GPU– Global and Shared Variables– Pointers and Explicit Memory Allocation– OpenGL and DirectX interoperability

Page 13: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Tesla Architecture

• Scalable array of multithreaded Streaming Multiprocessors or SMs 768 to 12,288 concurrent threads

Page 14: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Kernels

• C C++ Simple Functions or Full Programs• Consists of thread blocks and grids– Thread Block• Set of concurrent threads that cooperate through

barriers and shared memory.

– Grid• Set of thread blocks that are independent form each

other• Multiple Grids per Kernel

Page 15: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Syntax Example

• __global__ void my_par_func(float a){

do something with a

} int dimGrid = 256, dimBlock 256

my_par_func<<<dimGrid,dimBlock>>>(5.0f)

Page 16: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Execution

• SIMT Single Instruction Multiple Model Scheduler schedules Warps or sets of concurrent threads on SM units.

• Warp is scheduled independently of other warps• If a Warps threads diverge in control flow path the

paths are each executed turning off the threads that are not effected

• No recursion is allowed for stack space problems

Page 17: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

SIMD vs SIMT

• CUDA utilizes the wide SIMD units• However SIMD is not exposed to the

programmer• Instead SIMD units are used by multiple

threads at once• SIMT utilizes of SIMD

Page 18: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

CUDA Wrap Up

• More general model using same hardware• GPU is a CUDA coprocessor• Tesla Architecture 768 to 12000+ threads• C C++ syntax• Serial Branching• No recursion• SIMD used by SIMT

Page 19: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Another Model GRAMPS

• General Runtime Architecture for Multicore Parallel Systems

• A programming model for graphics pipelines• Allows for custom pipelines mixing fixed

function and programmable stages• Data is exchanged using queues and buffers• Motivation comes from hybrid applications– REYES Rasterization and Ray Tracing

Page 20: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Execution Graphs

• Analog of a GPU pipeline• Made up of Stages• Provides scheduling information• Not limited to execution DAGs– Cycles are not forbidden

• Forward progress is not guaranteed• Flexibility presumably outweighs the cost of

well behaved programs assurance

Page 21: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Stages

• Types SHADER THREAD FIXEDFUNCTION• Operate asynchronously exposing parallelism• Indicate similarities in data access and

execution characteristics for efficient processing

• Useful when benefits coherent execution outweigh deferred processing

Page 22: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Shader

• Short live run to completion computations• Per element programs• Push operation introduced for conditional

output• Otherwise queue inputs and outputs are

managed automatically• Shader instances are scheduled in packets

similar to GPU execution

Page 23: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Threads and Fixed Function

• Threads– Similar to CPU threads designed for task

parallelism– Must be manually parallelize by the application– Useful for repacking data between Shader stages

and processing bulk chunks of data where sharing or cross communication is needed

• Fixed Function– Hardware unit wrappers

Page 24: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Buffers and Queues

• Buffers– essentially shared memory across stages

• Queues– Packets are the primitive data format of the queue defined at

creation– Opaque packets: are for data chunks which need not be

interpreted– Collection packets: for shader group dispatch

• Queue Manipulation– Thread/Fixed Stages– Shader Stages

Page 25: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Thread Fixed Stages

• reserve-commit– reserve: returns called a reference to one or more

contiguous packets a reservation is also acquired– commit: is a notification that releases the

referenced data back to the system– Input commit means packet has been consumed– Output commit means packet can go downstream

Page 26: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Shader Stages

• Queue ops are transparent to the user• As input packets arrive output reservations are

attained• When all shader instances for a collection

packer are done the commits happen automatically

• Queue Sets are introduced– Groups of queues viewed as single queues for

sharing among shaders

Page 27: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali

Summary GRAMPS

• Application creates stages, queues, and buffers.

• Queues and buffer are bound to stages• Computation proceeds according to execution

graphs• Computation graphs are fully programmable• Dynamic aggregation of work at runtime

Page 28: GPUs and GPU Programming Bharadwaj  Subramanian, Apollo Ellis,  Keshav Pingali