graphics and computing gpus jehan-françois pâris [email protected]

GRAPHICS ANDCOMPUTING GPUS

Jehan-François Pâ[email protected]

Chapter Organization

• Why bother?• Evolution• GPU System Architecture• Programming GPUs• …

Why bother? (I)

• Yesterday's fastest computer was the Sequoia supercomputer– Can crunch 16.32 quadrillion calculations per

second (16.32 Petaflops/s). – 98,304 compute nodes

• Each compute nodes is a 16-core PowerPC A2 processor

Why bother? (II)

• Today's fastest computer is the Cray XK7 – Hits 17.59 Petaflops/s on the LINPAC

benchmark.– Features 560,640 processors, including

261,632 Nvidia K20x accelerating cores.• Supercomputing version of consumer-

oriented GK104 CPU

Why bother (III)

• Most techniques developed for high-speed computing end trickling down to mass markets

EVOLUTION

History (I)

• Up to late 90's– No GPUs– Much simpler VGA controller

• Consisted of–A memory controller–Display generator + DRAM

• DRAM was either shared with CPU or private

History (I)

• By 1997– More complex VGA controllers

• Incorporated 3D accelerating functions in hardware

–Triangle set up and rasterization–Texture mapping and shading

Rasterization

• Converting– An image described in a vector graphics

format as a combination of shapes• Lines, polygons, letters, …

into– A raster image consisting of individual pixels

History (II)

• By 2000– Single chip graphics processor incorporated

nearly all functions of graphics pipeline of high-end workstations• Beginning of the end of high-end workstation

market– VGA controller was renamed Graphic

Processing Units

Current trends (I)

• Graphics processing standards– Well defined APIs– Open GL:

Open standard for 3D graphics programming– DirectX:

Set of MS multimedia programming interfaces (Direct3D for 3D graphics)• Xbox was named after it!

Current trends (II)

• Frequent doubling of GPU speeds– Every 12 to 18 months

• New paradigm:– Visual computing stands at the intersection

graphic processing and parallel computing• Can implement novel graphics algorithms• Use GPUs for non-conventional applications

Two results

• Triumph of heterogeneous architectures– Combining powers of CPU and GPU

• GPUs become scalable parallel processors– Moving from hardware-defined pipelining

architectures to more flexible programmable architectures

From GPGU to CUDA

• GPGU– General-Purpose computing on GPU– Uses traditional graphics API and graphics

pipeline

From GPGU to CUDA

• CUDA– Compute Unified Device Architecture– Parallel computing platform and programming

model• C/C++• Invented by NVIDIA

– Single Program Multiple Data approach

GPU SYSTEM ARCHITECTURE

Old School Approach

CPU

NorthBridge

SouthBridge

VGAController

RAM

PCI bus

Framebuffer

UART To VGA display

Intel Architecture

CPU

NorthBridge

SouthBridge

DDR2 RAMTo

displayGPU

GPUMemory

AMD Architecture

CPU

NorthBridge

Chipset

DDR2 RAM

To

displayGPU

GPUMemory

Variations

• Unified Memory Architecture (UMA):– GPU shares RAM with CPU– Lower memory bandwidth, higher latency– Cheap, low-end solution

• Scalable Link Interconnect:– NVIDIA– Allows multiple GPUs– High-end solution

Integrated solutions

• Integrate CPU and Northbridge• Integrate GPU and chipset

Game console

• Similar architectures• Architectures evolve over time • Objective is to reduce costs while maintaining

performance

GPU interfaces and drivers

• GPU attached to CPU via PCI-Express– Replaces older AGP

• Interfaces such as OpenGL and Direct3D use the GPU as a coprocessor– Send commands, programs and data to GPU

through a specific GPU device driver

They are often buggy!

Graphics logical pipeline

VertexShader

GeometryShader

Setup&

Raster

PixelShader

Raster&

Merger

These functions must be mappedinto a programmable GPU

InputAss'er

Basic Unified GPU Architecture

• Programmable processor array– Tightly integrated with fixed-function

processors for texture filtering, rasterization, raster operations

– Emphasis in on very high level of parallelism

Example architecture

• Tesla architecture (NVIDIA Geoforce 8800)• 116 streaming processors (SP) cores

– Organized as 14 multithreaded streaming multiprocessors (SM)

• Each SP core– Manages 96 concurrent threads

• Thread state are maintained by hardware– Connects with four 64-bit DRAM partitions

Example architecture

• Each SM has– 8 SP cores– 2 special function units– Separate caches for instructions and constants– A multithreaded instruction unit– Shared memory (NUMA?)

PROGRAMMING GPUS

Will focus on parallel computing applications

Key idea

• Must decompose problem into set of parallel computations– Ideally two-level to match GPU organization

Example

Data are inbig array

Small array

Small array

Small array

Small array

Small array

Tiny Tiny

Tiny Tiny

CUDA

• CUDA programs are written in C• Provides three abstractions

– Hierarchy of thread groups– Shared memory– Barrier synchronization

Barrier synchronization

• Barriers let threads– Wait for completion of a computation step by

other cores so they can• Exchange results• Start next step

ExampleTiny Tiny TinyTiny Tiny Tiny TinyTiny Tiny Tiny TinyTiny

Barrier = Wait for each other Exchange partial results

Tiny Tiny TinyTiny

Tiny Tiny TinyTiny Tiny Tiny TinyTiny Tiny Tiny TinyTiny

Barrier = Wait for each other Exchange partial results

Tiny Tiny TinyTiny Tiny Tiny TinyTiny Tiny Tiny TinyTiny

Big fallacies

• GPUs– Not good for general computation– Cannot run double precision arithmetic– Do not do floating point correctly

• Cannot speedup O(n) algorithms

graphics and computing gpus jehan-françois pâris [email protected]

Documents

gpu gpu memory slide

chipset slide

evolution slide

shading slide

private slide

graphics pipeline slide

vga display slide

gpu system architecture