graphics and computing gpus jehan-françois pâris [email protected]
TRANSCRIPT
Why bother? (I)
• Yesterday's fastest computer was the Sequoia supercomputer– Can crunch 16.32 quadrillion calculations per
second (16.32 Petaflops/s). – 98,304 compute nodes
• Each compute nodes is a 16-core PowerPC A2 processor
Why bother? (II)
• Today's fastest computer is the Cray XK7 – Hits 17.59 Petaflops/s on the LINPAC
benchmark.– Features 560,640 processors, including
261,632 Nvidia K20x accelerating cores.• Supercomputing version of consumer-
oriented GK104 CPU
Why bother (III)
• Most techniques developed for high-speed computing end trickling down to mass markets
History (I)
• Up to late 90's– No GPUs– Much simpler VGA controller
• Consisted of–A memory controller–Display generator + DRAM
• DRAM was either shared with CPU or private
History (I)
• By 1997– More complex VGA controllers
• Incorporated 3D accelerating functions in hardware
–Triangle set up and rasterization–Texture mapping and shading
Rasterization
• Converting– An image described in a vector graphics
format as a combination of shapes• Lines, polygons, letters, …
into– A raster image consisting of individual pixels
History (II)
• By 2000– Single chip graphics processor incorporated
nearly all functions of graphics pipeline of high-end workstations• Beginning of the end of high-end workstation
market– VGA controller was renamed Graphic
Processing Units
Current trends (I)
• Graphics processing standards– Well defined APIs– Open GL:
Open standard for 3D graphics programming– DirectX:
Set of MS multimedia programming interfaces (Direct3D for 3D graphics)• Xbox was named after it!
Current trends (II)
• Frequent doubling of GPU speeds– Every 12 to 18 months
• New paradigm:– Visual computing stands at the intersection
graphic processing and parallel computing• Can implement novel graphics algorithms• Use GPUs for non-conventional applications
Two results
• Triumph of heterogeneous architectures– Combining powers of CPU and GPU
• GPUs become scalable parallel processors– Moving from hardware-defined pipelining
architectures to more flexible programmable architectures
From GPGU to CUDA
• GPGU– General-Purpose computing on GPU– Uses traditional graphics API and graphics
pipeline
From GPGU to CUDA
• CUDA– Compute Unified Device Architecture– Parallel computing platform and programming
model• C/C++• Invented by NVIDIA
– Single Program Multiple Data approach
Old School Approach
CPU
NorthBridge
SouthBridge
VGAController
RAM
PCI bus
Framebuffer
UART To VGA display
Variations
• Unified Memory Architecture (UMA):– GPU shares RAM with CPU– Lower memory bandwidth, higher latency– Cheap, low-end solution
• Scalable Link Interconnect:– NVIDIA– Allows multiple GPUs– High-end solution
Game console
• Similar architectures• Architectures evolve over time • Objective is to reduce costs while maintaining
performance
GPU interfaces and drivers
• GPU attached to CPU via PCI-Express– Replaces older AGP
• Interfaces such as OpenGL and Direct3D use the GPU as a coprocessor– Send commands, programs and data to GPU
through a specific GPU device driver
They are often buggy!
Graphics logical pipeline
VertexShader
GeometryShader
Setup&
Raster
PixelShader
Raster&
Merger
These functions must be mappedinto a programmable GPU
InputAss'er
Basic Unified GPU Architecture
• Programmable processor array– Tightly integrated with fixed-function
processors for texture filtering, rasterization, raster operations
– Emphasis in on very high level of parallelism
Example architecture
• Tesla architecture (NVIDIA Geoforce 8800)• 116 streaming processors (SP) cores
– Organized as 14 multithreaded streaming multiprocessors (SM)
• Each SP core– Manages 96 concurrent threads
• Thread state are maintained by hardware– Connects with four 64-bit DRAM partitions
Example architecture
• Each SM has– 8 SP cores– 2 special function units– Separate caches for instructions and constants– A multithreaded instruction unit– Shared memory (NUMA?)
Key idea
• Must decompose problem into set of parallel computations– Ideally two-level to match GPU organization
Example
Data are inbig array
Small array
Small array
Small array
Small array
Small array
Tiny Tiny
Tiny Tiny
CUDA
• CUDA programs are written in C• Provides three abstractions
– Hierarchy of thread groups– Shared memory– Barrier synchronization
Barrier synchronization
• Barriers let threads– Wait for completion of a computation step by
other cores so they can• Exchange results• Start next step
ExampleTiny Tiny TinyTiny Tiny Tiny TinyTiny Tiny Tiny TinyTiny
Barrier = Wait for each other Exchange partial results
Tiny Tiny TinyTiny
Tiny Tiny TinyTiny Tiny Tiny TinyTiny Tiny Tiny TinyTiny
Barrier = Wait for each other Exchange partial results
Tiny Tiny TinyTiny Tiny Tiny TinyTiny Tiny Tiny TinyTiny