cscads ben de waalcscads.rice.edu/dewaal--nvidia--gpus.pdfcuda terminology: grids, blocks, and...

20
GPU Ben de Waal Summer 2008

Upload: others

Post on 09-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

GPU

Ben de WaalSummer 2008

Page 2: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Agenda

Quick Roadmap

A few observations

© NVIDIA Corporation 20072

A few observations

And a few positions

Page 3: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

GPUs are Great at Graphics

© NVIDIA Corporation 20073Hellgate: London © 2005-2006 Flagship

Studios, Inc.Licensed by NAMCO BANDAI Games

America, Inc.

Crysis © 2006 Crytek / Electronic Arts

Full Spectrum Warrior: Ten Hammers© 2006 Pandemic Studios, LLC. All rights

reserved.© 2006 THQ Inc. All rights reserved.

Page 4: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

GPUs are Great at Other Things!

An expanding trend over the last few years

Successful applications in many areas

Computational geometry, biology, chemistry, physics, finance…

Computer vision

© NVIDIA Corporation 20074

Computer vision

Database management

Signal processing

Physics simulation

Page 5: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(4, 8, 8); // 256 threads per blocksize_t SharedMemBytes = 64; // 64 bytes of shared memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);

Standard C ProgrammingNew Architecture for Computing

Parallel

Data Cache

P’ = P + V * t

P’ = P + V * t

P’ = P + V * t

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1, V1

P2, V2

P3, V3

P4, V4

P5, V5

Shared

Data

Parallel

Data Cache

P’ = P + V * t

P’ = P + V * t

P’ = P + V * t

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1, V1

P2, V2

P3, V3

P4, V4

P5, V5

Shared

Data

C for the GPU CUDA & GPU Computing

© NVIDIA Corporation 20075

New ApplicationsUnprecedented Performance

P’ = P + V * tP’ = P + V * t

Page 6: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

70M CUDA GPUs

Heterogeneous Computing

CPUCPUCPUCPU

GPUGPUGPUGPU60K CUDA Developers

© NVIDIA Corporation 20076

Oil & Gas

Finance Medical Biophysics Numerics Audio Video Imaging

Page 7: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

GeForce GTX 280 Parallel Computing Architecture

Thread Scheduler

© NVIDIA Corporation 20077

Atomic Tex L2 Atomic Tex L2 Atomic Tex L2 Atomic Tex L2

Memory Memory Memory Memory Memory Memory Memory Memory

Page 8: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

CUDA Terminology:Grids, Blocks, and Threads

CPU

Kernel 1

GPU deviceGrid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)Sequence

Programmer partitions problem into a sequence of kernels.

A kernel executes as a grid of thread blocks

A thread block is an array of

© NVIDIA Corporation 20078

Kernel 2

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

A thread block is an array of threads that can cooperate

Threads within the same block synchronize and share data in Shared Memory

Execute thread blocks on multithreaded multiprocessor SM cores

Page 9: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

CUDA Programming Model:Thread Memory Spaces

Each kernel thread can read:

Thread Id per thread

Block Id per block

Constants per grid

Texture per grid

Thread Id, Block Id

RegistersKernelThread ProgramWritten in C

Local Memory

© NVIDIA Corporation 20079

Each thread can read and write:

Registers per thread

Local memory per thread

Shared memory per block

Global memory per grid

Host CPU can read and write:

Constants per grid

Texture per grid

Global memory per grid

Constants

Texture

Global Memory

SharedMemory

Written in C

Page 10: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Trends and Observations

Cores generally doubles per familyLow end has substantially less cores than high endRanges from 8 to 100s

Memory hierarchy will likely remain

Evolving

© NVIDIA Corporation 200710

EvolvingProcessor expressivenessEasy of programmingReducing performance cliffsHierarchical scheduling & partitioningNested parallelism

Heterogeneous computingAlgorithms varyRun them on the most suitable processor

Page 11: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Autotuning – Super Languages

One Possible extreme outcome:

People program in an expressive enough language that maps fairly cleanly onto the installed base of processors

Programmer driven

Just very simple machine translation needed

© NVIDIA Corporation 200711

As an example, CUDA’s programming paradigm also scales with CPU cores

Data parallel

Memory hierarchy is explicit

i.e. It reflects an architectural superset of several different designs

Page 12: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Heterogeneous Tuning Space

Cache hit architectures

Like traditional CPUs

Thread driven execution

NUMA / Cost of global coherence

Cache cliffs (hits, misses, aliasing, etc.)

Scalar / Vector (SIMD)

© NVIDIA Corporation 200712

Scalar / Vector (SIMD)

Cache miss architectures

Like many GPUs

Data driven execution

Wide range of cores

NUMA / sometimes no global coherence

Memory technology exposure (banks, etc.)

Vector / Scalar

Page 13: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Autotuning – Really Smart Code

Another extreme outcome:

Genetic programming style autotuners

Evolves optimal code for any (local) architecture

Potential to find a diamond in the state of Texas

Somehow still generalize

© NVIDIA Corporation 200713

Good news: It’s parallelizable!

Detour: Circuit Synthesis

Similar Problem

Remarkable success

Remarkable exploitation

Genetic Programming III, Koza, John R, et al, 1999

Chapter 25

Page 14: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Autotuning

Both extremes seem to be too good to be true

We’ll probably end up in the middle

Programmer will do some of the parameterization

Identify blocks

© NVIDIA Corporation 200714

Identify blocks

Memory tradeoffs

Serial code

Autotuners explores smaller space

Page 15: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Composition is key

Tuners likely need access to complete code base

Need powerful/expressive enough IL that isnt source

Allow investment

Client side must be smart upfront, or binary ships its

© NVIDIA Corporation 200715

Client side must be smart upfront, or binary ships its own brains

Smart client

can have IL logic for local system, supplied perhaps by IHVs

Smart binary

more flexible but may not understand the target

Seems desirable for IL to include high level expression

Page 16: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Compiling CUDA

NVCC

C/C++ CUDAApplication

CPU Code

© NVIDIA Corporation 200716

PTX to Target

Translator

GPU … GPU

Target code

PTX CodeVirtual

Target

Page 17: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Virtual to Target ISA Translation

PTX to Target

Translator

PTX Code

ld.global.v4.f32 {$f1,$f3,$f5,$f7},[$r9+0];mad.f32 $f1,$f5,$f3,$f1;

Parallel Thread eXecution (PTX)

Virtual Machine and ISA

Distribution format for applications

Install-time translation

“fat binary” caches target-specific

© NVIDIA Corporation 200717

Translator

GPU … GPU

Target code

0x103c8009 0x0fffffff0xd00e0609 0xa0c007800x100c8009 0x000000030x21000409 0x07800780

“fat binary” caches target-specific versions

Target-specific translation optimizes for:

ISA diffences

Resource allocation

Performance

Page 18: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Interesting Architectures

Do more on GPUs

Millions out there

Compact, well suited for server farms

© NVIDIA Corporation 200718

Plenty of tuning parameters

A very hard problem

Represents many issues many-core CPUs are going to

Its like the future – Today

Page 19: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Interesting Architectures

Heterogeneous Tuning

Figuring out how to divide work appropriately among asymmetrical cores

E.g. partitioning a problem to map serial code onto an aggressive out-of-order mono-core CPU plus parallel

© NVIDIA Corporation 200719

aggressive out-of-order mono-core CPU plus parallel parts of problem onto a plenty core GPU.

Page 20: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,

Questions?

© NVIDIA Corporation 200720

Ben de Waal

[email protected]