toru baji / nvidia solution architect 16 dec.,...

© 2009 NVIDIA Corporation

SIGGRAPH Asia 2009 GPU Computing master Class Future Direction in GPU Computing

Toru Baji / NVIDIA Solution Architect 16th Dec., 2009


ContentWhy GPU Computing?

Single-threaded CPU performance is no longer scalingCUDA can exploit “ Performance = Parallelism”

Evolution of NVIDIA GPU Key Milestone in GPU Computing ‘Fermi’

Processor / Memory ArchitectureEnhanced Thread ManagementCUDA3.0Integrated Development Environment ‘Nexus’

An NVIDIA ExaScale Machine in 2017 Conclusion


Single-threaded CPU performance is no longer scaling (1) Moore’s Law

In 1965, Gordon Moore predicted the number of Tr will grow at the rate of:x 2 / yearlater revised tox 2 / 18-months

No prediction of CPU performanceMoore, Electronics 38(8) April, 1965


Single-threaded CPU performance is no longer scaling (2) The End of ILP (Instruction Level Parallelism) Scaling

The increase in Tr counts was used to increasing the parallelism of the CPU instruction execution (e.g. superscalar, pipeline) and other speed-ups

Bill Dally et al., the Last Classical Computer, ISAT Study, 2001

CPU Performance: ps/inst

Performance is measured as the time required to execute one instruction (ps/inst)


Single-threaded CPU performance is no longer scaling (3) Explicit Parallelism is Now Attractive

Bill Dally et al., the Last Classical Computer, ISAT Study, 2001

Explicit ParallelismGap is Growing

The increase in Tr counts is used to increasing the number of processors


CUDA can exploit Performance = ParallelismNow it is clear that “ Performance = Parallelism”

Then what is required to exploit this Parallelism?Many efficient processorsA programming system that abstract the parallelism

“CUDA (Compute Unified Device Architecture”GPU has hundreds of full-featured processorsMulti-threading architecture use the best of those massive parallel coresCUDA starts from extension to the familiar C-language, and now expanded to Fortran, OpenCL and direct ComputeCUDA can isolate the programmer from the details of parallel programming


CUDA GPU Computing Visual Demo separate animation tool used

Hierarchical thread definition

Scalability to n-SMs Hide memory Latency


Ease of Programming and Performance IncreaseEase of programming represented as “Development time”Performance achievement measured in GFLOPS

Source: Nicolas Pinto, MIT

GPU with CUDA

Cell Processor

CPU & SSE-SIMD

High-level algorithm Development tool on CPU


Evolution of NVIDIA GPU - Number of Cores

21

4

8

16

32

64

128

256512

1024FERMI 512

C1060 240

C870 128GTX285 240

GTX260 216GTX250 128

GTX240 112

GeForceGT2XX

Tesla

9800GTX 128

9600GT 64

GeForce9XXX

9500GT 32

9400GT 16

GeForce8XXX

8800GTX 128

8800GS 96

8600GTS 32

8500GT 16

8300GT 8

7900GT 8+16

7800GT 6+16

GeForce7XXX

7600GT 5+12

7500GT 3+4

7300SE 2+27050SE 1+4

GeForce6XXX

6800G T 6+166800 5+126800X

T 4+86200 3+4

6200 1+2

GeForce5XXX

FX590 0 3+8

FX570 0 3+2

FX560 0 1+2

GeForce4XXX

GeForce3XXX

Ti4800 2+4

Ti4000 0+2

Ti500 1+4

Programmable Shader

Number of programmable shaders / die

Vertex Shader + Pixel Shader

Unified Shader

Unified Shader

- Vertical axis shows the number of cores, but not the performance- Only major products are shown

CUDA, OpenCL, DirectX Compute etc

Fermi

Tesla Architecture

Fermi Architecture


Gap of GPU-CPU Performance, Mem-BW is growing

8x double precisionECC

L1, L2 Caches

1 TF Single Precision4GB Memory

NVIDIA GPUX86 CPU

1000

100


Key Milestone in GPU Computing ‘Fermi’

Improved peak performance

Improved throughput efficiency

Broader applicability

Full integration within modern software development environment

DR

AM

I/F

DR

AM

I/F

HO

ST I/

FH

OST

I/F

Gig

a Th

read

DR

AM

I/F

DR

AM

I/F D

RA

M I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/F

L2L2


SM Multiprocessor ArchitectureRegister FileRegister File

SchedulerScheduler

DispatchDispatch

SchedulerScheduler

DispatchDispatch

Load/Store Units x 16Special Func Units x 4

Interconnect NetworkInterconnect Network

64K Configurable64K Configurable Cache/Shared Cache/Shared MemMem

Uniform CacheUniform Cache

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

Instruction CacheInstruction Cache

32 CUDA cores per SM (512 total)

8x peak FP64 performance50% of peak FP32 performance

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)

FP32FP32 FP64FP64 INTINT SFUSFU LD/STLD/STOps / clkOps / clk 3232 1616 3232 44 1616


IEEE 754-2008 Floating Point

IEEE 754-2008 results64-bit double precision32-bit single precisionfull-speed denormal operands & resultsNaNs, +/- Infinity

IEEE 754-2008 roundingnearest even, zero, +inf, -inf

IEEE 754-2008 Fused Multiply-Add (FMA)D = A*B + C; No loss of precisionIEEE divide & sqrt use FMA


Cached Memory Hierarchy

Configurable L1 cache per SM16KB L1$ / 48KB Shared Memory48KB L1$ / 16KB Shared Memory

Shared 768KB L2 cacheMotivation:

Caching captures locality, amplifies bandwidthCaching more effective than Shared Memory RAM for irregular or unpredictable access

Ray tracing, sparse matrix multiply, physics kernels …

Caching helps latency sensitive cases


Larger, Faster Memory Interface

GDDR5 memory interface2x improvement in peak speed over GDDR3

Up to 1 Terabyte of memory attached to GPUOperate on large data sets

DR

AM I/

FD

RAM

I/F

Gig

a Th

read

HO

ST I/

FH

OST

I/F

DR

AM I/

FD

RAM

I/F

DR

AM I/F

DR

AM I/F

DR

AM I/F

DR

AM I/F

DR

AM I/F

DR

AM I/F

DR

AM I/F

DR

AM I/F

L2L2


Unified Load/Store Addressing


ECC Memory Protection

All major internal memories protected by ECCRegister fileL1 cacheL2 cache

External DRAM protected by ECC

ECC is a must have for many computing applicationsClear customer feedback


GigaThreadTM Hardware Thread Scheduler

Hierarchically manages tens of thousands of simultaneously active threads

10x faster context switching on Fermi

Overlapping kernel execution

HTS


Overlapping Kernel Execution

Serial Kernel ExecutionSerial Kernel Execution Parallel Kernel ExecutionParallel Kernel Execution

Tim

eTi

me

Kernel 1Kernel 1 Kernel 1Kernel 1 Kernel 2Kernel 2

Kernel 2Kernel 2 Kernel 3Kernel 3

Kernel 3Kernel 3

KerKer 44

nelnel Kernel 5Kernel 5

Kernel 5Kernel 5

Kernel 4Kernel 4

Kernel 2Kernel 2

Kernel 2Kernel 2


GigaThread Streaming Data Transfer Engine

Dual DMA engines

Simultaneous CPU GPUand GPU CPU data transfer

Fully overlapped with CPU/GPU processing

SDT

SDT

Kernel 0Kernel 1

Kernel 2Kernel 3

CPUCPU

CPUCPU

CPUCPU

CPUCPU

SDT0

SDT0

SDT0

SDT0

GPUGPU

GPUGPU

GPUGPU

GPUGPU

SDT1

SDT1

SDT1

SDT1


G80 GT200 Fermi

Transistors 681 million 1.4 billion 3.0 billion

CUDA Cores 128 240 512

Double Precision Floating Point - 30 FMA ops/clock 256 FMA ops/clock

Single Precision Floating Point 128 MAD ops/clock 240 MAD ops/clock 512 FMA ops/clock

Special Function Units (per SM) 2 2 4

Warp schedulers (per SM) 1 1 2

Shared Memory (per SM) 16 KB 16 KB Configurable 48/16 KB

L1 Cache (per SM) - - Configurable 16/48 KB

L2 Cache - - 768 KB

ECC Memory Support - - Yes

Concurrent Kernels - - Up to 16

Load/Store Address Width 32-bit 32-bit 64-bit


GPU Computing ApplicationsGPU Computing Applications

CUDA Parallel Computing Architecture

NVIDIA GPUNVIDIA GPUwith the CUDA Parallel Computing Architecturewith the CUDA Parallel Computing Architecture

CC OpenCLOpenCLtmtm Direct Direct ComputeCompute FortranFortran Java and Java and

PythonPython

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

C++C++


CUDA C 3.0

Unified addressing for C and C++ pointersGlobal, shared, local addressesEnables 3rd party GPU callable libraries, dynamic linkingOne 40-bit address space for load/store instructions

Compiling for native 64-bit addressing

IEEE 754-2008 single & double precisionC99 math.h support


NVIDIA Integrated Development Environment Code-named ‘Nexus’

Industry’s 1st IDE for massively parallel applications

Accelerate development of CPU + GPU co-processing applications

Complete Visual Studio-integrated development environmentC, C++, OpenCL, & DirectCompute platformsDirectX and OpenGL graphics APIs


DR

AM

I/F

DR

AM

I/F

HO

ST I/

FH

OST

I/F

Gig

a Th

read

DR

AM

I/F

DR

AM

I/F D

RA

M I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/F

L2L2

Fermi Summary

Third generation CUDA architecture

Improved peak performance

Improved throughput efficiency

Broader applicability

Full integration within modern development environment


Future: ExaScale ComputingExaScale (10^18) is the next milestone following

PetaScale (10^15) computing

DARPA, HP, France and others are planning for

ExaScale computing facilities.

DARPA report (28th Sept., 2008) described four

major challenges (Power consumption, memory,

Concurrency/Locality, Resiliency)

DARPA reports that the number one issue is

Power. Extrapolation of Power indicates over

100MW for Exaflop. Available atwww.darpa.mil/personnel/docs/ExaScale_Study_Initial.pdf


A Possible NVIDIA ExaScale Machine in 2017 by Bill Dally: Chief Scientist & Sr. VP of Research, NVIDIA A projection based on Moore’s Law and does not represent a committed roadmap

GPU Node (GPU + CPU + memory + supply) ~300W2,400 throughput-cores (7,200FPUs), 16 CPUs – single chip40TFLOPS (SP), 13TFLOPS (DP)

Node Memory128GB DRAM, 2TB/s bandwidth512GB Phase-change Flash for checkpoint and scratch

Cabinet ~100kW384-Nodes 15.7PFLOPS (SP), 50TB DRAMDragonfly network – 1TB/sec node bandwidth

System ~ 10MW128 cabinets – 2 ExaFLOPS (SP), 6.8PB DRAMDragonfly network with active optical links


Conclusion

Performance of Single-threaded processor is saturating

Performance = Parallelism

NVIDIA are committed to offer massive parallel GPUs for now and

beyond

CUDA GPU computing abstracts parallel programming

‘Fermi’ is the key milestone in GPU Computing


Thank you for your attention

toru baji / nvidia solution architect 16 dec.,...

Documents