toru baji / nvidia solution architect 16 dec.,...
TRANSCRIPT
© 2009 NVIDIA Corporation
SIGGRAPH Asia 2009 GPU Computing master Class Future Direction in GPU Computing
Toru Baji / NVIDIA Solution Architect 16th Dec., 2009
© 2009 NVIDIA Corporation
ContentWhy GPU Computing?
Single-threaded CPU performance is no longer scalingCUDA can exploit “ Performance = Parallelism”
Evolution of NVIDIA GPU Key Milestone in GPU Computing ‘Fermi’
Processor / Memory ArchitectureEnhanced Thread ManagementCUDA3.0Integrated Development Environment ‘Nexus’
An NVIDIA ExaScale Machine in 2017 Conclusion
© 2009 NVIDIA Corporation
Single-threaded CPU performance is no longer scaling (1) Moore’s Law
In 1965, Gordon Moore predicted the number of Tr will grow at the rate of:x 2 / yearlater revised tox 2 / 18-months
No prediction of CPU performanceMoore, Electronics 38(8) April, 1965
© 2009 NVIDIA Corporation
Single-threaded CPU performance is no longer scaling (2) The End of ILP (Instruction Level Parallelism) Scaling
The increase in Tr counts was used to increasing the parallelism of the CPU instruction execution (e.g. superscalar, pipeline) and other speed-ups
Bill Dally et al., the Last Classical Computer, ISAT Study, 2001
CPU Performance: ps/inst
Performance is measured as the time required to execute one instruction (ps/inst)
© 2009 NVIDIA Corporation
Single-threaded CPU performance is no longer scaling (3) Explicit Parallelism is Now Attractive
Bill Dally et al., the Last Classical Computer, ISAT Study, 2001
Explicit ParallelismGap is Growing
The increase in Tr counts is used to increasing the number of processors
© 2009 NVIDIA Corporation
CUDA can exploit Performance = ParallelismNow it is clear that “ Performance = Parallelism”
Then what is required to exploit this Parallelism?Many efficient processorsA programming system that abstract the parallelism
“CUDA (Compute Unified Device Architecture”GPU has hundreds of full-featured processorsMulti-threading architecture use the best of those massive parallel coresCUDA starts from extension to the familiar C-language, and now expanded to Fortran, OpenCL and direct ComputeCUDA can isolate the programmer from the details of parallel programming
© 2009 NVIDIA Corporation
CUDA GPU Computing Visual Demo separate animation tool used
Hierarchical thread definition
Scalability to n-SMs Hide memory Latency
© 2009 NVIDIA Corporation
Ease of Programming and Performance IncreaseEase of programming represented as “Development time”Performance achievement measured in GFLOPS
Source: Nicolas Pinto, MIT
GPU with CUDA
Cell Processor
CPU & SSE-SIMD
High-level algorithm Development tool on CPU
© 2009 NVIDIA Corporation
Evolution of NVIDIA GPU - Number of Cores
21
4
8
16
32
64
128
256512
1024FERMI 512
C1060 240
C870 128GTX285 240
GTX260 216GTX250 128
GTX240 112
GeForceGT2XX
Tesla
9800GTX 128
9600GT 64
GeForce9XXX
9500GT 32
9400GT 16
GeForce8XXX
8800GTX 128
8800GS 96
8600GTS 32
8500GT 16
8300GT 8
7900GT 8+16
7800GT 6+16
GeForce7XXX
7600GT 5+12
7500GT 3+4
7300SE 2+27050SE 1+4
GeForce6XXX
6800G T 6+166800 5+126800X
T 4+86200 3+4
6200 1+2
GeForce5XXX
FX590 0 3+8
FX570 0 3+2
FX560 0 1+2
GeForce4XXX
GeForce3XXX
Ti4800 2+4
Ti4000 0+2
Ti500 1+4
Programmable Shader
Number of programmable shaders / die
Vertex Shader + Pixel Shader
Unified Shader
Unified Shader
- Vertical axis shows the number of cores, but not the performance- Only major products are shown
CUDA, OpenCL, DirectX Compute etc
Fermi
Tesla Architecture
Fermi Architecture
© 2009 NVIDIA Corporation
Gap of GPU-CPU Performance, Mem-BW is growing
8x double precisionECC
L1, L2 Caches
1 TF Single Precision4GB Memory
NVIDIA GPUX86 CPU
1000
100
© 2009 NVIDIA Corporation
Key Milestone in GPU Computing ‘Fermi’
Improved peak performance
Improved throughput efficiency
Broader applicability
Full integration within modern software development environment
DR
AM
I/F
DR
AM
I/F
HO
ST I/
FH
OST
I/F
Gig
a Th
read
DR
AM
I/F
DR
AM
I/F D
RA
M I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/F
L2L2
© 2009 NVIDIA Corporation
SM Multiprocessor ArchitectureRegister FileRegister File
SchedulerScheduler
DispatchDispatch
SchedulerScheduler
DispatchDispatch
Load/Store Units x 16Special Func Units x 4
Interconnect NetworkInterconnect Network
64K Configurable64K Configurable Cache/Shared Cache/Shared MemMem
Uniform CacheUniform Cache
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
Instruction CacheInstruction Cache
32 CUDA cores per SM (512 total)
8x peak FP64 performance50% of peak FP32 performance
Dual Thread Scheduler
64 KB of RAM for shared memory and L1 cache (configurable)
FP32FP32 FP64FP64 INTINT SFUSFU LD/STLD/STOps / clkOps / clk 3232 1616 3232 44 1616
© 2009 NVIDIA Corporation
IEEE 754-2008 Floating Point
IEEE 754-2008 results64-bit double precision32-bit single precisionfull-speed denormal operands & resultsNaNs, +/- Infinity
IEEE 754-2008 roundingnearest even, zero, +inf, -inf
IEEE 754-2008 Fused Multiply-Add (FMA)D = A*B + C; No loss of precisionIEEE divide & sqrt use FMA
© 2009 NVIDIA Corporation
Cached Memory Hierarchy
Configurable L1 cache per SM16KB L1$ / 48KB Shared Memory48KB L1$ / 16KB Shared Memory
Shared 768KB L2 cacheMotivation:
Caching captures locality, amplifies bandwidthCaching more effective than Shared Memory RAM for irregular or unpredictable access
Ray tracing, sparse matrix multiply, physics kernels …
Caching helps latency sensitive cases
© 2009 NVIDIA Corporation
Larger, Faster Memory Interface
GDDR5 memory interface2x improvement in peak speed over GDDR3
Up to 1 Terabyte of memory attached to GPUOperate on large data sets
DR
AM I/
FD
RAM
I/F
Gig
a Th
read
HO
ST I/
FH
OST
I/F
DR
AM I/
FD
RAM
I/F
DR
AM I/F
DR
AM I/F
DR
AM I/F
DR
AM I/F
DR
AM I/F
DR
AM I/F
DR
AM I/F
DR
AM I/F
L2L2
© 2009 NVIDIA Corporation
Unified Load/Store Addressing
© 2009 NVIDIA Corporation
ECC Memory Protection
All major internal memories protected by ECCRegister fileL1 cacheL2 cache
External DRAM protected by ECC
ECC is a must have for many computing applicationsClear customer feedback
© 2009 NVIDIA Corporation
GigaThreadTM Hardware Thread Scheduler
Hierarchically manages tens of thousands of simultaneously active threads
10x faster context switching on Fermi
Overlapping kernel execution
HTS
© 2009 NVIDIA Corporation
Overlapping Kernel Execution
Serial Kernel ExecutionSerial Kernel Execution Parallel Kernel ExecutionParallel Kernel Execution
Tim
eTi
me
Kernel 1Kernel 1 Kernel 1Kernel 1 Kernel 2Kernel 2
Kernel 2Kernel 2 Kernel 3Kernel 3
Kernel 3Kernel 3
KerKer 44
nelnel Kernel 5Kernel 5
Kernel 5Kernel 5
Kernel 4Kernel 4
Kernel 2Kernel 2
Kernel 2Kernel 2
© 2009 NVIDIA Corporation
GigaThread Streaming Data Transfer Engine
Dual DMA engines
Simultaneous CPU GPUand GPU CPU data transfer
Fully overlapped with CPU/GPU processing
SDT
SDT
Kernel 0Kernel 1
Kernel 2Kernel 3
CPUCPU
CPUCPU
CPUCPU
CPUCPU
SDT0
SDT0
SDT0
SDT0
GPUGPU
GPUGPU
GPUGPU
GPUGPU
SDT1
SDT1
SDT1
SDT1
© 2009 NVIDIA Corporation
G80 GT200 Fermi
Transistors 681 million 1.4 billion 3.0 billion
CUDA Cores 128 240 512
Double Precision Floating Point - 30 FMA ops/clock 256 FMA ops/clock
Single Precision Floating Point 128 MAD ops/clock 240 MAD ops/clock 512 FMA ops/clock
Special Function Units (per SM) 2 2 4
Warp schedulers (per SM) 1 1 2
Shared Memory (per SM) 16 KB 16 KB Configurable 48/16 KB
L1 Cache (per SM) - - Configurable 16/48 KB
L2 Cache - - 768 KB
ECC Memory Support - - Yes
Concurrent Kernels - - Up to 16
Load/Store Address Width 32-bit 32-bit 64-bit
© 2009 NVIDIA Corporation
GPU Computing ApplicationsGPU Computing Applications
CUDA Parallel Computing Architecture
NVIDIA GPUNVIDIA GPUwith the CUDA Parallel Computing Architecturewith the CUDA Parallel Computing Architecture
CC OpenCLOpenCLtmtm Direct Direct ComputeCompute FortranFortran Java and Java and
PythonPython
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
C++C++
© 2009 NVIDIA Corporation
CUDA C 3.0
Unified addressing for C and C++ pointersGlobal, shared, local addressesEnables 3rd party GPU callable libraries, dynamic linkingOne 40-bit address space for load/store instructions
Compiling for native 64-bit addressing
IEEE 754-2008 single & double precisionC99 math.h support
© 2009 NVIDIA Corporation
NVIDIA Integrated Development Environment Code-named ‘Nexus’
Industry’s 1st IDE for massively parallel applications
Accelerate development of CPU + GPU co-processing applications
Complete Visual Studio-integrated development environmentC, C++, OpenCL, & DirectCompute platformsDirectX and OpenGL graphics APIs
© 2009 NVIDIA Corporation
© 2009 NVIDIA Corporation
DR
AM
I/F
DR
AM
I/F
HO
ST I/
FH
OST
I/F
Gig
a Th
read
DR
AM
I/F
DR
AM
I/F D
RA
M I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/F
L2L2
Fermi Summary
Third generation CUDA architecture
Improved peak performance
Improved throughput efficiency
Broader applicability
Full integration within modern development environment
© 2009 NVIDIA Corporation
Future: ExaScale ComputingExaScale (10^18) is the next milestone following
PetaScale (10^15) computing
DARPA, HP, France and others are planning for
ExaScale computing facilities.
DARPA report (28th Sept., 2008) described four
major challenges (Power consumption, memory,
Concurrency/Locality, Resiliency)
DARPA reports that the number one issue is
Power. Extrapolation of Power indicates over
100MW for Exaflop. Available atwww.darpa.mil/personnel/docs/ExaScale_Study_Initial.pdf
© 2009 NVIDIA Corporation
A Possible NVIDIA ExaScale Machine in 2017 by Bill Dally: Chief Scientist & Sr. VP of Research, NVIDIA A projection based on Moore’s Law and does not represent a committed roadmap
GPU Node (GPU + CPU + memory + supply) ~300W2,400 throughput-cores (7,200FPUs), 16 CPUs – single chip40TFLOPS (SP), 13TFLOPS (DP)
Node Memory128GB DRAM, 2TB/s bandwidth512GB Phase-change Flash for checkpoint and scratch
Cabinet ~100kW384-Nodes 15.7PFLOPS (SP), 50TB DRAMDragonfly network – 1TB/sec node bandwidth
System ~ 10MW128 cabinets – 2 ExaFLOPS (SP), 6.8PB DRAMDragonfly network with active optical links
© 2009 NVIDIA Corporation
Conclusion
Performance of Single-threaded processor is saturating
Performance = Parallelism
NVIDIA are committed to offer massive parallel GPUs for now and
beyond
CUDA GPU computing abstracts parallel programming
‘Fermi’ is the key milestone in GPU Computing
© 2009 NVIDIA Corporation
Thank you for your attention