Schedule2
1. Introduction, performance metrics & analysis
2. Many-core hardware
3. Cuda class 1: basics
4. Cuda class 2: advanced
5. Case study: LOFAR telescope with many-cores
What are many-cores?3
What are many-cores4
From Wikipedia: “A many-core processor is a multi-
core processor in which the number of cores is large
enough that traditional multi-processor techniques
are no longer efficient — largely because of issues
with congestion in supplying instructions and data
to the many processors.”
What are many-cores5
How many is many?
Several tens of cores
How are they different from multi-core CPUs?
Non-uniform memory access (NUMA)
Private memories
Network-on-chip
Examples
Multi-core CPUs (48-core AMD magny-cours)
Graphics Processing Units (GPUs)
Cell processor (PlayStation 3)
Server processors (Sun Niagara)
Many-core questions
The search for performance Build hardware
What architectures?
Evaluate hardware
What metrics?
How do we measure?
Use it
What workloads?
Expected performance?
Program it
How to program?
How to optimize?
Benchmark
How to analyze performance?
6
Today’s Topics7
Introduction
Why many-core programming?
History
Hardware introduction
Performance model:
Arithmetic Intensity and Roofline
Why do we need many-cores?8
T12
Westmere
NV30NV40
G70
G80
GT200
3GHz Dual
Core P4
3GHz
Core2 Duo
3GHz Xeon
Quad
Why do we need many-cores?9
Why do we need many-cores?10
Why do we need many-cores?11
China's Tianhe-1A
#2 in top500 list
4.701 pflops peak
2.566 pflops max
14,336 Xeon X5670 processors
7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores
12
Power efficiency13
Graphics in 198014
Graphics in 200015
Graphics now: GPU movie16
Why do we need many-cores?
Performance
Large scale parallelism
Power Efficiency
Use transistors more efficiently
Price (GPUs)
Huge market, bigger than Hollywood
Mass production, economy of scale
“spotty teenagers” pay for our HPC needs!
17
1995 2000 2005 2010
RIVA 1283M xtors
GeForce® 25623M xtors
GeForce FX125M xtors
GeForce8800
681M xtors
GeForce 3 60M xtors
“Fermi”3B xtors
GPGPU history18
GPGPU History
Use Graphics primitives for HPC
Ikonas [England 1978]
Pixel Machine [Potmesil & Hoffert 1989]
Pixel-Planes 5 [Rhoades, et al. 1992]
Programmable shaders, around 1998
DirectX / OpenGL
Map application onto graphics domain!
GPUGPU
Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...
19
CUDA C/C++ Continuous Innovation
2007 2008 2009 2010
July 07 Nov 07April 08
Aug 08July 09
Nov 09 Mar 10
CUDA Toolkit 1.1
• Win XP 64
• Atomics
support
• Multi-GPU
support
CUDA Toolkit 2.0
Double
Precision
• Compiler
Optimizations
• Vista 32/64
• Mac OSX
• 3D Textures
• HW Interpolation
CUDA Toolkit 2.3
• DP FFT
• 16-32 Conversion
intrinsics
• Performance
enhancements
CUDA Toolkit 1.0
• C Compiler
• C Extensions
• Single
Precision
• BLAS
• FFT
• SDK
40 examples
CUDA
Visual Profiler 2.2
cuda-gdb
HW Debugger
Parallel Nsight
Beta CUDA Toolkit 3.0
• C++ inheritance
• Fermi arch support
• Tools updates
• Driver / RT interop
20
Parallel Nsight
Visual Studio
Visual Profiler
For Linux
cuda-gdb
For Linux
Cuda Tools21
Many-core hardware introduction22
The search for performance23
The search for performance24
We have M(o)ore transistors …
Bigger cores?
We are hitting the walls!
power, memory, instruction-level parallelism (ILP)
How do we use them?
Large-scale parallelism
Many-cores !
Choices … 25
Core type(s):
Fat or slim ?
Vectorized (SIMD) ?
Homogeneous or heterogeneous?
Number of cores:
Few or many ?
Memory
Shared-memory or distributed-memory?
Parallelism
Instruction-level parallelism, threads, vectors, …
A taxonomy26
Based on “field-of-origin”:
General-purpose
Intel, AMD
Graphics Processing Units (GPUs)
NVIDIA, ATI
Gaming/Entertainment
Sony/Toshiba/IBM
Embedded systems
Philips/NXP, ARM
Servers
Oracle, IBM, Intel
High Performance Computing
Intel, IBM, …
General Purpose Processors27
Architecture
Few fat cores
Vectorization (SSE, AVX)
Homogeneous
Stand-alone
Memory
Shared, multi-layered
Per-core cache and shared cache
Programming
Processes (OS Scheduler)
Message passing
Multi-threading
Coarse-grained parallelism
Server-side28
General-purpose-like with more hardware threads
Lower performance per thread
Examples
Sun Niagara II
8 cores x 8 threads
high throughput
IBM POWER7
8 cores x 4 threads
Intel SCC
48 cores, all can run their own OS
Graphics Processing Units29
Architecture
Hundreds/thousands of slim cores
Homogeneous
Accelerator
Memory
Very complex hierarchy
Both shared and per-core
Programming
Off-load model
Many fine-grained symmetrical threads
Hardware scheduler
Cell/B.E. 30
Architecture
Heterogeneous
8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)
Memory
Per-core memory, network-on-chip
Programming
User-controlled scheduling
6 levels of parallelism, all under user control
Fine- and coarse-grain parallelism
Take home message31
Variety of platforms
Core types & counts
Memory architecture & sizes
Parallelism layers & types
Scheduling
Open questions:
Why so many?
How many platforms do we need?
Can any application run on any platform?
Hardware performance metrics32
Hardware Performance metrics33
Clock frequency [GHz] = absolute hardware speed
Memories, CPUs, interconnects
Operational speed [GFLOPs]
Operations per cycle
Memory bandwidth [GB/s]
differs a lot between different memories on chip
Power
Derived metrics
FLOP/Byte, FLOP/Watt
Theoretical peak performance34
Peak = chips * cores * vectorWidth *
FLOPs/cycle * clockFrequency
Examples from DAS-4:
Intel Core i7 CPU
2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle
* 2.4 GHz = 154 GFLOPs
NVIDIA GTX 580 GPU
1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle
* 1.544 GhZ = 1581 GFLOPs
ATI HD 6970
1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle
* 0.880 GhZ = 2703 GFLOPs
DRAM Memory bandwidth35
Throughput =
memory bus frequency * bits per cycle * bus width
Memory clock != CPU clock!
In bits, divide by 8 for GB/s
Examples:
Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s
NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s
ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s
Memory bandwidths36
On-chip memory can be orders of magnitude faster
Registers, shared memory, caches, …
Other memories: depends on the interconnect
Intel’s technology: QPI (Quick Path Interconnect)
25.6 GB/s
AMD’s technology: HT3 (Hyper Transport 3)
19.2 GB/s
Accelerators: PCI-e 2.0
8 GB/s
Power37
Chip manufactures specify Thermal Design Power (TDP)
We can measure dissipated power
Whole system
Typically (much) lower than TDP
Power efficiency
FLOPS / Watt
Examples (with theoretical peak and TDP)
Intel Core i7: 154 / 160 = 1.0 GFLOPs/W
NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W
ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W
Summary
Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte
Cell/B.E. 8 8 204.80 25.6 8.0
Intel Nehalem EE 4 8 57.60 25.5 2.3
Intel Nehalem EX 8 16 170.00 63 2.7
Sun Niagara 8 32 9.33 20 0.5
Sun Niagara 2 8 64 11.20 76 0.1
AMD Barcelona 4 8 37.00 21.4 1.7
AMD Istanbul 6 6 62.40 25.6 2.4
AMD Magny-Cours 12 12 124.80 25.6 4.9
IBM Power 7 8 32 264.96 68.22 3.9
NVIDIA GTX 580 16 512 1581 192 8.2
ATI HD 6970 384 1536 2703 176 15.4
Absolute hardware performance39
Only achieved in the optimal conditions:
Processing units 100% used
All parallelism 100% exploited
All data transfers at maximum bandwidth
No application is like this
Even difficult to write micro-benchmarks
Operational Intensity and the Roofline model
Performance analysis40
Software performance metrics (3 P’s)41
Performance
Execution time
Speed-up vs. best available sequential application
Achieved GFLOPs
Computational efficiency
Achieved GB/s
Memory efficiency
Productivity and Portability
Production costs
Maintenance costs
Arithmetic intensity42
The number of arithmetic (floating point) operations
per byte of memory that is accessed
Is the program compute intensive or data intensive on a
particular architecture?
RGB to gray43
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x];
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
RGB to gray44
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x];
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
2 additions, 3 multiplies = 5 operations
3 reads, 1 write = 4 memory accesses
AI = 5/4 = 1.25
Compute or memory intensive?45
0.0 2.0 4.0 6.0 8.0 10.0
Cell/B.E.
Intel Nehalem EE
Intel Nehalem EX
Sun Niagara
Sun Niagara 2
AMD Barcelona
AMD Istanbul
AMD Magny-Cours
IBM Power 7
ATI Radeon 4890
ATI HD5870
NVIDIA G80
NVIDIA GT200
NVIDIA GTX580
RGB to Gray
Applications AI46
A r i t h m e t i c I n t e n s i t y
O( N )O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra
(BLAS3)
Particle Methods
Operational intensity47
The number of operations per byte of DRAM traffic
Difference with Arithmetic Intensity
Operations, not just arithmetic
Caches
“After they have been filtered by the cache hierarchy”
Not between processor and cache
But between cache and DRAM memory
Attainable performance48
Attainable GFlops/sec
= min(Peak Floating-Point Performance,
Peak Memory Bandwidth * Operational Intensity)
The Roofline model49
AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17
Roofline: comparing architectures50
AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9
Roofline: computational ceilings51
Roofline: bandwidth ceilings52
Roofline: optimization regions53
Use the Roofline model54
Determine what to do first to gain performance
Increase memory streaming rate
Apply in-core optimizations
Increase arithmetic intensity
Reader
Samuel Williams, Andrew Waterman, David Patterson
“Roofline: an insightful visual performance model for
multicore architectures”