Download - PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

INTRO (1/5)

Rob van Nieuwpoort

[email protected]

Schedule2

1. Introduction, performance metrics & analysis

2. Many-core hardware

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

What are many-cores?3

What are many-cores4

From Wikipedia: “A many-core processor is a multi-

core processor in which the number of cores is large

enough that traditional multi-processor techniques

are no longer efficient — largely because of issues

with congestion in supplying instructions and data

to the many processors.”

What are many-cores5

How many is many?

Several tens of cores

How are they different from multi-core CPUs?

Non-uniform memory access (NUMA)

Private memories

Network-on-chip

Examples

Multi-core CPUs (48-core AMD magny-cours)

Graphics Processing Units (GPUs)

Cell processor (PlayStation 3)

Server processors (Sun Niagara)

Many-core questions

The search for performance Build hardware

What architectures?

Evaluate hardware

What metrics?

How do we measure?

Use it

What workloads?

Expected performance?

Program it

How to program?

How to optimize?

Benchmark

How to analyze performance?

6

Today’s Topics7

Introduction

Why many-core programming?

History

Hardware introduction

Performance model:

Arithmetic Intensity and Roofline

Why do we need many-cores?8

T12

Westmere

NV30NV40

G70

G80

GT200

3GHz Dual

Core P4

3GHz

Core2 Duo

3GHz Xeon

Quad


China's Tianhe-1A

#2 in top500 list

4.701 pflops peak

2.566 pflops max

14,336 Xeon X5670 processors

7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

12

Power efficiency13

Graphics in 198014

Graphics in 200015

Graphics now: GPU movie16

Why do we need many-cores?

Performance

Large scale parallelism

Power Efficiency

Use transistors more efficiently

Price (GPUs)

Huge market, bigger than Hollywood

Mass production, economy of scale

“spotty teenagers” pay for our HPC needs!

17

1995 2000 2005 2010

RIVA 1283M xtors

GeForce® 25623M xtors

GeForce FX125M xtors

GeForce8800

681M xtors

GeForce 3 60M xtors

“Fermi”3B xtors

GPGPU history18

GPGPU History

Use Graphics primitives for HPC

Ikonas [England 1978]

Pixel Machine [Potmesil & Hoffert 1989]

Pixel-Planes 5 [Rhoades, et al. 1992]

Programmable shaders, around 1998

DirectX / OpenGL

Map application onto graphics domain!

GPUGPU

Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...

19

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010

July 07 Nov 07April 08

Aug 08July 09

Nov 09 Mar 10

CUDA Toolkit 1.1

• Win XP 64

• Atomics

support

• Multi-GPU

support

CUDA Toolkit 2.0

Double

Precision

• Compiler

Optimizations

• Vista 32/64

• Mac OSX

• 3D Textures

• HW Interpolation

CUDA Toolkit 2.3

• DP FFT

• 16-32 Conversion

intrinsics

• Performance

enhancements

CUDA Toolkit 1.0

• C Compiler

• C Extensions

• Single

Precision

• BLAS

• FFT

• SDK

40 examples

CUDA

Visual Profiler 2.2

cuda-gdb

HW Debugger

Parallel Nsight

Beta CUDA Toolkit 3.0

• C++ inheritance

• Fermi arch support

• Tools updates

• Driver / RT interop

20

Parallel Nsight

Visual Studio

Visual Profiler

For Linux

cuda-gdb

For Linux

Cuda Tools21

Many-core hardware introduction22

The search for performance23

The search for performance24

We have M(o)ore transistors …

Bigger cores?

We are hitting the walls!

power, memory, instruction-level parallelism (ILP)

How do we use them?

Large-scale parallelism

Many-cores !

Choices … 25

Core type(s):

Fat or slim ?

Vectorized (SIMD) ?

Homogeneous or heterogeneous?

Number of cores:

Few or many ?

Memory

Shared-memory or distributed-memory?

Parallelism

Instruction-level parallelism, threads, vectors, …

A taxonomy26

Based on “field-of-origin”:

General-purpose

Intel, AMD

Graphics Processing Units (GPUs)

NVIDIA, ATI

Gaming/Entertainment

Sony/Toshiba/IBM

Embedded systems

Philips/NXP, ARM

Servers

Oracle, IBM, Intel

High Performance Computing

Intel, IBM, …

General Purpose Processors27

Architecture

Few fat cores

Vectorization (SSE, AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Processes (OS Scheduler)

Message passing

Multi-threading

Coarse-grained parallelism

Server-side28

General-purpose-like with more hardware threads

Lower performance per thread

Examples

Sun Niagara II

8 cores x 8 threads

high throughput

IBM POWER7

8 cores x 4 threads

Intel SCC

48 cores, all can run their own OS

Graphics Processing Units29

Architecture

Hundreds/thousands of slim cores

Homogeneous

Accelerator

Memory

Very complex hierarchy

Both shared and per-core

Programming

Off-load model

Many fine-grained symmetrical threads

Hardware scheduler

Cell/B.E. 30

Architecture

Heterogeneous

8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)

Memory

Per-core memory, network-on-chip

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Take home message31

Variety of platforms

Core types & counts

Memory architecture & sizes

Parallelism layers & types

Scheduling

Open questions:

Why so many?

How many platforms do we need?

Can any application run on any platform?

Hardware performance metrics32

Hardware Performance metrics33

Clock frequency [GHz] = absolute hardware speed

Memories, CPUs, interconnects

Operational speed [GFLOPs]

Operations per cycle

Memory bandwidth [GB/s]

differs a lot between different memories on chip

Power

Derived metrics

FLOP/Byte, FLOP/Watt

Theoretical peak performance34

Peak = chips * cores * vectorWidth *

FLOPs/cycle * clockFrequency

Examples from DAS-4:

Intel Core i7 CPU

2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle

* 2.4 GHz = 154 GFLOPs

NVIDIA GTX 580 GPU

1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle

* 1.544 GhZ = 1581 GFLOPs

ATI HD 6970

1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GhZ = 2703 GFLOPs

DRAM Memory bandwidth35

Throughput =

memory bus frequency * bits per cycle * bus width

Memory clock != CPU clock!

In bits, divide by 8 for GB/s

Examples:

Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s

NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s

ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

Memory bandwidths36

On-chip memory can be orders of magnitude faster

Registers, shared memory, caches, …

Other memories: depends on the interconnect

Intel’s technology: QPI (Quick Path Interconnect)

25.6 GB/s

AMD’s technology: HT3 (Hyper Transport 3)

19.2 GB/s

Accelerators: PCI-e 2.0

8 GB/s

Power37

Chip manufactures specify Thermal Design Power (TDP)

We can measure dissipated power

Whole system

Typically (much) lower than TDP

Power efficiency

FLOPS / Watt

Examples (with theoretical peak and TDP)

Intel Core i7: 154 / 160 = 1.0 GFLOPs/W

NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W

ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

Summary

Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte

Cell/B.E. 8 8 204.80 25.6 8.0

Intel Nehalem EE 4 8 57.60 25.5 2.3

Intel Nehalem EX 8 16 170.00 63 2.7

Sun Niagara 8 32 9.33 20 0.5

Sun Niagara 2 8 64 11.20 76 0.1

AMD Barcelona 4 8 37.00 21.4 1.7

AMD Istanbul 6 6 62.40 25.6 2.4

AMD Magny-Cours 12 12 124.80 25.6 4.9

IBM Power 7 8 32 264.96 68.22 3.9

NVIDIA GTX 580 16 512 1581 192 8.2

ATI HD 6970 384 1536 2703 176 15.4

Absolute hardware performance39

Only achieved in the optimal conditions:

Processing units 100% used

All parallelism 100% exploited

All data transfers at maximum bandwidth

No application is like this

Even difficult to write micro-benchmarks

Operational Intensity and the Roofline model

Performance analysis40

Software performance metrics (3 P’s)41

Performance

Execution time

Speed-up vs. best available sequential application

Achieved GFLOPs

Computational efficiency

Achieved GB/s

Memory efficiency

Productivity and Portability

Production costs

Maintenance costs

Arithmetic intensity42

The number of arithmetic (floating point) operations

per byte of memory that is accessed

Is the program compute intensive or data intensive on a

particular architecture?

RGB to gray43

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

RGB to gray44

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

2 additions, 3 multiplies = 5 operations

3 reads, 1 write = 4 memory accesses

AI = 5/4 = 1.25

Compute or memory intensive?45

0.0 2.0 4.0 6.0 8.0 10.0

Cell/B.E.

Intel Nehalem EE

Intel Nehalem EX

Sun Niagara

Sun Niagara 2

AMD Barcelona

AMD Istanbul

AMD Magny-Cours

IBM Power 7

ATI Radeon 4890

ATI HD5870

NVIDIA G80

NVIDIA GT200

NVIDIA GTX580

RGB to Gray

Applications AI46

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

Operational intensity47

The number of operations per byte of DRAM traffic

Difference with Arithmetic Intensity

Operations, not just arithmetic

Caches

“After they have been filtered by the cache hierarchy”

Not between processor and cache

But between cache and DRAM memory

Attainable performance48

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational Intensity)

The Roofline model49

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17

Roofline: comparing architectures50

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

Roofline: computational ceilings51

Roofline: bandwidth ceilings52

Roofline: optimization regions53

Use the Roofline model54

Determine what to do first to gain performance

Increase memory streaming rate

Apply in-core optimizations

Increase arithmetic intensity

Reader

Samuel Williams, Andrew Waterman, David Patterson

“Roofline: an insightful visual performance model for

multicore architectures”

Download - PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Top Related