parallel programming many-core computing: intro (1/5)bal/college11/class1-intro-roofline.pdf ·...

54
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort [email protected]

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

INTRO (1/5)

Rob van Nieuwpoort

[email protected]

Page 2: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Schedule2

1. Introduction, performance metrics & analysis

2. Many-core hardware

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

What are many-cores?3

Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

What are many-cores4

From Wikipedia: “A many-core processor is a multi-

core processor in which the number of cores is large

enough that traditional multi-processor techniques

are no longer efficient — largely because of issues

with congestion in supplying instructions and data

to the many processors.”

Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

What are many-cores5

How many is many?

Several tens of cores

How are they different from multi-core CPUs?

Non-uniform memory access (NUMA)

Private memories

Network-on-chip

Examples

Multi-core CPUs (48-core AMD magny-cours)

Graphics Processing Units (GPUs)

Cell processor (PlayStation 3)

Server processors (Sun Niagara)

Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Many-core questions

The search for performance Build hardware

What architectures?

Evaluate hardware

What metrics?

How do we measure?

Use it

What workloads?

Expected performance?

Program it

How to program?

How to optimize?

Benchmark

How to analyze performance?

6

Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Today’s Topics7

Introduction

Why many-core programming?

History

Hardware introduction

Performance model:

Arithmetic Intensity and Roofline

Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Why do we need many-cores?8

Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

T12

Westmere

NV30NV40

G70

G80

GT200

3GHz Dual

Core P4

3GHz

Core2 Duo

3GHz Xeon

Quad

Why do we need many-cores?9

Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Why do we need many-cores?10

Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Why do we need many-cores?11

Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

China's Tianhe-1A

#2 in top500 list

4.701 pflops peak

2.566 pflops max

14,336 Xeon X5670 processors

7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

12

Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Power efficiency13

Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Graphics in 198014

Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Graphics in 200015

Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Graphics now: GPU movie16

Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Why do we need many-cores?

Performance

Large scale parallelism

Power Efficiency

Use transistors more efficiently

Price (GPUs)

Huge market, bigger than Hollywood

Mass production, economy of scale

“spotty teenagers” pay for our HPC needs!

17

Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

1995 2000 2005 2010

RIVA 1283M xtors

GeForce® 25623M xtors

GeForce FX125M xtors

GeForce8800

681M xtors

GeForce 3 60M xtors

“Fermi”3B xtors

GPGPU history18

Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

GPGPU History

Use Graphics primitives for HPC

Ikonas [England 1978]

Pixel Machine [Potmesil & Hoffert 1989]

Pixel-Planes 5 [Rhoades, et al. 1992]

Programmable shaders, around 1998

DirectX / OpenGL

Map application onto graphics domain!

GPUGPU

Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...

19

Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010

July 07 Nov 07April 08

Aug 08July 09

Nov 09 Mar 10

CUDA Toolkit 1.1

• Win XP 64

• Atomics

support

• Multi-GPU

support

CUDA Toolkit 2.0

Double

Precision

• Compiler

Optimizations

• Vista 32/64

• Mac OSX

• 3D Textures

• HW Interpolation

CUDA Toolkit 2.3

• DP FFT

• 16-32 Conversion

intrinsics

• Performance

enhancements

CUDA Toolkit 1.0

• C Compiler

• C Extensions

• Single

Precision

• BLAS

• FFT

• SDK

40 examples

CUDA

Visual Profiler 2.2

cuda-gdb

HW Debugger

Parallel Nsight

Beta CUDA Toolkit 3.0

• C++ inheritance

• Fermi arch support

• Tools updates

• Driver / RT interop

20

Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Parallel Nsight

Visual Studio

Visual Profiler

For Linux

cuda-gdb

For Linux

Cuda Tools21

Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Many-core hardware introduction22

Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

The search for performance23

Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

The search for performance24

We have M(o)ore transistors …

Bigger cores?

We are hitting the walls!

power, memory, instruction-level parallelism (ILP)

How do we use them?

Large-scale parallelism

Many-cores !

Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Choices … 25

Core type(s):

Fat or slim ?

Vectorized (SIMD) ?

Homogeneous or heterogeneous?

Number of cores:

Few or many ?

Memory

Shared-memory or distributed-memory?

Parallelism

Instruction-level parallelism, threads, vectors, …

Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

A taxonomy26

Based on “field-of-origin”:

General-purpose

Intel, AMD

Graphics Processing Units (GPUs)

NVIDIA, ATI

Gaming/Entertainment

Sony/Toshiba/IBM

Embedded systems

Philips/NXP, ARM

Servers

Oracle, IBM, Intel

High Performance Computing

Intel, IBM, …

Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

General Purpose Processors27

Architecture

Few fat cores

Vectorization (SSE, AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Processes (OS Scheduler)

Message passing

Multi-threading

Coarse-grained parallelism

Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Server-side28

General-purpose-like with more hardware threads

Lower performance per thread

Examples

Sun Niagara II

8 cores x 8 threads

high throughput

IBM POWER7

8 cores x 4 threads

Intel SCC

48 cores, all can run their own OS

Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Graphics Processing Units29

Architecture

Hundreds/thousands of slim cores

Homogeneous

Accelerator

Memory

Very complex hierarchy

Both shared and per-core

Programming

Off-load model

Many fine-grained symmetrical threads

Hardware scheduler

Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Cell/B.E. 30

Architecture

Heterogeneous

8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)

Memory

Per-core memory, network-on-chip

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Take home message31

Variety of platforms

Core types & counts

Memory architecture & sizes

Parallelism layers & types

Scheduling

Open questions:

Why so many?

How many platforms do we need?

Can any application run on any platform?

Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Hardware performance metrics32

Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Hardware Performance metrics33

Clock frequency [GHz] = absolute hardware speed

Memories, CPUs, interconnects

Operational speed [GFLOPs]

Operations per cycle

Memory bandwidth [GB/s]

differs a lot between different memories on chip

Power

Derived metrics

FLOP/Byte, FLOP/Watt

Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Theoretical peak performance34

Peak = chips * cores * vectorWidth *

FLOPs/cycle * clockFrequency

Examples from DAS-4:

Intel Core i7 CPU

2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle

* 2.4 GHz = 154 GFLOPs

NVIDIA GTX 580 GPU

1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle

* 1.544 GhZ = 1581 GFLOPs

ATI HD 6970

1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GhZ = 2703 GFLOPs

Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

DRAM Memory bandwidth35

Throughput =

memory bus frequency * bits per cycle * bus width

Memory clock != CPU clock!

In bits, divide by 8 for GB/s

Examples:

Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s

NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s

ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Memory bandwidths36

On-chip memory can be orders of magnitude faster

Registers, shared memory, caches, …

Other memories: depends on the interconnect

Intel’s technology: QPI (Quick Path Interconnect)

25.6 GB/s

AMD’s technology: HT3 (Hyper Transport 3)

19.2 GB/s

Accelerators: PCI-e 2.0

8 GB/s

Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Power37

Chip manufactures specify Thermal Design Power (TDP)

We can measure dissipated power

Whole system

Typically (much) lower than TDP

Power efficiency

FLOPS / Watt

Examples (with theoretical peak and TDP)

Intel Core i7: 154 / 160 = 1.0 GFLOPs/W

NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W

ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Summary

Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte

Cell/B.E. 8 8 204.80 25.6 8.0

Intel Nehalem EE 4 8 57.60 25.5 2.3

Intel Nehalem EX 8 16 170.00 63 2.7

Sun Niagara 8 32 9.33 20 0.5

Sun Niagara 2 8 64 11.20 76 0.1

AMD Barcelona 4 8 37.00 21.4 1.7

AMD Istanbul 6 6 62.40 25.6 2.4

AMD Magny-Cours 12 12 124.80 25.6 4.9

IBM Power 7 8 32 264.96 68.22 3.9

NVIDIA GTX 580 16 512 1581 192 8.2

ATI HD 6970 384 1536 2703 176 15.4

Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Absolute hardware performance39

Only achieved in the optimal conditions:

Processing units 100% used

All parallelism 100% exploited

All data transfers at maximum bandwidth

No application is like this

Even difficult to write micro-benchmarks

Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Operational Intensity and the Roofline model

Performance analysis40

Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Software performance metrics (3 P’s)41

Performance

Execution time

Speed-up vs. best available sequential application

Achieved GFLOPs

Computational efficiency

Achieved GB/s

Memory efficiency

Productivity and Portability

Production costs

Maintenance costs

Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Arithmetic intensity42

The number of arithmetic (floating point) operations

per byte of memory that is accessed

Is the program compute intensive or data intensive on a

particular architecture?

Page 43: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

RGB to gray43

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

Page 44: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

RGB to gray44

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

2 additions, 3 multiplies = 5 operations

3 reads, 1 write = 4 memory accesses

AI = 5/4 = 1.25

Page 45: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Compute or memory intensive?45

0.0 2.0 4.0 6.0 8.0 10.0

Cell/B.E.

Intel Nehalem EE

Intel Nehalem EX

Sun Niagara

Sun Niagara 2

AMD Barcelona

AMD Istanbul

AMD Magny-Cours

IBM Power 7

ATI Radeon 4890

ATI HD5870

NVIDIA G80

NVIDIA GT200

NVIDIA GTX580

RGB to Gray

Page 46: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Applications AI46

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

Page 47: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Operational intensity47

The number of operations per byte of DRAM traffic

Difference with Arithmetic Intensity

Operations, not just arithmetic

Caches

“After they have been filtered by the cache hierarchy”

Not between processor and cache

But between cache and DRAM memory

Page 48: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Attainable performance48

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational Intensity)

Page 49: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

The Roofline model49

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17

Page 50: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Roofline: comparing architectures50

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

Page 51: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Roofline: computational ceilings51

Page 52: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Roofline: bandwidth ceilings52

Page 53: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Roofline: optimization regions53

Page 54: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)bal/college11/class1-intro-roofline.pdf · CUDA C/C++ Continuous Innovation 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08

Use the Roofline model54

Determine what to do first to gain performance

Increase memory streaming rate

Apply in-core optimizations

Increase arithmetic intensity

Reader

Samuel Williams, Andrew Waterman, David Patterson

“Roofline: an insightful visual performance model for

multicore architectures”