can you really learn to use accelerators in one...

Can you really learn to use

accelerators in one

morning?

John Urbanic Parallel Computing Specialist

Pittsburgh Supercomputing Center

Copyright 2013

Can you really learn to use accelerators in

one morning?

Yes!

With OpenACC

I am going to assume you are already an

accelerator believer.

# Site Manufacturer Computer Country Cores Rmax

[Pflops]

Power

[MW]

1 Oak Ridge National Laboratory Cray

Titan

Cray XK7, Opteron 16C 2.2GHz,

Gemini, NVIDIA K20x

USA 560,640 17.6 8.21

2 Lawrence Livermore National

Laboratory IBM

Sequoia

BlueGene/Q,

Power BQC 16C 1.6GHz, Custom

USA 1,572,864 16.3 7.89

3 RIKEN Advanced Institute for

Computational Science Fujitsu

K Computer

SPARC64 VIIIfx 2.0GHz,

Tofu Interconnect

Japan 795,024 10.5 12.66

4 Argonne National Laboratory IBM

Mira

BlueGene/Q,


USA 786,432 8.16 3.95

5 Forschungszentrum Juelich

(FZJ) IBM

JuQUEEN

BlueGene/Q,


Germany 393,216 4.14 1.97

6 Leibniz Rechenzentrum IBM

SuperMUC

iDataPlex DX360M4,

Xeon E5 8C 2.7GHz, Infiniband FDR

Germany 147,456 2.90 3.52

7 Texas Advanced Computing

Center/UT Dell

Stampede

PowerEdge C8220,

Xeon E5 8C 2.7GHz, Intel Xeon Phi

USA 204,900 2.66

8 National SuperComputer

Center in Tianjin NUDT

Tianhe-1A

NUDT TH MPP,

Xeon 6C, NVidia, FT-1000 8C

China 186,368 2.57 4.04

9 CINECA IBM

Fermi

BlueGene/Q,


Italy 163,840 1.73 0.82

10 IBM IBM

DARPA Trial Subset

Power 775,

Power7 8C 3.84GHz, Custom

USA 63,360 1.52 3.57

0.1

10

1000

100000

10000000

1E+09

1E+11

19

94

19

96

19

98

20

00

20

02

20

04

20

06

20

08

20

10

20

12

20

14

20

16

20

18

20

20

But what exactly is an “Accelerator”?

PCI Bus

CPU Memory GPU Memory

CPU GPU

GPU Architecture: Two Main Components

Global memory

Analogous to RAM in a CPU server

Accessible by both GPU and CPU

Currently up to 24 GB

Bandwidth currently up to almost 288 GB/s for Tesla

products

ECC on/off option for Quadro and Tesla products

Streaming Multiprocessors (SMs)

Perform the actual computations

Each SM has its own:

Control units, registers, execution pipelines,

caches

DR

AM

I/F

G

iga T

hre

ad

H

OS

T I

/F

DR

AM

I/F

DR

AM

I/F

DR

AM

I/F

DR

AM

I/F

DR

AM

I/F

L2

GPU Architecture – Kepler:

Streaming Multiprocessor (SMX)

192 SP CUDA Cores per SMX

192 fp32 ops/clock

192 int32 ops/clock

64 DP CUDA Cores per SMX

64 fp64 ops/clock

4 warp schedulers

Up to 2048 threads concurrently

32 special-function units

64KB shared mem + L1 cache

48KB Read-Only Data cache

64K 32-bit registers

GPU Architecture – Kepler: CUDA Core

Floating point & Integer unit

IEEE 754-2008 floating-point

standard

Fused multiply-add (FMA)

instruction for both single and

double precision

Logic unit

Move, compare unit

Branch unit

CUDA Core

Dispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

GK110 Kepler

From a document you should read if you are interested in this:

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-

Architecture-Whitepaper.pdf

MIC Architecture

• Many cores on the die

• L1 and L2 cache

• Bidirectional ring network

for L2, Memory and PCIe

connection

Courtesy Dan Stanzione, TACC

Rapid Evolution Fermi

GF100

Fermi

GF104

Kepler

GK104

Kepler

GK110

Compute Capability 2.0 2.1 3.0 3.5

Threads / Warp 32 32 32 32

Max Warps / Multiprocessor 48 48 54 64

Max Threads / Multiprocessor 1536 1536 2048 2048

Max Thread Blocks / Multiprocessor 8 8 16 16

32‐bit Registers / Multiprocessor 32768 32768 65536 65536

Max Registers / Thread 63 63 63 255

Max Threads / Thread Block 1024 1024 1024 1024

Shared Memory Size Configurations 16k/48k 16k/48k 16k/32k/48k 16k/32k/48k

Max X Grid Dimension 2^16 2^16 2^32 2^32

Hyper‐Q No No No Yes

Dynamic Parallelism No No No Yes

From http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

• Shows healthy commitment

by vendors to staying on

the leading edge.

• Maybe the compiler knows

more about this than you?

• We will see how we can

gain the advantages

without having to stay on

top of this.

How We Accelerate Applications?

Applications

Libraries

“Drop-in”

Acceleration

Programming

Languages

(CUDA)

OpenACC

Directives

Maximum

Flexibility

Incrementally

Accelerate

Applications

What is OpenACC?

It is a directive based standard to allow developers to

take advantage of accelerators such as GPUs from

NVIDIA and AMD, Intel's Xeon Phi, FPGAs, and even DSP

chips.

Directives

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

...

End Program myscience

CPU GPU

Your original

Fortran or C code

Simple compiler hints from

coder.

Compiler generates parallel

code.

OpenACC

Compiler

Hint

Familiar to OpenMP Programmers

main() {

double pi = 0.0; long i;

#pragma omp parallel for reduction(+:pi)

for (i=0; i<N; i++)

{

double t = (double)((i+0.05)/N);

pi += 4.0/(1.0+t*t);

}

printf(“pi = %f\n”, pi/N);

}

CPU

OpenMP

main() {

double pi = 0.0; long i;

#pragma acc kernels

for (i=0; i<N; i++)

{

double t = (double)((i+0.05)/N);

pi += 4.0/(1.0+t*t);

}

printf(“pi = %f\n”, pi/N);

}

CPU GPU

OpenACC

subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i !$acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo !$acc end kernels end subroutine saxpy ... $ From main program $ call SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

...

// Somewhere in main

// call SAXPY on 1M elements

saxpy(1<<20, 2.0, x, y);

...

A Simple Example: SAXPY

SAXPY in C SAXPY in Fortran

Key Advantages Of This Approach

High-level. No involvement of OpenCL, CUDA, etc.

Single source. No forking off a separate GPU code. Compile the same program

for accelerators or serial, non-GPU programmers can play along.

Efficient. Experience shows very favorable comparison to low-level

implementations of same algorithms.

Performance portable. Supports GPU accelerators and co-processors from

multiple vendors, current and future versions.

Incremental. Developers can port and tune parts of their application as resources

and profiling dictates. No wholesale rewrite required. Which can be quick.

A Few Cases

Reading DNA nucleotide sequences

Shanghai JiaoTong University

Designing circuits for quantum computing

UIST, Macedonia

Extracting image features in real-time

Aselsan

1 week

40x faster

3 directives

4.1x faster

HydroC- Galaxy Formation

PRACE Benchmark Code, CAPS

Real-time Derivative Valuation

Opel Blue, Ltd

Matrix Matrix Multiply

Independent Research Scientist

Few hours

70x faster

4 directives

6.4x faster

4 directives

16x faster

1 week

3x faster

A Champion Case

S3D: Fuel Combustion

Design alternative fuels with up to 50% higher efficiency Titan

10 days

Jaguar

42 days

Modified <1% Lines of Code

4x Faster

Was recently the fastest

scientific simulation

ever done. 15 PF!

Comparison to Alternatives Lattice-Boltzmann Example

True Standard

Full OpenACC 1.0 and 2.0 Specifications available online

http://www.openacc-standard.org

Implementations available now from PGI, Cray, and CAPS

OpenACC 2.0 Spec just released

This looks easy! Too easy…

If it is this simple, why don’t we just throw kernel in front of every loop?

Better yet, why doesn’t the compiler do this for me?

The answer is that there are two general issues that prevent the compiler from being

able to just automatically parallelize every loop.

Data Dependencies in Loops (A classic performance issue)

Data Movement

The compiler needs your higher level perspective (in the form of directive hints) to get

correct results, and reasonable performance.

Our Foundation Exercise: Jacobi Iteration

I’ve been using this for MPI and OpenMP for a long time (since 1992 for MPI/PVM).

It is a great simulation problem, not rigged for OpenACC.

In this most basic form, it solves the Laplace equation: 𝛁𝟐𝒇(𝒙, 𝒚) = 𝟎

In our workshop example it is the Steady State Heat Equation.

Students start with a realistic, normal serial code and parallelize it themselves.

Speedup typically 8 – 12 time serial speed.

Metal

Plate

Heating

Element

Initial Conditions Final Steady State

Since I’ve mentioned MPI…

Can you learn it in half a day?

We have a two day workshop and expect a long latency

before we hear success stories back from our student.

Definitely not.

I also mentioned OpenMP

o We do have a 1 day OpenMP workshop.

o It does not cover the accelerator extensions: OpenMP 4.0

o That will be become its own extra day.

MPI, OpenMP and OpenACC are the fundamental technologies.

For nodes, cores and accelerators, respectively.

They drive the big codes on the big platforms.

All are well worth learning.

OpenACC is easiest.

And is the only one with alternatives: CUDA, OpenCL, OpenMP 4.0…

Comparing OpenACC with OpenMP 4.0 on NVIDIA & Phi

#pragma omp target device(0) map(tofrom:B)

#pragma omp parallel for

for (i=0; i<N; i++)

B[i] += sin(B[i]);

First two examples

Courtesy Christian Terboven

#pragma omp target device(0) map(tofrom:B)

#pragma omp teams num_teams(num_blocks) num_threads(bsize)

#pragma omp distribute

for (i=0; i<N; i += num_blocks)

#pragma omp parallel for

for (b = i; b < i+num_blocks; b++)

B[b] += sin(B[b]);

#pragma acc kernels

for (i=0; i<N; ++i)

B[i] += sin(B[i]);

OpenMP 4.0 for Intel Xeon Phi

OpenMP 4.0 for NVIDIA GPU

OpenACC for NVIDIA GPU

How is it going?

We’ve been doing this a few years now, and really got the confidence (in our AV infrastructure) to

ramp up the remote workshops this past year. Over that period we have worked 4 OpenACC

workshops into our rotating monthly lineup and done others at special events.

October 2012

147 students on Keeneland (National Institute for Computational Sciences) 1

January 2013

72 students on Keeneland 1

June 2013

35 students on Keeneland (at International Summer School on HPC) 2

July 2013

35 students on Keeneland (at XSEDE ‘13) 2

August 2013

113 students on Keeneland 3

November 2013

149 students on Blue Waters (National Center for Supercomputing Applications) 3

1 - Two day

2 - Morning

3 - 5 hour

How is it really going?

o We use comprehensive anonymous student evaluations

Both comments and quantitative data

o We react to that feedback

Hence our 2 day workshop has become 5 hours

Advanced 2 day workshop (with OpenMP 4.0) under development

o It has been very positive thus far

Better than 5/6 on review scales

Repeat request rate of sites is far beyond our ability to accommodate

Student demand is more than we can accommodate (max seats available)

o Our latest workshop was the object of an in depth XSEDE training evaluation

Conclusions pending

Upcoming opportunities. Not counting this booth!

XSEDE calendar

https://portal.xsede.org/course-calendar

OpenACC.org calendar

http://www.openacc.org/Education_Classes_and_Workshops

or contact me

[email protected]

can you really learn to use accelerators in one...

Documents