can you really learn to use accelerators in one...
TRANSCRIPT
Can you really learn to use
accelerators in one
morning?
John Urbanic Parallel Computing Specialist
Pittsburgh Supercomputing Center
Copyright 2013
Can you really learn to use accelerators in
one morning?
Yes!
With OpenACC
I am going to assume you are already an
accelerator believer.
# Site Manufacturer Computer Country Cores Rmax
[Pflops]
Power
[MW]
1 Oak Ridge National Laboratory Cray
Titan
Cray XK7, Opteron 16C 2.2GHz,
Gemini, NVIDIA K20x
USA 560,640 17.6 8.21
2 Lawrence Livermore National
Laboratory IBM
Sequoia
BlueGene/Q,
Power BQC 16C 1.6GHz, Custom
USA 1,572,864 16.3 7.89
3 RIKEN Advanced Institute for
Computational Science Fujitsu
K Computer
SPARC64 VIIIfx 2.0GHz,
Tofu Interconnect
Japan 795,024 10.5 12.66
4 Argonne National Laboratory IBM
Mira
BlueGene/Q,
Power BQC 16C 1.6GHz, Custom
USA 786,432 8.16 3.95
5 Forschungszentrum Juelich
(FZJ) IBM
JuQUEEN
BlueGene/Q,
Power BQC 16C 1.6GHz, Custom
Germany 393,216 4.14 1.97
6 Leibniz Rechenzentrum IBM
SuperMUC
iDataPlex DX360M4,
Xeon E5 8C 2.7GHz, Infiniband FDR
Germany 147,456 2.90 3.52
7 Texas Advanced Computing
Center/UT Dell
Stampede
PowerEdge C8220,
Xeon E5 8C 2.7GHz, Intel Xeon Phi
USA 204,900 2.66
8 National SuperComputer
Center in Tianjin NUDT
Tianhe-1A
NUDT TH MPP,
Xeon 6C, NVidia, FT-1000 8C
China 186,368 2.57 4.04
9 CINECA IBM
Fermi
BlueGene/Q,
Power BQC 16C 1.6GHz, Custom
Italy 163,840 1.73 0.82
10 IBM IBM
DARPA Trial Subset
Power 775,
Power7 8C 3.84GHz, Custom
USA 63,360 1.52 3.57
0.1
10
1000
100000
10000000
1E+09
1E+11
19
94
19
96
19
98
20
00
20
02
20
04
20
06
20
08
20
10
20
12
20
14
20
16
20
18
20
20
But what exactly is an “Accelerator”?
PCI Bus
CPU Memory GPU Memory
CPU GPU
GPU Architecture: Two Main Components
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 24 GB
Bandwidth currently up to almost 288 GB/s for Tesla
products
ECC on/off option for Quadro and Tesla products
Streaming Multiprocessors (SMs)
Perform the actual computations
Each SM has its own:
Control units, registers, execution pipelines,
caches
DR
AM
I/F
G
iga T
hre
ad
H
OS
T I
/F
DR
AM
I/F
DR
AM
I/F
DR
AM
I/F
DR
AM
I/F
DR
AM
I/F
L2
GPU Architecture – Kepler:
Streaming Multiprocessor (SMX)
192 SP CUDA Cores per SMX
192 fp32 ops/clock
192 int32 ops/clock
64 DP CUDA Cores per SMX
64 fp64 ops/clock
4 warp schedulers
Up to 2048 threads concurrently
32 special-function units
64KB shared mem + L1 cache
48KB Read-Only Data cache
64K 32-bit registers
GPU Architecture – Kepler: CUDA Core
Floating point & Integer unit
IEEE 754-2008 floating-point
standard
Fused multiply-add (FMA)
instruction for both single and
double precision
Logic unit
Move, compare unit
Branch unit
CUDA Core
Dispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
GK110 Kepler
From a document you should read if you are interested in this:
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-
Architecture-Whitepaper.pdf
MIC Architecture
• Many cores on the die
• L1 and L2 cache
• Bidirectional ring network
for L2, Memory and PCIe
connection
Courtesy Dan Stanzione, TACC
Rapid Evolution Fermi
GF100
Fermi
GF104
Kepler
GK104
Kepler
GK110
Compute Capability 2.0 2.1 3.0 3.5
Threads / Warp 32 32 32 32
Max Warps / Multiprocessor 48 48 54 64
Max Threads / Multiprocessor 1536 1536 2048 2048
Max Thread Blocks / Multiprocessor 8 8 16 16
32‐bit Registers / Multiprocessor 32768 32768 65536 65536
Max Registers / Thread 63 63 63 255
Max Threads / Thread Block 1024 1024 1024 1024
Shared Memory Size Configurations 16k/48k 16k/48k 16k/32k/48k 16k/32k/48k
Max X Grid Dimension 2^16 2^16 2^32 2^32
Hyper‐Q No No No Yes
Dynamic Parallelism No No No Yes
From http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
• Shows healthy commitment
by vendors to staying on
the leading edge.
• Maybe the compiler knows
more about this than you?
• We will see how we can
gain the advantages
without having to stay on
top of this.
How We Accelerate Applications?
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
(CUDA)
OpenACC
Directives
Maximum
Flexibility
Incrementally
Accelerate
Applications
What is OpenACC?
It is a directive based standard to allow developers to
take advantage of accelerators such as GPUs from
NVIDIA and AMD, Intel's Xeon Phi, FPGAs, and even DSP
chips.
Directives
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C code
Simple compiler hints from
coder.
Compiler generates parallel
code.
OpenACC
Compiler
Hint
Familiar to OpenMP Programmers
main() {
double pi = 0.0; long i;
#pragma omp parallel for reduction(+:pi)
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU
OpenMP
main() {
double pi = 0.0; long i;
#pragma acc kernels
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU GPU
OpenACC
subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i !$acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo !$acc end kernels end subroutine saxpy ... $ From main program $ call SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Somewhere in main
// call SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Simple Example: SAXPY
SAXPY in C SAXPY in Fortran
Key Advantages Of This Approach
High-level. No involvement of OpenCL, CUDA, etc.
Single source. No forking off a separate GPU code. Compile the same program
for accelerators or serial, non-GPU programmers can play along.
Efficient. Experience shows very favorable comparison to low-level
implementations of same algorithms.
Performance portable. Supports GPU accelerators and co-processors from
multiple vendors, current and future versions.
Incremental. Developers can port and tune parts of their application as resources
and profiling dictates. No wholesale rewrite required. Which can be quick.
A Few Cases
Reading DNA nucleotide sequences
Shanghai JiaoTong University
Designing circuits for quantum computing
UIST, Macedonia
Extracting image features in real-time
Aselsan
1 week
40x faster
3 directives
4.1x faster
HydroC- Galaxy Formation
PRACE Benchmark Code, CAPS
Real-time Derivative Valuation
Opel Blue, Ltd
Matrix Matrix Multiply
Independent Research Scientist
Few hours
70x faster
4 directives
6.4x faster
4 directives
16x faster
1 week
3x faster
A Champion Case
S3D: Fuel Combustion
Design alternative fuels with up to 50% higher efficiency Titan
10 days
Jaguar
42 days
Modified <1% Lines of Code
4x Faster
Was recently the fastest
scientific simulation
ever done. 15 PF!
Comparison to Alternatives Lattice-Boltzmann Example
True Standard
Full OpenACC 1.0 and 2.0 Specifications available online
http://www.openacc-standard.org
Implementations available now from PGI, Cray, and CAPS
OpenACC 2.0 Spec just released
This looks easy! Too easy…
If it is this simple, why don’t we just throw kernel in front of every loop?
Better yet, why doesn’t the compiler do this for me?
The answer is that there are two general issues that prevent the compiler from being
able to just automatically parallelize every loop.
Data Dependencies in Loops (A classic performance issue)
Data Movement
The compiler needs your higher level perspective (in the form of directive hints) to get
correct results, and reasonable performance.
Our Foundation Exercise: Jacobi Iteration
I’ve been using this for MPI and OpenMP for a long time (since 1992 for MPI/PVM).
It is a great simulation problem, not rigged for OpenACC.
In this most basic form, it solves the Laplace equation: 𝛁𝟐𝒇(𝒙, 𝒚) = 𝟎
In our workshop example it is the Steady State Heat Equation.
Students start with a realistic, normal serial code and parallelize it themselves.
Speedup typically 8 – 12 time serial speed.
Metal
Plate
Heating
Element
Initial Conditions Final Steady State
Since I’ve mentioned MPI…
Can you learn it in half a day?
We have a two day workshop and expect a long latency
before we hear success stories back from our student.
Definitely not.
I also mentioned OpenMP
o We do have a 1 day OpenMP workshop.
o It does not cover the accelerator extensions: OpenMP 4.0
o That will be become its own extra day.
MPI, OpenMP and OpenACC are the fundamental technologies.
For nodes, cores and accelerators, respectively.
They drive the big codes on the big platforms.
All are well worth learning.
OpenACC is easiest.
And is the only one with alternatives: CUDA, OpenCL, OpenMP 4.0…
Comparing OpenACC with OpenMP 4.0 on NVIDIA & Phi
#pragma omp target device(0) map(tofrom:B)
#pragma omp parallel for
for (i=0; i<N; i++)
B[i] += sin(B[i]);
First two examples
Courtesy Christian Terboven
#pragma omp target device(0) map(tofrom:B)
#pragma omp teams num_teams(num_blocks) num_threads(bsize)
#pragma omp distribute
for (i=0; i<N; i += num_blocks)
#pragma omp parallel for
for (b = i; b < i+num_blocks; b++)
B[b] += sin(B[b]);
#pragma acc kernels
for (i=0; i<N; ++i)
B[i] += sin(B[i]);
OpenMP 4.0 for Intel Xeon Phi
OpenMP 4.0 for NVIDIA GPU
OpenACC for NVIDIA GPU
How is it going?
We’ve been doing this a few years now, and really got the confidence (in our AV infrastructure) to
ramp up the remote workshops this past year. Over that period we have worked 4 OpenACC
workshops into our rotating monthly lineup and done others at special events.
October 2012
147 students on Keeneland (National Institute for Computational Sciences) 1
January 2013
72 students on Keeneland 1
June 2013
35 students on Keeneland (at International Summer School on HPC) 2
July 2013
35 students on Keeneland (at XSEDE ‘13) 2
August 2013
113 students on Keeneland 3
November 2013
149 students on Blue Waters (National Center for Supercomputing Applications) 3
1 - Two day
2 - Morning
3 - 5 hour
How is it really going?
o We use comprehensive anonymous student evaluations
Both comments and quantitative data
o We react to that feedback
Hence our 2 day workshop has become 5 hours
Advanced 2 day workshop (with OpenMP 4.0) under development
o It has been very positive thus far
Better than 5/6 on review scales
Repeat request rate of sites is far beyond our ability to accommodate
Student demand is more than we can accommodate (max seats available)
o Our latest workshop was the object of an in depth XSEDE training evaluation
Conclusions pending
Upcoming opportunities. Not counting this booth!
XSEDE calendar
https://portal.xsede.org/course-calendar
OpenACC.org calendar
http://www.openacc.org/Education_Classes_and_Workshops
or contact me