graphics processors and the exascale: parallel mappings, scalability and application lifespan rob...

Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan

Rob Farber, Senior Scientist, PNNL

Questions 1 and 2

1. Looking forward in the 2-5 year timeframe will we continue to need new languages, compiler directives, or language extensions to use accelerators?Absolutely as will be discussed in the next few slides1. Will compiler technology advance sufficiently to seamlessly use

accelerators as when the 8087 was added to the 8086 in the early days of the x86 architecture or when instruction sets were extended to include SSE or AltiVec and compilers eventually generated code for them?

Oh I wish! However, there is hope for data-parallel problems

2. What is your vision of what a unified Heterogeneous HPC ecosystem should encompass? What Languages, Libraries, frameworks? Should debuggers and profiling tools be integrated across heterogeneous architectures?

Humans are the weak linkA scalable globally unified file-system is essentialYes to a unified set of debugger and profiling toolsI’d like to say any language but many semantics and assumptions will not scale!

A perfect storm of opportunities and technology(Summary of Farber, Scientific Computing, “Realizing the Benefits of Affordable Teraflop-capable

Hardware”)

Multi-threaded software is a must-have because manufacturers were forced to move to multi-core CPUs

The failure of Dennard’s scaling laws meant processor manufacturers had to add cores to increase performance and entice customersThis is a new model for a huge body of legacy code!

Multi-core is disruptive to single-threaded and poorly scaling legacy appsGPGPUs, the Cray XMT, Blue Waters have changed the numbers. Commodity systems are catching up. Massive threading is the future. Research efforts will not benefit from new hardware unless they invest in scalable, multi-threaded softwareLack of investment risks stagnation and losing to the competition

Competition is fierce, the new technology is readily available and it is inexpensive!Which software and models? Look to those successes that are:

Widely adopted and have withstood the test of timeBriefly examine CUDA, OpenCL, data-parallel extensions

GPGPUs: an existing capabilityMarket forces evolved GPUs into massively parallel GPGPUs (General Purpose Graphics Processing Units).

NVIDIA quotes a 100+ million installed base of CUDA-enabled GPUs

GPUs put supercomputing in the hands of the masses.

December 1996, ASCI Red the first teraflop supercomputer Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s.

Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables:

• New thinking

• A large educated base of developers

GPU Peak 32-bitGF/s

Peak 64-bitGF/s

Cost$

GeForce GTX 480

1.35 168 < $500

AMD Radeon HD 5870

2.72 544 < $380

Meeting the need. CUDA was adopted quickly!

February 2007: The initial CUDA SDK was made public.

Now: CUDA-based GPU Computing is part of the curriculum at more than 200 universities.

MIT, Harvard, Cambridge, Oxford, the Indian Institutes of Technology, National Taiwan University, and the Chinese Academy of Sciences.

Application speed tells the story.fastest 100 apps in the NVIDIA Showcase Sept. 8, 2010

0

0.5

1

1.5

2

2.5

3

3.5

4

1 11 21 31 41 51 61 71 81 91

Ranked speedup by project (best to worst)

Spee

dup

(ord

er o

f mag

nitu

de)

Fastest: 2600x

Median: 253x

Slowest: 98x

URL: http://www.nvidia.com/object/cuda_apps_flash_new.html click on Sort by Speed Up

http://www.nvidia.com/object/cuda_apps_flash_new.html

http://www.nvidia.com/object/cuda_apps_flash_new.html

GPGPUs are not a one-trick pony

Used on a wide-range of computational, data driven, and real-time applications

Exhibit knife-edge performance

Balance ratios can help map problems

Can really be worth the effort

10x can make computational workflows more interactive (even poorly performing GPU apps are useful).

100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers.

1000x and greater achieved through the use of optimized transcendental functions and/or multiple GPUs.

Three rules for fast GPU codes

1. Get the data on the GPU (and keep it there!)• PCIe x16 v2.0 bus: 8 GiB/s in a single direction • 20-series GPUs: 140-200 GiB/s

2. Give the GPU enough work to do• Assume 10 s latency and 1 TF device• Can waste (10-6 * 1012) = 1M operations

3. Reuse and locate data to avoid global memory bandwidth bottlenecks• 1012 flop hardware delivers 1010 flop when global memory

limited• Can cause a 100x slowdown!

Tough for people. Tools need heuristics that can work on incomplete data and adjust for bad decisions. It’s even worse in a distributed and non-failsafe environment.

Results presented at SC09 (courtesy TACC)

Application lifespanSIMD: a key from the past

Farber: general SIMD mapping from the 1980sAcknowledgements: Work performed at or funded by the Santa Fe Institute, the theoretical division at Los Alamos National Laboratory and various NSF, DOE and other funding sources including the Texas Advance Computer Center.

This mapping for Neural Networks …

“Most efficient implementation to date” (Singer 1990),

(Thearling 1995)

The Connection Machine Observed Peak Effective Rate vs. Number of Ranger Cores

0

50

100

150

200

250

300

350

400

0 10000 20000 30000 40000 50000 60000 70000

Number of Barcelona cores

Effec

tive R

ate (

TF/s

)

60,000 cores: 363 TF/s measured62,796 cores: 386 TF/s (projected)

The Parallel Mapping energy = objFunc(p1, p2, … pn)

Examples

0, N-1

Examples

N, 2N-1

Examples

2N, 3N-1

Examples

3N, 4N-1

Step 2Calculate partials

Step 3Sum partials to get energy

Step1Broadcast

parameters

Optimization Method

(Powell, Conjugate Gradient, Other)

p1,p2, … pn

GPU 1p1,p2, … pn

GPU 2p1,p2, … pn

GPU 3p1,p2, … pn

GPU 4

Results

=

The Connection Machine

* CNVIDIA(where CNVIDIA >> 1)

Nonlinear PCA

Average 100 iterations (sec)

8x core* 0.877923

C2050 ** 0.021667

speedup 40x

vs. 1 core 295x (measured)

Linear PCA

Average 100 iterations (sec)

8x core* 0.164605

C2050 ** 0.020173

speedup 8x

vs. 1 core 57x (measured)

* 2x Intel (quadcore) E5540s @ 2.53 GHz, openmp, SSE enabled via g++

** includes all data transfer overhead (“Effective Flops”)

What is CNVIDIA for modern x86_64 machines?

Scalability across GPU/CPU cluster nodes(big hybrid supercomputers are coming)

Oak Ridge National Laboratory looks to NVIDIA “Fermi” architecture for new supercomputer

NERSC experimental GPU cluster: Dirac

EMSL experimental GPU cluster: Barracuda

Cuda IB cluster speedup with two-GPUs per node(courtesy NVIDIA Corp)

0

2

4

6

8

10

12

14

16

2 3 4 5 6 7 8 9 10 11 12 13 14

Number of GPUs

Sp

ee

du

p o

ve

r o

ne

GP

U

Looking into my crystal ballI predict long life for GPGPU applications

Why? SIMD/SPMD/MIMD mappings translate well to new architecturesCUDA/OpenCL provide an excellent way to create these codes

Will these applications always be written in these languages?

Data-parallel extensions are hot!

Data-parallel extensions

URL: http://code.google.com/p/thrust/

Example fromwebsite

int main(void){ // generate random data on the host thrust::host_vector<int> h_vec(100); thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and compute sum thrust::device_vector<int> d_vec = h_vec; int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>()); return 0;}

http://code.google.com/p/thrust/

OpenCL has potential(but is still very new)

X86:

The dominant architectureMore cores with greater memory bandwidth and lower power

Power 7

Blue waters with over 1 million concurrent threads of execution in a petabyte of shared memoryInnovative design feature to avoid SMP scaling bottlenecks

Hybrid architectures

CPU/GPU clusters

Problems dominated by irregular access in large data.Cray XMT: specialized for large graph problems

Question 3Will we need a whole new computational execution model for

Exascale systems Eg. something like LSU’s ParallelX ?

It certainly sounds wonderful!A new model of parallel computationSemantics for state objects, functions,parallel flow control, and distributedinteractionsUnbounded policies for implementationtechnology, structure, and mechanismIntrinsic system-wide latency hidingNear fine-grain global parallelismGlobal unified parallel programming

•Humans are the weak link•A scalable globally unified file-system is essential•Yes to a unified set of debugger and profiling tools•Many language semantics and assumptions will not scale!

graphics processors and the exascale: parallel mappings, scalability and application lifespan rob...

Documents