graphics processors and the exascale: parallel mappings, scalability and application lifespan rob...
TRANSCRIPT
Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan
Rob Farber, Senior Scientist, PNNL
Questions 1 and 2
1. Looking forward in the 2-5 year timeframe will we continue to need new languages, compiler directives, or language extensions to use accelerators?Absolutely as will be discussed in the next few slides1. Will compiler technology advance sufficiently to seamlessly use
accelerators as when the 8087 was added to the 8086 in the early days of the x86 architecture or when instruction sets were extended to include SSE or AltiVec and compilers eventually generated code for them?
Oh I wish! However, there is hope for data-parallel problems
2. What is your vision of what a unified Heterogeneous HPC ecosystem should encompass? What Languages, Libraries, frameworks? Should debuggers and profiling tools be integrated across heterogeneous architectures?
Humans are the weak linkA scalable globally unified file-system is essentialYes to a unified set of debugger and profiling toolsI’d like to say any language but many semantics and assumptions will not scale!
A perfect storm of opportunities and technology(Summary of Farber, Scientific Computing, “Realizing the Benefits of Affordable Teraflop-capable
Hardware”)
Multi-threaded software is a must-have because manufacturers were forced to move to multi-core CPUs
The failure of Dennard’s scaling laws meant processor manufacturers had to add cores to increase performance and entice customersThis is a new model for a huge body of legacy code!
Multi-core is disruptive to single-threaded and poorly scaling legacy appsGPGPUs, the Cray XMT, Blue Waters have changed the numbers. Commodity systems are catching up. Massive threading is the future. Research efforts will not benefit from new hardware unless they invest in scalable, multi-threaded softwareLack of investment risks stagnation and losing to the competition
Competition is fierce, the new technology is readily available and it is inexpensive!Which software and models? Look to those successes that are:
Widely adopted and have withstood the test of timeBriefly examine CUDA, OpenCL, data-parallel extensions
GPGPUs: an existing capabilityMarket forces evolved GPUs into massively parallel GPGPUs (General Purpose Graphics Processing Units).
NVIDIA quotes a 100+ million installed base of CUDA-enabled GPUs
GPUs put supercomputing in the hands of the masses.
December 1996, ASCI Red the first teraflop supercomputer Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s.
Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables:
• New thinking
• A large educated base of developers
GPU Peak 32-bitGF/s
Peak 64-bitGF/s
Cost$
GeForce GTX 480
1.35 168 < $500
AMD Radeon HD 5870
2.72 544 < $380
Meeting the need. CUDA was adopted quickly!
February 2007: The initial CUDA SDK was made public.
Now: CUDA-based GPU Computing is part of the curriculum at more than 200 universities.
MIT, Harvard, Cambridge, Oxford, the Indian Institutes of Technology, National Taiwan University, and the Chinese Academy of Sciences.
Application speed tells the story.fastest 100 apps in the NVIDIA Showcase Sept. 8, 2010
0
0.5
1
1.5
2
2.5
3
3.5
4
1 11 21 31 41 51 61 71 81 91
Ranked speedup by project (best to worst)
Spee
dup
(ord
er o
f mag
nitu
de)
Fastest: 2600x
Median: 253x
Slowest: 98x
URL: http://www.nvidia.com/object/cuda_apps_flash_new.html click on Sort by Speed Up
GPGPUs are not a one-trick pony
Used on a wide-range of computational, data driven, and real-time applications
Exhibit knife-edge performance
Balance ratios can help map problems
Can really be worth the effort
10x can make computational workflows more interactive (even poorly performing GPU apps are useful).
100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers.
1000x and greater achieved through the use of optimized transcendental functions and/or multiple GPUs.
Three rules for fast GPU codes
1. Get the data on the GPU (and keep it there!)• PCIe x16 v2.0 bus: 8 GiB/s in a single direction • 20-series GPUs: 140-200 GiB/s
2. Give the GPU enough work to do• Assume 10 s latency and 1 TF device• Can waste (10-6 * 1012) = 1M operations
3. Reuse and locate data to avoid global memory bandwidth bottlenecks• 1012 flop hardware delivers 1010 flop when global memory
limited• Can cause a 100x slowdown!
Tough for people. Tools need heuristics that can work on incomplete data and adjust for bad decisions. It’s even worse in a distributed and non-failsafe environment.
Results presented at SC09 (courtesy TACC)
Application lifespanSIMD: a key from the past
Farber: general SIMD mapping from the 1980sAcknowledgements: Work performed at or funded by the Santa Fe Institute, the theoretical division at Los Alamos National Laboratory and various NSF, DOE and other funding sources including the Texas Advance Computer Center.
This mapping for Neural Networks …
“Most efficient implementation to date” (Singer 1990),
(Thearling 1995)
The Connection Machine Observed Peak Effective Rate vs. Number of Ranger Cores
0
50
100
150
200
250
300
350
400
0 10000 20000 30000 40000 50000 60000 70000
Number of Barcelona cores
Effec
tive R
ate (
TF/s
)
60,000 cores: 363 TF/s measured62,796 cores: 386 TF/s (projected)
The Parallel Mapping energy = objFunc(p1, p2, … pn)
Examples
0, N-1
Examples
N, 2N-1
Examples
2N, 3N-1
Examples
3N, 4N-1
Step 2Calculate partials
Step 3Sum partials to get energy
Step1Broadcast
parameters
Optimization Method
(Powell, Conjugate Gradient, Other)
p1,p2, … pn
GPU 1p1,p2, … pn
GPU 2p1,p2, … pn
GPU 3p1,p2, … pn
GPU 4
Results
=
The Connection Machine
* CNVIDIA(where CNVIDIA >> 1)
Nonlinear PCA
Average 100 iterations (sec)
8x core* 0.877923
C2050 ** 0.021667
speedup 40x
vs. 1 core 295x (measured)
Linear PCA
Average 100 iterations (sec)
8x core* 0.164605
C2050 ** 0.020173
speedup 8x
vs. 1 core 57x (measured)
* 2x Intel (quadcore) E5540s @ 2.53 GHz, openmp, SSE enabled via g++
** includes all data transfer overhead (“Effective Flops”)
What is CNVIDIA for modern x86_64 machines?
Scalability across GPU/CPU cluster nodes(big hybrid supercomputers are coming)
Oak Ridge National Laboratory looks to NVIDIA “Fermi” architecture for new supercomputer
NERSC experimental GPU cluster: Dirac
EMSL experimental GPU cluster: Barracuda
Cuda IB cluster speedup with two-GPUs per node(courtesy NVIDIA Corp)
0
2
4
6
8
10
12
14
16
2 3 4 5 6 7 8 9 10 11 12 13 14
Number of GPUs
Sp
ee
du
p o
ve
r o
ne
GP
U
Looking into my crystal ballI predict long life for GPGPU applications
Why? SIMD/SPMD/MIMD mappings translate well to new architecturesCUDA/OpenCL provide an excellent way to create these codes
Will these applications always be written in these languages?
Data-parallel extensions are hot!
Data-parallel extensions
URL: http://code.google.com/p/thrust/
Example fromwebsite
int main(void){ // generate random data on the host thrust::host_vector<int> h_vec(100); thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer to device and compute sum thrust::device_vector<int> d_vec = h_vec; int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>()); return 0;}
OpenCL has potential(but is still very new)
X86:
The dominant architectureMore cores with greater memory bandwidth and lower power
Power 7
Blue waters with over 1 million concurrent threads of execution in a petabyte of shared memoryInnovative design feature to avoid SMP scaling bottlenecks
Hybrid architectures
CPU/GPU clusters
Problems dominated by irregular access in large data.Cray XMT: specialized for large graph problems
Question 3Will we need a whole new computational execution model for
Exascale systems Eg. something like LSU’s ParallelX ?
It certainly sounds wonderful!A new model of parallel computationSemantics for state objects, functions,parallel flow control, and distributedinteractionsUnbounded policies for implementationtechnology, structure, and mechanismIntrinsic system-wide latency hidingNear fine-grain global parallelismGlobal unified parallel programming
•Humans are the weak link•A scalable globally unified file-system is essential•Yes to a unified set of debugger and profiling tools•Many language semantics and assumptions will not scale!