thinking outside of the tera-scale box piotr luszczek

Thinking Outside of the Tera-Scale Box Piotr Luszczek Slide 2 Brief History of Tera-flop: 1997 1997 ASCI Red Slide 3 Brief History of Tera-flop: 2007 1997 2007 ASCI Red Intel Polaris Slide 4 Brief History of Tera-flop: GPGPU 2008 1997 2007 ASCI Red Intel Polaris GPGPU Slide 5 Tera-flop: a Dictionary Perspective Belongs to supercomputer expert Consumes 500 kW Costs over $1M Requires distributed memory programming Memory: 1 GiB or more Belongs to gaming geek Consumes under 500 W Costs under $200 Requires only a CUDA kernel Memory: 1 GiB or more Slide 6 Double-Dip Recession? 20072005 Time Productivity Hybrid Multicore Multicore Cilk, Intel TBB, Intel Ct, Apple GCD, MS Concrt, HMPP, PGI Accelerator, PathScale? Slide 7 Locality Space in HPC Challenge HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess Slide 8 Locality Space in HPC Challenge: Bounds HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess Latency-bound Bandwidth-bound Compute-bound AORSA2D PCIe Slide 9 CUDA vs. OpenCL: Similar Software Stacks Slide 10 Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j) CUBLAS Volkov et al. MAGMA Nakasato Slide 11 Fermi C2050: from CUDA to OpenCL Slide 12 Fermi C2050: Simple OpenCL Porting Slide 13 ATI 5870: Simple OpenCL Porting Slide 14 Making Case for Autotuning: Use of Texture Slide 15 Making Case for Autotuning: Blocking Factors Provided by NVIDIA Slide 16 Mat-Mul Portability Summary Achieved Performance Single precision Achieved Performance Double precision ATI OpenCL 52%57% ATI IL 74%87% NVIDIA CUDA 63%58% NVIDIA OpenCL 57%49% Multicore (vendor) 79% Multicore (autotuned) 46%40% Slide 17 Now, lets think outside of the box Slide 18 Recipe for Performance Before Multicore Wider superscalar instr. issue Increase clock frequency Increase number of transistor Load/Store Instr. scheduling Slide 19 Recipe for Performance After Multicore More cores Parallel software Slide 20 Recipe for Future Performance More cores Better sequential software DAG Runtime Slide 21 From Assembly to DAG: PLASMA Typical loop in assembly load R0, #1000 divf F1, F2, F3 load R1, #1000 divf F2, F4, F5 dec R1 jump LU factorization for i = 1:N GETR2(A[i, i]) for j = i+1:N TRSM(A[i, i], A[i, j]) for k = i+1:N GEMM(A[k,i],A[i,j],A[k,j]) Instr. scheduling GETRF2 TRSM GETRF2 GEMM TRSM GEMM DAG scheduling Sequential instruction stream Sequential task stream Slide 22 Scalability of LU Factorization on Multicore PLASMA MKL 11.0 LAPACK 6 cores = 1 socket 24 cores = 4 sockets 48 cores = 8 sockets AMD Istanbul 2.8 GHz Slide 23 Combining Multicore and GPU for QR Fact. AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480 Slide 24 Performance on Distributed Memory QR LU Cholesky ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus Cray LibSci DPLASMA Peak Slide 25 People and Projects Jack Dongarra Mathieu Faverge Jakub Kurzak Hatem Ltaief Stan Tomov Asim YarKhan Peng Du Rick Weber MAGMA PLASMA DPLASMA and DAGuE George Bosilca Aurelien Bouteiller Anthony Danalis Thomas Herault Perre Lemarinier

thinking outside of the tera-scale box piotr luszczek

Documents