automated code generation and optimization for gpu kernels€¦ · • empirical dispersion...

Automated Code Generation and Optimization for GPU Kernels

GTC, May 2012

Alexey Titov, Ivan Ufimtsev, Nathan Luehr and Todd Martinez

Department of Chemistry

Stanford University

G80 GT200 Fermi Kepler

Cayman Tahiti

MIC

GPU computing ecosystem evolution

G80 GT200 Fermi Kepler

Cayman Tahiti

MIC

2007 2012

GPU computing ecosystem evolution

• Restricted, unrestricted, and restricted open shell Hartree-Fock and grid-based Kohn-Sham energy and gradient calculations

• Full support of s, p and d-type basis functions • Various DFT functionals, including range-corrected and Coulomb attenuated

functionals (BLYP, B3LYP, PBE, PBE0, ωPBE, ωPBEh, ωB97, ωB97x, camB3LYP, etc) and DFT grids (800 - 80,000 grid points per atom) • Static and dynamical DFT grids • Empirical dispersion correction (DFT-D3 and DFT-D2)

• Geometry optimization (L-BFGS, Conjugate gradient, Steepest descent) and transition state search

• Reaction path and transition state search (through DL-FIND, Kastner) • Ab initio molecular dynamics (NVE, NVT ensembles)

• Time reversible Born-Oppenheimer dynamics • Spherical boundary conditions

• Support of multiple-GPU systems • Single/Dynamical/Double precision accuracy • QM/MM treatment of surrounding water molecules using TIP3P force field • QM/MM with TeraChem/Amber – (w/ Ross Walker, UCSD/SDSC) • Natural bond orbital analysis through integration with NBO6 • Polarizabilities for HF and closed-shell DFT methods

TeraChem

Selected feature list

TeraChem the world’s fastest best GPU multi-GPU accelerated quantum computational chemisty software Gaussian Nwchem GAMESS Q-Chem MolPro DFT density functional theory GGA LDA hybrid functionals coupled cluster hartree fock ab initio molecular dynamics electronic structure molecular properties nano bionano nanosystem high performance AMD radeon NVIDIA C2050 petachem GPGPU polarization charge redistribution Modeling simulation molecular mechanics first principles jaguar http://petachem.com/ spartan mpqc psi wavefunction dmol3 gpaw cpmd gaussian basis sets gaussian type orbitals

Quantum chemistry with TeraChem

3 journal covers, 8 peer-reviewed papers, 4000+ downloads of free beta

Quantum chemistry with TeraChem

2009 2011

Riding Advances in GPU Hardware:

• Is it possible to easily retune codes for new and older archs for better performance?

• How to simplify transitions between architectures (e.g. Fermi -> Kepler)?

• How to implement complex kernels performing efficiently for GPUs?

• What about other hardware architectures (Cayman, Tahiti, MIC, etc)?

Mo

lecu

le s

ize

Increase computational capabilities

+ 26 more elements

Managing d-functions

• Increased number of kernels to calculate electron repulsion integrals over gaussian-type orbitals χ(r): J: 9 36 K: 10 45

• Increased depth of calculation: J kernel for ssss integrals batch: 63 loc (30 flops) pppp integrals batch: 306 loc (387 flops) dddd integrals batch: 2094 loc (3584 flops)

Our ideal case: automate kernel generation and optimization

2122

21

11 )()(||

1)()()|( drdrrr

rrrr

Opening ‘combination’ lock for multiple targets

voidJSSSPclne = voidJSSSPclne; dbllData0 = 0.0e0; loopentry = loopentry; Rs = Rlev(R0000, R0001); gamm = Gamma1(R0000, R0001, T); R0001multi = -0.20e1 * alp1; fltR[-1][-1][0][-1] = c * R[-1][-1][-1][0]; fltR[-1][0][-1][-1] = b * R[-1][-1][-1][0]; fltR[0][-1][-1][-1] = a * R[-1][-1][-1][0]; tmp = Temp(tmp0); blockindex = 0; collect0 = R[-1][-1][-1][-1] * P0; blockindex = 1; collect0minus = R[0][-1][-1][-1] * P1; blockindex = 2; collect0minus = R[-1][0][-1][-1] * P2; blockindex = 3; collect0minus = R[-1][-1][0][-1] * P3; Lambda = Lambda; lData0plus = tmp0 * Lambda; gthidXplus = BSIZEX; logg = logg; lData0contraction1 = clps; clse = clse;

Maple sed

Batch of integrals going through the generation pipeline

Intermediate representation

C++/CUDA

Autogenerated J kernel example: dddd batch … for( [bra • ket] > ε) { // load data Gamma8(…) // calculate a,b,c and auxiliary functions R000j float R0010 = c * R0001; … float R3000 = a * R2001 + 2.0f * R1001; … float R0080 = c * R0071 + 7.0f * R0061; float P0 = fetch_data(preproP, g_thidX); tmp0 += R0000 * P0; tmp1 += R1000 * P0; … tmp34 += R0040 * P0; float P34 = fetch_data(preproP, g_thidX + ne*34); … //accumulate tmps in DP } // collect integrals and upload to global memory

Total 2090 lines

× 35 = 1300 lines

35 lines

486 lines

Bytes per thread: 1880 (reg + lmem) Mops: 47 Flops: 3583

~ 0.63

3×

~ 13.33

JDDDD performance

GPU i7, 1 core

GPU i7, 8 cores, SSE

GPU orig.

GPU orig.+ volatile

Code variant #1

Legend Red: density matrix elements RXXXX: variables containing values of auxiliary functions Blue: hermite expansion coefficients (ket pair) Green: hermite expansion coefficients (bra pair) Bold: density contracted with ket coefficients Italic: intermediates

float R1000 = a * R0001; tmp_0 += R1000 * ((-PBx * (Pxz * PAx + PAz * Pzz + PAy * Pyz) * QDz - PBx * (PAy * Pyx + PAz * Pzx + Pxx * PAx) * QCx - PBx * (PAy * Pyx + PAz * Pzx + Pxx * PAx) * QDx - PBx * (PAz * Pzy + PAy * Pyy + Pxy * PAx) * QDy) * rtaq + ((-Pxz * QDz + PAz * Pzx - Pxy * QDy - Pxx * QDx - Pxx * QCx + Pxx * PBx + Pxx * PAx + PAy * Pyx) * rtaq + ((PAy * Pyx + PAz * Pzx + Pxx * PBx + Pxx * PAx) * QDx + (Pxy * PAx + PAz * Pzy + PAy * Pyy + Pxy * PBx) * QDy + (PAy * Pyz + Pxz * PAx + PAz * Pzz + Pxz * PBx) * QDz) * QCx) * rtap);

Architecture tuning: Empirically test different pathways

float t650 = Pxy * R2100 + Pxz * R2010 + Pxx * R3000; float t652 = Pxy * R1100 + Pxz * R1010 + Pxx * R2000; float t639 = t650 * rtap + t652 * PAx; float t658 = -rtap * R2000 - PAx * R1000; float t660 = Pxx * QDx + QDy * Pxy + QDz * Pxz; float t659 = -Pxx * R1000 - Pxy * R0100 - Pxz * R0010; float t656 = rtap * R1000 + PAx * R0000; float t641 = t656 * PBx + (R0000 - t658) * rtap; float t640 = -t652 * rtap + t659 * PAx; float t624 = t640 * PBx + (t659 - t639) * rtap; tmp_0 += ((t639 * PBx + ((Pxy * R3100 + Pxz * R3010 + Pxx * R4000) * rtap + t650 * PAx + t652) * rtap) * rtaq + t624 * QCx + t641 * Pxx) * rtaq + t660 * ((t658 * PBx + (-rtap * R3000 - PAx * R2000 - R1000) * rtap) * rtaq + t641 * QCx);

Code variant #2

float PP_0 = Pxx * QDx + Pxy * QDy + Pxz * QDz; float PP_1 = Pxx * rtaq; float PP_2 = Pxy * rtaq; float PP_3 = Pxz * rtaq; tmp_0 += PBx * ( PAx * ( QCx * (R0000*PP_0 - R1000*PP_1 - R0100*PP_2 - R0010*PP_3) - rtaq * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) + R0000 * PP_1) + rtap * ( QCx * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) - rtaq * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) + R1000 * PP_1)) + rtap * ( PAx * ( QCx * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) - rtaq * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) + R1000 * PP_1) + rtap * ( QCx * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) - rtaq * (R3000*PP_0 - R4000*PP_1 - R3100*PP_2 - R3010*PP_3) + R2000 * PP_1)) + rtap * ( QCx * (R0000*PP_0 - R1000*PP_1 - R0100*PP_2 - R0010*PP_3) - rtaq * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) + R0000 * PP_1);

Code variant #3

Colors: different auto-generated kernels

Empirical testing of code variants

Code variant C1060 Timing (ms) C2050 Timing (ms) Registersa FLOPsb

1 1025.26 822.57 115 1049

2 1042.57 823.99 115 1083

3 1112.64 988.97 114 1218

4 1117.36 1151.79 120 2124

5 2303.17 2511.44 145 1185

6 2523.15 2780.31 171 2012

7 2077.94 2852.86 141 1931

Empirical testing of code variants

Hardware

Assembly language

C/C++ CUDA OpenCL

Hardware

Assembly language

C/C++ CUDA OpenCL

Computer Algebra System

Development & execution stages

Numerical part

Numerical part

Algebraic part

Algebraic part

Data input

Data output

Computation flow expressed and computed in C language

Data input

Data output

Computational flow designed algebraically and then expressed & computed in C language

Conclusions

• Performance is sensitive to architecture-specific optimizations.

• There is no direct and meaningful relationship between performance and FLOPS on GPUs.

• Automatic code generation and performance tuning will provide code portability. It enables performance portability across various architectures: from the same or different vendors.

Funding

Jason Quenneville,

Spectral Sciences

Vlad Kindratenko

Guochun Shi

STTR - AFOSR

Acknowledgements

Ivan

Ufimtsev The

Boss

Not shown: Jeff Gour, Ed Hohenstein

Nathan

Luehr

automated code generation and optimization for gpu kernels€¦ · • empirical dispersion...

Documents