automated code generation and optimization for gpu kernels€¦ · • empirical dispersion...
TRANSCRIPT
Automated Code Generation and Optimization for GPU Kernels
GTC, May 2012
Alexey Titov, Ivan Ufimtsev, Nathan Luehr and Todd Martinez
Department of Chemistry
Stanford University
G80 GT200 Fermi Kepler
Cayman Tahiti
MIC
GPU computing ecosystem evolution
G80 GT200 Fermi Kepler
Cayman Tahiti
MIC
2007 2012
GPU computing ecosystem evolution
• Restricted, unrestricted, and restricted open shell Hartree-Fock and grid-based Kohn-Sham energy and gradient calculations
• Full support of s, p and d-type basis functions • Various DFT functionals, including range-corrected and Coulomb attenuated
functionals (BLYP, B3LYP, PBE, PBE0, ωPBE, ωPBEh, ωB97, ωB97x, camB3LYP, etc) and DFT grids (800 - 80,000 grid points per atom) • Static and dynamical DFT grids • Empirical dispersion correction (DFT-D3 and DFT-D2)
• Geometry optimization (L-BFGS, Conjugate gradient, Steepest descent) and transition state search
• Reaction path and transition state search (through DL-FIND, Kastner) • Ab initio molecular dynamics (NVE, NVT ensembles)
• Time reversible Born-Oppenheimer dynamics • Spherical boundary conditions
• Support of multiple-GPU systems • Single/Dynamical/Double precision accuracy • QM/MM treatment of surrounding water molecules using TIP3P force field • QM/MM with TeraChem/Amber – (w/ Ross Walker, UCSD/SDSC) • Natural bond orbital analysis through integration with NBO6 • Polarizabilities for HF and closed-shell DFT methods
TeraChem
Selected feature list
TeraChem the world’s fastest best GPU multi-GPU accelerated quantum computational chemisty software Gaussian Nwchem GAMESS Q-Chem MolPro DFT density functional theory GGA LDA hybrid functionals coupled cluster hartree fock ab initio molecular dynamics electronic structure molecular properties nano bionano nanosystem high performance AMD radeon NVIDIA C2050 petachem GPGPU polarization charge redistribution Modeling simulation molecular mechanics first principles jaguar http://petachem.com/ spartan mpqc psi wavefunction dmol3 gpaw cpmd gaussian basis sets gaussian type orbitals
Quantum chemistry with TeraChem
3 journal covers, 8 peer-reviewed papers, 4000+ downloads of free beta
Quantum chemistry with TeraChem
2009 2011
Riding Advances in GPU Hardware:
• Is it possible to easily retune codes for new and older archs for better performance?
• How to simplify transitions between architectures (e.g. Fermi -> Kepler)?
• How to implement complex kernels performing efficiently for GPUs?
• What about other hardware architectures (Cayman, Tahiti, MIC, etc)?
Mo
lecu
le s
ize
Increase computational capabilities
+ 26 more elements
Managing d-functions
• Increased number of kernels to calculate electron repulsion integrals over gaussian-type orbitals χ(r): J: 9 36 K: 10 45
• Increased depth of calculation: J kernel for ssss integrals batch: 63 loc (30 flops) pppp integrals batch: 306 loc (387 flops) dddd integrals batch: 2094 loc (3584 flops)
Our ideal case: automate kernel generation and optimization
2122
21
11 )()(||
1)()()|( drdrrr
rrrr
Opening ‘combination’ lock for multiple targets
voidJSSSPclne = voidJSSSPclne; dbllData0 = 0.0e0; loopentry = loopentry; Rs = Rlev(R0000, R0001); gamm = Gamma1(R0000, R0001, T); R0001multi = -0.20e1 * alp1; fltR[-1][-1][0][-1] = c * R[-1][-1][-1][0]; fltR[-1][0][-1][-1] = b * R[-1][-1][-1][0]; fltR[0][-1][-1][-1] = a * R[-1][-1][-1][0]; tmp = Temp(tmp0); blockindex = 0; collect0 = R[-1][-1][-1][-1] * P0; blockindex = 1; collect0minus = R[0][-1][-1][-1] * P1; blockindex = 2; collect0minus = R[-1][0][-1][-1] * P2; blockindex = 3; collect0minus = R[-1][-1][0][-1] * P3; Lambda = Lambda; lData0plus = tmp0 * Lambda; gthidXplus = BSIZEX; logg = logg; lData0contraction1 = clps; clse = clse;
Maple sed
Batch of integrals going through the generation pipeline
Intermediate representation
C++/CUDA
Autogenerated J kernel example: dddd batch … for( [bra • ket] > ε) { // load data Gamma8(…) // calculate a,b,c and auxiliary functions R000j float R0010 = c * R0001; … float R3000 = a * R2001 + 2.0f * R1001; … float R0080 = c * R0071 + 7.0f * R0061; float P0 = fetch_data(preproP, g_thidX); tmp0 += R0000 * P0; tmp1 += R1000 * P0; … tmp34 += R0040 * P0; float P34 = fetch_data(preproP, g_thidX + ne*34); … //accumulate tmps in DP } // collect integrals and upload to global memory
Total 2090 lines
× 35 = 1300 lines
35 lines
486 lines
Bytes per thread: 1880 (reg + lmem) Mops: 47 Flops: 3583
~ 0.63
3×
~ 13.33
JDDDD performance
GPU i7, 1 core
GPU i7, 8 cores, SSE
GPU orig.
GPU orig.+ volatile
Code variant #1
Legend Red: density matrix elements RXXXX: variables containing values of auxiliary functions Blue: hermite expansion coefficients (ket pair) Green: hermite expansion coefficients (bra pair) Bold: density contracted with ket coefficients Italic: intermediates
float R1000 = a * R0001; tmp_0 += R1000 * ((-PBx * (Pxz * PAx + PAz * Pzz + PAy * Pyz) * QDz - PBx * (PAy * Pyx + PAz * Pzx + Pxx * PAx) * QCx - PBx * (PAy * Pyx + PAz * Pzx + Pxx * PAx) * QDx - PBx * (PAz * Pzy + PAy * Pyy + Pxy * PAx) * QDy) * rtaq + ((-Pxz * QDz + PAz * Pzx - Pxy * QDy - Pxx * QDx - Pxx * QCx + Pxx * PBx + Pxx * PAx + PAy * Pyx) * rtaq + ((PAy * Pyx + PAz * Pzx + Pxx * PBx + Pxx * PAx) * QDx + (Pxy * PAx + PAz * Pzy + PAy * Pyy + Pxy * PBx) * QDy + (PAy * Pyz + Pxz * PAx + PAz * Pzz + Pxz * PBx) * QDz) * QCx) * rtap);
Architecture tuning: Empirically test different pathways
float t650 = Pxy * R2100 + Pxz * R2010 + Pxx * R3000; float t652 = Pxy * R1100 + Pxz * R1010 + Pxx * R2000; float t639 = t650 * rtap + t652 * PAx; float t658 = -rtap * R2000 - PAx * R1000; float t660 = Pxx * QDx + QDy * Pxy + QDz * Pxz; float t659 = -Pxx * R1000 - Pxy * R0100 - Pxz * R0010; float t656 = rtap * R1000 + PAx * R0000; float t641 = t656 * PBx + (R0000 - t658) * rtap; float t640 = -t652 * rtap + t659 * PAx; float t624 = t640 * PBx + (t659 - t639) * rtap; tmp_0 += ((t639 * PBx + ((Pxy * R3100 + Pxz * R3010 + Pxx * R4000) * rtap + t650 * PAx + t652) * rtap) * rtaq + t624 * QCx + t641 * Pxx) * rtaq + t660 * ((t658 * PBx + (-rtap * R3000 - PAx * R2000 - R1000) * rtap) * rtaq + t641 * QCx);
Code variant #2
float PP_0 = Pxx * QDx + Pxy * QDy + Pxz * QDz; float PP_1 = Pxx * rtaq; float PP_2 = Pxy * rtaq; float PP_3 = Pxz * rtaq; tmp_0 += PBx * ( PAx * ( QCx * (R0000*PP_0 - R1000*PP_1 - R0100*PP_2 - R0010*PP_3) - rtaq * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) + R0000 * PP_1) + rtap * ( QCx * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) - rtaq * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) + R1000 * PP_1)) + rtap * ( PAx * ( QCx * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) - rtaq * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) + R1000 * PP_1) + rtap * ( QCx * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) - rtaq * (R3000*PP_0 - R4000*PP_1 - R3100*PP_2 - R3010*PP_3) + R2000 * PP_1)) + rtap * ( QCx * (R0000*PP_0 - R1000*PP_1 - R0100*PP_2 - R0010*PP_3) - rtaq * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) + R0000 * PP_1);
Code variant #3
Colors: different auto-generated kernels
Empirical testing of code variants
Code variant C1060 Timing (ms) C2050 Timing (ms) Registersa FLOPsb
1 1025.26 822.57 115 1049
2 1042.57 823.99 115 1083
3 1112.64 988.97 114 1218
4 1117.36 1151.79 120 2124
5 2303.17 2511.44 145 1185
6 2523.15 2780.31 171 2012
7 2077.94 2852.86 141 1931
Empirical testing of code variants
Hardware
Assembly language
C/C++ CUDA OpenCL
Hardware
Assembly language
C/C++ CUDA OpenCL
Computer Algebra System
Development & execution stages
Numerical part
Numerical part
Algebraic part
Algebraic part
Data input
Data output
Computation flow expressed and computed in C language
Data input
Data output
Computational flow designed algebraically and then expressed & computed in C language
Conclusions
• Performance is sensitive to architecture-specific optimizations.
• There is no direct and meaningful relationship between performance and FLOPS on GPUs.
• Automatic code generation and performance tuning will provide code portability. It enables performance portability across various architectures: from the same or different vendors.
Funding
Jason Quenneville,
Spectral Sciences
Vlad Kindratenko
Guochun Shi
STTR - AFOSR
Acknowledgements
Ivan
Ufimtsev The
Boss
Not shown: Jeff Gour, Ed Hohenstein
Nathan
Luehr