vasp accelerated with gpus - gtc on-demand...
TRANSCRIPT
VASP Accelerated with GPUsCapabilities, Methods, and Road-Map
Max Hutchinson
University of Chicago; Carnegie Mellon University
GTC, May 17th, 2012
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 1 / 44
Acknowledgements
The rest of our team:
Michael Widom
James Komianos
The real VASP team:
Georg Kresse
Martijn Marsman
Jurgen Hafner
This work was supported by the PETTT project PP-CCM-KY02-123-P3.
This research was supported in part by the National Science Foundationthrough TeraGrid resources provided by Pittsburgh SupercomputingCenter.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 2 / 44
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 3 / 44
References
M. Hutchinson, M. Widom, VASP on a GPU: Application toexact-exchange calculations of the stability of lemental boron, ComputerPhysics Communications, Volume 183, Issue 7, July 2012, Pages1422-1426
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 4 / 44
Table of Contents
1 ContextMotivating ScienceDFT and VASP
2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements
3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips
4 Road-MapOur plansYour part
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 5 / 44
Context
Table of Contents
1 ContextMotivating ScienceDFT and VASP
2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements
3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips
4 Road-MapOur plansYour part
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 6 / 44
Context Motivating Science
Quantum ChemistryHard Condensed Matter
Modern model for atomic physics has non-classical elements
Electron correlation, exchange energy
Discretization of energy, angular momentum
Practical understanding of some materials requires quantum models
Nano-scale electronics
Surface effects
High-resolution spectroscopy
Low-temperature structure
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 7 / 44
Context DFT and VASP
Scientific Perspective
Start by approximating n-bodyquantum system with thesingle-particle Kohn-Sham equation.
Density functional theory (DFT)approximates correlation andexchange energies as functionals ofthe electron density.
Functionals form a ‘ladder’ ofincreasing accuracy andcomputational cost.
Eigenvalue solvers then used to findthe wave-functions.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 8 / 44
Context DFT and VASP
One example: Boron
The low temperature structure of elemental boron is not known.
Eβα Eβ′α
LDA 47.83 15.48PBE 26.63 -0.17PKZB 37.02 8.53HF 46.74 8.06
Table: Table of structural energies(units meV/atom). Here β refers to theideal hR105 structure, β′ refers to the107 atom optimized variant of B.hR141.Energies of α are obtained from thesuper cell hR12x8. All values are givenfor the 3x3x3 k-point mesh.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 9 / 44
Context DFT and VASP
Computational Perspective
DFT is nominally O(n2 ln n) or O(n3), depending on system size.
Excact-exchange is more expensive: O(n3 ln n) or O(n4).
Operations have high fine-grain data parallelism
BLAS
FFT
Scatter-Gather
Iterations are long (order second)
All adds up to a great GPU candidate
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 10 / 44
Capabilities and Performance
Table of Contents
1 ContextMotivating ScienceDFT and VASP
2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements
3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips
4 Road-MapOur plansYour part
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 11 / 44
Capabilities and Performance Low-Level Ports
FFT Port
FFT’s contribute 30-50% of CPU time.
FFT calls funneled through kernels (4 of them)
Previously used to switch between FFTW and custom FFTs
Simple copy, compute, copy-back used
Cores CPU + 1 GPU Ratio
1 2749.54 1314.54 2.12 1224.20 723.58 1.74 665.72 418.05 1.68 410.93 321.26 1.3
Table: PdO benchmark (87 ions, 496 bands, 822 electrons) on Dirac (NERSC)
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 12 / 44
Capabilities and Performance Low-Level Ports
BLAS Port
BLAS calls contribute 15-40% of CPU time.
BLAS calls are made inline, but there aren’t too many important ones
Again, simple copy, compute, copy-back used
Performance was poor (20% worse), so this was abandoned early on.
Advances in CUBLAS might make this profitable
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 13 / 44
Capabilities and Performance High-Level Ports
Exact-Exchange (HF) Port
Hybrid functionals, or “exact-exchange,” are very intensive
> 98% of runtime
Factor of 2 in memory use
Includes ‘interaction’ between bands
Add a linear order to previous complexities
VASP implementation is somewhat compartmentalized
Calls funnel through two routines
Once per k-point per iteration
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 14 / 44
Capabilities and Performance High-Level Ports
HF Port PerformanceWorkstation vs Workstation
Structure hR12 hR12x8 hR105
Platform cpu gpu cpu gpu cpu gpu
FOCK ACC (s) 409.9 59.9 5,093.8 387.3 10,467.2 487.8FOCK FORCE (s) 789.1 290.1 10,714.9 1,199.3 22,144.5 1,435.5Other (s) 26.9 27.5 117.8 134.6 216.2 142.2
Overall (hr) 9.64 1.66 121.04 9.77 248.88 12.20Speedup 5.82x 12.39x 20.41x
Table: Run-times of components of VASP exact-exchange runs. Overall times areprojected assuming a total of 5 ionic minimization steps and 75 electronicminimization steps. CPU runs are single-core and GPU runs are single-device.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 15 / 44
Capabilities and Performance High-Level Ports
Plots
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 16 / 44
Capabilities and Performance High-Level Ports
HF Port PerformanceWorkstation vs Supercomputer
Struct. k T-1C1G T-2C2G B-16C B-32C B-64C B-128ChR12 1 90.3 64.8 43.3 47.8 60.5 172.4hR12x8 2 1,650.7 983.6 1,964.8 1,206.0 1,070.7 1,160.3hR105 2 2,097.2 1,075.2 2,157.0 1,201.1 1,039.7 1,221.0hR105 3 20,489.9 10,318.0 21,080.4 10,741.3 7,794.9 5,817.5aP107 2 3,748.4 2,168.4 4,452.5 2,515.4 1,900.9 1,816.5
Table: Actual run-times of truncated runs, reduced NELM and NSW, of differentstructures on different platforms. T is tirith, B is blacklight, attributes mCnGindicates m CPU cores and n GPU devices.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 17 / 44
Capabilities and Performance System Capabilities, Requirements
Other Capabilities
Compute capability 2.0 or higher
Arbitrary CPU:GPU ratios
Round-robin
Uses File I/O (I’m sorry)
Mixed or full double precision
FFTs in single or double
Everything else in double
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 18 / 44
Design Decisions and Methods
Table of Contents
1 ContextMotivating ScienceDFT and VASP
2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements
3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips
4 Road-MapOur plansYour part
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 19 / 44
Design Decisions and Methods Guiding Principles
Guiding Principles
1 Performance: ultimately, this is our primary concern
Intercept high in the call treeWrite/use good kernels
2 Programmability: programmer time is a limited quantity
Be maximally compartmental, minimally intrusiveDon’t get too clever
3 Portability: why write something that can’t be used?
Use standard languages (FORTRAN, C[, Python])Use standard libraries (CUBLAS, CUFFT)Don’t add system assumptions
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 20 / 44
Design Decisions and Methods Guiding Principles
CPU
Profile
Optimize Translate
Profile
Optimize GPU
Validate
Debug
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 21 / 44
Design Decisions and Methods Development Cycle
Incremental Ports
Our technique has been to climb up callgraphs.
Pros:
Important work is done first
Debugging is [more] palatable
Provides rough numerical validation
Cons:
Divergent efforts can require merges
Inherit high-level structure from CPU code
Perturbation method.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 22 / 44
Design Decisions and Methods Development Cycle
CPU
Profile
Optimize Translate
Profile
Optimize GPU
Validate
Debug
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 23 / 44
Design Decisions and Methods Development Cycle
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 24 / 44
Design Decisions and Methods Development Cycle
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 25 / 44
Design Decisions and Methods Development Cycle
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 26 / 44
Design Decisions and Methods Development Cycle
CPU
Profile
Optimize Translate
Profile
Optimize GPU
Validate
Debug
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 27 / 44
Design Decisions and Methods Development Cycle
Intercepts
#i f d e f CUDA/∗ Assumpt ions ∗/USE CUDA = ( con d i t i o n 1 && cond i t i o n 2 && . . . ) ;i f ( USE CUDA ) {
f un cu ( foo , bar ) // i n t e r c e p t ( not a k e r n e l )} e l s e {
#end i f
/∗ Funct i on to be i n t e r c e p t e d ∗/fun ( foo , bar )
#i f d e f CUDA}
#end i f
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 28 / 44
Design Decisions and Methods Development Cycle
CPU
Profile
Optimize Translate
Profile
Optimize GPU
Validate
Debug
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 29 / 44
Design Decisions and Methods Development Cycle
Validation
./vasp_test.py -e ../exes/vasp-pgk -t PdO-v/ -n 1
======================================================
Test Name: PdO-v/
Run on: 2012.05.16 In: ./tests/3F0T
Result Parameter Test vs Expected
------------------------------------------------------
passed energy -5.725595e+02 vs -5.725596e+02
passed ext. pressure -5.922600e+02 vs -5.922700e+02
passed volume 1.895670e+03 vs 1.895670e+03
passed stress (xx) -7.125278e+02 vs -7.125343e+02
passed stress (yy) -7.009820e+02 vs -7.010164e+02
passed stress (zz) -6.887410e+02 vs -6.887406e+02
passed stress (xy) 0.000000e+00 vs 0.000000e+00
passed stress (yz) 0.000000e+00 vs 0.000000e+00
passed stress (zx) 6.749650e+00 vs 6.753080e+00
------------------------------------------------------
1.44x loop time 179.9700 vs 258.5200
------------------------------------------------------
0.95x setdij time 0.1900 vs 0.1800Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 30 / 44
Design Decisions and Methods Development Cycle
CPU
Profile
Optimize Translate
Profile
Optimize GPU
Validate
Debug
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 31 / 44
Design Decisions and Methods Development Cycle
CUDA Profiler
Tot Num Avg %
"method=": 4105133156.2 22642281 181.3 100.0
A_kernel: 1000726933.4 823872 1214.7 24.4
gemm: 747718784.3 9121548 82.0 18.2
double_: 369791403.9 62552 5911.7 9.0
crrexp_mul_wave_k: 339633562.0 4056780 83.7 8.3
aug_charge_trace_k: 326918471.3 15632 20913.4 8.0
mul_vec_k: 289512103.3 13076 22140.7 7.1
4charge_trace_k: 239677493.5 15632 15332.5 5.8
racc0_combine_k: 217330923.3 1641360 132.4 5.3
calc_dllmm_k: 214525210.6 38636 5552.5 5.2
apply_gfac_der_k: 85949460.5 552096 155.7 2.1
apply_gfac_k: 86839195.7 15632 5555.2 2.1
eccp_nl_fock: 68893474.7 23004 2994.8 1.7
memcpy: 32428350.8 278353 116.5 0.8
rpro1_combine_k: 24683417.5 4056780 6.1 0.6
split_complex_k: 15031156.0 1641360 9.2 0.4
else: 44754728.3 194852 229.7 1.1Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 32 / 44
Design Decisions and Methods Development Cycle
CUDA Profiler
Tot Num Avg %
"method=": 16699996.2 457575 36.5 100.0
memcpy: 7460230.4 117277 63.6 44.7
A_kernel: 6306412.3 246386 25.6 37.8
B_kernel: 1106599.3 35198 31.4 6.6
memset32: 443551.5 23516 18.9 2.7
else: 1383202.7 35198 39.3 8.3
gemm: 0.0 0 0.0 0.0
crrexp_mul_wave_k: 0.0 0 0.0 0.0
racc0_combine_k: 0.0 0 0.0 0.0
4charge_trace_k: 0.0 0 0.0 0.0
aug_charge_trace_k: 0.0 0 0.0 0.0
apply_gfac_der_k: 0.0 0 0.0 0.0
apply_gfac_k: 0.0 0 0.0 0.0
eccp_nl_fock: 0.0 0 0.0 0.0
double_: 0.0 0 0.0 0.0
mul_vec_k: 0.0 0 0.0 0.0
rpro1_combine_k: 0.0 0 0.0 0.0
split_complex_k: 0.0 0 0.0 0.0Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 33 / 44
Design Decisions and Methods Development Cycle
CPU
Profile
Optimize Translate
Profile
Optimize GPU
Validate
Debug
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 34 / 44
Design Decisions and Methods Examples
“Persistent” pointers
/∗∗ vo i d p o i n t e r ∗/typedef s t ruct vo i d p {
unsigned in t s i z e ;void ∗ p t r ;
} vo i d p ;
/∗∗ doub l e p o i n t e r ∗/typedef s t ruct doub l e p {
unsigned in t s i z e ;double∗ p t r ;
} doub l e p ;
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 35 / 44
Design Decisions and Methods Examples
“Persistent” pointers
/∗∗ Ass i gn a chunk o f GPU mem to a chunck o f CPU mem ∗/s t a t i c i n l i n e void a s s i g n c u ( vo i d p ∗ des t , //!< d e s t i n a
void∗ s rc , //!< s ou r c eunsigned in t s i z e //<! s i z e ( i n
){/∗ Do we need to r e s i z e ? ∗/i f ( des t−>p t r == NULL | | des t−>s i z e < s i z e ){
i f ( des t−>p t r != NULL)cudaFree ( des t−>p t r ) ;
cudaMal loc ( ( void∗∗)&des t−>pt r , s i z e ) ;des t−>s i z e = s i z e ;
}/∗ Do the a c t u a l copy ∗/cudaMemcpy( des t−>pt r , s rc , s i z e , cudaMemcpyHostToDevice ) ;
}
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 36 / 44
Design Decisions and Methods Examples
Structs
typedef s t ruct 4 v e c t o r {i n t t ;i n t x ;i n t y ;i n t z ;
}4 v e c t o r e v en t s [N ] ;
Improves locality for elementalfunctions. Mechanism is deepmemory caches.
typedef s t ruct 4 v e c t o r s {i n t t [N ] ;i n t x [N ] ;i n t y [N ] ;i n t z [N ] ;
}4 v e c t o r s e v en t s ;
Improves memory bandwidth forvector functions. Mechanism is widememory bus.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 37 / 44
Design Decisions and Methods Tips
Intercepts vs Overhauls
Intercepts and overhauls have the same theoretical peak performance.
Maximal intercept is 2 codes
One is usually easier than the other.
Difficulty of intercepts is governed by
Loop position: must intercept above fine-grain loops
Data structures: must pass data and context to GPU
Difficulty of overhauls is governed by
Size, complexity of auxiliary code
‘State’ of the original code
Overhaul has side-benefits. Intercepts have side-costs.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 38 / 44
Road-Map
Table of Contents
1 ContextMotivating ScienceDFT and VASP
2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements
3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips
4 Road-MapOur plansYour part
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 39 / 44
Road-Map Our plans
Non-HF Port
Port will use the same scheme as HF port
Climbing up may of the non-HF versions of CPU routines
Trying to get all the way up to minimization routine (e.g. RMM-DIIS)
You can expect performance approaching HF performance
Less parallelism for systems of the same size
More rapid iteration
Mitigated by larger quantum systems
Our goal is beta by sometime this summer
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 40 / 44
Road-Map Our plans
Merge with VASP Core
Our code is generally available to VASP license holders
Must request access through Vienna
Distribution through our website and git repo
This scheme is inadequate (doesn’t scale).
We hope to put the ports in VASP 5.3, which will have some otherarchitectural changes.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 41 / 44
Road-Map Your part
Wish List
Users, to do science
It’s all about science
Find the kink’s in our implementation
Input, to direct effort and validate results
Scientifically relevant systems
Requests for functionality
Effort, to write the ports
Current VASP users with time to contribute
VASP is a large code
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 42 / 44
Road-Map Your part
Conclusions
We’ve ported HF functionality in VASP to CUDA.
Up to 20x performance over singe core
Up to 64 core performance compared to supercomputers
Callgraph climbing port method is effective
Accelerate specific functionality of large codes
Can inform future decisions about dedicated ports
Accelerating scientific codes enables new science.
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 43 / 44
Road-Map Your part
Thank you
Questions?
Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 44 / 44