vasp accelerated with gpus - gtc on-demand...

44
VASP Accelerated with GPUs Capabilities, Methods, and Road-Map Max Hutchinson University of Chicago; Carnegie Mellon University GTC, May 17th, 2012 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 1 / 44

Upload: duongthuan

Post on 22-Apr-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

VASP Accelerated with GPUsCapabilities, Methods, and Road-Map

Max Hutchinson

University of Chicago; Carnegie Mellon University

GTC, May 17th, 2012

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 1 / 44

Page 2: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Acknowledgements

The rest of our team:

Michael Widom

James Komianos

The real VASP team:

Georg Kresse

Martijn Marsman

Jurgen Hafner

This work was supported by the PETTT project PP-CCM-KY02-123-P3.

This research was supported in part by the National Science Foundationthrough TeraGrid resources provided by Pittsburgh SupercomputingCenter.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 2 / 44

Page 3: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 3 / 44

Page 4: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

References

M. Hutchinson, M. Widom, VASP on a GPU: Application toexact-exchange calculations of the stability of lemental boron, ComputerPhysics Communications, Volume 183, Issue 7, July 2012, Pages1422-1426

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 4 / 44

Page 5: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Table of Contents

1 ContextMotivating ScienceDFT and VASP

2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements

3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips

4 Road-MapOur plansYour part

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 5 / 44

Page 6: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Context

Table of Contents

1 ContextMotivating ScienceDFT and VASP

2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements

3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips

4 Road-MapOur plansYour part

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 6 / 44

Page 7: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Context Motivating Science

Quantum ChemistryHard Condensed Matter

Modern model for atomic physics has non-classical elements

Electron correlation, exchange energy

Discretization of energy, angular momentum

Practical understanding of some materials requires quantum models

Nano-scale electronics

Surface effects

High-resolution spectroscopy

Low-temperature structure

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 7 / 44

Page 8: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Context DFT and VASP

Scientific Perspective

Start by approximating n-bodyquantum system with thesingle-particle Kohn-Sham equation.

Density functional theory (DFT)approximates correlation andexchange energies as functionals ofthe electron density.

Functionals form a ‘ladder’ ofincreasing accuracy andcomputational cost.

Eigenvalue solvers then used to findthe wave-functions.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 8 / 44

Page 9: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Context DFT and VASP

One example: Boron

The low temperature structure of elemental boron is not known.

Eβα Eβ′α

LDA 47.83 15.48PBE 26.63 -0.17PKZB 37.02 8.53HF 46.74 8.06

Table: Table of structural energies(units meV/atom). Here β refers to theideal hR105 structure, β′ refers to the107 atom optimized variant of B.hR141.Energies of α are obtained from thesuper cell hR12x8. All values are givenfor the 3x3x3 k-point mesh.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 9 / 44

Page 10: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Context DFT and VASP

Computational Perspective

DFT is nominally O(n2 ln n) or O(n3), depending on system size.

Excact-exchange is more expensive: O(n3 ln n) or O(n4).

Operations have high fine-grain data parallelism

BLAS

FFT

Scatter-Gather

Iterations are long (order second)

All adds up to a great GPU candidate

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 10 / 44

Page 11: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance

Table of Contents

1 ContextMotivating ScienceDFT and VASP

2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements

3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips

4 Road-MapOur plansYour part

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 11 / 44

Page 12: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance Low-Level Ports

FFT Port

FFT’s contribute 30-50% of CPU time.

FFT calls funneled through kernels (4 of them)

Previously used to switch between FFTW and custom FFTs

Simple copy, compute, copy-back used

Cores CPU + 1 GPU Ratio

1 2749.54 1314.54 2.12 1224.20 723.58 1.74 665.72 418.05 1.68 410.93 321.26 1.3

Table: PdO benchmark (87 ions, 496 bands, 822 electrons) on Dirac (NERSC)

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 12 / 44

Page 13: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance Low-Level Ports

BLAS Port

BLAS calls contribute 15-40% of CPU time.

BLAS calls are made inline, but there aren’t too many important ones

Again, simple copy, compute, copy-back used

Performance was poor (20% worse), so this was abandoned early on.

Advances in CUBLAS might make this profitable

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 13 / 44

Page 14: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance High-Level Ports

Exact-Exchange (HF) Port

Hybrid functionals, or “exact-exchange,” are very intensive

> 98% of runtime

Factor of 2 in memory use

Includes ‘interaction’ between bands

Add a linear order to previous complexities

VASP implementation is somewhat compartmentalized

Calls funnel through two routines

Once per k-point per iteration

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 14 / 44

Page 15: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance High-Level Ports

HF Port PerformanceWorkstation vs Workstation

Structure hR12 hR12x8 hR105

Platform cpu gpu cpu gpu cpu gpu

FOCK ACC (s) 409.9 59.9 5,093.8 387.3 10,467.2 487.8FOCK FORCE (s) 789.1 290.1 10,714.9 1,199.3 22,144.5 1,435.5Other (s) 26.9 27.5 117.8 134.6 216.2 142.2

Overall (hr) 9.64 1.66 121.04 9.77 248.88 12.20Speedup 5.82x 12.39x 20.41x

Table: Run-times of components of VASP exact-exchange runs. Overall times areprojected assuming a total of 5 ionic minimization steps and 75 electronicminimization steps. CPU runs are single-core and GPU runs are single-device.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 15 / 44

Page 16: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance High-Level Ports

Plots

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 16 / 44

Page 17: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance High-Level Ports

HF Port PerformanceWorkstation vs Supercomputer

Struct. k T-1C1G T-2C2G B-16C B-32C B-64C B-128ChR12 1 90.3 64.8 43.3 47.8 60.5 172.4hR12x8 2 1,650.7 983.6 1,964.8 1,206.0 1,070.7 1,160.3hR105 2 2,097.2 1,075.2 2,157.0 1,201.1 1,039.7 1,221.0hR105 3 20,489.9 10,318.0 21,080.4 10,741.3 7,794.9 5,817.5aP107 2 3,748.4 2,168.4 4,452.5 2,515.4 1,900.9 1,816.5

Table: Actual run-times of truncated runs, reduced NELM and NSW, of differentstructures on different platforms. T is tirith, B is blacklight, attributes mCnGindicates m CPU cores and n GPU devices.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 17 / 44

Page 18: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Capabilities and Performance System Capabilities, Requirements

Other Capabilities

Compute capability 2.0 or higher

Arbitrary CPU:GPU ratios

Round-robin

Uses File I/O (I’m sorry)

Mixed or full double precision

FFTs in single or double

Everything else in double

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 18 / 44

Page 19: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods

Table of Contents

1 ContextMotivating ScienceDFT and VASP

2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements

3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips

4 Road-MapOur plansYour part

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 19 / 44

Page 20: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Guiding Principles

Guiding Principles

1 Performance: ultimately, this is our primary concern

Intercept high in the call treeWrite/use good kernels

2 Programmability: programmer time is a limited quantity

Be maximally compartmental, minimally intrusiveDon’t get too clever

3 Portability: why write something that can’t be used?

Use standard languages (FORTRAN, C[, Python])Use standard libraries (CUBLAS, CUFFT)Don’t add system assumptions

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 20 / 44

Page 21: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Guiding Principles

CPU

Profile

Optimize Translate

Profile

Optimize GPU

Validate

Debug

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 21 / 44

Page 22: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

Incremental Ports

Our technique has been to climb up callgraphs.

Pros:

Important work is done first

Debugging is [more] palatable

Provides rough numerical validation

Cons:

Divergent efforts can require merges

Inherit high-level structure from CPU code

Perturbation method.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 22 / 44

Page 23: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CPU

Profile

Optimize Translate

Profile

Optimize GPU

Validate

Debug

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 23 / 44

Page 24: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 24 / 44

Page 25: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 25 / 44

Page 26: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 26 / 44

Page 27: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CPU

Profile

Optimize Translate

Profile

Optimize GPU

Validate

Debug

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 27 / 44

Page 28: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

Intercepts

#i f d e f CUDA/∗ Assumpt ions ∗/USE CUDA = ( con d i t i o n 1 && cond i t i o n 2 && . . . ) ;i f ( USE CUDA ) {

f un cu ( foo , bar ) // i n t e r c e p t ( not a k e r n e l )} e l s e {

#end i f

/∗ Funct i on to be i n t e r c e p t e d ∗/fun ( foo , bar )

#i f d e f CUDA}

#end i f

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 28 / 44

Page 29: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CPU

Profile

Optimize Translate

Profile

Optimize GPU

Validate

Debug

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 29 / 44

Page 30: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

Validation

./vasp_test.py -e ../exes/vasp-pgk -t PdO-v/ -n 1

======================================================

Test Name: PdO-v/

Run on: 2012.05.16 In: ./tests/3F0T

Result Parameter Test vs Expected

------------------------------------------------------

passed energy -5.725595e+02 vs -5.725596e+02

passed ext. pressure -5.922600e+02 vs -5.922700e+02

passed volume 1.895670e+03 vs 1.895670e+03

passed stress (xx) -7.125278e+02 vs -7.125343e+02

passed stress (yy) -7.009820e+02 vs -7.010164e+02

passed stress (zz) -6.887410e+02 vs -6.887406e+02

passed stress (xy) 0.000000e+00 vs 0.000000e+00

passed stress (yz) 0.000000e+00 vs 0.000000e+00

passed stress (zx) 6.749650e+00 vs 6.753080e+00

------------------------------------------------------

1.44x loop time 179.9700 vs 258.5200

------------------------------------------------------

0.95x setdij time 0.1900 vs 0.1800Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 30 / 44

Page 31: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CPU

Profile

Optimize Translate

Profile

Optimize GPU

Validate

Debug

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 31 / 44

Page 32: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CUDA Profiler

Tot Num Avg %

"method=": 4105133156.2 22642281 181.3 100.0

A_kernel: 1000726933.4 823872 1214.7 24.4

gemm: 747718784.3 9121548 82.0 18.2

double_: 369791403.9 62552 5911.7 9.0

crrexp_mul_wave_k: 339633562.0 4056780 83.7 8.3

aug_charge_trace_k: 326918471.3 15632 20913.4 8.0

mul_vec_k: 289512103.3 13076 22140.7 7.1

4charge_trace_k: 239677493.5 15632 15332.5 5.8

racc0_combine_k: 217330923.3 1641360 132.4 5.3

calc_dllmm_k: 214525210.6 38636 5552.5 5.2

apply_gfac_der_k: 85949460.5 552096 155.7 2.1

apply_gfac_k: 86839195.7 15632 5555.2 2.1

eccp_nl_fock: 68893474.7 23004 2994.8 1.7

memcpy: 32428350.8 278353 116.5 0.8

rpro1_combine_k: 24683417.5 4056780 6.1 0.6

split_complex_k: 15031156.0 1641360 9.2 0.4

else: 44754728.3 194852 229.7 1.1Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 32 / 44

Page 33: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CUDA Profiler

Tot Num Avg %

"method=": 16699996.2 457575 36.5 100.0

memcpy: 7460230.4 117277 63.6 44.7

A_kernel: 6306412.3 246386 25.6 37.8

B_kernel: 1106599.3 35198 31.4 6.6

memset32: 443551.5 23516 18.9 2.7

else: 1383202.7 35198 39.3 8.3

gemm: 0.0 0 0.0 0.0

crrexp_mul_wave_k: 0.0 0 0.0 0.0

racc0_combine_k: 0.0 0 0.0 0.0

4charge_trace_k: 0.0 0 0.0 0.0

aug_charge_trace_k: 0.0 0 0.0 0.0

apply_gfac_der_k: 0.0 0 0.0 0.0

apply_gfac_k: 0.0 0 0.0 0.0

eccp_nl_fock: 0.0 0 0.0 0.0

double_: 0.0 0 0.0 0.0

mul_vec_k: 0.0 0 0.0 0.0

rpro1_combine_k: 0.0 0 0.0 0.0

split_complex_k: 0.0 0 0.0 0.0Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 33 / 44

Page 34: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Development Cycle

CPU

Profile

Optimize Translate

Profile

Optimize GPU

Validate

Debug

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 34 / 44

Page 35: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Examples

“Persistent” pointers

/∗∗ vo i d p o i n t e r ∗/typedef s t ruct vo i d p {

unsigned in t s i z e ;void ∗ p t r ;

} vo i d p ;

/∗∗ doub l e p o i n t e r ∗/typedef s t ruct doub l e p {

unsigned in t s i z e ;double∗ p t r ;

} doub l e p ;

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 35 / 44

Page 36: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Examples

“Persistent” pointers

/∗∗ Ass i gn a chunk o f GPU mem to a chunck o f CPU mem ∗/s t a t i c i n l i n e void a s s i g n c u ( vo i d p ∗ des t , //!< d e s t i n a

void∗ s rc , //!< s ou r c eunsigned in t s i z e //<! s i z e ( i n

){/∗ Do we need to r e s i z e ? ∗/i f ( des t−>p t r == NULL | | des t−>s i z e < s i z e ){

i f ( des t−>p t r != NULL)cudaFree ( des t−>p t r ) ;

cudaMal loc ( ( void∗∗)&des t−>pt r , s i z e ) ;des t−>s i z e = s i z e ;

}/∗ Do the a c t u a l copy ∗/cudaMemcpy( des t−>pt r , s rc , s i z e , cudaMemcpyHostToDevice ) ;

}

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 36 / 44

Page 37: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Examples

Structs

typedef s t ruct 4 v e c t o r {i n t t ;i n t x ;i n t y ;i n t z ;

}4 v e c t o r e v en t s [N ] ;

Improves locality for elementalfunctions. Mechanism is deepmemory caches.

typedef s t ruct 4 v e c t o r s {i n t t [N ] ;i n t x [N ] ;i n t y [N ] ;i n t z [N ] ;

}4 v e c t o r s e v en t s ;

Improves memory bandwidth forvector functions. Mechanism is widememory bus.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 37 / 44

Page 38: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Design Decisions and Methods Tips

Intercepts vs Overhauls

Intercepts and overhauls have the same theoretical peak performance.

Maximal intercept is 2 codes

One is usually easier than the other.

Difficulty of intercepts is governed by

Loop position: must intercept above fine-grain loops

Data structures: must pass data and context to GPU

Difficulty of overhauls is governed by

Size, complexity of auxiliary code

‘State’ of the original code

Overhaul has side-benefits. Intercepts have side-costs.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 38 / 44

Page 39: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Road-Map

Table of Contents

1 ContextMotivating ScienceDFT and VASP

2 Capabilities and PerformanceLow-Level PortsHigh-Level PortsSystem Capabilities, Requirements

3 Design Decisions and MethodsGuiding PrinciplesDevelopment CycleExamplesTips

4 Road-MapOur plansYour part

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 39 / 44

Page 40: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Road-Map Our plans

Non-HF Port

Port will use the same scheme as HF port

Climbing up may of the non-HF versions of CPU routines

Trying to get all the way up to minimization routine (e.g. RMM-DIIS)

You can expect performance approaching HF performance

Less parallelism for systems of the same size

More rapid iteration

Mitigated by larger quantum systems

Our goal is beta by sometime this summer

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 40 / 44

Page 41: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Road-Map Our plans

Merge with VASP Core

Our code is generally available to VASP license holders

Must request access through Vienna

Distribution through our website and git repo

This scheme is inadequate (doesn’t scale).

We hope to put the ports in VASP 5.3, which will have some otherarchitectural changes.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 41 / 44

Page 42: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Road-Map Your part

Wish List

Users, to do science

It’s all about science

Find the kink’s in our implementation

Input, to direct effort and validate results

Scientifically relevant systems

Requests for functionality

Effort, to write the ports

Current VASP users with time to contribute

VASP is a large code

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 42 / 44

Page 43: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Road-Map Your part

Conclusions

We’ve ported HF functionality in VASP to CUDA.

Up to 20x performance over singe core

Up to 64 core performance compared to supercomputers

Callgraph climbing port method is effective

Accelerate specific functionality of large codes

Can inform future decisions about dedicated ports

Accelerating scientific codes enables new science.

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 43 / 44

Page 44: VASP Accelerated with GPUs - GTC On-Demand …on-demand.gputechconf.com/.../S0378-VASP-Accelerated-with-GPUs… · VASP Accelerated with GPUs Capabilities, Methods, ... Debug MaxHutchinson

Road-Map Your part

Thank you

Questions?

Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 44 / 44