subdivide, preprocess and conquer ... - nvidia gtc digital · t s4283 - elmar westphal - subdivide,...

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t

S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer!1

S4283 - Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations

on Single-Node/Multi-GPU Systems

Elmar Westphal - Forschungszentrum Jülich GmbH

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t

S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

Contents

• Micromagnetism

• TetraMag, a FEM/BEM Micromagnetism Simulator

• Porting TetraMag to CUDA

• Porting TetraMag to multi-GPU CUDA

• Benchmarks

!2

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


MicromagnetismIn Micromagnetism, we investigate • The structure of ferromagnetic domains. • The structure, dynamics and motion of domain walls. • Structure and dynamics of magnetic vortices. • Spin waves, etc… As a mesoscopic theory, it can provide a link between simulation and experiment

!3

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Magnetism on Different Length Scales

!4

Quantum theory!!electronic structure

Micromagnetism!!continuum theory,!domain walls, vortices

Domain theory!!subdivision into domains, details of the magnetic structure are neglected

Macroscopic models!!hysteresis models,!response to external parameters

Heisenberg model !!atomistic effects, spin chains

magnetic nanostructures

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Magnetism on Different Time Scales

!5

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Most Recent Achievement

• Discovery of the “Spin Cherenkov Effect”[1](magnetism equivalent to the sonic boom)

!6

Geometry: - 2 µm x 1 µm x 1 µm Permalloy prism - 5nm resolution (100 million tetrahedrons, ! 16 million discretisation nodes)

[1] M. Yan, A. Kákay, C. Andreas, & R. Hertel. Spin Cherenkov effect and magnonic Mach cones. Physical Review B, Rapid Communications, 88, 220412(R) (2013)

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


About TetraMag• Code started by Riccardo Hertel [1], extended and

ported by Attila Kakay, Elmar Westphal [2][3]

• Upcoming: details about

• Calculation steps

• Matrix properties

• Older CUDA versions

• New challenges

!7

[1] Hertel, R. (2001). Micromagnetic simulations of magnetostatically coupled Nickel nanowires. Journal of Applied Physics, 90(11), 5752-5758. [2] Kakay, A., Westphal, E., & Hertel, R. (2010). Speedup of FEM micromagnetic simulations with graphical processing units. Magnetics, IEEE Transactions on, 46(6), 2303-2306. [3] http://www.fz-juelich.de/pgi/pgi-6/DE/Forschung/MagnetizationDynamics/Simulations/_node.html

http://www.fz-juelich.de/pgi/pgi-6/DE/Forschung/MagnetizationDynamics/Simulations/_node.html

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Calculation Steps• Calculate magnetostatic fields with scalar potential

• Split into two parts U = U1 + U2

• U1 is the solution of the inhomogenous Neumann problem

• U2 is to satisfy Laplace’s equation with Dirichlet boundary conditions

• Solve/Integrate the Landau-Lifshitz-Gilbert equation of motion

!8

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Magnetostatic Field• Calculated in 3 steps:

• Iterative solution of a sparse linear system for U1

• Dense/hierarchical matrix-vector multiplication to obtain U2 in the boundary regions (not covered in this talk)

• Iterative solution of a sparse linear system for U2 within the magnetic region

• Sparse systems are solved using multi-GPU linbcg and bicgstab solvers

!9

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Time Integration• Includes field calculations and vector operations

• Performed using CVODE from the Sundials package

• CVODE’s “NVector”-structure and its operations have been ported for CUDA on single-host-multi-GPU systems

• Memory consuming (~1KB/node, limiting factor for GPU usage)

• Field calculations use a sparse matrix and several field vectors

• CVODE internally needs many helper vectors

!10

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Properties of TetraMag’s Sparse Matrices

• Contain ~15 non-zero elements per row/column

• Distribution of elements depends on the underlying mesh (regular to seemingly random)

• Symmetric (for magnetostatic field calculation)

OR

• Asymmetric (for exchange field calculation)

• Static over the whole program run

!11

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


TetraMagCUDA• Around since ~2009 (see GTC 2010 poster[1])

• CUDA parts were piggy-backed onto CPU-routines

• Tries to copy as many sparse matrices to device memory as possible, the remainder stays on the CPU

• GPU-only execution limited to problem sizes of ~1M nodes

• Sufficient for most use cases at the time

!12

[1] http://www.gputechconf.com/content/GTC/posters/2010/Q02-Massively-Parallel-Micromagnetic-FEM-Calculations-with-Graphical-Processing-Units.pdf

http://www.gputechconf.com/content/GTC/posters/2010/Q02-Massively-Parallel-Micromagnetic-FEM-Calculations-with-Graphical-Processing-Units.pdf

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


New Challenges• Simulations at experimental scale (µm) require larger

scale simulations (10M+ nodes)

• Single matrix + solver/integrator-vectors often exceed device memory

• Copying matrices sequentially not sufficient,

Possible solutions:

• Reduce memory footprint

• Distribute problem over multiple GPUs

!13

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Reduce Memory Footprint• Use symmetry

• Effective, but not possible for all matrices

• Often needs atomic operations (slow)

• Reduce precision of off-diagonal elements

• Moderately effective (8+4 -> 4+4 bit/value) • May lead to unacceptable loss of precision

!14

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Reduce Memory Footprint• Extract diagonals

• Good for performance (coalesced access)

• Very good results for symmetric matrices based on regular meshes (2 x (8+4) -> 8 bit)

• Mixed results otherwise

• Combinations of some/all of the above

!15

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Preprocessing Approaches for Matrix Distribution

• Matrices can be preprocessed for GPU(s) in different ways

• Traditional single GPU calculation

• “Naive” multi-GPU distribution

• checkerboard-style distribution

!16

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


X =

“Traditional” MVM on a Single GPU

!X

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Distribution over Multiple GPUs

• Naive approach:

• Divide matrix into NGPU sub-matrices with Nrow/NGPU rows

• Copy one sub-matrix to each GPU

• Copy vector to all GPUs

• Perform partial multiplications

• Copy partial results to all other GPUs

• Repeat (if needed)

!17

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


X =

“Naive” Distribution, Preparation and First Multiplication

!18

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


X =


copy partial matrices to other GPUs!18

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


X =


copy partial matrices to other GPUs copy vectors to other GPUs!18

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


X =


copy partial matrices to other GPUs copy vectors to other GPUs

calculate!18

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


=X

“Naive” Distribution, Subsequent Multiplications

!19

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


=X


copy vectors to other GPUs!19

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


=X


copy vectors to other GPUs

calculate!19

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Naive Distribution Approach• Pro

• Easy to implement

• Con

• All sub-matrices need vector data from all GPUs at the beginning of calculation

• Data transfer overhead about as expensive as actual calculation -> performance often below single GPU solution

!20

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Checkerboard Approach• Split each sub-matrix into NGPU sub-sub-matrices

• Each of these needs vector values from only one GPU

• Perform multiplication of first sub-sub-Matrix with partial vector

• At the same time, copy vector part needed for next multiplication into a (double)-buffer in a different stream

• Repeat for other sub-sub matrices!21

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


part. vectors double part. results!buffer

!

!

!

! ! 0!!

!

!

! ! 1!!

!

!

! ! 2!!

!

!

! ! 3

0! ! ! ! 1! ! ! ! 2! ! ! ! 3

X =

!22

Part

ial m

atrix

on

GPU

needs vector data from

“Checkerboard”-Style MVM

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Matrix Preprocessing• Original Matrix in CSR-like format

• Blocks of Nwarpsize rows are transposed to enable coalesced memory access

• Distribution of data destroys uniformity of row lengths

• Zero padding may be necessary -> wasted memory

• Rows are sorted and re-indexed by number of non-zero elements -> minimal padding

!23

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Optimizing Data Transfers• Depending on the problem, sub-sub-matrices can

get very small and access only very few vector elements

• Multiplication time is short

• Transfer of potentially unneeded elements takes a much longer time

• Solution: transfer only those elements that are really needed

!24

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Creating Export Vectors• Each GPU needs an individual set of vector elements from

every other GPU:

• Approach one: create NGPU x (NGPU-1) export vectors

• consumes much memory

• Approach two: rewrite export vector NGPU x (NGPU-1) times

• May need to read/write all elements NGPU x (NGPU-1) times

• Approach three: do some (lots of) work in preprocessing

!25

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Building the Perfect(?) Set of Export Vector

• During preprocessing:

• Build a key value out of which other GPUs need a certain vector element

• Using one bit for each target GPU results in at most 2NGPU-1 key values (usually NGPU

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Building the Perfect(?) Set of Export Vector (cont.)

• During multiplication:

• Build export vector according to the index generated in preprocessing

• During loop over all sub-sub-matrices:

• Loop over all export blocks

• Asynchronously copy blocks with elements needed from next GPU into buffer

!27

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

GPU!!

0!!

!

1!!

!

2!!

!

3

Key values are calculated relative to originating GPU:

!28

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

GPU!!

0!!

!

1!!

!

2!!

!

3

* * *

vector elements accessed in !cols 0-2 are originally on GPU 0,!

relative key binary values:!! ! ! 0! ! ! 1(001)! 2(010)! 4(100)


!28

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

GPU!!

0!!

!

1!!

!

2!!

!

3

* * *



0! ! ! ! 0! ! ! ! 0! ! ! ! 4! ! ! =!4


!28

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

GPU!!

0!!

!

1!!

!

2!!

!

3

* * *



0! ! ! ! 0! ! ! ! 0! ! ! ! 4! ! ! =!4

* * *0! ! ! ! 1! ! ! ! 2! ! ! ! 0! ! ! =!3


!28

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

* * ** * ** * *

GPU!!

0!!

!

1!!

!

2!!

!

3

* * *



0! ! ! ! 0! ! ! ! 0! ! ! ! 4! ! ! =!4

* * *0! ! ! ! 1! ! ! ! 2! ! ! ! 0! ! ! =!3

* * *

cols 6-8 are originally on GPU 2, ! relative key binary values:! ! !

2(010)! 4(100)! ! 0! ! ! 1(001)

0! ! ! ! 4! ! ! ! 0! ! ! ! 1! ! ! =!5


!28

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

keys from!matrix!

for GPU 0

build export buffer during preprocessing:

!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

keys from!matrix!

for GPU 0

export index

10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

reorder to!resulting!

export!index


!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

keys from!matrix!

for GPU 0

export index

10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -


export!index


!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

keys from!matrix!

for GPU 0

export index

10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -


export!index


blocks sent to export streams during multiplication loop:

!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

keys from!matrix!

for GPU 0

export index

10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -


export!index

exported!to GPU 1!(bit 001)



!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

keys from!matrix!

for GPU 0

export index

10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -


export!index





!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

keys from!matrix!

for GPU 0

export index

10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -


export!index






!29

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Pros and Cons• Pro:

• Export vector is only built once per multiplication

• No element needs to be stored more than once

• No element is transferred more often than necessary

• Con:

• Limited number of GPUs, because number of blocks grows exponentially

• Time-consuming pre-processing

!30

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Further Optimisations• Sub-sub-matrix multiplications are ordered by size of matrix

(number of non-zero elements)

• More likely to correlate larger transfers with longer calculations

• Export vectors are copied to host memory during initial calculations

• Allows parallel import on devices with only one copy engine

!31

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Preprocessing Time• Preprocessing involves multiple expensive indexing

and sorting steps

• May take up to 10-20 seconds per matrix (with n ~20M)

• Depends on the number of GPUs used

• Happens only once, because matrices are static

• Typical run includes millions of solver iterations/matrix multiplications

!32

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Benchmarks• Solver iteration times for

• Cube with regular mesh, very diagonal-dominant matrices, little data transfer

• Two problems of similar size and nature:

• Round disk with irregular mesh structure, few extractable diagonals, lots of data transfer

• Round disk with partially regular mesh structure, some extractable diagonals, moderate data transfer

!33

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Time per solver integration for cube 1µm (8.1M nodes, regular mesh)

time

per s

olve

r ite

ratio

n [m

s]

0

5

10

15

20

25

30

number of GPUs1 2 3 4 5 6 7 8

Titan GTX 690

!34

Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 134 ms

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Time per solver integration for disk (6.x M nodes, different mesh structures)

time

per s

olve

r ite

ratio

n [m

s]

0

10

20

30

40

number of GPUs1 2 3 4 5 6 7 8

Titan irreg.mesh Titan reg.meshGTX 690, irreg.mesh GTX 690 reg.meshGTX 690, 1GPU/card, irreg.mesh

!35

Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 100 ms reg. 118 ms irreg.

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Conclusions

• Distribution of matrices and vectors over multiple GPUs allows us to simulate significantly larger samples

• Performance scaling depends largely on the amount of data exchanged between GPUs

• Optimising the mesh-structure is very important in multi-GPU setups

!36

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t


Questions

!37

?…

subdivide, preprocess and conquer ... - nvidia gtc digital · t s4283 - elmar westphal - subdivide,...

Documents