subdivide, preprocess and conquer ... - nvidia gtc digital · t s4283 - elmar westphal - subdivide,...

58
Mitglied der Helmholtz-Gemeinschaft S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer 1 S4283 - Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH

Upload: others

Post on 31-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer!1

    S4283 - Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations

    on Single-Node/Multi-GPU Systems

    Elmar Westphal - Forschungszentrum Jülich GmbH

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Contents

    • Micromagnetism

    • TetraMag, a FEM/BEM Micromagnetism Simulator

    • Porting TetraMag to CUDA

    • Porting TetraMag to multi-GPU CUDA

    • Benchmarks

    !2

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    MicromagnetismIn Micromagnetism, we investigate • The structure of ferromagnetic domains. • The structure, dynamics and motion of domain walls. • Structure and dynamics of magnetic vortices. • Spin waves, etc… As a mesoscopic theory, it can provide a link between simulation and experiment

    !3

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Magnetism on Different Length Scales

    !4

    Quantum theory!!electronic structure

    Micromagnetism!!continuum theory,!domain walls, vortices

    Domain theory!!subdivision into domains, details of the magnetic structure are neglected

    Macroscopic models!!hysteresis models,!response to external parameters

    Heisenberg model !!atomistic effects, spin chains

    magnetic nanostructures

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Magnetism on Different Time Scales

    !5

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Most Recent Achievement

    • Discovery of the “Spin Cherenkov Effect”[1](magnetism equivalent to the sonic boom)

    !6

    Geometry: - 2 µm x 1 µm x 1 µm Permalloy prism - 5nm resolution (100 million tetrahedrons, ! 16 million discretisation nodes)

    [1] M. Yan, A. Kákay, C. Andreas, & R. Hertel. Spin Cherenkov effect and magnonic Mach cones. Physical Review B, Rapid Communications, 88, 220412(R) (2013)

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    About TetraMag• Code started by Riccardo Hertel [1], extended and

    ported by Attila Kakay, Elmar Westphal [2][3]

    • Upcoming: details about

    • Calculation steps

    • Matrix properties

    • Older CUDA versions

    • New challenges

    !7

    [1] Hertel, R. (2001). Micromagnetic simulations of magnetostatically coupled Nickel nanowires. Journal of Applied Physics, 90(11), 5752-5758. [2] Kakay, A., Westphal, E., & Hertel, R. (2010). Speedup of FEM micromagnetic simulations with graphical processing units. Magnetics, IEEE Transactions on, 46(6), 2303-2306. [3] http://www.fz-juelich.de/pgi/pgi-6/DE/Forschung/MagnetizationDynamics/Simulations/_node.html

    http://www.fz-juelich.de/pgi/pgi-6/DE/Forschung/MagnetizationDynamics/Simulations/_node.html

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Calculation Steps• Calculate magnetostatic fields with scalar potential

    • Split into two parts U = U1 + U2

    • U1 is the solution of the inhomogenous Neumann problem

    • U2 is to satisfy Laplace’s equation with Dirichlet boundary conditions

    • Solve/Integrate the Landau-Lifshitz-Gilbert equation of motion

    !8

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Magnetostatic Field• Calculated in 3 steps:

    • Iterative solution of a sparse linear system for U1

    • Dense/hierarchical matrix-vector multiplication to obtain U2 in the boundary regions (not covered in this talk)

    • Iterative solution of a sparse linear system for U2 within the magnetic region

    • Sparse systems are solved using multi-GPU linbcg and bicgstab solvers

    !9

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Time Integration• Includes field calculations and vector operations

    • Performed using CVODE from the Sundials package

    • CVODE’s “NVector”-structure and its operations have been ported for CUDA on single-host-multi-GPU systems

    • Memory consuming (~1KB/node, limiting factor for GPU usage)

    • Field calculations use a sparse matrix and several field vectors

    • CVODE internally needs many helper vectors

    !10

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Properties of TetraMag’s Sparse Matrices

    • Contain ~15 non-zero elements per row/column

    • Distribution of elements depends on the underlying mesh (regular to seemingly random)

    • Symmetric (for magnetostatic field calculation)

    OR

    • Asymmetric (for exchange field calculation)

    • Static over the whole program run

    !11

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    TetraMagCUDA• Around since ~2009 (see GTC 2010 poster[1])

    • CUDA parts were piggy-backed onto CPU-routines

    • Tries to copy as many sparse matrices to device memory as possible, the remainder stays on the CPU

    • GPU-only execution limited to problem sizes of ~1M nodes

    • Sufficient for most use cases at the time

    !12

    [1] http://www.gputechconf.com/content/GTC/posters/2010/Q02-Massively-Parallel-Micromagnetic-FEM-Calculations-with-Graphical-Processing-Units.pdf

    http://www.gputechconf.com/content/GTC/posters/2010/Q02-Massively-Parallel-Micromagnetic-FEM-Calculations-with-Graphical-Processing-Units.pdf

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    New Challenges• Simulations at experimental scale (µm) require larger

    scale simulations (10M+ nodes)

    • Single matrix + solver/integrator-vectors often exceed device memory

    • Copying matrices sequentially not sufficient,

    Possible solutions:

    • Reduce memory footprint

    • Distribute problem over multiple GPUs

    !13

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Reduce Memory Footprint• Use symmetry

    • Effective, but not possible for all matrices

    • Often needs atomic operations (slow)

    • Reduce precision of off-diagonal elements

    • Moderately effective (8+4 -> 4+4 bit/value) • May lead to unacceptable loss of precision

    !14

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Reduce Memory Footprint• Extract diagonals

    • Good for performance (coalesced access)

    • Very good results for symmetric matrices based on regular meshes (2 x (8+4) -> 8 bit)

    • Mixed results otherwise

    • Combinations of some/all of the above

    !15

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Preprocessing Approaches for Matrix Distribution

    • Matrices can be preprocessed for GPU(s) in different ways

    • Traditional single GPU calculation

    • “Naive” multi-GPU distribution

    • checkerboard-style distribution

    !16

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    X =

    “Traditional” MVM on a Single GPU

    !X

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    X =

    “Traditional” MVM on a Single GPU

    !X

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Distribution over Multiple GPUs

    • Naive approach:

    • Divide matrix into NGPU sub-matrices with Nrow/NGPU rows

    • Copy one sub-matrix to each GPU

    • Copy vector to all GPUs

    • Perform partial multiplications

    • Copy partial results to all other GPUs

    • Repeat (if needed)

    !17

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    X =

    “Naive” Distribution, Preparation and First Multiplication

    !18

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    X =

    “Naive” Distribution, Preparation and First Multiplication

    copy partial matrices to other GPUs!18

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    X =

    “Naive” Distribution, Preparation and First Multiplication

    copy partial matrices to other GPUs copy vectors to other GPUs!18

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    X =

    “Naive” Distribution, Preparation and First Multiplication

    copy partial matrices to other GPUs copy vectors to other GPUs

    calculate!18

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    =X

    “Naive” Distribution, Subsequent Multiplications

    !19

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    =X

    “Naive” Distribution, Subsequent Multiplications

    copy vectors to other GPUs!19

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    =X

    “Naive” Distribution, Subsequent Multiplications

    copy vectors to other GPUs

    calculate!19

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Naive Distribution Approach• Pro

    • Easy to implement

    • Con

    • All sub-matrices need vector data from all GPUs at the beginning of calculation

    • Data transfer overhead about as expensive as actual calculation -> performance often below single GPU solution

    !20

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Checkerboard Approach• Split each sub-matrix into NGPU sub-sub-matrices

    • Each of these needs vector values from only one GPU

    • Perform multiplication of first sub-sub-Matrix with partial vector

    • At the same time, copy vector part needed for next multiplication into a (double)-buffer in a different stream

    • Repeat for other sub-sub matrices!21

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    part. vectors double part. results!buffer

    !

    !

    !

    ! ! 0!!

    !

    !

    ! ! 1!!

    !

    !

    ! ! 2!!

    !

    !

    ! ! 3

    0! ! ! ! 1! ! ! ! 2! ! ! ! 3

    X =

    !22

    Part

    ial m

    atrix

    on

    GPU

    needs vector data from

    “Checkerboard”-Style MVM

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    part. vectors double part. results!buffer

    !

    !

    !

    ! ! 0!!

    !

    !

    ! ! 1!!

    !

    !

    ! ! 2!!

    !

    !

    ! ! 3

    0! ! ! ! 1! ! ! ! 2! ! ! ! 3

    X =

    !22

    Part

    ial m

    atrix

    on

    GPU

    needs vector data from

    “Checkerboard”-Style MVM

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    part. vectors double part. results!buffer

    !

    !

    !

    ! ! 0!!

    !

    !

    ! ! 1!!

    !

    !

    ! ! 2!!

    !

    !

    ! ! 3

    0! ! ! ! 1! ! ! ! 2! ! ! ! 3

    X =

    !22

    Part

    ial m

    atrix

    on

    GPU

    needs vector data from

    “Checkerboard”-Style MVM

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    part. vectors double part. results!buffer

    !

    !

    !

    ! ! 0!!

    !

    !

    ! ! 1!!

    !

    !

    ! ! 2!!

    !

    !

    ! ! 3

    0! ! ! ! 1! ! ! ! 2! ! ! ! 3

    X =

    !22

    Part

    ial m

    atrix

    on

    GPU

    needs vector data from

    “Checkerboard”-Style MVM

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    part. vectors double part. results!buffer

    !

    !

    !

    ! ! 0!!

    !

    !

    ! ! 1!!

    !

    !

    ! ! 2!!

    !

    !

    ! ! 3

    0! ! ! ! 1! ! ! ! 2! ! ! ! 3

    X =

    !22

    Part

    ial m

    atrix

    on

    GPU

    needs vector data from

    “Checkerboard”-Style MVM

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Matrix Preprocessing• Original Matrix in CSR-like format

    • Blocks of Nwarpsize rows are transposed to enable coalesced memory access

    • Distribution of data destroys uniformity of row lengths

    • Zero padding may be necessary -> wasted memory

    • Rows are sorted and re-indexed by number of non-zero elements -> minimal padding

    !23

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Optimizing Data Transfers• Depending on the problem, sub-sub-matrices can

    get very small and access only very few vector elements

    • Multiplication time is short

    • Transfer of potentially unneeded elements takes a much longer time

    • Solution: transfer only those elements that are really needed

    !24

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Creating Export Vectors• Each GPU needs an individual set of vector elements from

    every other GPU:

    • Approach one: create NGPU x (NGPU-1) export vectors

    • consumes much memory

    • Approach two: rewrite export vector NGPU x (NGPU-1) times

    • May need to read/write all elements NGPU x (NGPU-1) times

    • Approach three: do some (lots of) work in preprocessing

    !25

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Building the Perfect(?) Set of Export Vector

    • During preprocessing:

    • Build a key value out of which other GPUs need a certain vector element

    • Using one bit for each target GPU results in at most 2NGPU-1 key values (usually NGPU

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Building the Perfect(?) Set of Export Vector (cont.)

    • During multiplication:

    • Build export vector according to the index generated in preprocessing

    • During loop over all sub-sub-matrices:

    • Loop over all export blocks

    • Asynchronously copy blocks with elements needed from next GPU into buffer

    !27

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    GPU!!

    0!!

    !

    1!!

    !

    2!!

    !

    3

    Key values are calculated relative to originating GPU:

    !28

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    GPU!!

    0!!

    !

    1!!

    !

    2!!

    !

    3

    * * *

    vector elements accessed in !cols 0-2 are originally on GPU 0,!

    relative key binary values:!! ! ! 0! ! ! 1(001)! 2(010)! 4(100)

    Key values are calculated relative to originating GPU:

    !28

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    GPU!!

    0!!

    !

    1!!

    !

    2!!

    !

    3

    * * *

    vector elements accessed in !cols 0-2 are originally on GPU 0,!

    relative key binary values:!! ! ! 0! ! ! 1(001)! 2(010)! 4(100)

    0! ! ! ! 0! ! ! ! 0! ! ! ! 4! ! ! =!4

    Key values are calculated relative to originating GPU:

    !28

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    GPU!!

    0!!

    !

    1!!

    !

    2!!

    !

    3

    * * *

    vector elements accessed in !cols 0-2 are originally on GPU 0,!

    relative key binary values:!! ! ! 0! ! ! 1(001)! 2(010)! 4(100)

    0! ! ! ! 0! ! ! ! 0! ! ! ! 4! ! ! =!4

    * * *0! ! ! ! 1! ! ! ! 2! ! ! ! 0! ! ! =!3

    Key values are calculated relative to originating GPU:

    !28

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    * * ** * ** * *

    GPU!!

    0!!

    !

    1!!

    !

    2!!

    !

    3

    * * *

    vector elements accessed in !cols 0-2 are originally on GPU 0,!

    relative key binary values:!! ! ! 0! ! ! 1(001)! 2(010)! 4(100)

    0! ! ! ! 0! ! ! ! 0! ! ! ! 4! ! ! =!4

    * * *0! ! ! ! 1! ! ! ! 2! ! ! ! 0! ! ! =!3

    * * *

    cols 6-8 are originally on GPU 2, ! relative key binary values:! ! !

    2(010)! 4(100)! ! 0! ! ! 1(001)

    0! ! ! ! 4! ! ! ! 0! ! ! ! 1! ! ! =!5

    Key values are calculated relative to originating GPU:

    !28

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    keys from!matrix!

    for GPU 0

    build export buffer during preprocessing:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    keys from!matrix!

    for GPU 0

    export index

    10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

    reorder to!resulting!

    export!index

    build export buffer during preprocessing:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    keys from!matrix!

    for GPU 0

    export index

    10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

    reorder to!resulting!

    export!index

    build export buffer during preprocessing:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    keys from!matrix!

    for GPU 0

    export index

    10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

    reorder to!resulting!

    export!index

    build export buffer during preprocessing:

    blocks sent to export streams during multiplication loop:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    keys from!matrix!

    for GPU 0

    export index

    10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

    reorder to!resulting!

    export!index

    exported!to GPU 1!(bit 001)

    build export buffer during preprocessing:

    blocks sent to export streams during multiplication loop:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    keys from!matrix!

    for GPU 0

    export index

    10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

    reorder to!resulting!

    export!index

    exported!to GPU 1!(bit 001)

    exported!to GPU 2!(bit 010)

    build export buffer during preprocessing:

    blocks sent to export streams during multiplication loop:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3

    index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - -

    keys from!matrix!

    for GPU 0

    export index

    10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - -ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - -

    reorder to!resulting!

    export!index

    exported!to GPU 1!(bit 001)

    exported!to GPU 2!(bit 010)

    exported!to GPU 3!(bit 100)

    build export buffer during preprocessing:

    blocks sent to export streams during multiplication loop:

    !29

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Pros and Cons• Pro:

    • Export vector is only built once per multiplication

    • No element needs to be stored more than once

    • No element is transferred more often than necessary

    • Con:

    • Limited number of GPUs, because number of blocks grows exponentially

    • Time-consuming pre-processing

    !30

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Further Optimisations• Sub-sub-matrix multiplications are ordered by size of matrix

    (number of non-zero elements)

    • More likely to correlate larger transfers with longer calculations

    • Export vectors are copied to host memory during initial calculations

    • Allows parallel import on devices with only one copy engine

    !31

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Preprocessing Time• Preprocessing involves multiple expensive indexing

    and sorting steps

    • May take up to 10-20 seconds per matrix (with n ~20M)

    • Depends on the number of GPUs used

    • Happens only once, because matrices are static

    • Typical run includes millions of solver iterations/matrix multiplications

    !32

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Benchmarks• Solver iteration times for

    • Cube with regular mesh, very diagonal-dominant matrices, little data transfer

    • Two problems of similar size and nature:

    • Round disk with irregular mesh structure, few extractable diagonals, lots of data transfer

    • Round disk with partially regular mesh structure, some extractable diagonals, moderate data transfer

    !33

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Time per solver integration for cube 1µm (8.1M nodes, regular mesh)

    time

    per s

    olve

    r ite

    ratio

    n [m

    s]

    0

    5

    10

    15

    20

    25

    30

    number of GPUs1 2 3 4 5 6 7 8

    Titan GTX 690

    !34

    Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 134 ms

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Time per solver integration for disk (6.x M nodes, different mesh structures)

    time

    per s

    olve

    r ite

    ratio

    n [m

    s]

    0

    10

    20

    30

    40

    number of GPUs1 2 3 4 5 6 7 8

    Titan irreg.mesh Titan reg.meshGTX 690, irreg.mesh GTX 690 reg.meshGTX 690, 1GPU/card, irreg.mesh

    !35

    Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 100 ms reg. 118 ms irreg.

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Conclusions

    • Distribution of matrices and vectors over multiple GPUs allows us to simulate significantly larger samples

    • Performance scaling depends largely on the amount of data exchanged between GPUs

    • Optimising the mesh-structure is very important in multi-GPU setups

    !36

  • Mitg

    lied

    der H

    elm

    holtz

    -Gem

    eins

    chaf

    t

    S4283 - Elmar Westphal - Subdivide, Preprocess and Conquer

    Questions

    !37

    ?…