a new asynchronous solver for banded linear systems · 2017. 12. 24. · michael jandron &...
TRANSCRIPT
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 1
A New Asynchronous Solverfor Banded Linear SystemsURI Applied Math SeminarNovember 12, 2015
Michael Jandron & Anthony RuffaNaval Undersea Warfare Center, Newport, RI
Raymond Roberts, NUWC, Newport, RIMichael Warnock, NUWC, Newport, RIEric Blake, NUWC, Newport, RIJames Baglama, University of Rhode Island, Kingston, RI
1
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 22
Looking for new techniques to complement these tried-and-true methods
• Large sparse problems take a while to solve (days, months, years)– Direct methods still are useful– In FEA, substructuring, Shur Complement, multi-frontal methods common and rely
on a Gaussian Elimination backbone which is difficult to parallelize– Always looking for ways to increase levels of parallelization and decrease
communication bound
Motivation
Image source: simulia.com
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 33
Part I: Modified Forward Substitution• Tridiagonal solver
– Limitations and what it’s good for• Pentadiagonal solver• General banded solver
– Theoretical speedup predictions– Development– Numerical implementation– Numerical benchmark against MKL PARDISO
• A method to do forward and backward substitution– Numerical benchmark against MKL DGTSV & DPTPSV
• Summary
Part II: Modified Block LU• Mechanics of approach• Examples: Pentadiagonal to FEA• Where we’re headed
Outline
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 4
Part I
Modified Forward Substitution
4
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 5
Method for Tridiagonal Systems
5
Augment an unknown to the system [1-3]
Given the following linear system
[2] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (2015, Submitted)[3] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM-H Springer Series Contribution (2015, Submitted)
[1] Ruffa, A., “A Solution Approach for Lower Hessenberg Linear Systems,” ISRN Applied Mathematics (2011)
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 6
Method for Tridiagonal Systems
6
Split into two tasks
1
2
Principle of superposition applies
Last equation gives:
Final vectorized superposition
[2] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (2015, Submitted)[3] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM-H Springer Series Contribution (2015, Submitted)
[1] Ruffa, A., “A Solution Approach for Lower Hessenberg Linear Systems,” ISRN Applied Mathematics (2011)
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 7
System Details for Tridiagonal Systems
7
Undetermined system – solution to within constantChoose arbitrarily
and solve for remaining unknowns
1
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 8
System Details for Tridiagonal Systems
8
Undetermined system – solution to within constantChoose arbitrarily
and solve for remaining unknowns
2
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 9
Limitations of Modified Forward Sub
9
0 20 40 60 80 100-3
-2
-1
0
1
2
3 x 10-14
Unknown (k)
Erro
r (b-
Ax)
BackslashMFS
0 20 40 60 80 100-30
-25
-20
-15
-10
-5
0
Unknown (k)
Solu
tion
(x)
BackslashMFS
0 20 40 60 80 100-1
-0.5
0
0.5
Unknown (k)
Erro
r (b-
Ax)
BackslashMFS
0 20 40 60 80 100-0.5
-0.4
-0.3
-0.2
-0.1
0
Unknown (k)
Solu
tion
(x)
BackslashMFS
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
0 1 2
-1
-0.5
0
0.5
1
Alternate methods?
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 10
Solution Options
10
Option 1:A modified forward substitution scheme
Option 2:Using the pseudoinverseGeneral, but can be slower and memory intensive
Fast, but can be unreliable in some cases without a form of pivoting or precision control
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 11
Method for Pentadiagonal Systems
11
Add a two variables
Given the following linear system
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 12
Method for Pentadiagonal Systems
12How does it work for general banded systems?
Split into three tasks
1
2
3
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 13
Method for Pentadiagonal Systems
13How does it work for general banded systems?
Principle of superposition:
Last two equations gives a constraint linear system:
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 14
Extension to Banded Systems
14
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 15
Extension to Banded SystemsIndependent linear systems
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 16
Extension to Banded Systems
Extra Variables
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 17
Extension to Banded Systems
17
All related through superposition
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 18
Extension to Banded Systems
18
Constraint Matrix
Final solution through superposition
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 19
Numerical Implementation
19Even the constraint matrix can be split up if desired
Request solutionBroadcast to
each available core
Begin asynchronousforward substitution
as it arrives
Send extra variablesback as they are formed
Once all extra variables come back,tackle constraint matrix using any dense solver
Master thread
Level 1 superpositionto get final solution
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 20
Banded Systems Expected Speedups
SpeedupNumber of superdiagonalsNumber of subdiagonalsNumber of unknowns
Banded GaussianElimination
Forward / backwardSubstitution
Dense ConstraintMatrix Solve
Superposition
-core BMFS
Seq. BMFS
Same cost
Speedup
Seq. LU
Pentadiagonal should be ~ 8x faster than sequential LUTridiagonal should be ~ 2x faster than sequential LU
Heptadiagonal should be ~ 18x faster than sequential LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 21
0 2000 4000 6000 8000 100000
100
200
300
400
q
X
1-core8-coreq-core
Banded Systems Expected Speedups
Anticipated speedup over sequential LUusing a various number of cores
1 core is 0.5X8-core is 4X
n = 1,000,000
0 2 4 6 8 10x 105
0
2000
4000
6000
8000
10000
12000
q
X
1-core8-coreq-core
n = 1,000,000,000
For the same number of coresLU (e.g. multi-frontal) must scale to these levels in order to match speed
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 22
Banded Systems Expected Speedups
We know optimal locations for max speedup over sequential LU
For the same number of coresLU (e.g. multi-frontal) must scale to these levels in order to match speed
SpeedupNumber of superdiagonalsNumber of subdiagonalsNumber of unknowns
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 23
Numerical Benchmarks
23
Tests dependencewithout exponential growth
For simplicity let’s just consider symmetric cases
ImplementationFORTRAN 90OPENMP with 8-coresPARDISO 5.0.0 [1-3] Solver using 8-cores
[1] M. Luisier, O. Schenk et.al.,Fast Methods for Computing Selected Elements of the Green's Function in Massively Parallel Nanoelectronic Device Simulations, Euro-Par 2013, LNCS 8097, F. Wolf, B. Mohr, and D. an Ney (Eds.), Springer-Verlag Berlin Heidelberg, pp. 533–544, 2013,[2] O. Schenk, M. Bollhoefer, and R. Roemer, On large-scale diagonalization techniques for the Anderson model of localization. Featured SIGEST paper in the SIAM Review selected "on the basis of its exceptional interest to the entire SIAM community". SIAM Review 50 (2008), pp. 91-112.[3] O. Schenk, A. Waechter, and M. Hagemann, Matching-based Preprocessing Algorithms to the Solution of Saddle-Point Problems in Large-Scale NonconvexInterior-Point Optimization. Journal of Computational Optimization and Applications, pp. 321-341, Volume 36, Numbers 2-3 / April, 2007.
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 24
Numerical Results with 8-cores
24
Wall time was less than PARDISO in certain cases without even scaling
SpeedupNumber of superdiagonalsNumber of unknowns
FORTRAN OpenMPSpeedup resultsPARDISO 8 cores
BMFS 8 cores
PARDISOBMFS
BMFS – 8-coreBMFS – 8-core qxq solveBMFS – q-core scaledPARDISO – 8-core
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 25
Numerical Results with 8-cores
25
Increased error likely due to round off and errors inherent in constraint matrix solve
SpeedupNumber of superdiagonalsNumber of unknowns
FORTRAN OpenMPSpeedup resultsPARDISO 8 cores
BMFS 8 cores
BMFS – 8-coreBMFS – 8-core qxq solveBMFS – q-core scaledPARDISO – 8-core
n = 100,000 n = 500,000
n = 1,000,000
n = 5,000,000
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 26
Speedup over PARDISO Solver
26
SpeedupNumber of superdiagonalsNumber of unknowns
From actual wall times8-core BMFS vs. 8-core PARDISO
By scaling BMFS to q-cores (not qxq solve part) vs. 8-core PARDISO
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 27
A Method to Split the Tridiagonal System
27
Split at equationConsider in top half and in lower half
Split into four independent tasks
1
3
2
4
A modified forward substitution process for (1) and (2)A modified backward substitution process for (3) and (4)
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 28
A Method to Split the Tridiagonal System
28
Still built on superposition principle
-th equation forms constraint
Goal to determine weights of these two solutions
From
From
Final superposition
or modified fwd/back sub.
or modified fwd/back sub.
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 29
Tridiagonal Numerical Experiments
29
• TMFS uses 2 parallel threads• TMFBS uses 4 parallel threads
• 1.6X and 3.2X faster than MKL– DPTSV (sym, pos-def solver – LDLT)– DGTSV (Gauss elim with PP)
• Slightly more error in Euclidean norm using fwd/back sub.
• Parallelization efficiency near optimal (value of unity)
Can incorporate this into banded algorithm for increased speed
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 30
Summary
30
• Developed an direct solver that can skip the Gaussian elimination process while solving banded linear systems [1,2]
• Built on a superposition principle• Fastest for banded systems without exponential growth
– Observed speedup over 20x faster than 8 thread PARDISO when using 8 threads– 1.6X faster than sequential MKL DPTSV when using 2 threads
• Can handle exponential growth by incorporating nullspace and pseudoinverse calculations but this becomes slower
• Splitting the system saw a near ideal 2x speed increase for large– 3.2X faster than sequential MKL DPTSV when using 4 threads
• Future work involves:– Distributed memory/MPI/GPU computing– Further partitioning?– Extension to arbitrary bandwidth?
[1] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (Submitted)[2] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM-H Springer Series Contribution (Submitted)
This leads into Part II
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 31
Part II
Modified Block LU Approach
31
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 32
A Method to Split the Tridiagonal System
32
The forward and backward substitution approach works for tridiagonal systems when each solution involves only one RHS
term. (Solutions can be performed in parallel without any communication between processors.) However, when there is
more than one superdiagonal, there are complications…
Row corresponding to the single RHS
term
Backward substitution
Forward substitution
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 33
A Method to Split the Pentadiagonal System
33
The forward and backward substitution approach applied to a pentadiagonal system leads to three remaining rows and three
unconstrained RHS terms. There are several ways to remove the additional two RHS terms, but they are all complicated…
Three remaining rows & RHS terms
Backward substitution
Forward substitution
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 34
Example: The Beam Vibration Problem
34
Single RHS term Oscillatory response
Three remaining RHS terms Evanescent response
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 35
Modular/Block Solution
35
Introduce two RHS terms so that the solution is computed only betweenthose equations. This suppresses the exponential terms: the introduced RHSterms can be made close enough to allow the sinusoidal terms to dominate.However, a method is needed to remove the fictitious RHS terms…
Compute the solution here only
Introduce ficticous RHS term to this row
Introduce fictitious RHS term to this row
Solution is zero in these regions
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 36
Modular/Block Solution
36
• The beam vibration problem (solved via finite differences) leadsto a Toeplitz system having row structure [1, -4, 6+γ, -4, 1].
• Consider a system having a 3001 nodes and a single nonzeroRHS term in equation 2001. Fictitious nonzero RHS terms areintroduced into equations 1751 & 2151, leading to
• Equation 1751: x1749 - 4x1750 + (6+γ) x1751 - 4x1752 + x1753 = b1751
• Equation 2151: x2149 - 4x2150 + (6+γ) x2151 - 4x2152 + x2153 = b2151
• Introducing the nonzero b1751 term allows us to set x1753 = b1751and then set xi = 0 for 1 ≤ i ≤ 1752. In the same way, we can setx2149 = b2151 and then set xi = 0 for 2150 ≤ i ≤ 3001.
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 37
Modular/Block Solution
37
This figure shows the modular solution for the beam problem, with RHS terms introducedin equations 1751 & 2151. The oscillatory solution component outside nodes 1753 & 2149is exactly zero; with only an evanescent solution component. We filed an inventiondisclosure to apply this approach towards vibration suppression…
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 38
Modular Solution Using Backslash (1)
38
A key insight: implement the modular solution with existing solvers, e.g., MATLAB“backslash.” This makes the approach general so that it can be used for any banded orblock banded system.
Compute the solution here only
Introduce fictitious RHS terms corresponding to
these terms
Introduce fictitious RHS terms corresponding to
these terms
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 39
Example: Beam Vibration Problem
39
n=300; p=101; q=200
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 40
Confined Solution
40
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 41
RHS Vector
41
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 42
Modular Solution Using Backslash (2)
42
We can solve the upper and lower systems in parallel. Each is a solution to the system.There are four overlapping RHS terms. We compute four solutions and then developweights to superimpose them to get the specified RHS vector.
“Lower” solution
Introduce RHS terms to lower solution corresponding to
these terms“Upper” solution
Introduce RHS terms to upper solution corresponding to
these terms
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 43
Solving Two Smaller Systems in Parallel
43
“Lower” solution
Additional terms associated with
the lower solution
“Upper” solution
Additional terms associated with the
upper solution
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 44
Confined Solution + RHS Vector
44
Upper solution RHS vector for upper solution
Lower solution RHS vector for lower solution
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 4545
Efficient implementation performs LU decomposition on and Then computes solution per each right hand side in parallelMatrix multiplications are also performed in parallel
Formal Idea of Modified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 4646
Constraint matrix easy to solve using Schur Complement/Static Condensation
Weighted superposition
Formal Idea of Modified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 4747
Pentadiagonal Case
Key:only need
solutions
Number of subdiagonalsNumber of superdiagonals
0 5 10 15 200
500
1000
1500
2000
2500
Unknown (k)
Solu
tion
(x)
BackslashModified Block LU
0 5 10 15 20
-2
-1
0
1
2
3 x 10-11
Unknown (k)
Erro
r (b-
Ax)
BackslashModified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 48
250 parallel solutionsOptimal with 250 cores
Constraint matrix 250 x 250Orig system 13,476 x 13,476 Reordered to half bandwidth q=124
Notional FEA Model2261 Linear 3D Shell Elements2280 Nodes
Demonstration with Notional FEA Model
48
0 5000 10000 15000-2
-1
0
1
2
3 x 10-14
Unknown (k)
Erro
r (b-
Ax)
BackslashModified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 4949
Pentadiagonal Case with 3 x 3 Grid
0 10 20 30-1
-0.5
0
0.5
1
1.5
2 x 10-10
Unknown (k)
Erro
r (b-
Ax)
BackslashModified Block LU
0 10 20 300
5000
10000
15000
Unknown (k)
Solu
tion
(x)
BackslashModified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 5050
3D Structured Mesh Case with 3 x 3 Grid
0 500 1000 1500 2000 2500-4
-2
0
2
4 x 10-8
Unknown (k)
Solu
tion
(x)
BackslashModified Block LU
0 500 1000 1500 2000 2500-1
-0.5
0
0.5
1 x 10-12
Unknown (k)
Erro
r (b-
Ax)
BackslashModified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 5151
3D Tetrahedral Mesh Case with 3 x 3 Grid
0 500 1000 1500 2000 2500
-0.5
0
0.5
x 10-14
Unknown (k)
Erro
r (b-
Ax)
BackslashModified Block LU
0 500 1000 1500 2000 2500-15
-10
-5
0
5 x 10-11
Unknown (k)
Solu
tion
(x)
BackslashModified Block LU
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 52
Conclusions
52
• Developing a general Block LU solver built on a superposition principal• When implemented in parallel the cost should be less than existing
methods
• Future Work:– Through cost analysis– GPU/MPI Implementation– Extension to arbitrary partitioning– Comparison against banded solver SPIKE as well as sparse solvers PARDISO, MUMPS
• End goal is to develop a competitive direct solver for banded systems with an eye on FEA applications
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 53
BACKUP
53
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 5454
Algorithm for Tridiagonal Modified Forward Substitution (TMFS)
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 5555
Algorithm for Tridiagonal Modified Forward and Backward Substitution (TMFBS)
Michael Jandron & Anthony Ruffa – Naval Undersea Warfare Center // Approved for Public Release 56
Can be implemented recursively
56
Requires parallel tasks where is the half bandwidth of the systemand is the number of levels
Gray regions denote unreferenced regions at that particular level
First Level Second Level Third Level