generic compressed matrix insertion p eter g ottschling – s mart s oft /tud d ag l indbo – k...
DESCRIPTION
Generic library for high-performance numeric operations in mathematical notation Many new techniques as implicit enable-if and meta- tuning Most modern iterative solvers Focus on high-performance simulation: FEM/XFEM/FVM/FDM Commercial version in preparation Parallel version in progress Multi-core, GPU support and multigrid in near future Matrix Template Library 4TRANSCRIPT
Generic Compressed Matrix InsertionPETER GOTTSCHLING – SMARTSOFT/TUD
DAG LINDBO – KUNGLIGA TEKNISKA HÖGSKOLAN
SmartSoft – TU [email protected].: +49 (0) 351 463 34018
• Software libraries• MTL4• FEniCS
• Compressed sparse matrices• Insertion• Benchmarks• Vision
Overview
• Generic library for high-performance numeric operations in mathematical notation
• Many new techniques as implicit enable-if and meta-tuning
• Most modern iterative solvers• Focus on high-performance simulation:
FEM/XFEM/FVM/FDM• Commercial version in preparation
• Parallel version in progress• Multi-core, GPU support and multigrid in near future
Matrix Template Library 4
Innovative Produktentwicklung durchFinite-Elemente-Methode (FEM)
Innovative Produktentwicklung durch
template < class LinearOperator, class HilbertSpaceX, class HilbertSpaceB, class Preconditioner, class Iteration >int cg(const LinearOperator& A, HilbertSpaceX& x, const HilbertSpaceB& b, const Preconditioner& M, Iteration& iter){ typedef typename mtl::Collection<HilbertSpaceX>::value_type Scalar; Scalar rho, rho_1, alpha, beta; HilbertSpaceX p(size(x)), q(size(x)), r(size(x)), z(size(x)); r = b - A*x; while (! iter.finished(r)) { z = solve(M, r); rho = dot(r, z); if (iter.first()) p = z; else { beta = rho / rho_1; p = z + beta * p; } q = A * p; alpha = rho / dot(p, q); x += alpha * p; r -= alpha * q; rho_1 = rho; ++iter; } return iter;}
Linearer Gleichungslöser
• Free software for solving differential equations• FFC – FEniCS Form Compiler
• High-level math language for formulating differential equations
• Generate C++ code • DOLFIN – generic FEM kernel
• C++ library for FEM cores: assembler, mesh and function abstraction
• Interface to uBLAS, PETSc, Trillinos, and MTL4
• Paper focus in matrix assembly
FEniCS
Compressed Sparse Row Format
• Most common general-purpose sparse format
• Entries sorted• Kind of run-
length encoding on rows
In-Flight Insertion
• Very simple use• Like dense
matrices• Simple realization• Extremely
expensive• All following entries
are changed• Quadratic
complexity
A[0][1]= 6;
• Dedicated insertion phase• Matrix is available after terminating insertion• Later modification impossible• Works for distributed matrices as well
• Used in PETSc, includes construction of communication buffers for dist. SpMVP
• Janus derives its name from it (two faces)
Two-phase Insertion
• Inserter = object providing operations to set up other objects, e.g. matrices or vectors, efficiently
• Insertion phase lasts as long as inserter lives• Insert within a scope (block, function)
• Matrix ready when inserter destroyed• Later insertion possible with another inserter• Extends to distributed matrices and vectors• MTL4 inserters have minimal memory usage
Inserter Concept in MTL4
int main(int argc, char* argv[]){
compressed2D<float> A(3, 5); { matrix::inserter<compressed2D<float> > ins(A);ins[0][0] << 1.0; ins[0][2] << 2.0;ins[1][3] << 3.0;ins[2][1] << 4.0; ins[2][4] << 5.0; } std::cout << "A is\n" << A << '\n'; return 0;
}
Using Inserters
Direct Insertion
• Reserve s entries per row
• Find insert position• By linear or binary
search• Move remainder in
row• Linear in s
• That is constant
A[0][1]= 6;
Indirect Insertion
• For saturated rows use “spare” container
• std::map of index pair• Logarithmic in number
of spare entries• Additional allocation• About 10 times slower
than direct insertion
A[0][4]= 7;
• Assemble CRS matrix• Row order important, and order within row• Performance measure: number of non-zeros inserted per second• Reassembly• Three libraries: uBLAS (including vector-of-vector), MTL4, PETSc• Ordinary workstation (Intel)• All benchmarks run in a simple interface routine for each library, e.g.
Benchmark
void insert row(Matrix& A, int row_idx, int cols_idx, double a, int n)∗ ∗{
for(int j=0; j<n; j++) A(row_idx , cols_idx[j]) += a[j];
}
• 10,000 rows, 5 non-zeros/row• MTL4: 46 million entries per second• uBLAS: 5.9 million entries per second• uBLAS (gov): 2 million entries per second• PETSc: 22 million entries per second
Benchmark: Assembly rate with ascending rows
• 100,000 rows, 50 non-zeros/row• MTL4: 29.6 million entries per second• uBLAS: 6.5 million entries per second• uBLAS (gov): 2.8 million entries per second• PETSc: 32.3 million entries per second
Benchmark: Assembly rate with ascending rows
• 10,000 rows, 5 non-zeros/row• MTL4: 41.4 million entries per second• uBLAS: 31,300 entries per second• uBLAS (gov): 1.9 million entries per second• PETSc: 19.9 million entries per second
Benchmark: Assembly rate with random rows
• 100,000 rows, 50 non-zeros/row• MTL4: 25.6 million entries per second• uBLAS: measuring abandonned• uBLAS (gov): 2.7 million entries per second• PETSc: 25.6 million entries per second
Benchmark: Assembly rate with random rows
• 10,000 rows, 5 non-zeros/row• MTL4: 4.8 million entries per second• uBLAS: 16,700 entries per second• uBLAS (gov): 1.8 million entries per second• PETSc: 15,900 entries per second
Benchmark: Assembly rate with entirely random entries
• 10,000 rows, 50 non-zeros/row• MTL4: 2.9 million entries per second• uBLAS: 3,340 entries per second• uBLAS (gov): 1.7 million entries per second• PETSc: 13,400 entries per second
Benchmark: Assembly rate with random rows
How to do Science in Silicon?
Graphic application
CPUGPU
ScientificSoftware
Scientific application
CPU
GPU Multi-Core Par. Arch. Scien. Proc.
• Introduced new approach for setting and modifying compressed sparse matrices
• Does not need preparation phase• Minimal memory footprint• Optimal performance• Tuned block-insertion under progress• Extends to distributed data structures
Conclusions