generic compressed matrix insertion p eter g ottschling – s mart s oft /tud d ag l indbo – k...

Generic Compressed Matrix InsertionPETER GOTTSCHLING – SMARTSOFT/TUD

DAG LINDBO – KUNGLIGA TEKNISKA HÖGSKOLAN

SmartSoft – TU [email protected].: +49 (0) 351 463 34018

• Software libraries• MTL4• FEniCS

• Compressed sparse matrices• Insertion• Benchmarks• Vision

Overview

• Generic library for high-performance numeric operations in mathematical notation

• Many new techniques as implicit enable-if and meta-tuning

• Most modern iterative solvers• Focus on high-performance simulation:

FEM/XFEM/FVM/FDM• Commercial version in preparation

• Parallel version in progress• Multi-core, GPU support and multigrid in near future

Matrix Template Library 4

Innovative Produktentwicklung durchFinite-Elemente-Methode (FEM)

Innovative Produktentwicklung durch

template < class LinearOperator, class HilbertSpaceX, class HilbertSpaceB, class Preconditioner, class Iteration >int cg(const LinearOperator& A, HilbertSpaceX& x, const HilbertSpaceB& b, const Preconditioner& M, Iteration& iter){ typedef typename mtl::Collection<HilbertSpaceX>::value_type Scalar; Scalar rho, rho_1, alpha, beta; HilbertSpaceX p(size(x)), q(size(x)), r(size(x)), z(size(x)); r = b - A*x; while (! iter.finished(r)) { z = solve(M, r); rho = dot(r, z); if (iter.first()) p = z; else { beta = rho / rho_1; p = z + beta * p; } q = A * p; alpha = rho / dot(p, q); x += alpha * p; r -= alpha * q; rho_1 = rho; ++iter; } return iter;}

Linearer Gleichungslöser

• Free software for solving differential equations• FFC – FEniCS Form Compiler

• High-level math language for formulating differential equations

• Generate C++ code • DOLFIN – generic FEM kernel

• C++ library for FEM cores: assembler, mesh and function abstraction

• Interface to uBLAS, PETSc, Trillinos, and MTL4

• Paper focus in matrix assembly

FEniCS

Compressed Sparse Row Format

• Most common general-purpose sparse format

• Entries sorted• Kind of run-

length encoding on rows

In-Flight Insertion

• Very simple use• Like dense

matrices• Simple realization• Extremely

expensive• All following entries

are changed• Quadratic

complexity

A[0][1]= 6;

• Dedicated insertion phase• Matrix is available after terminating insertion• Later modification impossible• Works for distributed matrices as well

• Used in PETSc, includes construction of communication buffers for dist. SpMVP

• Janus derives its name from it (two faces)

Two-phase Insertion

• Inserter = object providing operations to set up other objects, e.g. matrices or vectors, efficiently

• Insertion phase lasts as long as inserter lives• Insert within a scope (block, function)

• Matrix ready when inserter destroyed• Later insertion possible with another inserter• Extends to distributed matrices and vectors• MTL4 inserters have minimal memory usage

Inserter Concept in MTL4

int main(int argc, char* argv[]){

compressed2D<float> A(3, 5); { matrix::inserter<compressed2D<float> > ins(A);ins[0][0] << 1.0; ins[0][2] << 2.0;ins[1][3] << 3.0;ins[2][1] << 4.0; ins[2][4] << 5.0; } std::cout << "A is\n" << A << '\n'; return 0;

}

Using Inserters

Direct Insertion

• Reserve s entries per row

• Find insert position• By linear or binary

search• Move remainder in

row• Linear in s

• That is constant

A[0][1]= 6;

Indirect Insertion

• For saturated rows use “spare” container

• std::map of index pair• Logarithmic in number

of spare entries• Additional allocation• About 10 times slower

than direct insertion

A[0][4]= 7;

• Assemble CRS matrix• Row order important, and order within row• Performance measure: number of non-zeros inserted per second• Reassembly• Three libraries: uBLAS (including vector-of-vector), MTL4, PETSc• Ordinary workstation (Intel)• All benchmarks run in a simple interface routine for each library, e.g.

Benchmark

void insert row(Matrix& A, int row_idx, int cols_idx, double a, int n)∗ ∗{

for(int j=0; j<n; j++) A(row_idx , cols_idx[j]) += a[j];

}

• 10,000 rows, 5 non-zeros/row• MTL4: 46 million entries per second• uBLAS: 5.9 million entries per second• uBLAS (gov): 2 million entries per second• PETSc: 22 million entries per second

Benchmark: Assembly rate with ascending rows

• 100,000 rows, 50 non-zeros/row• MTL4: 29.6 million entries per second• uBLAS: 6.5 million entries per second• uBLAS (gov): 2.8 million entries per second• PETSc: 32.3 million entries per second

Benchmark: Assembly rate with ascending rows

• 10,000 rows, 5 non-zeros/row• MTL4: 41.4 million entries per second• uBLAS: 31,300 entries per second• uBLAS (gov): 1.9 million entries per second• PETSc: 19.9 million entries per second

Benchmark: Assembly rate with random rows

• 100,000 rows, 50 non-zeros/row• MTL4: 25.6 million entries per second• uBLAS: measuring abandonned• uBLAS (gov): 2.7 million entries per second• PETSc: 25.6 million entries per second


• 10,000 rows, 5 non-zeros/row• MTL4: 4.8 million entries per second• uBLAS: 16,700 entries per second• uBLAS (gov): 1.8 million entries per second• PETSc: 15,900 entries per second

Benchmark: Assembly rate with entirely random entries

• 10,000 rows, 50 non-zeros/row• MTL4: 2.9 million entries per second• uBLAS: 3,340 entries per second• uBLAS (gov): 1.7 million entries per second• PETSc: 13,400 entries per second


How to do Science in Silicon?

Graphic application

CPUGPU

ScientificSoftware

Scientific application

CPU

GPU Multi-Core Par. Arch. Scien. Proc.

• Introduced new approach for setting and modifying compressed sparse matrices

• Does not need preparation phase• Minimal memory footprint• Optimal performance• Tuned block-insertion under progress• Extends to distributed data structures

Conclusions

generic compressed matrix insertion p eter g ottschling – s mart s oft /tud d ag l indbo – k...

Documents