introduction to scienti c...

Introduction to Scientific Computing

J.-F. Remacle

Universite catholique de Louvain

Mathematical and Computational Concepts

1

Dongarra’ top ten list

• 1946: The Monte Carlo method for modeling probabolistic phe-

nomena

• 1947: The Simplex method for linear optimization problems

• 1950: The Krylov subspace iteration method for fast linear solvers

and eigensolvers

• 1957: The FORTRAN compiler that liberated scientists and en-

gineers for programming in assembly

2


• 1959-1961: The QR algorithm for computing many eigenvalues

• 1962: The quicksort algorithm (divide and conquer)

• 1965: The fast Fourier transform (FFT) to reduce operation

count in Fourier series representation

• 1977: The integer relation detection algorithm, which is useful

for bifurcations and in quantum field theory

• 1987: The fast multipole algorithm for N-body problems

3


FORTRAN has been used extensively in the past.

Yet, most of the recent scientific computing software has been

written in C++.

As an example, Gmsh 2.1.1 is written in C++ and has about 120,000

lines of code

4


Binary numbers and roundoffs.

126 = 1× 102 + 2× 101 + 6× 100.

In the base 2 system ,

126 = 011111102

= 0× 27 + 1× 26 + 1× 25 +

1× 24 + 1× 23 + 1× 22 + 1× 21 + 0× 20.

In computing, we call each place in a binary number a bit.

We call a group of 8 bits a byte.

Similarly, we call 1,024 bytes a kilobyte (kB).

5


Scientific notation :

126 = +︸︷︷︸sign

.126︸︷︷︸coefficient

× 103︸︷︷︸basis and exponent

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) :

a = (−1)s × c× bq

Name Common name Base Digits Emin Emaxbinary16 Half precision 2 10+1 -14 +15binary32 Single precision 2 23+1 -126 +127binary64 Double precision 2 52+1 -1022 +1023binary128 Quadruple precision 2 112+1 -16382 +16383

C++ float, 32 bits, 4 bytes : 23 bits for the fraction, 1 bit for thesign and 8 bits for the exponent.

C++ double, 64 bits, 8 bytes : 52 bits for the fraction, 1 bit for thesign and 11 bits for the exponent.

6


// --- numeric_limits.cpp -- show extreme numeric values

#include <limits>

#include <iostream>

int main(void){

std::cout << std::numeric_limits<float>::max() << std::endl;

std::cout << std::numeric_limits<float>::min() << std::endl;

std::cout << std::numeric_limits<double>::max() << std::endl;

std::cout << std::numeric_limits<double>::min() << std::endl;

}

Output ...

3.40282e+38

1.17549e-38

1.79769e+308

2.22507e-308

Easy to see that e.g. 1.17549 10−38 ' 2−126

7


The effective zero or machine epsilon is the value of 1/2p such that

1.0 +1

2p= 1.0.

template <class T>

T MachineEpsilon(){

T eps = 1.0, test = 1.0 + eps;

while (1.0 != test){

eps /= 2.0;

test = 1.0 + eps;

}

return eps;

}

int main(void) {

std::cout << MachineEpsilon<float> () << std::endl;

std::cout << MachineEpsilon<double> () << std::endl;

return 0;

}

5.96046e-08

1.11022e-16

8


Name Common name Base Digits ε

binary16 Half precision 2 10+1 2−11

binary32 Single precision 2 23+1 2−24

binary64 Double precision 2 52+1 2−53

binary128 Quadruple precision 2 112+1 2−113

A round-off error, also called rounding error, is the difference between

the calculated approximation of a number and its exact mathematical

value.

Associativity is not preserved on a computer

−1.0 + (1.0 + ε) 6= (−1.0 + 1.0) + ε

Epsilon machine is much bigger than the lower numeric limits.

9


In the numerical analysis, the condition number associated with a

function φ(x) is a measure of that function’s amenability to digital

computation, that is, how numerically well-conditioned the function

is.

φ(x+ dx) ' φ(x) + φ′(x)dx

The relative change in function value is

|φ(x+ dx)− φ(x)||φ(x)| ' |φ

′(x)||x||φ(x)| ×

|dx||x| .

The condition number is defined as

κ =|φ′(x)||x||φ(x)| .

10


The norm ‖x‖ of a vector xT = {x1, x2, . . . , xn} is a scalar that obbeys

‖x‖ ≥ 0

‖x‖ = 0↔ x = 0.

‖αx‖ = |α| ‖x‖.

‖x + y‖ ≤ ‖x‖+ ‖y‖.Examples:

• L∞ norm : ‖x‖∞ = maxi |xi|.

• L2 norm : ‖x‖2 =√∑

i x2i .

• L1 norm : ‖x‖1 =∑i |xi|.

• Lp norm : ‖x‖1 = (∑i |xi|p)1/p .

11


Matrix norm generated by a vector norm

‖A‖p = maxx6=0

‖Ax‖p‖x‖p

The L∞ norm generates ‖A‖∞ = maxi∑nj=1 |aij| is the maximum row

sum.

The L1 norm generates ‖A‖1 = maxj∑ni=1 |aij| is the maximum column

sum.

The L2 norm generates ‖A‖2 =√

maxλ(ATA) is the largest eigenvalueof the SPD matrix ATA.

‖A + B‖ ≤ ‖A‖+ ‖B‖,

‖Ax‖ ≤ ‖A‖‖x‖.

12


The condition number associated with the linear system of equationsAx = b.

Consider a change b + db in the right hand side. The change dx inthe solution is computed as

A(x + dx) = b + db

or

dx = A−1db.

We have

‖dx‖ ≤ ‖A−1‖‖db‖.Relative error : knowing that

‖A‖‖x‖ ≤ ‖b‖.we have

‖dx‖‖x‖ ≤

‖A−1‖‖db‖‖x‖ ≤ ‖A−1‖‖A‖‖db‖

‖b‖ = κ(A)‖db‖‖b‖ .

13


A matrix A is normal if ATA = AAT . Symmetric, skew-symmetric and

orthogonal matrices are normal.

A matrix A is unitary if ATA = I.

If A is normal, κ(A) =∣∣∣λmax(A)λmin(A)

∣∣∣.If A is not normal, κ(A) =

∣∣∣σmax(A)σmin(A)

∣∣∣, where σmax(A) and σmin(A) are

extreme singular values of A. A matrix A can always be decomposed

in the form UDW , where U and W are unitary matrices and D is a

diagonal matrix with the singular values lying on the diagonal.

A normal matrix is well conditioned when its eigenvalues are clustered

and bounded away from zero.

14

Introduction to Scientific Computing

J.-F. Remacle

Universite catholique de Louvain

C++, BLAS, OpenMP, MPI, LAPACK ...

15

Computer Architecture

Computer memory consist in a linearly addressable space (JFR doesa nice drawing on the blackboard).

// see gmsh/Numeric/fullMatrix.h for a complete version

template <class scalar>

class fullVector {

private:

int _r;

scalar *_data;

public:

fullVector(int r) : _r(r){

_data = new scalar[_r];

scale(0.);

}

~fullVector() { if(_data) delete [] _data; }

inline scalar operator () (int i) const{ return _data[i];}

inline scalar & operator () (int i){return _data[i];}

inline int size() const { return _r; }

};

16


How are two-dimensional arrays stored in memory?

Memory is linearly addressable, so we have to deceide how to decom-pose the matrix in one-dimensional units.


class fullMatrix {

private:

int _r, _c; // number of rows and columns

scalar *_data; // the linearly addressable data

};

There are two obvious ways (JFR does a nice drawing on the black-board):

1. Row-major ordering: aij = data [ j + c * i ] ;

2. Comumn-major ordering: aij = data [ i + r * j ] ;

17


Both orderings use the same amount of memory.

Both orderings start with a00.

C++ uses row-major ordering whil FORTRAN uses column-major or-

dering.

// A is a linear space in memory of size 16 x sizeof(double),

// ordered row-major.

double A[4][4];

Gmsh fullMatrix class stores matrices in column-major ordering be-

cause it allow direct access to BLAS !

18


Intel Xeon ISA: x86 64 - Streamlining SIMD Extensions (SSE)

:-) 2 × 128 bit FPU double and single precision packed operations

(FMUL/FADD), chained

:-( Has no combined floating multiplication and addition (fma)

A CPU cache is a cache used by the central processing unit of a

computer to reduce the average time to access memory: L1: 32kB(I)

+ 32kB(D), L2: 8MB.

Theoretical peak performance

• double precision: 4 times clock speed → ' 10 GFlops / core,

• single precision: 8 times clock speed → ' 20 GFlops / core.

19


Matrix-matrix multiplications: A(n× n), B(n×m) and C(m× n).

Every of the n2 entries requires 2m floating point operations:

aij = bikckj.

The total number of floating point operations Nflop is therefore

Nflop = 2×m× n2.

Consider n = m = 1000, a naive implementation requires ∆t = 6.02

seconds. The number of floating point operations per second is

Nflops =Nflop

∆t=

2 109

6= 0.33 Gflops.

This is about 30 times slower than the peak performance of the ma-

chine !

20

Basic Linear Algebra Subroutines (BLAS)

BLAS3

Matrix-matrix multiplications: Nd = 3n2 data, Nflop = 2n3,

Nflop

Nd= O(n).

BLAS2

Matrix-vector multiplications: Nd = n2 + 2n data, Nflop = 2n2,

Nflop

Nd= O(1).

BLAS1

Vector-vector operations (axpy): Nd = 2n data, Nflop = 2n.

Nflop

Nd= O(1).

21

Basic Linear Algebra Subroutines (BLAS1)

#define F77NAME(x) (x##_)

//(*this) <-- alpha*x + (*this)


class fullVector

{

public:

void axpy(fullVector<scalar> &x, scalar alpha=1.)

#if !defined(HAVE_BLAS)

{

for (int i = 0; i < _r; i++) _data[i] += alpha * x._data[i];

}

#endif

;

template<>

void fullVector<double>::axpy(fullVector<double> &x,double alpha)

{

int M = _r, INCX = 1, INCY = 1;

F77NAME(daxpy)(&M, &alpha, x._data,&INCX, _data, &INCY);

}

22


Consider two vectors of size N that we “daxpy” 10000 times. We

use three versions of daxpy: Netlib “native” linux BLAS, ATLAS and

MKL.

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

axpy(n) : y! !x + y x,y " Rn

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lops

N

MKLAtlasNative

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lops

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (saxpy) double precision (daxpy)

23


Consider two vectors of size N that we “daxpy” 10000 times. The

first vector is shifted !




axpy(n) : y! !x + y x,y " Rn - shifted access (y)

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lop

s

N

MKLAtlasNative

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lop

s

N

MKLAtlasNative



24


Efficiency is driven by L1 cache size and operation packing.




axpy(n) : y! !x + y x,y " Rn - influence of operation packing

0

2

4

6

8

10

1210 1215 1220 1225 1230 1235 1240

GF

lops

N

MKLAtlasNative

0

2

4

6

8

10

1210 1215 1220 1225 1230 1235 1240

GF

lops

N

MKLAtlasNative



25



class fullMatrix

{

public:

void mult(const fullVector<scalar> &x,

fullVector<scalar> &y)


{

y.scale(0.);

for(int i = 0; i < _r; i++)

for(int j = 0; j < _c; j++)

y._data[i] += (*this)(i, j) * x(j);

}

#endif

;

26


Direct call to BLAS.

// y := alpha*(*this)*x + beta*y

template<>

void fullMatrix<double>::mult(const fullVector<double> &x,

fullVector<double> &y)

{

int M = _r, N = _c, LDA = _r, INCX = 1, INCY = 1;

double alpha = 1., beta = 0.;

F77NAME(dgemv)("N", &M, &N, &alpha, _data, &LDA, x._data, &INCX,

&beta, y._data, &INCY);

}

27


Consider two vectors of size N and a matrix of size N × N that we

“dgemv” 10000 times. We use three versions of daxpy: Netlib “na-

tive” linux BLAS, ATLAS and MKL.




gemv(n, n) : y! !A · x + y x,y " Rn,A " Rn!n

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative


single precision (sgemv) double precision (dgemv)

28


Matrix is changed every time.

Efficiency driven by RAM access (because of the random access to

matrices) and L2 cache (because of the larger size of the data).




gemv(n, n) : y! !A · x + y x,y " Rn,A " Rn!n - random acces

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative


single precision (sgemv) double precision (dgemv)

29


// (*this) = a * b * alpha + (*this) * beta


class fullMatrix

{

public:

void gemm(const fullMatrix<scalar> &a,

const fullMatrix<scalar> &b,

scalar alpha=1., scalar beta=1.)


{

gemm_naive(anb,alpha,beta);

}

#endif

;

30


Direct call to BLAS.

template<>

void fullMatrix<double>::gemm(const fullMatrix<double> &a,

const fullMatrix<double> &b,

double alpha, double beta)

{

int M = size1(), N = size2(), K = a.size2();

int LDA = a.size1(), LDB = b.size1(), LDC = size1();

F77NAME(dgemm)("N", "N", &M, &N, &K, &alpha, a._data,

&LDA, b._data, &LDB,

&beta, _data, &LDC);

}

31


Consider two matrices of size N ×N that we “dgemm” 10000 times.

We use three versions of daxpy: Netlib “native” linux BLAS, ATLAS

and MKL.




gemm(n, n, n) : A! !B · C + "A A,B,C " Rn!n

0

5

10

15

20

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

0

5

10

15

20

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative


single precision (sgemm) double precision (dgemm)

32


No cache effects on efficiency. Erratic behavior (cache line size ?)




gemm(n, n, n) : A! !B · C + "A A,B,C " Rn!n

0

5

10

15

20

150 160 170 180 190 200

GF

lop

s

N

MKLAtlasNative

0

5

10

15

20

150 160 170 180 190 200

GF

lop

s

N

MKLAtlasNative


single precision (sgemm) double precision (dgemm)

33

introduction to scienti c...

Documents