introduction to scienti c...
TRANSCRIPT
Introduction to Scientific Computing
J.-F. Remacle
Universite catholique de Louvain
Mathematical and Computational Concepts
1
Dongarra’ top ten list
• 1946: The Monte Carlo method for modeling probabolistic phe-
nomena
• 1947: The Simplex method for linear optimization problems
• 1950: The Krylov subspace iteration method for fast linear solvers
and eigensolvers
• 1957: The FORTRAN compiler that liberated scientists and en-
gineers for programming in assembly
2
Dongarra’ top ten list
• 1959-1961: The QR algorithm for computing many eigenvalues
• 1962: The quicksort algorithm (divide and conquer)
• 1965: The fast Fourier transform (FFT) to reduce operation
count in Fourier series representation
• 1977: The integer relation detection algorithm, which is useful
for bifurcations and in quantum field theory
• 1987: The fast multipole algorithm for N-body problems
3
Dongarra’ top ten list
FORTRAN has been used extensively in the past.
Yet, most of the recent scientific computing software has been
written in C++.
As an example, Gmsh 2.1.1 is written in C++ and has about 120,000
lines of code
4
Mathematical and Computational Concepts
Binary numbers and roundoffs.
126 = 1× 102 + 2× 101 + 6× 100.
In the base 2 system ,
126 = 011111102
= 0× 27 + 1× 26 + 1× 25 +
1× 24 + 1× 23 + 1× 22 + 1× 21 + 0× 20.
In computing, we call each place in a binary number a bit.
We call a group of 8 bits a byte.
Similarly, we call 1,024 bytes a kilobyte (kB).
5
Mathematical and Computational Concepts
Scientific notation :
126 = +︸︷︷︸sign
.126︸ ︷︷ ︸coefficient
× 103︸︷︷︸basis and exponent
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) :
a = (−1)s × c× bq
Name Common name Base Digits Emin Emaxbinary16 Half precision 2 10+1 -14 +15binary32 Single precision 2 23+1 -126 +127binary64 Double precision 2 52+1 -1022 +1023binary128 Quadruple precision 2 112+1 -16382 +16383
C++ float, 32 bits, 4 bytes : 23 bits for the fraction, 1 bit for thesign and 8 bits for the exponent.
C++ double, 64 bits, 8 bytes : 52 bits for the fraction, 1 bit for thesign and 11 bits for the exponent.
6
Mathematical and Computational Concepts
// --- numeric_limits.cpp -- show extreme numeric values
#include <limits>
#include <iostream>
int main(void){
std::cout << std::numeric_limits<float>::max() << std::endl;
std::cout << std::numeric_limits<float>::min() << std::endl;
std::cout << std::numeric_limits<double>::max() << std::endl;
std::cout << std::numeric_limits<double>::min() << std::endl;
}
Output ...
3.40282e+38
1.17549e-38
1.79769e+308
2.22507e-308
Easy to see that e.g. 1.17549 10−38 ' 2−126
7
Mathematical and Computational Concepts
The effective zero or machine epsilon is the value of 1/2p such that
1.0 +1
2p= 1.0.
template <class T>
T MachineEpsilon(){
T eps = 1.0, test = 1.0 + eps;
while (1.0 != test){
eps /= 2.0;
test = 1.0 + eps;
}
return eps;
}
int main(void) {
std::cout << MachineEpsilon<float> () << std::endl;
std::cout << MachineEpsilon<double> () << std::endl;
return 0;
}
5.96046e-08
1.11022e-16
8
Mathematical and Computational Concepts
Name Common name Base Digits ε
binary16 Half precision 2 10+1 2−11
binary32 Single precision 2 23+1 2−24
binary64 Double precision 2 52+1 2−53
binary128 Quadruple precision 2 112+1 2−113
A round-off error, also called rounding error, is the difference between
the calculated approximation of a number and its exact mathematical
value.
Associativity is not preserved on a computer
−1.0 + (1.0 + ε) 6= (−1.0 + 1.0) + ε
Epsilon machine is much bigger than the lower numeric limits.
9
Mathematical and Computational Concepts
In the numerical analysis, the condition number associated with a
function φ(x) is a measure of that function’s amenability to digital
computation, that is, how numerically well-conditioned the function
is.
φ(x+ dx) ' φ(x) + φ′(x)dx
The relative change in function value is
|φ(x+ dx)− φ(x)||φ(x)| ' |φ
′(x)||x||φ(x)| ×
|dx||x| .
The condition number is defined as
κ =|φ′(x)||x||φ(x)| .
10
Mathematical and Computational Concepts
The norm ‖x‖ of a vector xT = {x1, x2, . . . , xn} is a scalar that obbeys
‖x‖ ≥ 0
‖x‖ = 0↔ x = 0.
‖αx‖ = |α| ‖x‖.
‖x + y‖ ≤ ‖x‖+ ‖y‖.Examples:
• L∞ norm : ‖x‖∞ = maxi |xi|.
• L2 norm : ‖x‖2 =√∑
i x2i .
• L1 norm : ‖x‖1 =∑i |xi|.
• Lp norm : ‖x‖1 = (∑i |xi|p)1/p .
11
Mathematical and Computational Concepts
Matrix norm generated by a vector norm
‖A‖p = maxx6=0
‖Ax‖p‖x‖p
The L∞ norm generates ‖A‖∞ = maxi∑nj=1 |aij| is the maximum row
sum.
The L1 norm generates ‖A‖1 = maxj∑ni=1 |aij| is the maximum column
sum.
The L2 norm generates ‖A‖2 =√
maxλ(ATA) is the largest eigenvalueof the SPD matrix ATA.
‖A + B‖ ≤ ‖A‖+ ‖B‖,
‖Ax‖ ≤ ‖A‖‖x‖.
12
Mathematical and Computational Concepts
The condition number associated with the linear system of equationsAx = b.
Consider a change b + db in the right hand side. The change dx inthe solution is computed as
A(x + dx) = b + db
or
dx = A−1db.
We have
‖dx‖ ≤ ‖A−1‖‖db‖.Relative error : knowing that
‖A‖‖x‖ ≤ ‖b‖.we have
‖dx‖‖x‖ ≤
‖A−1‖‖db‖‖x‖ ≤ ‖A−1‖‖A‖‖db‖
‖b‖ = κ(A)‖db‖‖b‖ .
13
Mathematical and Computational Concepts
A matrix A is normal if ATA = AAT . Symmetric, skew-symmetric and
orthogonal matrices are normal.
A matrix A is unitary if ATA = I.
If A is normal, κ(A) =∣∣∣λmax(A)λmin(A)
∣∣∣.If A is not normal, κ(A) =
∣∣∣σmax(A)σmin(A)
∣∣∣, where σmax(A) and σmin(A) are
extreme singular values of A. A matrix A can always be decomposed
in the form UDW , where U and W are unitary matrices and D is a
diagonal matrix with the singular values lying on the diagonal.
A normal matrix is well conditioned when its eigenvalues are clustered
and bounded away from zero.
14
Introduction to Scientific Computing
J.-F. Remacle
Universite catholique de Louvain
C++, BLAS, OpenMP, MPI, LAPACK ...
15
Computer Architecture
Computer memory consist in a linearly addressable space (JFR doesa nice drawing on the blackboard).
// see gmsh/Numeric/fullMatrix.h for a complete version
template <class scalar>
class fullVector {
private:
int _r;
scalar *_data;
public:
fullVector(int r) : _r(r){
_data = new scalar[_r];
scale(0.);
}
~fullVector() { if(_data) delete [] _data; }
inline scalar operator () (int i) const{ return _data[i];}
inline scalar & operator () (int i){return _data[i];}
inline int size() const { return _r; }
};
16
Computer Architecture
How are two-dimensional arrays stored in memory?
Memory is linearly addressable, so we have to deceide how to decom-pose the matrix in one-dimensional units.
template <class scalar>
class fullMatrix {
private:
int _r, _c; // number of rows and columns
scalar *_data; // the linearly addressable data
};
There are two obvious ways (JFR does a nice drawing on the black-board):
1. Row-major ordering: aij = data [ j + c * i ] ;
2. Comumn-major ordering: aij = data [ i + r * j ] ;
17
Computer Architecture
Both orderings use the same amount of memory.
Both orderings start with a00.
C++ uses row-major ordering whil FORTRAN uses column-major or-
dering.
// A is a linear space in memory of size 16 x sizeof(double),
// ordered row-major.
double A[4][4];
Gmsh fullMatrix class stores matrices in column-major ordering be-
cause it allow direct access to BLAS !
18
Computer Architecture
Intel Xeon ISA: x86 64 - Streamlining SIMD Extensions (SSE)
:-) 2 × 128 bit FPU double and single precision packed operations
(FMUL/FADD), chained
:-( Has no combined floating multiplication and addition (fma)
A CPU cache is a cache used by the central processing unit of a
computer to reduce the average time to access memory: L1: 32kB(I)
+ 32kB(D), L2: 8MB.
Theoretical peak performance
• double precision: 4 times clock speed → ' 10 GFlops / core,
• single precision: 8 times clock speed → ' 20 GFlops / core.
19
Computer Architecture
Matrix-matrix multiplications: A(n× n), B(n×m) and C(m× n).
Every of the n2 entries requires 2m floating point operations:
aij = bikckj.
The total number of floating point operations Nflop is therefore
Nflop = 2×m× n2.
Consider n = m = 1000, a naive implementation requires ∆t = 6.02
seconds. The number of floating point operations per second is
Nflops =Nflop
∆t=
2 109
6= 0.33 Gflops.
This is about 30 times slower than the peak performance of the ma-
chine !
20
Basic Linear Algebra Subroutines (BLAS)
BLAS3
Matrix-matrix multiplications: Nd = 3n2 data, Nflop = 2n3,
Nflop
Nd= O(n).
BLAS2
Matrix-vector multiplications: Nd = n2 + 2n data, Nflop = 2n2,
Nflop
Nd= O(1).
BLAS1
Vector-vector operations (axpy): Nd = 2n data, Nflop = 2n.
Nflop
Nd= O(1).
21
Basic Linear Algebra Subroutines (BLAS1)
#define F77NAME(x) (x##_)
//(*this) <-- alpha*x + (*this)
template <class scalar>
class fullVector
{
public:
void axpy(fullVector<scalar> &x, scalar alpha=1.)
#if !defined(HAVE_BLAS)
{
for (int i = 0; i < _r; i++) _data[i] += alpha * x._data[i];
}
#endif
;
template<>
void fullVector<double>::axpy(fullVector<double> &x,double alpha)
{
int M = _r, INCX = 1, INCY = 1;
F77NAME(daxpy)(&M, &alpha, x._data,&INCX, _data, &INCY);
}
22
Basic Linear Algebra Subroutines (BLAS1)
Consider two vectors of size N that we “daxpy” 10000 times. We
use three versions of daxpy: Netlib “native” linux BLAS, ATLAS and
MKL.
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
axpy(n) : y! !x + y x,y " Rn
0
2
4
6
8
10
0 1000 2000 3000 4000 5000 6000
GF
lops
N
MKLAtlasNative
0
2
4
6
8
10
0 1000 2000 3000 4000 5000 6000
GF
lops
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (saxpy) double precision (daxpy)
23
Basic Linear Algebra Subroutines (BLAS1)
Consider two vectors of size N that we “daxpy” 10000 times. The
first vector is shifted !
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
axpy(n) : y! !x + y x,y " Rn - shifted access (y)
0
2
4
6
8
10
0 1000 2000 3000 4000 5000 6000
GF
lop
s
N
MKLAtlasNative
0
2
4
6
8
10
0 1000 2000 3000 4000 5000 6000
GF
lop
s
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (saxpy) double precision (daxpy)
24
Basic Linear Algebra Subroutines (BLAS1)
Efficiency is driven by L1 cache size and operation packing.
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
axpy(n) : y! !x + y x,y " Rn - influence of operation packing
0
2
4
6
8
10
1210 1215 1220 1225 1230 1235 1240
GF
lops
N
MKLAtlasNative
0
2
4
6
8
10
1210 1215 1220 1225 1230 1235 1240
GF
lops
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (saxpy) double precision (daxpy)
25
Basic Linear Algebra Subroutines (BLAS2)
template <class scalar>
class fullMatrix
{
public:
void mult(const fullVector<scalar> &x,
fullVector<scalar> &y)
#if !defined(HAVE_BLAS)
{
y.scale(0.);
for(int i = 0; i < _r; i++)
for(int j = 0; j < _c; j++)
y._data[i] += (*this)(i, j) * x(j);
}
#endif
;
26
Basic Linear Algebra Subroutines (BLAS2)
Direct call to BLAS.
// y := alpha*(*this)*x + beta*y
template<>
void fullMatrix<double>::mult(const fullVector<double> &x,
fullVector<double> &y)
{
int M = _r, N = _c, LDA = _r, INCX = 1, INCY = 1;
double alpha = 1., beta = 0.;
F77NAME(dgemv)("N", &M, &N, &alpha, _data, &LDA, x._data, &INCX,
&beta, y._data, &INCY);
}
27
Basic Linear Algebra Subroutines (BLAS2)
Consider two vectors of size N and a matrix of size N × N that we
“dgemv” 10000 times. We use three versions of daxpy: Netlib “na-
tive” linux BLAS, ATLAS and MKL.
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
gemv(n, n) : y! !A · x + y x,y " Rn,A " Rn!n
0
1
2
3
4
5
6
7
8
0 50 100 150 200 250 300 350 400
GF
lop
s
N
MKLAtlasNative
0
1
2
3
4
5
6
7
8
0 50 100 150 200 250 300 350 400
GF
lop
s
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (sgemv) double precision (dgemv)
28
Basic Linear Algebra Subroutines (BLAS2)
Matrix is changed every time.
Efficiency driven by RAM access (because of the random access to
matrices) and L2 cache (because of the larger size of the data).
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
gemv(n, n) : y! !A · x + y x,y " Rn,A " Rn!n - random acces
0
1
2
3
4
5
6
7
8
0 50 100 150 200 250 300 350 400
GF
lop
s
N
MKLAtlasNative
0
1
2
3
4
5
6
7
8
0 50 100 150 200 250 300 350 400
GF
lop
s
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (sgemv) double precision (dgemv)
29
Basic Linear Algebra Subroutines (BLAS3)
// (*this) = a * b * alpha + (*this) * beta
template <class scalar>
class fullMatrix
{
public:
void gemm(const fullMatrix<scalar> &a,
const fullMatrix<scalar> &b,
scalar alpha=1., scalar beta=1.)
#if !defined(HAVE_BLAS)
{
gemm_naive(anb,alpha,beta);
}
#endif
;
30
Basic Linear Algebra Subroutines (BLAS3)
Direct call to BLAS.
template<>
void fullMatrix<double>::gemm(const fullMatrix<double> &a,
const fullMatrix<double> &b,
double alpha, double beta)
{
int M = size1(), N = size2(), K = a.size2();
int LDA = a.size1(), LDB = b.size1(), LDC = size1();
F77NAME(dgemm)("N", "N", &M, &N, &K, &alpha, a._data,
&LDA, b._data, &LDB,
&beta, _data, &LDC);
}
31
Basic Linear Algebra Subroutines (BLAS3)
Consider two matrices of size N ×N that we “dgemm” 10000 times.
We use three versions of daxpy: Netlib “native” linux BLAS, ATLAS
and MKL.
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
gemm(n, n, n) : A! !B · C + "A A,B,C " Rn!n
0
5
10
15
20
0 50 100 150 200 250 300 350 400
GF
lop
s
N
MKLAtlasNative
0
5
10
15
20
0 50 100 150 200 250 300 350 400
GF
lop
s
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (sgemm) double precision (dgemm)
32
Basic Linear Algebra Subroutines (BLAS3)
No cache effects on efficiency. Erratic behavior (cache line size ?)
Understanding computersLinear algebra performance
Application to DGMConclusions
IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK
gemm(n, n, n) : A! !B · C + "A A,B,C " Rn!n
0
5
10
15
20
150 160 170 180 190 200
GF
lop
s
N
MKLAtlasNative
0
5
10
15
20
150 160 170 180 190 200
GF
lop
s
N
MKLAtlasNative
K. Hillewaert Efficient implicit DGM
single precision (sgemm) double precision (dgemm)
33