introduction to scienti c...

33
Introduction to Scientific Computing J.-F. Remacle Universit´ e catholique de Louvain Mathematical and Computational Concepts 1

Upload: lekhuong

Post on 28-Mar-2018

232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Introduction to Scientific Computing

J.-F. Remacle

Universite catholique de Louvain

Mathematical and Computational Concepts

1

Page 2: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Dongarra’ top ten list

• 1946: The Monte Carlo method for modeling probabolistic phe-

nomena

• 1947: The Simplex method for linear optimization problems

• 1950: The Krylov subspace iteration method for fast linear solvers

and eigensolvers

• 1957: The FORTRAN compiler that liberated scientists and en-

gineers for programming in assembly

2

Page 3: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Dongarra’ top ten list

• 1959-1961: The QR algorithm for computing many eigenvalues

• 1962: The quicksort algorithm (divide and conquer)

• 1965: The fast Fourier transform (FFT) to reduce operation

count in Fourier series representation

• 1977: The integer relation detection algorithm, which is useful

for bifurcations and in quantum field theory

• 1987: The fast multipole algorithm for N-body problems

3

Page 4: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Dongarra’ top ten list

FORTRAN has been used extensively in the past.

Yet, most of the recent scientific computing software has been

written in C++.

As an example, Gmsh 2.1.1 is written in C++ and has about 120,000

lines of code

4

Page 5: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

Binary numbers and roundoffs.

126 = 1× 102 + 2× 101 + 6× 100.

In the base 2 system ,

126 = 011111102

= 0× 27 + 1× 26 + 1× 25 +

1× 24 + 1× 23 + 1× 22 + 1× 21 + 0× 20.

In computing, we call each place in a binary number a bit.

We call a group of 8 bits a byte.

Similarly, we call 1,024 bytes a kilobyte (kB).

5

Page 6: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

Scientific notation :

126 = +︸︷︷︸sign

.126︸ ︷︷ ︸coefficient

× 103︸︷︷︸basis and exponent

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) :

a = (−1)s × c× bq

Name Common name Base Digits Emin Emaxbinary16 Half precision 2 10+1 -14 +15binary32 Single precision 2 23+1 -126 +127binary64 Double precision 2 52+1 -1022 +1023binary128 Quadruple precision 2 112+1 -16382 +16383

C++ float, 32 bits, 4 bytes : 23 bits for the fraction, 1 bit for thesign and 8 bits for the exponent.

C++ double, 64 bits, 8 bytes : 52 bits for the fraction, 1 bit for thesign and 11 bits for the exponent.

6

Page 7: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

// --- numeric_limits.cpp -- show extreme numeric values

#include <limits>

#include <iostream>

int main(void){

std::cout << std::numeric_limits<float>::max() << std::endl;

std::cout << std::numeric_limits<float>::min() << std::endl;

std::cout << std::numeric_limits<double>::max() << std::endl;

std::cout << std::numeric_limits<double>::min() << std::endl;

}

Output ...

3.40282e+38

1.17549e-38

1.79769e+308

2.22507e-308

Easy to see that e.g. 1.17549 10−38 ' 2−126

7

Page 8: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

The effective zero or machine epsilon is the value of 1/2p such that

1.0 +1

2p= 1.0.

template <class T>

T MachineEpsilon(){

T eps = 1.0, test = 1.0 + eps;

while (1.0 != test){

eps /= 2.0;

test = 1.0 + eps;

}

return eps;

}

int main(void) {

std::cout << MachineEpsilon<float> () << std::endl;

std::cout << MachineEpsilon<double> () << std::endl;

return 0;

}

5.96046e-08

1.11022e-16

8

Page 9: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

Name Common name Base Digits ε

binary16 Half precision 2 10+1 2−11

binary32 Single precision 2 23+1 2−24

binary64 Double precision 2 52+1 2−53

binary128 Quadruple precision 2 112+1 2−113

A round-off error, also called rounding error, is the difference between

the calculated approximation of a number and its exact mathematical

value.

Associativity is not preserved on a computer

−1.0 + (1.0 + ε) 6= (−1.0 + 1.0) + ε

Epsilon machine is much bigger than the lower numeric limits.

9

Page 10: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

In the numerical analysis, the condition number associated with a

function φ(x) is a measure of that function’s amenability to digital

computation, that is, how numerically well-conditioned the function

is.

φ(x+ dx) ' φ(x) + φ′(x)dx

The relative change in function value is

|φ(x+ dx)− φ(x)||φ(x)| ' |φ

′(x)||x||φ(x)| ×

|dx||x| .

The condition number is defined as

κ =|φ′(x)||x||φ(x)| .

10

Page 11: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

The norm ‖x‖ of a vector xT = {x1, x2, . . . , xn} is a scalar that obbeys

‖x‖ ≥ 0

‖x‖ = 0↔ x = 0.

‖αx‖ = |α| ‖x‖.

‖x + y‖ ≤ ‖x‖+ ‖y‖.Examples:

• L∞ norm : ‖x‖∞ = maxi |xi|.

• L2 norm : ‖x‖2 =√∑

i x2i .

• L1 norm : ‖x‖1 =∑i |xi|.

• Lp norm : ‖x‖1 = (∑i |xi|p)1/p .

11

Page 12: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

Matrix norm generated by a vector norm

‖A‖p = maxx6=0

‖Ax‖p‖x‖p

The L∞ norm generates ‖A‖∞ = maxi∑nj=1 |aij| is the maximum row

sum.

The L1 norm generates ‖A‖1 = maxj∑ni=1 |aij| is the maximum column

sum.

The L2 norm generates ‖A‖2 =√

maxλ(ATA) is the largest eigenvalueof the SPD matrix ATA.

‖A + B‖ ≤ ‖A‖+ ‖B‖,

‖Ax‖ ≤ ‖A‖‖x‖.

12

Page 13: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

The condition number associated with the linear system of equationsAx = b.

Consider a change b + db in the right hand side. The change dx inthe solution is computed as

A(x + dx) = b + db

or

dx = A−1db.

We have

‖dx‖ ≤ ‖A−1‖‖db‖.Relative error : knowing that

‖A‖‖x‖ ≤ ‖b‖.we have

‖dx‖‖x‖ ≤

‖A−1‖‖db‖‖x‖ ≤ ‖A−1‖‖A‖‖db‖

‖b‖ = κ(A)‖db‖‖b‖ .

13

Page 14: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Mathematical and Computational Concepts

A matrix A is normal if ATA = AAT . Symmetric, skew-symmetric and

orthogonal matrices are normal.

A matrix A is unitary if ATA = I.

If A is normal, κ(A) =∣∣∣λmax(A)λmin(A)

∣∣∣.If A is not normal, κ(A) =

∣∣∣σmax(A)σmin(A)

∣∣∣, where σmax(A) and σmin(A) are

extreme singular values of A. A matrix A can always be decomposed

in the form UDW , where U and W are unitary matrices and D is a

diagonal matrix with the singular values lying on the diagonal.

A normal matrix is well conditioned when its eigenvalues are clustered

and bounded away from zero.

14

Page 15: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Introduction to Scientific Computing

J.-F. Remacle

Universite catholique de Louvain

C++, BLAS, OpenMP, MPI, LAPACK ...

15

Page 16: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Computer Architecture

Computer memory consist in a linearly addressable space (JFR doesa nice drawing on the blackboard).

// see gmsh/Numeric/fullMatrix.h for a complete version

template <class scalar>

class fullVector {

private:

int _r;

scalar *_data;

public:

fullVector(int r) : _r(r){

_data = new scalar[_r];

scale(0.);

}

~fullVector() { if(_data) delete [] _data; }

inline scalar operator () (int i) const{ return _data[i];}

inline scalar & operator () (int i){return _data[i];}

inline int size() const { return _r; }

};

16

Page 17: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Computer Architecture

How are two-dimensional arrays stored in memory?

Memory is linearly addressable, so we have to deceide how to decom-pose the matrix in one-dimensional units.

template <class scalar>

class fullMatrix {

private:

int _r, _c; // number of rows and columns

scalar *_data; // the linearly addressable data

};

There are two obvious ways (JFR does a nice drawing on the black-board):

1. Row-major ordering: aij = data [ j + c * i ] ;

2. Comumn-major ordering: aij = data [ i + r * j ] ;

17

Page 18: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Computer Architecture

Both orderings use the same amount of memory.

Both orderings start with a00.

C++ uses row-major ordering whil FORTRAN uses column-major or-

dering.

// A is a linear space in memory of size 16 x sizeof(double),

// ordered row-major.

double A[4][4];

Gmsh fullMatrix class stores matrices in column-major ordering be-

cause it allow direct access to BLAS !

18

Page 19: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Computer Architecture

Intel Xeon ISA: x86 64 - Streamlining SIMD Extensions (SSE)

:-) 2 × 128 bit FPU double and single precision packed operations

(FMUL/FADD), chained

:-( Has no combined floating multiplication and addition (fma)

A CPU cache is a cache used by the central processing unit of a

computer to reduce the average time to access memory: L1: 32kB(I)

+ 32kB(D), L2: 8MB.

Theoretical peak performance

• double precision: 4 times clock speed → ' 10 GFlops / core,

• single precision: 8 times clock speed → ' 20 GFlops / core.

19

Page 20: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Computer Architecture

Matrix-matrix multiplications: A(n× n), B(n×m) and C(m× n).

Every of the n2 entries requires 2m floating point operations:

aij = bikckj.

The total number of floating point operations Nflop is therefore

Nflop = 2×m× n2.

Consider n = m = 1000, a naive implementation requires ∆t = 6.02

seconds. The number of floating point operations per second is

Nflops =Nflop

∆t=

2 109

6= 0.33 Gflops.

This is about 30 times slower than the peak performance of the ma-

chine !

20

Page 21: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS)

BLAS3

Matrix-matrix multiplications: Nd = 3n2 data, Nflop = 2n3,

Nflop

Nd= O(n).

BLAS2

Matrix-vector multiplications: Nd = n2 + 2n data, Nflop = 2n2,

Nflop

Nd= O(1).

BLAS1

Vector-vector operations (axpy): Nd = 2n data, Nflop = 2n.

Nflop

Nd= O(1).

21

Page 22: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS1)

#define F77NAME(x) (x##_)

//(*this) <-- alpha*x + (*this)

template <class scalar>

class fullVector

{

public:

void axpy(fullVector<scalar> &x, scalar alpha=1.)

#if !defined(HAVE_BLAS)

{

for (int i = 0; i < _r; i++) _data[i] += alpha * x._data[i];

}

#endif

;

template<>

void fullVector<double>::axpy(fullVector<double> &x,double alpha)

{

int M = _r, INCX = 1, INCY = 1;

F77NAME(daxpy)(&M, &alpha, x._data,&INCX, _data, &INCY);

}

22

Page 23: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS1)

Consider two vectors of size N that we “daxpy” 10000 times. We

use three versions of daxpy: Netlib “native” linux BLAS, ATLAS and

MKL.

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

axpy(n) : y! !x + y x,y " Rn

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lops

N

MKLAtlasNative

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lops

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (saxpy) double precision (daxpy)

23

Page 24: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS1)

Consider two vectors of size N that we “daxpy” 10000 times. The

first vector is shifted !

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

axpy(n) : y! !x + y x,y " Rn - shifted access (y)

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lop

s

N

MKLAtlasNative

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000

GF

lop

s

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (saxpy) double precision (daxpy)

24

Page 25: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS1)

Efficiency is driven by L1 cache size and operation packing.

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

axpy(n) : y! !x + y x,y " Rn - influence of operation packing

0

2

4

6

8

10

1210 1215 1220 1225 1230 1235 1240

GF

lops

N

MKLAtlasNative

0

2

4

6

8

10

1210 1215 1220 1225 1230 1235 1240

GF

lops

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (saxpy) double precision (daxpy)

25

Page 26: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS2)

template <class scalar>

class fullMatrix

{

public:

void mult(const fullVector<scalar> &x,

fullVector<scalar> &y)

#if !defined(HAVE_BLAS)

{

y.scale(0.);

for(int i = 0; i < _r; i++)

for(int j = 0; j < _c; j++)

y._data[i] += (*this)(i, j) * x(j);

}

#endif

;

26

Page 27: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS2)

Direct call to BLAS.

// y := alpha*(*this)*x + beta*y

template<>

void fullMatrix<double>::mult(const fullVector<double> &x,

fullVector<double> &y)

{

int M = _r, N = _c, LDA = _r, INCX = 1, INCY = 1;

double alpha = 1., beta = 0.;

F77NAME(dgemv)("N", &M, &N, &alpha, _data, &LDA, x._data, &INCX,

&beta, y._data, &INCY);

}

27

Page 28: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS2)

Consider two vectors of size N and a matrix of size N × N that we

“dgemv” 10000 times. We use three versions of daxpy: Netlib “na-

tive” linux BLAS, ATLAS and MKL.

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

gemv(n, n) : y! !A · x + y x,y " Rn,A " Rn!n

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (sgemv) double precision (dgemv)

28

Page 29: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS2)

Matrix is changed every time.

Efficiency driven by RAM access (because of the random access to

matrices) and L2 cache (because of the larger size of the data).

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

gemv(n, n) : y! !A · x + y x,y " Rn,A " Rn!n - random acces

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (sgemv) double precision (dgemv)

29

Page 30: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS3)

// (*this) = a * b * alpha + (*this) * beta

template <class scalar>

class fullMatrix

{

public:

void gemm(const fullMatrix<scalar> &a,

const fullMatrix<scalar> &b,

scalar alpha=1., scalar beta=1.)

#if !defined(HAVE_BLAS)

{

gemm_naive(anb,alpha,beta);

}

#endif

;

30

Page 31: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS3)

Direct call to BLAS.

template<>

void fullMatrix<double>::gemm(const fullMatrix<double> &a,

const fullMatrix<double> &b,

double alpha, double beta)

{

int M = size1(), N = size2(), K = a.size2();

int LDA = a.size1(), LDB = b.size1(), LDC = size1();

F77NAME(dgemm)("N", "N", &M, &N, &K, &alpha, a._data,

&LDA, b._data, &LDB,

&beta, _data, &LDC);

}

31

Page 32: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS3)

Consider two matrices of size N ×N that we “dgemm” 10000 times.

We use three versions of daxpy: Netlib “native” linux BLAS, ATLAS

and MKL.

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

gemm(n, n, n) : A! !B · C + "A A,B,C " Rn!n

0

5

10

15

20

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

0

5

10

15

20

0 50 100 150 200 250 300 350 400

GF

lop

s

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (sgemm) double precision (dgemm)

32

Page 33: Introduction to Scienti c Computingperso.uclouvain.be/vincent.legat/teaching/documents0910/meca2300... · binary128 Quadruple precision 2 112+1 -16382 +16383 ... Introduction to Scienti

Basic Linear Algebra Subroutines (BLAS3)

No cache effects on efficiency. Erratic behavior (cache line size ?)

Understanding computersLinear algebra performance

Application to DGMConclusions

IntroductionAXPY - level 1 BLASGEMV - level 2 BLASGEMM - level 3 BLASGESV - LAPACK

gemm(n, n, n) : A! !B · C + "A A,B,C " Rn!n

0

5

10

15

20

150 160 170 180 190 200

GF

lop

s

N

MKLAtlasNative

0

5

10

15

20

150 160 170 180 190 200

GF

lop

s

N

MKLAtlasNative

K. Hillewaert Efficient implicit DGM

single precision (sgemm) double precision (dgemm)

33