parallel and cluster computing 1. 2 array libraries u function extensions of c and fortran 77 with...

Parallel and Cluster Computing

1

Parallel and Cluster Computing 2

Array Libraries

Function extensions of C and Fortran 77 with array or vector libraries– The libraries are supposed to be optimised for each

particular computer– Regular compilers can be used => no need in dedicated

optimising compilers One of the most well-known and well-designed array

libraries is the Basic Linear Algebra Subprograms (BLAS)– Provides basic array operations for numerical linear algebra– Available for most modern VP and SP computers


BLAS All BLAS routines are divided into 3 main categories:

– Level 1 BLAS addresses scalar and vector operations– Level 2 BLAS addresses matrix-vector operations– Level 3 BLAS addresses matrix-matrix operations

Routines of Level 1 do– vector reduction operations– vector rotation operations– element-wise and combined vector operations– data movement with vectors


Level 1 BLAS A vector reduction operation

– The addition of the scaled dot product of two real vectors x and y into a scaled scalar r

1

0

n

iii

T yxryxrr

The C interface of the routine implementing the operation isvoid BLAS_ddot(

enum blas_conj_type conj, int n, double alpha, const double *x, int incx, double beta, const double *y, int incy, double *r );


Level 1 BLAS (ctd) Other routines doing reduction operations

– Compute different vector norms of vector x– Compute the sum of the entries of vector x– Find the smallest or biggest component of vector x– Compute the sum of squares of the entries of vector x

Routines doing rotation operations– Generate Givens plane rotation– Generate Jacobi rotation– Generate Householder transformation


Level 1 BLAS (ctd) An element-wise vector operation

– The scaled addition of two real vectors x and y

The C interface of the routine implementing the operation isvoid BLAS_dwaxpby( int n, double alpha, const double *x, int incx, double beta, const double *y, int incy, double *w, int incw );

Function BLAS_cwaxpby does the same operation but on complex vectors

yxw


Level 1 BLAS (ctd) Other routines doing element-wise operations

– Scale the entries of a vector x by the real scalar 1/

– Scale a vector x by and a vector y by , add these two vectors to one another and store the result in the vector y

– Combine a scaled vector accumulation and a dot product

– Apply a plane rotation to vectors x and y


Level 1 BLAS (ctd) An example of data movement with vectors

– The interchange of real vectors x and y

The C interface of the routine implementing the operation isvoid BLAS_dswap( int n, double *x, int incx, double *y, int incy );

Function BLAS_cswap does the same operation but on complex vectors


Level 1 BLAS (ctd) Other routines doing data movement with vectors

– Copy vector x into vector y

– Sort the entries of real vector x in increasing or decreasing order and overwrite this vector x with the sorted vector as well as compute the corresponding permutation vector p

– Scale the entries of a vector x by the real scalar 1/

– Permute the entries of vector x according to permutation vector p


Level 2 BLAS

Routines of Level 2– Compute different matrix vector products– Do addition of scaled matrix vector products– Compute multiple matrix vector products– Solve triangular equations– Perform rank one and rank two updates– Some operations use symmetric or triangular matrices


Level 2 BLAS (ctd) To store matrices, the following schemes are used

– Column-based and row-based storage

– Packed storage for symmetric or triangular matrices

– Band storage for band matrices

Conventional storage– An nxn matrix A is stored in a one-dimensional array a

» aij => a[i+j*s] (C, column-wise storage)

» aij => a[j+i*s] (C, row-wise storage)

– If s=n, rows (columns) will be contiguous in memory

– If s>n, there will be a gap of (s-n) memory elements between two successive rows (columns)

– Only significant elements of symmetric/triangular matrices need be set


Packed Storage Packed storage

– The relevant triangle of a symmetric/triangular matrix is packed by columns or rows in a one-dimensional array

– The upper triangle of an nxn matrix A may be stored in a one-dimensional array a» aij(ij) => a[j+i*(2*n-i-1)/2] (C, row-wise storage)

Example.

22

1211

020100

00

0

a

aa

aaa

221211020100 aaaaaa=>


Band Storage Band storage

– A compact storage scheme for band matrices Consider Fortran and a column-wise storage scheme

– An mxn band matrix A with l subdiagonals and u superdiagonals may be stored in a 2-dimensional array A with l+u+1 rows and n columns

– Columns of matrix A are stored in corresponding columns of array A– Diagonals of matrix A are stored in rows of array A

» aij => A(u+i-j,j) for max(0,j-u) i min(m-1,j+l)

Example.

=>

444342

34333231

23222120

121110

0100

00

0

0

00

000

aaa

aaaa

aaaa

aaa

aa

423120

43322110

4433221100

34231201

aaa

aaaa

aaaaa

aaaa


Level 2 BLAS (ctd) An example of matrix vector multiplication operation

– The scaled addition of a real n-length vector y, and the product of a general real mxn matrix A and a real n-length vector x

yAxy The C interface of the routine implementing this operation is void BLAS_dgemv( enum blas_order_type order, enum blas_trans_type trans, int m, int n, double alpha, const double *a, int stride, const double *x, int incx, double beta, const double *y, int incy );

Parametersorder => blas_rowmajor or blas_colmajortrans => blas_no_trans (do not transpose A)


Level 2 BLAS (ctd) If matrix A is a general band matrix with l subdiagonals and u

superdiagonals, the function

void BLAS_dgbmv( enum blas_order_type order, enum blas_trans_type trans, int m, int n, int l, int u, double alpha, const double *a, int stride, const double *x, int incx, double beta, const double *y, int incy );

better uses the memory. It assumes that a band storage scheme is used to store matrix A.


Level 2 BLAS (ctd) Other routines of Level 2 perform the following operations

yAxy TAA where

Axx xAx T where A is triangularor

BxAxy yAx T , Axw

yAx T , Azw where A is triangular

as well as many others For any matrix-vector operation with a specific matrix operand

(triangular, symmetric, banded, etc.), there is a routine for each storage scheme that can be used to store the operand


Level 3 BLAS

Routines of Level 3 do– O(n2) matrix operations

» norms, diagonal scaling, scaled accumulation and addition» different storage schemes to store matrix operands are supported

– O(n3) matrix-matrix operations» multiplication, solving matrix equations, symmetric rank k and 2k

updates

– Data movement with matrices


Level 3 BLAS (ctd) An example of O(n2) matrix operation, which scales two real mxn matrices A

and B and stores their sum in a matrix C, is

The C interface of the routine implementing this operation under assumption that the matrices A, B and C are of the general form, is

void BLAS_dge_add( enum blas_order_type order, int m, int n, double alpha, const double *a, int stride_a, double beta, const double *b, int stride_b, double *c, int stride_c);

BAC

There are other 15 routines performing this operation for different types and forms of the matrices A, B and C


Level 3 BLAS (ctd) An example of O(n3) matrix-matrix operation involving a real

mxn matrix A, a real nxk matrix B, and a real mxk matrix C is

The C routine implementing the operation for matrices A, B and C in the general form is

void BLAS_dgemm( enum blas_order_type order, enum blas_trans_type trans_a, enum blas_trans_type trans_b, int m, int n, int k, double alpha, const double *a, int stride_a, const double *b, int stride_c, double beta, const double *c, int stride_c);

CABC


Level 3 BLAS (ctd) Data movement with matrices includes

– Copying matrix A or its transpose with storing the result in matrix B

B A or B AT

– Transposition of a square matrix A with the result overwriting matrix A

A AT

– Permutation of the rows or columns of matrix A by a permutation matrix P

A PA or A AP– Different types and forms of matrix operands as well as

different storage schemes are supported


Sparse BLAS Sparse BLAS

– Provides routines for unstructured sparse matrices– Poorer functionality compared to Dense and Banded BLAS

» only some basic array operations used in solving large sparse linear equations using iterative techniques

» matrix multiply, triangular solve, sparse vector update, dot product, gather/scatter

– Does not specify methods to store a sparse matrix» storage format is dependent on the algorithm, the original sparsity

pattern, the format in which the data already exists, etc.» sparse matrix arguments are a placeholder, or handle, which refers

to an abstract representation of a matrix, not the actual data components


Sparse BLAS (ctd) Several routines provided to create sparse matrices

– The internal representation is implementation dependent» Sparse BLAS applications are independent of the matrix storage

scheme, relying on the scheme provided by each implementation

A typical Sparse BLAS application– Creates an internal sparse matrix representation and returns

its handle – Uses the handle as a parameter in computational Sparse

BLAS routines– Calls a cleanup routine to free resourses associated with the

handle, when the matrix is no longer needed


Example Example. Consider a C program using Sparse BLAS

performing the matrix-vector operation y Ax, where

4.4001.4

03.300

4.202.20

0001.1

A

0.1

0.1

0.1

0.1

x


Example (ctd)#include <blas_sparse.h>int main() { const int n = 4, nonzeros = 6; double values[] = {1.1, 2.2, 2.4, 3.3, 4.1, 4.4}; int index_i[] = {0, 1, 1, 2, 3, 3}; int index_j[] = {0, 1, 3, 2, 0, 3}; double x[] = {1.0, 1.0, 1.0, 1.0}, y[] = {0.0, 0.0, 0.0, 0.0}; blas_sparse_matrix A; int k; double alpha = 1.0; A = BLAS_duscr_begin(n, n); //Create Sparse BLAS handle for(k=0; k < nonzeros; k++) //Insert entries one by one BLAS_duscr_insert_entry(A, values[k], index_i[k], index_j[k]); BLAS_uscr_end(A); // Complete construction of sparse matrix //Compute matrix-vector product y = A*x BLAS_dusmv(blas_no_trans, alpha, A, x, 1, y, 1); BLAS_usds(A); //Release matrix handle}

Parallel and Cluster Computing

25


Parallel Languages

C and Fortran 77 do not reflect some essential features of VP and SP architectures– They cannot play the same role for VPs and SPs

Optimizing compilers– Only for a simple and limited class of applications

Array libraries– Cover a limited class of array operations

» Other array operations can be only expressed as a combination of the locally-optimized library array operations

» This excludes global optimization of combined array operations


Parallel Languages (ctd)

Parallel extensions of C and Fortran 77 allows programmers– To explicitly express in a portable form any array

operation» Compiler does not need to recognize code to parallelize

» Global optimisation of operations on array is possible

– We consider 2 parallel supersets of C and Fortran 77 » Fortran 90

» C[]


Fortran 90

Fortran 90 is a new Fortran standard released in 1991– Widely implemented since then

Two categories of new features– Modernization of Fortran according to the state-of-

the-art in serial programming languages– Support for explicit expression of operations on

arrays


Fortran 90 (ctd)

Serial extensions include – Free-format source code and some other simple

improvements – Dynamic memory allocation (automatic arrays,

allocatable arrays, and pointers and associated heap storage management)

– User-defined data types (structures)– Generic user-defined procedures (functions and

subroutines) and operators


Fortran 90 (ctd)

Serial extensions (ctd) – Recursive procedures– New control structures to support structured

programming – A new program unit, MODULE, for encapsulation

of data and a related set of procedures

We focus on parallel extensions


Fortran 90 (ctd)

Fortran 90 considers arrays first-class objects – Whole-array operations, assignments, and functions – Operations and assignments are extended in an

obvious way, on an element-by-element basis – Intrinsic functions are array-valued for array

arguments» operate element-wise if given an array as their argument

– Array expressions may include scalar constants and variables, which are replicated (or expanded) to the required number of elements


Fortran 90 (ctd)

Example.

REAL, DIMENSION(3,4,5) :: a, b, c, d…c = a + bd = SQRT(a)c = a + 2.0


WHERE Structure

Sometimes, some elements of arrays in an array-valued expression should be treated specially – Division by zero in a = 1./a should be avoided

WHERE statementWHERE (a /= 0.) a = 1./a

WHERE construct WHERE (a /= 0.) a = 1./a ELSEWHERE a = HUGE(a) END WHERE


Fortran 90 (ctd)

All the array elements in an array-valued expression or array assignment must be conformable, i.e., they must have the same shape – the same number of axes– the same number of elements along each axis

Example. REAL :: a(3,4,5), b(0:2,4,5), c(3,4,-1:3)

Arrays a, b, and c have the same rank of 3, extents of 3,4, and 5, shape of {3,4,5}, size of 60 – Only differ in the lower and upper dimension bounds


Array Section

An array section can be used everywhere in array assignments and array-valued expressions where a whole array is allowed

An array section may be specified with subscripts of the form of triplet : lower:upper:stride

It designates an ordered set i1,…,ik such that

– i1 = lower

– ij+1 = ij + stride ( j=1,…,k-1 )

– |ik - upper| < stride


Array Section (ctd)

Example. REAL :: a(50,50) What sections are designated by the following

expressions? What are the rank and shape for each section? – a(i,1:50:1), a(i,1:50) – a(i,:)– a(i,1:50:3)– a(i,50:1:-1)– a(11:40,j)– a(1:10,1:10)


Array Section (ctd)

Vector subscripts may also be used to specify array sections– Any expression whose value is a rank 1 integer

array may be used as a vector subsript

Example.REAL :: a(5,5), b(5)INTEGER :: index(5)index = (/5,4,3,2,1/)b = a(index,1)


Array Section (ctd)

Whole arrays and array sections of the same shape can be mixed in expressions and assignments

Note, that unlike a whole array, an array section may not occupy contiguous storage locations


Array Constants

Fortran 90 introduces array constants, or array constructors – The simplest form is just a list of elements enclosed in (/

and /)

– May contain lists of scalars, lists of arrays, and implied-DO loops

Examples.(/ 0, i=1,50 /)

(/ (3.14*i, i=4,100,3) /)

(/ ( (/ 5,4,3,2,1 /), i=1,5 ) /)


Array Constants (ctd)

The array constructors can only produce 1-dimensional arrays– Function RESHAPE can be used to construct

arrays of higher rank

REAL :: a(500,500)

a = RESHAPE( (/ (0., i=1,250000) /), (/ 500,500 /) )


Assumed-Shape and Automatic Arrays

Consider the user-defined procedure operating on arrays

SUBROUTINE swap(a,b)REAL, DIMENSION(:,:) :: a, bREAL, DIMENSION(SIZE(a,1), SIZE(a,2)) :: temptemp = aa = bb = tempEND SUBROUTINE swap


Assumed-Shape and Automatic Arrays (ctd)

Formal array arguments a and b are of assumed shape– Only the type and rank are specified– The actual shape is taken from that of the actual array

arguments

The local array temp is an example of the automatic array– Its size is set at runtime– It stops existing as soon as control leaves the procedure


Intrinsic Array Functions

Intrinsic array functions include– Extension of such intrinsic functions as SQRT, SIN, etc. to array arguments

– Specific array intrinsic functions Specific array intrinsic functions do the

following– Compute the scalar product of two vectors

(DOT_PRODUCT) and the matrix product of two matrices (MATMUL)


Specific Intrinsic Array Functions

– Perform diverse reduction operations on an array» logical multiplication (ALL) and addition (ANY)» counting the number of true elements in the

array» arithmetical multiplication (PRODUCT) and

addition (SUM) of its elements» finding the smallest (MINVAL) or the largest

(MAXVAL) element


Specific Intrinsic Array Functions (ctd)

– Return diverse attributes of an array» its shape (SHAPE)» the lower dimension bounds of the array (LBOUND)» the upper dimension bounds (UBOUND)» the number of elements (SIZE)» the allocation status of the array (ALLOCATED)



– Construct arrays by means of» merging two arrays under mask (MERGE)» packing an array into a vector (PACK)» replication of an array by adding a dimension

(SPREAD)» unpacking a vector (a rank 1 array) into an array

under mask (UNPACK)



– Reshape arrays (RESHAPE)

– Move array elements performing» the circular shift (CSHIFT)» the end-off shift (EOSHIFT)» the transpose of a rank 2 array (TRANSPOSE)

– Locate the first maximum (MAXLOC) or minimum (MINLOC) element in an array


C[]

C[] (C “brackets”) is a strict ANSI C superset allowing programmers to explicitly describe operations on arrays

Vector value, or vector– An ordered set of values (or vector values) of any

one type– Any vector type is characterised by

» the number of elements

» the type of elements


Vector Value and Vector Object

Vector object– A region of data storage, the contents of which can

represent vector values– An ordered sequence of objects (or vector objects)

of any one type Unlike ANSI C, C[] defines the notion of

value of array object – This value is vector


Vector Value and Vector Object (ctd)

Example. The value of the arrayint a[3][2] = {0,1,2,3,4,5};

is the vector{ {0,1}, {2,3}, {4,5} }

This vector has the shape {3,2}. This vector type is named by int[3][2] The shape of array = that of its vector value In C[], array object is a particular case of vector

object


Arrays and Pointers

C array is a contiguously allocated set of elements of any one type of object

C[] array is a a set of elements of any one type of object sequentially allocated with a positive stride– The stride is a distance between successive

elements of the array measured in units equal to the size of array element

– If stride is not specified, it is assumed to be 1


Arrays and Pointers (ctd)

C[] array has at least three attributes:– the type of elements– the number of elements– the allocation stride



Example 1.int a[3];

int a[3:1];

a[0] a[1] a[2]

a[0] a[1] a[2]

Example 2.int a[3:3];

The slot between array elements is of 2xsizeof(int) bytes



In C, a pointer has only one attribute– The type of object it points to– It is needed to correctly interpret

» the value of the object it points to

» the address operators + and - (operand(s) and result should point into the same array)

In C[], a pointer has an additional attribute, stride– If stride is not specified, it is assumed to be 1



Example 1. The declarations int a[] = {0,1,2,3,4};int * p1 = (void*)a;int *:2 p2 = (void*)&a[4];

form the following structure of storage

p1+2 and p2-1 point to the same array element, a[2]

a[0] a[3] a[4]a[1] a[2]

p2p1



Expressions e1[e2] or (e2)[e1] provide access to the e2-th element of an array e1 – Identical to (*((e1)+(e2))) – e2 is an integer expression – e1 is an lvalue that has the type “array of type ”

» converted to an expression of the type “pointer to type” pointing to the intial element of the array object

» the attribute stride of this pointer is identical to that of the array object



C[] allows dynamic arrays

typedef int (*pDiag)[n:n+1];int a[n][n];int j;pDiag p = (void*)a;...for(j=0; j<n; j++)

(*p)[j]=1;


Blocking Operator

In C[], the value of an aray object is a vector– The i-th element of the vector is the value of the i-th

element of the array object The postfix [] operator (the blocking operator)

– Supports access to an array as the whole – Its operand has the type “array of type”– Blocks the conversion of the operand to a pointer

Example. int a[5], b[5:2], c[5:3]; » a[], b[], and c[] designate arrays a, b, and c as a whole» c[]=a[]+b[]


Lvector

In C, an lvalue is an expression designating an object

Example. int d[5][5];– d[i][j], d and d[0] are lvalues– d[i][j]+1 and &d[0] are not.

Modifiable lvalue– May be the left operand of an assignment operator

»d[i][j] is a modifiable lvalue»d and d[0] are not


Lvector (ctd)

In C[], an lvector is an expression designating a vector object

Modifiable lvector– May be the left operand of an assignment operator

Example. int d[5][5];– d, d[0], d[], and d[0][] are lvectors

»d[], and d[0][] are modifiable»d and d[0] are not modifiable


Lvector (ctd) Example. int a[4][4];

(*(int(*)[4:5])a)[]

a[0][0]

a[1][0]

a[2][0]

a[3][0]

a[0][1] a[0][2] a[0][3]

a[2][1]

a[3][1]

a[1][2]

a[3][2]

a[1][3]

a[2][3]

a[1][1]

a[2][2]

a[3][3]


Subarray

An object belongs to an array if– It is an element of the array, or– It belongs to an element of the array

Subarray – A set of objects belonging to an array– An array itself

Example (ctd). The main diagonal is a subarray– It is an array obect of the type int[4:5]


Subarray (ctd) Example. int a[4][4];

(*(int(*)[3:5])(&a[0][1]))[]

a[0][0]

a[1][0]

a[2][0]

a[3][0]

a[0][1] a[0][2] a[0][3]

a[2][1]

a[3][1]

a[1][2]

a[3][2]

a[1][3]

a[2][3]

a[1][1]

a[2][2]

a[3][3]


Subarray (ctd)

Not every regular set of objects belonging to an array makes up its subarray

Example. int a[5][5];a[0][0]

a[1][0]

a[2][0]

a[3][0]

a[0][1] a[0][2] a[0][3]

a[2][1]

a[3][1]

a[1][2]

a[3][2]

a[1][3]

a[2][3]

a[1][1]

a[2][2]

a[3][3]

a[0][4]

a[1][4]

a[2][4]

a[3][4]

a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]

No constant modifiable lvector designates this “inner square”


Array Section

The [:] operator (the grid operator) – Supports access to array sections of general form

Syntax. e[l:r:s] – Expression e may have type “array of type” or

“pointer to type”– Expressions l, r, and s have integer types and denote

» the left bound» the right bound» the stride


Array Section (ctd)

Semantics. e[l:r:s] – A vector object of (r-s)/l+1 elements of type type

» Its i-th element is e[l+s*i]

– Expression e[l:r:s] is lvector– Expression e[l:r:s] is a modifiable lvector if

» All expressions {e[l+s*i] }i=0,1,… are modifiable lvectors/lvalues

– e[l:r:1] e[l:r]


Array Section (ctd)

Operand e in e[l:r:s] may have a vector type – Operator [:] is applied element-wise

– Let the vector value of e be {u1,…,uk}

» e[l:r:s] will designate a vector of k vectors

» The i-th element of the j-th vector will be uj[l+s*i] (j=1,

…,k)


Example

Example. int a[5][5];

a[1:3][1:3]

a[0][0]

a[1][0]

a[2][0]

a[3][0]

a[0][1] a[0][2] a[0][3]

a[2][1]

a[3][1]

a[1][2]

a[3][2]

a[1][3]

a[2][3]

a[1][1]

a[2][2]

a[3][3]

a[0][4]

a[1][4]

a[2][4]

a[3][4]

a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]


Example (ctd)

a[1:3]

a[0][0]

a[1][0]

a[2][0]

a[3][0]

a[0][1] a[0][2] a[0][3]

a[2][1]

a[3][1]

a[1][2]

a[3][2]

a[1][3]

a[2][3]

a[1][1]

a[2][2]

a[3][3]

a[0][4]

a[1][4]

a[2][4]

a[3][4]

a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]

a[0][0]

a[1][0]

a[2][0]

a[3][0]

a[0][1] a[0][2] a[0][3]

a[2][1]

a[3][1]

a[1][2]

a[3][2]

a[1][3]

a[2][3]

a[1][1]

a[2][2]

a[3][3]

a[0][4]

a[1][4]

a[2][4]

a[3][4]

a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]

a[1:3][1:3]


Array Section (ctd)

Operands l and/or r in e[l:r:s] may be omitted – If l is omitted, the left bound is set to 0 – If r is omitted, the right bound is

» set to n-1, if the first operand e is an n-element array

» determined from the context, if e is a pointer

Example. int a[5][5];

– a[:][:] a[]


Element-Wise Vector Operators

The operand of the cast operator and the unary &, *, +, -, ~, !, ++, and –- operators may have a vector type– The operators are applied element-wise

Example.int j, k, l, m, n;

int *p[5] = { &j, &k, &l, &m, &n };

– *(p[1:3]) designates a vector object consisting of three integer variables k, l, and m.


Element-Wise Vector Operators (ctd)

Binary operators *, /, %, +, -, <<, >>, <, >, <=, >=, ==, !=, &, ^, |, &&, and || may have vector operands

If the operands have the same shape, then the operator is executed element-wise producing the result of this shape



In general, the operands may have different shapes but they must be conformable – 2 operands are conformable iff the beginning

of the shape of one operand is identical to the shape of the other operand» Vectors having shapes {9,8,7,6} and {9,8} are

conformable» A non-vector operand is conformable with any

vector operand (why?)



Let operands a and b be conformable, and rank(a) < rank(b)

The execution of the operator starts from conformable extension of the value of a to the shape of b – The conformable extension just replicates the value

by adding dimensions



Example. Conformable extension of vector {{1,2,3},{4,5,6}} of shape {2,3} to shape {2,3,2} is vector {{{1,1},{2,2},{3,3}},{{4,4},{5,5},{6,6}}}

Then the operator is applied element-wise to the result of the conformable extension of the value of a and the value of b, producing the result of the same shape as that of b



The assignment operators =, *=, +=, etc. may have vector operands – The left operand shall be a modifiable lvector – Its rank shall not be less than that of the right operand – The operands shall be conformable

Two-step execution– The right operand is conformably extended to the

shape of the left one– The assignment is executed element-wise



Example.

int a[m][n], b[m];...a[:][:] = b[:];


Example LU-factorization of the square matrix a by using the

Gaussian eliminationdouble a[n][n], t;int i, j;...for(i=0; i<n; i++) {

for(j=i+1; j<n; j++) { t = a[j][i]/a[i][i]; if(a[j][i]!=0.) a[j][i:n-1]-=t*a[i][i:n-1]; } }



By definition, e1[e2] is identical to (*((e1)+(e2)))– Therefore, e1 and e2 may be of vector type

The programmer can construct lvectors that designate irregular array sections

Example.int a[m][n], ind[] = {0,1,6,18};...a[ind[:]][:] = 0;

This code zeros the elements of the 0-th, 1-st, 6-th, and 18-th rows of array a



The first operand of the . operator may have a vector type

The second operand shall name a member of a structure or union type

The operator is executed element-wise– The result will have the same shape as the first

operand e->id is identical to (*e).id


Reduction Operators

Reduction operators [*], [+], [&], [^], [|], [&&], [||], [?<], and [?>]– Unary operators

– Only applicable to vector operands

– If v1,…,vn are the elements of the vector value of

the expression e, then the value of the expression [*]e is that of the expression (v1*…*vn)


Examples

Example 1. Dot product of the vectors a and b.

double a[n];

double b[n];

double c;

...

c = [+](a[]*b[]);


Examples (ctd)

Example 2. Maximal element of the matrix a

int a[m][n];

int max;

...

max = [?>][?>]a[];


Examples (ctd)

Example 3. Multiplication of matrices a and b

double a[m][l];double b[l][n];double c[m][n];int i;...for(i=0; i<m; i++)

c[i][] = [+](a[i][]*b[]);


Memory Hierarchy

Parallel programming systems for VPs and SPs take into account their modern memory structure– Optimal memory management is often more efficient than

optimal usage of IEUs

– Approaches to optimal memory management appear surprisingly similar to optimisation of parallel facilities

Simple two-level memory model– Small and fast register memory

– Large and relatively slow main memory


Memory Hierarchy (ctd)

A simple modern memory hierarchy– Register memory – Cache memory – Main memory – Disk memory

Cache memory– A buffer memory between main memory and

registers – Holds copies of some data from the main memory


Memory Hierarchy (ctd)

Execution of instruction reading a data item from the main memory into a register– Check if a copy of the data item is already in the

cache– If so, the data item will be actually transferred into

the register from the cache– If not, the data item will be transferred into the

register from the main memory, and a copy of the item will appear in the cache


Cache

Cache– Partitioned into cache lines

» Cache line is a minimum unit of data transfer between the cache and the main memory

» Scalars may be transferred only as a part of a cache line

– Much smaller than the main memory» The same cache line may reflect different data blocks

from the main memory


Cache (ctd)

Types of cache memory– Direct mapped

» each block of the main memory has only one place it can appear in the cache

– Fully associative » a block can be placed anywhere in the cache

– Set associative » a block can be placed in a restricted set of places» a set is a group of two or more cache lines» n-way associative cache


Cache (ctd)

Cache miss is the situation when a data item being referenced is not in the cache

Minimization of cache misses is able to significantly accelerate execution of the program

Programs intensively using basic operations on arrays are obviously suitable for that type of optimization


Loop Tiling

The main specific optimization minimizing the number of cache misses is loop tiling

Consider the loop nest for(i=0; i<m; i++) /* loop 1 */ for(j=0; j<n; j++) /* loop 2 */

if(i==0) b[j]=a[i][j]; else b[j]+=a[i][j];

b[j] are repeatedly used by successive iterations of loop 1


Loop Tiling (ctd)

If n is large enough, the data items may be flushed from the cache by the moment of their repeated use

To minimize the flushing of repeatedly used data items, the number of iterations of loop 2 may be decreased

To keep the total number of iterations of this loop nest unchanged, an additional controlling loop is introduced


Loop Tiling (ctd)

The transformed loop nest is: for(k=0; k<n; k+=T) //additional controlling loop 0 for(i=0; i<m; i++) // loop 1 for(j=k; j<min(k+T,n); j++) // loop 2 if(i==0) b[j]=a[i][j]; else b[j]+=a[i][j];

This transformation is called tiling T is the tile size


Loop Tiling (ctd)

In general, the loop tiling is applied to loop nests of the form

for(i1=...) /* loop 1 */ for(i2=...) /* loop 2 */ ... for(in=...) { /* loop n */ ... e[i2]…[in] ... }

The goal is to minimize the number of cache misses for reference e[i2]…[in], which is repeatedly used by successive iterations of loop 1


Loop Tiling and Optimising Compilers

The recognition of the loop nests, which can be tiled is the most difficult problem to be solved by optimising C and Fortran 77 compilers– Based on the analysis of data dependencies in loop nests

Theorem. The loop tiling is legally applicable (to the above loop nest) iff the loops from loop 2 to loop n are fully interchangeable – To prove the interchangability an analysis of data

dependence between different iterations of the loop nest is needed


Loop Tiling and Array Libraries

Level 3 BLAS is specified to support block algorithms of matrix-matrix operations

Partitioning matrices into blocks and performing the computation on the blocks maximizes the reuse of data held in the upper levels of memory hierarchy


Loop Tiling and Parallel Languages

Compilers for parallel languages do not need to recognize loops suitable for tiling

They can translate explicit operations on arrays into loop nests with the best possible temporal locality


Virtual memory

Instructions address virtual memory rather than the real physical memory

The virtual memory is partitioned into pages of a fixed size – Each page is stored on a disk until it is needed – When the page is needed, it copied to main

memory, with the virtual addresses mapping into real addresses

– This copying is known as paging or swapping


Virtual memory (ctd)

Programs processing large enough arrays do not fit into main memory – The swapping takes place each time when required data are

not in the main memory

– The swapping is a very expensive operation

– Minimization of the number of swappings can significantly accelerate the programs

The problem is similar to minimization of cache misses and can be, therefore, approached similarly


Vector and Superscalar Processors: Summary

VPs and SPs provide instruction-level parallelism, which is best exploited by applications with intensive operations on arrays

Such applications can be written in a serial programming language and complied by dedicated optimizing compilers performing some specific loop optimizations– Modular, portable, and reliable programming are supported

– Efficiency and portable efficiency are also supported but only for a limited class of programs


Vector and Superscalar Processors: Summary (ctd)

Array libraries allow the programmers to avoid the use of dedicated compilers– The programmers express operations on arrays

directly using calls to carefully implemented subroutines

– Modular, portable, and reliable programming are supported

– Limited efficiency and portable efficiency» Excludes global optimization of combined array

operations


Vector and Superscalar Processors: Summary (ctd)

Parallel languages combine advantages of the first and second approaches– Operations on arrays can be explicitly expressed

» No need in sophisticated algorithms to recognize parallelizable loops

– Global optimisation of combined array operations is possible

– They support general-purpose programming (unlike existing array libraries)

parallel and cluster computing 1. 2 array libraries u function extensions of c and fortran 77 with...

Documents

vector libraries

entries of vector x

entries of real vector

sorted vector

entries of vector x

blas ctd u

vector y sort

blas routines