parallel and cluster computing 1. 2 array libraries u function extensions of c and fortran 77 with...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Parallel and Cluster Computing
1
Parallel and Cluster Computing 2
Array Libraries
Function extensions of C and Fortran 77 with array or vector libraries– The libraries are supposed to be optimised for each
particular computer– Regular compilers can be used => no need in dedicated
optimising compilers One of the most well-known and well-designed array
libraries is the Basic Linear Algebra Subprograms (BLAS)– Provides basic array operations for numerical linear algebra– Available for most modern VP and SP computers
Parallel and Cluster Computing 3
BLAS All BLAS routines are divided into 3 main categories:
– Level 1 BLAS addresses scalar and vector operations– Level 2 BLAS addresses matrix-vector operations– Level 3 BLAS addresses matrix-matrix operations
Routines of Level 1 do– vector reduction operations– vector rotation operations– element-wise and combined vector operations– data movement with vectors
Parallel and Cluster Computing 4
Level 1 BLAS A vector reduction operation
– The addition of the scaled dot product of two real vectors x and y into a scaled scalar r
1
0
n
iii
T yxryxrr
The C interface of the routine implementing the operation isvoid BLAS_ddot(
enum blas_conj_type conj, int n, double alpha, const double *x, int incx, double beta, const double *y, int incy, double *r );
Parallel and Cluster Computing 5
Level 1 BLAS (ctd) Other routines doing reduction operations
– Compute different vector norms of vector x– Compute the sum of the entries of vector x– Find the smallest or biggest component of vector x– Compute the sum of squares of the entries of vector x
Routines doing rotation operations– Generate Givens plane rotation– Generate Jacobi rotation– Generate Householder transformation
Parallel and Cluster Computing 6
Level 1 BLAS (ctd) An element-wise vector operation
– The scaled addition of two real vectors x and y
The C interface of the routine implementing the operation isvoid BLAS_dwaxpby( int n, double alpha, const double *x, int incx, double beta, const double *y, int incy, double *w, int incw );
Function BLAS_cwaxpby does the same operation but on complex vectors
yxw
Parallel and Cluster Computing 7
Level 1 BLAS (ctd) Other routines doing element-wise operations
– Scale the entries of a vector x by the real scalar 1/
– Scale a vector x by and a vector y by , add these two vectors to one another and store the result in the vector y
– Combine a scaled vector accumulation and a dot product
– Apply a plane rotation to vectors x and y
Parallel and Cluster Computing 8
Level 1 BLAS (ctd) An example of data movement with vectors
– The interchange of real vectors x and y
The C interface of the routine implementing the operation isvoid BLAS_dswap( int n, double *x, int incx, double *y, int incy );
Function BLAS_cswap does the same operation but on complex vectors
Parallel and Cluster Computing 9
Level 1 BLAS (ctd) Other routines doing data movement with vectors
– Copy vector x into vector y
– Sort the entries of real vector x in increasing or decreasing order and overwrite this vector x with the sorted vector as well as compute the corresponding permutation vector p
– Scale the entries of a vector x by the real scalar 1/
– Permute the entries of vector x according to permutation vector p
Parallel and Cluster Computing 10
Level 2 BLAS
Routines of Level 2– Compute different matrix vector products– Do addition of scaled matrix vector products– Compute multiple matrix vector products– Solve triangular equations– Perform rank one and rank two updates– Some operations use symmetric or triangular matrices
Parallel and Cluster Computing 11
Level 2 BLAS (ctd) To store matrices, the following schemes are used
– Column-based and row-based storage
– Packed storage for symmetric or triangular matrices
– Band storage for band matrices
Conventional storage– An nxn matrix A is stored in a one-dimensional array a
» aij => a[i+j*s] (C, column-wise storage)
» aij => a[j+i*s] (C, row-wise storage)
– If s=n, rows (columns) will be contiguous in memory
– If s>n, there will be a gap of (s-n) memory elements between two successive rows (columns)
– Only significant elements of symmetric/triangular matrices need be set
Parallel and Cluster Computing 12
Packed Storage Packed storage
– The relevant triangle of a symmetric/triangular matrix is packed by columns or rows in a one-dimensional array
– The upper triangle of an nxn matrix A may be stored in a one-dimensional array a» aij(ij) => a[j+i*(2*n-i-1)/2] (C, row-wise storage)
Example.
22
1211
020100
00
0
a
aa
aaa
221211020100 aaaaaa=>
Parallel and Cluster Computing 13
Band Storage Band storage
– A compact storage scheme for band matrices Consider Fortran and a column-wise storage scheme
– An mxn band matrix A with l subdiagonals and u superdiagonals may be stored in a 2-dimensional array A with l+u+1 rows and n columns
– Columns of matrix A are stored in corresponding columns of array A– Diagonals of matrix A are stored in rows of array A
» aij => A(u+i-j,j) for max(0,j-u) i min(m-1,j+l)
Example.
=>
444342
34333231
23222120
121110
0100
00
0
0
00
000
aaa
aaaa
aaaa
aaa
aa
423120
43322110
4433221100
34231201
aaa
aaaa
aaaaa
aaaa
Parallel and Cluster Computing 14
Level 2 BLAS (ctd) An example of matrix vector multiplication operation
– The scaled addition of a real n-length vector y, and the product of a general real mxn matrix A and a real n-length vector x
yAxy The C interface of the routine implementing this operation is void BLAS_dgemv( enum blas_order_type order, enum blas_trans_type trans, int m, int n, double alpha, const double *a, int stride, const double *x, int incx, double beta, const double *y, int incy );
Parametersorder => blas_rowmajor or blas_colmajortrans => blas_no_trans (do not transpose A)
Parallel and Cluster Computing 15
Level 2 BLAS (ctd) If matrix A is a general band matrix with l subdiagonals and u
superdiagonals, the function
void BLAS_dgbmv( enum blas_order_type order, enum blas_trans_type trans, int m, int n, int l, int u, double alpha, const double *a, int stride, const double *x, int incx, double beta, const double *y, int incy );
better uses the memory. It assumes that a band storage scheme is used to store matrix A.
Parallel and Cluster Computing 16
Level 2 BLAS (ctd) Other routines of Level 2 perform the following operations
yAxy TAA where
Axx xAx T where A is triangularor
BxAxy yAx T , Axw
yAx T , Azw where A is triangular
as well as many others For any matrix-vector operation with a specific matrix operand
(triangular, symmetric, banded, etc.), there is a routine for each storage scheme that can be used to store the operand
Parallel and Cluster Computing 17
Level 3 BLAS
Routines of Level 3 do– O(n2) matrix operations
» norms, diagonal scaling, scaled accumulation and addition» different storage schemes to store matrix operands are supported
– O(n3) matrix-matrix operations» multiplication, solving matrix equations, symmetric rank k and 2k
updates
– Data movement with matrices
Parallel and Cluster Computing 18
Level 3 BLAS (ctd) An example of O(n2) matrix operation, which scales two real mxn matrices A
and B and stores their sum in a matrix C, is
The C interface of the routine implementing this operation under assumption that the matrices A, B and C are of the general form, is
void BLAS_dge_add( enum blas_order_type order, int m, int n, double alpha, const double *a, int stride_a, double beta, const double *b, int stride_b, double *c, int stride_c);
BAC
There are other 15 routines performing this operation for different types and forms of the matrices A, B and C
Parallel and Cluster Computing 19
Level 3 BLAS (ctd) An example of O(n3) matrix-matrix operation involving a real
mxn matrix A, a real nxk matrix B, and a real mxk matrix C is
The C routine implementing the operation for matrices A, B and C in the general form is
void BLAS_dgemm( enum blas_order_type order, enum blas_trans_type trans_a, enum blas_trans_type trans_b, int m, int n, int k, double alpha, const double *a, int stride_a, const double *b, int stride_c, double beta, const double *c, int stride_c);
CABC
Parallel and Cluster Computing 20
Level 3 BLAS (ctd) Data movement with matrices includes
– Copying matrix A or its transpose with storing the result in matrix B
B A or B AT
– Transposition of a square matrix A with the result overwriting matrix A
A AT
– Permutation of the rows or columns of matrix A by a permutation matrix P
A PA or A AP– Different types and forms of matrix operands as well as
different storage schemes are supported
Parallel and Cluster Computing 21
Sparse BLAS Sparse BLAS
– Provides routines for unstructured sparse matrices– Poorer functionality compared to Dense and Banded BLAS
» only some basic array operations used in solving large sparse linear equations using iterative techniques
» matrix multiply, triangular solve, sparse vector update, dot product, gather/scatter
– Does not specify methods to store a sparse matrix» storage format is dependent on the algorithm, the original sparsity
pattern, the format in which the data already exists, etc.» sparse matrix arguments are a placeholder, or handle, which refers
to an abstract representation of a matrix, not the actual data components
Parallel and Cluster Computing 22
Sparse BLAS (ctd) Several routines provided to create sparse matrices
– The internal representation is implementation dependent» Sparse BLAS applications are independent of the matrix storage
scheme, relying on the scheme provided by each implementation
A typical Sparse BLAS application– Creates an internal sparse matrix representation and returns
its handle – Uses the handle as a parameter in computational Sparse
BLAS routines– Calls a cleanup routine to free resourses associated with the
handle, when the matrix is no longer needed
Parallel and Cluster Computing 23
Example Example. Consider a C program using Sparse BLAS
performing the matrix-vector operation y Ax, where
4.4001.4
03.300
4.202.20
0001.1
A
0.1
0.1
0.1
0.1
x
Parallel and Cluster Computing 24
Example (ctd)#include <blas_sparse.h>int main() { const int n = 4, nonzeros = 6; double values[] = {1.1, 2.2, 2.4, 3.3, 4.1, 4.4}; int index_i[] = {0, 1, 1, 2, 3, 3}; int index_j[] = {0, 1, 3, 2, 0, 3}; double x[] = {1.0, 1.0, 1.0, 1.0}, y[] = {0.0, 0.0, 0.0, 0.0}; blas_sparse_matrix A; int k; double alpha = 1.0; A = BLAS_duscr_begin(n, n); //Create Sparse BLAS handle for(k=0; k < nonzeros; k++) //Insert entries one by one BLAS_duscr_insert_entry(A, values[k], index_i[k], index_j[k]); BLAS_uscr_end(A); // Complete construction of sparse matrix //Compute matrix-vector product y = A*x BLAS_dusmv(blas_no_trans, alpha, A, x, 1, y, 1); BLAS_usds(A); //Release matrix handle}
Parallel and Cluster Computing
25
Parallel and Cluster Computing 26
Parallel Languages
C and Fortran 77 do not reflect some essential features of VP and SP architectures– They cannot play the same role for VPs and SPs
Optimizing compilers– Only for a simple and limited class of applications
Array libraries– Cover a limited class of array operations
» Other array operations can be only expressed as a combination of the locally-optimized library array operations
» This excludes global optimization of combined array operations
Parallel and Cluster Computing 27
Parallel Languages (ctd)
Parallel extensions of C and Fortran 77 allows programmers– To explicitly express in a portable form any array
operation» Compiler does not need to recognize code to parallelize
» Global optimisation of operations on array is possible
– We consider 2 parallel supersets of C and Fortran 77 » Fortran 90
» C[]
Parallel and Cluster Computing 28
Fortran 90
Fortran 90 is a new Fortran standard released in 1991– Widely implemented since then
Two categories of new features– Modernization of Fortran according to the state-of-
the-art in serial programming languages– Support for explicit expression of operations on
arrays
Parallel and Cluster Computing 29
Fortran 90 (ctd)
Serial extensions include – Free-format source code and some other simple
improvements – Dynamic memory allocation (automatic arrays,
allocatable arrays, and pointers and associated heap storage management)
– User-defined data types (structures)– Generic user-defined procedures (functions and
subroutines) and operators
Parallel and Cluster Computing 30
Fortran 90 (ctd)
Serial extensions (ctd) – Recursive procedures– New control structures to support structured
programming – A new program unit, MODULE, for encapsulation
of data and a related set of procedures
We focus on parallel extensions
Parallel and Cluster Computing 31
Fortran 90 (ctd)
Fortran 90 considers arrays first-class objects – Whole-array operations, assignments, and functions – Operations and assignments are extended in an
obvious way, on an element-by-element basis – Intrinsic functions are array-valued for array
arguments» operate element-wise if given an array as their argument
– Array expressions may include scalar constants and variables, which are replicated (or expanded) to the required number of elements
Parallel and Cluster Computing 32
Fortran 90 (ctd)
Example.
REAL, DIMENSION(3,4,5) :: a, b, c, d…c = a + bd = SQRT(a)c = a + 2.0
Parallel and Cluster Computing 33
WHERE Structure
Sometimes, some elements of arrays in an array-valued expression should be treated specially – Division by zero in a = 1./a should be avoided
WHERE statementWHERE (a /= 0.) a = 1./a
WHERE construct WHERE (a /= 0.) a = 1./a ELSEWHERE a = HUGE(a) END WHERE
Parallel and Cluster Computing 34
Fortran 90 (ctd)
All the array elements in an array-valued expression or array assignment must be conformable, i.e., they must have the same shape – the same number of axes– the same number of elements along each axis
Example. REAL :: a(3,4,5), b(0:2,4,5), c(3,4,-1:3)
Arrays a, b, and c have the same rank of 3, extents of 3,4, and 5, shape of {3,4,5}, size of 60 – Only differ in the lower and upper dimension bounds
Parallel and Cluster Computing 35
Array Section
An array section can be used everywhere in array assignments and array-valued expressions where a whole array is allowed
An array section may be specified with subscripts of the form of triplet : lower:upper:stride
It designates an ordered set i1,…,ik such that
– i1 = lower
– ij+1 = ij + stride ( j=1,…,k-1 )
– |ik - upper| < stride
Parallel and Cluster Computing 36
Array Section (ctd)
Example. REAL :: a(50,50) What sections are designated by the following
expressions? What are the rank and shape for each section? – a(i,1:50:1), a(i,1:50) – a(i,:)– a(i,1:50:3)– a(i,50:1:-1)– a(11:40,j)– a(1:10,1:10)
Parallel and Cluster Computing 37
Array Section (ctd)
Vector subscripts may also be used to specify array sections– Any expression whose value is a rank 1 integer
array may be used as a vector subsript
Example.REAL :: a(5,5), b(5)INTEGER :: index(5)index = (/5,4,3,2,1/)b = a(index,1)
Parallel and Cluster Computing 38
Array Section (ctd)
Whole arrays and array sections of the same shape can be mixed in expressions and assignments
Note, that unlike a whole array, an array section may not occupy contiguous storage locations
Parallel and Cluster Computing 39
Array Constants
Fortran 90 introduces array constants, or array constructors – The simplest form is just a list of elements enclosed in (/
and /)
– May contain lists of scalars, lists of arrays, and implied-DO loops
Examples.(/ 0, i=1,50 /)
(/ (3.14*i, i=4,100,3) /)
(/ ( (/ 5,4,3,2,1 /), i=1,5 ) /)
Parallel and Cluster Computing 40
Array Constants (ctd)
The array constructors can only produce 1-dimensional arrays– Function RESHAPE can be used to construct
arrays of higher rank
REAL :: a(500,500)
a = RESHAPE( (/ (0., i=1,250000) /), (/ 500,500 /) )
Parallel and Cluster Computing 41
Assumed-Shape and Automatic Arrays
Consider the user-defined procedure operating on arrays
SUBROUTINE swap(a,b)REAL, DIMENSION(:,:) :: a, bREAL, DIMENSION(SIZE(a,1), SIZE(a,2)) :: temptemp = aa = bb = tempEND SUBROUTINE swap
Parallel and Cluster Computing 42
Assumed-Shape and Automatic Arrays (ctd)
Formal array arguments a and b are of assumed shape– Only the type and rank are specified– The actual shape is taken from that of the actual array
arguments
The local array temp is an example of the automatic array– Its size is set at runtime– It stops existing as soon as control leaves the procedure
Parallel and Cluster Computing 43
Intrinsic Array Functions
Intrinsic array functions include– Extension of such intrinsic functions as SQRT, SIN, etc. to array arguments
– Specific array intrinsic functions Specific array intrinsic functions do the
following– Compute the scalar product of two vectors
(DOT_PRODUCT) and the matrix product of two matrices (MATMUL)
Parallel and Cluster Computing 44
Specific Intrinsic Array Functions
– Perform diverse reduction operations on an array» logical multiplication (ALL) and addition (ANY)» counting the number of true elements in the
array» arithmetical multiplication (PRODUCT) and
addition (SUM) of its elements» finding the smallest (MINVAL) or the largest
(MAXVAL) element
Parallel and Cluster Computing 45
Specific Intrinsic Array Functions (ctd)
– Return diverse attributes of an array» its shape (SHAPE)» the lower dimension bounds of the array (LBOUND)» the upper dimension bounds (UBOUND)» the number of elements (SIZE)» the allocation status of the array (ALLOCATED)
Parallel and Cluster Computing 46
Specific Intrinsic Array Functions (ctd)
– Construct arrays by means of» merging two arrays under mask (MERGE)» packing an array into a vector (PACK)» replication of an array by adding a dimension
(SPREAD)» unpacking a vector (a rank 1 array) into an array
under mask (UNPACK)
Parallel and Cluster Computing 47
Specific Intrinsic Array Functions (ctd)
– Reshape arrays (RESHAPE)
– Move array elements performing» the circular shift (CSHIFT)» the end-off shift (EOSHIFT)» the transpose of a rank 2 array (TRANSPOSE)
– Locate the first maximum (MAXLOC) or minimum (MINLOC) element in an array
Parallel and Cluster Computing 48
C[]
C[] (C “brackets”) is a strict ANSI C superset allowing programmers to explicitly describe operations on arrays
Vector value, or vector– An ordered set of values (or vector values) of any
one type– Any vector type is characterised by
» the number of elements
» the type of elements
Parallel and Cluster Computing 49
Vector Value and Vector Object
Vector object– A region of data storage, the contents of which can
represent vector values– An ordered sequence of objects (or vector objects)
of any one type Unlike ANSI C, C[] defines the notion of
value of array object – This value is vector
Parallel and Cluster Computing 50
Vector Value and Vector Object (ctd)
Example. The value of the arrayint a[3][2] = {0,1,2,3,4,5};
is the vector{ {0,1}, {2,3}, {4,5} }
This vector has the shape {3,2}. This vector type is named by int[3][2] The shape of array = that of its vector value In C[], array object is a particular case of vector
object
Parallel and Cluster Computing 51
Arrays and Pointers
C array is a contiguously allocated set of elements of any one type of object
C[] array is a a set of elements of any one type of object sequentially allocated with a positive stride– The stride is a distance between successive
elements of the array measured in units equal to the size of array element
– If stride is not specified, it is assumed to be 1
Parallel and Cluster Computing 52
Arrays and Pointers (ctd)
C[] array has at least three attributes:– the type of elements– the number of elements– the allocation stride
Parallel and Cluster Computing 53
Arrays and Pointers (ctd)
Example 1.int a[3];
int a[3:1];
a[0] a[1] a[2]
a[0] a[1] a[2]
Example 2.int a[3:3];
The slot between array elements is of 2xsizeof(int) bytes
Parallel and Cluster Computing 54
Arrays and Pointers (ctd)
In C, a pointer has only one attribute– The type of object it points to– It is needed to correctly interpret
» the value of the object it points to
» the address operators + and - (operand(s) and result should point into the same array)
In C[], a pointer has an additional attribute, stride– If stride is not specified, it is assumed to be 1
Parallel and Cluster Computing 55
Arrays and Pointers (ctd)
Example 1. The declarations int a[] = {0,1,2,3,4};int * p1 = (void*)a;int *:2 p2 = (void*)&a[4];
form the following structure of storage
p1+2 and p2-1 point to the same array element, a[2]
a[0] a[3] a[4]a[1] a[2]
p2p1
Parallel and Cluster Computing 56
Arrays and Pointers (ctd)
Expressions e1[e2] or (e2)[e1] provide access to the e2-th element of an array e1 – Identical to (*((e1)+(e2))) – e2 is an integer expression – e1 is an lvalue that has the type “array of type ”
» converted to an expression of the type “pointer to type” pointing to the intial element of the array object
» the attribute stride of this pointer is identical to that of the array object
Parallel and Cluster Computing 57
Arrays and Pointers (ctd)
C[] allows dynamic arrays
typedef int (*pDiag)[n:n+1];int a[n][n];int j;pDiag p = (void*)a;...for(j=0; j<n; j++)
(*p)[j]=1;
Parallel and Cluster Computing 58
Blocking Operator
In C[], the value of an aray object is a vector– The i-th element of the vector is the value of the i-th
element of the array object The postfix [] operator (the blocking operator)
– Supports access to an array as the whole – Its operand has the type “array of type”– Blocks the conversion of the operand to a pointer
Example. int a[5], b[5:2], c[5:3]; » a[], b[], and c[] designate arrays a, b, and c as a whole» c[]=a[]+b[]
Parallel and Cluster Computing 59
Lvector
In C, an lvalue is an expression designating an object
Example. int d[5][5];– d[i][j], d and d[0] are lvalues– d[i][j]+1 and &d[0] are not.
Modifiable lvalue– May be the left operand of an assignment operator
»d[i][j] is a modifiable lvalue»d and d[0] are not
Parallel and Cluster Computing 60
Lvector (ctd)
In C[], an lvector is an expression designating a vector object
Modifiable lvector– May be the left operand of an assignment operator
Example. int d[5][5];– d, d[0], d[], and d[0][] are lvectors
»d[], and d[0][] are modifiable»d and d[0] are not modifiable
Parallel and Cluster Computing 61
Lvector (ctd) Example. int a[4][4];
(*(int(*)[4:5])a)[]
a[0][0]
a[1][0]
a[2][0]
a[3][0]
a[0][1] a[0][2] a[0][3]
a[2][1]
a[3][1]
a[1][2]
a[3][2]
a[1][3]
a[2][3]
a[1][1]
a[2][2]
a[3][3]
Parallel and Cluster Computing 62
Subarray
An object belongs to an array if– It is an element of the array, or– It belongs to an element of the array
Subarray – A set of objects belonging to an array– An array itself
Example (ctd). The main diagonal is a subarray– It is an array obect of the type int[4:5]
Parallel and Cluster Computing 63
Subarray (ctd) Example. int a[4][4];
(*(int(*)[3:5])(&a[0][1]))[]
a[0][0]
a[1][0]
a[2][0]
a[3][0]
a[0][1] a[0][2] a[0][3]
a[2][1]
a[3][1]
a[1][2]
a[3][2]
a[1][3]
a[2][3]
a[1][1]
a[2][2]
a[3][3]
Parallel and Cluster Computing 64
Subarray (ctd)
Not every regular set of objects belonging to an array makes up its subarray
Example. int a[5][5];a[0][0]
a[1][0]
a[2][0]
a[3][0]
a[0][1] a[0][2] a[0][3]
a[2][1]
a[3][1]
a[1][2]
a[3][2]
a[1][3]
a[2][3]
a[1][1]
a[2][2]
a[3][3]
a[0][4]
a[1][4]
a[2][4]
a[3][4]
a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]
No constant modifiable lvector designates this “inner square”
Parallel and Cluster Computing 65
Array Section
The [:] operator (the grid operator) – Supports access to array sections of general form
Syntax. e[l:r:s] – Expression e may have type “array of type” or
“pointer to type”– Expressions l, r, and s have integer types and denote
» the left bound» the right bound» the stride
Parallel and Cluster Computing 66
Array Section (ctd)
Semantics. e[l:r:s] – A vector object of (r-s)/l+1 elements of type type
» Its i-th element is e[l+s*i]
– Expression e[l:r:s] is lvector– Expression e[l:r:s] is a modifiable lvector if
» All expressions {e[l+s*i] }i=0,1,… are modifiable lvectors/lvalues
– e[l:r:1] e[l:r]
Parallel and Cluster Computing 67
Array Section (ctd)
Operand e in e[l:r:s] may have a vector type – Operator [:] is applied element-wise
– Let the vector value of e be {u1,…,uk}
» e[l:r:s] will designate a vector of k vectors
» The i-th element of the j-th vector will be uj[l+s*i] (j=1,
…,k)
Parallel and Cluster Computing 68
Example
Example. int a[5][5];
a[1:3][1:3]
a[0][0]
a[1][0]
a[2][0]
a[3][0]
a[0][1] a[0][2] a[0][3]
a[2][1]
a[3][1]
a[1][2]
a[3][2]
a[1][3]
a[2][3]
a[1][1]
a[2][2]
a[3][3]
a[0][4]
a[1][4]
a[2][4]
a[3][4]
a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]
Parallel and Cluster Computing 69
Example (ctd)
a[1:3]
a[0][0]
a[1][0]
a[2][0]
a[3][0]
a[0][1] a[0][2] a[0][3]
a[2][1]
a[3][1]
a[1][2]
a[3][2]
a[1][3]
a[2][3]
a[1][1]
a[2][2]
a[3][3]
a[0][4]
a[1][4]
a[2][4]
a[3][4]
a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]
a[0][0]
a[1][0]
a[2][0]
a[3][0]
a[0][1] a[0][2] a[0][3]
a[2][1]
a[3][1]
a[1][2]
a[3][2]
a[1][3]
a[2][3]
a[1][1]
a[2][2]
a[3][3]
a[0][4]
a[1][4]
a[2][4]
a[3][4]
a[4][0] a[4][1] a[4][2] a[4][3] a[4][4]
a[1:3][1:3]
Parallel and Cluster Computing 70
Array Section (ctd)
Operands l and/or r in e[l:r:s] may be omitted – If l is omitted, the left bound is set to 0 – If r is omitted, the right bound is
» set to n-1, if the first operand e is an n-element array
» determined from the context, if e is a pointer
Example. int a[5][5];
– a[:][:] a[]
Parallel and Cluster Computing 71
Element-Wise Vector Operators
The operand of the cast operator and the unary &, *, +, -, ~, !, ++, and –- operators may have a vector type– The operators are applied element-wise
Example.int j, k, l, m, n;
int *p[5] = { &j, &k, &l, &m, &n };
– *(p[1:3]) designates a vector object consisting of three integer variables k, l, and m.
Parallel and Cluster Computing 72
Element-Wise Vector Operators (ctd)
Binary operators *, /, %, +, -, <<, >>, <, >, <=, >=, ==, !=, &, ^, |, &&, and || may have vector operands
If the operands have the same shape, then the operator is executed element-wise producing the result of this shape
Parallel and Cluster Computing 73
Element-Wise Vector Operators (ctd)
In general, the operands may have different shapes but they must be conformable – 2 operands are conformable iff the beginning
of the shape of one operand is identical to the shape of the other operand» Vectors having shapes {9,8,7,6} and {9,8} are
conformable» A non-vector operand is conformable with any
vector operand (why?)
Parallel and Cluster Computing 74
Element-Wise Vector Operators (ctd)
Let operands a and b be conformable, and rank(a) < rank(b)
The execution of the operator starts from conformable extension of the value of a to the shape of b – The conformable extension just replicates the value
by adding dimensions
Parallel and Cluster Computing 75
Element-Wise Vector Operators (ctd)
Example. Conformable extension of vector {{1,2,3},{4,5,6}} of shape {2,3} to shape {2,3,2} is vector {{{1,1},{2,2},{3,3}},{{4,4},{5,5},{6,6}}}
Then the operator is applied element-wise to the result of the conformable extension of the value of a and the value of b, producing the result of the same shape as that of b
Parallel and Cluster Computing 76
Element-Wise Vector Operators (ctd)
The assignment operators =, *=, +=, etc. may have vector operands – The left operand shall be a modifiable lvector – Its rank shall not be less than that of the right operand – The operands shall be conformable
Two-step execution– The right operand is conformably extended to the
shape of the left one– The assignment is executed element-wise
Parallel and Cluster Computing 77
Element-Wise Vector Operators (ctd)
Example.
int a[m][n], b[m];...a[:][:] = b[:];
Parallel and Cluster Computing 78
Example LU-factorization of the square matrix a by using the
Gaussian eliminationdouble a[n][n], t;int i, j;...for(i=0; i<n; i++) {
for(j=i+1; j<n; j++) { t = a[j][i]/a[i][i]; if(a[j][i]!=0.) a[j][i:n-1]-=t*a[i][i:n-1]; } }
Parallel and Cluster Computing 79
Element-Wise Vector Operators (ctd)
By definition, e1[e2] is identical to (*((e1)+(e2)))– Therefore, e1 and e2 may be of vector type
The programmer can construct lvectors that designate irregular array sections
Example.int a[m][n], ind[] = {0,1,6,18};...a[ind[:]][:] = 0;
This code zeros the elements of the 0-th, 1-st, 6-th, and 18-th rows of array a
Parallel and Cluster Computing 80
Element-Wise Vector Operators (ctd)
The first operand of the . operator may have a vector type
The second operand shall name a member of a structure or union type
The operator is executed element-wise– The result will have the same shape as the first
operand e->id is identical to (*e).id
Parallel and Cluster Computing 81
Reduction Operators
Reduction operators [*], [+], [&], [^], [|], [&&], [||], [?<], and [?>]– Unary operators
– Only applicable to vector operands
– If v1,…,vn are the elements of the vector value of
the expression e, then the value of the expression [*]e is that of the expression (v1*…*vn)
Parallel and Cluster Computing 82
Examples
Example 1. Dot product of the vectors a and b.
double a[n];
double b[n];
double c;
...
c = [+](a[]*b[]);
Parallel and Cluster Computing 83
Examples (ctd)
Example 2. Maximal element of the matrix a
int a[m][n];
int max;
...
max = [?>][?>]a[];
Parallel and Cluster Computing 84
Examples (ctd)
Example 3. Multiplication of matrices a and b
double a[m][l];double b[l][n];double c[m][n];int i;...for(i=0; i<m; i++)
c[i][] = [+](a[i][]*b[]);
Parallel and Cluster Computing 85
Memory Hierarchy
Parallel programming systems for VPs and SPs take into account their modern memory structure– Optimal memory management is often more efficient than
optimal usage of IEUs
– Approaches to optimal memory management appear surprisingly similar to optimisation of parallel facilities
Simple two-level memory model– Small and fast register memory
– Large and relatively slow main memory
Parallel and Cluster Computing 86
Memory Hierarchy (ctd)
A simple modern memory hierarchy– Register memory – Cache memory – Main memory – Disk memory
Cache memory– A buffer memory between main memory and
registers – Holds copies of some data from the main memory
Parallel and Cluster Computing 87
Memory Hierarchy (ctd)
Execution of instruction reading a data item from the main memory into a register– Check if a copy of the data item is already in the
cache– If so, the data item will be actually transferred into
the register from the cache– If not, the data item will be transferred into the
register from the main memory, and a copy of the item will appear in the cache
Parallel and Cluster Computing 88
Cache
Cache– Partitioned into cache lines
» Cache line is a minimum unit of data transfer between the cache and the main memory
» Scalars may be transferred only as a part of a cache line
– Much smaller than the main memory» The same cache line may reflect different data blocks
from the main memory
Parallel and Cluster Computing 89
Cache (ctd)
Types of cache memory– Direct mapped
» each block of the main memory has only one place it can appear in the cache
– Fully associative » a block can be placed anywhere in the cache
– Set associative » a block can be placed in a restricted set of places» a set is a group of two or more cache lines» n-way associative cache
Parallel and Cluster Computing 90
Cache (ctd)
Cache miss is the situation when a data item being referenced is not in the cache
Minimization of cache misses is able to significantly accelerate execution of the program
Programs intensively using basic operations on arrays are obviously suitable for that type of optimization
Parallel and Cluster Computing 91
Loop Tiling
The main specific optimization minimizing the number of cache misses is loop tiling
Consider the loop nest for(i=0; i<m; i++) /* loop 1 */ for(j=0; j<n; j++) /* loop 2 */
if(i==0) b[j]=a[i][j]; else b[j]+=a[i][j];
b[j] are repeatedly used by successive iterations of loop 1
Parallel and Cluster Computing 92
Loop Tiling (ctd)
If n is large enough, the data items may be flushed from the cache by the moment of their repeated use
To minimize the flushing of repeatedly used data items, the number of iterations of loop 2 may be decreased
To keep the total number of iterations of this loop nest unchanged, an additional controlling loop is introduced
Parallel and Cluster Computing 93
Loop Tiling (ctd)
The transformed loop nest is: for(k=0; k<n; k+=T) //additional controlling loop 0 for(i=0; i<m; i++) // loop 1 for(j=k; j<min(k+T,n); j++) // loop 2 if(i==0) b[j]=a[i][j]; else b[j]+=a[i][j];
This transformation is called tiling T is the tile size
Parallel and Cluster Computing 94
Loop Tiling (ctd)
In general, the loop tiling is applied to loop nests of the form
for(i1=...) /* loop 1 */ for(i2=...) /* loop 2 */ ... for(in=...) { /* loop n */ ... e[i2]…[in] ... }
The goal is to minimize the number of cache misses for reference e[i2]…[in], which is repeatedly used by successive iterations of loop 1
Parallel and Cluster Computing 95
Loop Tiling and Optimising Compilers
The recognition of the loop nests, which can be tiled is the most difficult problem to be solved by optimising C and Fortran 77 compilers– Based on the analysis of data dependencies in loop nests
Theorem. The loop tiling is legally applicable (to the above loop nest) iff the loops from loop 2 to loop n are fully interchangeable – To prove the interchangability an analysis of data
dependence between different iterations of the loop nest is needed
Parallel and Cluster Computing 96
Loop Tiling and Array Libraries
Level 3 BLAS is specified to support block algorithms of matrix-matrix operations
Partitioning matrices into blocks and performing the computation on the blocks maximizes the reuse of data held in the upper levels of memory hierarchy
Parallel and Cluster Computing 97
Loop Tiling and Parallel Languages
Compilers for parallel languages do not need to recognize loops suitable for tiling
They can translate explicit operations on arrays into loop nests with the best possible temporal locality
Parallel and Cluster Computing 98
Virtual memory
Instructions address virtual memory rather than the real physical memory
The virtual memory is partitioned into pages of a fixed size – Each page is stored on a disk until it is needed – When the page is needed, it copied to main
memory, with the virtual addresses mapping into real addresses
– This copying is known as paging or swapping
Parallel and Cluster Computing 99
Virtual memory (ctd)
Programs processing large enough arrays do not fit into main memory – The swapping takes place each time when required data are
not in the main memory
– The swapping is a very expensive operation
– Minimization of the number of swappings can significantly accelerate the programs
The problem is similar to minimization of cache misses and can be, therefore, approached similarly
Parallel and Cluster Computing 100
Vector and Superscalar Processors: Summary
VPs and SPs provide instruction-level parallelism, which is best exploited by applications with intensive operations on arrays
Such applications can be written in a serial programming language and complied by dedicated optimizing compilers performing some specific loop optimizations– Modular, portable, and reliable programming are supported
– Efficiency and portable efficiency are also supported but only for a limited class of programs
Parallel and Cluster Computing 101
Vector and Superscalar Processors: Summary (ctd)
Array libraries allow the programmers to avoid the use of dedicated compilers– The programmers express operations on arrays
directly using calls to carefully implemented subroutines
– Modular, portable, and reliable programming are supported
– Limited efficiency and portable efficiency» Excludes global optimization of combined array
operations
Parallel and Cluster Computing 102
Vector and Superscalar Processors: Summary (ctd)
Parallel languages combine advantages of the first and second approaches– Operations on arrays can be explicitly expressed
» No need in sophisticated algorithms to recognize parallelizable loops
– Global optimisation of combined array operations is possible
– They support general-purpose programming (unlike existing array libraries)