bsc · bsc. scicomp15, cell tutorial, may 18th 2009 outline ... t1 1 t2 t3 1 t4 1 t5 1 t1 2

56
Programming with CellSs BSC

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

Programming with CellSs

BSC

Page 2: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

Outline

• StarSs Programming Model

• CellSs runtime

• CellSs syntax

• CellSs compiler

• Programming examples

• Performance analysis using Paraver

• Conclusions

Page 3: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

STARSs programming model

Basic idea

...

for (i=0; i<N; i++){

T1 (data1, data2);

T2 (data4, data5);

T3 (data2, data5, data6);

T4 (data7, data8);

T5 (data6, data8, data9);

}

...

Sequential Application

T10 T20

T30

T40

T50

T11 T21

T31

T41

T51

T12

Resource 1

Resource 2

Resource 3

Resource N

.

.

.

Task graph creation

based on data

precedence

Task selection +

parameters direction

(input, output, inout)

Scheduling,

data transfer,

task execution

Synchronization,

results transfer

Parallel Resources(multicore, SMP, cluster, grid)

Page 4: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax example - matrix multiply

int main (int argc, char **argv) {

int i, j, k;

initialize(A, B, C);

for (i=0; i < NB; i++)

for (j=0; j < NB; j++)

for (k=0; k < NB; k++)

block_addmultiply (C[i][j], A[i][k], B[k][j]);

}

static void block_addmultiply (float C[BS][BS], float A[BS][BS],

float B[BS][BS]) {

int i, j, k;

for (i=0; i < B; i++)

for (j=0; j < B; j++)

for (k=0; k < B; k++)

C[i][j] += A[i][k] * B[k][j];

}

BS

BS

NB

NB

BS

BS

Page 5: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax example - matrix multiply

int main (int argc, char **argv) {

int i, j, k;

initialize(A, B, C);

for (i=0; i < NB; i++)

for (j=0; j < NB; j++)

for (k=0; k < NB; k++)

block_addmultiply( C[i][j], A[i][k], B[k][j]);

}

#pragma css task input(A, B) inout(C)

static void block_addmultiply( float C[BS][BS], float A[BS][BS],

float B[BS][BS]) {

int i, j, k;

for (i=0; i < B; i++)

for (j=0; j < B; j++)

for (k=0; k < B; k++)

C[i][j] += A[i][k] * B[k][j];

}

B

B

NB

NB

B

B

Page 6: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime

PPE

User main program

CellSs PPU lib

SPE0

DMA inTask executionDMA outSynchronization

CellSs SPU lib

Original task code

Helper threadMain thread

Memory

Userdata

Task control buffer

Synchronization

Tasks

Finalization signal

Stage in/out data

Work assignment

Data dependence

Data renaming

Scheduling

SPE1SPE2

Renaming table

...

Page 7: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - argument renaming

• False dependences (WaW and WaR) are removed with dynamic

renaming of arguments

for (i=0; i<N; i++) {

T1 (…,…, block1);

T2 (block1, …, block2[i]);

T3 (block2[i],…,…);

}

Block1 is output from task T1

Block1 is input to task T2

block1block1

T1_1

T2_1

T3_1

T1_2

T2_2

T3_2

T1_N

T2_N

T3_N

…block1

WaR

WaW

WaR

WaW

WaR

WaW

Page 8: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - argument renaming

• False dependences (WaW and WaR) are removed with dynamic

renaming of arguments

for (i=0; i<N; i++) {

T1 (…,…, block1);

T2 (block1, …, block2[i]);

T3 (block2[i],…,…);

}

Block1 is output from task T1

Block1 is input to task T2

block1_Nblock1_2

T1_1

T2_1

T3_1

T1_2

T2_2

T3_2

T1_N

T2_N

T3_N

…block1_1

WaR

WaW

WaR

WaW

WaR

WaW

Page 9: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime – scheduling

• Scheduling strategy

• Critical path

• Locality

... ...

Bundle of dependent tasks: data locality in SPE Bundle of independent tasks:

Mixed bundle

Page 10: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime

• Paraver view of the runtime behavior

Bundle

Main thread:runs user code and adds and remove tasks to the task graph

SPEs: execute tasks' code

Helper thread:schedules tasks and synchronize with SPEs

Page 11: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime – specific SPE library features

• Data dependence analysis, data renaming, task scheduling

performed in the CellSs PPE runtime library

• CellSs SPE runtime library implements specific features, to assist

the CellSs PPE runtime library, but independently

• Early callback

• Minimal stage-out

• Software cache in the SPE Local Store

• Double buffering

Page 12: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime – specific SPE library features

• Early call-back

• Initially, communication of completion of tasks is

done per bundle basis

• There are cases where this limits the application

• Task A in the example

• An early callback after the limiting task, enables

the scheduling of new bundles

• Condition: the task has more than one outgoing

dependency

Page 13: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime – specific SPE library features

• Minimal stage-out

• For each task in a bundle its outpus will be written

back to main memory

• If inside the bundle, a task rewrites the same

output, there is no need for writing back to main

memory

• The case in the figure can not happen!

• Thanks to renaming

• Example: matmul

• C[i][j] += A[i][k]*B[k][j]

X

Y

X

Zwrites A'

writes A

reads A

X

Y

X

Zwrites A

writes A

reads A

X

Y

X

Z

writes A

writes A

reads A

Page 14: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime – specific SPE library features

• Software cache in the SPE Local Store

• Maintained by the SPE runtime

• LRU replacement strategy

• PPE scheduling is not aware of this behavior

Page 15: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime – specific SPE library features

...

#pragma css task input(A, B) inout(C)

block_addmultiply( C[i][j], A[i][k], B[k][j])

C[i][j]

A[i][k] B[k][j]

• For each operation, two blocks of data are get from PPE memory to SPE local storage

• Clusters of dependent tasks are scheduled to the same PPE

The inout block is kept in the local storage and only put in

PPE memory once (reuse)

Page 16: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - specific SPE library features

• Double buffering

• CellSs overlaps DMA transfers with computations

DMA programming: reading task control buffer

Waiting for DMA transfer

DMA programming: reading data

Task execution overlapped with data transfers

DMA programming: writing data

Task 1 in bundle Task 2 in bundle Task N in bundle

Synchronization with helper thread

...

Page 17: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - specific SPE library features

• Double buffering: paraver view

SPE reads data

SPE executes task

Page 18: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - specific SPE library features

• Double buffering: paraver view

DMA programming DMA programming

SPE waits for DMA in

Page 19: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - specific SPE library features

• Double buffering: paraver view

DMA out programming

DMA in programmingSPE waits for DMA in

Page 20: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Runtime - specific SPE library features

• Double buffering: paraver view

DMA out programming

SPE waits for DMA out (all)

Page 21: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax

• pragmas' syntax:

#pragma css task [input (<input parameters>)] \

[output (<output parameters>)] \

[inout (<input/output parameters>)] \

[highpriority]

void task(<parameters>) { ...

#pragma css wait on(<data address>)

#pragma css barrier

#pragma css start

#pragma css finish

Page 22: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax

• Examples: task selection

#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...

#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..

#pragma css task input(A[BS][BS], B[BS][BS], BS) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B, int BS ) { ...

• Examples: waiting for data

#pragma css task input (ref_block, to_comp) output (mse)

void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ......

are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)

if (sq_error >0.0000001)

Page 23: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax

• Examples: synchronization

for (i=0; i < NB; i++)

for (j=0; j < NB; j++)

for (k=0; k < NB; k++)

block_addmultiply( C[i][j], A[i][k], B[k][j]);

#pragma css barrier

• Examples: priorization

#pragma css task input(lefthalo[32], tophalo[32], righthalo[32], \

bottomhalo[32]) inout(A[32][32]) highpriority

void jacobi (float *lefthalo, float *tophalo, float *righthalo, float *bottomhalo, float *A)

{

...

}

Page 24: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax

• Examples: CellSs program boundary

#pragma css start

for (i=0; i < NB; i++)

for (j=0; j < NB; j++)

for (k=0; k < NB; k++)

block_addmultiply( C[i][j], A[i][k], B[k][j]);

#pragma css finish

Page 25: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Syntax in Fortran

subroutine example()...interface

!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)

imtlicit noneinteger, intent (in) :: BSreal, intent (in) :: A(BS,BS), B(BS,BS)real, intent (inout) :: C(BS,BS)

end subroutineend interface...!$CSS START...call block_add_multiply(C, A, B, BLOCK_SIZE)...!$CSS FINISH

...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine

Page 26: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs compiler: Compiler phase

Code translation

(mcc)

cellss-spu-cc_app.c

pack

app.tasks (tasks list)

app.c

cellss-spu-cc_app.o

app.o

CELSS-CC

cellss-ppu-cc_app.c

SPE Compiler PPE Compiler

cellss-spu-cc_app.o

Page 27: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs compiler: Compiler phase

• Files

• app.c: User code, with CellSs annotations

• cellss-spu-cc_app.c: specific code generated for the spu (tasks code)

• cellss-ppu-cc_app.c: specific code generated for the ppu (main program)

• app.tasks: list of annotated tasks

• Compilation steps

• Mcc: Mercurium compiler (BCS), source to source compiler

• SPE compiler: Generic SPE compiler (IBM SDK)

• PPE compiler: Generic PPE compiler (IBM SDK)

• pack: Specific CellSs module that combines objects (BSC)

Page 28: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs compiler: Linker phase

app.c

unpackapp-adapters.c

exec

libCellSS.so

glue code generator

app.capp.o

app.tasks

exec-adapters.c

app-adapters.cccellss-spu-cc_app.o

exec-registration.c

exec-adapters.o

exec-registration.o

CELLSS-CC

app-adapters.capp-adapters.cccellss-ppu-cc_app.o

PPE Linker

exec-spu

SPE Compiler

PPE Compiler

SPE Embedder

SPE Linker

libCellSS-spu.a

exec-spu.o

app.tasksapp.tasks

Page 29: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs compiler: Linker phase

• Files

• exec-adapters.c: code generated for each of the annotated tasks to uniformly

call them (“stubs”).

• exec-registration.c: code generated to register the annotated tasks

• Linker steps

• unpack: unpacks objects

• glue code generator: from all the *.tasks files of an application generates a

single “adapters” file and a single “registration” file per executable

• SPE, PPE compilers and linkers and SPE embedder (IBM SDK)

Page 30: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

• Cholesky factorization

• Common matrix operation used to solve normal equations in linear

least squares problems.

• Calculates a triangular matrix (L) from a symetric and positive defined

matrix A.

Cholesky(A) = L

L · Lt = A

• Different possible implementations, depending on how the matrix is

traversed (by rows, by columns, left-looking, right-looking)

• It can be decomposed in block operations

Page 31: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

• In each iteration red and blue blocks are updated

• SPOTRF: Computes the Cholesky factorization of the diagonal block .

• STRSM: Computes the column panel

• SSYRK: Computes the row panel

• SGEMM: Updates the rest of the matrix

block_syrk

Page 32: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

main (){

...

for (int j = 0; j < DIM; j++){for (int k= 0; k< j; k++){for (int i = j+1; i < DIM; i++){// A[i,j] = A[i,j] - A[i,k] * (A[j,k])^tcss_sgemm_tile( A[i][k], A[j][k], A[i][j] );

}}for (int i = 0; i < j; i++){// A[j,j] = A[j,j] - A[j,i] * (A[j,i])^tcss_ssyrk_tile(A[j][i],A[j][j]);

}

// Cholesky Factorization of A[j,j]css_spotrf_tile( A[j][j] );for (int i = j+1; i < DIM; i++){// A[i,j] <- A[i,j] = X * (A[j,j])^tcss_strsm_tile( A[j][j], A[i][j] );

}}

...

for (int i = 0; i < DIM; i++){

for (int j = 0; j < DIM; j++){

#pragma css wait on (A[i][j]) print_block(A[i][j]);

}

}... }

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])

void sgemm_tile(float *A, float *B, float *C)

#pragma css task input (T[64][64]) inout(B[64][64])

void strsm_tile(float *T, float *B)

#pragma css task input(A[64][64]) inout(C[64][64])

void ssyrk_tile(float *A, float *C)

#pragma css task inout(A[64][64])

void spotrf_tile(float *A)

DIM

DIM

64

64

Cholesky factorization

Page 33: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

• Sparse LU

• More generic factorization than Cholesky

• Deals with non symetric matrixes

• Calculates one lower triangular matrix (L) and one upper triangular(U) matrix

which product fits with a permutation of rows of the original

Perm(A)=L*U

• Difficult to program for Cell, since some operations are for columns (not

blocks)

• The example shown here is a simplified version (without pivoting) based on

an initial sparse matrix

Page 34: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

int main(int argc, char **argv) {

int ii, jj, kk;

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}

}

B

B

NB

NB

B

B

void lu0(float *diag);

void bdiv(float *diag, float *row);

void bmod(float *row, float *col, float *inner);

void fwd(float *diag, float *col);

Sparse LU

Page 35: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

Dynamic main memory allocation

Data dependent parallelism

int main(int argc, char **argv) {

int ii, jj, kk;

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}

}

CellSs: Programming examples

#pragma css task inout(diag[B][B]) highpriority

void lu0(float *diag);

#pragma css task input(diag[B][B]) inout(row[B][B])

void bdiv(float *diag, float *row);

#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])

void bmod(float *row, float *col, float *inner);

#pragma css task input(diag[B][B]) inout(col[B][B])

void fwd(float *diag, float *col);

Page 36: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

int main(int argc, char* argv[])

{

...

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat (A);

sparse_matmult (L, U, A);

compare_mat (origA, A);

}

#pragma css task input(Src) out(Dst)

void copy_block (float Src[BS][BS], float Dst[BS][BS]);

void copy_mat (float *Src,float *Dst)

{

...

for (ii=0; ii<NB; ii++)

for (jj=0; jj<NB; jj++)

...

copy_block(Src[ii][jj],block);

...

}

#pragma gss task input(A) out(L,U)

void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);

void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB])

{

...

for (ii=0; ii<NB; ii++)

for (jj=0; jj<NB; jj++){

...

split_block (LU[ii][ii],L[ii][ii],U[ii][ii]);

...

}

}

Checking LU

Page 37: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

int main(int argc, char* argv[])

{

...

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat (A);

sparse_matmult (L, U, A);

compare_mat (origA, A);

}

Checking LU

void clean_mat (p_block_t Src[NB][NB])

{

int ii, jj;

for (ii=0; ii<NB; ii++)

for (jj=0; jj<NB; jj++)

if (Src[ii][jj] != NULL) {

free (Src[ii][jj]);

Src[ii][jj]=NULL;

}

}

#pragma css task output(Dst)

void clean_block (float Dst[BS][BS] );

void clean_mat (p_block_t Src[NB][NB])

{

int ii, jj;

for (ii=0; ii<NB; ii++)

for (jj=0; jj<NB; jj++)

if (Src[ii][jj] != NULL) {

clean_block(Src[ii][jj]);

}

}

Page 38: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

int main(int argc, char* argv[])

{

...

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat (A);

sparse_matmult (L, U, A);

compare_mat (origA, A);

}

Checking LU

void sparse_matmult (float *A[NB][NB], float *B[NB][NB], float

*C[NB][NB])

{

int ii, jj, kk;

for (ii=0; ii<NB; ii++)

for (jj=0; jj<NB; jj++)

for (kk=0; kk<NB; kk++)

if ((A[ii][kk]!= NULL) && (B[kk][jj] !=NULL )) {

if (C[ii][jj] == NULL)

C[ii][jj] = allocate_clean_block();

block_matmul (A[ii][kk], B[kk][jj], C[ii][jj]);

}

}

#pragma css task input(a,b) inout(c)

void block_matmul(float a[BS][BS], float b[BS][BS], float c[BS][BS])

{

int i, j, k;

for (i=0; i<BS; i++)

for (j=0; j<BS; j++)

for (k=0; k<BS; k++)

c[i][j] += a[i][k]*b[k][j];

}

Page 39: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

int main(int argc, char* argv[])

{

...

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat (A);

sparse_matmult (L, U, A);

compare_mat (origA, A);

}

Checking LU#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS],

float *ms e);

void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct

timeval *stop)

{

...

Zero_block = allocate_clean_block();

for (ii = 0; ii < NB; ii++)

for (jj = 0; jj < NB; jj++) {

if (X[ii][jj] == NULL)

if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f;

else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]);

else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]);

}

#pragma css finish

for (ii = 0; ii < NB; ii++)

for (jj = 0; jj < NB; jj++)

if (sq_error[ii][jj] >0.0000001L) {

printf ("block [%d, %d]: detected mse = % 20lf\n", ii,

jj,sq_error[ii][jj]);

some_difference =TRUE;

}

if ( some_difference == FALSE) printf ("matrices are identical\n");

}

Page 40: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

int main(int argc, char* argv[])

{

...

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat (A);

sparse_matmult (L, U, A);

compare_mat (origA, A);

}

Checking LU #pragma css task input (ref_block, to_comp) output (mse)

void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS],

float *ms e);

void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct

timeval *stop)

{

...

Zero_block = allocate_clean_block();

for (ii = 0; ii < NB; ii++)

for (jj = 0; jj < NB; jj++) {

if (X[ii][jj] == NULL)

if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f;

else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]);

else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]);

}

for (ii = 0; ii < NB; ii++)

for (jj = 0; jj < NB; jj++)

#pragma css wait on (&sq_error[ii][jj])

if (sq_error[ii][jj] >0.0000001L) {

printf ("block [%d, %d]: detected mse = % 20lf\n", ii,

jj,sq_error[ii][jj]);

some_difference =TRUE;

}

if ( some_difference == FALSE) printf ("matrices are identical\n");

}

Page 41: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat(A);

sparse_matmult (L, U, A);

compare_mat (origA, A);

Without CellSs With CellSs(for NB=4 matrix)

Behavior Checking LU

Page 42: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

Behavior Checking LU

0: are_blocks_equal1: bdiv_adapte2: block_mpy_add3: bmod4: clean_block5: copy_block6: fwd7: lu08: split_block

Page 43: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

• Molecular dynamics: Argon simulation

• Simulates the mobility of Argon atoms in gas state, in a

constant volume at T=300K

• All elestrostatic forces observed for each of the atoms due to

the others are considered (Fi)

• The second Newton law is then applied to each atom

Fi=m*a

i

• The initial velocities are random but reasonable for argon

atoms at 300K

• To maintain a constant temperature in all the process the

Berendsen algorithm is applied

Page 44: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

program argon

...

!$CSS START

do step=1,niter

do ii=1, N, BSIZE

do jj=1, N, BSIZE

call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),

vy(ii), vz(ii))

enddo

enddo

do jj=1, N, BSIZE

call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))

enddo

!$CSS BARRIER

tins=0.e0

do i=1,N

tins=mkg*v(i)**2/3.e0/kb+tins

enddo

tins=tins/N

lam1=sqrt(t/tins)

do ii=1, N, BSIZE

call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii))

enddo

enddo

!$CSS FINISH

end

program argon

...

interface

!$CSS TASK

subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,

vx, vy, vz)

implicit none

integer, intent(in) :: BSIZE, ii, jj

real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj

real, intent(inout), dimension(BSIZE) :: vx, vy, vz

end subroutine

!$CSS TASK

subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)

implicit none

integer, intent(in) :: BSIZE

real, intent(in) :: lam1

real, intent(inout), dimension(BSIZE) :: vx, vy, vz

real, intent(inout), dimension(BSIZE) :: x, y, z

end subroutine

!$CSS TASK

subroutine v_mod(BSIZE, v, vx, vy, vz)

implicit none

integer, intent(in) :: BSIZE

real, intent(out) :: v(BSIZE)

real, intent(in), dimension(BSIZE) :: vx, vy, vz

end subroutine

end interface

Molecular dynamics: Argon simulation

Page 45: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Programming examples

program argon

...

!$CSS START

do step=1,niter

do ii=1, N, BSIZE

do jj=1, N, BSIZE

call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),

vy(ii), vz(ii))

enddo

enddo

do jj=1, N, BSIZE

call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))

enddo

!$CSS BARRIER

tins=0.e0

do i=1,N

tins=mkg*v(i)**2/3.e0/kb+tins

enddo

tins=tins/N

lam1=sqrt(t/tins)

do ii=1, N, BSIZE

call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii))

enddo

enddo

!$CSS FINISH

end

!$CSS TASK

subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)

! subroutine code

end subroutine

!$CSS TASK

subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)

! subroutine code

end subroutine

!$CSS TASK

subroutine v_mod(BSIZE, v, vx, vy, vz)

! subroutine code

end subroutine

Molecular dynamics: Argon simulation

Page 46: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

• Paraver

• Flexible performance visualization and analysis tool that can be used to analyze:

• MPI, OpenMP, MPI+OpenMP

• Java

• Hardware counters profile

• Operating system activity

• ... and many other things you may think of

• Generally it uses external trace file generators. Example for MPI:

> mpitrace mpirun -n 10 my_mpi-binary

• For CellSs, the libraries have been instrumented.

• When installing the distribution, two libraries are generated: normal and instrumented

• Flag -t links with instrumented version

• Available for free from the BSC website: www.bsc.es/paraver

Page 47: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

• Running paraver

paraver tracefile-0001.prv

Page 48: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

• Configuration files

Configuration file Feature shown

2dh inbw.cfg

2dh inbytes.cfg

2dh outbw.cfg

2dh outbytes.cfg

3dh duration phase.cfg

3dh duration tasks.cfg

DMA bw.cfg

DMA bytes.cfg

execution phases.cfg

Histogram of the bandwidth achieved by individual DMA IN transfers. Histogram of bytes read by the stage in DMA transfers.Histogram of the bandwidth achieved by individual DMA OUT transfers

Histogram of bytes writen by the stage out DMA transfers.Histogram of duration for each of the runtime phases.Histogram of duration of SPU tasks.

DMA (in + out) bandwidth per SPU.Bytes being DMAed (in + out) by each SPU.

Profile of percentage of time spent by each thread at each of the major phases

Page 49: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

• Configuration files

Configuration file Feature shown

flushing.cfg

general.cfg Mix of timelines.

stage in out phase.cfg

task.cfg

task distance histogram.cfg .

task number.cfg

Task profile.cfg

task repetitions.cfg

Total DMA bw.cfg

Intervals (dark blue) where each SPU is flushing its local trace buffer to main memory.

Identification of DMA in (grey) and out phases (green).Outlined function being executed by each SPU.

Histogram of task distance between dependent tasks Number of task being executed by each SPU

Time (microseconds) each SPU spent executing the different tasksShows which SPU executed each task and the number of times that the task was executed.Total DMA (in+out) bandwidth to Memory.

Page 50: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

Clustering

Group of 8 tasks (23 us)Block size: 64x64 floats

DMA in/out

Data re-use

Main thread

Helper thread

Page 51: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

• Demo

• matmul application

• first view, explain what is seen

• show cfgs, explain that they are in the distribution and where

• $(CELLSS_HOME)/share/cellss/paraver_cfgs/

• matmul

• show execution phases, tasks (and task type), task number

• show flushing

• size of DMA in

• another cholesky

• execution phases, tasks, task number, order of tasks that copy data from

memory

Page 52: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: Performance Analysis with Paraver

Another Cholesky

Page 53: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs performance evolution & scalability

0 1 2 3 4 5 6 7 8 9

0

20

40

60

80

100

120

140

160

180

Matmul performance

March 2007

July 2007

Nov 2007

December 2008

May 2009

#SPUs

GF

lop

s

0 500 1000 1500 2000 2500 3000 3500 4000 4500

0

20

40

60

80

100

120

140

160

Cholesky performance evolution

Apr 2007

Jul 2007

Jul 2007

Set 2008

May 2009

Matrix size

GFlo

ps

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

Cholesky scalability

1024

2048

4096

# SPUs

Speed u

p

0 1 2 3 4 5 6 7 8 9

0

20

40

60

80

100

120

140

160

Cholesky performance

1024

2048

4096

# SPUs

GF

lops

Page 54: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs: issues and ongoing efforts

• CellSs programming model

• Array regions, subobject accesses

• Blocks larger than Local Store

• Access to global memory by tasks

• CellSs runtime system

• Further optimization of overheads (insert task and remove task)

• By-passing (SPE to SPE transfers)

• Scheduling algorithms: overhead, locality

• Lazy renaming

• Other members of the family: SMPSs, GPUSs, hierarchical (SMPSs + CellSs)

• Convergence with OpenMP 3.0

Page 55: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

Conclusions

• The road for new chips with multi and many cores is open

• New programming models that can deal with the complexity of the

hardware are now more needed than ever

• StarSs

• Simple

• Portable

• Enough performance

• Enabled for different architectures: CellSs, SMPSs, GPUSs

Page 56: BSC · BSC. ScicomP15, Cell tutorial, May 18th 2009 Outline ... T1 1 T2 T3 1 T4 1 T5 1 T1 2

ScicomP15, Cell tutorial, May 18th 2009

CellSs and SMPSs websites

• CellSs

• www.bsc.es/cellsuperscalar

• SMPSs

• www.bsc.es/smpsuperscalar

• Both available for download (open source, GPL and LGPL)