frédéric brégier thèse présentée à l’université de bordeaux i 21 décembre 1999

Frédéric Brégier - LaBRI 1

Extensions du langage HPF pour la mise en œuvre de Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de programmes parallèles manipulant des structures de

données irrégulièresdonnées irrégulières

Extensions du langage HPF pour la mise en œuvre de Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de programmes parallèles manipulant des structures de

données irrégulièresdonnées irrégulièresFrédéric Brégier

Thèse présentée à l’Université de Bordeaux I

21 Décembre 1999


Frame of Work

•Parallel program by compilation

•HPF: standard for Data-parallel programs (regular programs)

•Need investments for irregular programs: poor efficiencies

•Optimizations at compile-time

•Optimizations at run-time (generated at compile-time)


Plan


•Irregular Data Structure (IDS)

•A Tree to represent an IDS

•Optimizations at run-time

•Inspection-Execution principles

•Irregular communications: irregular active processor sets

•Irregular iteration spaces

•Scheduling of loops with irregular loop-carried dependencies

•New data-parallel irregular operation: progressive irregular

prefix operation

•Conclusion and Perspectives


IF (B(I) is local) THEN Send(B(J) to Owner(A(I))) END IFIF (A(I) is local) THEN Receive(in TMP from Owner(B(J))) A(I) = TMP + XEND IF

A

B

X Y

HPF (High Performance Fortran):HPF (High Performance Fortran): data-parallel languageMay 1993 HPF 1.0, January 1997 HPF 2.0

• Fortran 95 source code + structured comments (!HPF$) (distributions + parallel properties)

• Target Code : SPMD parallel code

•« Owner computes » rule• Runtime guards and communication generations

A(I) = B(J) + X

A A AB B B

X Y X Y X Y


Optimizations at compile-timeLoop iteration space

•Affine expression

•Local loop bounds

•Not optimizable

!HPF$ INDEPENDENTDO I = 1, N A(I) = A(I) + 1END DO

! Cyclic Distribution caseDO I = PID+1, N, NOP A(I) = A(I) + 1END DO

! Block Distribution case (N dividable by NOP)LB = BLOC * PID + 1UB = min(N, LB+BLOC)DO I = LB, UB A(I) = A(I) + 1END DO

! Indirect distributionDO I = 1, N IF (A(I) is local) THEN A(I) = A(I) + 1 END IFEND DO

•Irregular = « what is not regular », not optimizable


Plan










prefix operation



Irregular Data Structure (IDS)

•Standard irregular format: indirect access arrays, example CSCI II III IV V VI VII VIII

12345678

1 3 5 6 9 12 16 18 21

1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8

1

1 5

3

2 5

A(1,1) DA(JA(1)) (IA(JA(1)) = 1)A(6,4) DA(JA(4)+1) (IA(JA(4)+1) = 6)A(:,4) DA(JA(4):JA(5)-1)

JA(1:9)

IA(1:20)

DA(1:20) = Non zero values of A

•Irregular distribution formats:

!HPF$ DISTRIBUTE JA(BLOCK) !HPF$ DISTRIBUTE IA(GEN_BLOCK(/5, 10, 5/))


Problems at compile-time

•Distribution : unknown alignment between arrays of the IDS

•Data accesses: unknown indexes (indirection)

1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8

1 3 5 6 9 12 16 18 21

DA(JA(4)+1) JA(4) = ?

6

6

•Implies additional run-time guards and communications•Inefficient SPMD code

JA(1:9)

DA(1:20)


Related Works

•Regular to Irregular Compilation•Bik et Wijshoff : « Sparse Compiler »

•Sparse Matrix with known topology•Regular analysis + known topology•IDS chosen by the compiler

•Pingali et al.•Relational description (between components and access functions)•Non standard and difficult notations

•Compilation of irregular programs•Vienna Fortran Compilation System: SPARSE directive

•Storage format specification•Limited to storage formats known by the compiler


Plan










prefix operation



I II III IV V VI VII VIII

1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8

The Tree: a generic data structure with hierarchical access

•From a data to a tree:I II III IV V VI VII VIII

12345678

•Representation in HPF2: derived data type of Fortran 95type level2 integer ROW !row number real VAL !non zero valueend type level2

type level1 type (level2), pointer :: COL(:) !columnend type level1

type (level1), allocatable :: A(:) !matrix with a hierarchical access by column

!HPF$ TREE

Tree Matrix CSC

A(i)%COL(j)%VAL A(j,i) DA(JA(i)+j-1)

A(i)%COL(:)%VAL A(:,i) DA(JA(i):JA(i+1)-1)


Distribution of a TREE


1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8

!HPF$ DISTRIBUTE A(BLOCK)!HPF$ DISTRIBUTE A(INDIRECT(/1,2,3,2,1,2,3,1/))


Example of improvement!HPF$ DISTRIBUTE A(BLOCK)

!HPF$ INDEPENDENT FORALL (I = 3:N-2) A(I)%COL(:)%VAL = A(I-2)%COL(:)%VAL + A(I+2)%COL(:)%VAL END FORALL

!HPF$ DISTRIBUTE DA(GEN_BLOCK(array))!HPF$ INDEPENDENT FORALL (I = 3:N-2) DA(IA(I):IA(I+1)-1) = DA(IA(I-2):IA(I-1)-1) + DA(IA(I+2):IA(I+3)-1) END FORALL

TMP(:) = Global Copy with BCAST(DA(:))DO I = 3, N-2 local_bound(DA(IA(I):IA(I+1)-1), lb, ub) DO J = lb, ub DA(J) = TMP(J1)+TMP(J2) END DOEND DOIA(I-2) = ?? : IA(I-1)-1 = ??

Communications on frontiers onlyAs SHADOW in HPF2

Global Copy+Bcast of DA

local_bound(A(:), lb, ub)TMP(lb:ub) = Local Copy of Local Part(A(lb:ub))Shadow_Update(TMP(:), -2,+2)local_bound(A(3:N-2), lb, ub)DO I = lb, ub A(I)%COL(:)%VAL = TMP(I-2)%COL(:)%VAL + TMP(I+2)%COL(:)%VALEND DO


Arrays

DALIB

MPI

Trees/Derived Types

DALIB TriDenT

MPI


1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8


1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8


Serial Product

0

10

20

30

40

50

60

70

80

90

Tim

es in

sec

onds

F90 Derived Type

F90 ADAPTORMatrix (F77)

F90 ADAPTORTriDenTM

atri

x V

ecto

r P

rod

uct

Parallel Product (dense notations)

70

80

90

100

processors (1-16)

Rel

ativ

e E

ffic

ienc

ies

%

HPF2/Matrix

HPF2/TREE

IBM SP2-LaBRI4096x4096


•Advantages:

•Less indirections

•Less unknown alignments

•Better compile-time analysis (locality and dependence)

•Generic (defined by the user)

•Low overhead

•Disadvantages:

•Not necessary implemented in HPF compilers: portability

•Need to rewrite irregular code (with derived types)


Plan










prefix operation



Inspection-Execution

Inspection: scan the program to analyze in order to get useful informationExecution: execute the true computations according to the optimized scheme induced by the inspected information

DO I = 1, N A(I) = B(INDEX(I))END DOModify B

DO I = 1, N if (A(I) is local) then Add INDEX(I) to local_index end ifEND DOExchange info on local_index (what indexes to send, to receive)

Gather (B(local_index(:)) into Copy_B)I_local = 1DO I = 1, N if (A(I) is local) then A(I) = Copy_B(I_local) I_local = I_local + 1 end ifEND DOModify B

DO STEP = 1, S

END DO

DO STEP = 1, S

END DO

INSPECTION

EXECUTION

often iterative schemes

Related works:

•PARTI: iterative scheme•CHAOS: iterative and adaptive scheme (by steps)

Integrated in Fortran D and Vienna Fortran Compilation System

•PILAR: iterative and multi-phase scheme, basic element = sectionCompiler PARADIGM

•ADAPTOR: directive TRACE, dynamic adaptive scheme


•ON HOME Directive: to control the computation mapping

!HPF$ ALIGN (I) WITH A(I) :: B, C

!HPF$ INDEPENDENT DO I = 1, N

C(INDEX(I)) = A(I) * B(I) END DO

DO I = 1, N-1 if (A(I) is local) then call Send(A(I) to Owner( C(INDEX(I)) )) call Send(B(I) to Owner( C(INDEX(I)) )) end if if (C(INDEX(I)) is local) then call Receive(TMP1 from Owner( A(I) )) call Receive(TMP2 from Owner( A(I) )) C(INDEX(I)) = TMP1 * TMP2 end ifEND DO

DO I = 1, N-1 if (A(I) is local) then TMP = A(I) * B(I) call Send(TMP to Owner( C(INDEX(I)) )) end if if (C(INDEX(I)) is local) then call Receive(TMP from Owner( A(I) )) C(INDEX(I)) = TMP end ifEND DO

!HPF$ ON HOME (A(I))

HPF2: communication optimizations with active processor sets


Irregular Active Processor Sets


12345678

ON HOME A(1,I) + ON HOME A(1,V)ON HOME A(2,II) + ON HOME A(2,V)ON HOME A(3,III)

•Less active processors in collective communications•Less communications (reduction or broadcast)•Less synchronizations

Extensions to the ON HOME directive:!HPF$ ON HOME (A(K,:)) !HPF$ ON HOME (A(K,INDEX(K))

FORALL(J=I:VIII, J .eq. K .or. A(K,J) .ne. 0.0)!HPF$ ON HOME (A(K,J), J=I:VIII, J .eq. K .or. A(K,J) .ne. 0.0)

IIIIIIIVVVIVIIVIII

A B

!HPF$ ALIGN A(*,K) with B(K) B(K) = Sum(A(K,:))



12345678

Cholesky Example: TREE and Set (Matrix with 65024 columns)

DO K = 1, N

allocate (TMP(N)) TMP(:) = 0.0

DO J = 1, K-1 IF (A(K,J) .ne. 0.0) THEN CMOD (TMP, A(:,J)) END IF END DO A(:,K) = A(:,K) + TMP(:) CDIV (A(:,K))

END DO

!HPF$ INDEPENDENT, REDUCTION (TMP(:))

!HPF$ ON HOME (A(K,J), J = 1:K, J.eq.K .or. A(K,J) .ne. 0.0), NEW(TMP), BEGIN

!HPF$ END ON

20

40

60

80

100

120

140

160

180

200

1 2 4 8 16Processors

Tim

es in

sec

onds

V0Vset

IBM SP2-LaBRI2D-Grid 255x255


Plan










prefix operation



Irregular Iteration Space!HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, K-1 IF (A(K,J) .ne. 0.0) THEN … END IF END DO

!HPF$ DISTRIBUTE A(:,BLOCK)

Cholesky

15

35

55

75

95

115

135

155

175

195

1 2 4 8 16Processors

Tim

es in

sec

onds

VsetVset+Loop



Plan








•Scheduling of loops with partial loop-carried dependencies


prefix operation



Loop with Partial Loop-Carried Dependencies

•Loop-carried dependencies:DO I = 1, N DO J = 1, I-1 A(I) = A(I) + A(J) END DOEND DO

•Partial loop-carried dependencies:DO I = 1, N DO J = 1, I-1 IF (TEST(I,J)) THEN A(I) = A(I) + A(J) END IF END DOEND DO

•Precomputable partial loop-carried dependencies: PPLD LoopTEST never modified


PPLD Loop

DO I = 1, N

B = 0.0!HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B

END DO

Steps P1 P2 P3 P41 1 1 1 12 2 2 2 23 3 3 3 34 4 4 4 45 5 5 5 56 6 6 6 67 7 7 7 78 8 8 8 89 9 9 9 910 10 10 10 1011 11 11 11 11

Steps P1 P2 P3 P41 1 42 2 23 3 34 5 6 6 55 7 86 9 97 10 10 10 108 11 11

I Owner (A(I)) TEST(I,J) = TRUE1 1 -2 2 13 3 14 4 -5 1 1 46 2 2 37 3 38 4 49 1 1 4 5 810 2 4 5 7 911 3 6 7

!HPF$ ON HOME (A(J), J=I .or. TEST(I,J))

!HPF$ END ON

Set(I)P1

P1 P2P1 P3

P4P1 P4P2 P3

P3P4

P1 P4P1 P2 P3 P4

P2 P3

4

4


PPLD Loop Scheduling

•Associates one iteration with one task•Precomputable Partial Loop-Carried Dependencies = task graph•Scheduling problem: HPF context

•Known mapping (HPF data distribution => task mapping)•Data distribution => possible multi-processor tasks

•« Scheduling multi-processor tasks on dedicated processors »

Related Work:•Complexity: Drozdowski 97, Krämer 95: NP-Hard Problem

•Wennink 95: Scheduling algorithm

•PYRROS / RAPID libraries: precomputable task graph with mono-

processor tasks (inspection-execution)


Scheduling Tasks Associated to a PPLD Loop

1) DAG GenerationNew SCHEDULE directive

2) SchedulingSimple and Wennink’s scheduling

3) ExecutionStatic execution / Dynamic executionSingle thread / Multi-thread execution

4) Experimental Results


1011

976

8532

SCHEDULE directive

Dependencies between iterations (inspection-execution):

DO I = 1, N!HPF$ ON HOME (A(J), J=I .or. TEST(I,J)) B = 0.0!HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B!HPF$ END ON END DO

!HPF$ SCHEDULE (J = 1:I-1, TEST(I,J) )I TEST(I,J) = TRUE1 -2 13 14 -5 1 46 2 37 38 49 1 4 5 810 4 5 7 911 6 7

1 4

1011

976

8532

1 4


Distributed Scheduling Algorithms

•Simple Scheduling: local tasks only

1011

976

8532

1 4a d

a b a c a d d

c a db c

b c a b c d10

9

532

1

1 2 3 5 9 10

Steps P1 P2 P3 P41 1 42 2 2 83 3 34 5 6 6 55 9 7 96 10 10 10 107 11 11

Order in task scheduling: priority criteria based on critical path

1

2

333

4

123334

1

2 3 5

9

10

2 3 5

Problem of scheduling coherence between processors: prevent deadlockBy step scheduling algorithm

List for task execution


Scheduling•Wennink’s Scheduling: multi-processor tasks + insertion principle

1 2 3 5 9 10Simple:

Wennink: 1 23 5 9 102

Steps P1 P2 P3 P41 1 42 3 3 83 2 2 74 5 6 6 55 9 11 11 96 10 10 10 10

Complexity: Simple WenninkComputations O(N log N) O(N²)Memory O(|E|) O(N² + |E|)

1011

976

8532

1 4


Static execution / Dynamic execution•HPF context: task costs not known at compile-time => unit costs•Static Critical Path = longest path (in edges) to the virtual « End » vertex

1011

976

8532

1 4

1

2

33

4

1

22

3 3

4

1 2 3 5 9 10

2

3

4

6 10 11

6 7 10 11

8 5 9 10

Static Scheduling: static order of execution

a

b

c

d

•Iterative program: first iteration records times, then re-scheduling Dynamic Scheduling

1 2 3 5 9 10

2

3

4

6 10 11

6 7 10 11

8 5 9 101011

976

8532

1 4

t10

t9

t5t3

t1

t11

t7t6

t2 t8

t4

1 3 2 5 9 10

2

3

4

6 11 10

7 6 11 10

8 5 9 10

E


Single Thread / MultiThread execution

0

1 2

•2 independent tasks on the same processor•Same priority: which task first ?

•Single Thread: the lower rank first

•MultiThread: both

•User mode thread system: Marcel from PM² HighPerf

ComputationsWaiting for communicationCommunications

Task K

Task K’

Task K

Task K’

Overlapping communicationsby computations


Experimental Results: Matrix with 261121 columns

•Cholesky on sparse matrix with column-block access•Irregular data structure: TREE•Distribution: INDIRECT (minimizing communications)

•VSet: V0 + Set•Stat: VSet+SCHEDULE (static simple scheduling)•Dyn: VSet+SCHEDULE (dynamic simple scheduling)•Stat_th: Stat + Threads•W: VSet+SCHEDULE (dynamic Wennink’s scheduling)

Relative efficiencies (global time)

30

50

70

90

110

130

150

1 2 4 8 16

% v

s V

set

VsetStatDynStat_thW

Relative Efficiencies (Re-execution only)

90

110

130

150

170

190

210

1 2 4 8 16

% v

s V

set

VsetStatDynStat_thW



Plan








•Scheduling of loops with partial loop-carried dependencies

•New data-parallel irregular operation: progressive

irregular prefix operation



Irregular Progressive PREFIX Operation

•Irregular Progressive PREFIX Operation: found in PPLD Loop

],1[)( iBavecXgfX iBk

ki

i

•Irregular Coefficient:

%,1

i

Baverage i

ni

•Exploit independencies with specific communication schemes


6

4 5

3


1 2

Asynchronous communication

Synchronous REDUCTION

6

5

3

1 2

4



PREFIX directive/clause: differs from REDUCTION clause

DO I = 1, N

DO J = I+1, N IF (TEST(J,I)) THEN A(J) = A(J) + A(I) END IF END DOEND DO

DO I = 1, N B = 0.0

DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + BEND DO

!HPF$ INDEPENDENT, REDUCTION(B)!HPF$ INDEPENDENT, PREFIX(B)

!HPF$ PREFIX(B)

Inspection(A,TEST)DO I = lb, ub (ON HOME A(I)) Finalize(A(I)) (receive contributions prev. send) DO J = I+1, N IF (TEST(J,I)) THEN A’(J) = A’(J) + A(I) (send when ready) END IF END DOEND DO

DO I = 1, N (Set(I)) B = 0.0 DO J = lb, ub (ON HOME A(J)) IF (TEST(I,J)) THEN B = B+ A(J) END IF END DO A(I) = A(I) + REDUCTION(B)END DO

Comparisons: PREFIX vs REDUCTION

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

Irregular Coefficient (TEST)

Equality

PREFIX vs REDUCTION

IBM SP2-LaBRI


Irregular Progressive PREFIX Operation: Cholesky ExampleIrregular coef. = 0.1%

Global Time

75

125

175

225

275

325

375

425

475

1 2 4 8 16processors

Tim

es v

ersu

s V

1 (V

1/T

) (%

) Vset

VsetP

Stat

StatP

PaSTiX

Re-Execution Time

100

150

200

250

300

350

400

1 2 4 8 16processors

Tim

es v

ersu

s V

1 (V

1/T

) (%

)

Vset

VsetP

Stat

StatP

PaSTiX



Conclusion•TREE: Irregular Data Structure, more information at compile-time

Locality and dependence analysis => TriDenT

•Inspection/Execution: Still information not known at compile-time=> CoLUMBO

•Irregular Active Processor Sets: fundamental inspection/executionUp to a factor of 10

•Irregular Iteration Space: minor improvement

•Loop with Partial Loop Carried Dependencies:•DAG associated with loop iterations•Semi-automatic task scheduling at run-time•PREFIX operation

•Inspection costs repayed with only one iteration

•Experimental Results: Efficiency close to hand-made codes (time ratio between 1.25 and 2.5)


Perspectives

•Integration in a HPF compiler: preliminary experiments

•TREE: ADAPTOR•Set inspection/execution, PREFIX inspection/execution:

NESTOR (Silber 98)

•Transposition to other parallel languages:

•Irregular Data Structures: always a problem => TREE

•Irregular iteration space

•OpenMP: Virtual shared memory => Data distribution

Irregular active processor sets

frédéric brégier thèse présentée à l’université de bordeaux i 21 décembre 1999

Documents