Download - Collective Operations Dr. Stephen Tse [email protected] 908-872-2108

1

Collective OperationsCollective Operations

Dr. Stephen TseDr. Stephen Tse

[email protected]

Lesson 12Lesson 12

2

Collective Communication

• A collective communication is– A communication pattern that involves all the

processes in a communicator– It involves more than two processes.

• Different Collective Communication Operations:– Broadcast– Gather and Scatter– Allgather– Alltoall

3

Consider the following Arrangement

----> Data 0 | A0 A1 A2 A3 A4 . . . 1 | 2 | : | n |

VProcesses

4

Broadcast

• A broadcast is a collective communication that a single process sends the same data to every process in the communicator.

----> Data | A0 =======> A0 | bcast A0 | A0 | A0VProcesses

5

Matrix-Vector Product

• If A=(aij) is an mxn matrix and x=(x0, x1, …, xn-1 )T is an n-dimensional vector then the matrix-vector product is y=Ax

=

A x y

Process 0

Process 1

Process 2

Process 3

6

A Gather• A collective communication in which a root

process receives data from every other process.• In order to form the dot product of each row of A

with x:• We need to gather all of x onto each process

Process 0 x0

Process 1 x1

Process 2 x2

Process 3 x3

A

7

A Scatter• A collective communication in which a fixed root

process sends a distinct collection of data to every other process.

• Scatter each row of A across the process

Process 0 a00 a01 a02 a03

Process 1

Process 2

Process 3

AAA

9

Allgather ----> Data

|A0 A0 B0 C0 D0 E0 |B0 A0 B0 C0 D0 E0 |C0 A0 B0 C0 D0 E0 |D0 A0 B0 C0 D0 E0 |E0 A0 B0 C0 D0 E0 VProcesses

• Simultaneously gather all of x onto each processes. • Gathering a distributed array to every process.• It gathers the contents of each process’s send_data into

each process’s recv_data.• After the function returns, all the processes in the

communicator will have the result stored in the memory referenced by result.

10

Alltoall (transpose) ----> Data

|A0 A1 A2 A3 A4 A0 B0 C0 D0 E0 |B0 B1 B2 B3 B4 A1 B1 C1 D1 E1 |C0 C1 C2 C3 C4 A2 B2 C2 D2 E2 |D0 D1 D2 D3 D4 A3 B3 C3 D3 E3 |E0 E1 E2 E3 E4 A4 B4 C4 D4 E4 VProcesses

• The heart of the redistribution of the keys is each process’s sending of its original local keys to the appropriate process

• This is a collective communication operation in which each process sends a distinct collection of data to every other process.

11

Tree-Structure Communication• To improve the coding, we should focus on the distribution

of the input data.• How can we divide the work more evenly among

processes?• We think of that we have a tree of processes, with process

0 as the root;• During the 1st stage of the data distribution: 0 sends data to

1.• During 2nd stage: 0 sends the data to 2 while 1 sends data

to 3.• During 3rd stage: 0 sends to 4, while 1 sends to 5, 2 sends

to 6, and 3 sends to 7. • So we reduce the input distribution loop from 7 stages to 3

stages.• In general, if we have p processes, this procedure allows

us to distribute the input data in | log2(p) | stages; which is the smallest whole number greater than of equal to log2(p) , which is all called the ceiling of that number.

(See the processes configuration tree )

12

Data Distribution Stages

0

0

0

0 351624

31

1

2

7

1. This distribution reduced the original p-1 stages.2. If p=7, it reduced the time required for the program to complete the data

distribution from 6 to 3 and reduced by a factor of 50 times.3. There is no canonical choice of ordering.4. We have to know the topology of the system in order to have better

choice of scheme.

13

Reduce the burden of final Sum• In the final summation phase, process 0 always gets a

disproportionate amount of work; i.e. the global sum of results from all other processes.

• To accelerate the final phase, we can use the tree concept in reverse to reduce the load of process 0.

• Distribute the work as:– Stage 1:

1. 4 sends to 0; 5 sends to 1; 6 sends to 2; 7 sends to 3.2. 0 adds its integral from 4; 1 adds its integral from 5; 2 adds its

integral from 6; 3 adds its integral from 7.

– Stage 2:1. 2 sends to 0; 3 sends to 1.2. 0 adds its integral from 2; 1 adds its integral from 3.

– Stage 3:1. 1 sends to 0.2. 0 adds its integral from 1.

(See the reverse tree processes configuration)

14

Reverse-tree Processes Configuration

0

0

0

0

351624

31

1

2

7

15

Reduction Operations

• The “global sum” calculation , is a general class of collective communication operations called reduction operations.

• In a global reduction operation, all the processes in a communicator are contributing data. All those data will be combined by using a binary operation.

• Typical operations are addition, max, min, logical and , etc.

16

Simple Reduce

----> Data

| A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2

| B0 B1 B2

| C0 C1 C2

V

Process

17

Allreduce

• In the simple reduce function only process 0 will return the global sum result. All the other processes will return 0.

• If we want to use the result for subsequent calculations, we would like each process to return the same correct result.

• The obvious approach is to call MPI_Reduce with a call to MPI_Bcast.

18

Every Processes have same Results

----> Data Results

| A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2

| B0 B1 B2 A0+B0+C0 A1+B1+C1 A2+B2+C2

| C0 C1 C2 A0+B0+C0 A1+B1+C1 A2+B2+C2

V

Process

19

Implementation in MPI- MPI_Gather

1. MPI_Gather(sendbuffersendcountsendtyperecvbufferrecvcountrecvtyperoot rankcomm)

Remarks:1. All processes in “comm..” Including root send “sendbuffer” to root’s

recvbuffer2. Root collects these “sendbuffer” contents and put them in rank order in

“recvbuffer”3. “recvbuffer” is ignored in all processes except the “root”.4. Its inverse operation is MPI_Scatter()

20

Implementation in MPI- MPI_Scatter

2. MPI_Scatter(sendbuffersendcountsendtype

recvbufferrecvcountrecvtype

root rankcomm.)

Remarks:1. Root sends “sendbuffer” to all processes including “root”2. Root puts them in rank order in “recvbuffer”3. Root cuts its msg into “n” equal parts and then sends them to “n” processes

21

Implementation in MPI- MPI_GatherV

3. MPI_GatherV(sendbuffersendcountsendtype

recvbufferrecvcount

displacement /* integer array for displacement */

recvtype

root rankcomm.)

Remarks:1. This is a more general and more flexible function2. Allowing varying count of data from each process3. The variation is marked in "displacement" which is an "n-" dimensional

array.

22

Implementation in MPI- MPI_Allgather

4. MPI_Allgather(sendbuffersendcountsendtype


comm)

Remarks:

1. This operation is similar to all-to-all operation2. Instead of specifying the "root", every process sends a its data too

all other processes3. The "j-th" block of data from each process is received by every

process and is placed in the "j-th" block of the buffer "recvbuf"

23

Implementation in MPI- MPI_Allgather

5. MPI_AllgatherV(sendbuffersendcountsendtype

recvbufferrecvcount

displacement

recvtype

comm)

Remarks:(1) This is an operation similar to all-to-all operation. (2) Instead of specifying the "root", every process sends its data too all other processes.(3) The "j-th" block of data from each process is received by every process and is placed in the

"j-th" block of the buffer "recvbuf". (4) But the blocks from different processes need not to be uniform in sizes.

24

Implementation in MPI- MPI_Alltoall

6. MPI_Alltoall(sendbuffersendcountsendtype


comm)

Remarks: (1) This is an all-to-all operation(2) "j-th" block sent from process "i" is placed in process "j"'s "i-th"

location of the "recv" buffer

25

Implementation in MPI- MPI_AlltoallV

7. MPI_AlltoallV(sendbuffersendcounts-displacementsendtype

recvbufferrecvcountr-displacementrecvtype

comm)

Remarks:

(1) This is an all-to-all process(2) "j-th" block sent from process "i" is placed process "j"'s "i-th"

location of the "recv" buffer

Download - Collective Operations Dr. Stephen Tse [email protected] 908-872-2108

Top Related