![Page 1: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/1.jpg)
1
Collective OperationsCollective Operations
Dr. Stephen TseDr. Stephen Tse
Lesson 12Lesson 12
![Page 2: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/2.jpg)
2
Collective Communication
• A collective communication is– A communication pattern that involves all the
processes in a communicator– It involves more than two processes.
• Different Collective Communication Operations:– Broadcast– Gather and Scatter– Allgather– Alltoall
![Page 3: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/3.jpg)
3
Consider the following Arrangement
----> Data 0 | A0 A1 A2 A3 A4 . . . 1 | 2 | : | n |
VProcesses
![Page 4: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/4.jpg)
4
Broadcast
• A broadcast is a collective communication that a single process sends the same data to every process in the communicator.
----> Data | A0 =======> A0 | bcast A0 | A0 | A0VProcesses
![Page 5: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/5.jpg)
5
Matrix-Vector Product
• If A=(aij) is an mxn matrix and x=(x0, x1, …, xn-1 )T is an n-dimensional vector then the matrix-vector product is y=Ax
=
A x y
Process 0
Process 1
Process 2
Process 3
![Page 6: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/6.jpg)
6
A Gather• A collective communication in which a root
process receives data from every other process.• In order to form the dot product of each row of A
with x:• We need to gather all of x onto each process
Process 0 x0
Process 1 x1
Process 2 x2
Process 3 x3
A
![Page 7: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/7.jpg)
7
A Scatter• A collective communication in which a fixed root
process sends a distinct collection of data to every other process.
• Scatter each row of A across the process
Process 0 a00 a01 a02 a03
Process 1
Process 2
Process 3
AAA
![Page 8: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/8.jpg)
8
Gather and Scatter
----> Data | A0 A1 A2 A3 A4 A0 | ====> A1 | Scatter A2 | <==== A3 | Gather A4 VProcesses
![Page 9: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/9.jpg)
9
Allgather ----> Data
|A0 A0 B0 C0 D0 E0 |B0 A0 B0 C0 D0 E0 |C0 A0 B0 C0 D0 E0 |D0 A0 B0 C0 D0 E0 |E0 A0 B0 C0 D0 E0 VProcesses
• Simultaneously gather all of x onto each processes. • Gathering a distributed array to every process.• It gathers the contents of each process’s send_data into
each process’s recv_data.• After the function returns, all the processes in the
communicator will have the result stored in the memory referenced by result.
![Page 10: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/10.jpg)
10
Alltoall (transpose) ----> Data
|A0 A1 A2 A3 A4 A0 B0 C0 D0 E0 |B0 B1 B2 B3 B4 A1 B1 C1 D1 E1 |C0 C1 C2 C3 C4 A2 B2 C2 D2 E2 |D0 D1 D2 D3 D4 A3 B3 C3 D3 E3 |E0 E1 E2 E3 E4 A4 B4 C4 D4 E4 VProcesses
• The heart of the redistribution of the keys is each process’s sending of its original local keys to the appropriate process
• This is a collective communication operation in which each process sends a distinct collection of data to every other process.
![Page 11: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/11.jpg)
11
Tree-Structure Communication• To improve the coding, we should focus on the distribution
of the input data.• How can we divide the work more evenly among
processes?• We think of that we have a tree of processes, with process
0 as the root;• During the 1st stage of the data distribution: 0 sends data to
1.• During 2nd stage: 0 sends the data to 2 while 1 sends data
to 3.• During 3rd stage: 0 sends to 4, while 1 sends to 5, 2 sends
to 6, and 3 sends to 7. • So we reduce the input distribution loop from 7 stages to 3
stages.• In general, if we have p processes, this procedure allows
us to distribute the input data in | log2(p) | stages; which is the smallest whole number greater than of equal to log2(p) , which is all called the ceiling of that number.
(See the processes configuration tree )
![Page 12: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/12.jpg)
12
Data Distribution Stages
0
0
0
0 351624
31
1
2
7
1. This distribution reduced the original p-1 stages.2. If p=7, it reduced the time required for the program to complete the data
distribution from 6 to 3 and reduced by a factor of 50 times.3. There is no canonical choice of ordering.4. We have to know the topology of the system in order to have better
choice of scheme.
![Page 13: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/13.jpg)
13
Reduce the burden of final Sum• In the final summation phase, process 0 always gets a
disproportionate amount of work; i.e. the global sum of results from all other processes.
• To accelerate the final phase, we can use the tree concept in reverse to reduce the load of process 0.
• Distribute the work as:– Stage 1:
1. 4 sends to 0; 5 sends to 1; 6 sends to 2; 7 sends to 3.2. 0 adds its integral from 4; 1 adds its integral from 5; 2 adds its
integral from 6; 3 adds its integral from 7.
– Stage 2:1. 2 sends to 0; 3 sends to 1.2. 0 adds its integral from 2; 1 adds its integral from 3.
– Stage 3:1. 1 sends to 0.2. 0 adds its integral from 1.
(See the reverse tree processes configuration)
![Page 14: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/14.jpg)
14
Reverse-tree Processes Configuration
0
0
0
0
351624
31
1
2
7
![Page 15: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/15.jpg)
15
Reduction Operations
• The “global sum” calculation , is a general class of collective communication operations called reduction operations.
• In a global reduction operation, all the processes in a communicator are contributing data. All those data will be combined by using a binary operation.
• Typical operations are addition, max, min, logical and , etc.
![Page 16: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/16.jpg)
16
Simple Reduce
----> Data
| A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2
| B0 B1 B2
| C0 C1 C2
V
Process
![Page 17: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/17.jpg)
17
Allreduce
• In the simple reduce function only process 0 will return the global sum result. All the other processes will return 0.
• If we want to use the result for subsequent calculations, we would like each process to return the same correct result.
• The obvious approach is to call MPI_Reduce with a call to MPI_Bcast.
![Page 18: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/18.jpg)
18
Every Processes have same Results
----> Data Results
| A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2
| B0 B1 B2 A0+B0+C0 A1+B1+C1 A2+B2+C2
| C0 C1 C2 A0+B0+C0 A1+B1+C1 A2+B2+C2
V
Process
![Page 19: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/19.jpg)
19
Implementation in MPI- MPI_Gather
1. MPI_Gather(sendbuffersendcountsendtyperecvbufferrecvcountrecvtyperoot rankcomm)
Remarks:1. All processes in “comm..” Including root send “sendbuffer” to root’s
recvbuffer2. Root collects these “sendbuffer” contents and put them in rank order in
“recvbuffer”3. “recvbuffer” is ignored in all processes except the “root”.4. Its inverse operation is MPI_Scatter()
![Page 20: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/20.jpg)
20
Implementation in MPI- MPI_Scatter
2. MPI_Scatter(sendbuffersendcountsendtype
recvbufferrecvcountrecvtype
root rankcomm.)
Remarks:1. Root sends “sendbuffer” to all processes including “root”2. Root puts them in rank order in “recvbuffer”3. Root cuts its msg into “n” equal parts and then sends them to “n” processes
![Page 21: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/21.jpg)
21
Implementation in MPI- MPI_GatherV
3. MPI_GatherV(sendbuffersendcountsendtype
recvbufferrecvcount
displacement /* integer array for displacement */
recvtype
root rankcomm.)
Remarks:1. This is a more general and more flexible function2. Allowing varying count of data from each process3. The variation is marked in "displacement" which is an "n-" dimensional
array.
![Page 22: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/22.jpg)
22
Implementation in MPI- MPI_Allgather
4. MPI_Allgather(sendbuffersendcountsendtype
recvbufferrecvcountrecvtype
comm)
Remarks:
1. This operation is similar to all-to-all operation2. Instead of specifying the "root", every process sends a its data too
all other processes3. The "j-th" block of data from each process is received by every
process and is placed in the "j-th" block of the buffer "recvbuf"
![Page 23: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/23.jpg)
23
Implementation in MPI- MPI_Allgather
5. MPI_AllgatherV(sendbuffersendcountsendtype
recvbufferrecvcount
displacement
recvtype
comm)
Remarks:(1) This is an operation similar to all-to-all operation. (2) Instead of specifying the "root", every process sends its data too all other processes.(3) The "j-th" block of data from each process is received by every process and is placed in the
"j-th" block of the buffer "recvbuf". (4) But the blocks from different processes need not to be uniform in sizes.
![Page 24: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/24.jpg)
24
Implementation in MPI- MPI_Alltoall
6. MPI_Alltoall(sendbuffersendcountsendtype
recvbufferrecvcountrecvtype
comm)
Remarks: (1) This is an all-to-all operation(2) "j-th" block sent from process "i" is placed in process "j"'s "i-th"
location of the "recv" buffer
![Page 25: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/25.jpg)
25
Implementation in MPI- MPI_AlltoallV
7. MPI_AlltoallV(sendbuffersendcounts-displacementsendtype
recvbufferrecvcountr-displacementrecvtype
comm)
Remarks:
(1) This is an all-to-all process(2) "j-th" block sent from process "i" is placed process "j"'s "i-th"
location of the "recv" buffer
![Page 26: Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813c60550346895da5e6ae/html5/thumbnails/26.jpg)
26