on optimizing collective communication
DESCRIPTION
On Optimizing Collective Communication. UT/Texas Advanced Computing Center UT/Computer Science. Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn. ScicomP 10, August 9-13 Austin, TX. Outline. Model of Parallel Computation Collective Communications Algorithms - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/1.jpg)
On Optimizing Collective Communication
UT/Texas AdvancedComputing Center
UT/Computer Science
Avi Purkayastha
Ernie Chan, Marcel HeinrichRobert van de Geijn
ScicomP 10, August 9-13 Austin, TX
![Page 2: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/2.jpg)
Outline
• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work
![Page 3: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/3.jpg)
Model of Parallel Computation
• Target Architectures– distributed memory parallel architectures
• Indexing– p nodes– indexed 0 … p – 1– each node has one computational processor
![Page 4: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/4.jpg)
1100 22 33 44
• often logically viewed as a linear array
5 6 7 8
![Page 5: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/5.jpg)
Model of Parallel Computation
• Logically Fully Connected– a node can send directly to any other node
• Communicating Between Nodes– a node can simultaneously receive and send
• Network Conflicts– sending over a path between two nodes that is
completely occupied
![Page 6: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/6.jpg)
Model of Parallel Computation
• Cost of Communication– sending a message of length n between any two nodes
is the startup cost (latency) is the transmission cost (bandwidth)
• Cost of Computation– cost to perform arithmetic operation is
• reduction operations– sum– prod– min– max
+n
![Page 7: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/7.jpg)
Outline
• Model of Parallel Computation
• Collective Communications• Algorithms• Performance Results• Conclusions and Future work
![Page 8: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/8.jpg)
Collective Communications
• Broadcast• Reduce(-to-one)• Scatter • Gather• Allgather• Reduce-scatter• Allreduce
![Page 9: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/9.jpg)
Lower Bounds (Latency)
• Broadcast
• Reduce(-to-one)
• Scatter/Gather
• Allgather
• Reduce-scatter
• Allreduce
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
![Page 10: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/10.jpg)
Lower Bounds (Bandwidth)
• Broadcast
• Reduce(-to-one)
• Scatter/Gather
• Allgather
• Reduce-scatter
• Allreduce
n
n +p−1p
n
p−1p
n
p−1p
n
p−1p
n +p−1p
n
2p−1p
n +p−1p
n
![Page 11: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/11.jpg)
Outline
• Model of Parallel Computation• Collective Communications
• Algorithms• Performance Results• Conclusions and Future work
![Page 12: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/12.jpg)
Motivating Example
• We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.
![Page 13: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/13.jpg)
A building block approach to library implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
![Page 14: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/14.jpg)
Short-vector case
• Primary concern: – algorithms must have low latency cost
• Secondary concerns:– algorithms must work for arbitrary number of
nodes• in particular, not just for power-of-two numbers of nodes
– algorithms should avoid network conflicts• not absolutely necessary, but nice if possible
![Page 15: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/15.jpg)
Minimum-Spanning Tree based algorithms
• We will show how the following building blocks:– broadcast/reduce– scatter/gather
• Using minimum spanning trees embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes– no network conflicts
![Page 16: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/16.jpg)
General principles
• message starts on one processor
![Page 17: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/17.jpg)
General principles
• divide logical linear array in half
![Page 18: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/18.jpg)
General principles
• send message to the half of the network that does not contain the current node (root) that holds the message
![Page 19: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/19.jpg)
General principles
• send message to the half of the network that does not contain the current node (root) that holds the message
![Page 20: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/20.jpg)
General principles
• continue recursively in each of the two halves
![Page 21: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/21.jpg)
General principles
• The demonstrated technique directly applies to – broadcast– scatter
• The technique can be applied in reverse to– reduce– gather
![Page 22: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/22.jpg)
General principles
• This technique can be used to implement the following building blocks:– broadcast/reduce– scatter/gather
• Using a minimum spanning tree embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes
– no network conflicts?• Yes, on linear arrays
![Page 23: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/23.jpg)
Reduce-scatter (short vector)
![Page 24: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/24.jpg)
Reduce
Reduce-scatter (short vector)
![Page 25: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/25.jpg)
Scatter
Reduce-scatter (short vector)
![Page 26: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/26.jpg)
Reduce
Before Reduce
+ + + + + + + +
![Page 27: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/27.jpg)
+ + + + + + +
Reduce
Reduce After
+
![Page 28: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/28.jpg)
![Page 29: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/29.jpg)
![Page 30: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/30.jpg)
![Page 31: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/31.jpg)
![Page 32: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/32.jpg)
![Page 33: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/33.jpg)
![Page 34: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/34.jpg)
Cost of Minimum-Spanning Tree Reduce
log( p)⎡ ⎤ +n +n( )
number of steps cost per steps
![Page 35: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/35.jpg)
Cost of Minimum-Spanning Tree Reduce
log( p)⎡ ⎤ +n +n( )
number of steps cost per steps
Notice: attains lower bound for latency component
![Page 36: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/36.jpg)
Scatter
Before After
![Page 37: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/37.jpg)
![Page 38: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/38.jpg)
![Page 39: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/39.jpg)
![Page 40: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/40.jpg)
![Page 41: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/41.jpg)
![Page 42: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/42.jpg)
![Page 43: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/43.jpg)
Cost of Minimum-Spanning Tree Scatter
+ n2kβ
⎛ ⎝ ⎜
⎞ ⎠ ⎟
k=1
log(p)∑
=
log( p) α +p − 1pnβ
• Assumption: power of two number of nodes
![Page 44: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/44.jpg)
Cost of Minimum-Spanning Tree Scatter
+ n2kβ
⎛ ⎝ ⎜
⎞ ⎠ ⎟
k=1
log(p)∑
=
log( p) α +p − 1pnβ
• Assumption: power of two number of nodes
Notice: attains lower bound for latency and bandwidth components
![Page 45: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/45.jpg)
Cost of Reduce/Scatter Reduce-scatter
log( p)( +n +n)
log(p) + p−1p
n
2log(p) + p−1p
+ log(p)⎛
⎝⎜
⎞
⎠⎟n + log(p)n
• Assumption: power of two number of nodes
reduce
scatter
![Page 46: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/46.jpg)
Cost of Reduce/Scatter Reduce-scatter
log( p)( +n +n)
log(p) + p−1p
n
2log(p) + p−1p
+ log(p)⎛
⎝⎜
⎞
⎠⎟n + log(p)n
• Assumption: power of two number of nodes
reduce
scatter
Notice: does not attain lower bound for latency or bandwidth components
![Page 47: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/47.jpg)
RecapReduce
log( p )( + n + n)
Scatterlog(p) +
p−1pn
Broadcastlog( p )( + n )
Gatherlog(p) +
p−1pn
Allreduce2log( p) + log(p)n( 2 + )
Reduce-scatter2log( p) + log(p)n( + ) +
p−1p
n
Allgather2log( p) + log(p)n +
p−1p
n
![Page 48: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/48.jpg)
A building block approach to library implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
![Page 49: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/49.jpg)
Long-vector case
• Primary concern: – algorithms must have low cost due to vector length– algorithms must avoid network conflicts
• Secondary concerns:– algorithms must work for arbitrary number of
nodes• in particular, not just for power-of-two numbers of nodes
![Page 50: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/50.jpg)
Long-vector building blocks
• We will show how the following building blocks:– allgather/reduce-scatter
• Can be implemented using “bucket” algorithms while attaining– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts
![Page 51: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/51.jpg)
• A logical ring can be embedded in a physical linear array
![Page 52: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/52.jpg)
General principles
• This technique can be used to implement the following building blocks:– allgather/reduce-scatter
• Send subvectors of data around the ring at each step until all data is collected, like a “bucket”– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts
![Page 53: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/53.jpg)
Reduce-scatter
Before Reduce
+ + + + + + + +
![Page 54: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/54.jpg)
+ + + + + + +
Reduce-scatter
Reduce After
+
![Page 55: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/55.jpg)
![Page 56: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/56.jpg)
![Page 57: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/57.jpg)
![Page 58: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/58.jpg)
![Page 59: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/59.jpg)
![Page 60: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/60.jpg)
![Page 61: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/61.jpg)
![Page 62: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/62.jpg)
![Page 63: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/63.jpg)
![Page 64: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/64.jpg)
![Page 65: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/65.jpg)
Cost of Bucket Reduce-scatter
( p−1) + np + n
p
⎛
⎝⎜
⎞
⎠⎟
=(p−1) + p−1
pn + p−1
pn
number of steps cost per steps
![Page 66: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/66.jpg)
Cost of Bucket Reduce-scatter
( p−1) + np + n
p
⎛
⎝⎜
⎞
⎠⎟
=(p−1) + p−1
pn + p−1
pn
number of steps cost per steps
Notice: attains lower bound for bandwidth andcomputation component
![Page 67: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/67.jpg)
RecapReduce-scatter( p−1)+
p−1pn(+)
Scatterlog(p) +
p−1pn
Allgather( p−1)+
p−1pn
Gatherlog(p) +
p−1pn
Allreduce2( p −1) +
p−1p
n(2 + )
Reduce( p −1 + log(p)) +
p−1p
n(2 + )
Broadcast(log( p) + p−1) + 2
p−1p
n
![Page 68: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/68.jpg)
A building block approach to library implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
![Page 69: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/69.jpg)
Hybrid algorithms(Intermediate length case)
• Algorithms must balance latency, cost due to vector length, and network conflicts
![Page 70: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/70.jpg)
General principles
• View p nodes as a 2-dimensional mesh– p = r x c, row and column dimensions
• Perform operation in each dimension– reduce-scatter within columns, followed by reduce-
scatter within rows
![Page 71: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/71.jpg)
General principles
• Many different combinations of short- and long- algorithms
• Generally try to reduce vector length to use at next dimension
• Can use more than 2-dimensions
![Page 72: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/72.jpg)
Example: 2D Scatter/Allgather Broadcast
![Page 73: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/73.jpg)
Example: 2D Scatter/Allgather Broadcast
Scatter in columns
![Page 74: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/74.jpg)
Example: 2D Scatter/Allgather Broadcast
Scatter in rows
![Page 75: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/75.jpg)
Example: 2D Scatter/Allgather Broadcast
Allgather in rows
![Page 76: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/76.jpg)
Example: 2D Scatter/Allgather Broadcast
Allgather in columns
![Page 77: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/77.jpg)
![Page 78: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/78.jpg)
![Page 79: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/79.jpg)
![Page 80: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/80.jpg)
![Page 81: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/81.jpg)
![Page 82: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/82.jpg)
![Page 83: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/83.jpg)
![Page 84: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/84.jpg)
![Page 85: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/85.jpg)
![Page 86: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/86.jpg)
![Page 87: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/87.jpg)
![Page 88: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/88.jpg)
![Page 89: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/89.jpg)
Cost of 2D Scatter/Allgather Broadcast
(log( p)+ r +c−2) + 2 p−1p
n
![Page 90: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/90.jpg)
Cost comparison
• Option 1:– MST Broadcast in column– MST Broadcast in rows
• Option 2:– Scatter in column– MST Broadcast in rows– Allgather in columns
• Option 3:– Scatter in column– Scatter in rows– Allgather in rows– Allgather in columns
log( p) + log(p)n
log( p)+ c−1( ) + 2c−1+ log(r)c
⎛⎝⎜
⎞⎠⎟n
(log( p)+ r +c−2) + 2 p−1p
n
![Page 91: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/91.jpg)
Outline
• Model of Parallel Computation• Collective Communications• Algorithms
• Performance Results• Conclusions and Future work
![Page 92: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/92.jpg)
Testbed Architecture• Cray-Dell PowerEdge Linux Cluster
– Lonestar• Texas Advanced Computing Center
– J. J. Pickle Research Campus» The University of Texas at Austin
– 856 3.06 GHz Xeon processors and– 410 Dell dual-processor PowerEdge 1750 compute nodes– 2 GB of memory per node in the compute nodes– Myrinet-2000 switch fabric– mpich-gm library 1.2.5..12a
![Page 93: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/93.jpg)
Performance Results• Log-log graphs
– Shows performance better than linear-linear graphs• Used one processor per node
• Graphs– ‘send-min’ is the minimum time to send between nodes– ’MPI’ is the MPICH implementation– ‘short’ is minimum spanning tree algorithm– ‘long’ is the bucket algorithm– ‘hybrid’ is a 3-dimensional hybrid algorithm
• two higher dimensions with ‘long’ algorithm• Lowest dimension with ‘short’ algorithm
![Page 94: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/94.jpg)
![Page 95: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/95.jpg)
![Page 96: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/96.jpg)
![Page 97: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/97.jpg)
![Page 98: On Optimizing Collective Communication](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813537550346895d9c9d54/html5/thumbnails/98.jpg)
Conclusions and Future work
• Have a working prototype for an optimal collective communication library, using optimal algorithms for short, intermediate and long vector lengths.
• Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture.
• Need to complete this approach for ALL MPI data types and operations.
• Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.