ernie chan
DESCRIPTION
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links. Ernie Chan. Ernie Chan Robert van de Geijn Department of Computer Sciences The University of Texas at Austin. William Gropp Rajeev Thakur Mathematics and Computer Science Division - PowerPoint PPT PresentationTRANSCRIPT
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links
Ernie Chan
Authors
Ernie Chan Robert van de Geijn
Department of Computer Sciences
The University of Texas at Austin
William Gropp Rajeev Thakur
Mathematics and Computer Science Division
Argonne National Laboratory
Testbed Architecture
IBM Blue Gene/L3D torus point-to-point interconnect networkOne rack
1024 dual-processor nodes Two 8 x 8 x 8 midplanes
Special feature to send simultaneously Use multiple calls to MPI_Isend
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Model of Parallel Computation
Target Architectures Distributed-memory parallel architectures
Indexingp computational nodes Indexed 0 … p - 1
Logically Fully ConnectedA node can send directly to any other node
Model of Parallel Computation
TopologyN-dimensional torus
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
Model of Parallel Computation
Old Model of Communicating Between NodesUnidirectional sending or receiving
Model of Parallel Computation
Old Model of Communicating Between NodesSimultaneous sending and receiving
Model of Parallel Computation
Old Model of Communicating Between NodesBidirectional exchange
Model of Parallel Computation
Communicating Between NodesA node can send or receive with 2N other
nodes simultaneously along its 2N different links
Model of Parallel Computation
Communicating Between NodesCannot perform bidirectional exchange on any
link while sending or receiving simultaneously with multiple nodes
Model of Parallel Computation
Cost of Communication
α + nβ
α: startup time, latencyn: number of bytes to communicateβ: per data transmission time, bandwidth
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Sending Simultaneously
Old Cost of Communication with Sends to Multiple NodesCost to send to m separate nodes
(α + nβ) m
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1) τ
Cost of one send Cost of extra sends
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1) τ
Cost of one send Cost of extra sends
0 ≤ τ ≤ 1
Sending Simultaneously
Benchmarking Sending SimultaneouslyLogarithmic-Logarithmic timing graphsMidplane – 512 nodesSending simultaneously with 1 – 6 neighbors8 bytes – 4 MB
Sending Simultaneously
Sending Simultaneously
Cost of Communication with Simultaneous Sends
(α + nβ) (1 + (m - 1) τ)
Sending Simultaneously
Sending Simultaneously
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Collective Communication
Broadcast (Bcast)Motivating example
Before After
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Generalized Algorithms
Short-Vector AlgorithmsMinimum-Spanning Tree
Long-Vector AlgorithmsBucket Algorithm
Generalized Algorithms
Minimum-Spanning Tree
Generalized Algorithms
Minimum-Spanning TreeDivide p nodes into N+1 partitions
Generalized Algorithms
Minimum-Spanning TreeDisjointed partitions on N-dimensional mesh
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
Generalized Algorithms
Minimum-Spanning TreeDivide dimensions by a decrementing counter
from N+1
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
Generalized Algorithms
Minimum-Spanning TreeNow divide into 2N+1 partitions
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Performance Results
Single point-to-pointcommunication
Performance Results
my-bcast-MST
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Conclusion
IBM Blue Gene/L supports functionality of sending simultaneouslyBenchmarking along with model checking
verifies this claim New generalized algorithms show clear
performance gains
Conclusion
Future DirectionsRoom for optimization to reduce
implementation overheadWhat if not using MPI_COMM_WORLD?Possible new algorithm for Bucket Algorithm
Questions? [email protected]