locality-aware connection management and rank assignment for wide-area mpi

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Hideo Saito Kenjiro TauraThe University of TokyoMay 16, 2007

Background Increase in the bandwidth of WANs➭ More opportunities to perform parallel

computation using multiple clusters

Requirements for Wide-area MPI

Wide-area connectivity Firewalls and private addresses Only some nodes can connect to each other Perform routing using the connections that

happen to be possible

Firewall

Reqs. for Wide-area MPI (2) Scalability

The number of conns. must be limited in order to scale to thousands of nodes Various allocation limits of the

system (e.g., memory, file descriptors, router sessions)

Simplistic schemes that may potentially result in O(n2) connections won’t scale Lazy connect strategies work for

many apps, but not for those that involve all-to-all communication

Reqs. for Wide-area MPI (3) Locality awareness

To achieve high performance with few conns, select conns. in a locality-aware manner Many connections with nearby

nodes, few connections with faraway nodes

Many conns.within a cluster

Few conns.between clusters

Reqs. for Wide-area MPI (4) Application awareness

Select connections according to the application’s communication pattern

Assign ranks* according to the application’s communication pattern

Adaptivity Automatically, without tedious manual configuration

* rank = process ID in MPI

Contributions of Our Work Locality-aware connection management

Uses latency and traffic information obtained from a short profiling run

Locality-aware rank assignment Uses the same info. to discover rank-process

mappings with low comm. overhead

➭ Multi-Cluster MPI (MC-MPI) Wide-area-enabled MPI library

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

Grid-enabled MPI Libraries MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielma

nn et al. ‘99] Locality-aware communication optimizations

E.g., wide-area-aware collective operations (broadcast, reduction, ...)

Doesn’t work with Firewalls

Grid-enabled MPI Libraries (cont’d)

MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] Forwarding mechanisms that

allow nodes to communicate even in the presence of FWs

Manual configuration Amount of necessary config.

becomes overwhelming as more resources are used

Forward Firewall

P2P Overlays Pastry [Rowstron et al. ’00]

Each node maintains just O(log n) connections Messages are routed using those connections Highly scalable, but routing properties are unfavora

ble for high performance computing Few connections between nearby nodes Messages between nearby nodes need to be forwarded, c

ausing large latency penalties

Adaptive MPI Huang et al. ‘06

Performs load balancing by migrating virtual processors Balance the exec. times of the physical processors Minimize inter-processor communication

Adapts to apps. by tracking the amount of communication performed between procs.

Assumes that the communication cost of every processor pair is the same MC-MPI takes differences in communication costs into ac

Physical Processor

Virtual Processor

Lazy Connect Strategies MPICH [Gropp et al. ‘96], Scalable MPI over Inf

iniband [Yu et al. ‘06] Establish connections only on demand Reduces the number of conns. if each proc. only co

mmunicates with a few other procs. Some apps. generate all-to-all comm. patterns, res

ulting in many connections E.g., IS in the NAS Parallel Benchmarks

Doesn’t extend to wide-area environments where some communication may be blocked

Outline

Overview of Our Method

Latency matrix (L) Traffic matrix (T)

Locality-aware connection management Locality-aware rank assignment

Short Profiling Run

Optimized Real Run

Outline

Latency Matrix

Latency matrix L = {lij} lij: latency between processes i and j in the target e

nvironment Each process autonomously measures the RTT bet

ween itself and other processes Reduce the num. of measurements by using the tri

angular inequality to estimate RTTs

if rttpr>αrttrq: rttpq=rttpr

(α: constant)

Traffic Matrix

Traffic matrix T = {tij} tij: traffic between ranks i and j in the target applica

tion Many applications repeat similar communication pa

tterns➭ Execute the application for a short amount of time

and make tij the number of transmitted messages

(E.g., one iteration of an iterative 　 app.)

Outline

Connection Management

Bounding Graph Spanning Tree Lazy Connection Establishment

MPI_Init Application Body

Candidateconnections

Establishcandidate

connectionson demand

Selection of Candidate Connections

Each process selects O(log n) neighbors based on L and T : parameter that controls connection dens

ity n: number of processes

/ /2 /4

Many nearbyprocesses

Few farawayprocesses

near far

Bounding Graph Procs. try to establish temporary

conns. to their selected neighbors The collective set of

successful connections ➭ Bounding graph (Some conns. may fail due to FWs)

Temporaryconnections

BoundingGraph

Routing Table Construction Construct a routing table using

just the bounding graph Close the temporary connections

Conns. of the bounding graph are reestablished lazily as “real” conns. Temporary conns. => small bufs. Real conns. => large bufs.

BoundingGraph

Lazy Connection Establishment

Bounding Graph

Spanning Tree

Send connectrequest usingspanning tree

Connectin reversedirection

Lazy connectfails due to FW

Outline

Commonly-used Method Sort the processes by host name (or IP

address) and assign ranks in that order Assumptions

Most communication takes place between processes with close ranks

The communication cost between processes with close host names is low

However, Applications have various comm. patterns Host names don’t necessarily have a

correlation to communication costs

Our Rank Assignment Scheme Find a rank-process mapping with low commun

ication overhead Map the rank assignment problem to the Quadratic

Assignment Problem QAP

Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:

Solving QAPs NP-Hard, but there are heuristics for finding go

od suboptimal solutions Library based on GRASP [Resende et al. ’96] Test against QAPLIB [Burkard et al. ’97]

Instances of up to n = 256 n processors for problem size n Approximate solutions that were within one to two p

ercent of the best known solution in under one second

Outline

1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion

Experimental Environment

Xeon/Pentium M Linux Intra-cluster RTT: 60-120

microsecs TCP send/recv bufs: 256K

sheepXX(64 nodes)

istbsXXX(64 nodes)

hongoXXX(64 nodes)

chibaXXX(64 nodes)

10.8ms

Experiment 1: Conn. Management

Measure the performance of the NPB with limited numbers of connections MC-MPI

Limit the number of connections to 10%, 20%, ..., 100% by varying

Random Establish a comparable number of connections

randomly

BT, LU, MG and SP

0% 25% 50% 75% 100%Maximum % of connections

MC-MPIRandom

LU (Lower-Upper)

SOR(Successive Over-Relxation)

BT, LU, MG and SP (2)

MC-MPIRandom

0% 25% 50% 75%100%Maximum % of connections

MC-MPIRandom

MG (Multi-Grid)BT (Block Tridiagonal)

BT, LU, MG and SP (3)

MC-MPIRandom

SP (Scalar Pentadiagonal)

% of connections actually established was lower than that shown by the x-axis B/c of lazy connection

establishment To be discussed in

more detail later

MC-MPIRandom

EP involves very little communication

EP (Embarrassingly Parallel)

ce MC-MPIRandom

0 32 64 96 128Buffer size (KB)

20%60%100%

IS (Integer Sort)

Performance decreasedue to congestion!

Experiment 2: Lazy Conn. Establish.

Compare our lazy conn. establishment method with an MPICH-like method MC-MPI

Select so that the maximum number of allowed connections is 30%

MPICH-like Establish connections on demand without preselecting ca

ndidate connections(we can also say that we preselect all connections)

Experiment 2: Results

BT EP IS MG LU SP

Benchmark

MC- MPIMPICH- like

Connections Established

BT EP IS MG LU SP

Benchmark

MC- MPIMPICH- like

Relative Performance

Comparable number ofconns. except for IS

Comparable performanceexcept for IS

Experiment 3: Rank Assignment

Compare 3 assignment algorithms Random Hostname (24 patterns)

Real host names (1) What if istbsXXX were named sheepXX, etc. (23)

MC-MPI (QAP) chibaXXX

sheepXX

hongoXXXistbsXXX

LU and MG

Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)

BT and SP

Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)

BT and SP (cont’d)

Destination

Hostname

MC-MPI (QAP)

Traffic Matrix

Cluster A

Cluster B

Cluster C

Cluster D

Rank Assignment

EP and IS

Random Hostname Hostname (Best) Hostname (Worst) QAP (MC-MPI)

Outline

1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion

Conclusion MC-MPI

Connection management High performance with connections between just

10% of all process pairs Rank assignment

Up to 300% faster than locality-unaware assignments

Future Work An API to perform profiling w/in a single run Integration of adaptive collectives

locality-aware connection management and rank assignment for wide-area mpi

scalable mpi

communication costs

on2 connections wont

locality awarenessto

rankprocess mappings

nearby nodesmessages

doesnt work

strategies work

Documents

data locality

mpi + mpi: a new hybrid approach to parallel programming...

mpi + mpi: using mpi-3 shared memory as a multicore...

implantes dentales mpi 4 mpi excellence 6

our locality

what is [open] mpi?open]-mpi-2up.pdf2 may 2008 screencast:...

message passing interface (mpi) - cornell university–...

a data-centric proﬁler for parallel programs · –...

pflotran performance benchmark and profiling · 2020. 9....

kirsten heinrich, library mpi biogeochemistry & mpi...

locality versus anti-locality effects in mandarin sentence

low-rank techniques: from matrix equations to high … ·...

mpi message passing interface - indian institute of...

introduction to parallel computing - byu• include the mpi...

s7-lan / mpi-lan / s7-usb / mpi-usb / mpi-ii user manual v2

parallel numerics - in.tum.de€¦ · mpi send mpi bcast...

ieee transactions on geoscience and remote sensing,...

mpi-3.0 and mpi-3.1...

introduction to mpi mpi programming running mpi program...

china mcp 1 open mpi. agenda mpi overview open mpi...