madeleine olivier aumage runtime project inria – labri bordeaux, france

Madeleine

Olivier Aumage

Runtime ProjectINRIA – LaBRI

Bordeaux, France

Objective

Rational task assignment in high-performance communication stacks

Programmingenvironment

Middlelevel

interface

Lowlevel

interface

Network

Application

Software stack

Model

Abstraction

Hardware control

Madeleine

A communication support for clusters and multi-clusters

Features

Abstract interface

Programmation by contract Specification of constraints Freedom for optimization

Active software support Dynamic optimization Adaptivity Transparency

Interface

Definitions

Connection Uni-directional point-to-point link FIFO ordering

Channel Graph of connections Multiplexing unit Network virtualization

Connection

Process

Channel

Communication model

Characteristics

Model Message passing Incremental message builing

Expressiveness Control of data blocs by flags Contract between the programmer and the interface

Express

Primitives

Main commands

Send mad_begin_packing mad_pack … mad_pack mad_end_packing

Receive mad_begin_unpacking mad_unpack … mad_unpack mad_end_unpacking

Message building

Commands Mad_pack(cnx, buffer, len, pack_mode, unpack_mode) Mad_unpack(cnx, buffer, len, pack_mode, unpack_mode)

Send contract options (send modes) Send_CHEAPER Send_SAFER Send_LATER

Receive contract options (receive modes) Receive_CHEAPER Receive_EXPRESS

Constraints Strictly symmetrical pack/unpack sequences

Triplets (len, pack_mode, unpack_mode) identical for send and for receive Data consistency

Send

Pack

Modification

End_packing

Send_SAFER Send_LATER Send_CHEAPER

Contract between the programmer and the interface

Send_SAFER / Send_LATER / Send_CHEAPER

Control of data transfer Optimization amount

Promises of programmer Data consistency

Special services Delayed send Buffer reuse

Specification at semantical level Independency: request / implementation

Receive

Unpack

After Unpack

End_unpacking

Receive_EXPRESS Receive_CHEAPER

Data available Availability?

Data available

Message structuring

Receive_CHEAPER / Receive_EXPRESS

Receive_EXPRESS Mandatory immediate receive Interpretation/extraction of message

Receive_CHEAPER Free reception of block Message contents

Express

Organization

Two-layered model Buffer management

Data processing code reuse Hardware abstraction

Modular approach Buffer management modules Drivers Transmission modules

Interface

Buffermanagement

Networkmanagement

BMM BMM

TM TM TM

Network

Driver Driver

Drivers

Network management layer

Data transfers Send, receive Group transfers

Transfer method selection Choice function

Transmission modules

Depends on the network

One module per transfer method Pilote GM: 2 TM Pilote BIP: 2 TM Pilote SCI: 3 TM Pilote VIA: 3 TM

Associated to a buffer management module

Transmission modules

Thread

Network

Pack

Madeleine

Interface BMM

BMM

TM

TM

Process

Buffers

Generic management layer

Virtual buffers Static Dynamic

Groups Aggregations Splitting

Buffer management modules

Buffer type Static/dynamic

Aggregation mode Without Sequential aggregation Half-sequential aggregation

Aggregation shape Symmetrical/non-symmetrical

Implementation

Status

Network drivers Quadrics, MX, GM, SISCI,

MPI, TCP, VRP

VIA, UDP, SBP, BIP

Distribution Licence GPL

Availability

Linux IA32, IA64, x86-64,

Alpha, Sparc, PowerPC

MacOS/X G4

Solaris IA32, Sparc

Aix PowerPC

Windows NT IA32

Tests – current plaform

Test environment

Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB Giga-Ethernet SISCI/SCI MX & GM /Myrinet Quadrics Elan4

Testing procedure

Test: 1000 x (send + receive) Result: ½ x average of 5 tests

Latency

1

10

100

4 8 16 32 64 128 256 512 1024 2048 4096 8192

Mad/ SISCI

Mad/ GM

Mad/ MX

Mad/ Quadrics

Packet size (bytes)

Late

ncy

(µs)

Bandwidth

0,1

1

10

100

1000

Mad/ SISCI

Mad/ GM

Mad/ MX

Mad/ Quadrics

Transfer time (bytes)

Ban

dw

idt

h(M

B/s

)

Tests – older platform

Testing environments

Cluster of PC bi-Pentium II 450 MHz, 128 MB Fast-Ethernet SISCI/SCI BIP/Myrinet

Testing procedure


SISCI/SCI – latency

1

10

100

1000

10000

100000

Mad/ SISCI

SISCI

Packet size (bytes)

Late

ncy

(µs)

SISCI/SCI – bandwidth

0,1

1

10

100

Mad/ SISCI

SISCI

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

SISCI/SCI – latencyPacks/messages

1

10

100

1000

10000

100000

Mad/ SISCI

2 msgs

2 packs

4 msgs

4 packs

8 msgs

8 packs

16 msgs

16 packs

32 msgs

32 packs

64 msgs

64 packs

128 msgs

128 packs

256 msgs

256 packs

Packet size (bytes)

Late

ncy

(µs)

SISCI/SCI – bandwidthPacks/messages

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

0,1

1

10

100

Mad/ SISCI

2 msgs

2 packs

4 msgs

4 packs

8 msgs

8 packs

16 msgs

16 packs

32 msgs

32 packs

64 msgs

64 packs

128 msgs

128 packs

256 msgs

256 packs

API MPIGeneric interface: point-to-point communication, collective communication, groups building

Abstract Device Interface (ADI)Generic interface: data type management, request queues management

SMP_PLUG

Local communication

CH_SELF

Local loops

Madeleine

CH_MADCommunicationPolling loopsInternal MPICH protocols

CommunicationMulti-protocol support

QSNETTCP UDP BIP MXGMSISCI

Users – MPICH/Madeleine

MPICH/Mad/SCI – Latency

1

10

100

1000

10000

100000

Mad

MPICH/ Mad

SCI-MPICH

SCA MPI

Packet size (bytes)

Late

ncy

(µs)

MPICH/Mad/SCI – bandwidth

0,1

1

10

100

Mad

MPICH/ Mad

SCI-MPICH

SCA MPI

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

Application

MPI JVMORB

MarcelMadeleineCommunicationMulti-protocol support

Circuit VSock

Padico Core

Padico Task Manager

Thread Padicomicro-kernelmanager

Net Access

QSNETTCP UDP BIP MXGMSISCI

Padico

Users – Padico

Padico – latency

1

10

100

1000

10000

Madeleine

Vsock

MPI

CORBA

Packet size (bytes)

Late

ncy

(µs)

Padico – bandwidth

0,01

0,1

1

10

100

1000

Madeleine

Vsock

MPI

Corba

Java

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

Conclusion

Unified communication support

Abstract interface Contract-based programming Modular/adaptive architecture Dynamic optimization Transparent multi-cluster support

On-going/future work

Programming interface Message structuration Near-future information exploitation Pathological cases reduction Fault tolerance

Communication sequences processing Code specialization, compilation

Session management Deployment Dynamicity Fault-tolerance Scaling

?

Madeleine I Madeleine II Madeleine III Madeleine IV

Some limitations of Madeleine (version III)

Objectives for a new Madeleine

Some optimizations are out of reach for Madeleine The optimization range is to narrow

Need information about what is coming in the near future Need to be more liberal in allowing permutations in the packet flow

Optimizations strategies involve too much work from the driver programmer Need to share more of strategic code Need to easily evaluate and even mix various strategies

Optimization sequences are synchronous with the application program Need to synchronize optimization sequences with the NIC

Proposal: Madeleine IV

Optimizer thread

Sender thread

Driver

Network

Hardware-specificparameters

Tracks

Tactics

Strategies

Constraints

Optimizer thread

Concepts

Definitions

Tracks Hardware multiplexing units mapping (tags) Main track

Control packets, small packets, … Optional auxiliary tracks

Other traffics (large messages, …) Tactics

Basic optimization operations Permutation, aggregation, piggybacking, association, splitting, track change

Strategies Set of tactics towards one optimization goal

Constraints Tactics compatibility Send/receive modes


Optimizer thread

Sender thread

Driver

Network


Tracks

Tactics

Strategies

Constraints

Optimizer thread

Packet headers

Giving up a little bit of raw efficiency to get much more flexibility

Opportunist packet aggregation/permutation Inside a single packet flow Across multiple packet flows

Side effects Control packets

Rendez-vous ACKs

Piggybacking Multiplexing

Concurrent communication progression

Communication scheduling

The NIC is responsible for requesting work

Packets are built when the NIC is ready

The optimizer gets more time to gather up-to-date optimization clues

Tests

Test environment

Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB MX / Myrinet

Testing procedure


Test – Latency

Packet size (bytes)

Late

ncy

(µs)

1

10

100

4 8 16 32 64 128 256 512 1024 2048

MX

Mad3

Mad4

Test – Bandwidth

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

0,1

1

10

100

1000

4 16 64 256 1024 4096 16384 65536 262144 1048576

MX

Mad3

Mad4

Test – Latency when aggregating short packets

Packet size (bytes)

Late

ncy

(µs)

1

10

100

1000

4 8 16 32 64 128 256

Mad3

Mad4

Opportunist aggregation on RDV

Aggregating a short packet with a RDV request for a long packet

No gain with MX/Myrinet

Madeleine III Latency: 310 µs Bandwidth: 201 MB/s

Madeleine IV Latency: 314 µs Bandwidth: 200 MB/s

MX flow control gets in the way

Conclusion

A new architecture for optimizing communication Wider optimization spectrum Better interactions between software and harware

A platform for experimenting optimizations Optimization tactics

A prototype implemented on top of MX/Myrinet Proof of concept

On-going and future work

Optimization Tactic combinations Automatic strategy selection External strategies (plug-ins)

Interface expressiveness Extended packs One-sided communication

Load-balancing, multi-rail Benefit from all available links


Optimizer thread

Sender thread

Driver

Network


Tracks

Tactics

Strategies

Constraints

Optimizer thread

Cluster architectures

Characteristics

A set of computers Regular of-the-shelf PC

A « classical » network Slow Administration Service

A fast network Low latency High bandwidth Applications

Fast network

Slow network

ClusterCluster

Three programming models

Programming environments Message passing

PVM, MPI Service invocation

RPC SUN, OVM, PM2, etc.

RSR Nexus

JAVA RMI CORBA

Distributed-shared memory TreadMarks, DSMThreads, DSM-PM2

Each model has its use

?

!

Research theme

Interfacing programming environments with networking technologies

NetworkNetwork

ProgrammingenvironmentsProgrammingenvironments

CommunicationsupportCommunicationsupport

Messagepassing

Serviceinvocation(RPC, RMI)

?

ApplicationprocessesApplicationprocesses

Ethernet Myrinet SCI Quadrics Infiniband

Distributedshared memory

Features needed

A generic communication interface

Neutrality Independence with respect to the target programming model

Message passing Service invocation: RPC, RMI Distributed Shared Memory

Portability Independence with respect to hardware

Computing hardware Networking hardware

Efficiency Raw performance

Latency, bandwidth, reactivity Application performance

Available solutions

High level network interfaces?

Example MPI

Advantages Portability, normalization Rich features Efficiency

The interface is not adapted to complex communication schemes Relations between pieces of data in a communication message? Lack of expressiveness

Problem example

Remote service invocation Request

Header: service descriptor Body: service arguments

First option – two messages

MPIconnection

MPImessage

Header

Header

MPImessage

Body

Request

Body

Header

Header

Body

Client Server

Body

Problem example

Remote service invocation Second option – one copy

In both cases, MPI is not expressive enough

ServerClient

Header

Body

Requête

Corps

En-tête

ConnexionMPI

MessageMPI

Corps

En-tête

Copy

Body

Header

Body

Available solutions (cont’d)

Low level interfaces?

Examples BIP, GAMMA, GM, SISCI, VIA

Advantages Efficiency Exploitation of hardware potential

Hardware dependency Limited abstraction level

Difficult development Limited potential for code reuse

Short-lived development?

Available solutions (fin)

Middle-level communication interface?

Examples Nexus, Active Messages, Fast Messages

Advantages Abstraction Efficiency Relative portability

Neutrality? Expressiveness? Active message (or similar) programming model Unnecessary additional processing Problem of appraoch

Objective

Proposal for a generic middle-level communication interface

Independency with respect to programming environnementsNeutral programming model

Independency with respect to networking technologyPerformance portability

Env 1 Env 2 Env 3 Env n

Net 1 Net 2 Net 3 Net m

Env 1 Env 2 Env 3 Env n

Net 1 Net 2 Net 3 Net m

?

Objectifs

Améliorer la portée des optimisations

Permettre d’implanter et d’évaluer facilement différentes stratégies, de manière portable

Optimiser l’activité des cartes réseaux Transferts dirigés par la carte Équilibrage des transferts entre plusieurs cartes

Exemple 1 Deux paquets consécutifs dont le mode de réception est express

Nouvelles tactiques d’optimisation

Exemple 2 Un paquet nécessitant l’envoi d’un rendez-vous a pour mode de réception

express et est suivi d’un paquet n’en ayant pas besoin

Avec Madeleine 3

Tactique agrégationde messages courts

Avec Madeleine 3

Tactique agrégation de rendez-vous

Exemple

Send

begin_send(dest)

pack(data, long, r_express) pack(index1, court,

r_express) pack(index2, court,

r_express)

end_send()

Receive

begin_recv()

unpack(data , long , r_express)

unpack(index1, court, r_express) unpack(index2, court, r_express)

total = data[index1] + data[index2]

end_recv()

Paquets à acquitter

Optimiseur

Applications

Réseau

Emission Réception

Paquets inattendus

AcqAcq

Static buffers

Buffer managers Filling

Drivers Allocation/free

Buffer managers Allocation/free Copy (when necessary) Aggregation by affinity

Dynamic buffers

Aggregation

Sequential

Flush Flush

TM1 TM1TM2

Aggregation

Half-sequential

Flush Flush

TM 1

TM 2

Main

Aggregation shape

Symmetrical

Non-symmetrical Flush FlushFlush

Flush FlushFlushFlush

Send Receive

Special cases

Send_LATER / Receive_CHEAPER Automatic half-sequential aggregation

TM 1

TM 2

Main

End_packing

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Send

Pack Unpack

Receive

Special cases


Pack Unpack

Send Receive

Special cases


Pack Unpack

Expected data

Delayed send

Send Receive

Special cases


End_packing Unpack

Expected data

Delayed send

Send Receive

Special cases


Fill Unpack

Delayed send

Send ReceiveExpected data

Special cases


Send Receive

Transfer

Tests – first part

Testing environments

Cluster of PC bi-Pentium II 450 MHz, 128 MB Fast-Ethernet SISCI/SCI BIP/Myrinet

Testing procedure


Grids?

Heterogeneity

Grids

Idea

A grid

A computer A interconnected set of grids

Multi-cluster support

Cluster of cluster exploitation

Fast cluster networks Fast inter-clusters networks Network level heterogeneity

High performance

networkHigh

performance network

High performance

network

Idea

Physical channels Related to a physical network Not-necessarily cover each node of the session

Virtual channels Cover each node the session Contains one or more physical channels

MyrinetSCI

Virtuel

Integration

Generic transmission module Limited stack traversal on forwarding nodes

Interface

Buffermanagement

Drivers

BMM BMM

TM TM TM

Network

Generic TM

Forwarding module

Thread

Network 2

Madeleine

BMM TM

TM

Process

TMInterface

Threads

Network 1

Bandwidth preservation

Pipeline Concurrent receive et re-send using two buffers

One copy Same buffer for receive and re-send

Buffer 1

Buffer 2

Receive

Re-send

LANai

Deployment

Session spawning –Léonie

Sessions Flexibility

Multi-cluster Unified launch

Grouped spawns Extensibility

Support for optimized distributed process launchers

Network Information table generation

Processes directory Routing tables for virtual channels

Ordering NIC initializations, channel opening

Madeleine

Léonie

Virtual connections – latencySISCI+BIP

10

100

1000

10000

100000

BIP+SISCI

SISCI+BIP

Packet size (bytes)

Late

ncy

(µs)

Myrinet

SCI

Virtual connections – bandwidthSISCI+BIP

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

0,1

1

10

100

4 16 64256

10244096

16384

65536

262144

1048576

BIP+SISCI

SISCI+BIP

madeleine olivier aumage runtime project inria – labri bordeaux, france

Documents