madeleine olivier aumage runtime project inria – labri bordeaux, france

92
Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Upload: cordelia-quinn

Post on 04-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Madeleine

Olivier Aumage

Runtime ProjectINRIA – LaBRI

Bordeaux, France

Page 2: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Objective

Rational task assignment in high-performance communication stacks

Programmingenvironment

Middlelevel

interface

Lowlevel

interface

Network

Application

Software stack

Model

Abstraction

Hardware control

Page 3: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Madeleine

A communication support for clusters and multi-clusters

Page 4: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Features

Abstract interface

Programmation by contract Specification of constraints Freedom for optimization

Active software support Dynamic optimization Adaptivity Transparency

Page 5: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Interface

Definitions

Connection Uni-directional point-to-point link FIFO ordering

Channel Graph of connections Multiplexing unit Network virtualization

Connection

Process

Channel

Page 6: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Communication model

Characteristics

Model Message passing Incremental message builing

Expressiveness Control of data blocs by flags Contract between the programmer and the interface

Express

Page 7: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Primitives

Main commands

Send mad_begin_packing mad_pack … mad_pack mad_end_packing

Receive mad_begin_unpacking mad_unpack … mad_unpack mad_end_unpacking

Page 8: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Message building

Commands Mad_pack(cnx, buffer, len, pack_mode, unpack_mode) Mad_unpack(cnx, buffer, len, pack_mode, unpack_mode)

Send contract options (send modes) Send_CHEAPER Send_SAFER Send_LATER

Receive contract options (receive modes) Receive_CHEAPER Receive_EXPRESS

Constraints Strictly symmetrical pack/unpack sequences

Triplets (len, pack_mode, unpack_mode) identical for send and for receive Data consistency

Page 9: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Send

Pack

Modification

End_packing

Send_SAFER Send_LATER Send_CHEAPER

Page 10: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Contract between the programmer and the interface

Send_SAFER / Send_LATER / Send_CHEAPER

Control of data transfer Optimization amount

Promises of programmer Data consistency

Special services Delayed send Buffer reuse

Specification at semantical level Independency: request / implementation

Page 11: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Receive

Unpack

After Unpack

End_unpacking

Receive_EXPRESS Receive_CHEAPER

Data available Availability?

Data available

Page 12: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Message structuring

Receive_CHEAPER / Receive_EXPRESS

Receive_EXPRESS Mandatory immediate receive Interpretation/extraction of message

Receive_CHEAPER Free reception of block Message contents

Express

Page 13: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Organization

Two-layered model Buffer management

Data processing code reuse Hardware abstraction

Modular approach Buffer management modules Drivers Transmission modules

Interface

Buffermanagement

Networkmanagement

BMM BMM

TM TM TM

Network

Driver Driver

Page 14: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Drivers

Network management layer

Data transfers Send, receive Group transfers

Transfer method selection Choice function

Page 15: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Transmission modules

Depends on the network

One module per transfer method Pilote GM: 2 TM Pilote BIP: 2 TM Pilote SCI: 3 TM Pilote VIA: 3 TM

Associated to a buffer management module

Page 16: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Transmission modules

Thread

Network

Pack

Madeleine

Interface BMM

BMM

TM

TM

Process

Page 17: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Buffers

Generic management layer

Virtual buffers Static Dynamic

Groups Aggregations Splitting

Page 18: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Buffer management modules

Buffer type Static/dynamic

Aggregation mode Without Sequential aggregation Half-sequential aggregation

Aggregation shape Symmetrical/non-symmetrical

Page 19: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Implementation

Status

Network drivers Quadrics, MX, GM, SISCI,

MPI, TCP, VRP

VIA, UDP, SBP, BIP

Distribution Licence GPL

Availability

Linux IA32, IA64, x86-64,

Alpha, Sparc, PowerPC

MacOS/X G4

Solaris IA32, Sparc

Aix PowerPC

Windows NT IA32

Page 20: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Tests – current plaform

Test environment

Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB Giga-Ethernet SISCI/SCI MX & GM /Myrinet Quadrics Elan4

Testing procedure

Test: 1000 x (send + receive) Result: ½ x average of 5 tests

Page 21: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Latency

1

10

100

4 8 16 32 64 128 256 512 1024 2048 4096 8192

Mad/ SISCI

Mad/ GM

Mad/ MX

Mad/ Quadrics

Packet size (bytes)

Late

ncy

(µs)

Page 22: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Bandwidth

0,1

1

10

100

1000

Mad/ SISCI

Mad/ GM

Mad/ MX

Mad/ Quadrics

Transfer time (bytes)

Ban

dw

idt

h(M

B/s

)

Page 23: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Tests – older platform

Testing environments

Cluster of PC bi-Pentium II 450 MHz, 128 MB Fast-Ethernet SISCI/SCI BIP/Myrinet

Testing procedure

Test: 1000 x (send + receive) Result: ½ x average of 5 tests

Page 24: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

SISCI/SCI – latency

1

10

100

1000

10000

100000

Mad/ SISCI

SISCI

Packet size (bytes)

Late

ncy

(µs)

Page 25: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

SISCI/SCI – bandwidth

0,1

1

10

100

Mad/ SISCI

SISCI

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

Page 26: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

SISCI/SCI – latencyPacks/messages

1

10

100

1000

10000

100000

Mad/ SISCI

2 msgs

2 packs

4 msgs

4 packs

8 msgs

8 packs

16 msgs

16 packs

32 msgs

32 packs

64 msgs

64 packs

128 msgs

128 packs

256 msgs

256 packs

Packet size (bytes)

Late

ncy

(µs)

Page 27: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

SISCI/SCI – bandwidthPacks/messages

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

0,1

1

10

100

Mad/ SISCI

2 msgs

2 packs

4 msgs

4 packs

8 msgs

8 packs

16 msgs

16 packs

32 msgs

32 packs

64 msgs

64 packs

128 msgs

128 packs

256 msgs

256 packs

Page 28: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

API MPIGeneric interface: point-to-point communication, collective communication, groups building

Abstract Device Interface (ADI)Generic interface: data type management, request queues management

SMP_PLUG

Local communication

CH_SELF

Local loops

Madeleine

CH_MADCommunicationPolling loopsInternal MPICH protocols

CommunicationMulti-protocol support

QSNETTCP UDP BIP MXGMSISCI

Users – MPICH/Madeleine

Page 29: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

MPICH/Mad/SCI – Latency

1

10

100

1000

10000

100000

Mad

MPICH/ Mad

SCI-MPICH

SCA MPI

Packet size (bytes)

Late

ncy

(µs)

Page 30: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

MPICH/Mad/SCI – bandwidth

0,1

1

10

100

Mad

MPICH/ Mad

SCI-MPICH

SCA MPI

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

Page 31: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Application

MPI JVMORB

MarcelMadeleineCommunicationMulti-protocol support

Circuit VSock

Padico Core

Padico Task Manager

Thread Padicomicro-kernelmanager

Net Access

QSNETTCP UDP BIP MXGMSISCI

Padico

Users – Padico

Page 32: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Padico – latency

1

10

100

1000

10000

Madeleine

Vsock

MPI

CORBA

Packet size (bytes)

Late

ncy

(µs)

Page 33: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Padico – bandwidth

0,01

0,1

1

10

100

1000

Madeleine

Vsock

MPI

Corba

Java

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

Page 34: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Conclusion

Unified communication support

Abstract interface Contract-based programming Modular/adaptive architecture Dynamic optimization Transparent multi-cluster support

Page 35: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

On-going/future work

Programming interface Message structuration Near-future information exploitation Pathological cases reduction Fault tolerance

Communication sequences processing Code specialization, compilation

Session management Deployment Dynamicity Fault-tolerance Scaling

Page 36: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

?

Madeleine I Madeleine II Madeleine III Madeleine IV

Page 37: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Some limitations of Madeleine (version III)

Objectives for a new Madeleine

Some optimizations are out of reach for Madeleine The optimization range is to narrow

Need information about what is coming in the near future Need to be more liberal in allowing permutations in the packet flow

Optimizations strategies involve too much work from the driver programmer Need to share more of strategic code Need to easily evaluate and even mix various strategies

Optimization sequences are synchronous with the application program Need to synchronize optimization sequences with the NIC

Page 38: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Proposal: Madeleine IV

Optimizer thread

Sender thread

Driver

Network

Hardware-specificparameters

Tracks

Tactics

Strategies

Constraints

Optimizer thread

Page 39: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Concepts

Definitions

Tracks Hardware multiplexing units mapping (tags) Main track

Control packets, small packets, … Optional auxiliary tracks

Other traffics (large messages, …) Tactics

Basic optimization operations Permutation, aggregation, piggybacking, association, splitting, track change

Strategies Set of tactics towards one optimization goal

Constraints Tactics compatibility Send/receive modes

Page 40: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Proposal: Madeleine IV

Optimizer thread

Sender thread

Driver

Network

Hardware-specificparameters

Tracks

Tactics

Strategies

Constraints

Optimizer thread

Page 41: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Packet headers

Giving up a little bit of raw efficiency to get much more flexibility

Opportunist packet aggregation/permutation Inside a single packet flow Across multiple packet flows

Side effects Control packets

Rendez-vous ACKs

Piggybacking Multiplexing

Page 42: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Concurrent communication progression

Communication scheduling

The NIC is responsible for requesting work

Packets are built when the NIC is ready

The optimizer gets more time to gather up-to-date optimization clues

Page 43: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Tests

Test environment

Cluster of PC bi-Pentium IV HT 2.66 GHz, 1 GB MX / Myrinet

Testing procedure

Test: 1000 x (send + receive) Result: ½ x average of 5 tests

Page 44: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Test – Latency

Packet size (bytes)

Late

ncy

(µs)

1

10

100

4 8 16 32 64 128 256 512 1024 2048

MX

Mad3

Mad4

Page 45: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Test – Bandwidth

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

0,1

1

10

100

1000

4 16 64 256 1024 4096 16384 65536 262144 1048576

MX

Mad3

Mad4

Page 46: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Test – Latency when aggregating short packets

Packet size (bytes)

Late

ncy

(µs)

1

10

100

1000

4 8 16 32 64 128 256

Mad3

Mad4

Page 47: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Opportunist aggregation on RDV

Aggregating a short packet with a RDV request for a long packet

No gain with MX/Myrinet

Madeleine III Latency: 310 µs Bandwidth: 201 MB/s

Madeleine IV Latency: 314 µs Bandwidth: 200 MB/s

MX flow control gets in the way

Page 48: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Conclusion

A new architecture for optimizing communication Wider optimization spectrum Better interactions between software and harware

A platform for experimenting optimizations Optimization tactics

A prototype implemented on top of MX/Myrinet Proof of concept

Page 49: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

On-going and future work

Optimization Tactic combinations Automatic strategy selection External strategies (plug-ins)

Interface expressiveness Extended packs One-sided communication

Load-balancing, multi-rail Benefit from all available links

Page 50: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Proposal: Madeleine IV

Optimizer thread

Sender thread

Driver

Network

Hardware-specificparameters

Tracks

Tactics

Strategies

Constraints

Optimizer thread

Page 51: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France
Page 52: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Cluster architectures

Characteristics

A set of computers Regular of-the-shelf PC

A « classical » network Slow Administration Service

A fast network Low latency High bandwidth Applications

Fast network

Slow network

ClusterCluster

Page 53: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Three programming models

Programming environments Message passing

PVM, MPI Service invocation

RPC SUN, OVM, PM2, etc.

RSR Nexus

JAVA RMI CORBA

Distributed-shared memory TreadMarks, DSMThreads, DSM-PM2

Each model has its use

?

!

Page 54: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Research theme

Interfacing programming environments with networking technologies

NetworkNetwork

ProgrammingenvironmentsProgrammingenvironments

CommunicationsupportCommunicationsupport

Messagepassing

Serviceinvocation(RPC, RMI)

?

ApplicationprocessesApplicationprocesses

Ethernet Myrinet SCI Quadrics Infiniband

Distributedshared memory

Page 55: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Features needed

A generic communication interface

Neutrality Independence with respect to the target programming model

Message passing Service invocation: RPC, RMI Distributed Shared Memory

Portability Independence with respect to hardware

Computing hardware Networking hardware

Efficiency Raw performance

Latency, bandwidth, reactivity Application performance

Page 56: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Available solutions

High level network interfaces?

Example MPI

Advantages Portability, normalization Rich features Efficiency

The interface is not adapted to complex communication schemes Relations between pieces of data in a communication message? Lack of expressiveness

Page 57: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Problem example

Remote service invocation Request

Header: service descriptor Body: service arguments

First option – two messages

MPIconnection

MPImessage

Header

Header

MPImessage

Body

Request

Body

Header

Header

Body

Client Server

Body

Page 58: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Problem example

Remote service invocation Second option – one copy

In both cases, MPI is not expressive enough

ServerClient

Header

Body

Requête

Corps

En-tête

ConnexionMPI

MessageMPI

Corps

En-tête

Copy

Body

Header

Body

Page 59: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Available solutions (cont’d)

Low level interfaces?

Examples BIP, GAMMA, GM, SISCI, VIA

Advantages Efficiency Exploitation of hardware potential

Hardware dependency Limited abstraction level

Difficult development Limited potential for code reuse

Short-lived development?

Page 60: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Available solutions (fin)

Middle-level communication interface?

Examples Nexus, Active Messages, Fast Messages

Advantages Abstraction Efficiency Relative portability

Neutrality? Expressiveness? Active message (or similar) programming model Unnecessary additional processing Problem of appraoch

Page 61: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Objective

Proposal for a generic middle-level communication interface

Independency with respect to programming environnementsNeutral programming model

Independency with respect to networking technologyPerformance portability

Env 1 Env 2 Env 3 Env n

Net 1 Net 2 Net 3 Net m

Env 1 Env 2 Env 3 Env n

Net 1 Net 2 Net 3 Net m

?

Page 62: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France
Page 63: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Objectifs

Améliorer la portée des optimisations

Permettre d’implanter et d’évaluer facilement différentes stratégies, de manière portable

Optimiser l’activité des cartes réseaux Transferts dirigés par la carte Équilibrage des transferts entre plusieurs cartes

Page 64: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Exemple 1 Deux paquets consécutifs dont le mode de réception est express

Nouvelles tactiques d’optimisation

Exemple 2 Un paquet nécessitant l’envoi d’un rendez-vous a pour mode de réception

express et est suivi d’un paquet n’en ayant pas besoin

Avec Madeleine 3

Tactique agrégationde messages courts

Avec Madeleine 3

Tactique agrégation de rendez-vous

Page 65: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Exemple

Send

begin_send(dest)

pack(data, long, r_express) pack(index1, court,

r_express) pack(index2, court,

r_express)

end_send()

Receive

begin_recv()

unpack(data , long , r_express)

unpack(index1, court, r_express) unpack(index2, court, r_express)

total = data[index1] + data[index2]

end_recv()

Page 66: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Paquets à acquitter

Optimiseur

Applications

Réseau

Emission Réception

Paquets inattendus

AcqAcq

Page 67: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France
Page 68: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Static buffers

Buffer managers Filling

Drivers Allocation/free

Page 69: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Buffer managers Allocation/free Copy (when necessary) Aggregation by affinity

Dynamic buffers

Page 70: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Aggregation

Sequential

Flush Flush

TM1 TM1TM2

Page 71: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Aggregation

Half-sequential

Flush Flush

TM 1

TM 2

Main

Page 72: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Aggregation shape

Symmetrical

Non-symmetrical Flush FlushFlush

Flush FlushFlushFlush

Send Receive

Page 73: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_CHEAPER Automatic half-sequential aggregation

TM 1

TM 2

Main

End_packing

Page 74: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Send

Pack Unpack

Receive

Page 75: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Pack Unpack

Send Receive

Page 76: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Pack Unpack

Expected data

Delayed send

Send Receive

Page 77: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Pack Unpack

Expected data

Delayed send

Send Receive

Page 78: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

End_packing Unpack

Expected data

Delayed send

Send Receive

Page 79: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Fill Unpack

Delayed send

Send ReceiveExpected data

Page 80: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Special cases

Send_LATER / Receive_EXPRESS Half sequential aggregation for everybody Send delayed until end_packing call

Send Receive

Transfer

Page 81: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Tests – first part

Testing environments

Cluster of PC bi-Pentium II 450 MHz, 128 MB Fast-Ethernet SISCI/SCI BIP/Myrinet

Testing procedure

Test: 1000 x (send + receive) Result: ½ x average of 5 tests

Page 82: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France
Page 83: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Grids?

Heterogeneity

Page 84: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Grids

Idea

A grid

A computer A interconnected set of grids

Page 85: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Multi-cluster support

Cluster of cluster exploitation

Fast cluster networks Fast inter-clusters networks Network level heterogeneity

High performance

networkHigh

performance network

High performance

network

Page 86: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Idea

Physical channels Related to a physical network Not-necessarily cover each node of the session

Virtual channels Cover each node the session Contains one or more physical channels

MyrinetSCI

Virtuel

Page 87: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Integration

Generic transmission module Limited stack traversal on forwarding nodes

Interface

Buffermanagement

Drivers

BMM BMM

TM TM TM

Network

Generic TM

Page 88: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Forwarding module

Thread

Network 2

Madeleine

BMM TM

TM

Process

TMInterface

Threads

Network 1

Page 89: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Bandwidth preservation

Pipeline Concurrent receive et re-send using two buffers

One copy Same buffer for receive and re-send

Buffer 1

Buffer 2

Receive

Re-send

LANai

Page 90: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Deployment

Session spawning –Léonie

Sessions Flexibility

Multi-cluster Unified launch

Grouped spawns Extensibility

Support for optimized distributed process launchers

Network Information table generation

Processes directory Routing tables for virtual channels

Ordering NIC initializations, channel opening

Madeleine

Léonie

Page 91: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Virtual connections – latencySISCI+BIP

10

100

1000

10000

100000

BIP+SISCI

SISCI+BIP

Packet size (bytes)

Late

ncy

(µs)

Myrinet

SCI

Page 92: Madeleine Olivier Aumage Runtime Project INRIA – LaBRI Bordeaux, France

Virtual connections – bandwidthSISCI+BIP

Packet size (bytes)

Ban

dw

idt

h(M

B/s

)

0,1

1

10

100

4 16 64256

10244096

16384

65536

262144

1048576

BIP+SISCI

SISCI+BIP