towards realizing a dynamic and mpi application-aware ... · representation of communication...

31
The 26th Workshop on Sustained Simulation Performance (WSSP26) Towards Realizing a Dynamic and MPI Application-aware Interconnect with SDN Keichi Takahashi Cybermedia Center, Osaka University

Upload: others

Post on 15-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Towards Realizing a Dynamic and MPI Application-aware Interconnect with SDN

Keichi Takahashi Cybermedia Center, Osaka University

Page 2: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Introduction

‣ Why do we need an MPI Application-aware Interconnect?

SDN-accelerated MPI Primitives

‣ Is our idea feasible at all?

A Coordination Mechanism of Computation and Communication

‣ How do we reconfigure the interconnect in accordance with the execution of applications?

A Toolset for Analyzing Application-aware Dynamic Interconnects

‣ How will the proposed architecture perform with various types of applications and clusters?

Agenda

2

Page 3: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Challenges in Future Interconnects

Over-provisioned designs might not scale well

‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2]

‣ Properties such as full bisection bandwidth and non-blocking may become increasingly difficult to achieve

‣ Need to improve the utilization of the interconnect

Our proposal is to adopt:

‣ Dynamic (adaptive) routing

‣ Application-awareness network control

3

[1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.

Page 4: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Source of Inefficiency in Current Interconnect

4

Communication Pattern of Applications

Topology of the Interconnect

[3] S. Kamil et al. ,“Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., 2010.

Lower utilization Higher congestion Lower communication performance

[3]

Mismatch

Page 5: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

SDN-enhanced MPI Framework

Our prototype framework that integrates SDN into MPI

‣ Dynamically controls the interconnect based on the communication pattern of MPI applications

‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing)

‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast and MPI_Allreduce)

5

Page 6: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

SDN | Software-Defined Networking

6

Feature

Control Plane

Data Plane

Conventional Networking

Southbound API (e.g. OpenFlow)

Northbound API

App

App App

Control Plane

Data Plane

Feature

Software Defined Networking

Disaggregation

Page 7: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

OpenFlow | Standard implementation of SDN

7

Src MAC Dst MAC … Instructions

aa:aa:aa:… ff:ff:ff:… Flood

bb:bb:bb:… aa:aa:aa:… OutputtoPortX

aa:aa:aa:… bb:bb:bb:… SetDstIPtoY,…

Src MAC Dst MAC … Instructions

aa:aa:aa:… ff:ff:ff:… Flood

bb:bb:bb:… aa:aa:aa:… OutputtoPortX

aa:aa:aa:… bb:bb:bb:… SetDstIPtoY,…

Src MAC Dst MAC … Instructions

aa:aa:aa:… ff:ff:ff:… Flood

bb:bb:bb:… aa:aa:aa:… OutputtoPortX

aa:aa:aa:… bb:bb:bb:… SetDstIPtoY,…

Control Plane

Data Plane

Add/Modify/Delete flow entries Inject packets into data planeNotify flow entry misses

Flow Table (Collection of flow entries)

OpenFlow Controller

OpenFlow Messages

Page 8: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Basic Idea of SDN-enhanced MPI

8

Interconnect

Computing Nodes

1 2

3

0

Communication Pattern

0 1

2 3

Interconnect Control Seq.

Proactive reconfiguration using OpenFlow

Extract via Tracer/ProfilerStatic Analysis

Resource (Path, Bandwidth, etc.) Allocation

Page 9: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Related Work

SDN-enhanced InfiniBand (Lee et al. SC16)

‣ Enhancement to InfiniBand that allows dynamic and per-flow level network control

Conditional OpenFlow (Benito et al. HiPC 2015)

‣ Enhanced OpenFlow that allows users to add flow entries that are activated when an Ethernet Pause (IEEE 802.3x) occurs

‣ Primary goal is to implement non-minimal adaptive routing on Ethernet

Quantized Congestion Notification Switch (Benito et al. HiPINEB 2017)

‣ Another enhancement to OpenFlow that uses received QCNs (802.1 Qau Quantized Congestion Notification) to probabilistically determine which path to select

9

Page 10: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Introduction

‣ Why do we need an MPI Application-aware Interconnect?

SDN-accelerated MPI Primitives

‣ Can we accelerate MPI primitives based on our idea?

A Coordination Mechanism of Computation and Communication

‣ How do we reconfigure the interconnect in accordance with the execution of applications?

A Toolset for Analyzing Application-aware Dynamic Interconnects

‣ How will our idea on various types of applications and clusters?

Agenda

10

Page 11: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

MPI broadcast leveraging hardware-multicast

‣ Multicast rules are dynamically installed using OpenFlow

‣ Considers background traffic from other jobs to construct optimal delivery tree

SDN-accelerated MPI_Bcast

11

SW1 SW2

SW3 SW4 SW5 SW6

P0 P1 P2

P3

P0

SW3

SW2 SW6

P0 P0 P0

Delivery Tree

OpenFlow Ctrl.

Page 12: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Introduction

‣ Why do we need an MPI Application-aware Interconnect?

SDN-accelerated MPI Primitives

‣ Is our idea feasible at all?

A Coordination Mechanism of Computation and Communication

‣ How do we reconfigure the interconnect in accordance with the execution of applications?

A Toolset for Analyzing Application-aware Dynamic Interconnects

‣ How will our idea on various types of applications and clusters?

Agenda

12

Page 13: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Synchronizing Computation and Communication

13

#include<mpi.h>

intmain(){MPI_Init(&argc,&argv);

MPI_Bcast(buf,count,…);

/*…*/

MPI_Allreduce(buf,count,…);

MPI_Finalize();}

Src MAC Dst MAC … Instructions

aa:aa:aa:… ff:ff:ff:… Flood

bb:bb:bb:… aa:aa:aa:… OutputtoPortX

aa:aa:aa:… bb:bb:bb:… SetDstIPtoY,…

Time-varying Communication Pattern

Reconfiguration of the Interconnect

How do we synchronize these two?

Exec

utio

n

Page 14: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Our Idea: Embed encoded MPI envelope into each packet

‣ Current implementation uses virtual MAC addresses to represent tags

UnisonFlow

14

MPI PacketTagCustom Kernel Module

Kernel Space

User Space

MPI Library

MPI Application

ioct

l

MPI Packet

Tag Instructions

A Output to port X

B Output to port Y

… …

MPI

MPI Packet

TagEmbed Packet Flow Controlled

Based on Tag Value

MPI Envelope: Rank, Primitive Type,Communicator, etc.

[5] Keichi Takahashi, "Concept and Design of SDN-enhanced MPI Framework", EWSDN 2015

Page 15: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Introduction

‣ Why do we need an MPI Application-aware Interconnect?

SDN-accelerated MPI Primitives

‣ Is our idea feasible at all?

A Coordination Mechanism of Computation and Communication

‣ How do we reconfigure the interconnect in accordance with the execution of applications?

A Toolset for Analyzing Application-aware Dynamic Interconnects

‣ How will our idea on various types of applications and clusters?

Agenda

15

Page 16: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PEs PEs PEs

Need for a Holistic Analysis in SDN-enhanced MPI

16

0 1 2 3 4 5

2

4

5

10

3

Job Queue

j1j2j3j4

Communication Pattern

Job Scheduling Node Selection

Process Mapping

Routing

Page 17: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Q1: Impact of the Communication Pattern

How does the traffic load in the interconnect change for diverse applications?

‣ What kind of application benefits most from SDN-enhanced MPI?

‣ What happens if the number of processes scales out?

17

Page 18: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Q2: Impact of the Cluster Configuration

How does the traffic load in the interconnect change under diverse clusters with different configurations?

‣ How do job scheduling, node selection and process mapping affect the performance of applications?

‣ How does the topology of the interconnect impact the performance?

‣ What happens if the size of cluster scales out?

18

0

Node Selection (i.e. which node should be allocated to a given job?)

Process Placement (i.e. which node should execute a process?)

j1j2j3j41 2 3

Job Scheduling (i.e. which job should be executed next?)

Page 19: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

We aim to develop a toolset to help answer these questions

‣ How does the traffic load in the interconnect change for diverse applications?

‣ How does the traffic load in the interconnect change under diverse clusters with different configurations?

Simulator-based approach is taken to allow rapid assessment

‣ Requirements for the toolset are summarized as:

Requirements for the Interconnect Analysis Toolset

19

1. Support for application-aware dynamic routing

2. Support for communication patterns of real-world applications

3. Support for diverse cluster configurations

Page 20: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFProf (profiler) and PFSim (simulator) constitute PFAnalyzer [6]

‣ PFProf

- Fine-grained MPI profiler for observing network activity caused by MPI function calls (Requirement 2)

‣ PFSim

- Lightweight simulator to simulate traffic load in the interconnect targeting application-aware dynamic interconnects(Requirement 1, 2, 3)

Overview of PFAnalyzer

20

PFSimPFProfApplication

Profile Result

[6] Keichi Takahashi et al., "A Toolset for Analyzing Application-aware Dynamic Interconnects", HPCMASPA 2017

Page 21: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFProf: Motivation

Existing profilers do not capture the underlying pt2pt communication of collective communication

‣ They are designed to support code tuning and optimization, not network traffic analysis.

‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls.

21

1 2 3 4 5 6 7

1

4

3

2

5

7

6

Actual communication performed

0

0

Behavior of MPI_Bcast as seen from applications

21

Page 22: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFProf: Implementation

MPI Performance Revealing Extension Interface (PERUSE) is utilized

‣ PERUSE exposes internal information of MPI library

‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc.

22

PFProf

MPI Application

MPI Library

• MPI_Init• MPI_Finalize• MPI_Comm_create• MPI_Comm_dup• MPI_Comm_free

Call MPI Functions

Notify PERUSE Events

Subscribe to PERUSE Events

Hook MPI Functions with PMPI

Page 23: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Representation of Communication Pattern

‣ Defined as a matrix T of which element Tij is equal to the volume of traffic sent from rank i to rank j

‣ Implies that the volume of traffic between processes as constant during the execution of a job

23

0 50 100

Sender Rank

0

25

50

75

100

125

Rec

eive

rRan

k

0.0

0.2

0.4

0.6

0.8

1.0

Sen

tB

ytes

⇥108

An example obtained from running the NERSC MILC benchmark with 128

processes

The communication pattern of an application is represented using its traffic matrix

Page 24: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFProf: Overhead Evaluation

24

101 103 105 107

Message Size [B]

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

eT

hro

ugh

put

100

101

102

Thro

ugh

put

[MB

/s]

w/o profiler

w/ profiler

101 103 105 107

Message Size [B]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Rel

ativ

eLat

ency

102

103

104

Lat

ency

[µs]

w/o profiler

w/ profiler

Throughput (osu_bw) Latency (osu_latency)

Measured throughput and latency of pt2pt communication with and without PFProf using the OSU Microbenchmark

Page 25: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFSim: Overview

25

PFSim InterconnectUsage

PerformanceMetric Plot

SimulationLog

Output

SimulationScenario

Cluster Configuration

Communication Patterns

Input

Scheduling

Plugin

Node Selection

Process Placement

RoutingPFProf

For Requirement 3

For Requirement 1

For Requirement 3

Cluster Topology

For Requirement 2

Page 26: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFSim: Architecture

26

Event Event

Event Queue

j1j2j3j4

Job Submitted

Event Handlers

Job Started

Job Finished

Job Queue

Simulator State

Update

Interconnect

Computing Nodes

Event

Dispatch

Customized via Plugins

Page 27: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

PFSim: Example Input & Output

27

topology:topologies/milk.graphmloutput:output/milk-cg-dmodkalgorithms:scheduler:-pfsim.scheduler.FCFSSchedulernode_selector:-pfsim.node_selector.LinearNodeSelector-pfsim.node_selector.RandomNodeSelectorprocess_mapper:-pfsim.process_mapper.LinearProcessMapper-pfsim.process_mapper.CyclicProcessMapperrouter:-pfsim.router.DmodKRouter-pfsim.router.GreedyRouter-pfsim.router.GreedyRouter2jobs:-submit:distribution:pfsim.math.ExponentialDistributionparams:lambd:0.1trace:traces/cg-c-128.tar.gz

Cluster Configuration (YAML)Interconnect Utilization

(Output GraphML visualized with Cytoscape)

High Traffic Load

Less Traffic Load

Page 28: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

Simulated Configurations

28

Node Selection Process Placement Routing

Linear

Random

Linear

Cyclic

D-mod-K

Dynamic

0 1 2 3 4 5

0 3 1 4 2 5

Path selected solely based on the destination of flow

0 50 100

Sender Rank

0

25

50

75

100

125

Rec

eive

rRan

k

0.0

0.2

0.4

0.6

0.8

1.0

Sen

tByt

es

⇥108

Path allocated based on communication pattern (heavy pairs first)

Page 29: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

Simulation Results

Maximum traffic load on all links is plotted as a performance indicator

29

Linear

/Bloc

k/Dmod

K

Linear

/Bloc

k/Dyna

mic

Linear

/Cycl

ic/Dmod

K

Linear

/Cycl

ic/Dyna

mic

Random

/Bloc

k/Dmod

K

Random

/Bloc

k/Dyna

mic

Random

/Cycl

ic/Dmod

K

Random

/Cycl

ic/Dyna

mic0.0

0.5

1.0

1.5

2.0

2.5

Max

imum

Tra

�c

(Nor

mal

ized

)

Linear

/Bloc

k/Dmod

K

Linear

/Bloc

k/Dyna

mic

Linear

/Cycl

ic/Dmod

K

Linear

/Cycl

ic/Dyna

mic

Random

/Bloc

k/Dmod

K

Random

/Bloc

k/Dyna

mic

Random

/Cycl

ic/Dmod

K

Random

/Cycl

ic/Dyna

mic0.00

0.25

0.50

0.75

1.00

1.25

1.50

Max

imum

Tra

�c

(Nor

mal

ized

)

NAS CG Benchmark (128 ranks) NERC MILC Benchmark (128 ranks)

D-mod-K Dynamic

Page 30: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Further Challenges

Simulation-based study of large-scale clusters with different topologies

‣ Currently, our institution owns only a small-scale experimental clusteremployed with SDN

Integrate interconnect controller with scheduler and MPI runtime

‣ To support multiple jobs running in parallel

‣ To investigate the effect of node allocation and process placement

Better application-aware routing algorithms

‣ Currently, a simple greedy like algorithm is used

‣ How about optimization or machine learning?

30

Page 31: Towards Realizing a Dynamic and MPI Application-aware ... · Representation of Communication Pattern ‣ Defined as a matrix T of which element T ij is equal to the volume of traffic

The 26th Workshop on Sustained Simulation Performance (WSSP26)

Summary

Current static and over-provisioned interconnects might not scale well

‣ SDN allows us to build a more dynamic and application-aware interconnects

‣ Such architecture could improve the utilization of the interconnect and communication performance

Our achievements so far include:

‣ SDN-accelerated MPI primitives such as Bcast and Allreduce

‣ UnisonFlow, a coordination mechanism of computation and communication

‣ PFAnalyzer, a toolset for analyzing application-aware dynamic interconnects

31