a sdn based application aware and network provisioning

Application Aware SDN Network

Provisioning Platform

• YARN Architecture and Concept

• When SDN Meets Big Data & Cloud

• Service Profile based SDN Platform

• Open Discussion

Agenda

YARN ARCHITECTURE

AND CONCEPT

1st Generation Hadoop: Batch Focus

HADOOP 1.0 Built for Web-Scale Batch Apps

Single App

BATCH

HDFS

Single App

INTERACTIVE

Single App

BATCH

HDFS

All other usage patterns

MUST leverage same

infrastructure

Forces Creation of Silos to

Manage Mixed Workloads Single App

BATCH

HDFS

Single App

ONLINE

Hadoop 1 Architecture

JobTracker

Manage Cluster Resources & Job Scheduling

TaskTracker

Per-node agent

Manage Tasks

Hadoop 1 Limitations

Scalability

Max Cluster size ~5,000 nodes

Max concurrent tasks ~40,000

Coarse Synchronization in JobTracker

Availability

Failure Kills Queued & Running Jobs

Hard partition of resources into map and reduce slots

Non-optimal Resource Utilization

Lacks Support for Alternate Paradigms and Services

Iterative applications in MapReduce are 10x slower

YARN (Yet Another Resource Negotiator)

• Apache Hadoop YARN is a cluster

management technology;

• One of the key features in second-

generation Hadoop;

• Next-generation compute and

resource management framework in

Apache Hadoop;

Hadoop 2 - YARN Architecture

ResourceManager (RM)

Manages and allocates cluster resources

Central agent

NodeManager (NM)

Manage Tasks, Enforce Allocations

Per-Node Agent ResourceManager

MapReduce Status

Job Submission

Client

NodeManager

NodeManager

Container

NodeManager

App Mstr

Node Status

Resource Request

Data Processing Engines Run Natively IN Hadoop

BATCH MapReduce

INTERACTIVE Tez

STREAMING Storm, S4, …

GRAPH Giraph

MICROSOFT REEF

SAS LASR, HPA

ONLINE HBase

OTHERS

Apache YARN

HDFS2: Redundant, Reliable Storage

YARN: Cluster Resource Management

Flexible Enables other purpose-built data

processing models beyond

MapReduce (batch), such as

interactive and streaming

Efficient Double processing IN Hadoop on

the same hardware while

providing predictable

performance & quality of service

Shared Provides a stable, reliable,

secure foundation and

shared operational services

across multiple workloads

The Data Operating System for Hadoop 2.0

5 Key Benefits of YARN

1. New Applications & Services

2. Improved cluster utilization

3. Scale

4. Experimental Agility

5. Shared Services

Key Improvements in YARN

Framework supporting multiple applications

– Separate generic resource brokering from application logic

– Define protocols/libraries and provide a framework for custom

application development

– Share same Hadoop Cluster across applications

Application Agility and Innovation

– Use Protocol Buffers for RPC gives wire compatibility

– Map Reduce becomes an application in user space unlocking

safe innovation

– Multiple versions of an app can co-exist leading to

experimentation

– Easier upgrade of framework and applications

Key Improvements in YARN

Scalability

– Removed complex app logic from RM, scale further

– State machine, message passing based loosely coupled design

Cluster Utilization

– Generic resource container model replaces fixed Map/Reduce

slots. Container allocations based on locality, memory (CPU

coming soon)

– Sharing cluster among multiple applications

Reliability and Availability

– Simpler RM state makes it easier to save and restart (work in

progress)

– Application checkpoint can allow an app to be restarted.

MapReduce application master saves state in HDFS.

YARN as Cluster Operating System

NodeManager NodeManager NodeManager NodeManager

map 1.1

vertex1.2.2



map1.2

reduce1.1

Batch

vertex1.1.1

vertex1.1.2

vertex1.2.1

Interactive SQL

ResourceManager

Scheduler

Real-Time

nimbus0

nimbus1

nimbus2

YARN APIs & Client Libraries

Application Client Protocol: Client to RM interaction

–Library: YarnClient

–Application Lifecycle control

–Access Cluster Information

Application Master Protocol: AM – RM interaction

–Library: AMRMClient / AMRMClientAsync

–Resource negotiation

–Heartbeat to the RM

Container Management Protocol: AM to NM interaction

–Library: NMClient/NMClientAsync

–Launching allocated containers

–Stop Running containers

Use external frameworks like Weave/REEF/Spring

YARN Application Flow

Application Client

Resource

Manager

Application Master

NodeManager

YarnClient

App

Specific API

Application Client

Protocol

AMRMClient

NMClient

Application Master

Protocol

Container

Management

Protocol

App

Container

YARN Best Practices

Use provided Client libraries

Resource Negotiation

–You may ask but you may not get what you want - immediately.

–Locality requests may not always be met.

–Resources like memory/CPU are guaranteed.

Failure handling

–Remember, anything can fail ( or YARN can pre-empt your

containers)

–AM failures handled by YARN but container failures handled by the

application.

Checkpointing

–Check-point AM state for AM recovery.

–If tasks are long running, check-point task state.

YARN Best Practices

Cluster Dependencies

–Try to make zero assumptions on the cluster.

–Your application bundle should deploy everything required using

YARN’s local resources.

Client-only installs if possible

–Simplifies cluster deployment, and multi-version support

Securing your Application

–YARN does not secure communications between the AM and its

containers.

YARN Future Work

ResourceManager High Availability and Work-preserving restart

–Work-in-Progress

Scheduler Enhancements

–SLA Driven Scheduling, Low latency allocations

–Multiple resource types – disk/network/GPUs/affinity

Rolling upgrades

Long running services

–Better support to running services like HBase

–Discovery of services, upgrades without downtime

More utilities/libraries for Application Developers

–Failover/Checkpointing

• http://hadoop.apache.org

When SDN Meets Big Data

and Cloud Computing

Challenge Using Big Data & Cloud for

SDN

• The tools not available yet

• Do we need standards?

• Once you've mined big data, then

what?

Different types of traffic in

Hadoop Clusters

• Background Traffic

– Bulk transfers

– Control messages

• Active Traffic (used by jobs)

– HDFS read/writes

– Partition-Aggregate traffic

Typical Traffic Patterns

– Patterns used by Big Data Analytics

– You can optimize specifically for theses

Map Map Map Reduce Reduce

HDFS

Map Map Map

HDFS

Reduce Reduce

Shuffle Broadcast Incast

Approach Optimizing the

Network to Improve Performance

• Helios, Hedera, MicroTE, c-thru

– Congestion leads to bad performance

– Eliminate congestion

Gather Network Demand

Determine paths with minimal congestion

Install New paths

Disadvantage of Existing

Approach

• Demand gather at network is ineffective

– Assumes that past demand will predict future

– Many small jobs in cluster so ineffective

• May Require expensive instrumentation to gather

– Switch modifications

– Or end host modification to gather information

Application Aware Run Time

Network Configuration Practice

• Topology construction and routing for

aggregation, shuffling, and overlapping

aggregation traffic patterns;

Traffic Demand Estimation

Network-aware Job Scheduling

Topology and Routing

Integrated Network Control for

Big Data Applications

Examples of Topology and Routing

How Can That Be Done?

• Reactively

o Job tracker places the task; it knows the locations

• Check the Hadoop logs for the locations

• Modify the job tracker to directly inform application

• Proactively

o Have the SDN controller tell the job tracker where to

place the end-points

• Rack aware placement: reduce inter-rack transfers

• Congestion aware placement: reduce loss

Reactive Approach

• Reactive attempt to integrate big data +

SDN

– No changes to application

– Learn information by looking at logs and

determine file size and end-points

– Learn information by running agents on

the end host that determines start times

Reactive Architecture

L ink capacity (M bps) Avg.processing tim e (m in)100 3950 53 (x1.3)25 67 (x1.7)10 146 (x3.7)

Table 1: A verage job processing tim e increases as thenetw ork capacity decreases. The results representaver-ages over 10 runs of sorting a 10G B w orkload on a 14node H adoop cluster.The netw ork topology ispresentedin Figure 2. The num bers w ithin parentheses representincreasesfrom the baseline 100M bpsnetw ork.

39 to 146 m in (alm ostfour tim es) w hen w e reduce thelink capacity from 100M bps to 10 M bps. The results,sum m arized in Table 1,m atch ourintuition and indicatethatfinding paths w ith unused capacity in the netw orkand redirecting congested flow s along these paths couldim prove perform ance.

2.3 O ur goal

A s the netw ork plays an im portant role in the perfor-m ance ofdistributed data processing,itiscrucialto tuneit to the dem ands of applications. O btaining accuratedem and inform ation is difficult[5]. R equiring users tospecify the dem and is unrealistic because changesin de-m ands m ay be unknow n to users. Instrum enting appli-cations to give the instantdem and is betterbutis intru-sive and deters deploym entbecause itrequires m odifi-cations to applications [11]. Finally,inferring dem andfrom sw itch counters [3] does notplace any burden onthe user or application but gives only currentand paststatistics w ithoutrevealing future dem and. In addition,polling sw itch counters m ustbe carefully scheduled tom aintain scalability,w hich m ay lead to staleinform ation.O ur goalis to build a netw ork m anagem entplatform

for distributed data processing applications thatis bothtransparentto the applicationsand quick and accurateindetecting their dem and. W e propose to use applicationdom ain know ledge to detectnetw ork transfers (possiblybeforethey start)and softw are-defined netw orking to up-date the netw ork pathsto supportthese transfersw ithoutcreating congestion.O urvision bridgesthe gap betw eenthe netw ork and the application by introducing a m iddlelayerthatcollectsdem and inform ation transparently andscalably from both the application (data transfers) andthe netw ork (currentnetw ork utilization)and adapts thenetw ork to the needsofthe application.

3 D esignFlow C om b im provesjob processing tim esand avertsnet-w ork congestion in H adoop M apR educe clustersby pre-dicting netw ork transfers and scheduling them dynam i-cally on paths w ith sufficientavailable bandw idth. Fig-

P redictor S cheduler

Flow C om b

H adoop cluster

Agents

C ontroller

Figure 1: Flow C om b consists of three m odules: flowprediction,flow scheduling,and flow control.

ure 1 highlights the three m ain com ponents of Flow -C om b,flow prediction,flow scheduling,and flow control.

3.1 Prediction

Flow C om b detects data transfers betw een nodes in aH adoop cluster using dom ain know ledge aboutthe in-teraction betw een H adoop com ponents.

H adoop operation. W hen a m ap task finishes, itw rites its output to disk and notifies the job tracker,w hich in turn notifiesthe reduce tasks.Each reduce taskthen retrievesfrom the m apperthe data corresponding toits ow n key space. H ow ever,notalltransfers startim -m ediately. To avoid overloading the sam e m apperw ithm any sim ultaneous requests and burdening them selvesw ith concurrenttransfers from m any m appers,reducersstarta lim ited num beroftransfers(5 by default).W hen atransferends,the reducerstarts retrieving data from an-otherm apperchosen atrandom .H adoop m akesavailableinform ation aboutthe transfer(e.g.,source,destination,volum e)using its logsorthrough a w eb-based A PI.

A gents. To obtain inform ation about data transfersw ithoutm odifying H adoop,w e install softw are agentson each serverin the cluster.A n agentperform stw o sim -ple tasks:1)periodically scans H adoop logs and queriesH adoop nodesto find w hich m ap taskshave finished andw hich transfers have begun (oralready finished),and 2)sends this inform ation to Flow C om b’s flow schedulingm odule. To detect the size of a m ap output, an agentlearns the ID of the localm appers from the job trackerand querieseach m apperusing the w eb A PI.Essentially,our agentperform s the sam e sequence of calls as a re-ducerthattries to obtain inform ation aboutw here to re-trieve data.In addition,the agentscansthe localH adooplogsto learn w hethera transferhasalready started.

3

• Agents on servers – Detect start/end of map – Detect start/end transfer

• Predictor – Determines size of

intermediate data • Queries Map Via API

– Aggregates information from agents sends to scheduler

Reactive Architecture

L ink capacity (M bps) Avg.processing tim e (m in)100 3950 53 (x1.3)25 67 (x1.7)10 146 (x3.7)

Table 1: A verage job processing tim e increases as thenetw ork capacity decreases. The results representaver-ages over 10 runs of sorting a 10G B w orkload on a 14node H adoop cluster.The netw ork topology ispresentedin Figure 2. The num bers w ithin parentheses representincreasesfrom the baseline 100M bpsnetw ork.

39 to 146 m in (alm ostfour tim es) w hen w e reduce thelink capacity from 100M bps to 10 M bps. The results,sum m arized in Table 1,m atch ourintuition and indicatethatfinding paths w ith unused capacity in the netw orkand redirecting congested flow s along these paths couldim prove perform ance.

2.3 O ur goal

A s the netw ork plays an im portant role in the perfor-m ance ofdistributed data processing,itiscrucialto tuneit to the dem ands of applications. O btaining accuratedem and inform ation is difficult[5]. R equiring users tospecify the dem and is unrealistic because changesin de-m ands m ay be unknow n to users. Instrum enting appli-cations to give the instantdem and is betterbutis intru-sive and deters deploym entbecause itrequires m odifi-cations to applications [11]. Finally,inferring dem andfrom sw itch counters [3] does notplace any burden onthe user or application but gives only currentand paststatistics w ithoutrevealing future dem and. In addition,polling sw itch counters m ustbe carefully scheduled tom aintain scalability,w hich m ay lead to staleinform ation.O ur goalis to build a netw ork m anagem entplatform

for distributed data processing applications thatis bothtransparentto the applicationsand quick and accurateindetecting their dem and. W e propose to use applicationdom ain know ledge to detectnetw ork transfers (possiblybeforethey start)and softw are-defined netw orking to up-date the netw ork pathsto supportthese transfersw ithoutcreating congestion.O urvision bridgesthe gap betw eenthe netw ork and the application by introducing a m iddlelayerthatcollectsdem and inform ation transparently andscalably from both the application (data transfers) andthe netw ork (currentnetw ork utilization)and adapts thenetw ork to the needsofthe application.

3 D esignFlow C om b im provesjob processing tim esand avertsnet-w ork congestion in H adoop M apR educe clustersby pre-dicting netw ork transfers and scheduling them dynam i-cally on paths w ith sufficientavailable bandw idth. Fig-

P redictor S cheduler

Flow C om b

H adoop cluster

Agents

C ontroller

Figure 1: Flow C om b consists of three m odules: flowprediction,flow scheduling,and flow control.

ure 1 highlights the three m ain com ponents of Flow -C om b,flow prediction,flow scheduling,and flow control.

3.1 Prediction

Flow C om b detects data transfers betw een nodes in aH adoop cluster using dom ain know ledge aboutthe in-teraction betw een H adoop com ponents.

H adoop operation. W hen a m ap task finishes, itw rites its output to disk and notifies the job tracker,w hich in turn notifiesthe reduce tasks.Each reduce taskthen retrievesfrom the m apperthe data corresponding toits ow n key space. H ow ever,notalltransfers startim -m ediately. To avoid overloading the sam e m apperw ithm any sim ultaneous requests and burdening them selvesw ith concurrenttransfers from m any m appers,reducersstarta lim ited num beroftransfers(5 by default).W hen atransferends,the reducerstarts retrieving data from an-otherm apperchosen atrandom .H adoop m akesavailableinform ation aboutthe transfer(e.g.,source,destination,volum e)using its logsorthrough a w eb-based A PI.

A gents. To obtain inform ation about data transfersw ithoutm odifying H adoop,w e install softw are agentson each serverin the cluster.A n agentperform stw o sim -ple tasks:1)periodically scans H adoop logs and queriesH adoop nodesto find w hich m ap taskshave finished andw hich transfers have begun (oralready finished),and 2)sends this inform ation to Flow C om b’s flow schedulingm odule. To detect the size of a m ap output, an agentlearns the ID of the localm appers from the job trackerand querieseach m apperusing the w eb A PI.Essentially,our agentperform s the sam e sequence of calls as a re-ducerthattries to obtain inform ation aboutw here to re-trieve data.In addition,the agentscansthe localH adooplogsto learn w hethera transferhasalready started.

3

• Scheduler – Examines each flow

that has started

– For each flow what is the ideal rate

– Is the flow currently bottlenecked?

• Move to the next shortest path with available capacity

Proactive Approach

• Modify the applications

– Have them directly inform network of

intent

• Application inform network of co-flow

– Group of flows bound by app level

semantics

– Controls network path, transfer

times, and transfer rate

A PROPOSAL FOR

SERVICE PROFILE BASED

SDN PLATFORM

What we intend to do?

• Traditional network models to construct

elements such as switches, subnets, and

(ACLs), without application awareness and

correspondingly cause over provisioning;

• Service level network profile model provides

higher level connectivity and policy

abstractions;

• SDN controller platform supports service-

profile model being integral parts of network

planning and provisioning process;

Network Profile Abstraction Model

• Declaratively define network logical topologies

model to specify logical connectivity and policies

or services;

SDN Networking Platform Architecture

Application Integration Layer

• Present applications with a network model and

associated APIs that expose the information

needed to interact with the network;

• Provide network services to application using

query API, which allows the application to send

requests for abstract topology views, or

gathering performance metrics and status for

specific parts of the network;

Network Abstraction Layer

• Perform a logical-to-physical translation of commands issued through the abstraction layer and convert these API calls into the appropriate series of commands;

• Provide a set of network-wide services to applications, such as views of the topology, notifications of changes in link availability or utilization, and path computation according to different routing algorithms;

• Coordinate between network requests issued by applications, and mapping of those requests onto the network, such as selecting between multiple mechanisms available to achieve a given operation, setting up a virtual network using an overlay;

Network Driver Layer

• Enable the SDN controller to interface with various

network technologies or tools;

• The orchestration layer uses these drivers to issue

commands on specific devices, i.e. an Open Flow-

capable network driver could allow insertion of flow

rules in physical or virtual switches;

• Support other drivers to enable virtual network

creation using overlays and topology data gathered

by 3rd party network management tools such as IBM

Tivoli Network Manager, HP Openview;

Network Services Provided

• Network Planning & Design: Implements

network services as one or more plans and

provide workflow mechanism for scheduling

task;

• Network Topology Deploying: Manage the

execution and state of proposed plans via

multiple states, such validate, install, undo,

resume;

• Maintenance Service: monitor view of the

dynamic set of underlying network

resources;

a sdn based application aware and network provisioning

Technology