a sdn based application aware and network provisioning
TRANSCRIPT
Application Aware SDN Network
Provisioning Platform
• YARN Architecture and Concept
• When SDN Meets Big Data & Cloud
• Service Profile based SDN Platform
• Open Discussion
Agenda
YARN ARCHITECTURE
AND CONCEPT
1st Generation Hadoop: Batch Focus
HADOOP 1.0 Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
All other usage patterns
MUST leverage same
infrastructure
Forces Creation of Silos to
Manage Mixed Workloads Single App
BATCH
HDFS
Single App
ONLINE
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks
Hadoop 1 Limitations
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Coarse Synchronization in JobTracker
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Lacks Support for Alternate Paradigms and Services
Iterative applications in MapReduce are 10x slower
YARN (Yet Another Resource Negotiator)
• Apache Hadoop YARN is a cluster
management technology;
• One of the key features in second-
generation Hadoop;
• Next-generation compute and
resource management framework in
Apache Hadoop;
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Manages and allocates cluster resources
Central agent
NodeManager (NM)
Manage Tasks, Enforce Allocations
Per-Node Agent ResourceManager
MapReduce Status
Job Submission
Client
NodeManager
NodeManager
Container
NodeManager
App Mstr
Node Status
Resource Request
Data Processing Engines Run Natively IN Hadoop
BATCH MapReduce
INTERACTIVE Tez
STREAMING Storm, S4, …
GRAPH Giraph
MICROSOFT REEF
SAS LASR, HPA
ONLINE HBase
OTHERS
Apache YARN
HDFS2: Redundant, Reliable Storage
YARN: Cluster Resource Management
Flexible Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The Data Operating System for Hadoop 2.0
5 Key Benefits of YARN
1. New Applications & Services
2. Improved cluster utilization
3. Scale
4. Experimental Agility
5. Shared Services
Key Improvements in YARN
Framework supporting multiple applications
– Separate generic resource brokering from application logic
– Define protocols/libraries and provide a framework for custom
application development
– Share same Hadoop Cluster across applications
Application Agility and Innovation
– Use Protocol Buffers for RPC gives wire compatibility
– Map Reduce becomes an application in user space unlocking
safe innovation
– Multiple versions of an app can co-exist leading to
experimentation
– Easier upgrade of framework and applications
Key Improvements in YARN
Scalability
– Removed complex app logic from RM, scale further
– State machine, message passing based loosely coupled design
Cluster Utilization
– Generic resource container model replaces fixed Map/Reduce
slots. Container allocations based on locality, memory (CPU
coming soon)
– Sharing cluster among multiple applications
Reliability and Availability
– Simpler RM state makes it easier to save and restart (work in
progress)
– Application checkpoint can allow an app to be restarted.
MapReduce application master saves state in HDFS.
YARN as Cluster Operating System
NodeManager NodeManager NodeManager NodeManager
map 1.1
vertex1.2.2
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
map1.2
reduce1.1
Batch
vertex1.1.1
vertex1.1.2
vertex1.2.1
Interactive SQL
ResourceManager
Scheduler
Real-Time
nimbus0
nimbus1
nimbus2
YARN APIs & Client Libraries
Application Client Protocol: Client to RM interaction
–Library: YarnClient
–Application Lifecycle control
–Access Cluster Information
Application Master Protocol: AM – RM interaction
–Library: AMRMClient / AMRMClientAsync
–Resource negotiation
–Heartbeat to the RM
Container Management Protocol: AM to NM interaction
–Library: NMClient/NMClientAsync
–Launching allocated containers
–Stop Running containers
Use external frameworks like Weave/REEF/Spring
YARN Application Flow
Application Client
Resource
Manager
Application Master
NodeManager
YarnClient
App
Specific API
Application Client
Protocol
AMRMClient
NMClient
Application Master
Protocol
Container
Management
Protocol
App
Container
YARN Best Practices
Use provided Client libraries
Resource Negotiation
–You may ask but you may not get what you want - immediately.
–Locality requests may not always be met.
–Resources like memory/CPU are guaranteed.
Failure handling
–Remember, anything can fail ( or YARN can pre-empt your
containers)
–AM failures handled by YARN but container failures handled by the
application.
Checkpointing
–Check-point AM state for AM recovery.
–If tasks are long running, check-point task state.
YARN Best Practices
Cluster Dependencies
–Try to make zero assumptions on the cluster.
–Your application bundle should deploy everything required using
YARN’s local resources.
Client-only installs if possible
–Simplifies cluster deployment, and multi-version support
Securing your Application
–YARN does not secure communications between the AM and its
containers.
YARN Future Work
ResourceManager High Availability and Work-preserving restart
–Work-in-Progress
Scheduler Enhancements
–SLA Driven Scheduling, Low latency allocations
–Multiple resource types – disk/network/GPUs/affinity
Rolling upgrades
Long running services
–Better support to running services like HBase
–Discovery of services, upgrades without downtime
More utilities/libraries for Application Developers
–Failover/Checkpointing
• http://hadoop.apache.org
When SDN Meets Big Data
and Cloud Computing
Challenge Using Big Data & Cloud for
SDN
• The tools not available yet
• Do we need standards?
• Once you've mined big data, then
what?
Different types of traffic in
Hadoop Clusters
• Background Traffic
– Bulk transfers
– Control messages
• Active Traffic (used by jobs)
– HDFS read/writes
– Partition-Aggregate traffic
Typical Traffic Patterns
– Patterns used by Big Data Analytics
– You can optimize specifically for theses
Map Map Map Reduce Reduce
HDFS
Map Map Map
HDFS
Reduce Reduce
Shuffle Broadcast Incast
Approach Optimizing the
Network to Improve Performance
• Helios, Hedera, MicroTE, c-thru
– Congestion leads to bad performance
– Eliminate congestion
Gather Network Demand
Determine paths with minimal congestion
Install New paths
Disadvantage of Existing
Approach
• Demand gather at network is ineffective
– Assumes that past demand will predict future
– Many small jobs in cluster so ineffective
• May Require expensive instrumentation to gather
– Switch modifications
– Or end host modification to gather information
Application Aware Run Time
Network Configuration Practice
• Topology construction and routing for
aggregation, shuffling, and overlapping
aggregation traffic patterns;
Traffic Demand Estimation
Network-aware Job Scheduling
Topology and Routing
Integrated Network Control for
Big Data Applications
Examples of Topology and Routing
How Can That Be Done?
• Reactively
o Job tracker places the task; it knows the locations
• Check the Hadoop logs for the locations
• Modify the job tracker to directly inform application
• Proactively
o Have the SDN controller tell the job tracker where to
place the end-points
• Rack aware placement: reduce inter-rack transfers
• Congestion aware placement: reduce loss
Reactive Approach
• Reactive attempt to integrate big data +
SDN
– No changes to application
– Learn information by looking at logs and
determine file size and end-points
– Learn information by running agents on
the end host that determines start times
Reactive Architecture
L ink capacity (M bps) Avg.processing tim e (m in)100 3950 53 (x1.3)25 67 (x1.7)10 146 (x3.7)
Table 1: A verage job processing tim e increases as thenetw ork capacity decreases. The results representaver-ages over 10 runs of sorting a 10G B w orkload on a 14node H adoop cluster.The netw ork topology ispresentedin Figure 2. The num bers w ithin parentheses representincreasesfrom the baseline 100M bpsnetw ork.
39 to 146 m in (alm ostfour tim es) w hen w e reduce thelink capacity from 100M bps to 10 M bps. The results,sum m arized in Table 1,m atch ourintuition and indicatethatfinding paths w ith unused capacity in the netw orkand redirecting congested flow s along these paths couldim prove perform ance.
2.3 O ur goal
A s the netw ork plays an im portant role in the perfor-m ance ofdistributed data processing,itiscrucialto tuneit to the dem ands of applications. O btaining accuratedem and inform ation is difficult[5]. R equiring users tospecify the dem and is unrealistic because changesin de-m ands m ay be unknow n to users. Instrum enting appli-cations to give the instantdem and is betterbutis intru-sive and deters deploym entbecause itrequires m odifi-cations to applications [11]. Finally,inferring dem andfrom sw itch counters [3] does notplace any burden onthe user or application but gives only currentand paststatistics w ithoutrevealing future dem and. In addition,polling sw itch counters m ustbe carefully scheduled tom aintain scalability,w hich m ay lead to staleinform ation.O ur goalis to build a netw ork m anagem entplatform
for distributed data processing applications thatis bothtransparentto the applicationsand quick and accurateindetecting their dem and. W e propose to use applicationdom ain know ledge to detectnetw ork transfers (possiblybeforethey start)and softw are-defined netw orking to up-date the netw ork pathsto supportthese transfersw ithoutcreating congestion.O urvision bridgesthe gap betw eenthe netw ork and the application by introducing a m iddlelayerthatcollectsdem and inform ation transparently andscalably from both the application (data transfers) andthe netw ork (currentnetw ork utilization)and adapts thenetw ork to the needsofthe application.
3 D esignFlow C om b im provesjob processing tim esand avertsnet-w ork congestion in H adoop M apR educe clustersby pre-dicting netw ork transfers and scheduling them dynam i-cally on paths w ith sufficientavailable bandw idth. Fig-
P redictor S cheduler
Flow C om b
H adoop cluster
Agents
C ontroller
Figure 1: Flow C om b consists of three m odules: flowprediction,flow scheduling,and flow control.
ure 1 highlights the three m ain com ponents of Flow -C om b,flow prediction,flow scheduling,and flow control.
3.1 Prediction
Flow C om b detects data transfers betw een nodes in aH adoop cluster using dom ain know ledge aboutthe in-teraction betw een H adoop com ponents.
H adoop operation. W hen a m ap task finishes, itw rites its output to disk and notifies the job tracker,w hich in turn notifiesthe reduce tasks.Each reduce taskthen retrievesfrom the m apperthe data corresponding toits ow n key space. H ow ever,notalltransfers startim -m ediately. To avoid overloading the sam e m apperw ithm any sim ultaneous requests and burdening them selvesw ith concurrenttransfers from m any m appers,reducersstarta lim ited num beroftransfers(5 by default).W hen atransferends,the reducerstarts retrieving data from an-otherm apperchosen atrandom .H adoop m akesavailableinform ation aboutthe transfer(e.g.,source,destination,volum e)using its logsorthrough a w eb-based A PI.
A gents. To obtain inform ation about data transfersw ithoutm odifying H adoop,w e install softw are agentson each serverin the cluster.A n agentperform stw o sim -ple tasks:1)periodically scans H adoop logs and queriesH adoop nodesto find w hich m ap taskshave finished andw hich transfers have begun (oralready finished),and 2)sends this inform ation to Flow C om b’s flow schedulingm odule. To detect the size of a m ap output, an agentlearns the ID of the localm appers from the job trackerand querieseach m apperusing the w eb A PI.Essentially,our agentperform s the sam e sequence of calls as a re-ducerthattries to obtain inform ation aboutw here to re-trieve data.In addition,the agentscansthe localH adooplogsto learn w hethera transferhasalready started.
3
• Agents on servers – Detect start/end of map – Detect start/end transfer
• Predictor – Determines size of
intermediate data • Queries Map Via API
– Aggregates information from agents sends to scheduler
Reactive Architecture
L ink capacity (M bps) Avg.processing tim e (m in)100 3950 53 (x1.3)25 67 (x1.7)10 146 (x3.7)
Table 1: A verage job processing tim e increases as thenetw ork capacity decreases. The results representaver-ages over 10 runs of sorting a 10G B w orkload on a 14node H adoop cluster.The netw ork topology ispresentedin Figure 2. The num bers w ithin parentheses representincreasesfrom the baseline 100M bpsnetw ork.
39 to 146 m in (alm ostfour tim es) w hen w e reduce thelink capacity from 100M bps to 10 M bps. The results,sum m arized in Table 1,m atch ourintuition and indicatethatfinding paths w ith unused capacity in the netw orkand redirecting congested flow s along these paths couldim prove perform ance.
2.3 O ur goal
A s the netw ork plays an im portant role in the perfor-m ance ofdistributed data processing,itiscrucialto tuneit to the dem ands of applications. O btaining accuratedem and inform ation is difficult[5]. R equiring users tospecify the dem and is unrealistic because changesin de-m ands m ay be unknow n to users. Instrum enting appli-cations to give the instantdem and is betterbutis intru-sive and deters deploym entbecause itrequires m odifi-cations to applications [11]. Finally,inferring dem andfrom sw itch counters [3] does notplace any burden onthe user or application but gives only currentand paststatistics w ithoutrevealing future dem and. In addition,polling sw itch counters m ustbe carefully scheduled tom aintain scalability,w hich m ay lead to staleinform ation.O ur goalis to build a netw ork m anagem entplatform
for distributed data processing applications thatis bothtransparentto the applicationsand quick and accurateindetecting their dem and. W e propose to use applicationdom ain know ledge to detectnetw ork transfers (possiblybeforethey start)and softw are-defined netw orking to up-date the netw ork pathsto supportthese transfersw ithoutcreating congestion.O urvision bridgesthe gap betw eenthe netw ork and the application by introducing a m iddlelayerthatcollectsdem and inform ation transparently andscalably from both the application (data transfers) andthe netw ork (currentnetw ork utilization)and adapts thenetw ork to the needsofthe application.
3 D esignFlow C om b im provesjob processing tim esand avertsnet-w ork congestion in H adoop M apR educe clustersby pre-dicting netw ork transfers and scheduling them dynam i-cally on paths w ith sufficientavailable bandw idth. Fig-
P redictor S cheduler
Flow C om b
H adoop cluster
Agents
C ontroller
Figure 1: Flow C om b consists of three m odules: flowprediction,flow scheduling,and flow control.
ure 1 highlights the three m ain com ponents of Flow -C om b,flow prediction,flow scheduling,and flow control.
3.1 Prediction
Flow C om b detects data transfers betw een nodes in aH adoop cluster using dom ain know ledge aboutthe in-teraction betw een H adoop com ponents.
H adoop operation. W hen a m ap task finishes, itw rites its output to disk and notifies the job tracker,w hich in turn notifiesthe reduce tasks.Each reduce taskthen retrievesfrom the m apperthe data corresponding toits ow n key space. H ow ever,notalltransfers startim -m ediately. To avoid overloading the sam e m apperw ithm any sim ultaneous requests and burdening them selvesw ith concurrenttransfers from m any m appers,reducersstarta lim ited num beroftransfers(5 by default).W hen atransferends,the reducerstarts retrieving data from an-otherm apperchosen atrandom .H adoop m akesavailableinform ation aboutthe transfer(e.g.,source,destination,volum e)using its logsorthrough a w eb-based A PI.
A gents. To obtain inform ation about data transfersw ithoutm odifying H adoop,w e install softw are agentson each serverin the cluster.A n agentperform stw o sim -ple tasks:1)periodically scans H adoop logs and queriesH adoop nodesto find w hich m ap taskshave finished andw hich transfers have begun (oralready finished),and 2)sends this inform ation to Flow C om b’s flow schedulingm odule. To detect the size of a m ap output, an agentlearns the ID of the localm appers from the job trackerand querieseach m apperusing the w eb A PI.Essentially,our agentperform s the sam e sequence of calls as a re-ducerthattries to obtain inform ation aboutw here to re-trieve data.In addition,the agentscansthe localH adooplogsto learn w hethera transferhasalready started.
3
• Scheduler – Examines each flow
that has started
– For each flow what is the ideal rate
– Is the flow currently bottlenecked?
• Move to the next shortest path with available capacity
Proactive Approach
• Modify the applications
– Have them directly inform network of
intent
• Application inform network of co-flow
– Group of flows bound by app level
semantics
– Controls network path, transfer
times, and transfer rate
A PROPOSAL FOR
SERVICE PROFILE BASED
SDN PLATFORM
What we intend to do?
• Traditional network models to construct
elements such as switches, subnets, and
(ACLs), without application awareness and
correspondingly cause over provisioning;
• Service level network profile model provides
higher level connectivity and policy
abstractions;
• SDN controller platform supports service-
profile model being integral parts of network
planning and provisioning process;
Network Profile Abstraction Model
• Declaratively define network logical topologies
model to specify logical connectivity and policies
or services;
SDN Networking Platform Architecture
Application Integration Layer
• Present applications with a network model and
associated APIs that expose the information
needed to interact with the network;
• Provide network services to application using
query API, which allows the application to send
requests for abstract topology views, or
gathering performance metrics and status for
specific parts of the network;
Network Abstraction Layer
• Perform a logical-to-physical translation of commands issued through the abstraction layer and convert these API calls into the appropriate series of commands;
• Provide a set of network-wide services to applications, such as views of the topology, notifications of changes in link availability or utilization, and path computation according to different routing algorithms;
• Coordinate between network requests issued by applications, and mapping of those requests onto the network, such as selecting between multiple mechanisms available to achieve a given operation, setting up a virtual network using an overlay;
Network Driver Layer
• Enable the SDN controller to interface with various
network technologies or tools;
• The orchestration layer uses these drivers to issue
commands on specific devices, i.e. an Open Flow-
capable network driver could allow insertion of flow
rules in physical or virtual switches;
• Support other drivers to enable virtual network
creation using overlays and topology data gathered
by 3rd party network management tools such as IBM
Tivoli Network Manager, HP Openview;
Network Services Provided
• Network Planning & Design: Implements
network services as one or more plans and
provide workflow mechanism for scheduling
task;
• Network Topology Deploying: Manage the
execution and state of proposed plans via
multiple states, such validate, install, undo,
resume;
• Maintenance Service: monitor view of the
dynamic set of underlying network
resources;