dard

Upload: mai-osama-ahmed

Post on 06-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 DARD

    1/13

    DARD: Distributed Adaptive Routing for DatacenterNetworks

    Xin Wu Xiaowei YangDept. of Computer Science, Duke University

    Duke-CS-TR-2011-01

    ABSTRACTDatacenter networks typically have many paths connectingeach host pair to achieve high bisection bandwidth for arbi-trary communication patterns. Fully utilizing the bisectionbandwidth may require ows between the same source anddestination pair to take different paths to avoid hot spots.However, the existing routing protocols have little supportfor load-sensitive adaptive routing. We propose DARD, aDistributed AdaptiveRouting architecture for Datacenternet-works. DARD allows each end host to move trafc fromoverloaded paths to underloaded paths without central co-ordination. We use an openow implementation and sim-ulations to show that DARD can effectively use a datacen-ter networks bisection bandwidth under both static and dy-namic trafc patterns. It outperforms previoussolutions basedon random path selection by 10%, and performs similarly toprevious work that assigns ows to paths using a centralizedcontroller. We use competitive game theory to show thatDARDs path selection algorithm makes progress in everystep and converges to a Nash equilibrium in nite steps. Ourevaluation results suggest that DARD can achieve a close-to-optimal solution in practice.

    1. INTRODUCTIONDatacenternetwork applications, e.g. , MapReduceand net-

    work storage, often demandhigh intra-cluster bandwidth [11]to transfer data among distributed components. This is be-cause the components of an application cannot always beplaced on machines close to each other ( e.g. , within a rack)for two main reasons. First, applications may share com-mon services provided by the datacenter network, e.g. , DNS,search, and storage. These services arenot necessarily placedin nearbymachines. Second, the auto-scaling feature offeredby a datacenter network [1, 5] allows an application to cre-ate dynamic instances when its workload increases. Where

    those instances will be placed depends on machine avail-ability, and is not guaranteed to be close to the applicationsother instances.

    Therefore, it is important for a datacenter network to havehigh bisection bandwidthto avoid hot spots between anypairof hosts. To achieve this goal, todays datacenter networksoften use commodity Ethernet switches to form multi-rootedtree topologies [21] ( e.g. , fat-tree [10] or Clos topology [16])that have multiple equal-cost paths connecting any host pair.A ow (a ow refers to a TCP connection in this paper) can

    use an alternative path if one path is overloaded.However, legacy transport protocols such as TCP lack the

    ability to dynamically select paths based on trafc load. Toovercome this limitation, researchers have advocated a va-riety of dynamic path selection mechanisms to take advan-tage of the multiple paths connecting any host pair. At ahigh-level view, these mechanisms fall into two categories:centralized dynamic path selection, and distributed trafc-oblivious load balancing. A representative example of cen-tralized path selection is Hedera [11], which uses a centralcontroller to compute an optimal ow-to-path assignmentbased on dynamic trafc load. Equal-Cost-Multi-Path for-warding (ECMP) [19] and VL2 [16] are examples of trafc-oblivious load balancing. With ECMP, routers hash owsbased on ow identiers to multiple equal-cost next hops.VL2 [16] uses edge switches to forward a ow to a randomlyselected core switch to achieve valiant load balancing [23].

    Each of these two design paradigms has merit and im-proves the available bisection bandwidth between a host pairin a datacenter network. Yet each has its limitations. Acentralized path selection approach introduces a potentialscaling bottleneck and a centralized point of failure. Whena datacenter scales to a large size, the control trafc sentto and from the controller may congest the link that con-nects the controller to the rest of the datacenter network, Dis-tibuted trafc-oblivious load balancing scales well to largedatacenter networks, but may create hot spots, as their owassignment algorithms do not consider dynamic trafc loadon each path.

    In this paper, we aim to explore the design space thatuses end-to-end distributed load-sensitive path selection tofully use a datacenters bisection bandwidth. This designparadigm has a number of advantages. First, placing thepath selection logic at an end system rather than inside aswitch facilitates deployment, as it does not require specialhardware or replacing commodity switches. One can also

    upgrade or extend the path selection logic later by applyingsoftware patches rather than upgrading switching hardware.Second, a distributed design can be more robust and scalebetter than a centralized approach.

    This paper presents DARD, a lightweight, distributed, endsystem based path selection system for datacenter networks.DARDs design goal is to fully utilize bisection bandwidthand dynamically balance the trafc among themultipath pathsbetween any host pair. A key design challenge DARD facesis how to achieve dynamic distributed load balancing. Un-

  • 8/2/2019 DARD

    2/13

    like in a centralized approach, with DARD, no end system orrouter has a global view of the network. Each end system canonly select a path based on its local knowledge, thereby mak-ing achieving close-to-optimal load balancing a challengingproblem.

    To address this challenge, DARD uses a selsh path se-lection algorithm that provably converges to a Nash equilib-rium in nite steps (Appedix B). Our experimental evalua-tion shows that the equilibriums gap to the optimal solutionis small. To facilitate path selection, DARD uses hierarchi-cal addressing to represent an end-to-end path with a pair of source and destination addresses, as in [27]. Thus, an endsystem can switch paths by switching addresses.

    We have implemented a DARD prototype on DeterLab [6]and a ns -2 simulator. We use static trafc pattern to showthat DARD converges to a stable state in two to three con-trol intervals. We use dynamic trafc pattern to show thatDARD outperforms ECMP, VL2 and TeXCP and its per-formance gap to the centralized scheduling is small. Underdynamic trafc pattern, DARD maintains stable link utiliza-tion. About 90% of the ows change their paths less than 4times in their life cycles. Evaluation result also shows thatthe bandwidth taken by DARDs control trafc is boundedby the size of the topology.

    DARD is a scalable and stable end host based approachto load-balance datacenter trafc. We make every effort toleverage existing infrastructures and to make DARD prac-tically deployable. The rest of this paper is organized asfollows. Section 2 introduces background knowledge anddiscusses related work. Section 3 describes DARDs designgoals and system components. In Section 4, we introducethe system implementation details. We evaluate DARD inSection 5. Section 6 concludes our work.

    2. BACKGROUND AND RELATED WORKIn this section, we rst briey introduce what a datacenter

    network looks like and then discuss related work.

    2.1 Datacenter TopologiesRecent proposals [10, 16, 24] suggest to use multi-rooted

    tree topologies to build datacenter networks. Figure 1 showsa 3-stage multi-rooted tree topology. The topology has threevertical layers: Top-of-Rack (ToR), aggregation , and core . A pod is a management unit. It represents a replicable buildingblock consisting of a number of servers and switches sharing

    the same power and management infrastructure.An important design parameter of a datacenter network is the oversubscription ratio at each layer of the hierarchy,which is computed as a layers downstream bandwidth dev-ided by its upstream bandwidth, as shown in Figure 1. Theoversubscription ratio is usually designed larger than one,assuming that not all downstream devices will be active con-currently.

    We design DARD to work for arbitrary multi-rooted treetopologies. But for ease of exposition, we mostly use the

    Figure 1: A multi-rooted tree topology for a datacenter network. Theaggregation layers oversubscription ratio is dened as

    BW downBW up

    .

    fat-tree topology to illustrate DARDs design, unless other-wise noted. Therefore, we briey describe what a fat-treetopology is.

    Figure 2 shows a fat-tree topology example. A p-pod fat-tree topology (in Figure 2, p = 4 )has p pods in the horizontaldirection. It uses 5 p2/ 4 p-port switches and supports non-blocking communication among p3 / 4 end hosts. A pair of end hosts in different pods have p2/ 4 equal-cost paths con-

    necting them. Once the two end hosts choose a core switchas the intermediate node, the path between them is uniquelydetermined.

    Figure 2: A 4 -pod fat-tree topology.

    In this paper, we use the term elephant ow to refer toa continuous TCP connection longer than a threshold de-ned in the number of transferred bytes. We discuss howto choose this threshold in 3.

    2.2 Related workRelated work falls into three broad categories: adaptive

    path selection mechanisms, end host based multipath trans-mission, and trafc engineering protocols.

    Adaptive path selection. Adaptive path selection mecha-nisms [11, 16,19] can be further divided into centralized anddistributed approaches. Hedera [11] adopts a centralized ap-proach in the granularity of a ow. In Hedera, edge switchesdetect and report elephant ows to a centralized controller.The controller calculates a path arrangement and periodi-cally updates switches routing tables. Hedera can almostfully utilize a networks bisection bandwidth, but a recentdata center trafc measurement suggests that this central-ized approach needs parallelism and fast route computation

  • 8/2/2019 DARD

    3/13

    to support dynamic trafc patterns [12].Equal-Cost-Multi-Path forwarding (ECMP) [19] is a dis-

    tributed ow-level path selection approach. An ECMP-enabledswitch is congured with multiple next hops for a given des-tination and forwards a packet according to a hash of se-lected elds of the packet header. It can split trafc to eachdestination across multiple paths. Since packets of the sameow share the same hash value, they take the same path fromthe source to the destination and maintain the packet order.However, multiple large ows can collide on their hash val-ues and congest an output port [11].

    VL2 [16] is another distributed path selection mechanism.Different from ECMP, it places the path selection logic atedge switches. In VL2, an edge switch rst forwards a owto a randomly selected core switch, which then forwards theow to the destination. As a result, multiple elephant owscan still get collided on the same output port as ECMP.

    DARD belongs to the distributed adaptive path selectionfamily. It differs from ECMP and VL2 in two key aspects.First, its path selection algorithm is load sensitive. If mul-tiple elephant ows collide on the same path, the algorithmwill shift ows from the collided path to more lightly loadedpaths. Second, it places the path selection logic at end sys-tems rather than at switches to facilitate deployment. A pathselection module running at an end system monitors pathstate and switches path according to path load ( 3.5). Adatacenter network can deploy DARD by upgrading its endsystems software stack rather than updating switches.

    Multi-path Transport Protocol. A differentdesign paradigm,multipath TCP (MPTCP) [26], enables an end system to si-multaneously use multiple paths to improve TCP through-put. However, it requires applications to use MPTCP ratherthan legacy TCP to take advantage of underutilized paths. In

    contrast, DARD is transparent to applications as it is imple-mented as a path selection module under the transport layer( 3). Therefore, legacy applications need not upgrade totake advantage of multiple paths in a datacenter network.

    Trafc Engineering Protocols. Trafc engineering proto-cols such as TeXCP [20] are originally designed to balancetrafc in an ISP network, but can be adopted by datacenternetworks. However, because these protocols are not end-to-end solutions, they forward trafc along different paths inthe granularity of a packet rather than a TCP ow. There-fore, it can cause TCP packet reordering, harming a TCPows performance. In addition, different from DARD, theyalso place the path selection logic at switches and thereforerequire upgrading switches.

    3. DARD DESIGNIn this section, we describe DARDs design. We rst high-

    light the system design goals. Then we present an overviewof the system. We present more design details in the follow-ing sub-sections.

    3.1 Design Goals

    DARDs essential goal is to effectively utilize a datacen-ters bisection bandwidth with practically deployable mech-anisms and limited control overhead. We elaborate the de-sign goal in more detail.

    1. Efciently utilizing the bisection bandwidth . Given thelarge bisection bandwidth in datacenter networks, we aim to

    take advantage of the multiple paths connecting each hostpair and fully utilize the available bandwidth. Meanwhilewedesire to prevent any systematic design risk that may causepacket reordering and decrease the system goodput.

    2. Fairness among elephant ows . We aim to provide fair-ness among elephant ows so that concurrent elephant owscan evenly share the available bisection bandwidth. We fo-cus our work on elephant ows for two reasons. First, ex-isting work shows ECMP and VL2 already perform well onscheduling a large number of short ows [16]. Second, ele-phant ows occupy a signicant fraction of the total band-width (more than 90% of bytes are in the 1% of the largestows [16]).

    3. Lightweight and scalable . We aimto design a lightweightand scalable system. We desire to avoid a centralized scalingbottleneck and minimize the amount of control trafc andcomputation needed to fully utilize bisection bandwidth.

    4. Practically deployable . We aim to make DARD com-patible with existing datacenter infrastructures so that it canbe deployed without signicant modications or upgrade of existing infrastructures.

    3.2 OverviewIn this section, we present an overview of the DARD de-

    sign. DARD uses three key mechanisms to meet the abovesystem design goals. First, it uses a lightweight distributedend-system-based path selection algorithm to move owsfrom overloaded paths to underloaded paths to improve ef-ciency and prevent hot spots ( 3.5). Second, it uses hierar-chical addressing to facilitate efcient path selection ( 3.3).Each end system can use a pair of source and destination ad-dresses to represent an end-to-end path, and vary paths byvarying addresses. Third, DARD places the path selectionlogic at an end system to facilitate practical deployment, asa datacenter network can upgrade its end systems by apply-ing software patches. It only requires that a switch supportthe openow protocol and such switches are commerciallyavailable [ ? ].

    Figure 3 shows DARDs system components and how itworks. Since we choose to place the path selection logicat an end system, a switch in DARD has only two func-tions: (1) it forwards packets to the next hop according to apre-congured routing table; (2) it keeps track of the SwitchState (SS, dened in 3.4)and replies to end systems SwitchState Request (SSR). Our design implements this functionusing the openow protocol.

    An end system has three DARD components as shown inFigure 3: Elephant Flow Detector , Path State Monitor and

  • 8/2/2019 DARD

    4/13

    Figure 3: DARDs system overview. There are multiple paths con-necting each source and destination pair. DARD is a distributed system

    running on every end host. It has three components. The Elephant

    Flow Detector detects elephant ows. The Path State Monitor monitors

    trafc load on each path by periodically querying the switches. The

    Path Selector moves ows from overloaded paths to underloaded paths.

    Path Selector . The Elephant Flow Detector monitors all theoutput ows and treats one ow as an elephant once its sizegrows beyond a threshold. We use 100KB as the thresholdin our implementation. This is because according to a re-cent study, more than 85% of ows in a datacenter are lessthan 100 KB [16]. The Path State Monitor sends SSR to theswitches on all the paths and assembles the SS replies in PathState (PS , as dened in 3.4). The path state indicates theload on each path. Based on both the path state and the de-tected elephant ows, the Path Selector periodically assignsows from overloaded paths to underloaded paths.

    The rest of this section presents more design details of DARD, including how to use hierarchical addressing to se-lect paths at an end system ( 3.3), how to actively monitorall paths state in a scalable fashion ( 3.4), and how to as-sign ows from overloaded paths to underloaded paths toimprove efciency and prevent hot spots ( 3.5).

    3.3 Addressing and RoutingTo fully utilize the bisection bandwidth and, at the same

    time, to prevent retransmissions caused by packet reorder-ing (Goal 1), we allow a ow to take different paths in itslife cycle to reach the destination. However, one ow canuse only one path at any given time. Since we are exploringthe design space of putting as much control logic as possi-ble to the end hosts, we decided to leverage the datacentershierarchical structure to enable an end host to actively selectpaths for a ow.

    A datacenter network is usually constructed as a multi-rooted tree. Take Figure 4 as an example, all the switches

    and end hosts highlighted by the solid circles form a treewith its root core 1 . Three other similar trees exist in thesame topology. This strictly hierarchical structure facili-tates adaptive routing through some customized addressingrules [10]. We borrow the idea from NIRA [27] to splitan end-to-end path into uphill and downhill segments andencode a path in the source and destination addresses. InDARD, each of the core switches obtains a unique prexand then allocates nonoverlapping subdivisions of the prexto each of its sub-trees. The sub-trees will recursively allo-cate nonoverlapping subdivisions of their prexes to lowerhierarchies. By this hierarchical prex allocation, each net-work device receives multiple IP addresses, each of whichrepresents the devices position in one of the trees.

    As shown in Figure 4, we use core i to refer to the ithcore, aggr ij to refer to the j th aggregation switch in the ithpod. We follow the same rule to interpret T oRij for thetop of rack switches and E ij for the end hosts. We usethe device names prexed with letter P and delimited bycolons to illustrate how prexes are allocated along the hi-erarchies. The rst core is allocated with prex Pcore

    1. It

    then allocatesnonoverlappingprexes Pcore 1 .Paggr 11 andPcore 1 .Paggr 21 to two of its sub-trees. The sub-tree rootedfrom aggr 11 will further allocate four prexes to lower hier-archies.

    For a general multi-rooted tree topology, the datacenteroperators can generate a similar address assignment schemaand allocate the prexes along the topology hierarchies. Incase more IP addresses than network cards are assigned toeach end host, we propose to use IP alias to congure mul-tiple IP addresses to one network interface. The latest oper-ating systems support a large number of IP alias to associatewith one network interface, e.g. , Linux kernel 2.6 sets the

    limit to be 256K IP alias per interface [3]. Windows NT 4.0has no limitation on the number of IP addresses that can bebound to a network interface [4].

    One nice property of this hierarchical addressing is thatone host address uniquely encodes the sequence of upper-level switches that allocate that address, e.g. , in Figure 4,E 11 s address Pcore 1 .Paggr 11 .PToR 11 .P E 11 uniquely en-codes the address allocation sequence core 1 aggr 11 T oR11 . A source and destination address pair can furtheruniquely identify a path, e.g. , in Figure 4, we can use thesource and destination pair highlighted by dotted circles touniquely encode the dotted path from E 11 to E 21 throughcore 1 . We call the partial path encoded by source address

    the uphill path and the partial path encoded by destinationaddress the downhill path . To move a ow to a differentpath, we can simply use different source and destination ad-dress combinations without dynamically reconguring therouting tables.

    To forward a packet, each switch stores a downhill tableand an uphill table as described in [27]. The uphill tablekeeps the entries for the prexes allocated from upstreamswitches and the downhill table keeps the entries for the pre-

  • 8/2/2019 DARD

    5/13

    Figure 4: DARDs addressing and routing. E 11 s address Pcore 1 .Paggr 11 .PToR 11 .P E 11 encodes the uphill path T oR 11 -aggr 11 -core 1 . E 21 saddress Pcore 1 .Paggr 21 .PToR 21 .P E 21 encodes the downhill path core 1 -aggr 21 -T oR 21 .

    xes allocated to the downstream switches. Table 1 showsthe switch aggr 11 s downhill and uphill table. The port in-dexes are marked in Figure 4. When a packet arrives, aswitch rst looks up the destination address in the downhilltable using the longest prex matching algorithm. If a matchis found, the packet will be forwarded to the correspondingdownstream switch. Otherwise, the switch will look up thesource address in the uphill table to forward the packet to thecorresponding upstream switch. A core switch only has thedownhill table.

    downhill tablePrexes Port

    Pcore 1 .Paggr 11 .PToR 11 1Pcore 1 .Paggr 11 .PToR 12 2Pcore 2 .Paggr 11 .PToR 11 1Pcore 2 .Paggr 11 .PToR 12 2

    uphill tablePrexes PortPcore 1 3Pcore 2 4

    Table 1: aggr 11 s downhill and uphill routing tables.

    In fact, the downhill-uphill-looking-up is not necessaryfor a fat-tree topology, since a core switch in a fat-treeuniquelydetermines the entire path. However, not all the multi-rootedtrees share the same property, e.g. , a Clos network. Thedownhill-uphill-looking-up modies current switchs forward-ing algorithm. However, an increasing number of switchessupport highly customized forwarding policies. In our im-plementation we use OpenFlow enabled switch to supportthis forwarding algorithm. All switches uphill and down-

    hill tables are automatically congured during their initial-ization. These congurations are static unless the topologychanges.

    Each network component is also assigned a location inde-pendent IP address, ID , which uniquely identies the com-ponent and is used for making TCP connections. The map-ping from IDs to underlying IP addresses is maintained bya DNS-like system and cached locally. To deliver a packetfrom a source to a destination, the source encapsulates thepacket with a proper source and destination address pair.

    Switches in the middle will forward the packet according tothe encapsulated packet header. When the packet arrives atthe destination, it will be decapsulated and passed to upperlayer protocols.

    3.4 Path Monitoring

    To achieve load-sensitive path selection at an end host,DARD informs every end host with the trafc load in thenetwork. Each end host will accordingly select paths for itsoutbound elephant ows. In a high level perspective, thereare two options to inform an end host with the network traf-c load. First, a mechanism similar to the Netow [15], inwhich trafc is logged at the switches and stored at a cen-tralized server. We can then either pull or push the log tothe end hosts. Second, an active measurement mechanism,in which each end host actively probes the network and col-lects replies. TeXCP [20] chooses the second option. Sincewe desire to make DARD not to rely on any conceptuallycentralized component to prevent any potential scalability is-sue, an end host in DARD also uses the active probe methodto monitor the trafc load in the network. This function isdone by the Path State Monitor shown in Figure 3. Thissection rst describes a straw man design of the Path StateMonitor which enables an end host to monitor the trafc loadin the network. Then we improve the straw man design bydecreasing the control trafc.

    We rst dene some terms before describing the strawman design. We use C l to note the output link ls capac-ity. N l denotes the number of elephant ows on the outputlink l. We dene link ls fair share S l = C l /N l for the band-width each elephantow will get if they fairly share that link

    (S l = 0 , if N l = 0 ). Output link ls Link State (LS l ) is de-ned as a triple [ C l , N l , S l ]. A switch r s Switch State (SS r )is dened as {LS l | l is r s output link }. A path p refers to aset of links that connect a source and destination ToR switchpair. If link l has the smallest S l among all the links on path p, we use LS l to represent ps Path State (P S p ).

    The Path State Monitor is a DARD component running onan end host and is responsible to monitor the states of all thepaths to other hosts. In its straw man design, each switchkeeps track of its switch state (SS) locally. The path state

  • 8/2/2019 DARD

    6/13

    Figure 5: The set of switches a source sends switch state requests to.

    monitor of an end host periodically sends the Switch State Requests (SSR) to every switch and assembles the switchstate replies in the path states (PS). These path states indi-cate the trafc load on each path. This straw man designrequires every switch to have a customized ow counter andthe capability of replying to SSR. These two functions arealready supported by OpenFlow enabled switches [17].

    In the straw man design, every end host periodically sendsSSR to all the switches in the topology. The switches willthen reply to the requests. The control trafc in every controlinterval (we will discuss this control interval in 5) can be

    estimated using formula (1), where pkt size refers to thesum of request and response packet sizes.

    num of servers num of switchs pkt size (1)

    Even though the above control trafc is bounded by thesize of the topology, we can still improve the straw man de-sign by decreasing the control trafc. There are two intu-itions for the optimizations. First, if an end host is not send-ing any elephant ow, it is not necessary for the end hostto monitor the trafc load, since DARD is designed to se-lect paths for only elephant ows. Second, as shown in Fig-ure 5, E 21 is sending an elephant ow to E 31 . The switcheshighlighted by the dotted circles are the ones E 21 needs tosend SSR to. The rest switches are not on any path from thesource to the destination. We do not highlight the destinationToR switch ( T oR31 ), because its output link (the one con-nected to E 31 ) is shared by all the four paths from the sourceto the destination, thus DARD cannot moveows away fromthat link anyway. Based on these two observations, we limitthe number of switches each source sends SSR to. For anymulti-rooted tree topology with three hierarchies, this set of switches includes (1) the source ToR switch, (2) the aggre-gation switches directly connected to the source ToR switch,(3) all the core switches and (4) the aggregation switchesdirectly connected to the destination ToR switch.

    3.5 Path SelectionAs shown in Figure 3, a Path Selector running on an end

    host takes the detected elephant ows and the path state asthe input and periodically moves ows from overloadedpathsto underloaded paths. We desire to design a stable and dis-tributed algorithm to improve the systems efciency. Thissection rst introduces the intuition of DARDs path selec-tion and then describes the algorithm in detail.

    In 3.4 we dene a links fair share to be the links band-

    Algorithm selsh path selection1: for each src-dst ToR switch pair, P , do2: max index = 0 ; max S = 0 .0;3: min index = 0 ; min S = ;4: for each i [1, P.PV.length ] do5: if P.FV [i] > 0 and

    max S < P.PV [i].S then6: max S = M.PV [i].S ;7: max index = i;8: else if min S > M.PV [i].S then9: min S = M.PV [i].S;

    10: min index = i;11: end if 12: end for13: end for14: if max index = min index then15: estimation = P.PV [max index ].bandwidthP.PV [max index ].flow numbers +116: if estimation min S > then17: move one elephant ow from path min index to

    path max index .

    18: end if 19: end if

    width divided by the number of elephant ows. A pathsfair share is dened as the smallest link fair share along thepath. Given an elephant ows elastic trafc demand andsmall delays in datacenters, elephant ows tend to fully andfairly utilize their bottlenecks. As a result, moving one owfrom a path with small fair share to a path with large fairshare will push both the small and large fair shares towardthe middle and thus improve fairness to some extent. Basedon this observation, we propose DARDs path selection al-gorithm, whose high level idea is to enable every end host toselshly increase the minimum fair share they can observe.The Algorithm selsh ow scheduling illustrates one roundof the path selection process.

    In DARD, every source and destination pair maintains twovectors, the path state vector (P V ), whose ith item is thestate of the ith path, and the ow vector (FV ), whose ithitem is the number of elephant ows the source is sendingalong the ith path. Line 15 estimates the fair share of themax index th path if another elephant ow is moved to it.The in line 16 is a positive threshold to decide whether tomove a ow. If we set to 0, line 16 is to make sure thisalgorithm will not decrease the global minimum fair share.

    If we set to be larger than 0, the algorithm will convergeas soon as the estimation being close enough to the currentminimum fair share. In general, a small will evenly splitelephant ows among different paths and a large will ac-celerate the algorithms convergence.

    Existing work shows that load-sensitive adaptive routingprotocols can lead to oscillations and instability [22]. Fig-ure 6 shows an example of how an oscillation might happenusing the selsh path selection algorithm. There are threesource and destination pairs, ( E 1 , E 2 ), (E 3 , E 4 ) and (E 5 ,

  • 8/2/2019 DARD

    7/13

    E 6 ). Each of the pairs has two paths and two elephant ows.The source in each pair will run the path state monitor andthe path selector independently without knowing the othertwos behaviors. In the beginning, the shared path, (link switch 1-switch 2 ), has not elephant ows on it. Accord-ing to the selsh path selection algorithm, the three sourceswill move ows to it and increase the number of elephantows on the shared path to three, larger than one. This willfurther cause the three sources to move ows away from theshared paths. The three sources repeat this process and causepermanent oscillation and bandwidth underutilization.

    Figure 6: Path oscillation example.

    The reason for path oscillation is that different sourcesmove ows to under-utilized paths in a synchronized man-ner. As a result, in DARD, the interval between two adjacentow movements of the same end host consists of a xed spanof time and a random span of time. According to the evalua-tion in 5.3.3, simply adding this randomness in the controlinterval can prevent path oscillation.

    4. IMPLEMENTATIONTo test DARDs performance in real datecenter networks,

    we implemented a prototype and deployed it in a 4-pod fat-tree topology in DeterLab [6]. We also implemented a simu-lator on ns -2 to evaluate DARDs performance on differenttypes of topologies.

    4.1 Test BedWe set up a fat-tree topology using 4-port PCs acting as

    the switches and congure IP addresses according to the hi-erarchical addressing scheme described in 3.3. All PCsrun the Ubuntu 10.04 LTS standard image. All switches runOpenFlow 1.0. An OpenFlow enabled switch allows us toaccess and customize the forwarding table. It also maintains

    per ow and per port statistics. Different vendors, e.g. , Ciscoand HP, have implemented OpenFlow in their products. Weimplement our prototype based on existing OpenFlow plat-form to show DARD is practical and readily deployable.

    We implement a NOX [18] component to congure allswitches ow tables during their initialization. This com-ponent allocates the downhill table to OpenFlows ow table0 and the uphill table to OpenFlows ow table 1 to enforcea higher priority for the downhill table. All entries are set tobe permanent. NOX is often used as a centralized controller

    for OpenFlow enabled networks. However, DARD does notrely on it. We use NOX only once to initialize the static owtables. Each links bandwidth is congured as 100Mbps .

    A daemon program is running on every end host. It hasthe three components shown in Figure 3. The Elephant Flow Detector leverages the TCPTrack [9] at each end host tomonitor TCP connections and detects an elephant ow if aTCP connection grows beyond 100KB [16]. The Path State Monitor tracks the fair share of all the equal-cost paths con-necting the sourceand destinationToR switches as describedin 3.4. It queries switches for their states using the aggre-gate ow statistics interfaces provided by OpenFlow infras-tructure [8]. The query interval is set to 1 second. This inter-val causes acceptable amount of control trafc as shown in 5.3.4. We leave exploring the impact of varying this queryinterval to our future work. A Path Selector moves elephantows from overloaded paths to underloaded paths accordingto the selsh path selection algorithm, where we set the tobe 10Mbps . This number is a tradeoff between maximiz-ing the minimum ow rate and fast convergence. The owmovement interval is 5 seconds plus a random time from[0s, 5s]. Because a signicant amount of elephant ows lastfor more than 10s [21], this interval setting prevents an ele-phant ow from being delivered even without the chance tobe moved to a less congested path, and at the same time,this conservative interval setting limits the frequency of owmovement. We use the Linux IP-in-IP tunneling as the en-capsulation/decapsulation module. All the mappings fromIDs to underlying IP addresses are kept at every end host.

    4.2 SimulatorTo evaluate DARDs performancein larger topologies, we

    build a DARD simulator on ns -2, which captures the sys-

    tems packet level behavior. The simulator supports fat-tree,Clos network [16] and the 3-tier topologies whose oversub-scription is larger than 1 [2]. The topology and trafc pat-terns are passed in as tcl conguration les. A links band-width is 1Gbps and its delay is 0.01ms . The queue size isset to be the delay-bandwidth product. TCP New Reno isused as the transport protocol. We use the same settings asthe test bed for the rest of the parameters.

    5. EVALUATIONThis section describes the evaluation of DARD using De-

    terLab test bed and ns -2 simulation. We focus this evalua-tion on four aspects. (1) Can DARD fully utilize the bisec-tion bandwidth and prevent elephant ows from colliding atsome hot spots? (2) How fast can DARD converge to a stablestate given different static trafc patterns? (3) Will DARDsdistributed algorithm cause any path oscillation? (4) Howmuch control overhead does DARD introduce to a datacen-ter?

    5.1 Trafc PatternsDue to the absence of commercial datacenter network traces,

  • 8/2/2019 DARD

    8/13

    we use the three trafc patterns introduced in [10] for bothour test bed and simulation evaluations. (1) Stride , where anend host with index E ij sends elephant ows to the end hostwith index E kj , where k = (( i + 1) mod(num pods )) + 1 .This trafc pattern emulates the worst case where a sourceand a destination are in different pods. As a result, the trafcstresses out the links between the core and the aggregationlayers. (2) Staggered(ToRP,PodP) , where an end host sendselephant ows to another end host connecting to the sameToR switch with probability ToRP , to any other end host inthe same pod with probability PodP and to the end hosts indifferent pods with probability 1 ToRP PodP . In ourevaluation ToRP is 0.5 and PodP is 0.3. This trafc patternemulates the case where an applications instances are closeto each other and the most intra-cluster trafc is in the samepod or even in the same ToR switch. (3) Random , wherean end host sends elephant ows to any other end host inthe topology with a uniform probability. This trafc patternemulates an average case where applications are randomlyplaced in datacenters.

    The above three trafc patterns can be either static ordynamic . The static trafc refers to a number of perma-nent elephant ows from the source to the destination. Thedynamic trafc means the elephant ows between a sourceand a destination start at different times. The elephant owstransfer large les of different sizes. Two key parameters forthe dynamic trafc are the ow inter-arrival time and the lesize . According to [21], the distribution of inter-arrival timesbetween two ows at an end host has periodic modes spacedby 15ms. Given 20% of the ows are elephants [21], we setthe inter-arrival time between two elephant ows to 75ms.Because 99% of the ows are smaller than 100MB and 90%of the bytes are in ows between 100MB and 1GB [21], we

    set an elephant ows size to be uniformly distributed be-tween 100MB and 1GB.We do not include any short term TCP ows in the eval-

    uation because elephant ows occupy a signicant fractionof the total bandwidth (more than 90% of bytes are in the1% of ows [16, 21]). We leave the short ows impact onDARDs performance to our future work.

    5.2 Test Bed ResultsTo evaluate whether DARD can fully utilize the bisection

    bandwidth, we use the static trafc patterns and for eachsource and destination pair, a TCP connection transfers an

    innite le. We constantly measure the incoming bandwidthat every end host. The experiment lasts for one minute. Weuse the results from the middle 40 seconds to calculate theaverage bisection bandwidth.

    We also implement a static hash-based ECMP and a mod-ied version of ow-level VLB in the test bed. In the ECMPimplementation, a ow is forwarded according to a hash of the source and destinations IP addresses and TCP ports.Because a ow-level VLB randomly chooses a core switchto forward a ow, it can also introduce collisions at output

    ports as ECMP. In the hope of smooth out the collisions byrandomness, our ow-level VLB implementation randomlypicks a core every 10 seconds for an elephant ow. This10s control interval is set roughly the same as DARDs con-trol interval. We do not explore other choices. We note thisimplementation as periodicalVLB ( pVLB). We do not imple-ment other approaches, e.g. , Hedera, TeXCP and MPTCP, inthe test bed. We compare DARD and these approaches inthe simulation.

    Figure 7 shows the comparison of DARD, ECMP andpVLBs bisection bandwidths under different static trafcpatterns. DARD outperforms both ECMP and pVLB. Oneobservation is that the bisectionbandwidthgapbetween DARDand the other two approaches increases in the order of stag-gered, random and stride. This is because ows through thecore have more path diversities than the ows inside a pod.Compared with ECMP and pVLB, DARDs strategic pathselection can converge to a better ow allocation than sim-ply relying on randomness.

    Figure 7: DARD, ECMP and pVLBs bisection bandwidths underdifferent static trafc patterns. Measured on 4 -pod fat-tree test bed.

    We also measure DARDs large le transfer times underdynamic trafc patterns. We vary each source-destinationpairs ow generating rate from 1 to 10 per second. Eachelephant ow is a TCP connection transferring a 128MBle. We use a xed le length for all the ows instead of lengths uniformly distributed between 100MB and 1GB. Itis because we need to differentiate whether nishing a owearlier is because of a better path selection or a smaller le.The experiment lasts for ve minutes. We track the start andthe end time of every elephant ow and calculate the aver-age le transfer time. We run the same experiment on ECMPand calculate the improvement of DARD over ECMP usingformula (2), where avg T ECMP is the average le transfertime using ECMP, and avg T DARD is the average le trans-fer time using DARD.

    improvement =avg T ECMP avg T DARD

    avg T ECMP (2)

    Figure 8 shows the improvement vs. the ow generatingrate under different trafc patterns. For the stride trafcpattern, DARD outperforms ECMP because DARD movesows from overloaded paths to underloaded ones and in-

  • 8/2/2019 DARD

    9/13

    creases the minimum ow throughput in every step. We ndboth random trafc and staggered trafc share an interest-ing pattern. When the ow generating rate is low, ECMPand DARD have almost the same performance because thebandwidth is over-provided. As the ow generating rate in-creases, cross-pod ows congest the switch-to-switch links,in which case DARD reallocates the ows sharing the samebottleneck and improves the average le transfer time. Whenow generating rate becomes even higher, the host-switchlinks are occupied by ows within the same pod and thusbecome the bottlenecks, in which case DARD helps little.

    1 2 3 4 5 6 7 8 9 100%

    5%

    10%

    15%

    20%

    Flow generating rate foreach srcdst pair (number_of_flows / s)

    I m p r o v e m e n

    t o

    f a v e r a g e

    f i l e

    t r a n s

    f e r

    t i m e

    staggeredrandomstride

    Figure 8: File transfer improvement. Measured on testbed.

    5.3 Simulation ResultsTo fully understand DARDs advantages and disadvan-

    tages, besides comparing DARD with ECMP and pVLB, wealso compare DARD with Hedera, TeXCP and MPTCP inour simulation. We implement both the demand-estimationand the simulated annealing algorithm described in Hederaand set its scheduling interval to 5 seconds [11]. In theTeXCP implementation each ToR switch pair maintains theutilizations for all the available paths connecting the two of them by periodical probing (The default probe interval is200ms . However, since the RTT in datacenter is in granular-ity of 1ms or even smaller, we decrease this probe interval to10ms). The control interval is ve times of the probe inter-val [20]. We do not implement the owlet [25] mechanismsin the simulator. As a result, each ToR switch schedulestrafc at packet level. We use MPTCPs ns -2 implementa-tion [7] to compare with DARD. Each MPTCP connectionuses all the simple paths connecting the source and the des-tination.

    5.3.1 Performance Improvement

    We use static trafc pattern on a fat-tree with 1024 hosts

    ( p = 16 ) to evaluate whether DARD can fully utilize thebisection bandwidth in a larger topology. Figure 9 showsthe result. DARD achieves higher bisection bandwidth thanboth ECMP and pVLB under all the three trafc patterns.This is because DARD monitors all the paths connecting asource and destination pair and moves ows to underloadedpaths to smooth out the collisions caused by ECMP. TeXCPand DARD achieve similar bisection bandwidth. We willcompare these two approaches in detail later. As a central-ized method, Hedera outperforms DARD under both stride

    and random trafc patterns. However, given staggered traf-c pattern, Hedera achieves less bisection bandwidth thanDARD. This is because current Hedera only schedules theows going through the core. When intra-pod trafc is dom-inant, Hedera degrades to ECMP. We believe this issue canbe addressed by introducing new neighbor generating func-tions in Hedera. MPTCP outperforms DARD by completelyexploring the path diversities. However, it achieves less bi-section bandwidth than Hedera. We suspect this is becausethe current MPTCPs ns -2 implementation does not supportMPTCP level retransmission. Thus, lost packets are alwaysretransmitted to the same path regardless how congested thepath is. We leave a further comparison between DARD andMPTCP as our future work.

    Figure 9: Bisection bandwidth under different static trafc patterns.Simulated on p = 16 fat-tree.

    We also measure the large le transfer time to compareDARD with other approaches on a fat-tree topology with1024 hosts ( p = 16 ). We assign each elephant ow froma dynamic random trafc pattern a unique index, and com-pare its transmission times under different trafc schedulingapproaches. Each experiment lasts for 120s in ns -2. We de-ne T mi to be ow j s transmission time when the underly-ing trafc scheduling approach is m , e.g. , DARD or ECMP.We use every T ECMP i as the reference and calculate the im- provement of le transfer time for every trafc schedulingapproach accordingto formula (3). The improvement ECMP iis 1 for all the ows.

    improvement mi =T ECMP i T mi

    T ECMP i(3)

    Figure 10 shows theCDF of theabove improvement. ECMPand pVLBessentially have thesame performance, since ECMPtransfers half of the ows faster than pVLB and transfers

    the rest half slower than pVLB. Hedera outperformsMPTCPbecause Hedera achieves higher bisection bandwidth. Eventhough DARD and TeXCP achieve the same bisection band-width under dynamic randomtrafc pattern (Figure 9), DARDstill outperforms TeXCP.

    We further measure every elephant ows retransmissionrate, which is dened as the number of retransmitted packetsover the number of unique packets. Figure 11 shows TeXCPhas a higher retransmission rate than DARD. In other words,even though TeXCP can achieve a high bisection bandwidth,

  • 8/2/2019 DARD

    10/13

    some of the packets are retransmitted because of reorderingand thus its goodput is not as high as DARD.

    5% 0 5% 10% 15% 20% 25% 30%020%

    40%

    60%

    80%

    100%

    Improvement of file transfer time

    C D F

    pVLBTeXCPDARD

    MPTCPHedera

    Figure 10: Improvement of large le transfer time under dynamicrandom trafc pattern. Simulated on p = 16 fat-tree.

    0.2% 1% 2% 3% 4%0

    25%

    50%

    75%

    100%

    Retransmission rate

    C D F

    DARD

    TeXCP

    Figure 11: DARD and TeXCPs TCP retransmission rate. Simulatedon p = 16 fat-tree.

    5.3.2 Convergence Speed

    Being a distributed path selection algorithm, DARD isprovable to converge to a Nash equilibrium (Appendix B).However, if the convergence takes a signicant amount of time, the network will be underutilized during the process.As a result, we measure how fast can DARD converge to astable state, in which every ows stops changing paths. Weuse the static trafc patterns on a fat-tree with 1024 hosts( p = 16 ). For each source and destination pair, we vary thenumber of elephant ows from 1 to 64, which is the numberof core switches. We start these elephant ows simultane-ously and track the time when all the ows stop changingpaths. Figure 12 show the CDF of DARDs convergencetime, from which we can see DARD converges in less than25s for more than 80% of the cases. Given DARDs controlinterval at each end host is roughly 10s, the entire systemconverges in less than three control intervals.

    5.3.3 Stability

    Load-sensitiveadaptiverouting can lead to oscillation. Themain reason is because different sources move ows to un-derloaded paths in a highlysynchronizedmanner. To preventthis oscillation, DARD adds a random span of time to eachend hosts control interval. This section evaluates the effectsof this simple mechanism.

    We use dynamic random trafc pattern on a fat-tree with128 end hosts ( p = 8 ) and track the output link utilizations atthe core, since a cores output link is usually the bottleneck

    10 15 20 250

    25%

    50%

    75%

    100%

    Convergence time (s)

    C D F

    staggeredrandomstride

    Figure 12: DARD converges to a stable state in 2 or 3 control inter-vals given static trafc patterns. Simulated on p = 16 fat-tree.

    for a inter-pod elephant ow [11], Figure 13 shows the link utilizations on the 8 output ports at the rst core switch. Wecan see that after the initial oscillation the link utilizationsstabilize afterward.

    0 50 100 150 20070%

    80%

    90%

    100%

    Time (s)

    L i n k u

    t i l i z a

    t i o n

    Figure 13: The rst core switchs output port utilizations under dy-namic random trafc pattern. Simulated on p = 8 fat-tree.

    However we cannot simply conclude that DARD does notcause path oscillations, because the link utilization is an ag-gregated metric and misses every single ows behavior. Werst disable the random time span added to the control inter-val and log every single ows path selection history. As aresult, we nd that even though the link utilizations are sta-ble, Certain ows are constantly moved between two paths,e.g. , one 512MB elephant ow are moved between two paths23 times in its life cycle. This indicates path oscillation doesexist in load-sensitive adaptive routing.

    After many attempts, we choose to add a random span of time to the control interval to address above problem. Fig-ure 14 shows the CDF of how many times ows change theirpaths in their life cycles. For the staggered trafc, around90% of the ows stick to their original path assignment. Thisindicates when most of the ows are within the same pod oreven the same ToR switch, the bottleneck is most likely lo-cated at the host-switch links, in which case few path diver-sities exist. On the other hand, for the stride trafc, where allows are inter-pod, around 50% of the ows do not changetheir paths. Another 50% change their paths for less than 4times. This small number of path changing times indicatesthat DARD is stable and no ow changes its paths back andforth.

    5.3.4 Control Overhead

    To evaluate DARDs communication overhead, we tracethe control messages for both DARD and Hedera on a fat-

  • 8/2/2019 DARD

    11/13

    0 2 4 6 850%

    70%

    90%

    Path switch times

    C D F

    staggeredrandomstride

    Figure 14: CDF of the times that ows change their paths underdynamic trafc pattern. Simulated on p = 8 fat-tree.

    tree with 128 hosts ( p = 8 ) under static random trafc pat-tern. DARDs communication overheadis mainly introducedby the periodical probes, including both queries from hostsand replies from switches. This communication overhead isbounded by the size of the topology, because in the worstcase, the system needs to process all pair probes. How-ever, for Hederas centralized scheduling, ToR switches re-port elephant ows to the centralized controller and the con-troller further updates some switches ow tables. As a re-sult, the communicationoverhead is bounded by the numberof ows.

    Figure 15 shows how much of the bandwidth is taken bycontrol messages given different number of elephant ows.With the increase of the number of elephant ows, thereare three stages. In the rst stage (between 0K and 1.5Kin this example), DARDs control messages take less band-width than Hederas. The reason is mainly because Hederascontrol message size is larger than DARDs (In Hedera, thepayload of a message from a ToR switch to the controller is80 bytes and the payload of a message from the controller toa switch is 72 Bytes. On the other hand, these two numbersare 48 bytes and 32 bytes for DARD). In the second stage(between 1.5K and 3K in this example), DARDs controlmessages take more bandwidth. That is because the sourcesare probing for the states of all the paths to their destinations.In the third stage (more than 3K in this example), DARDsprobe trafc is boundedby the topology size. However, Hed-eras communication overhead does not increase proportion-ally to the number of elephant ows. That is mainly becausewhen the trafc pattern is dense enough, even the centralizedscheduling cannot easily nd out an improved ow alloca-tion and thus few messages are sent from the controller tothe switches.

    0 1K 2K 3K 4K 5K0

    100

    200

    300

    Peak number of elephant flows C o n

    t r o

    l o v e r h e a d

    ( M B / s )

    DARDSimulated Annealing

    Figure 15: DARD and Hederas communication overhead. Simu-lated on p = 8 fat-tree.

    6. CONCLUSIONThis paper proposes DARD, a readily deployable, lightweight

    distributed adaptive routing system for datacenter networks.DARD allows each end host to selshly move elephant owsfrom overloaded paths to underloaded paths. Our analysisshows that this algorithm converges to a Nash equilibriumin nite steps. Test bed emulation and ns -2 simulation show

    that DARD outperforms randomow-level scheduling whenthe bottlenecks are not at the edge, and outperforms central-ized scheduling when intra-pod trafc is dominant.

    7. ACKNOWLEDGEMENTThis material is based upon work supported by the Na-

    tional Science Foundation under Grant No. 1040043.

    8. REFERENCES[1] Amazon elastic compute cloud. http://aws.amazon.com/ec2 .[2] Cisco Data Center Infrastructure 2.5 Design Guide.

    http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCI_SRND_2_5_book.html .

    [3] IP alias limitation in linux kernel 2.6.http://lxr.free-electrons.com/source/net/core/dev.c#L935 .

    [4] IP alias limitation in windows nt 4 .0.http://support.microsoft.com/kb/149426 .

    [5] Microsoft Windows Azure. http://www.microsoft.com/windowsazure .[6] Microsoft Windows Azure. http://www.isi.deterlab.net/ .[7] ns -2 implementation of mptcp. http://www.jp.nishida.org/mptcp/ .[8] Openow switch specication, version 1.0.0. http:

    //www.openflowswitch.org/documents/openflow-spec-v1.0.0.pdf .[9] TCPTrack. http://www.rhythm.cx/ ~steve/devel/tcptrack/ .

    [10] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data centernetwork architecture. SIGCOMM Comput. Commun. Rev. , 38(4):6374, 2008.

    [11] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat.Hedera: Dynamic ow scheduling for data center networks. In Proceedings of the 7th ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI) , San Jose, CA, Apr. 2010.

    [12] T. Benson, A. Akella, and D. A. Maltz. Network trafc characteristics of datacenters in the wild. In Proceedings of the 10th annual conference on Internet measurement , IMC 10, pages 267280, New York, NY, USA, 2010. ACM.

    [13] J.-Y. Boudec. Rate adaptation, congestion control and fairness: A tutorial, 2000.[14] C. Busch and M. Magdon-Ismail. Atomic routing games on maximum

    congestion. Theor. Comput. Sci. , 410(36):33373347, 2009.[15] B. Claise. Cisco Systems Netow Services Export Version 9. RFC 3954, 2004.[16] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A.

    Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and exible data centernetwork. SIGCOMM Comput. Commun. Rev. , 39(4):5162, 2009.

    [17] T. Greene. Researchers show off advanced network control technology.http://www.networkworld.com/news/2008/102908-openflow.html .

    [18] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, andS. Shenker. Nox: towards an operating system for networks. SIGCOMM Comput. Commun. Rev. , 38(3):105110, 2008.

    [19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, 2000.[20] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the tightrope:

    Responsive yet stable trafc engineering. In In Proc. ACM SIGCOMM , 2005.[21] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of

    data center trafc: measurements & analysis. In IMC 09: Proceedings of the9th ACM SIGCOMM conference on Internet measurement conference , pages202208, New York, NY, USA, 2009. ACM.

    [22] A. Khanna and J. Zinky. The revised arpanet routing metric. In Symposium proceedings on Communications architectures & protocols , SIGCOMM 89,

    pages 4556, New York, NY, USA, 1989. ACM.[23] M. K. Lakshman, T. V. Lakshman, and S. Sengupta. Efcient and robust routing

    of highly variable trafc. In In Proceedings of Third Workshop on Hot Topics in Networks (HotNets-III) , 2004.

    [24] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri,S. Radhakrishnan, V. Subramanya, and A. Vahdat. Portland: a scalablefault-tolerant layer 2 data center network fabric. SIGCOMM Comput. Commun. Rev. , 39(4):3950, 2009.

    [25] S. Sinha, S. Kandula, and D. Katabi. Harnessing TCPs Burstiness using FlowletSwitching. In 3rd ACM SIGCOMM Workshop on Hot Topics in Networks(HotNets) , San Diego, CA, November 2004.

    [26] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley. Design,implementation and evaluation of congestion control for multipath tcp. InProceedings of the 8th USENIX conference on Networked systems design and

  • 8/2/2019 DARD

    12/13

    implementation , NSDI11, pages 88, Berkeley, CA, USA, 2011. USENIXAssociation.

    [27] X. Yang. Nira: a new internet routing architecture. In FDNA 03: Proceedingsof the ACM SIGCOMM workshop on Future directions in network architecture ,pages 301312, New York, NY, USA, 2003. ACM.

  • 8/2/2019 DARD

    13/13

    Appendix

    A. EXPLANATION OF THE OBJECTIVEWe assume TCP is the dominant transport protocol in dat-

    acenter, which tries to achieve max-min fairness if combinedwith fair queuing. Each end host moves ows from over-loaded paths to underloaded ones to increase its observedminimum fair share (link ls fair share is dened as the link capacity, C l , divided by the number of elephant ows, N l ).This section explains given max-min fair bandwidth alloca-tion, the global minimum fair share is the lower bound of theglobal minimum ow rate, thus increasing the minimum fairshare actually increases the global minimum ow rate.

    Theorem 1 . Given max-min fair bandwidth allocation forany network topology and any trafc pattern, the global min-imum fair share is the lower bound of global minimum owrate.

    First we dene a bottleneck link according to [13]. A link l is a bottleneck for a ow f if and only if (a) link l is fullyutilized, and (b) ow f has the maximum rate among all theows using link l.

    Given max-min fair bandwidth allocation, link li has thefair share S i = C i /N i . Suppose link l0 has the minimumfair share S 0 . Flow f has the minimum ow rate, min rate .Link lf is ow f sbottleneck. Theorem 1 claims min rate S 0 . We prove this theorem using contradiction.

    According to the bottleneck denition, min rate is themaximum ow rate on link lf , and thus C f /N f min rate .Suppose min rate < S 0 , we get

    C f /N f < S 0 (A1)

    (A1) is conict with S 0 being the minimum fair share. Asa result, the minimum fair share is the lower bound of theglobal minimum ow rate.

    In DARD, every end host tries to increase its observedminimum fair share in each round, thus the global minimumfair share increases, so does the global minimum ow rate.

    B. CONVERGENCE PROOFWe now formalizeDARDs ow schedulingalgorithmand

    prove that this algorithm can converge to a Nash equilibriumin nite steps.

    The proof is a special case of a congestion game [14],which is dened as (F,G, { pf }f F ). F is the set of all theows. G = ( V, E ) is an directed graph. pf is a set of pathsthatcan beused byow f . A strategy s = [ pf 1i 1 , pf

    2i 2 , . . . , p f |

    F |i | F | ]

    is a collection of paths, in which the ik th path in pf k , pf ki k , isused by ow f k .

    For a strategy s and a link j , the link state LS j (s) is atriple ( C j , N j , S j ), as dened in Section 3.4. For a path p, the path state P S p (s) is the link state with the smallestfair share over all links in p. The system state SysS (s) isthe link state with the smallest fair share over all links in E .A flow state F S f (s) is the corresponding path state, i.e. ,

    F S f (s) = P S p(s), ow f is using path p.Notation s k refers to the strategy s without pf ki k , i.e.

    [ pf 1i 1 , . . . , pf k 1i k 1 , p

    f k +1i k +1 , . . . , p

    f | F |i | F | ]. (s k , p

    f ki k ) refers to the

    strategy [ pf 1i 1 , . . . , pf k 1i k 1 , p

    f ki k , p

    f k +1i k +1 , . . . , p

    f | F |i | F | ]. Flow f k is

    locally optimal in strategy s if

    FS f k (s).S FS f k (s k , pf ki k ).S (B1)

    for all pf ki k pf k . Nash equilibrium is a state where all

    ows are locally optimal. A strategy s is global optimal if for any strategy s , SysS (s).S SysS (s).S .

    Theorem 2 . If there is no synchronized ow scheduling,Algorithm selsh ow scheduling will increase the minimumfair share round by round and converge to a Nash equilib-rium in nite steps. The global optimal strategy is also aNash equilibrium strategy.

    For a strategy s , the state vector SV (s) = [ v0(s), v1(s),v2 (s),. . . ], where vk (s) stands for the number of links whosefair share is located at [ k, (k + 1) ), where is a positive

    parameter, e.g. , 10Mbps , to cluster links into groups. As aresult k vk (s) = |E |. A small will group the links ina ne granularity and increase the minimum fair share. Alarge will improve the convergence speed. Suppose s ands are two strategies, SV (s) = [v0(s), v1(s), v2(s),. . . ] andSV (s ) = [v0 (s ), v1(s ), v2(s ), . . . ]. We dene s = s

    when vk (s) = vk (s ) for all k 0. s < s when there existssome K such that vK (s) < v K (s ) and k < K,v k (s) vk (s ). It is easy to show that given three strategies s , s ands , if s s and s s , then s s .

    Given a congestion game (F,G, { pf }f F ) and , thereare only nite number of state vectors. According to thedenition of = and < , we can nd out at least onestrategy s that is the smallest , i.e. , for any strategy s , s s .It is easy to see that this s has the largest minimum fair shareor has the least number of links that have the minimum fairshare and thus is the global optimal.

    If only one ow f selshly changes its route to improveits fair share, making the strategy change from s to s , thisaction decreases the number of links with small fair sharesand increases the number of links with larger fair shares. Inother words, s < s . This indicates that asynchronous andselsh ow movements actually increase global minimumfair share round by round until all ows reach their locallyoptimal state. Since the number of state vectors is limited,the steps converge to a Nash equilibrium is nite. What ismore, because s is the smallest strategy, no ow can havea further movement to decrease s , i.e every ow is in itslocally optimal state. Hence this global optimal strategy s isalso a Nash equilibrium strategy.