dard: distributed adaptive routing for datacenter...

DARD: Distributed Adaptive Routing for DatacenterNetworks

Xin Wu Xiaowei YangDept. of Computer Science, Duke University

Duke-CS-TR-2011-01

ABSTRACT

Datacenter networks typically have many paths connecting

each host pair to achieve high bisection bandwidth for arbi-

trary communication patterns. Fully utilizing the bisection

bandwidth may require flows between the same source and

destination pair to take different paths to avoid hot spots.

However, the existing routing protocols have little support

for load-sensitive adaptive routing. We propose DARD, a

Distributed Adaptive Routing architecture for Datacenter net-

works. DARD allows each end host to move traffic from

overloaded paths to underloaded paths without central co-

ordination. We use an openflow implementation and sim-

ulations to show that DARD can effectively use a datacen-

ter network’s bisection bandwidth under both static and dy-

namic traffic patterns. It outperforms previous solutions based

on random path selection by 10%, and performs similarly to

previous work that assigns flows to paths using a centralized

controller. We use competitive game theory to show that

DARD’s path selection algorithm makes progress in every

step and converges to a Nash equilibrium in finite steps. Our

evaluation results suggest that DARD can achieve a close-

to-optimal solution in practice.

1. INTRODUCTION

Datacenter network applications, e.g., MapReduce and net-

work storage, often demand high intra-cluster bandwidth [11]

to transfer data among distributed components. This is be-

cause the components of an application cannot always be

placed on machines close to each other (e.g., within a rack)

for two main reasons. First, applications may share com-

mon services provided by the datacenter network, e.g., DNS,

search, and storage. These services are not necessarily placed

in nearby machines. Second, the auto-scaling feature offered

by a datacenter network [1, 5] allows an application to cre-

ate dynamic instances when its workload increases. Where

those instances will be placed depends on machine avail-

ability, and is not guaranteed to be close to the application’s

other instances.

Therefore, it is important for a datacenter network to have

high bisection bandwidth to avoid hot spots between any pair

of hosts. To achieve this goal, today’s datacenter networks

often use commodity Ethernet switches to form multi-rooted

tree topologies [21] (e.g., fat-tree [10] or Clos topology [16])

that have multiple equal-cost paths connecting any host pair.

A flow (a flow refers to a TCP connection in this paper) can

use an alternative path if one path is overloaded.

However, legacy transport protocols such as TCP lack the

ability to dynamically select paths based on traffic load. To

overcome this limitation, researchers have advocated a va-

riety of dynamic path selection mechanisms to take advan-

tage of the multiple paths connecting any host pair. At a

high-level view, these mechanisms fall into two categories:

centralized dynamic path selection, and distributed traffic-

oblivious load balancing. A representative example of cen-

tralized path selection is Hedera [11], which uses a central

controller to compute an optimal flow-to-path assignment

based on dynamic traffic load. Equal-Cost-Multi-Path for-

warding (ECMP) [19] and VL2 [16] are examples of traffic-

oblivious load balancing. With ECMP, routers hash flows

based on flow identifiers to multiple equal-cost next hops.

VL2 [16] uses edge switches to forward a flow to a randomly

selected core switch to achieve valiant load balancing [23].

Each of these two design paradigms has merit and im-

proves the available bisection bandwidth between a host pair

in a datacenter network. Yet each has its limitations. A

centralized path selection approach introduces a potential

scaling bottleneck and a centralized point of failure. When

a datacenter scales to a large size, the control traffic sent

to and from the controller may congest the link that con-

nects the controller to the rest of the datacenter network, Dis-

tibuted traffic-oblivious load balancing scales well to large

datacenter networks, but may create hot spots, as their flow

assignment algorithms do not consider dynamic traffic load

on each path.

In this paper, we aim to explore the design space that

uses end-to-end distributed load-sensitive path selection to

fully use a datacenter’s bisection bandwidth. This design

paradigm has a number of advantages. First, placing the

path selection logic at an end system rather than inside a

switch facilitates deployment, as it does not require special

hardware or replacing commodity switches. One can also

upgrade or extend the path selection logic later by applying

software patches rather than upgrading switching hardware.

Second, a distributed design can be more robust and scale

better than a centralized approach.

This paper presents DARD, a lightweight, distributed, end

system based path selection system for datacenter networks.

DARD’s design goal is to fully utilize bisection bandwidth

and dynamically balance the traffic among the multipath paths

between any host pair. A key design challenge DARD faces

is how to achieve dynamic distributed load balancing. Un-

like in a centralized approach, with DARD, no end system or

router has a global view of the network. Each end system can

only select a path based on its local knowledge, thereby mak-

ing achieving close-to-optimal load balancing a challenging

problem.

To address this challenge, DARD uses a selfish path se-

lection algorithm that provably converges to a Nash equilib-

rium in finite steps (Appedix B). Our experimental evalua-

tion shows that the equilibrium’s gap to the optimal solution

is small. To facilitate path selection, DARD uses hierarchi-

cal addressing to represent an end-to-end path with a pair of

source and destination addresses, as in [27]. Thus, an end

system can switch paths by switching addresses.

We have implemented a DARD prototype on DeterLab [6]

and a ns-2 simulator. We use static traffic pattern to show

that DARD converges to a stable state in two to three con-

trol intervals. We use dynamic traffic pattern to show that

DARD outperforms ECMP, VL2 and TeXCP and its per-

formance gap to the centralized scheduling is small. Under

dynamic traffic pattern, DARD maintains stable link utiliza-

tion. About 90% of the flows change their paths less than 4times in their life cycles. Evaluation result also shows that

the bandwidth taken by DARD’s control traffic is bounded

by the size of the topology.

DARD is a scalable and stable end host based approach

to load-balance datacenter traffic. We make every effort to

leverage existing infrastructures and to make DARD prac-

tically deployable. The rest of this paper is organized as

follows. Section 2 introduces background knowledge and

discusses related work. Section 3 describes DARD’s design

goals and system components. In Section 4, we introduce

the system implementation details. We evaluate DARD in

Section 5. Section 6 concludes our work.

2. BACKGROUND AND RELATED WORK

In this section, we first briefly introduce what a datacenter

network looks like and then discuss related work.

2.1 Datacenter Topologies

Recent proposals [10, 16, 24] suggest to use multi-rooted

tree topologies to build datacenter networks. Figure 1 shows

a 3-stage multi-rooted tree topology. The topology has three

vertical layers: Top-of-Rack (ToR), aggregation, and core. A

pod is a management unit. It represents a replicable building

block consisting of a number of servers and switches sharing

the same power and management infrastructure.

An important design parameter of a datacenter network

is the oversubscription ratio at each layer of the hierarchy,

which is computed as a layer’s downstream bandwidth dev-

ided by its upstream bandwidth, as shown in Figure 1. The

oversubscription ratio is usually designed larger than one,

assuming that not all downstream devices will be active con-

currently.

We design DARD to work for arbitrary multi-rooted tree

topologies. But for ease of exposition, we mostly use the

Figure 1: A multi-rooted tree topology for a datacenter network. The

aggregation layer’s oversubscription ratio is defined asBWdown

BWup.

fat-tree topology to illustrate DARD’s design, unless other-

wise noted. Therefore, we briefly describe what a fat-tree

topology is.

Figure 2 shows a fat-tree topology example. A p-pod fat-

tree topology (in Figure 2, p = 4)has p pods in the horizontal

direction. It uses 5p2/4 p-port switches and supports non-

blocking communication among p3/4 end hosts. A pair of

end hosts in different pods have p2/4 equal-cost paths con-

necting them. Once the two end hosts choose a core switch

as the intermediate node, the path between them is uniquely

determined.

Figure 2: A 4-pod fat-tree topology.

In this paper, we use the term elephant flow to refer to

a continuous TCP connection longer than a threshold de-

fined in the number of transferred bytes. We discuss how

to choose this threshold in § 3.

2.2 Related work

Related work falls into three broad categories: adaptive

path selection mechanisms, end host based multipath trans-

mission, and traffic engineering protocols.

Adaptive path selection. Adaptive path selection mecha-

nisms [11,16,19] can be further divided into centralized and

distributed approaches. Hedera [11] adopts a centralized ap-

proach in the granularity of a flow. In Hedera, edge switches

detect and report elephant flows to a centralized controller.

The controller calculates a path arrangement and periodi-

cally updates switches’ routing tables. Hedera can almost

fully utilize a network’s bisection bandwidth, but a recent

data center traffic measurement suggests that this central-

ized approach needs parallelism and fast route computation

to support dynamic traffic patterns [12].

Equal-Cost-Multi-Path forwarding (ECMP) [19] is a dis-

tributed flow-level path selection approach. An ECMP-enabled

switch is configured with multiple next hops for a given des-

tination and forwards a packet according to a hash of se-

lected fields of the packet header. It can split traffic to each

destination across multiple paths. Since packets of the same

flow share the same hash value, they take the same path from

the source to the destination and maintain the packet order.

However, multiple large flows can collide on their hash val-

ues and congest an output port [11].

VL2 [16] is another distributed path selection mechanism.

Different from ECMP, it places the path selection logic at

edge switches. In VL2, an edge switch first forwards a flow

to a randomly selected core switch, which then forwards the

flow to the destination. As a result, multiple elephant flows

can still get collided on the same output port as ECMP.

DARD belongs to the distributed adaptive path selection

family. It differs from ECMP and VL2 in two key aspects.

First, its path selection algorithm is load sensitive. If mul-

tiple elephant flows collide on the same path, the algorithm

will shift flows from the collided path to more lightly loaded

paths. Second, it places the path selection logic at end sys-

tems rather than at switches to facilitate deployment. A path

selection module running at an end system monitors path

state and switches path according to path load (§ 3.5). A

datacenter network can deploy DARD by upgrading its end

system’s software stack rather than updating switches.

Multi-path Transport Protocol. A different design paradigm,

multipath TCP (MPTCP) [26], enables an end system to si-

multaneously use multiple paths to improve TCP through-

put. However, it requires applications to use MPTCP rather

than legacy TCP to take advantage of underutilized paths. In

contrast, DARD is transparent to applications as it is imple-

mented as a path selection module under the transport layer

(§ 3). Therefore, legacy applications need not upgrade to

take advantage of multiple paths in a datacenter network.

Traffic Engineering Protocols. Traffic engineering proto-

cols such as TeXCP [20] are originally designed to balance

traffic in an ISP network, but can be adopted by datacenter

networks. However, because these protocols are not end-to-

end solutions, they forward traffic along different paths in

the granularity of a packet rather than a TCP flow. There-

fore, it can cause TCP packet reordering, harming a TCP

flow’s performance. In addition, different from DARD, they

also place the path selection logic at switches and therefore

require upgrading switches.

3. DARD DESIGN

In this section, we describe DARD’s design. We first high-

light the system design goals. Then we present an overview

of the system. We present more design details in the follow-

ing sub-sections.

3.1 Design Goals

DARD’s essential goal is to effectively utilize a datacen-

ter’s bisection bandwidth with practically deployable mech-

anisms and limited control overhead. We elaborate the de-

sign goal in more detail.

1. Efficiently utilizing the bisection bandwidth . Given the

large bisection bandwidth in datacenter networks, we aim to

take advantage of the multiple paths connecting each host

pair and fully utilize the available bandwidth. Meanwhile we

desire to prevent any systematic design risk that may cause

packet reordering and decrease the system goodput.

2. Fairness among elephant flows . We aim to provide fair-

ness among elephant flows so that concurrent elephant flows

can evenly share the available bisection bandwidth. We fo-

cus our work on elephant flows for two reasons. First, ex-

isting work shows ECMP and VL2 already perform well on

scheduling a large number of short flows [16]. Second, ele-

phant flows occupy a significant fraction of the total band-

width (more than 90% of bytes are in the 1% of the largest

flows [16]).

3. Lightweight and scalable. We aim to design a lightweight

and scalable system. We desire to avoid a centralized scaling

bottleneck and minimize the amount of control traffic and

computation needed to fully utilize bisection bandwidth.

4. Practically deployable. We aim to make DARD com-

patible with existing datacenter infrastructures so that it can

be deployed without significant modifications or upgrade of

existing infrastructures.

3.2 Overview

In this section, we present an overview of the DARD de-

sign. DARD uses three key mechanisms to meet the above

system design goals. First, it uses a lightweight distributed

end-system-based path selection algorithm to move flows

from overloaded paths to underloaded paths to improve effi-

ciency and prevent hot spots (§ 3.5). Second, it uses hierar-

chical addressing to facilitate efficient path selection (§ 3.3).

Each end system can use a pair of source and destination ad-

dresses to represent an end-to-end path, and vary paths by

varying addresses. Third, DARD places the path selection

logic at an end system to facilitate practical deployment, as

a datacenter network can upgrade its end systems by apply-

ing software patches. It only requires that a switch support

the openflow protocol and such switches are commercially

available [?].

Figure 3 shows DARD’s system components and how it

works. Since we choose to place the path selection logic

at an end system, a switch in DARD has only two func-

tions: (1) it forwards packets to the next hop according to a

pre-configured routing table; (2) it keeps track of the Switch

State (SS, defined in § 3.4) and replies to end systems’ Switch

State Request (SSR). Our design implements this function

using the openflow protocol.

An end system has three DARD components as shown in

Figure 3: Elephant Flow Detector, Path State Monitor and

Figure 3: DARD’s system overview. There are multiple paths con-

necting each source and destination pair. DARD is a distributed system

running on every end host. It has three components. The Elephant

Flow Detector detects elephant flows. The Path State Monitor monitors

traffic load on each path by periodically querying the switches. The

Path Selector moves flows from overloaded paths to underloaded paths.

Path Selector. The Elephant Flow Detector monitors all the

output flows and treats one flow as an elephant once its size

grows beyond a threshold. We use 100KB as the threshold

in our implementation. This is because according to a re-

cent study, more than 85% of flows in a datacenter are less

than 100 KB [16]. The Path State Monitor sends SSR to the

switches on all the paths and assembles the SS replies in Path

State (PS, as defined in § 3.4). The path state indicates the

load on each path. Based on both the path state and the de-

tected elephant flows, the Path Selector periodically assigns

flows from overloaded paths to underloaded paths.

The rest of this section presents more design details of

DARD, including how to use hierarchical addressing to se-

lect paths at an end system (§ 3.3), how to actively monitor

all paths’ state in a scalable fashion (§ 3.4), and how to as-

sign flows from overloaded paths to underloaded paths to

improve efficiency and prevent hot spots (§ 3.5).

3.3 Addressing and Routing

To fully utilize the bisection bandwidth and, at the same

time, to prevent retransmissions caused by packet reorder-

ing (Goal 1), we allow a flow to take different paths in its

life cycle to reach the destination. However, one flow can

use only one path at any given time. Since we are exploring

the design space of putting as much control logic as possi-

ble to the end hosts, we decided to leverage the datacenter’s

hierarchical structure to enable an end host to actively select

paths for a flow.

A datacenter network is usually constructed as a multi-

rooted tree. Take Figure 4 as an example, all the switches

and end hosts highlighted by the solid circles form a tree

with its root core1. Three other similar trees exist in the

same topology. This strictly hierarchical structure facili-

tates adaptive routing through some customized addressing

rules [10]. We borrow the idea from NIRA [27] to split

an end-to-end path into uphill and downhill segments and

encode a path in the source and destination addresses. In

DARD, each of the core switches obtains a unique prefix

and then allocates nonoverlapping subdivisions of the prefix

to each of its sub-trees. The sub-trees will recursively allo-

cate nonoverlapping subdivisions of their prefixes to lower

hierarchies. By this hierarchical prefix allocation, each net-

work device receives multiple IP addresses, each of which

represents the device’s position in one of the trees.

As shown in Figure 4, we use corei to refer to the ithcore, aggrij to refer to the jth aggregation switch in the ithpod. We follow the same rule to interpret ToRij for the

top of rack switches and Eij for the end hosts. We use

the device names prefixed with letter P and delimited by

colons to illustrate how prefixes are allocated along the hi-

erarchies. The first core is allocated with prefix Pcore1. It

then allocates nonoverlapping prefixesPcore1.Paggr11 and

Pcore1.Paggr21 to two of its sub-trees. The sub-tree rooted

from aggr11 will further allocate four prefixes to lower hier-

archies.

For a general multi-rooted tree topology, the datacenter

operators can generate a similar address assignment schema

and allocate the prefixes along the topology hierarchies. In

case more IP addresses than network cards are assigned to

each end host, we propose to use IP alias to configure mul-

tiple IP addresses to one network interface. The latest oper-

ating systems support a large number of IP alias to associate

with one network interface, e.g., Linux kernel 2.6 sets the

limit to be 256K IP alias per interface [3]. Windows NT 4.0has no limitation on the number of IP addresses that can be

bound to a network interface [4].

One nice property of this hierarchical addressing is that

one host address uniquely encodes the sequence of upper-

level switches that allocate that address, e.g., in Figure 4,

E11’s addressPcore1.Paggr11.PToR11.PE11 uniquely en-

codes the address allocation sequence core1 → aggr11 →ToR11. A source and destination address pair can further

uniquely identify a path, e.g., in Figure 4, we can use the

source and destination pair highlighted by dotted circles to

uniquely encode the dotted path from E11 to E21 through

core1. We call the partial path encoded by source address

the uphill path and the partial path encoded by destination

address the downhill path. To move a flow to a different

path, we can simply use different source and destination ad-

dress combinations without dynamically reconfiguring the

routing tables.

To forward a packet, each switch stores a downhill table

and an uphill table as described in [27]. The uphill table

keeps the entries for the prefixes allocated from upstream

switches and the downhill table keeps the entries for the pre-

Figure 4: DARD’s addressing and routing. E11’s address Pcore1.Paggr11.PToR11.PE11 encodes the uphill path ToR11-aggr11-core1. E21’s

address Pcore1.Paggr21.PToR21.PE21 encodes the downhill path core1-aggr21-ToR21 .

fixes allocated to the downstream switches. Table 1 shows

the switch aggr11’s downhill and uphill table. The port in-

dexes are marked in Figure 4. When a packet arrives, a

switch first looks up the destination address in the downhill

table using the longest prefix matching algorithm. If a match

is found, the packet will be forwarded to the corresponding

downstream switch. Otherwise, the switch will look up the

source address in the uphill table to forward the packet to the

corresponding upstream switch. A core switch only has the

downhill table.

downhill table

Prefixes Port

Pcore1.Paggr11.PToR11 1




uphill table

Prefixes Port

Pcore1 3

Pcore2 4

Table 1: aggr11’s downhill and uphill routing tables.

In fact, the downhill-uphill-looking-up is not necessary

for a fat-tree topology, since a core switch in a fat-tree uniquely

determines the entire path. However, not all the multi-rooted

trees share the same property, e.g., a Clos network. The

downhill-uphill-looking-up modifies current switch’s forward-

ing algorithm. However, an increasing number of switches

support highly customized forwarding policies. In our im-

plementation we use OpenFlow enabled switch to support

this forwarding algorithm. All switches’ uphill and down-

hill tables are automatically configured during their initial-

ization. These configurations are static unless the topology

changes.

Each network component is also assigned a location inde-

pendent IP address, ID, which uniquely identifies the com-

ponent and is used for making TCP connections. The map-

ping from IDs to underlying IP addresses is maintained by

a DNS-like system and cached locally. To deliver a packet

from a source to a destination, the source encapsulates the

packet with a proper source and destination address pair.

Switches in the middle will forward the packet according to

the encapsulated packet header. When the packet arrives at

the destination, it will be decapsulated and passed to upper

layer protocols.

3.4 Path Monitoring

To achieve load-sensitive path selection at an end host,

DARD informs every end host with the traffic load in the

network. Each end host will accordingly select paths for its

outbound elephant flows. In a high level perspective, there

are two options to inform an end host with the network traf-

fic load. First, a mechanism similar to the Netflow [15], in

which traffic is logged at the switches and stored at a cen-

tralized server. We can then either pull or push the log to

the end hosts. Second, an active measurement mechanism,

in which each end host actively probes the network and col-

lects replies. TeXCP [20] chooses the second option. Since

we desire to make DARD not to rely on any conceptually

centralized component to prevent any potential scalability is-

sue, an end host in DARD also uses the active probe method

to monitor the traffic load in the network. This function is

done by the Path State Monitor shown in Figure 3. This

section first describes a straw man design of the Path State

Monitor which enables an end host to monitor the traffic load

in the network. Then we improve the straw man design by

decreasing the control traffic.

We first define some terms before describing the straw

man design. We use Cl to note the output link l’s capac-

ity. Nl denotes the number of elephant flows on the output

link l. We define link l’s fair share Sl = Cl/Nl for the band-

width each elephant flow will get if they fairly share that link

(Sl = 0, if Nl = 0). Output link l’s Link State (LSl) is de-

fined as a triple [Cl, Nl, Sl]. A switch r’s Switch State (SSr)

is defined as {LSl | l is r’s output link}. A path p refers to a

set of links that connect a source and destination ToR switch

pair. If link l has the smallest Sl among all the links on path

p, we use LSl to represent p’s Path State (PSp).

The Path State Monitor is a DARD component running on

an end host and is responsible to monitor the states of all the

paths to other hosts. In its straw man design, each switch

keeps track of its switch state (SS) locally. The path state

Figure 5: The set of switches a source sends switch state requests to.

monitor of an end host periodically sends the Switch State

Requests (SSR) to every switch and assembles the switch

state replies in the path states (PS). These path states indi-

cate the traffic load on each path. This straw man design

requires every switch to have a customized flow counter and

the capability of replying to SSR. These two functions are

already supported by OpenFlow enabled switches [17].

In the straw man design, every end host periodically sends

SSR to all the switches in the topology. The switches will

then reply to the requests. The control traffic in every control

interval (we will discuss this control interval in § 5) can be

estimated using formula (1), where pkt size refers to the

sum of request and response packet sizes.

num of servers× num of switchs× pkt size (1)

Even though the above control traffic is bounded by the

size of the topology, we can still improve the straw man de-

sign by decreasing the control traffic. There are two intu-

itions for the optimizations. First, if an end host is not send-

ing any elephant flow, it is not necessary for the end host

to monitor the traffic load, since DARD is designed to se-

lect paths for only elephant flows. Second, as shown in Fig-

ure 5, E21 is sending an elephant flow to E31. The switches

highlighted by the dotted circles are the ones E21 needs to

send SSR to. The rest switches are not on any path from the

source to the destination. We do not highlight the destination

ToR switch (ToR31), because its output link (the one con-

nected to E31) is shared by all the four paths from the source

to the destination, thus DARD cannot move flows away from

that link anyway. Based on these two observations, we limit

the number of switches each source sends SSR to. For any

multi-rooted tree topology with three hierarchies, this set of

switches includes (1) the source ToR switch, (2) the aggre-

gation switches directly connected to the source ToR switch,

(3) all the core switches and (4) the aggregation switches

directly connected to the destination ToR switch.

3.5 Path Selection

As shown in Figure 3, a Path Selector running on an end

host takes the detected elephant flows and the path state as

the input and periodically moves flows from overloaded paths

to underloaded paths. We desire to design a stable and dis-

tributed algorithm to improve the system’s efficiency. This

section first introduces the intuition of DARD’s path selec-

tion and then describes the algorithm in detail.

In § 3.4 we define a link’s fair share to be the link’s band-

Algorithm selfish path selection

1: for each src-dst ToR switch pair, P , do

2: max index = 0; max S = 0.0;

3: min index = 0; min S = ∞;

4: for each i ∈ [1, P.PV.length] do

5: if P.FV [i] > 0 and

max S < P.PV [i].S then

6: max S = M.PV [i].S;

7: max index = i;8: else if min S > M.PV [i].S then

9: min S = M.PV [i].S;

10: min index = i;11: end if

12: end for

13: end for

14: if max index 6= min index then

15: estimation = P.PV [max index].bandwidth

P.PV [max index].flow numbers+1

16: if estimation−min S > δ then

17: move one elephant flow from path min index to

path max index.

18: end if

19: end if

width divided by the number of elephant flows. A path’s

fair share is defined as the smallest link fair share along the

path. Given an elephant flow’s elastic traffic demand and

small delays in datacenters, elephant flows tend to fully and

fairly utilize their bottlenecks. As a result, moving one flow

from a path with small fair share to a path with large fair

share will push both the small and large fair shares toward

the middle and thus improve fairness to some extent. Based

on this observation, we propose DARD’s path selection al-

gorithm, whose high level idea is to enable every end host to

selfishly increase the minimum fair share they can observe.

The Algorithm selfish flow scheduling illustrates one round

of the path selection process.

In DARD, every source and destination pair maintains two

vectors, the path state vector (PV ), whose ith item is the

state of the ith path, and the flow vector (FV ), whose ithitem is the number of elephant flows the source is sending

along the ith path. Line 15 estimates the fair share of the

max indexth path if another elephant flow is moved to it.

The δ in line 16 is a positive threshold to decide whether to

move a flow. If we set δ to 0, line 16 is to make sure this

algorithm will not decrease the global minimum fair share.

If we set δ to be larger than 0, the algorithm will converge

as soon as the estimation being close enough to the current

minimum fair share. In general, a small δ will evenly split

elephant flows among different paths and a large δ will ac-

celerate the algorithm’s convergence.

Existing work shows that load-sensitive adaptive routing

protocols can lead to oscillations and instability [22]. Fig-

ure 6 shows an example of how an oscillation might happen

using the selfish path selection algorithm. There are three

source and destination pairs, (E1, E2), (E3, E4) and (E5,

E6). Each of the pairs has two paths and two elephant flows.

The source in each pair will run the path state monitor and

the path selector independently without knowing the other

two’s behaviors. In the beginning, the shared path, (link

switch1-switch2), has not elephant flows on it. Accord-

ing to the selfish path selection algorithm, the three sources

will move flows to it and increase the number of elephant

flows on the shared path to three, larger than one. This will

further cause the three sources to move flows away from the

shared paths. The three sources repeat this process and cause

permanent oscillation and bandwidth underutilization.

Figure 6: Path oscillation example.

The reason for path oscillation is that different sources

move flows to under-utilized paths in a synchronized man-

ner. As a result, in DARD, the interval between two adjacent

flow movements of the same end host consists of a fixed span

of time and a random span of time. According to the evalua-

tion in § 5.3.3, simply adding this randomness in the control

interval can prevent path oscillation.

4. IMPLEMENTATION

To test DARD’s performance in real datecenter networks,

we implemented a prototype and deployed it in a 4-pod fat-

tree topology in DeterLab [6]. We also implemented a simu-

lator on ns-2 to evaluate DARD’s performance on different

types of topologies.

4.1 Test Bed

We set up a fat-tree topology using 4-port PCs acting as

the switches and configure IP addresses according to the hi-

erarchical addressing scheme described in § 3.3. All PCs

run the Ubuntu 10.04 LTS standard image. All switches run

OpenFlow 1.0. An OpenFlow enabled switch allows us to

access and customize the forwarding table. It also maintains

per flow and per port statistics. Different vendors, e.g., Cisco

and HP, have implemented OpenFlow in their products. We

implement our prototype based on existing OpenFlow plat-

form to show DARD is practical and readily deployable.

We implement a NOX [18] component to configure all

switches’ flow tables during their initialization. This com-

ponent allocates the downhill table to OpenFlow’s flow table

0 and the uphill table to OpenFlow’s flow table 1 to enforce

a higher priority for the downhill table. All entries are set to

be permanent. NOX is often used as a centralized controller

for OpenFlow enabled networks. However, DARD does not

rely on it. We use NOX only once to initialize the static flow

tables. Each link’s bandwidth is configured as 100Mbps.

A daemon program is running on every end host. It has

the three components shown in Figure 3. The Elephant Flow

Detector leverages the TCPTrack [9] at each end host to

monitor TCP connections and detects an elephant flow if a

TCP connection grows beyond 100KB [16]. The Path State

Monitor tracks the fair share of all the equal-cost paths con-

necting the source and destination ToR switches as described

in § 3.4. It queries switches for their states using the aggre-

gate flow statistics interfaces provided by OpenFlow infras-

tructure [8]. The query interval is set to 1 second. This inter-

val causes acceptable amount of control traffic as shown in

§ 5.3.4. We leave exploring the impact of varying this query

interval to our future work. A Path Selector moves elephant

flows from overloaded paths to underloaded paths according

to the selfish path selection algorithm, where we set the δ to

be 10Mbps. This number is a tradeoff between maximiz-

ing the minimum flow rate and fast convergence. The flow

movement interval is 5 seconds plus a random time from

[0s, 5s]. Because a significant amount of elephant flows last

for more than 10s [21], this interval setting prevents an ele-

phant flow from being delivered even without the chance to

be moved to a less congested path, and at the same time,

this conservative interval setting limits the frequency of flow

movement. We use the Linux IP-in-IP tunneling as the en-

capsulation/decapsulation module. All the mappings from

IDs to underlying IP addresses are kept at every end host.

4.2 Simulator

To evaluate DARD’s performance in larger topologies, we

build a DARD simulator on ns-2, which captures the sys-

tem’s packet level behavior. The simulator supports fat-tree,

Clos network [16] and the 3-tier topologies whose oversub-

scription is larger than 1 [2]. The topology and traffic pat-

terns are passed in as tcl configuration files. A link’s band-

width is 1Gbps and its delay is 0.01ms. The queue size is

set to be the delay-bandwidth product. TCP New Reno is

used as the transport protocol. We use the same settings as

the test bed for the rest of the parameters.

5. EVALUATION

This section describes the evaluation of DARD using De-

terLab test bed and ns-2 simulation. We focus this evalua-

tion on four aspects. (1) Can DARD fully utilize the bisec-

tion bandwidth and prevent elephant flows from colliding at

some hot spots? (2) How fast can DARD converge to a stable

state given different static traffic patterns? (3) Will DARD’s

distributed algorithm cause any path oscillation? (4) How

much control overhead does DARD introduce to a datacen-

ter?

5.1 Traffic Patterns

Due to the absence of commercial datacenter network traces,

we use the three traffic patterns introduced in [10] for both

our test bed and simulation evaluations. (1) Stride, where an

end host with index Eij sends elephant flows to the end host

with index Ekj , where k = ((i + 1)mod(num pods)) + 1.

This traffic pattern emulates the worst case where a source

and a destination are in different pods. As a result, the traffic

stresses out the links between the core and the aggregation

layers. (2) Staggered(ToRP,PodP), where an end host sends

elephant flows to another end host connecting to the same

ToR switch with probability ToRP , to any other end host in

the same pod with probability PodP and to the end hosts in

different pods with probability 1 − ToRP − PodP . In our

evaluation ToRP is 0.5 and PodP is 0.3. This traffic pattern

emulates the case where an application’s instances are close

to each other and the most intra-cluster traffic is in the same

pod or even in the same ToR switch. (3) Random, where

an end host sends elephant flows to any other end host in

the topology with a uniform probability. This traffic pattern

emulates an average case where applications are randomly

placed in datacenters.

The above three traffic patterns can be either static or

dynamic. The static traffic refers to a number of perma-

nent elephant flows from the source to the destination. The

dynamic traffic means the elephant flows between a source

and a destination start at different times. The elephant flows

transfer large files of different sizes. Two key parameters for

the dynamic traffic are the flow inter-arrival time and the file

size. According to [21], the distribution of inter-arrival times

between two flows at an end host has periodic modes spaced

by 15ms. Given 20% of the flows are elephants [21], we set

the inter-arrival time between two elephant flows to 75ms.

Because 99% of the flows are smaller than 100MB and 90%of the bytes are in flows between 100MB and 1GB [21], we

set an elephant flow’s size to be uniformly distributed be-

tween 100MB and 1GB.

We do not include any short term TCP flows in the eval-

uation because elephant flows occupy a significant fraction

of the total bandwidth (more than 90% of bytes are in the

1% of flows [16, 21]). We leave the short flow’s impact on

DARD’s performance to our future work.

5.2 Test Bed Results

To evaluate whether DARD can fully utilize the bisection

bandwidth, we use the static traffic patterns and for each

source and destination pair, a TCP connection transfers an

infinite file. We constantly measure the incoming bandwidth

at every end host. The experiment lasts for one minute. We

use the results from the middle 40 seconds to calculate the

average bisection bandwidth.

We also implement a static hash-based ECMP and a mod-

ified version of flow-level VLB in the test bed. In the ECMP

implementation, a flow is forwarded according to a hash of

the source and destination’s IP addresses and TCP ports.

Because a flow-level VLB randomly chooses a core switch

to forward a flow, it can also introduce collisions at output

ports as ECMP. In the hope of smooth out the collisions by

randomness, our flow-level VLB implementation randomly

picks a core every 10 seconds for an elephant flow. This

10s control interval is set roughly the same as DARD’s con-

trol interval. We do not explore other choices. We note this

implementation as periodical VLB (pVLB). We do not imple-

ment other approaches, e.g., Hedera, TeXCP and MPTCP, in

the test bed. We compare DARD and these approaches in

the simulation.

Figure 7 shows the comparison of DARD, ECMP and

pVLB’s bisection bandwidths under different static traffic

patterns. DARD outperforms both ECMP and pVLB. One

observation is that the bisection bandwidth gap between DARD

and the other two approaches increases in the order of stag-

gered, random and stride. This is because flows through the

core have more path diversities than the flows inside a pod.

Compared with ECMP and pVLB, DARD’s strategic path

selection can converge to a better flow allocation than sim-

ply relying on randomness.

Figure 7: DARD, ECMP and pVLB’s bisection bandwidths under

different static traffic patterns. Measured on 4-pod fat-tree test bed.

We also measure DARD’s large file transfer times under

dynamic traffic patterns. We vary each source-destination

pair’s flow generating rate from 1 to 10 per second. Each

elephant flow is a TCP connection transferring a 128MB

file. We use a fixed file length for all the flows instead of

lengths uniformly distributed between 100MB and 1GB. It

is because we need to differentiate whether finishing a flow

earlier is because of a better path selection or a smaller file.

The experiment lasts for five minutes. We track the start and

the end time of every elephant flow and calculate the aver-

age file transfer time. We run the same experiment on ECMP

and calculate the improvement of DARD over ECMP using

formula (2), where avg TECMP is the average file transfer

time using ECMP, and avg TDARD is the average file trans-

fer time using DARD.

improvement =avg TECMP − avg TDARD

avg TECMP

(2)

Figure 8 shows the improvement vs. the flow generating

rate under different traffic patterns. For the stride traffic

pattern, DARD outperforms ECMP because DARD moves

flows from overloaded paths to underloaded ones and in-

creases the minimum flow throughput in every step. We find

both random traffic and staggered traffic share an interest-

ing pattern. When the flow generating rate is low, ECMP

and DARD have almost the same performance because the

bandwidth is over-provided. As the flow generating rate in-

creases, cross-pod flows congest the switch-to-switch links,

in which case DARD reallocates the flows sharing the same

bottleneck and improves the average file transfer time. When

flow generating rate becomes even higher, the host-switch

links are occupied by flows within the same pod and thus

become the bottlenecks, in which case DARD helps little.

1 2 3 4 5 6 7 8 9 100%

5%

10%

15%

20%

Flow generating rate foreach src−dst pair (number_of_flows / s)

Impr

ovem

ent o

f ave

rage

fil

e tr

ansf

er ti

me

staggeredrandomstride

Figure 8: File transfer improvement. Measured on testbed.

5.3 Simulation Results

To fully understand DARD’s advantages and disadvan-

tages, besides comparing DARD with ECMP and pVLB, we

also compare DARD with Hedera, TeXCP and MPTCP in

our simulation. We implement both the demand-estimation

and the simulated annealing algorithm described in Hedera

and set its scheduling interval to 5 seconds [11]. In the

TeXCP implementation each ToR switch pair maintains the

utilizations for all the available paths connecting the two of

them by periodical probing (The default probe interval is

200ms. However, since the RTT in datacenter is in granular-

ity of 1ms or even smaller, we decrease this probe interval to

10ms). The control interval is five times of the probe inter-

val [20]. We do not implement the flowlet [25] mechanisms

in the simulator. As a result, each ToR switch schedules

traffic at packet level. We use MPTCP’s ns-2 implementa-

tion [7] to compare with DARD. Each MPTCP connection

uses all the simple paths connecting the source and the des-

tination.

5.3.1 Performance Improvement

We use static traffic pattern on a fat-tree with 1024 hosts

(p = 16) to evaluate whether DARD can fully utilize the

bisection bandwidth in a larger topology. Figure 9 shows

the result. DARD achieves higher bisection bandwidth than

both ECMP and pVLB under all the three traffic patterns.

This is because DARD monitors all the paths connecting a

source and destination pair and moves flows to underloaded

paths to smooth out the collisions caused by ECMP. TeXCP

and DARD achieve similar bisection bandwidth. We will

compare these two approaches in detail later. As a central-

ized method, Hedera outperforms DARD under both stride

and random traffic patterns. However, given staggered traf-

fic pattern, Hedera achieves less bisection bandwidth than

DARD. This is because current Hedera only schedules the

flows going through the core. When intra-pod traffic is dom-

inant, Hedera degrades to ECMP. We believe this issue can

be addressed by introducing new neighbor generating func-

tions in Hedera. MPTCP outperforms DARD by completely

exploring the path diversities. However, it achieves less bi-

section bandwidth than Hedera. We suspect this is because

the current MPTCP’s ns-2 implementation does not support

MPTCP level retransmission. Thus, lost packets are always

retransmitted to the same path regardless how congested the

path is. We leave a further comparison between DARD and

MPTCP as our future work.

Figure 9: Bisection bandwidth under different static traffic patterns.

Simulated on p = 16 fat-tree.

We also measure the large file transfer time to compare

DARD with other approaches on a fat-tree topology with

1024 hosts (p = 16). We assign each elephant flow from

a dynamic random traffic pattern a unique index, and com-

pare its transmission times under different traffic scheduling

approaches. Each experiment lasts for 120s in ns-2. We de-

fine Tmi to be flow j’s transmission time when the underly-

ing traffic scheduling approach is m, e.g., DARD or ECMP.

We use every TECMPi as the reference and calculate the im-

provement of file transfer time for every traffic scheduling

approach according to formula (3). The improvementECMPi

is 1 for all the flows.

improvementmi =TECMPi − Tm

i

TECMPi

(3)

Figure 10 shows the CDF of the above improvement. ECMP

and pVLB essentially have the same performance, since ECMP

transfers half of the flows faster than pVLB and transfers

the rest half slower than pVLB. Hedera outperforms MPTCP

because Hedera achieves higher bisection bandwidth. Even

though DARD and TeXCP achieve the same bisection band-

width under dynamic random traffic pattern (Figure 9), DARD

still outperforms TeXCP.

We further measure every elephant flow’s retransmission

rate, which is defined as the number of retransmitted packets

over the number of unique packets. Figure 11 shows TeXCP

has a higher retransmission rate than DARD. In other words,

even though TeXCP can achieve a high bisection bandwidth,

some of the packets are retransmitted because of reordering

and thus its goodput is not as high as DARD.

−5% 0 5% 10% 15% 20% 25% 30%0

20%

40%

60%

80%

100%

Improvement of file transfer time

CD

F

pVLBTeXCPDARDMPTCPHedera

Figure 10: Improvement of large file transfer time under dynamic

random traffic pattern. Simulated on p = 16 fat-tree.

0.2% 1% 2% 3% 4%0

25%

50%

75%

100%

Retransmission rate

CD

F

DARDTeXCP

Figure 11: DARD and TeXCP’s TCP retransmission rate. Simulated

on p = 16 fat-tree.

5.3.2 Convergence Speed

Being a distributed path selection algorithm, DARD is

provable to converge to a Nash equilibrium (Appendix B).

However, if the convergence takes a significant amount of

time, the network will be underutilized during the process.

As a result, we measure how fast can DARD converge to a

stable state, in which every flows stops changing paths. We

use the static traffic patterns on a fat-tree with 1024 hosts

(p = 16). For each source and destination pair, we vary the

number of elephant flows from 1 to 64, which is the number

of core switches. We start these elephant flows simultane-

ously and track the time when all the flows stop changing

paths. Figure 12 show the CDF of DARD’s convergence

time, from which we can see DARD converges in less than

25s for more than 80% of the cases. Given DARD’s control

interval at each end host is roughly 10s, the entire system

converges in less than three control intervals.

5.3.3 Stability

Load-sensitive adaptive routing can lead to oscillation. The

main reason is because different sources move flows to un-

derloaded paths in a highly synchronized manner. To prevent

this oscillation, DARD adds a random span of time to each

end host’s control interval. This section evaluates the effects

of this simple mechanism.

We use dynamic random traffic pattern on a fat-tree with

128 end hosts (p = 8) and track the output link utilizations at

the core, since a core’s output link is usually the bottleneck

10 15 20 250

25%

50%

75%

100%

Convergence time (s)

CD

F

staggeredrandomstride

Figure 12: DARD converges to a stable state in 2 or 3 control inter-

vals given static traffic patterns. Simulated on p = 16 fat-tree.

for a inter-pod elephant flow [11], Figure 13 shows the link

utilizations on the 8 output ports at the first core switch. We

can see that after the initial oscillation the link utilizations

stabilize afterward.

0 50 100 150 20070%

80%

90%

100%

Time (s)

Link

util

izat

ion

Figure 13: The first core switch’s output port utilizations under dy-

namic random traffic pattern. Simulated on p = 8 fat-tree.

However we cannot simply conclude that DARD does not

cause path oscillations, because the link utilization is an ag-

gregated metric and misses every single flow’s behavior. We

first disable the random time span added to the control inter-

val and log every single flow’s path selection history. As a

result, we find that even though the link utilizations are sta-

ble, Certain flows are constantly moved between two paths,

e.g., one 512MB elephant flow are moved between two paths

23 times in its life cycle. This indicates path oscillation does

exist in load-sensitive adaptive routing.

After many attempts, we choose to add a random span of

time to the control interval to address above problem. Fig-

ure 14 shows the CDF of how many times flows change their

paths in their life cycles. For the staggered traffic, around

90% of the flows stick to their original path assignment. This

indicates when most of the flows are within the same pod or

even the same ToR switch, the bottleneck is most likely lo-

cated at the host-switch links, in which case few path diver-

sities exist. On the other hand, for the stride traffic, where all

flows are inter-pod, around 50% of the flows do not change

their paths. Another 50% change their paths for less than 4times. This small number of path changing times indicates

that DARD is stable and no flow changes its paths back and

forth.

5.3.4 Control Overhead

To evaluate DARD’s communication overhead, we trace

the control messages for both DARD and Hedera on a fat-

0 2 4 6 850%

70%

90%

Path switch times

CD

F

staggered

random

stride

Figure 14: CDF of the times that flows change their paths under

dynamic traffic pattern. Simulated on p = 8 fat-tree.

tree with 128 hosts (p = 8) under static random traffic pat-

tern. DARD’s communication overhead is mainly introduced

by the periodical probes, including both queries from hosts

and replies from switches. This communication overhead is

bounded by the size of the topology, because in the worst

case, the system needs to process all pair probes. How-

ever, for Hedera’s centralized scheduling, ToR switches re-

port elephant flows to the centralized controller and the con-

troller further updates some switches’ flow tables. As a re-

sult, the communication overhead is bounded by the number

of flows.

Figure 15 shows how much of the bandwidth is taken by

control messages given different number of elephant flows.

With the increase of the number of elephant flows, there

are three stages. In the first stage (between 0K and 1.5K

in this example), DARD’s control messages take less band-

width than Hedera’s. The reason is mainly because Hedera’s

control message size is larger than DARD’s (In Hedera, the

payload of a message from a ToR switch to the controller is

80 bytes and the payload of a message from the controller to

a switch is 72 Bytes. On the other hand, these two numbers

are 48 bytes and 32 bytes for DARD). In the second stage

(between 1.5K and 3K in this example), DARD’s control

messages take more bandwidth. That is because the sources

are probing for the states of all the paths to their destinations.

In the third stage (more than 3K in this example), DARD’s

probe traffic is bounded by the topology size. However, Hed-

era’s communication overhead does not increase proportion-

ally to the number of elephant flows. That is mainly because

when the traffic pattern is dense enough, even the centralized

scheduling cannot easily find out an improved flow alloca-

tion and thus few messages are sent from the controller to

the switches.

0 1K 2K 3K 4K 5K0

100

200

300

Peak number of elephant flows

Con

trol

ove

rhea

d (M

B/s

)

DARD

Simulated Annealing

Figure 15: DARD and Hedera’s communication overhead. Simu-

lated on p = 8 fat-tree.

6. CONCLUSION

This paper proposes DARD, a readily deployable, lightweight

distributed adaptive routing system for datacenter networks.

DARD allows each end host to selfishly move elephant flows

from overloaded paths to underloaded paths. Our analysis

shows that this algorithm converges to a Nash equilibrium

in finite steps. Test bed emulation and ns-2 simulation show

that DARD outperforms random flow-level scheduling when

the bottlenecks are not at the edge, and outperforms central-

ized scheduling when intra-pod traffic is dominant.

7. ACKNOWLEDGEMENT

This material is based upon work supported by the Na-

tional Science Foundation under Grant No. 1040043.

8. REFERENCES[1] Amazon elastic compute cloud. http://aws.amazon.com/ec2.

[2] Cisco Data Center Infrastructure 2.5 Design Guide.

http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCI_SRND_2_5_book.html.

[3] IP alias limitation in linux kernel 2.6.

http://lxr.free-electrons.com/source/net/core/dev.c#L935.

[4] IP alias limitation in windows nt 4.0.

http://support.microsoft.com/kb/149426.

[5] Microsoft Windows Azure. http://www.microsoft.com/windowsazure.

[6] Microsoft Windows Azure. http://www.isi.deterlab.net/.

[7] ns-2 implementation of mptcp. http://www.jp.nishida.org/mptcp/.

[8] Openflow switch specification, version 1.0.0. http:

//www.openflowswitch.org/documents/openflow-spec-v1.0.0.pdf.

[9] TCPTrack. http://www.rhythm.cx/~steve/devel/tcptrack/.

[10] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center

network architecture. SIGCOMM Comput. Commun. Rev., 38(4):63–74, 2008.

[11] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat.

Hedera: Dynamic flow scheduling for data center networks. In Proceedings of

the 7th ACM/USENIX Symposium on Networked Systems Design and

Implementation (NSDI), San Jose, CA, Apr. 2010.

[12] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data

centers in the wild. In Proceedings of the 10th annual conference on Internet

measurement, IMC ’10, pages 267–280, New York, NY, USA, 2010. ACM.

[13] J.-Y. Boudec. Rate adaptation, congestion control and fairness: A tutorial, 2000.

[14] C. Busch and M. Magdon-Ismail. Atomic routing games on maximum

congestion. Theor. Comput. Sci., 410(36):3337–3347, 2009.

[15] B. Claise. Cisco Systems Netflow Services Export Version 9. RFC 3954, 2004.

[16] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A.

Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and flexible data center

network. SIGCOMM Comput. Commun. Rev., 39(4):51–62, 2009.

[17] T. Greene. Researchers show off advanced network control technology.

http://www.networkworld.com/news/2008/102908-openflow.html.

[18] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, and

S. Shenker. Nox: towards an operating system for networks. SIGCOMM

Comput. Commun. Rev., 38(3):105–110, 2008.

[19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, 2000.

[20] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the tightrope:

Responsive yet stable traffic engineering. In In Proc. ACM SIGCOMM, 2005.

[21] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of

data center traffic: measurements & analysis. In IMC ’09: Proceedings of the

9th ACM SIGCOMM conference on Internet measurement conference, pages

202–208, New York, NY, USA, 2009. ACM.

[22] A. Khanna and J. Zinky. The revised arpanet routing metric. In Symposium

proceedings on Communications architectures & protocols, SIGCOMM ’89,

pages 45–56, New York, NY, USA, 1989. ACM.

[23] M. K. Lakshman, T. V. Lakshman, and S. Sengupta. Efficient and robust routing

of highly variable traffic. In In Proceedings of Third Workshop on Hot Topics in

Networks (HotNets-III), 2004.

[24] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri,

S. Radhakrishnan, V. Subramanya, and A. Vahdat. Portland: a scalable

fault-tolerant layer 2 data center network fabric. SIGCOMM Comput. Commun.

Rev., 39(4):39–50, 2009.

[25] S. Sinha, S. Kandula, and D. Katabi. Harnessing TCPs Burstiness using Flowlet

Switching. In 3rd ACM SIGCOMM Workshop on Hot Topics in Networks

(HotNets), San Diego, CA, November 2004.

[26] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley. Design,

implementation and evaluation of congestion control for multipath tcp. In

Proceedings of the 8th USENIX conference on Networked systems design and

implementation, NSDI’11, pages 8–8, Berkeley, CA, USA, 2011. USENIX

Association.

[27] X. Yang. Nira: a new internet routing architecture. In FDNA ’03: Proceedings

of the ACM SIGCOMM workshop on Future directions in network architecture,

pages 301–312, New York, NY, USA, 2003. ACM.

Appendix

A. EXPLANATION OF THE OBJECTIVE

We assume TCP is the dominant transport protocol in dat-

acenter, which tries to achieve max-min fairness if combined

with fair queuing. Each end host moves flows from over-

loaded paths to underloaded ones to increase its observed

minimum fair share (link l’s fair share is defined as the link

capacity, Cl, divided by the number of elephant flows, Nl).

This section explains given max-min fair bandwidth alloca-

tion, the global minimum fair share is the lower bound of the

global minimum flow rate, thus increasing the minimum fair

share actually increases the global minimum flow rate.

Theorem 1. Given max-min fair bandwidth allocation for

any network topology and any traffic pattern, the global min-

imum fair share is the lower bound of global minimum flow

rate.

First we define a bottleneck link according to [13]. A link

l is a bottleneck for a flow f if and only if (a) link l is fully

utilized, and (b) flow f has the maximum rate among all the

flows using link l.Given max-min fair bandwidth allocation, link li has the

fair share Si = Ci/Ni. Suppose link l0 has the minimum

fair share S0. Flow f has the minimum flow rate, min rate.

Link lf is flow f ’s bottleneck. Theorem 1 claims min rate ≥S0. We prove this theorem using contradiction.

According to the bottleneck definition, min rate is the

maximum flow rate on link lf , and thusCf/Nf ≤min rate.

Suppose min rate < S0, we get

Cf/Nf < S0 (A1)

(A1) is conflict with S0 being the minimum fair share. As

a result, the minimum fair share is the lower bound of the

global minimum flow rate.

In DARD, every end host tries to increase its observed

minimum fair share in each round, thus the global minimum

fair share increases, so does the global minimum flow rate.

B. CONVERGENCE PROOF

We now formalize DARD’s flow scheduling algorithm and

prove that this algorithm can converge to a Nash equilibrium

in finite steps.

The proof is a special case of a congestion game [14],

which is defined as (F,G, {pf}f∈F ). F is the set of all the

flows. G = (V,E) is an directed graph. pf is a set of paths

that can be used by flow f . A strategy s = [pf1i1 , pf2i2, . . . , p

f|F |

i|F |]

is a collection of paths, in which the ikth path in pfk , pfkik , is

used by flow fk.

For a strategy s and a link j, the link state LSj(s) is a

triple (Cj , Nj , Sj), as defined in Section 3.4. For a path

p, the path state PSp(s) is the link state with the smallest

fair share over all links in p. The system state SysS(s) is

the link state with the smallest fair share over all links in E.

A flow state FSf (s) is the corresponding path state, i.e.,

FSf (s) = PSp(s), flow f is using path p.

Notation s−k refers to the strategy s without pfkik , i.e.

[pf1i1 , . . . , pfk−1

ik−1, p

fk+1

ik+1, . . . , p

f|F |

i|F |]. (s−k, p

fkik′

) refers to the

strategy [pf1i1 , . . . , pfk−1

ik−1, pfkik′

, pfk+1

ik+1, . . . , p

f|F |

i|F |]. Flow fk is

locally optimal in strategy s if

FSfk(s).S ≥ FSfk(s−k, pfkik′

).S (B1)

for all pfkik′∈ pfk . Nash equilibrium is a state where all

flows are locally optimal. A strategy s∗ is global optimal if

for any strategy s, SysS(s∗).S ≥ SysS(s).S.

Theorem 2. If there is no synchronized flow scheduling,

Algorithm selfish flow scheduling will increase the minimum

fair share round by round and converge to a Nash equilib-

rium in finite steps. The global optimal strategy is also a

Nash equilibrium strategy.

For a strategy s, the state vector SV (s) = [v0(s), v1(s),v2(s),. . .], where vk(s) stands for the number of links whose

fair share is located at [kδ, (k + 1)δ), where δ is a positive

parameter, e.g., 10Mbps, to cluster links into groups. As a

result∑

k vk(s) = |E|. A small δ will group the links in

a fine granularity and increase the minimum fair share. A

large δ will improve the convergence speed. Suppose s and

s′ are two strategies, SV (s) = [v0(s), v1(s), v2(s),. . .] and

SV (s′) = [v0(s′), v1(s

′), v2(s′), . . .]. We define s = s′

when vk(s) = vk(s′) for all k ≥ 0. s < s′ when there exists

some K such that vK(s) < vK(s′) and ∀k < K, vk(s) ≤vk(s

′). It is easy to show that given three strategies s, s′ and

s′′, if s ≤ s′ and s′ ≤ s′′, then s ≤ s′′.Given a congestion game (F,G, {pf}f∈F ) and δ, there

are only finite number of state vectors. According to the

definition of ” = ” and ” < ”, we can find out at least one

strategy s̃ that is the smallest, i.e., for any strategy s, s̃ ≤ s.

It is easy to see that this s̃ has the largest minimum fair share

or has the least number of links that have the minimum fair

share and thus is the global optimal.

If only one flow f selfishly changes its route to improve

its fair share, making the strategy change from s to s′, this

action decreases the number of links with small fair shares

and increases the number of links with larger fair shares. In

other words, s′ < s. This indicates that asynchronous and

selfish flow movements actually increase global minimum

fair share round by round until all flows reach their locally

optimal state. Since the number of state vectors is limited,

the steps converge to a Nash equilibrium is finite. What is

more, because s̃ is the smallest strategy, no flow can have

a further movement to decrease s̃, i.e every flow is in its

locally optimal state. Hence this global optimal strategy s̃ is

also a Nash equilibrium strategy.

dard: distributed adaptive routing for datacenter...

Documents