free-scaling your data center

16
Free-scaling your data center László Gyarmati a,, András Gulyás b,d , Balázs Sonkoly b,d,e,f , Tuan A. Trinh b , Gergely Biczók c a Telefonica Research, Plaza de Ernest Lluch i Martín, 5, floor 15, 08019 Barcelona, Spain b Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, 2 Magyar tudósok körútja, Budapest, H-1117, Hungary c Norwegian University of Science and Technology, Department of Telematics, O.S. Bragstads plass 2a, N-7491 Trondheim, Norway d Hungarian Academy of Science (MTA) Information System Research Group, Széchenyi István tér 9., 1051 Budapest, Hungary e MTA-BME Future Internet Research Group, Széchenyi István tér 9., 1051 Budapest, Hungary f Inter-University Centre for Telecommunications and Informatics, Kassai út 26., 4028 Debrecen, Hungary article info Article history: Received 2 February 2012 Received in revised form 8 January 2013 Accepted 12 March 2013 Available online 26 March 2013 Keywords: Data center Scale-free Routing Fault-tolerance Asymmetric topology abstract The increasing popularity of both small and large private clouds and expanding public clouds poses new requirements to data center (DC) architectures. First, DC architectures should be incrementally scalable allowing the creation of DCs of arbitrary size with consis- tent performance characteristics. Second, initial DC deployments should be incrementally expandable supporting small-scale upgrades without decreasing operation efficiency. A DC architecture possessing both properties satisfies the requirement of free-scaling. Recent work in DC design focuses on traditional performance and scalability character- istics, therefore resulting in symmetric topologies whose upgradability is coarse-grained at best. In our earlier work we proposed Scafida, an asymmetric, scale-free network inspired DC topology which scales incrementally and has favorable structural characteris- tics. In this paper, we build on Scafida and propose a full-fledged DC architecture achieving free-scaling called FScafida. Our main contribution is threefold. First, we propose an organic expansion algorithm for FScafida; this combined with Scafida’s flexible original design results in a freely scalable architecture. Second, we introduce the Effective Source Routing mechanism that provides near-shortest paths, multi-path and multicast capability, and low signaling overhead by exploiting the benefits of the FScafida topology. Third, we show based on extensive simulations and a prototype implementation that FScafida is capable of handling the traffic patterns characteristic of both enterprise and cloud data centers, tol- erates network equipment failures to a high degree, and allows for high bisection bandwidth. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Fueled by the paradigm change to cloud computing, data centers (DCs) are all the talk both in networking practice and research today. So far, large public cloud data centers have been dominating the scene with the likes of Google, Amazon, Microsoft, etc. building out and maintain- ing data centers consisting of tens or hundreds of thou- sands of servers, serving the need of billions of users. However, another trend is gaining momentum: more and more organizations decide to consolidate their computing resources into small- or medium-scale private clouds [1]. There are multiple reasons behind the surge of private clouds. First, security and privacy issues using a public infrastructure can be prohibitive for certain organizations, such as governments [2]. Second, private cloud operation policies and management procedures can be tailor-made 1389-1286/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.comnet.2013.03.005 Corresponding author. Tel.: +34 931 233 071. E-mail addresses: [email protected] (L. Gyarmati), [email protected] (A. Gulyás), [email protected] (B. Sonkoly), [email protected] (T.A. Trinh), [email protected] (G. Biczó. Computer Networks 57 (2013) 1758–1773 Contents lists available at SciVerse ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet

Upload: gergely

Post on 15-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Free-scaling your data center

Computer Networks 57 (2013) 1758–1773

Contents lists available at SciVerse ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/locate /comnet

Free-scaling your data center

1389-1286/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.comnet.2013.03.005

⇑ Corresponding author. Tel.: +34 931 233 071.E-mail addresses: [email protected] (L. Gyarmati), [email protected]

(A. Gulyás), [email protected] (B. Sonkoly), [email protected](T.A. Trinh), [email protected] (G. Biczó.

László Gyarmati a,⇑, András Gulyás b,d, Balázs Sonkoly b,d,e,f, Tuan A. Trinh b, Gergely Biczók c

a Telefonica Research, Plaza de Ernest Lluch i Martín, 5, floor 15, 08019 Barcelona, Spainb Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, 2 Magyar tudósok körútja, Budapest,H-1117, Hungaryc Norwegian University of Science and Technology, Department of Telematics, O.S. Bragstads plass 2a, N-7491 Trondheim, Norwayd Hungarian Academy of Science (MTA) Information System Research Group, Széchenyi István tér 9., 1051 Budapest, Hungarye MTA-BME Future Internet Research Group, Széchenyi István tér 9., 1051 Budapest, Hungaryf Inter-University Centre for Telecommunications and Informatics, Kassai út 26., 4028 Debrecen, Hungary

a r t i c l e i n f o

Article history:Received 2 February 2012Received in revised form 8 January 2013Accepted 12 March 2013Available online 26 March 2013

Keywords:Data centerScale-freeRoutingFault-toleranceAsymmetric topology

a b s t r a c t

The increasing popularity of both small and large private clouds and expanding publicclouds poses new requirements to data center (DC) architectures. First, DC architecturesshould be incrementally scalable allowing the creation of DCs of arbitrary size with consis-tent performance characteristics. Second, initial DC deployments should be incrementallyexpandable supporting small-scale upgrades without decreasing operation efficiency. A DCarchitecture possessing both properties satisfies the requirement of free-scaling.

Recent work in DC design focuses on traditional performance and scalability character-istics, therefore resulting in symmetric topologies whose upgradability is coarse-grainedat best. In our earlier work we proposed Scafida, an asymmetric, scale-free networkinspired DC topology which scales incrementally and has favorable structural characteris-tics. In this paper, we build on Scafida and propose a full-fledged DC architecture achievingfree-scaling called FScafida. Our main contribution is threefold. First, we propose an organicexpansion algorithm for FScafida; this combined with Scafida’s flexible original designresults in a freely scalable architecture. Second, we introduce the Effective Source Routingmechanism that provides near-shortest paths, multi-path and multicast capability, and lowsignaling overhead by exploiting the benefits of the FScafida topology. Third, we showbased on extensive simulations and a prototype implementation that FScafida is capableof handling the traffic patterns characteristic of both enterprise and cloud data centers, tol-erates network equipment failures to a high degree, and allows for high bisectionbandwidth.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Fueled by the paradigm change to cloud computing,data centers (DCs) are all the talk both in networkingpractice and research today. So far, large public cloud datacenters have been dominating the scene with the likes of

Google, Amazon, Microsoft, etc. building out and maintain-ing data centers consisting of tens or hundreds of thou-sands of servers, serving the need of billions of users.However, another trend is gaining momentum: more andmore organizations decide to consolidate their computingresources into small- or medium-scale private clouds [1].There are multiple reasons behind the surge of privateclouds. First, security and privacy issues using a publicinfrastructure can be prohibitive for certain organizations,such as governments [2]. Second, private cloud operationpolicies and management procedures can be tailor-made

Page 2: Free-scaling your data center

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1759

to the owner’s liking [1]. Finally, of course, cost is always adeciding factor; surprisingly, operating a private cloud beadvantageous in the long(er) run [3]. As a consequence,the increasing proliferation of both large and small datacenters are highly likely.

Both large and small cloud structures would greatlybenefit from free-scaling, i.e., to be able to build and operatea data center of arbitrary size, and constantly upgrade itduring its lifetime. It is straightforward to see the benefitfor small companies: they would be able to start a privatedata center quickly and with a limited budget, and buildit out incrementally [4]. Note that even if servers are co-lo-cated at a data center provider, SMEs still have to face grow-ing costs as the number of servers increases. On the otherhand, a growing user base (private, e.g., Facebook) ordeployment of more demanding cloud applications andgetting more corporate customers (public, e.g., Amazon)triggers frequent upgrades in large data centers. As awell-known real-world example, Facebook has upgradedits data center facilities frequently and step-by-step [5],resulting in a doubling of servers over the course of7 months in 2010 [6]. Free-scaling demands two separateproperties from a data center architecture. First, the archi-tecture should be incrementally scalable, i.e., it should bepossible to build a data center with 1000 or 1100 nodes,while retaining the same performance indicators. Second,the data center should be incrementally expandable, i.e., fre-quent small-scale upgrades, by adding for example 10 serv-ers or larger switches, should be feasible and result in acomparably efficient data center. If coupled with the abilityto use commodity switches for connecting servers, such afree-scaling architecture would allow system administra-tors to build actual DC networks in a do-it-yourself manner.Moreover, if the processing and storage requirements of thegiven organization change over time, it would be possibleto reuse existing equipment when expanding the network.

Albeit networking infrastructure issues inside a DC havereceived tremendous attention from the research commu-nity in recent years, as evidenced by innovative designsboth for network fabric [7,8] and topology [9–11], withscalability, fault-tolerance, and high bisection bandwidthin focus, far less emphasis has been put on incrementalproperties. For example in case of recursive DCell [10], itis required that ‘‘the minimal quantum of machines addedat one time be much larger than one’’. Generally speaking,state-of-the-art DC topologies need to preserve their sym-metry in order to achieve their demonstrated high perfor-mance. Symmetry in turn, especially for sophisticatedtopology concepts, results in the type of scalability whichonly manifests itself at well-defined network sizes.MDCube [12] was designed to provide a way to createDCs of different size, its coarse-grained scale of around1000 servers per box, differing requirements and the fore-seeable price of hardware repair and software upgrade,likely carried out by transporting whole containers backto their respective vendors, can be prohibitive for SMEson a budget [13]. LEGUP [14] leverages hardware heteroge-neity to reduce the cost of upgrades in fat-tree based datacenters. Jellyfish [15] proposes a random graph topology tofacilitate upgradability of data centers; an interesting idea,however, Jellyfish sacrifices structure creating a significant

challenge for the routing mechanism. Another recentproposal proposes to use unorthodox topology for datacenters based on small-world networks [16].

In our earlier work we proposed Scafida, an asymmetric,scale-free network (i.e., the degree distribution of the net-work’s nodes has power law characteristics) inspired datacenter topology [17]. At the structural level, Scafida ex-ploits the inherent properties and advantages of scale-freenetworks, namely, small diameter and fault-tolerance to-wards random errors, while it copes with the numberand physical constraints (i.e., available ports) of availableservers and switches. We limit the node degrees byextending the original preferential attachment algorithmof Barabási and Albert [18]. The Scafida structure preservesthe favorable properties of scale-free networks and per-forms comparably well to the recently proposed symmet-ric DC topologies at various network sizes. Moreover, wecan adjust the size of the network on a fine-grained scale.Scafida satisfies the incremental scalability requirement;however, in [17] we focused on the justification of thetopology itself, leaving practical data center operational as-pects such as incremental expansion, routing, and imple-mentation for future work.

In this paper we address these challenging issues andextend the Scafida topology to a full-blown data centerarchitecture called FScafida. Our main contribution isthreefold:

� We propose an organic expansion algorithm with whichit is possible to upgrade an existing Scafida data centerin fine-grain increments. The expanded data center per-forms almost identically to such a Scafida network thatwas originally generated with the extended set ofdevices. Put together with its incremental scalability,FScafida satisfies the requirements of free-scaling.� At the networking level, we propose Efficient Source

Routing (ESR), inspired by efficient paths in complexnetworks, that provides near-shortest paths, multi-pathcapability, efficient multicasting, and low signalingoverhead. The proposed routing mechanism togetherwith the underlying topology demonstrates the abilityto handle efficiently the traffic pattern characteristicsof both cloud and enterprise data centers, to toleratenetwork equipment failures to a high degree, and toallow for high bisection bandwidth.� We design and implement a prototype for FScafida using

the OpenFlow framework. We validate the operation ofthe prototype both in a virtual test network with sev-eral hundreds of nodes and a live testbed with firm-ware-modified commodity switches.

The rest of the paper is organized as follows. Section 2gives an overview of the Scafida topology and its favorablestructural properties for flexible data center usage. Wepropose an incremental expansion mechanism in Section3. In Section 4 we describe ESR including addressing andforwarding with Bloom filters, the basic routing algorithmand its fault-tolerance. We show simulation results dem-onstrating the routing performance of our design. We de-scribe our prototype implementation and providevalidation experiments in Section 5. We discuss important

Page 3: Free-scaling your data center

1760 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

cost implications of our design in Section 6, while wereview related work in Section 7. Finally, we conclude inSection 8.

2. Basics of the Scafida structure

This section presents the basic properties of the Scafidadata center structure. First, we discuss the Scafida topologygeneration algorithm and the properties of the created net-works. Afterwards, we deal with Scafida’s scalability inboth the traditional and the incremental sense.

2.1. Basic properties

The goal of this part is to summarize the basic proper-ties of Scafida data center topologies including structuralperformance characteristics; for additional details pleaserefer to [17]. By structural, we mean properties which areindependent of routing, virtualization techniques, and ac-tual application types.

2.1.1. Structure generationOur topology generation algorithm is a modified version

of the preferential attachment method of Barabási and Al-bert [18]. We build the network structure iteratively, i.e.,we add nodes one by one. We attach a new node to anexisting one with a probability proportional to the existingnode’s degree. Our method artificially constrains the num-ber of links that a node can have, namely the maximal de-gree of the nodes, to meet the port number of servers andswitches as commodity switches usually come with 4, 8,16, 24, or 48 ports. In an effort to reveal maximum nodedegrees, we generated 300 scale-free networks with1000, 10,000, and 100,000 nodes, using preferential attach-ment [18]. The average maximum node degrees proved tobe 80.45, 257.5, and 816.35, respectively. These numbersindicate that we have to constrain the degrees of the net-works artificially. We illustrate the difference betweenscale-free and Scafida networks in Fig. 1.

Next we present the algorithm we use to generate thetopologies. We apply the notations of the Barabási–Albertmodel, thus nt0 denotes the number of servers. In the

Fig. 1. Scale-free vs. degree-limited Scafi

original algorithm the network is growing iteratively, i.e.,the nodes are added one by one to the network. Every new-ly added node has m links, whose target is selected propor-tionally to node degrees. In addition to the original model,we introduce new parameters; a ti type switch has pti

ports, the number of ti type switches is ntiði ¼ 1; . . . ; kÞ,

while pt0denotes the number of ports a server has. We as-

sume without loss of generality that the switches are or-dered based on their port numbers, i.e., if ti < tj thenpti

< ptj. We note that we use the term node in order to

emphasize that only at the end of the algorithm turnsout what is the port number of each node. Of course, thedegrees of the nodes match the degrees of the initial setof devices. In other words, the algorithm creates a logicalnetwork that can be physically realized at the end of thealgorithm.

We present the Scafida algorithm in detail in Fig. 2.First, we add m initial nodes to the graph (line 2). Next,similar to the original algorithm, we add a new node tothe network, we denote this node with index m. This nodegets m links whose targets are the initially added vertices(line 6). Except for the initial nodes ({0, . . . ,m � 1}), everynewly added node gets m links. In biological networks,the value of m is 2 or 3. These values are applicable in caseof data centers as well; however, the operator is able to setthe m parameter to influence the performance of the topol-ogy. We assume that m 6 pt0

holds, this assumption doesnot constrain the algorithm in general.

Afterwards, we add nodes to the graph one-by-one.Every newly added node gets m links (line 14). We selecttheir targets based on the preferential attachment princi-ple. Namely, we pick the target nodes randomly from thelist R, where we store the indices of nodes (line 16). Rmay store an index multiple times, i.e., if node v has dvlinks we store it dv times. Accordingly, we select the targetnodes proportionally to their degrees. At the end of a cyclewe have to determine whether we can still increase the de-gree of the selected target node (line 21). If the actual de-gree of the target node equals to pti

then we can create thelink only if there is still at least one larger available switchnot allocated yet (line 28). Otherwise, we cannot use theselected node as a target; furthermore, we remove its in-dex from the R list. Finally, before dealing with the next

da [17] networks with 100 nodes.

Page 4: Free-scaling your data center

Fig. 2. The Scafida algorithm.

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1761

node, we update the indices in the list R to reflect the chan-ged node degrees (lines 35–36). The algorithm terminatesif the network reaches the required number of nodes, i.e.,jV j ¼

Pki¼0nti

.

2.1.2. Path lengthAs the servers of a data center communicate with each

other, the average length of the paths between the nodescan fundamentally impact the performance of the data

center network. In Table 2, we present the impact of thelimiting node degrees on the average shortest path lengths.We computed this by dividing the sum of the shortest pathlengths with the number of the paths. We averaged thepresented values over 50 topologies for each case withm = 2. We use port numbers of commodity switches toconstrain the maximal node degrees. Irrespective of thesize of the networks, the average path lengths increasemoderately with stricter constraints on the maximum

Page 5: Free-scaling your data center

1762 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

degree; e.g., the 48-port constrained topology performingvery close to the NL case. We note that servers may be usedas forwarding nodes if their degrees are larger than one.

2.1.3. Bisection bandwidthSome applications (e.g., MapReduce [19]) require inten-

sive communication among the servers of the data center;bottlenecks in the topology would cause performance deg-radation. The throughput capability of a data center can bemeasured with bisection bandwidth. Table 2 shows the im-pact of degree limitation on the distribution of bisectionbandwidth in case of different size data centers. In the ta-ble we refer to the mean of the bisection bandwidth as ‘‘BBmean’’. Bisection bandwidth values are similar irrespectiveof degree limitation. Accordingly, in case of Scafida, the de-gree limitation does not effect the bisection bandwidthvalues of the topologies significantly.

2.1.4. Fault toleranceDue to their specific structure scale-free networks toler-

ate the random failure of nodes well [20]. This resiliencecomes handy in data centers, where random failures arebound to happen due to the large number of nodes. Ontop of that, constrained Scafida demonstrates even betterbehavior since there is no high-degree switch that can fail;therefore, a possible failure affects only a smaller part ofthe network. As Scafida data centers can be built usingcommodity switches similar to state-of-the-art DCs, Scafi-da is not more vulnerable to network failures—includingattacks—than state-of-the-art systems.

2.1.5. Parameter sensitivityThe Scafida structure has two key parameters which

have significant impact on data center characteristics.The first one is the number of initial links, denoted by m,that a new node gets when attached to the network. Forexample, if m = 1 the servers have only a single link thatconnects them to the data center. The total number of linksincreases with m, making the structure more and moreresilient, improving on bisection bandwidth, and shorten-ing the paths between servers. We present simulation re-sults from 1000-node Scafida structures, includingpairwise shortest path length, diameter, and bisectionbandwidth in Table 1. In case of shortest path length, boththe mean value and the variance decreases as m grows.Also, bisection bandwidth is directly proportional to m.

The second important lever is the number of ports a ser-ver has ðpt0

Þ. The average shortest path length values froma 5000-node topology (with varying pt0

values) are 4.632(2), 4.940 (3) and 5.066 (4), respectively. Somewhat sur-

Table 1The impact of the number of links a newly added node has (m); the datacenters have 1000 nodes.

Link number (m) 1 2 3

SPL mean 6.87 3.83 3.37SPL std. 2.41 0.82 0.66Diameter 14 7 6Bisection BW mean 525.19 1046.29 1573.30Bisection BW std. 15.21 24.18 29.78

prisingly, there is a moderate, but noticeable increase inpath length as pt0

grows. Bear in mind that we fix the num-ber of links per newly added node at m = 2, therefore, thenumber of total links is the same in every scenario. Nowif pt0

¼ m, servers are always attached to switches, result-ing in a compact structure with short server-to-serverpaths. On the contrary, if pt0

> m, servers have empty portsafter they are added to the network. Therefore, direct con-nections between two servers can be formed with some—albeit low—probability. When realized, this phenomenonresults in the given server being further from most otherservers (except for the directly connecting one and itsneighbors), since it does not benefit from the better-con-nectedness of a switch. In other words, more hops haveto be made in order to reach the switches from these serv-ers, which results in increased distances. Naturally, whenpt0

and m are increased simultaneously, the resiliency ofthe resulting topology improves due to the larger possiblenumber of disjunct paths between server pairs.

2.1.6. Comparison to the state-of-the-artRecently proposed data center architectures trade off

incremental scalability and expandability for performance.Scafida can achieve comparable structural performance,while still being freely scalable. Due to its flexibility, Scafi-da networks can be created out of the parts of given state-of-the-art data centers. Accordingly, the performance ofthe topologies can be compared. Scafida has similar bisec-tion bandwidth and average shortest path lengths as state-of-the-art architectures [17].

2.2. Scalability

2.2.1. Traditional scalabilityScafida scales extremely well in the traditional sense. In

order to quantify the impact of network size on importantbenchmarks, we carry out extensive simulations on datacenter topologies of different sizes and degree constraints.Mean values on average shortest path length (SPL), networkdiameter (D) and bisection bandwidth (BB) are shown inTable 2 (50 simulation runs). Note that standard deviationvalues of SPL and D are around 1 and independent of net-work size, while standard deviation values of BB are around21 (1000 nodes), 50 (5000 nodes), and 325 (10,000 nodes),respectively, and independent of degree constraints. Bothaverage shortest path length and diameter scale logarith-mically, while bisection bandwidth scales linearly with re-spect to network size. On the other hand, constrainingmaximum node degrees has only a moderate effect on per-formance; the comparison of the two extremes (eight vs. nolimitation) shows only minor differences.

2.2.2. Incremental scalabilityThe first requirement of free-scaling is that a proper

topology should be generated out of any reasonable num-ber and types of servers and switches. Due to its iterative,node-by-node topology construction, and its capability totake hardware constraints into account, Scafida satisfiesthis requirement at the structural level. Note, that weassume a reasonable ratio of switches and servers and asatisfactory number of ports (requirements for any data

Page 6: Free-scaling your data center

Table 2Scaling properties of Scafida with different degree constraints (SPL mean, D mean, BB mean). Applied parameters: m = 2and pt0

¼ 2.

Degree limitation 1000 5000 10,000

8 5.03, 9.5, 1000.4 6.15, 11.1, 5007.7 6.62, 12.0, 10339.716 4.58, 8.9, 999.6 5.54, 10.0, 5009.5 5.93, 11.0, 10298.424 4.41, 8.7, 1000.0 5.30, 9.9, 5008.0 5.67, 10.2, 10314.648 4.18, 8.0, 1001.3 5.02, 9.8, 5008.3 5.37, 9.9, 10313.4

None 4.06, 7.9, 1001.79 4.73, 8.9, 5010.7 5.01, 9.9, 10299.5

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1763

center topology). Specifically, the total number of the portsof the switches and servers are at least as much as 2m n,the total number of ports used by the links in the network.

After reviewing the basic properties of the Scafida DCarchitecture in the forthcomings we propose several tech-niques, starting from incremental expansion through rout-ing to implementation to design a full-fledged data center.We start this in the next section by demonstrating thatScafida can also be expanded incrementally without sacri-ficing the beneficial properties of the topology.

3. Incremental expansion

The structure of a freely scalable data center should alsoallow for organic expandability: any reasonable number ofswitches and servers could be added to an already existingnetwork while retaining its preferable characteristics.Next, we propose an iterative algorithm that increases thesize of Scafida data centers efficiently. The method extendsa Scafida topology by adding any given set of new switchesand servers to the network. We do not add the additionalswitches to the edge of the topology, instead they arereplacing existing devices with fewer ports. We presentthe pseudocode of the expansion algorithm in Fig. 3. Thealgorithm works on the logical topology represented bythe set of nodes ordered by their degrees (from serversthrough small switches to large switches). In a bottom-up manner, it determines which new switches should re-place devices lower in the degree-hierarchy, demotingthe replaced switches to the next tier. Sequential expan-sion points are determined probabilistically giving muchlarger chance to higher degree nodes (this facilitates plac-ing larger switches in the middle of the topology). Contraryto a top-down approach, the bottom-up construction as-sures that even servers could be replaced with a largeswitch at the end of a replacement series; however, theprobability of this event is low.

We note that if a Scafida topology has only identicalswitches, the expansion algorithm simply selects a randomone of them and replaces it with a switch, which has moreports. If all the switches including the newly added onesare identical, i.e. having the same number of ports, thealgorithm replaces some servers with the new switches.Hence, the ‘‘core’’ of the network expands. Afterwards, inthe second phase of the algorithm we apply the preferen-tial attachment paradigm to put the replaced servers backto the network. Furthermore, as a result of the iterativenature of the proposed algorithm, the properties of twoexpanded topologies are the same when we add a new

set of servers and switches to the network (i) once in abunch or (ii) repeatedly in small amounts.

3.1. Example

We expand a topology consisting of switches with 4 and8 ports (Fig. 4), with a 16-port switch. The switch replace-ment process happens in a bottom-up approach. Namely,first we determine randomly which server will be replacedwith a 4-port switch (server A). Next, we replace a ran-domly selected 4-port switch (switch B in the figure) withan 8-port switch. Finally, we replace a randomly selected8-port switch (switch B again) with the 16-port switch.After we added the new switches to the topology, we candeploy additional servers by connecting them to the topol-ogy based on the preferential attachment principle.

To quantify the goodness of the method, we incremen-tally expand the size of 1000-server FScafida data centers,by adding a set of new switches and servers, and we com-pare the properties of the expanded data centers withFScafida networks, freshly generated out of the same setof network elements. The results in Fig. 5 indicate thatFScafida data centers expand organically and efficientlyas the performance properties of the networks are nearlyidentical.

Put together, demonstrated incremental scalability andorganic expandability equal to the FScafida structure thateventuates free-scaling.

4. Efficient source routing

With the structure set, a routing mechanism playing tothe strengths of the underlying topology is required. Be-cause of the small diameter of the network, paths are gen-erally short. While nice to have, shortest paths can bestretched to a certain degree, if it otherwise benefits thesystem. Here we propose the Efficient Source Routing (ESR)mechanism that exploits the favorable properties of theScafida topology, while addressing the routing challengeposed by the asymmetric and heterogeneous structure.ESR applies the paradigm of source routing motivated bythe following reasons. First, source routing offers tight per-formance management and multipath capability for load-balancing. Second, typically a single organization operatesa data center. Therefore, there are no prohibitive, Inter-net-related security problems, and the operator knowsthe global topology. Third, the topology of a data center isstatic, especially compared to the Internet.

Page 7: Free-scaling your data center

Fig. 3. The expansion algorithm of FScafida.

1764 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

In this section, first, we introduce the applied address-ing scheme. Second, we describe the path selection proce-dure and error recovery. Finally, we evaluate routingperformance by simulation.

4.1. Addressing and forwarding: Bloom filters

Since they have stochastically heterogeneous topolo-gies, routing on traditional IP addresses would be ineffi-cient in FScafida data centers. As node addresses cannotbe aggregated properly, routing table sizes would be large,which contradicts the goal of applying low-end, commod-ity network equipments.

Accordingly, ESR addressing is based on Bloom filters[21]. We generate unique identifiers for every output linkin the topology: these are the Bloom IDs of the links. In

order to create a Bloom address for a target server, wesum up every single link identifier along the chosen pathusing the bitwise OR operator, from source to target. Theresulting Bloom address, stored in the packet header, con-tains the routing information for the packet, i.e., we storethe used links implicitly. Bloom addresses have local valid-ity, meaning that a demand can be routed to the specifictarget only from the source server where we generate theBloom address. Note, we fix the address size for all pathlengths.

As the packet traverses the network, along with its rout-ing information coded into the Bloom address, intermedi-ate switches inspect it. Any given switch compares theaddress of the packet with the IDs of its outgoing inter-faces. More exactly, the switch executes the bitwise ANDoperation between the Bloom address (stored in the

Page 8: Free-scaling your data center

Fig. 4. Example of the topology expansion mechanism.

1100 1250 1500 1750 20003

4

5

6

Number of servers

Avg

. sho

rtes

tpa

th le

ngth

OriginalExpanded

1100 1250 1500 1750 20006

6.5

7

7.5

8

Number of servers

Dia

met

er

OriginalExpanded

1100 1250 1500 1750 20001000

1500

2000

2500

Number of servers

Bis

ectio

nba

ndw

idth

OriginalExpanded

Fig. 5. FScafida can be expanded efficiently.

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1765

packet) and the Bloom ID (e.g., stored in a flow table of theswitch) for all output links. If the result of the operationequals to the Bloom ID, then the Bloom address containsthe given ID, hence, the packet has to be forwarded tothe corresponding port. The bitwise AND operation isexecuted for all output ports and the switch forwards the

packet on the matching links except for the incoming onein order to avoid loops. Note, that the forwarding proce-dure inherently supports multicast; the source node justhas to code all desired link IDs into the Bloom address.Network switches execute the bitwise AND operationexclusively, which has modest resource requirements. This

Page 9: Free-scaling your data center

Fig. 6. Path computation in Efficient Source Routing.

1766 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

is in line with the implications of recent data center trafficmeasurements [22], i.e., due to the high arrival rate of theflows, forwarding/routing decisions have to be madequickly.

False positive matches can occur in case of applyingBloom filters. Should such a false positive arise at a switch,the packet would also be forwarded on an additional link,creating unnecessary overhead and may also eventuateforwarding loops. The probability of such an occurrencedepends on the size of the ID space. Namely, if we labelthe links with longer IDs, the possibility of a false positivematch decreases. More specifically, the probability of afalse positive match is

P ¼ 1� 1� 1m

� �nk !k

ð1Þ

where m denotes the number of bits of the identifiers, k de-notes the number of utilized bits, and n denotes the thenumber of elements in the Bloom addresses. As an illustra-tion, we compute the applicable Bloom ID lengths for dif-ferent data center sizes assuming that the probability offalse positive matches cannot exceed 10�4. Thus, thelengths of Bloom IDs are 59, 73, 87, and 102 in case ofDC with 500, 1000, 5000, and 10,000 servers, respectively.A relatively small Bloom filter can eventuate a small falsepositive probability. In addition, FScafida can benefit fromfalse positive-free operation by choosing false positive freepaths exclusively similarly as in [23], employ smart tech-niques to avoid loops [24] or utilize soft-states in switches.

Clearly, Bloom filters accommodate efficiently the vari-able length of paths in FScafida data centers. In addition,the application of Bloom filters offers built-in multicastsupport. Multicast emerges as a crucial routing level capa-bility for data centers, since numerous cloud applicationsinclude routines where the same data is shared with multi-ple other servers (1-to-n pattern). MapReduce [19] is thebest known example. In FScafida, creating multicast pack-ets is as easy as creating unicast packets. We incorporatethe Bloom IDs of all desired links into a single Bloomaddress.

4.2. Path computation

Scale-free networks inspired the Scafida structure.Thus, the path computation mechanism of ESR is basedon the efficient routing algorithm [25] that has been de-signed for networks generated with the Barabási–Albertmethod. It avoids overloading high-degree network nodes(so-called hubs) by weighting links with the degree ofthe nodes; b denotes the weighting factor. Accordingly,we spread the traffic load in the network resulting in im-proved available bandwidth conditions, and hencedecreasing oversubscription—a key metric in today’s datacenter networks.

In order to exploit and adapt to the properties of theScafida structure, we slightly modify and extend the effi-cient routing mechanism. First, we design ESR to be asource routing scheme, where the servers carry out exclu-sively the computation of end-to-end paths and Bloom ad-dresses. This modification facilitates the use of commodity

switches inside the data center by pushing computation-ally intensive tasks to the edges, while switches only per-form simple bitwise operations. Second, ESR determinesnot only the shortest efficient paths between source anddestination based on the weighted topology, but also de-rives the paths that are only reasonably longer than theshortest paths. We denote the path length scaling factor(stretch) by c. Based on c, ESR calculates multiple paths be-tween the end-points; some paths may have an additionalhop compared to the shortest efficient paths, but this doesnot affect routing performance noticeably. This extensionassures that the ESR method derives multiple paths evenif the data center is built out of identical switches, wherethe weighting of the original efficient routing would notassure the spreading of traffic load. Finally, we pick twopaths randomly from the feasible paths: one for routingthe packets of the flow and one as a backup that we canuse immediately in case of failures. We route all packetsof a given flow on the same path. Contrary, we spreadthe load of multiple flows on multiple paths. Accordingly,ESR achieves adequate load-balancing in FScafida data cen-ters. We define the ESR algorithm in Fig. 6. Note that thesize of routing tables is manageable, as each server is onlyrequired to store entries for other servers to which an ac-tive flow exists.

A recent measurement study [22] revealed that veryshort flows dominate data center traffic. In such a dynamicenvironment, a source routing scheme which probes thenetwork for good end-to-end paths when a new flow is ini-tiated is ineffective (e.g., BCube’s BSR [11]). The results ofsuch probes are probably not valid by the time packetsstart to traverse the chosen route. ESR does not requireany kind of signaling during normal operation. Instead,

Page 10: Free-scaling your data center

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1767

ESR relies on its load-balancing enabled path computation,and achieves good results. As a bonus, commodity switchesdo not have to respond to probes, hence we save valuableresources.

Example. We illustrate the Efficient Source Routingmethod in Fig. 7, we show only links of interest while wedenote the actual switch degrees in parentheses. Server Awants to send packets to server B. Server A runs the ESRmethod to determine the path on which the packets willtraverse. Let us suppose that we operate the ESR with b = 1and c = 1.1 parameters. In this case, the length of the upperpath is 16 + 4 + 4 + 16 = 40 while the value of lower path is16 + 24 + 16 = 56. Accordingly, the upper path is the onlyfeasible path as the lower path is longer than 40 � 1.1 = 44.By selecting the upper path, the flow avoids those linkswhere high utilization is more likely as they are connectedto switches with higher degrees. The impact of c can beillustrated if c = 1.4 is used, in this case both paths arefeasible and therefore one is picked randomly. On thecontrary, in case of b = 0 we select the lower path as it hasonly four hops as opposed to the five hops of the upperpath.

Assume the latter scenario, hence the path computationprocedure chooses the path A ? 3 ? 4 ? 5 ? B. Next, weaggregate the Bloom IDs of the affected links into a singleBloom address of 011011. The packet first arrives at switch3, where the in-packet Bloom filter matches only the linkthat goes towards switch 4. Similarly, switch 4 comparesthe Bloom address of the packet to its link IDs. As the pack-et only matches the 01000 pattern (other links not shown),switch 4 forwards the packet towards switch 5. There, theaddress matches the link ID towards server B, therefore,the packet follows that link, and arrives at server B. Now,if server A wants to send a multicast packet to both ServerB and C, it has to compute the in-packet Bloom filteraccordingly. Let us assume that the path computation pro-cedure selects the same path towards server B, and A? 3 ? 4 ? 5 ? C for server C. As the link 5 ? C has theBloom ID of 10,000 the resulting Bloom address is111011. This address matches both links depicted atswitch 5. Thus, switch 5 delivers the packet to both desti-nation servers.

4.3. Failure handling

The Scafida structure inherently tolerates network fail-ures owed to its scale-free network inspired design as

Fig. 7. Illustration of the Efficient Source Routing mechanism. We showthe degree of the switches in parentheses while the Bloom IDs on thelinks.

noted in Section 2. At the routing level, we handle networkfailures as follows. The two neighboring nodes detect thefailure of a network link based on the loss of connectionat the link layer. All neighboring nodes detect the failureof a switch/server as multiple link failures. Upon detectinga failure, a node generates a network failure (NF) controlpacket and sends it on all of its active links. We propagatethe control message in the network based on a prunedbroadcast method. Accordingly, all servers are aware ofthe network failure. We switch the active paths, whichare affected by the failure, to their backup paths. In themeanwhile, the ESR method routes new flows based onthe updated network topology.

The NF control message has a Bloom address of 11. . .1;thus, it matches every possible Bloom ID at the switches.We set the source of the packet as the Bloom ID of theunavailable network link. Every switch stores the source ad-dresses of the control messages for a given, short time per-iod (as a soft-state). A switch forwards an incoming NFpacket only if it is not present in the failed links table; other-wise, the switch drops the packet. We compute an adequatetimeout period for the failure soft-states as the averagepropagation delay of links multiplied by twice the diameterof the network. On the other hand, if a network failure iscorrected, we broadcast a network recovery (NR) controlpacket in the network; consequently, notifying the serversabout the availability of the restored network equipment.

The proposed failure handling method does not have asignificant impact on the performance of FScafida data cen-ters: the generated control traffic overhead is proportionalto the number of failures in the network. Based on [7], thenumber of simultaneous network failures is usually belowa few hundred in very large data centers; thus, the aggre-gate load of the failure control messages is negligible com-pared to the throughput capability of the data centernetworks.

4.4. Routing performance

We evaluate the proposed ESR mechanism in a discreteevent simulator. We simulate the traffic at the flow-level,assuming fair sharing of link bandwidth among competingflows. Here, we investigate two different scenarios: the im-pact of ESR weighting parameter b and ESR fault tolerance.

First, we create 1000- and 5000-server FScafida topolo-gies with m = 2, pt0

¼ 2 and switch size distribution of (60,50, 40, 30, 17) for the 1000-server and (300, 250, 200, 150,85) for the 5000-server topology, with switches having 4,8, 16, 24, or 48 ports. We simulate 1000 flows with expo-nentially distributed inter-arrival times (k = 1000 to ensuresignificant competition among flows) and lognormal (lnN(10,2)) flow size distribution based on the findings of[22] on the traffic characteristics of operational DCs. Wedestine the 1000 flows at a randomly chosen group of100 servers to induce meaningful cross-traffic. We set thelink capacities uniformly to 1 Gbit/s, while ESR’s stretchbound is c = 1.1. We average the results in Fig. 8 over 20simulation runs per parameter setting. b has a profound ef-fect on per-flow throughput: the b = 0 case is shortest pathrouting, and it is clearly inferior to efficient routing withsome larger weighting parameter. Also, too large b values

Page 11: Free-scaling your data center

0 0.2 0.4 0.6 0.8 120

50

80

110

140

160

Beta parameter

Avg

. per

−flo

wth

roug

hput

[Mbi

t/s]

10005000

Fig. 8. The effect of the weighting factor b on per-flow throughput.

0 5 10 15 200

5

10

15

20

Percentage of link failures

Perc

enta

ge o

ffa

iled

dem

ands ESR

Fig. 9. Flow abortion ratio as the function of network failures.

1 Note that we choose false-positive free routes exclusively.

1768 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

result in congestion, hence lower per-flow throughput.Note that the throughput values are higher in the largertopology because of the increased number of possible effi-cient paths.

Second, we investigate the fault tolerance properties ofESR (k = 500 for inter-arrival times, b = 0.5, c = 1.3). Resultsin Fig. 9 show that ESR reacts well to network equipmentfailures, resulting in almost 90% of flows completed evenat a 20% link failure ratio. To sum up, ESR utilizes the inher-ent fault tolerance of the underlying Scafida network in anefficient manner.

5. Implementation and validation

As a proof of concept we implement a prototype ofFScafida in OpenFlow [26]; the implementation is availableon request. Note that we choose OpenFlow because we arefamiliar with it, however, no part of the architecture is spe-cific to OpenFlow. Besides illustrating the basic functional-ity of FScafida in the followings we briefly describe theimplementation of several advanced features like multi-path routing, multicasting, and failure discovery.

5.1. Main operation

For bootstrapping the system we use the NOX controller[26] to configure all the switches in the network. In ourprototype implementation, the initial topology of the datacenter is stored in a simple configuration file in the con-troller’s machine describing the connections between thenodes and the Bloom IDs of the output ports. The join ofa switch to the controller triggers the downloading of theconfiguration data in the form of permanent wildcardedflow entries dedicated to each port. A single flow entry

contains a Bloom ID in the destination MAC address fieldand an output action to the appertaining port, while allother fields are wildcarded. If a given port connects theswitch to a host, we initiate a MAC address modifying actionto ensure packet delivery to the host. By the end of the con-figuration process, flow tables of switches have as manyentries as the number of their ports containing the appro-priate Bloom ID to port mappings.

In OpenFlow v1.0, the Bloom filter based forwardingneeds vendor extensions in the switches or the mechanismhas to be implemented as controller application. We useboth approaches in our prototypes. First, we have extendedthe Open vSwitch [27] which is a widely used OpenFlowcapable software switch, to support Bloom filter based for-warding. These switches can be used in virtual testbeds,e.g., in Mininet [28]. It is worth noting that OpenFlowv1.1 switches can be used for Bloom filter based forward-ing without vendor extensions [29]. Second, we haveimplemented a dedicated NOX controller application forthe special forwarding. This approach is suitable for realtestbeds and hardware switches that support only Open-Flow v1.0. This means that the NOX controller gets the firstpacket of every new flow (i.e., that have destination ad-dresses not seen yet), applies the AND operation, and setsup this new forwarding rule in the switch. The proceedingpackets of the flow are handled without the intervention ofthe controller.

After we configure the switches, the FScafida data cen-ter is up and running and ready to route packets betweenthe host machines. On each host, we run an FScafida dae-mon keeping track of the correct Bloom filter encodingsof the paths towards all possible destinations. The configu-ration file describing the initial topology is sent to the serv-ers via a management network and FScafida daemon loadsit when starts. This daemon is aware of the actual topology,and it sets the ARP cache of the host to contain the destina-tion IP to Bloom filter mappings. Therefore, upon sending apacket the host consults its ARP cache and retrieves theappropriate Bloom filter as the MAC address of the destina-tion. In each switch, the Bloom filter matches only one1 en-try that indicates the appropriate output port. The lastswitch on the route replaces the destination MAC addressof the packet with the MAC address of the host, and thepacket gets delivered. The backward process works the sameway. Note that the NOX controller is not involved in the for-warding operations.

This type of operation also makes it easy to introduceflow-level multi-path communication. The ESR routingimplemented in the FScafida daemon updates the ARPcache frequently. Thus, the incoming flows retrieve differ-ent Bloom filters for the same IP address from the cache,thereby communicating on multiple paths with the samehost.

5.2. Multicast

The application of Bloom filters makes it possible toimplement multicast in a simple way. If a Bloom filter

Page 12: Free-scaling your data center

0 5 10 150

100

200

300

400

Timestamps [sec]

Id o

f the

pac

kets

Cable unplugged

Flow restored

100 101 102 103 1041

510

50100

5001000

Number of packets

CD

F(N

umbe

r of p

orts

) multipathno multipath

Fig. 10. Measurements.

2 This model of TP-Link requires a slight modification in the kernel driverof the switch, more exactly the MAC learning function has to be disabled inthe corresponding source file (ag71xx_ar7240.c).

3 More specifically, TIMEOUT_CHECK_PERIOD and LINK_TIMEOUTparameters are set 1 and 4.5 s, respectively.

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1769

matches multiple flow table entries in a switch then for-warding on all matching output ports naturally realizesmulticast. Since OpenFlow supports routing based onlyon the first matching entry in the flow table we extendthe implementation of Open vSwitch [27] to be able to seekfor multiple matches in the flow table. In order to haveboth multicast and the high performance of first matchbased forwarding, we mark the multicast packets in theheader, and multiple matching is triggered exclusivelywhen receiving a multicast header. This way the process-ing of a unicast packet happens pretty much the sameway as before, maintaining high performance unicastforwarding.

The implementation of the multiple matching processrelies on processing lists of matched rules. When a switchreceives a packet, it scans through the complete classifierflow table for matching entries. If there is a match, we con-catenate the associated classifier rule to a list. We note thatthis list is empty in the beginning of the process. When thescan passes the last entry in the flow table, we finish themultiple matching process, and the actions associated tothe matched rules can be put together in a meta-rule to beperformed. At the end, we install the meta-rule on the datapath. This modification affects ofproto.c and classi-

fier.c source files of the Open vSwitch implementation.

5.3. Failure discovery

Discovering link failures and notifying end hosts are ba-sic processes in FScafida to calculate the source routeBloom addresses in an up-to-date manner. To discover linkfailures the FScafida prototype uses heartbeat messagesand timers according to the standard Link Layer DiscoveryProtocol (LLDP) [30]. If a switch does not receive an LLDPpacket on a given port before its timer expires, then it as-sumes a link failure and triggers an intelligent flooding ofthis information. If the switch receives an LLDP packetthrough a port for which the timer is previously expired,it assumes a link recovery which also eventuates prunedflooding. The switch that receives the link failure/recoverymessage creates a non-permanent wildcarded flow entrycontaining the Bloom ID of the appertaining port in thesource MAC address field. Afterwards, the switch wildcardsall fields except the source MAC address and associates adrop action with it. This way, upon receiving this notifica-tion multiple times, the switch drops the message. Unfor-tunately, OpenFlow implements LLDP currently only inthe controller, therefore, this feature scales poorly in theprototype.

5.4. Measurements

To validate and evaluate the implementation, we estab-lish a 19 node FScafida data center in a laboratory environ-ment. The testbed consists of two types of commodityswitches, namely two pieces of a 24-port HP ProCurve6600, seven pieces of TP-Link TL-WR841ND 4 + 1 portswitches, and lab PCs. More precisely, we generate an FSca-fida topology according to the algorithm presented in Sec-tion 2.1 with input parameters regarding the number ofdifferent types of switches and hosts. Our testbed topology

consists of seven pieces of 4-port switches (TP-Link de-vices), three pieces of 8-port switches (24-port HP switchwas logically separated), one 24-port switch (HP device),and eight pieces of servers (laboratory PCs). The linksbetween the nodes are also given by our algorithm. Thecommodity switches need special firmware to supportthe OpenFlow protocol. In case of HP switches, we usethe v2.02s (K.14.77) OpenFlow-capable firmware whilethe TP-Link devices are operated by OpenWRT [31] Back-fire 10.03.1-rc4 with the OpenFlow extension.2 We useNOX v0.8.0 OpenFlow controller operating in out-of-bandcontrol mode in a separate management network and werun our NOX controller application implementing the Bloomfilter based forwarding mechanism.

Next, we demonstrate the failure discovery mechanismof the network. We establish a single flow between twoservers and after a while unplug a cable on the communi-cation route. In Fig. 10a, we plot the IDs of the receivedpackets against the timestamps at the receiver side. Thesystem propagates the link failure and propagates theinformation to the end hosts, thus, the sender changesthe Bloom address to send the packets on a different path.The recovery mechanism is performed within 4 s in thisscenario and the transmission is continued. The timeparameters of the discovery application of NOX affects thisbehavior. Inadequate parameters can lead to oscillation ofthe network.3 We emphasize that the efficiency and scala-bility of the mechanism can be enhanced by implementingLLDP in the switches.

Page 13: Free-scaling your data center

1770 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

To investigate multipath efficiency we run a measure-ment on a 484-node Mininet [28] network, an emulatedOpenFlow testbed. Here, we use Open vSwitches withour extensions to support the special forwarding mecha-nism. This topology is also generated by our algorithmwith input parameters describing the number of serversand switches with a port number of 4, 8, 16, 24, 48, respec-tively. The result of the algorithm is a 484-node virtualtestbed. We start 100 flows with 100 packets each be-tween two random hosts of the DC, which traverse on di-verse routes according to ESR. Fig. 10b shows the CDF ofthe tx/rx statistics of the ports over all switches in the net-work compared to the case when no multipath is issued.Here, tx/rx values give the number of packets traversedthe corresponding port. The plot gives information on thenumber of ports having more than a given number ofpackets sent or received. The tx/rx statistics of the portscontain also the control packets of the network, e.g., thepackets that are necessary to setup the topology itself. Thisis why we have identical number of ports with less than 10packers both with or without the multipath scheme. Onthe one hand, one can see that the 10,000 packets travelalong the same path when there is no multipath capability,which eventuates large tx/rx values. The plot shows thatwe have several ports with high traffic intensity. On theother hand, ESR effectively distributes the load amongnumerous ports in the network and we have more portswith moderate traffic and only a few number of portshas high load.

6. Cost considerations

Simulation results and experimental evaluation showthat our approach satisfies its design goals and is worthconsidering in production networks, especially for SMEsand NPOs (Non-profit organizations) with ad hoc networkexpansion opportunities and on a strict budget. With thatsaid, there are certain areas of expenditures for deeperconsideration.

6.1. Customizing for a set budget

Throughout the paper we assumed a fixed set of avail-able commodity hardware, over which a data centershould be built. While this is a plausible scenario, an orga-nization can also have a fixed budget to spend on hardwareand zero available equipment. Upon gathering pricinginformation, our simulation tools could be used to choosean optimal set of switches and servers resulting in the bestperforming FScafida data center. A recent cost comparisonof several data center architectures discusses cases wherespecial topologies are preferred from an expenditure pointof view [32]. Due to its flexible structure and competitivestructural properties, FScafida can be used as a single datacenter architecture in all the cases. Thus, FScafida allowsstakeholders to decide about the aggregate expenditureof the data center instead of selecting a specific topologybased on assumed future market scenarios (e.g., the priceof switches vs. servers).

6.2. Cabling

Our experience with small enterprise and universitynetworks suggests that usually very little emphasis is puton proper and clearly structured racking and cabling.While ad hoc solutions can certainly work on a small scale,multiple thousands of data center nodes would certainlybenefit from order. In our case, this implies finding sub-graphs in the topology which can be stored in the samerack. Thus, an issue with the cabling of FScafida iscomplexity. it requires a kind of chaotic cabling that is er-ror-prone to setup and hard to debug. This constitutesimportant future work for us. From a cost viewpoint, con-necting the equipments of an FScafida DC require intensehuman workforce; however, as the absolute cost of cablingis marginal compared to the aggregate cost of DC architec-tures [32], FScafida deployment would not be hindered bycabling costs.

Regarding cabling errors, we refer to the Jellyfish archi-tecture [15]. We assume that in an FScafida DC, humanworkers connect cables by following a ‘‘cabling map’’, gen-erated automatically based on the topology and physicalDC layout (racks). While some human errors may happenin cabling, these are easy to detect and fix (e.g., a suitablelayer-2 discovery protocol is proposed in [33]). Fixing suchmiscablings is relatively cheap: the labor cost of cabling isapprox. 10% of total cabling costs [34].

As for designing the cabling layout, we could use theoptimization technique proposed in [9]. The key idea is thatin a high performance FScafida topology, the largest portionof cables run between switches. This placing the switchesclose to each other reduces cable length and labor costs.This placement also simplifies recabling due to expansionor miscabling. Very large container DCs are out of scopefor FScafida for now: the main focus groups are SMEs andNPOs.

6.3. Power consumption

FScafida data centers have the potential to be energyefficient due to their scalable and flexible structure. AsFScafida can realize structures of any size, the resultingdata center is energy proportional; i.e., the power con-sumption is proportional to the number of servers. Onthe contrary, the power consumption of state-of-the-artdata center architectures is not energy proportional[35,36]. Current energy price trends predict that energyproportionality will be a crucial property regarding DCexpenditures as the price of power tends to increase. Ontop of this, operators can utilize network-wide power man-agers such as ElasticTree [37] inside FScafida DCs, switch-ing off unnecessary servers and network elements in thenetwork without sacrificing throughput.

6.4. Commodity switches

We design FScafida to create DCs out of commodityswitches using a flexible topology. FScafida has a special-ized routing method to exploit the benefits of its topology;however, this does not increase the cost of its deployment.The firmware of some off-the-shelf network equipments

Page 14: Free-scaling your data center

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1771

supported by OpenWRT can be easily modified [31]; thus,we can create an FScafida DC using cheap commodityswitches. For example, the cost of the devices of our proto-type is extremely low: the TP-Link switches costed $34according to Amazon but also the price of the 24-port HPswitches was moderate ($3800) in 2011.

7. Related work

There is a sizable and highly practical body of work re-lated to data center networking. Several data center archi-tectures have been proposed recently; most of them arebased on symmetric structures. Data centers based onfat-tree topologies [9,7,8] are built using commodityswitches arranged in three layers, namely core, medium,and server layers. The structure is also known as Clostopology. The LEGUP proposal tackles the problem of effi-ciently expanding existing data centers, adding heteroge-neity to Clos networks [14].

The symmetric structure of the BCube [11] data centerarchitecture is designed using a recursive algorithm. BCubeis intended to be used in container based, modular datacenters, with a few thousand servers. The MDCube archi-tecture proposes a method to interconnect these contain-ers to create a mega-data center [12]. Similarly to BCube,DCell is also built recursively [10]. DCell’s recursive expres-sion scales up rapidly, thus, DCell can have enormous num-ber of servers with small structural levels and switch ports.

Portland [8] is a scalable, fault tolerant layer-2 data cen-ter fabric built on multi-rooted tree topologies (includingfat tree as well) that supports easy migration of virtual ma-chines. VL2 [7] is an agile data center fabric with propertieslike performance isolation between services, load balanc-ing, and flat addressing.

Jellyfish [15], a recent DC architecture proposal suggeststo use random networks as underlying DC topology. Due tothe random structure, incremental expandability of this DCis solvable. Jellyfish further allows construction of arbi-trary-size networks. However, routing in Jellyfish DCs ischallenging, and it can only be solved in a cross-layer man-ner involving Multi-Path TCP. Despite comparable (incre-mental) scalability properties, FScafida has an advantageover Jellyfish owed to its structured randomness inheritedfrom scale-free networks: routing inside a FScafida DC canbe done with our proposed ESR method with considerableeffort. Furthermore, FScafida exhibits better structuralresilience.

Small-world network based data center topologies areproposed recently [16]. The topology consists of networkparts that have regular patterns like ring or torus whilethese parts are connected randomly. Similar to our pro-posal, small-world data centers also have preferable prop-erties compared to conventional data center networks.

A recent study provides key insights from productiondata centers [22]. Among others, it shows that short andsmall flows dominate data center traffic, predominantlyshort flow inter-arrival times require quick forwardingdecision in switches, and centralized fabric managers needto be implemented in hardware to scale with growing net-work size.

Various studies dealt with the properties of scale-free(or complex) networks in the recent decade, as a startingpoint we refer to [38].

8. Conclusion

We presented the design, implementation, and multi-faceted evaluation of the FScafida data center networkarchitecture. FScafida’s basic structure consists of serversand commodity switches, connected into a topology in-spired by scale-free networks. We presented a detailedevaluation of the underlying network topology. Moreover,in this paper we proposed an organic expansion algorithmfor FScafida resulting in a freely scalable data center net-work structure with favorable incremental properties (sca-lability and expandability) and performance characteristics(short paths, high bisection bandwidth, and fault-toler-ance). On top of its interconnection structure, FScafida runsa fault-tolerant source routing protocol, ESR, performinglow-stretch, load-balanced, multi-path, multicast-enabledrouting, while requiring minimal computation in switches.We also implemented an OpenFlow-based FScafida proto-type, and validated its main features via Mininet-emula-tions and live testbed measurements. We made thesource code of our prototype implementation, simulator,and additional evaluation results available on request.We believe that both small and large, continuouslyexpanding data centers could benefit from our findings.

Acknowledgement

The authors thank HP Networking for providing the HPProCurve 6600 switches for the prototype. The authorswould like to thank the reviewers for their insightful com-ments which greatly helped to improve the paper. Thiswork was partially performed in the High Speed NetworksLaboratory at BME-TMIT. This work was supported by theOTKA–KTIA grant CNK77802. The publication was sup-ported by the TÁMOP-4.2.2.C-11/1/KONV-2012-0001 pro-ject. This work was partially supported by the EuropeanUnion and the European Social Fund through project Futur-ICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013).

References

[1] Forrester Research, Market Overview: Private Cloud Solutions, q2 2011.<http://www.forrester.com/rb/Research/market_overview_private_cloud_solutions%2C_q2_2011/q/id/58924/t/2>.

[2] Defense Systems, Dod Tackles Information Security in the Cloud.<http://defensesystems.com/articles/2011/01/24/defense-it-1-dod-cloud-computing-security-issues.aspx>.

[3] Advanced Systems Group, Private vs. Public Cloud Financial Analysis.<http://www.virtual.com/solutions/cloud-computing/cloud-financial-analysis>.

[4] A. Licis, HP, Data Center Planning, Design and Optimization: A GlobalPerspective, June 2008. <http://goo.gl/Sfydq>.

[5] R. Miller, Facebook Now has 30,000 Servers, November 2009.<http://goo.gl/EGD2D>.

[6] R. Miller, Facebook Server Count: 60,000 or More, June 2010. <http://goo.gl/79J4>.

[7] A. Greenberg, J.R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.A.Maltz, P. Patel, S. Sengupta, Vl2: a scalable and flexible data centernetwork, in: SIGCOMM ’09, 2009, pp. 51–62. doi:http://doi.acm.org/10.1145/1592568.1592576.

Page 15: Free-scaling your data center

1772 L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773

[8] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S.Radhakrishnan, V. Subramanya, A. Vahdat, Portland: a scalable fault-tolerant layer 2 data center network fabric, in: SIGCOMM ’09, 2009,pp. 39–50. doi:http://doi.acm.org/10.1145/1592568.1592575.

[9] M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity datacenter network architecture, in: SIGCOMM ’08, 2008, pp. 63–74.doi:http://doi.acm.org/10.1145/1402958.1402967.

[10] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, Dcell: a scalable andfault-tolerant network structure for data centers, in: SIGCOMM ’08,2008, pp. 75–86. doi:http://doi.acm.org/10.1145/1402958.1402968.

[11] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, S. Lu,Bcube: a high performance, server-centric network architecture formodular data centers, in: SIGCOMM ’09, 2009, pp. 63–74. doi:http://doi.acm.org/10.1145/1592568.1592577.

[12] H. Wu, G. Lu, D. Li, C. Guo, Y. Zhang, Mdcube: a high performancenetwork structure for modular data center interconnection, in:CoNEXT ’09, ACM, New York, NY, USA, 2009, pp. 25–36. doi:http://doi.acm.org/10.1145/1658939.1658943..

[13] R. Katz, Tech titans building boom, IEEE Spectrum 46 (2) (2009) 40–54, http://dx.doi.org/10.1109/MSPEC.2009.4768855.

[14] A.R. Curtis, S. Keshav, A. Lopez-Ortiz, Legup: using heterogeneity toreduce the cost of data center network upgrades, in: Co-NEXT ’10,ACM, New York, NY, USA, 2010, pp. 14:1–14:12. doi:http://doi.acm.org/10.1145/1921168.1921187.

[15] A. Singla, C. Hong, L. Popa, P. Godfrey, Jellyfish: Networking datacenters, randomly, in: Usenix NSDI’12, 2011.

[16] J.-Y. Shin, B. Wong, E.G. Sirer, Small-world datacenters, in:Proceedings of the 2nd ACM Symposium on Cloud Computing,SOCC ’11, ACM, New York, NY, USA, 2011, pp. 2:1–2:13. doi:http://doi.acm.org/10.1145/2038916.2038918.

[17] L. Gyarmati, T.A. Trinh, Scafida: a scale-free network inspired datacenter architecture, SIGCOMM Comput. Commun. Rev. 40 (2010) 4–12. doi:http://doi.acm.org/10.1145/1880153.1880155.

[18] A.-L. Barabási, R. Albert, Emergence of scaling in random networks,Science 286 (5439) (1999) 509–512, http://dx.doi.org/10.1126/science.286.5439.509. <http://www.sciencemag.org/cgi/doi/10.1126/science.286.5439.509>.

[19] J. Dean, S. Ghemawat, MapReduce: simplified data processing onlarge clusters, in: OSDI’04, San Francisco, 2004.

[20] R. Albert, H. Jeong, A.-L. Barabási, Error and attack tolerance ofcomplex networks, Nature 406 (6794) (2000) 378–382, http://dx.doi.org/10.1038/35019019. <http://www.ncbi.nlm.nih.gov/pubmed/10935628>.

[21] A. Broder, M. Mitzenmacher, Network applications of bloom filters: asurvey, Internet Mathematics 1 (4) (2004) 485–509.

[22] T. Benson, A. Akella, D. Maltz, Network traffic characteristics of datacenters in the wild, IMC ’10 (2010) 267–280. <http://pages.cs.wisc.edu/�akella/papers/dc-meas-imc10.pdf>.

[23] C. Rothenberg, C. Macapuna, F. Verdi, M. Magalhães, A. Zahemszky,Data center networking with in-packet Bloom filters, in: SBRC,Gramado, Brazil, 2010.

[24] M. Sarela, C. Rothenberg, T. Aura, A. Zahemszky, P. Nikander, J. Ott,Forwarding anomalies in bloom filter-based multicast, in: INFOCOM,2011 Proceedings IEEE, 2011.

[25] G. Yan, T. Zhou, B. Hu, Z. Fu, B. Wang, Efficient routing on complexnetworks, Phys. Rev. E 73 (4) (2006) 46108.

[26] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J.Rexford, S. Shenker, J. Turner, OpenFlow: enabling innovation in campusnetworks, SIGCOMM Comput. Commun. Rev. 38 (2) (2008) 69–74.

[27] Open vSwitch. <http://openvswitch.org/>.[28] B. Lantz, B. Heller, N. McKeown, A network in a laptop: rapid

prototyping for software-defined networks, in: HotNets ’10, ACM,2010, p. 19.

[29] F. Németh, A. Stipkovits, B. Sonkoly, A. Gulyás, Towards SmartFlow:Case studies on enhanced programmable forwarding in OpenFlowswitches, in: ACM SIGCOMM, Helsinki, Finland, 2012, pp. 85–86.

[30] IEEE 802.1AB, Station and media access control connectivitydiscovery.

[31] OpenWrt. <http://openwrt.org/>.[32] L. Popa, S. Ratnasamy, G. Iannaccone, A. Krishnamurthy, I. Stoica, A

cost comparison of datacenter network architectures, in: ACMCoNEXT’10, ACM, New York, NY, USA, 2010, pp. 16:1–16:12.<http://doi.acm.org/10.1145/1921168.1921189>.

[33] Microsoft, Link Layer Topology Discovery Protocol. <http://goo.gl/bAcZ5>.

[34] J. Mudigonda, P. Yalagandula, J.C. Mogul, Taming the flying cablemonster: a topology design and optimization framework for data-center networks, in: Proceedings of the 2011 USENIX conference onUSENIX annual technical conference, USENIXATC’11, USENIXAssociation, Berkeley, CA, USA, 2011, pp. 8–8. <http://dl.acm.org/citation.cfm?id=2002181.2002189>.

[35] L. Gyarmati, A. Trinh, Energy efficiency of data centers, in: Green IT:Technologies and Applications, Springer-Verlag, 2011, pp. 229–244.

[36] L. Gyarmati, T.A. Trinh, How can architecture help to reduce energyconsumption in data center networking?, in: e-Energy ’10, ACM,New York, NY, USA, 2010, pp 183–186. doi:http://doi.acm.org/10.1145/1791314.1791343.

[37] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S.Banerjee, N. McKeown, Elastictree: Saving energy in data centernetworks, in: NDSI ’10, 2010.

[38] A. Barabási, Scale-free networks: a decade and beyond, Science 325(5939) (2009) 412.

László Gyarmati is an associate researcher atTelefonica Research. He received his Ph.D. andM.Sc. in Computer Science from the BudapestUniversity of Technology and Economics,Hungary in 2011 and 2008. His researchinterest includes network economics, in par-ticular the application of game theoreticmethods, online social networks, and energyefficiency of data centers. In addition, he hasan M.Sc. degree in biomedical engineering.

András Gulyás received M.Sc. and Ph.D.degrees in Computer Science at the BudapestUniversity of Technology and Economics,Hungary in 2002 and 2008 respectively. He isan assistant professor at the Department ofTelecommunications and Media Informatics,Budapest University of Technology and Eco-nomics. His current research areas spans overnovel routing mechanisms, social and com-plex networks, and self-organizing networks.

Balázs Sonkoly is an assistant professor at theDepartment of Telecommunications andMedia Informatics, BME. He received his Ph.D.(2010) and M.Sc. (2002) degrees in ComputerScience from the Budapest University ofTechnology and Economics, Hungary. He hasparticipated in several national projects andinvolved in EU projects as well. His researchactivity focuses on OpenFlow, novel routingmechanisms, congestion control, traffic mod-eling and intelligent transportation systems.

Page 16: Free-scaling your data center

L. Gyarmati et al. / Computer Networks 57 (2013) 1758–1773 1773

Tuan Anh Trinh received his M.Sc. and Ph.D.in Computer Science from the Budapest Uni-versity of Technology and Economics, Hun-gary in 2000 and 2005, respectively. He is aco-leader of the Network Economics Group atBUTE. His research interests include gametheoretic modeling of communication sys-tems, economics-inspired system design andfairness issues in resource allocation prob-lems in the Internet.

Gergely Biczók is a Postdoctoral Fellow at theNorwegian University of Science and Tech-nology. He earned his Ph.D. and M.Sc. degreesin Computer Science from the Budapest Uni-versity of Technology and Economics, Hun-gary in 2010 and 2003, respectively. He was aFulbright Scholar to Northwestern Universityin 2008, and held a researcher position atEricsson Research from 2003 to 2007. Hiscurrent research interests include data cen-ters, network economics and network coding.