replication for web hosting systems · replication for web hosting systems swaminathan...

Replication for Web Hosting Systems

SWAMINATHAN SIVASUBRAMANIAN, MICHAL� SZYMANIAK, GUILLAUME PIERRE,and MAARTEN VAN STEEN

Vrije Universiteit, Amsterdam

Replication is a well-known technique to improve the accessibility of Web sites. Itgenerally offers reduced client latencies and increases a site’s availability. However,applying replication techniques is not trivial, and various Content Delivery Networks(CDNs) have been created to facilitate replication for digital content providers. Thesuccess of these CDNs has triggered further research efforts into developing advancedWeb replica hosting systems. These are systems that host the documents of a websiteand manage replication automatically. To identify the key issues in designing awide-area replica hosting system, we present an architectural framework. Theframework assists in characterizing different systems in a systematic manner. Wecategorize different research efforts and review their relative merits and demerits. Asan important side-effect, this review and characterization shows that there a number ofinteresting research questions that have not received much attention yet, but whichdeserve exploration by the research community.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]:Distributed Systems; C.4 [Performance of Systems]: Design Studies; H.3.5[Information Storage and Retrieval]: Online Information Service—Web-basedservices

General Terms: Design, Measurement, Performance

Additional Key Words and Phrases: Web replication, content delivery networks

1. INTRODUCTION

Replication is a technique that allows toimprove the quality of distributed ser-vices. In the past few years, it has beenincreasingly applied to Web services, no-tably for hosting Web sites. In such cases,replication involves creating copies of asite’s Web documents, and placing thesedocument copies at well-chosen locations.In addition, various measures are takento ensure (possibly different levels of) con-sistency when a replicated document is

Authors’ address: Department of Computer Science, Vrije Universiteit, Amsterdam, The Netherlands; email:{swami,michal,gpierre,steen}@cs.vu.nlPermission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requiresprior specific permission and/or a fee.c©2004 ACM 0360-0300/04/0900-0291 $5.00

updated. Finally, effort is put into redi-recting a client to a server hosting adocument copy such that the client is opti-mally served. Replication can lead to re-duced client latency and network trafficby redirecting client requests to a replicaclosest to that client. It can also improvethe availability of the system, as the fail-ure of one replica does not result in entireservice outage.

These advantages motivate many Webcontent providers to offer their ser-vices using systems that use replication

ACM Computing Surveys, Vol. 36, No. 3, September 2004, pp. 291–334.

292 S. Sivasubramanian et al.

techniques. We refer to systems providingsuch hosting services as replica hostingsystems. The design space for replica host-ing systems is big and seemingly complex.In this article, we concentrate on organiz-ing this design space and review severalimportant research efforts concerning thedevelopment of Web replica hosting sys-tems. A typical example of such a system isa Content Delivery Network (CDN) [Hull2002; Rabinovich and Spastscheck 2002;Verma 2002].

There exists a wide range of articles dis-cussing selected aspects of Web replica-tion. However, to the best of our knowl-edge, there is no single framework thataids in understanding, analyzing and com-paring the efforts conducted in this area.In this article, we provide a frameworkthat covers the important issues that needto be addressed in the design of a Webreplica hosting system. The framework isbuilt around an objective function–a gen-eral method for evaluating the system per-formance. Using this objective function,we define the role of the different systemcomponents that address separate issuesin building a replica hosting system.

The Web replica hosting systems weconsider are scattered across a large ge-ographical area, notably the Internet.When designing such a system, at least thefollowing five issues need to be addressed:

(1) How do we select and estimate the met-rics for taking replication decisions?

(2) When do we replicate a given Web doc-ument?

(3) Where do we place the replicas of agiven document?

(4) How do we ensure consistency of allreplicas of the same document?

(5) How do we route client requests to ap-propriate replicas?

Each of these five issues is to a large ex-tent independent from the others. Oncegrouped together, they address all theissues constituting a generalized frame-work of a Web replica hosting system.Given this framework, we compare andcombine several existing research efforts,and identify problems that have not been

addressed by the research communitybefore.

Another issue that should also be ad-dressed separately is selecting the objectsto replicate. Object selection is directlyrelated to the granularity of replication.In practice, whole websites are taken asthe unit for replication, but Chen et al.[2002b, 2003] show that grouping Webdocuments can considerably improve theperformance of replication schemes at rel-atively low costs. However, as not muchwork has been done in this area, we havechosen to exclude object selection from ourstudy.

We further note that Web caching isan area closely related to replication. Incaching, whenever a client requests adocument for the first time, the clientprocess or the local server handling therequest will fetch a copy from the doc-ument’s server. Before passing it to theclient, the document is stored locally ina cache. Whenever that document is re-quested again, it can be fetched from thecache locally. In replication, a document’sserver proactively places copies of docu-ment at various servers, anticipating thatenough clients will make use of this copy.Caching and replication thus differ only inthe method of creation of copies. Hence, weperceive caching infrastructures (like, e.g.,Akamai [Dilley et al. 2002]) also as replicahosting systems, as document distribu-tion is initiated by the server. For moreinformation on traditional Web caching,see Wang [1999]. A survey on hierarchicaland distributed Web caching can be foundin Rodriguez et al. [2001].

A complete design of a Web replica host-ing system cannot restrict itself to ad-dressing the above five issues, but shouldalso consider other aspects. The two mostimportant ones are security and fault tol-erance. From a security perspective, Webreplica hosting systems provide a solutionto denial-of-service (DoS) attacks. By sim-ply replicating content, it becomes muchmore difficult to prevent access to specificWeb content. On the other hand, mak-ing a Web replica hosting system secureis currently done by using the same tech-niques as available for website security.

ACM Computing Surveys, Vol. 36, No. 3, September 2004.

Replication for Web Hosting Systems 293

Obviously, wide-spread Web content repli-cation poses numerous security issues, butthese have not been sufficiently studied towarrant inclusion in a survey such as this.

In contrast, when considering fault tol-erance, we face problems that have beenextensively studied in the past decadeswith many solutions that are now being in-corporated into highly replicated systemssuch as those studied here. Notably thesolutions for achieving high availabilityand fault tolerance of a single site are or-thogonal to achieving higher performanceand accessibility in Web replica hostingsystems. These solutions have been ex-tensively documented in the literature([Schneider 1990; Jalote 1994; Pradhan1996; Alvisi and Marzullo 1998; Elnozahyet al. 2002]); for which reason, we donot explicitly address them in our currentstudy.

1.1. Motivating Example

To obtain a first impression of the sizeof the design space, let us consider a fewexisting systems. We adopt the follow-ing model. An object encapsulates a (par-tial) implementation of a service. Exam-ples of an object are a collection of Webdocuments (such as HTML pages and im-ages) and server scripts (such as ASPs,PHPs) along with their databases. An ob-ject can exist in multiple copies, calledreplicas. Replicas are stored on replicaservers, which together form a replica host-ing system. The replica hosting system de-livers appropriate responses to its clients.The response usually comprises a set ofdocuments generated by the replica, stat-ically or dynamically. The time between aclient issuing a request and receiving thecorresponding response is defined as clientlatency. When the state of a replica is up-dated, then this update needs to propagateto other replicas so that no stale documentis delivered to the clients. We refer to theprocess of ensuring that all replicas of anobject have the same state amidst updatesas consistency enforcement.

Now consider the following four differ-ent CDNs, which we discuss in more detailbelow. Akamai is one of the largest CDNs

currently deployed, with tens of thousandsof replica servers placed all over the Inter-net [Dilley et al. 2002]. To a large extent,Akamai uses well-known technology toreplicate content, notably a combinationof DNS-based redirection and proxy-caching techniques. A different approachis followed by Radar, a CDN developed atAT&T [Rabinovich and Aggarwal 1999].A distinguishing feature of Radar isthat it can also perform migrationof content in addition to replication.SPREAD [Rodriguez and Sibal 2000]is a CDN also developed at AT&T thatattempts to transparently replicate con-tent through interception of networktraffic. Its network-level approach leadsto the application of different techniques.Finally, the Globule system is a researchCDN aiming to support large-scale user-initiated content distribution [Pierre andvan Steen 2001, 2003].

Table I gives an overview of how thesefour systems deal with the aforementionedfive issues: replica placement, consistencyenforcement, request routing, and adap-tation triggering. Let us consider each ofthese entries in more detail.

When considering replica placement, itturns out that this is not really an is-sue for Akamai because it essentially fol-lows a caching strategy. However, withthis scheme, client redirection is an im-portant issue as we shall see. Radar fol-lows a strategy that assumes that serversare readily available. In this case, replicaplacement shifts to the problem of decid-ing what the best server is to place con-tent. Radar simply replicates or migratescontent close to where many clients are.SPREAD considers the network path be-tween clients and the original documentserver and makes an attempt to createreplicas on that path. Globule, finally, alsoassumes that servers are readily avail-able, but performs a global evaluation tak-ing into account, for example, that thenumber of replicas needs to be restricted.Moreover, it does such evaluations on aper-document basis.

Consistency enforcement deals with thetechniques that are used to keep repli-cas consistent. Akamai uses an elegant



Table I. Summary of Strategies Adopted by Four Different CDNsDesign Issue Akamai Radar SPREAD GlobuleReplicaplacement

Caching Content is shippedto replica serverslocated closer toclients

Replicas createdon paths betweenclients and theoriginal documentserver

Selects the bestreplicationstrategyregularly uponevaluatingdifferentstrategies

Consistencyenforcement

Consistency basedon replicaversioning

Primary-copyapproach

Differentstrategies, chosenon a per-documentbasis

Adaptiveconsistencypolicies

Adaptationtriggering

Primarilyserver-triggered

Server-triggered Router-triggered Server-triggered

Requestrouting

Two-tier DNSredirectioncombined withURL rewriting,considering serverload andnetwork-relatedmetrics

Proprietaryredirectionmechanism,considering serverload and proximity

Packet-handoff,consideringhop-basedproximity

Single-tier DNSredirection,consideringAS-basedproximity

versioning scheme by which the versionis encoded in a document’s name. Thisapproach creates a cache miss in thereplica server whenever a new version iscreated, so that the replica server willfetch it from the document’s main server.Radar simply applies a primary-copy pro-tocol to keep replicas consistent. SPREADdeploys different techniques, and has incommon with Globule that different tech-niques can be used for different docu-ments. Globule, finally, evaluates whetherconsistency requirements are continuedto be met, and, if necessary, switchesto a different technique for consistencyenforcement.

Adaptation triggering concerns select-ing the component that is responsible formonitoring system conditions, and, if nec-essary, initiating adaptations to, for exam-ple, consistency enforcement and replicaplacement. All systems except SPREADput servers in control for monitoringand adaptation triggering. SPREAD usesrouters for this purpose.

Request routing deals with redirect-ing a client to the best replica forhandling its request. There are severalways to do this redirection. Akamai andGlobule both use the Domain Name Sys-tem (DNS [Mockapetris 1987a]) for thispurpose, but in a different way. Akamai

takes into account server load and variousnetwork metrics, whereas Globule cur-rently measures only the number of au-tonomous systems that a request needsto pass through. Radar deploys a propri-etary solution taking into account serverload and a Radar-specific proximity mea-sure. Finally, SPREAD deploys a networkpacket-handoff mechanism, actually us-ing router hops as its distance metric.

Without having gone into the detailsof these systems, it can be observed thatapproaches to organizing a CDN differwidely. As we discuss in this article, thereare numerous solutions to the various de-sign questions that need to be addressedwhen developing a replica hosting service.These questions and their solutions arethe topic of this article.

1.2. Contributions and Organization

The main contributions of this article arethree-fold. First, we identify and enumer-ate the issues in building a Web replicahosting system. Second, we propose an ar-chitectural framework for such systemsand review important research work in thecontext of the proposed framework. Third,we identify some of the problems that havenot been addressed by the research com-munity until now, but whose solutions can



be of value in building a large-scale replicahosting system.

The rest of the article is organized as fol-lows. In Section 2, we present our frame-work of wide-area replica hosting systems.In Sections 3 to 7, we discuss each of theabove mentioned five problems formingthe framework. For each problem, we re-fer to some of the significant related re-search efforts, and show how the problemwas tackled. We draw our conclusions inSection 8.

2. FRAMEWORK

The goal of a replica hosting system isto provide its clients with the best avail-able performance while consuming as lit-tle resources as possible. For example,hosting replicas on many servers spreadthroughout the Internet can decrease theclient end-to-end latency, but is bound toincrease the operational cost of the sys-tem. Replication can also introduce costsand difficulties in maintaining consistencyamong replicas, but the system should al-ways continue to meet application-specificconsistency constraints. The design of areplica hosting system is the result of com-promises between performance, cost, andapplication requirements.

2.1. Objective Function

In a sense, we are dealing with an opti-mization problem, which can be modeledby means of an abstract objective function,Fideal, whose value λ is dependent on manyinput parameters:

λ = Fideal(p1, p2, p3, . . . , pn).

In our case, the objective function takestwo types of input parameters. The firsttype consists of uncontrollable systemparameters, which cannot be directly con-trolled by the replica hosting system. Typ-ical examples of such uncontrollable pa-rameters are client request rates, updaterates for Web documents, and availablenetwork bandwidth. The second type ofinput parameters are those whose valuecan be controlled by the system. Examples

Fig. 1. The feedback control loop for a replica host-ing system.

of such parameters include the number ofreplicas, the location of replicas, and theadopted consistency protocols.

One of the problems that replica host-ing systems are confronted with, is thatto achieve optimal performance, only thecontrollable parameters can be manipu-lated. As a result, continuous feedback isnecessary, resulting in a traditional feed-back control system as shown in Figure 1.

Unfortunately, the actual objective func-tion Fideal represents an ideal situation,in the sense that the function is gener-ally only implicitly known. For example,the actual dimension of λ may be a com-plex combination of monetary revenues,network performance metrics, and so on.Moreover, the exact relationship betweeninput parameters and the observed valueλ may be impossible to derive. Therefore,a different approach is always followedby constructing an objective function Fwhose output λ is compared to an assumedoptimal value λ∗ of Fideal. The closer λ isto λ∗, the better. In general, the system isconsidered to be in an acceptable state, if|λ∗ − λ| ≤ δ, for some system-dependentvalue δ.

We perceive any large-scale Web replicahosting system to be constantly adjust-ing its controllable parameters to keep λwithin the acceptable interval around λ∗.For example, during a flash crowd (a sud-den and huge increase in the client re-quest rate), a server’s load increases, inturn increasing the time needed to ser-vice a client. These effects may resultin λ falling out of the acceptable inter-val and that the system must adjust itscontrollable parameters to bring λ backto an acceptable value. The actions oncontrollable parameters can be such as



Fig. 2. A framework for evaluating wide-area replica hosting systems.

increasing the number of replicas, or plac-ing replicas close to the locations thatgenerate most requests. The exact defi-nition of the objective function F , its in-put parameters, the optimal value λ∗, andthe value of δ are defined by the systemdesigners and will generally be based onapplication requirements and constraintssuch as cost.

In this article, we use this notion of anobjective function to describe the differ-ent components of a replica hosting sys-tem, corresponding to the different partsof the system design. These componentscooperate with each other to optimize λ.They operate on the controllable param-eters of the objective function, or observeits uncontrollable parameters.

2.2. Framework Elements

We identify five main issues that haveto be considered during the design of areplica hosting system: metric determina-tion, adaptation triggering, replica place-ment, consistency enforcement, and re-quest routing. These issues can be treatedas chronologically ordered steps that haveto be taken when transforming a central-ized service into a replicated one. Our pro-posed framework of a replica hosting sys-tem matches these five issues as depictedin Figure 2. Below, we discuss the five is-sues and show how each of them is relatedto the objective function.

In metric determination, we addressthe question how to find and estimate themetrics required by different componentsof the system. Metric determination is the

problem of estimating the value of the ob-jective function parameters. We discusstwo important issues related to metric es-timation that need to be addressed to builda good replica hosting system. The first is-sue is metric identification: the process ofidentifying the metrics that constitute theobjective function the system aims to op-timize. For example, a system might wantto minimize client latency to attract morecustomers, or might want to minimize thecost of replication. The other importantissue is the process of metric estimation.This involves the design of mechanismsand services related to estimation or mea-surement of metrics in a scalable manner.As a concrete example, measuring clientlatency to every client is generally not scal-able. In this case, we need to group clientsinto clusters and measure client-relatedmetrics on a per-cluster basis instead ofon a per-client basis (we call this processof grouping clients client clustering). Ingeneral, the metric estimation componentmeasures various metrics needed by othercomponents of the replica hosting system.

Adaptation triggering addresses thequestion when to adjust or adapt the sys-tem configuration. In other words, we de-fine when and how we can detect that λhas drifted too much from λ∗. Considera flash crowd causing poor client latency.The system must identify such a situationand react, for example, by increasing thenumber of replicas to handle the increasein the number of requests. Similarly, con-gestion in a network where a replica ishosted can result in poor accessibility ofthat replica. The system must identify



such a situation and possibly move thatreplica to another server. The adaptation-triggering mechanisms do not form aninput parameter of the objective func-tion. Instead, they form the heart of thefeedback element in Figure 1, thus indi-rectly control λ and maintain the systemin an acceptable state.

With replica placement, we addressthe question where to place replicas. Thisissue mainly concerns two problems: selec-tion of locations to install replica serversthat can host replicas (replica server place-ment) and selection of replica servers tohost replicas of a given object (replicacontent placement). The server placementproblem must be addressed during the ini-tial infrastructure installation and dur-ing the hosting infrastructure upgrading.The replica content placement algorithmsare executed to ensure that content place-ment results in an acceptable value ofλ, given a set of replica servers. Replicaplacement components use metric estima-tion services to get the value of metricsrequired by their placement algorithms.Both replica server placement and replicacontent placement form controllable inputparameters of the objective function.

With consistency enforcement, weconsider how to keep the replicas of agiven object consistent. Maintaining con-sistency among replicas adds overheadto the system, particularly when theapplication requires strong consistency(meaning clients are intolerant to staledata) and the number of replicas is large.The problem of consistency enforcementis defined as follows. Given certain ap-plication consistency requirements, wemust decide what consistency models,consistency policies and content distribu-tion mechanisms can meet these require-ments. A consistency model dictates theconsistency-related properties of contentdelivered by the systems to its clients.These models define consistency proper-ties of objects based on time, value, orthe order of transactions executed on theobject. A consistency model is usuallyadopted by consistency policies, which de-fine how, when, and which content distri-bution mechanisms must be applied. The

content distribution mechanisms specifythe protocols by which replica servers ex-change updates. For example, a systemcan adopt a time-based consistency modeland employ a policy where it guaranteesits clients that it will never serve a replicathat is more than an hour older than themost recent state of the object. This policycan be enforced by different mechanisms.

Request routing is about decidinghow to direct clients to the replicas theyneed. We choose from a variety of redi-rection policies and redirection mecha-nisms. Whereas the mechanisms provide amethod for informing clients about replicalocations, the policies are responsible fordetermining which replica must serve aclient. The request routing problem iscomplementary to the placement problem,as the assumptions made when solvingthe latter are implemented by the former.For example, we can place replica serversclose to our clients, assuming that the redi-rection policy directs the clients to theirnearby replica servers. However, deliber-ately drifting away from these assump-tions can sometimes help in optimizing theobjective function. For example, we maydecide to direct some client requests tomore distant replica servers to offload theclient-closest one. Therefore, we treat re-quest routing as one of the (controllable)objective function parameters.

Each of the above design issuescorresponds to a single logical systemcomponent. How each of them is actu-ally realized can be very different. Thefive components together should form ascalable Web replica hosting system. Theinteraction between these componentsis depicted in Figure 3, which is a re-finement of our initial feedback controlsystem shown in Figure 1. We assumethat λ∗ is a function of the uncontrollableinput parameters, that is:

λ∗ = minpk+1,..., pn

F ( p1, . . . , pk︸︷︷︸Uncontrollable

parameters

, pk+1, . . . , pn︸︷︷︸Controllableparameters

).

Its value is used for adaptation trigger-ing. If the difference with the computed



Fig. 3. Interactions between different components of a wide-area replicahosting system.

value λ is too high, the triggering com-ponent initiates changes in one or moreof the three control components: replicaplacement, consistency enforcement, orrequest routing. These different compo-nents strive to maintain λ close to λ∗. Theymanage the controllable parameters of theobjective function, now represented by theactually built system. Of course, the sys-tem conditions are also influenced by theuncontrollable parameters. The systemcondition is measured by the metric esti-mation services. They produce the currentsystem value λ, which is then passed to theadaptation triggering component for sub-sequent comparison. This process of adap-tation continues throughout the system’slifetime.

Note that the metric estimation ser-vices are also being used by componentsfor replica placement, consistency enforce-ment, and request routing, respectively,for deciding on the quality of their deci-sions. These interactions are not shown inthe figure for sake of clarity.

3. METRIC DETERMINATION

All replica hosting systems need to adapttheir configuration in an effort to main-tain high performance while meeting ap-plication constraints at minimum cost.The metric determination component isrequired to measure the system condi-tion, and thus allow the system to de-tect when the system quality drifts awayfrom the acceptable interval so that the

system can adapt its configuration ifnecessary.

Another purpose of the metric determi-nation component is to provide each of thethree control components with measure-ments of their input data. For example,replica placement algorithms may need la-tency measurements in order to generatea placement that is likely to minimize theaverage latency suffered by the clients.Similarly, consistency enforcement al-gorithms might require information onobject staleness in order to react withswitching to stricter consistency mech-anisms. Finally, request routing policiesmay need to know the current load ofreplica servers in order to distributethe requests currently targeting heavilyloaded servers to less loaded ones.

In this section, we discuss three issuesthat have to be addressed to enable scal-able metric determination. The first issueis metric selection. Depending on the per-formance optimization criteria, a numberof metrics must be carefully selected to ac-curately reflect the behavior of the system.Section 3.1 discusses metrics related toclient latency, network distance, networkusage, object hosting cost, and consistencyenforcement.

The second issue is client clustering.Some client-related metrics should ideallybe measured separately for each client.However, this can lead to scalability prob-lems as the number of clients for a typi-cal wide-area replica hosting system canbe in the order of millions. A common



solution to address this problem is togroup clients into clusters and measureclient-related metrics on a per-cluster ba-sis. Section 3.2 discusses various clientclustering schemes.

The third issue is metric estimation it-self. We must choose mechanisms to collectmetric data. These mechanisms typicallyuse client-clustering schemes to estimateclient-related metrics. Section 3.3 dis-cusses some popular mechanisms that col-lect metric data.

3.1. Choice of Metrics

The choice of metrics must reflect all as-pects of the desired performance. First ofall, the system must evaluate all metricsthat take part in the computation of theobjective function. Additionally, the sys-tem also needs to measure some extrametrics needed by the control components.For example, a map of host-to-host dis-tances may help the replica placement al-gorithms, although it does not have to beused by the objective function.

There exists a wide range of metricsthat can reflect the requirements of boththe system’s clients and the system’s op-erator. For example, the metrics relatedto latency, distance, and consistency canhelp evaluate the client-perceived perfor-mance. Similarly, the metrics related tonetwork usage and object hosting costare required to control the overall systemmaintenance cost, which should remainwithin bounds defined by the system’s op-erator. We distinguish five classes of met-rics, as shown in Table II, and which arediscussed in the following sections.

3.1.1. Temporal Metrics. An importantclass of metrics is related to the time ittakes for peers to communicate, generallyreferred to as latency metrics. Latency canbe defined in different ways. To explain,we consider a client-server system and fol-low the approach described in Dykes et al.[2000] by modeling the total time to pro-cess a request, as seen from the client’sperspective, as

T = TDNS + Tconn + Tres + Trest.

Table II. Five Different Classes of Metrics Used toEvaluate Performance in Replica Hosting SystemsClass DescriptionTemporal The metric reflects how long a

certain action takesSpatial The metric is expressed in terms

of a distance that is related tothe topology of the underlyingnetwork, or region in which thenetwork lies

Usage The metric is expressed in termsof usage of resources of theunderlying network, notablyconsumed bandwidth

Financial Financial metrics are expressedin terms of a monetary unit,reflecting the monetary costs ofdeploying or using services of thereplica hosting system

Consistency The metrics express to whatextent a replica’s value maydiffer from the master copy

TDNS is the DNS lookup time needed tofind the server’s network address. As re-ported by Cohen and Kaplan [2001], DNSlookup time can vary tremendously dueto cache misses (i.e., the client’s localDNS server does not have the addressof the requested host in its cache), al-though in many cases it stays below 500milliseconds.

Tconn is the time needed to establisha TCP connection, which, depending onthe type of protocols used in a replicahosting system, may be relevant to takeinto account. Zari et al. [2001] report thatTconn will often be below 200 milliseconds,but that, like in the DNS case, very highvalues up to even 10 seconds may alsooccur.

Tres is the time needed to transfer a re-quest from the client to the server and re-ceiving the first byte of the response. Thismetric is comparable to measuring theround-trip time (RTT) between two nodes,but includes the time the server needs tohandle the incoming request. Finally, Trestis the time needed to complete the transferof the entire response.

When considering latency, two differentversions are often considered. The end-to-end latency is taken as the time neededto send a request to the server, and isoften taken as 0.5Tres, possibly including



the time Tconn to setup a connection. Theclient-perceived latency is defined as T −Trest. This latter latency metric reflects thereal delay observed by a user.

Obtaining accurate values for latencymetrics is not a trivial task as it mayrequire specialized mechanisms, or evena complete infrastructure. One particu-lar problem is predicting client-perceivedlatency, which not only involves mea-suring the round-trip delay between twonodes (which is independent of the sizeof the response), but may also requiremeasuring bandwidth to determine Trest.The latter has shown to be particularlycumbersome requiring sophisticated tech-niques [Lai and Baker 1999]. We discussthe problem of latency measurement fur-ther in Section 3.3.

3.1.2. Spatial Metrics. As an alternativeto temporal metrics, many systems con-sider a spatial metric such as number ofnetwork-level hops or hops between au-tonomous systems, or even the geograph-ical distance between two nodes. In thesecases, the underlying assumption is gen-erally that there exists a map of the net-work in which the spatial metric can beexpressed.

Maps can have different levels of accu-racy. Some maps depict the Internet asa graph of Autonomous Systems (ASes),thus unifying all machines belonging tothe same AS. They are used for exampleby Pierre et al. [2002]. The graph of ASesis relatively simple and easy to operateon. However, because ASes significantlyvary in size, this approach can suffer frominaccuracy. Other maps depict the Inter-net as a graph of routers, thus unifyingall machines connected to the same router[Pansiot and Grad 1998]. These maps aremore detailed than the AS-based ones, butare not satisfying predictors for latency.For example, Huffaker et al. [2002] foundthat the number of router hops is accu-rate in selecting the closest server in only60% of the cases. The accuracy of router-level maps can be expected to decrease inthe future with the adoption of new rout-ing technologies such as Multi-Protocol

Label Switching (MPLS) [Pepelnjak andGuichard 2001], which can hide the rout-ing paths within a network. Finally, somesystems use proprietary distance calcula-tion schemes, for example by combiningthe two above approaches [Rabinovich andAggarwal 1999].

Huffaker et al. [2002] examined to whatextent geographical distance could be usedinstead of latency metrics. They showedthat there is generally a close correlationbetween geographical distance and RTT.An earlier study using simple networkmeasurement tools by Ballintijn et al.[2000], however, reported only a weakcorrelation between geographical distanceand RTT. This difference may be causedby the fact that many more monitoringpoints outside the U.S. were used, but thatmany physical connections actually crossthrough networks located in the U.S. Thisphenomenon also caused large deviationsin the results by Huffaker et al. [2002].

An interesting approach based ongeographical distance is followed inthe Global Network Positioning (GNP)project [Ng and Zhang 2002]. In thiscase, the Internet is modeled as an N -dimensional geometric space. GNP is usedto estimate the latency between two ar-bitrary nodes. We describe GNP and itsseveral variants in more detail in Sec-tion 3.3 when discussing metric estima-tion services.

Constructing and exploiting a mapof the Internet is easier than runningan infrastructure for latency measu-rements. The maps can be derived,for example, from routing tables.Interestingly, Crovella and Carter [1995]reported that the correlation between thedistance in terms of hops and the latencyis quite poor. However, other studies showthat the situation has changed. McManus[1999] shows that the number of hopsbetween ASes can be used as an indicatorfor client-perceived latency. Researchreported in Obraczka and Silva [2000]revealed that the correlation between thenumber of network-level or AS-level hopsand round-trip times (RTTs) has furtherincreased, but that RTT still remains thebest choice when a single latency metric



is needed for measuring client-perceivedperformance.

3.1.3. Network Usage Metrics. Anotherimportant metric is the total amount ofconsumed network resources. Such re-sources could include routers and othernetwork elements, but often entails onlythe consumed bandwidth. The total net-work usage can be classified into twotypes. Internal usage is caused by thecommunication between replica serversto keep replicas consistent. External us-age is caused by communication betweenclients and replica servers. Preferably, theratio between external and internal us-age is high, as internal usage can beviewed as a form of overhead introducedmerely to keep replicas consistent. On theother hand, overall network usage may de-crease in comparison to the nonreplicatedcase, but may require measuring morethan, for example, consumed bandwidthonly.

To see the problem at hand, considera nonreplicated document of size s bytesthat is requested r times per second. Thetotal consumed bandwidth in this case isr · s, plus the cost of r separate connec-tions to the server. The cost of a connec-tion can typically be expressed as a com-bination of setup costs and the averagedistance that each packet associated withthat connection needs to travel. Assumethat the distance is measured in the num-ber of hops (which is reasonable when con-sidering network usage). In that case, if lris the average length of a connection, wecan also express the total consumed band-width for reading the document as r · s · lr .

On the other hand, suppose the doc-ument is updated w times per secondand that updates are always immediatelypushed to, say, n replicas. If the aver-age path length for a connection from theserver to a replica is lw, update costs arew ·s · lw. However, the average path lengthof a connection for reading a document willnow be lower in comparison to the non-replicated case. If we assume that lr = lw,the total network usage may change bya factor w/r in comparison to the non-replicated case.

Of course, more precise models shouldbe applied in this case, but the exam-ple illustrates that merely measuring con-sumed bandwidth may not be enoughto properly determine network usage.This aspect becomes even more importantgiven that pricing may be an issue for pro-viding hosting services, an aspect that wediscuss next.

3.1.4. Financial Metrics. Of a completelydifferent nature are metrics that deal withthe economics of content delivery net-works. To date, such metrics form a rel-atively unexplored area, although thereis clearly interest to increase our insight(see, e.g., Janiga et al. [2001]). We needto distinguish at least two different roles.First, the owner of the hosting service isconfronted with costs for developing andmaintaining the hosting service. In partic-ular, costs will be concerned with serverplacement, server capacity, and networkresources (see, e.g., Chandra et al. [2001]).This calls for metrics aimed at the hostingservice provider.

The second role is that of customersof the hosting service. Considering thatwe are dealing with shared resourcesthat are managed by service provider, ac-counting management by which a pre-cise record of resource consumption isdeveloped, is important for billing cus-tomers [Aboba et al. 2000]. However,developing pricing models is not triv-ial and it may turn out that simplepricing schemes will dominate the so-phisticated ones, even if application ofthe latter are cheaper for customers[Odlyzko 2001]. For example, Akamai usespeak consumed bandwidth as its pricingmetric.

The pricing model for hosting an ob-ject can directly affect the control compo-nents. For example, a model can mandatethat the number of replicas of an objectis constrained by the money paid by theobject owner. Likewise, there exist vari-ous models that help in determining objecthosting costs. Examples include a modelwith a flat base fee and a price linearlyincreasing along with the number of ob-ject replicas, and a model charging for the



total number of clients serviced by all theobject replicas.

We observe that neither financial met-rics for the hosting service provider northose for consumers have actually been es-tablished other than in some ad hoc andspecific fashion. We believe that develop-ing such metrics, and notably the modelsto support them, is one of the more chal-lenging and interesting areas for contentdelivery networks.

3.1.5. Consistency Metrics. Consistencymetrics inform to what extent the replicasretrieved by the clients are consistentwith the replica version that was up-to-date at the moment of retrieval. Manyconsistency metrics have been proposedand are currently in use, but they areusually quantified along three differentaxes.

In time-based consistency models, thedifference between two replicas A andB is measured as the time between thelatest update on A and the one on B.In effect, time-based models measure thestaleness of a replica in comparison to an-other, more recently updated replica. Tak-ing time as a consistency metric is popularin Web-hosting systems as it is easy to im-plement and independent of the semanticsof the replicated object. Because updatesare generally performed at only a singleprimary copy from where they are propa-gated to secondaries, it is easy to associatea single timestamp with each update andto subsequently measure the staleness ofa replica.

In value-based models, it is assumedthat each replica has an associated numer-ical value that represents its current con-tent. Consistency is then measured as thenumerical difference between two repli-cas. This metric requires that the seman-tics of the replicated object are known orotherwise it would be impossible to asso-ciate and compare object values. An exam-ple of where value-based metrics can beapplied is a stock-market Web documentcontaining the current values of shares. Insuch a case, we could define a Web docu-ment to be inconsistent if at least one of

the displayed shares differs by more than2% with the most recent value.

Finally, in order-based models, readsand writes are perceived as transactionsand replicas can differ only in the order ofexecution of write transactions accordingto certain constraints. These constraintscan be defined as the allowed numberof out-of-order transactions, but can alsobe based on the dependencies betweentransactions as is commonly the case fordistributed shared-memory systems [Mos-berger 1993], or client-centric consistencymodels as introduced in Bayou [Terry et al.1994].

3.1.6. Metric Classification. Metrics canbe classified into two types: static and dy-namic. Static metrics are those whose esti-mates do not vary with time, as opposed todynamic metrics. Metrics such as the ge-ographical distance are static in nature,whereas metrics such as end-to-end la-tency, number of router hops or networkusage are dynamic. The estimation of dy-namic metrics can be a difficult problemas it must be performed regularly to be ac-curate. Note that determining how often ametric should be estimated is a problem byitself. For example, Paxson [1997a] foundthat the time periods over which end-to-end routes persist vary from seconds todays.

Dynamic metrics can be more usefulwhen selecting a replica for a given clientas they estimate the current situation.For example, Crovella and Carter [1995]conclude that the use of a dynamic met-ric instead of a static one is more use-ful for replica selection as the formercan also account for dynamic factors suchas network congestion. Static metrics, inturn, are likely to be exploited by replicaserver placement algorithms as they tendto be more directed toward a global, long-lasting situation than an instantaneousone.

In general, however, any combination ofmetrics can be used by any control com-ponent. For example, the placement al-gorithms proposed by Radoslavov et al.[2001] and Qiu et al. [2001] use dynamic



metrics (end-to-end latency and networkusage). Also Dilley et al. [2002] andRabinovich and Aggarwal [1999] use end-to-end latency as a primary metric fordetermining the replica location. Finally,the request-routing algorithm describedin Szymaniak et al. [2003] exploits net-work distance measurements. We observethat the existing systems tend to supporta small set of metrics, and use all of themin each control component.

3.2. Client Clustering

As we noticed before, some metrics shouldbe ideally measured on a per-client basis.In a wide-area replica hosting system, forwhich we can expect millions of clients,this poses a scalability problem to the esti-mation services as well as the componentsthat need to use them. Hence, there is aneed for scalable mechanisms for metricestimation.

A popular approach by which scalabil-ity is achieved is client clustering in whichclients are grouped into clusters. Metricsare then estimated on a per-cluster basisinstead of on a per-client basis. Althoughthis solution allows to estimate metrics ina scalable manner, the efficiency of the es-timation depends on the accuracy of clus-tering mechanisms. The underlying as-sumption here is that the metric valuecomputed for a cluster is representative ofvalues that would be computed for eachindividual client in that cluster. Accurateclustering schemes are those which keepthis assumption valid.

The choice of a clustering scheme de-pends on the metric it aims to estimate.Below, we present different kinds of clus-tering schemes that have been proposed inthe literature.

3.2.1. Local Name Servers. Each Internetclient contacts its local DNS server to re-solve a service host name to its IP ad-dress(es). The clustering scheme basedon local name servers unifies clients con-tacting the same name server, as theyare assumed to be located in the samenetwork-topological region. This is a use-ful abstraction as DNS-based request-

routing schemes are already used in theInternet. However, the success of theseschemes relies on the assumption thatclients and local name servers are close toeach other. Shaikh et al. [2001] performeda study on the proximity of clients andtheir name servers based on the HTTPlogs from several commercial websites.Their study concludes that clients are typ-ically eight or more hops from their repre-sentative name servers. The authors alsomeasured the round trip times both fromthe name servers to the servers (name-server latency) and from the clients to theservers (client latency). It turns out thatthe correlation between the name-serverlatency and the actual client latency isquite poor. They conclude that the latencymeasurements to the name servers areonly a weak approximation of the latencyto actual clients. These findings have beenconfirmed by Mao et al. [2002].

3.2.2. Autonomous Systems. The Inter-net has been built as a graph of individ-ual network domains, called AutonomousSystems (ASes). The AS clustering schemegroups together clients located in the sameAS, as is done for example, in Pierre et al.[2002]. This scheme naturally matches theAS-based distance metric. Further clus-tering can be achieved by grouping ASesinto a hierarchy, as proposed by Barfordet al. [2001], which in turn can be used toplace caches.

Although an AS is usually formed outof a set of networks belonging to a sin-gle administrative domain, it does not nec-essarily mean that these networks areproximal to each other. Therefore, estimat-ing latencies with an AS-based clusteringscheme can lead to poor results. Further-more, since ASes are global in scope, mul-tiple ASes may cover the same geographi-cal area. It is often the case that some IPhosts are very close to each other (either interms of latency or hops) but belong to dif-ferent ASes, while other IP hosts are veryfar apart but belong to the same AS. Thismakes the AS-based clustering schemesnot very effective for proximity-based met-ric estimations.



3.2.3. Client Proxies. In some cases,clients connect to the Internet throughproxies, which provide them with servicessuch as Web caching and prefetching.Client proxy-based clustering schemesgroup together all clients using the sameproxy into a single cluster. Proxy-basedschemes can be useful to measure latencyif the clients are close to their proxyservers. An important problem withthis scheme is that many clients in theInternet do not use proxies at all. Thus,this clustering scheme will create manyclusters consisting of only a single client,which is inefficient with respect to achiev-ing scalability for metric estimation.

3.2.4. Network-Aware Clustering. Resear-chers have proposed another schemefor clustering Web clients, which isbased on client-network characteristics.Krishnamurthy and Wang [2000] evaluatethe effectiveness of a simple mechanismthat groups clients having the same firstthree bytes of their IP addresses into a sin-gle cluster. However, this simple mecha-nism fails in more than 50% of the caseswhen checking whether grouped clientsactually belong to the same network. Theauthors identify two reasons for failure.First, their scheme wrongly merges smallclusters that share the same first threebytes of IP addresses as a single class-Cnetwork. Second, it splits several class-A,class-B, and CIDR networks into multi-ple class-C networks. Therefore, the au-thors propose a novel method to identifyclusters by using the prefixes and net-work masks information extracted fromthe Border Gateway Protocol routing ta-bles [Rekhter and Li 1995]. The proposedmechanism consists of the following steps:

(1) Creating a merged prefix table fromrouting table snapshots

(2) Performing the longest prefix match-ing on each client IP address (asrouters do) using the constructed pre-fix table

(3) Classifying all the clients which havethe same longest prefix into a singlecluster.

The authors demonstrate the effective-ness of their approach by showing a suc-cess rate of 99.99% in their validationtests.

3.2.5. Hierarchical Clustering. Most clus-tering schemes aim at achieving a scal-able manner of metric estimation. How-ever, if the clusters are too coarse grained,it decreases the accuracy of measurementsimply because the underlying assump-tion that the difference between the met-ric estimated to a cluster and to a client isnegligible is no longer valid. Hierarchicalclustering schemes help in estimatingmetrics at different levels (such as intr-acluster and intercluster), thereby aim-ing at improving the accuracy of measure-ment, as in IDMaps [Francis et al. 2001]and Radar [Rabinovich and Aggarwal1999]. Performing metric estimations atdifferent levels results not only in betteraccuracy, but also in better scalability.

Note that there can be other possibleschemes of client clustering, based notonly on the clients’ network addresses ortheir geographical proximities, but also ontheir content interests (see, e.g., Xiao andZhang [2001]). However, such clustering isnot primarily related to improving scala-bility through replication, for which rea-son we further exclude it from our study.

3.3. Metric Estimation Services

Once the clients are grouped into their re-spective clusters, the next step is to ob-tain the values for metrics (such as latencyor network overhead). Estimation of met-rics on a wide area scale such as the In-ternet is not a trivial task and has beenaddressed by several research initiativesbefore [Francis et al. 2001; Moore et al.1996]. In this section, we discuss the chal-lenges involved in obtaining the value forthese metrics.

Metric estimation services are respon-sible for providing values for the vari-ous metrics required by the system. Theseservices aid the control components intaking their decisions. For example, theseservices can provide replica placement al-gorithms with a map of the Internet. Also,



metric estimation services can use client-clustering schemes to achieve scalability.

Metric estimations schemes can be di-vided into two groups: active and passiveschemes. Active schemes obtain respec-tive metric data by simulating clients andmeasuring the performance observed bythese simulated clients. Active schemesare usually highly accurate, but thesesimulations introduce additional load tothe replica hosting system. Examples ofactive mechanisms are Cprobes [Carterand Crovella 1997] and Packet BunchMode [Paxson 1997b]. Passive mecha-nisms obtain the metric data from obser-vations of existing system behavior. Pas-sive schemes do not introduce additionalload to the network, but deriving the met-ric data from past events can suffer frompoor accuracy. Examples of passive mecha-nisms include SPAND [Stemm et al. 2000]and EtE [Fu et al. 2002].

Different metrics are by nature esti-mated in different manners. For exam-ple, metric estimation services are com-monly used to measure client latency ornetwork distance. The consistency-relatedmetrics are not measured by a separatemetric estimation service, but are usu-ally measured by instrumenting client ap-plications. In this section, our discussionof existing research efforts mainly cov-ers services that estimate network-relatedmetrics.

3.3.1. IDMaps. IDMaps is an active ser-vice that aims at providing an archi-tecture for measuring and disseminatingdistance information across the Internet[Francis et al. 1999, 2001]. IDMaps usesprograms called tracers that collect andadvertise distance information as so-called distance maps. IDMaps builds itsown client-clustering scheme. It groupsdifferent geographical regions as boxesand constructs distance maps betweenthese boxes. The number of boxes in theInternet is relatively small (in the orderof thousands). Therefore, building a dis-tance table between these boxes is in-expensive. To measure client-server dis-tance, an IDMaps client must calculate the

distance to its own box and the distancefrom the target server to this server’s box.Given these two calculations, and the dis-tance between the boxes calculated basedon distance maps, the client can discoverits real distance to the server. It must benoted that the efficiency of IDMaps heav-ily depends on the size and placement ofboxes.

3.3.2. King. King is an active metric es-timation method [Gummadi et al. 2002]. Itexploits the global infrastructure of DNSservers to measure the latency betweentwo arbitrary hosts. King approximatesthe latency between two hosts, H1 and H2,with the latency between their local DNSservers, S1 and S2.

Assume that a host X needs to calculatethe latency between hosts H1 and H2. Thelatency between their local DNS servers,LS1 S2 , is calculated based on round-triptimes (RTTs) of two DNS queries. Withthe first query, host X queries the DNSserver S1 about some non-existing DNSname that belongs to a domain hosted byS1 (see Figure 4(a)). In this way, X discov-ers its latency to S1:

LXS1 = 12

RTT1.

By querying about a nonexisting name,X ensures that the response is retrievedfrom S1, as no cached copy of that responsecan be found anywhere in the DNS.

With the second query, host X queriesthe DNS server S1 about another nonex-isting DNS name that this time belongs toa domain hosted by S2 (see Figure 4(b)).In this way, X measures the latency of itsroute to S2 that goes through S1:

LXS2 = 12

RTT2.

A crucial observation is that this latency isa sum of two partial latencies, one betweenX and S1, and the other between S1 andS2: LXS2 = LXS1 + LS1 S2 Since LXS1 hasbeen measured by the first DNS query, Xmay subtract it from the total latency LXS2



Fig. 4. The two DNS queries of King.

Fig. 5. Positioning in GNP.

to determine the latency between the DNSservers:

LS1 S2 = LXS2 − LXS1 = 12

RTT2 − 12

RTT1.

Note that S1 will forward the secondquery to S2 only if S1 is configured to ac-cept so-called “recursive” queries from X[Mockapetris 1987b].

In essence, King is actively probing withDNS queries. A potential problem withthis approach is that an extensive useof King may result in overloading theglobal infrastructure of DNS servers. Inthat case, the efficiency of DNS is likelyto decrease, which can degrade the perfor-mance of the entire Internet. Also, accord-ing to the DNS specification, it is recom-mended to reject recursive DNS queriesthat come from nonlocal clients, whichrenders many DNS servers unusable forKing [Mockapetris 1987a].

3.3.3. Network Positioning. The idea ofnetwork positioning has been proposedin Ng and Zhang [2002], where it is calledGlobal Network Positioning (GNP). GNP

is a novel approach to the problem of net-work distance estimation, where the In-ternet is modeled as an N -dimensional ge-ometric space [Ng and Zhang 2002]. GNPapproximates the latency between any twohosts as the Euclidean distance betweentheir corresponding positions in the geo-metric space.

GNP relies on the assumption thatlatencies can be triangulated in theInternet. The position of any host X iscomputed based on its measured laten-cies between X and k landmark hosts,whose positions have been computed ear-lier (k ≥ N + 1, to ensure that the cal-culated position is unique). By treatingthese latencies as distances, GNP triangu-lates the position of X (see Figure 5). Thetriangulation is implemented by means ofSimplex-downhill, which is a classical op-timization method for multi-dimensionalfunctions [Nelder and Mead 1965].

The most important limitation of GNPis that the set of landmarks can neverchange. If any of them becomes un-available, the latency to that landmarkcannot be measured and GNP is nolonger able to position any more hosts.



Fig. 6. Positioning in Lighthouses.

It makes GNP sensitive to landmarkfailures.

This limitation is removed in the Light-houses system [Pias et al. 2003]. Theauthors have shown that hosts can be ac-curately positioned relative to any previ-ously positioned hosts, acting as “local”landmarks. This eliminates the need forcontacting the original landmarks eachtime a host is positioned (see Figure 6). Italso allows to improve the positioning ac-curacy, by selecting some of the local land-marks close to the positioned host [Castroet al. 2003a].

SCoLE further improves the scala-bility of the system by allowing eachhost to select its own positioning pa-rameters, construct its own “private”space, and position other hosts in thatspace [Szymaniak et al. 2004]. This effec-tively removes the necessity of a globalnegotiation to determine positioningparameters, such as the space dimension,the selection of global landmarks, and thepositioning algorithm. Such an agreementis difficult to reach in large-scale systems,where different hosts can have differentrequirements with respect to the latencyestimation process. Latency estimatesperformed in different private spaceshave been shown to be highly correlated,even though these spaces have completelydifferent parameters.

Another approach is to position all hostssimultaneously as a result of a global opti-mization process [Cox et al. 2004; Shavittand Tankel 2003; Waldvogel and Rinaldi2003]. In this case, there is no need tochoose landmarks, since every host is infact considered to be a landmark. Theglobal optimization approach is gener-

ally faster than its iterative counterpart,which positions hosts one by one. The au-thors also claim that it leads to better ac-curacy, and that it is easy to implementin a completely distributed fashion. How-ever, because it operates on all the la-tencies simultaneously, it can potentiallyhave to be rerun every time new latencymeasurements are available. Such a rerunis likely to be computationally expensivein large-scale systems, where the num-ber of performed latency measurements ishigh.

3.3.4. SPAND. SPAND is a shared pas-sive network performance measurementservice [Stemm et al. 2000]. This serviceaims at providing network-related mea-sures such as client end-to-end latency,available bandwidth, or even application-specific performance details such as accesstime for a Web object. The components ofSPAND are client applications that canlog their performance details, a packet-capturing host that logs performance de-tails for SPAND-unaware clients, and per-formance servers that process the logssent by the above two components. Theperformance servers can reply to queriesconcerning various network-related andapplication-specific metrics. SPAND hasan advantage of being able to produce ac-curate application-specific metrics if thereare several clients using that applica-tion in the same shared network. Further,since it employs passive measurement, itdoes not introduce any additional traffic.

3.3.5. Network Weather Service. The Net-work Weather Service (NWS) is an active



measurement service [Wolski et al. 1999].It is primarily used in for Grid comput-ing, where decisions regarding schedul-ing of distributed computations are madebased on the knowledge of server loadsand several network performance metrics,such as available bandwidth and end-to-end latency. Apart from measuring thesemetrics, it also employs prediction mech-anisms to forecast their value based onpast events. In NWS, the metrics are mea-sured using special sensor processes, de-ployed on every potential server node.Further, to measure end-to-end latencyactive probes are sent between these sen-sors. NWS uses an adaptive forecast-ing approach, in which the service dy-namically identifies the model that givesthe least prediction error. NWS has alsobeen used for replica selection [McCuneand Andresen 1998]. However, exploit-ing NWS directly by a wide-area replicahosting system can be difficult, as thisservice does not scale to the level ofthe Internet. This is due to the factthat it runs sensors in every node anddoes not use any explicit client clusteringschemes. On the other hand, when com-bined with a good client clustering schemeand careful sensor placement, NWSmay become a useful metric estimationservice.

3.3.6. Akamai Metric Estimation. Commer-cial replica hosting systems often usetheir own monitoring or metric estima-tion services. Akamai has built its owndistributed monitoring service to moni-tor server resources, network-related met-rics and overall system performance. Themonitoring system places monitors in ev-ery replica server to measure server re-sources. The monitoring system simulatesclients to determine if the overall sys-tem performance is in an acceptablestate as perceived by clients. It mea-sures network-related information by em-ploying agents that communicate withborder routers in the Internet as peersand derive the distance-related metricsto be used for its placement decisions[Dilley et al. 2002].

3.3.7. Other Systems. In addition to theabove wide-area metric estimation sys-tems, there are different classes of sys-tems that measure service-related metricssuch as content popularity, client-abortedtransfers, and amount of consumed band-width. These kinds of systems perform es-timation in a smaller scale, and mostlymeasure metrics as seen by a single server.

Web page instrumentation and associ-ated code (e.g., in Javascript) is being usedin various commercial tools for measuringservice-related metrics. In these schemes,instrumentation code is downloaded bythe client browser after which it tracks thedownload time for individual objects andreports performance characteristics to thewebsite.

EtE is a passive system used for mea-suring metrics such as access latency, andcontent popularity for the contents hostedby a server [Fu et al. 2002]. This is doneby running a special model near the ana-lyzed server that monitors all the service-related traffic. It is capable of determiningsources of delay (distinguishing betweennetwork and server delays), content pop-ularity, client-aborted transfers and theimpact of remote caching on the serverperformance.

3.4. Discussion

In this section, we discussed three is-sues related to metric estimation: met-ric selection, client clustering, and metricestimation.

Metric selection deals with decidingwhich metrics are important to evaluatesystem performance. In most cases, opti-mizing latency is considered to be mostimportant, but the problem is that actu-ally measuring latencies can be quite dif-ficult. For this reason, simpler spatial met-rics are used under the assumption that,for example, a low number of network-level hops between two nodes also impliesa relatively low latency between those twonodes. Spatial metrics are easier to mea-sure, but also show to be fairly inaccurateas estimators for latency.

An alternative metric is consumedbandwidth, which is also used to measure



the efficiency of a system. However, in or-der to measure the efficiency of a con-sistency protocol as expressed by the ra-tio between the consumed bandwidth forreplica updates and the bandwidth deliv-ered to clients, some distance metric needsto be taken into account as well. When itcomes to consistency metrics, three differ-ent types need to be considered: those re-lated to time, value, and the ordering ofoperations. It appears that this differenti-ation is fairly complete, leaving the actualimplementation of consistency models andthe enforcement of consistency the mainproblem to solve.

An interesting observation is thathardly no systems today use financial met-rics to evaluate system performance. De-signing and applying such metrics is notobvious, but extremely important in thecontext of system evaluation.

A scalability problem that these met-rics introduce is that they, in theory, re-quire measurements on a per-client ba-sis. With millions of potential clients, suchmeasurements are impossible. Therefore,it is necessary to come to client-clusteringschemes. The goal is to design a clustersuch that measurement at a single client isrepresentative for any client in that clus-ter (in a mathematical sense, the clientsshould form an equivalence class). Find-ing the right metric for clustering, andalso one that can be easily established hasshown to be difficult. However, network-aware clustering by which a prefix ofthe network address is taken as criterionfor clustering has lead to very accurateresults.

Once a metric has been chosen, itsvalue needs to be determined. This iswhere metric estimation services comeinto place. Various services exist, includ-ing some very recent ones that can han-dle difficult problems such as estimatingthe latency between two arbitrary nodesin the Internet. Again, metric estimationservices can turn out to be rather compli-cated, partly due to the lack of sufficientacquisition points in the Internet. In thissense, the approach to model the Inter-net as a Euclidean N -dimensional space isvery powerful as it allows local computa-

tions concerning remote nodes. However,this approach can be applied only wherethe modeled metric can be triangulated,making it more difficult when measuring,for example, bandwidth.

4. ADAPTATION TRIGGERING

The performance of a replica hosting sys-tem changes with the variations of uncon-trollable system parameters such as clientaccess patterns and network conditions.These changes make the current value ofthe system’s λ drift away from the optimalvalue λ∗, and fall out of the acceptable in-terval. The system needs to maintain a de-sired level of performance by keeping λ inan acceptable range amidst these changes.The adaptation triggering component ofthe system is responsible for identifyingchanges in the system and for adaptingthe system configuration to bound λ withinthe acceptable range. This adaptation con-sists of a combination of changes in replicaplacement, consistency policy, and requestrouting policy.

We classify adaptation triggering com-ponents along two criteria. First, they canbe classified based on their timing nature.Second, they can be classified based onwhich element of the system actually per-forms the triggering.

4.1. Time-Based Classification

Taking timing into account, we distin-guish three different triggering mecha-nisms: periodic triggers, aperiodic trig-gers, and triggers that combine these two.

4.1.1. Periodic Triggers. A periodic trig-gering component analyzes a number ofinput variables, or λ itself, at fixed time in-tervals. If the analysis reveals that λ is toofar from λ∗, the system triggers the adap-tation. Otherwise, it allows the systemto continue with the same configuration.Such a periodic evaluation scheme can beeffective for systems that have relativelystable uncontrollable parameters. How-ever, if the uncontrollable parameters fluc-tuate a lot, then it may become very hardto determine a good evaluation periodicity.



A too short period will lead to considerableadaptation overhead, whereas a too longperiod will result in slow reactions to im-portant changes.

4.1.2. Aperiodic Triggers. Aperiodic trig-gers can trigger adaptation at any time. Atrigger is usually due to an event indicat-ing a possible drift of λ from the acceptableinterval. Such events are often defined aschanges of the uncontrollable parameters,such as the client request rates or end-to-end latency, which may reflect issues thatthe system has to deal with.

The primary advantage of aperiodictriggers is their responsiveness to emer-gency situations such as flash crowdswhere the system must be adaptedquickly. However, it requires continuousmonitoring of metrics that can indicateevents in the system, such as server loador client request rate.

4.1.3. Hybrid Triggers. Periodic and ape-riodic triggers have opposite qualitiesand drawbacks. Periodic triggers are wellsuited for detecting slow changes in thesystem that aperiodic triggers may notdetect, whereas aperiodic triggers arewell suited to detect emergency situationswhere immediate action is required. Con-sequently, a good approach may be a com-bination of periodic and aperiodic trig-gering schemes. For example, Radar andGlobule use both periodic and aperiodictriggers, which give them the ability toperform global optimizations and to reactto emergency situations.

In Radar [Rabinovich and Aggarwal1999], every replica server periodicallyruns a replication algorithm that checksfor the number of client accesses to aparticular replica and server load. Anobject is deleted for low client accessesand a migration/replication component isinvoked if the server load is above athreshold. In addition, a replica servercan detect that it is overloaded and askits replication-managing component to of-fload it. Adaptation in this case consistseither of distributing the load over otherreplica servers, or to propagate the request

to another replication-managing compo-nent in case there are not enough replicaservers available.

In Globule [Pierre and van Steen 2001],each primary server periodically evalu-ates recent client access logs. The needfor adapting the replication and consis-tency policies is determined by this eval-uation. Similarly to Radar, each replicaserver also monitors its request rate andresponse times. When a server is over-loaded, it can request its primary serverto reevaluate the replication strategy.

4.2. Source-Based Classification

Adaptation triggering mechanisms alsovary upon which part of the system actu-ally performs the triggering. We describethree different kinds of mechanisms.

4.2.1. Server-Triggered Adaptation. Server-triggered adaptation schemes considerthat replica servers are in a good posi-tion to evaluate metrics from the system.Therefore, the decision that adaptation isrequired is taken by one or more replicaservers.

Radar and Globule use server-triggeredadaptation, as they make replica serversmonitor and possibly react to systemconditions.

Server-triggered adaptation is also wellsuited for reacting to internal server con-ditions, such as overloads resulting fromflash crowds or denial-of-service (DoS) at-tacks. For example, Jung et al. [2002] stud-ied the characteristics of flash crowds andDoS attacks. They propose an adaptationscheme where servers can differentiatethese two kinds of events and react accord-ingly: increase the number of replicas incase of a flash crowd, or invoke securitymechanisms in case of a DoS attack.

Server-triggered adaptation is effectiveas the servers are in a good position todetermine the need for changes in theirstrategies in view of other constraints,such as total system cost. Also, thesemechanisms do not require running trig-gering components on elements (hosts,routers) that may not be under the controlof the replica hosting system.



4.2.2. Client-Triggered Adaptation. Adap-tation can be triggered by the clients. Inclient-triggered schemes, clients or clientrepresentatives can notice that they expe-rience poor quality of service and requestthe system to take the appropriate mea-sures. Sayal et al. [2003] describe such asystem where smart clients provide theservers with feedback information to helptake replication decisions.

Client-triggered adaptation can be ef-ficient in terms of preserving a client’sQoS. However, it has three importantdrawbacks. First, the clients or clientrepresentatives must cooperate with thereplica hosting system. Second, clienttransparency is lost, as clients or their rep-resentatives need to monitor events andtake explicit action. Third, by relying onindividual clients to trigger adaptation,this scheme may suffer from poor scalabil-ity in a wide-area network, unless efficientclient clustering methods are used.

4.2.3. Router-Triggered Adaptation. Inrouter-triggered schemes, adaptation isinitiated by the routers that can informthe system of network congestion, linkand network failures, or degraded end-to-end request latencies. These schemesobserve network-related metrics andoperate on them.

Such an adaptation scheme is used inSPREAD [Rodriguez and Sibal 2000]. InSPREAD, every network has one spe-cial router with a distinguished proxy at-tached to it. If the router notices a TCPcommunication from a client to retrievedata from a primary server, it interceptsthis communication and redirects the re-quest to the proxy attached to it. The proxygets a copy of the referenced object fromthe primary server and services this clientand all future requests passing throughits network. By using the network layer toimplement replication, this scheme buildsan architecture that is transparent to theclient.

Router-triggered schemes have the ad-vantage that routers are in a good posi-tion to observe network-related metrics,such as end-to-end latency and consumed

bandwidth while preserving client trans-parency. Such schemes are useful to de-tect network congestion or dead links, andthus may trigger changes in replica loca-tion. However, they suffer from two disad-vantages. First, they require the supportof routers, which may not be available toevery enterprise building a replica hostingsystem in the Internet. Second, they intro-duce an overhead to the network infras-tructure, as they need to isolate the traffictargeting Web hosting systems, which in-volves processing all packets received bythe routers.

4.3. Discussion

Deciding when to trigger system adapta-tion is difficult because explicitly comput-ing λ and λ∗ may be expensive, if not im-possible. This calls for schemes that areboth responsive enough to detect the driftof λ from the acceptable interval and arecomputationally inexpensive. This is usu-ally realized by monitoring simple metricswhich are believed to significantly influ-ence λ.

Another difficulty is posed by the factthat it is not obvious which adaptivecomponents should be triggered. Depend-ing on the origin of the performancedrift, the optimal adaptation may beany combination of changes in replicaplacement, request routing or consistencypolicies.

5. REPLICA PLACEMENT

The task of replica placement algorithmsis to find good locations to host replicas.As noted earlier, replica placement forms acontrollable input parameter of the objec-tive function. Changes in uncontrollableparameters, such as client request rate orclient latencies, may warrant changing thereplica locations. In such case, the adap-tation triggering component triggers thereplica placement algorithms, which sub-sequently adapt the current placement tonew conditions.

The problem of replica placement con-sists of two subproblems: replica serverplacement and replica content placement.



Replica server placement is the problemof finding suitable locations for replicaservers. Replica content placement is theproblem of selecting replica servers thatshould host replicas of an object. Boththese placements can be adjusted by thesystem to optimize the objective functionvalue λ.

There are some fundamental differ-ences between the server and contentplacement problems. Server placementconcerns the selection of locations that aregood for hosting replicas of many objects,whereas content placement deals with theselection of locations that are good forreplicas of a single object. Furthermore,these two problems differ in how often andby whom their respective solutions needto be applied. The server placement algo-rithms are used in a larger time scale thanthe content placement algorithms. Theyare usually used by the system operatorduring installation of server infrastruc-ture or while upgrading the hosting infras-tructure, and runs once every few months.The content placement algorithms are runmore often, as they need to react to pos-sibly rapidly changing situations such asflash crowds.

We note that Karlsson et al. [2002]present a framework for evaluatingreplica placement algorithms for con-tent delivery networks and also in otherfields such as distributed file systems anddatabases. Their framework can be usedto classify and qualitatively compare theperformance of various algorithms usinga generic set of primitives covering prob-lem definition and heuristics. They alsoprovide an analytical model to predict thedecision times of each algorithm. Theirframework is useful for evaluating therelative performance of different replicaplacement algorithms, and as such, com-plements the material discussed in thissection.

5.1. Replica Server Placement

The problem of replica server placementis to select K servers out of N poten-tial sites such that the objective functionis optimized for a given network topol-

ogy, client population, and access patterns.The objective function used by the serverplacement algorithms operates on some ofthe metrics defined in Section 3. Thesemetrics may include, for example, clientlatencies for the objects hosted by thesystem, and the financial cost of serverinfrastructure.

The problem of determining the num-ber and locations of replica servers, givena network topology, can be modeled asthe center placement problem. Two vari-ants used for modeling it are the facil-ity location problem and the minimumK -median problem. Both these problemsare NP-hard. They are defined in Shmoyset al. [1997] and Qiu et al. [2001], and wedescribe them here again for the sake ofcompleteness.

Facility Location Problem. Given a setof candidate server locations i in whichthe replica servers (“facilities”) may beinstalled, running a server in location iincurs a cost of fi. Each client j mustbe assigned to one replica server, incur-ring a cost of d j ci j where d j denotes thedemand of the node j , and ci j denotesthe distance between i and j . The objec-tive is to find the number and location ofreplica servers that minimizes the overallcost.

Minimum K-Median Problem. GivenN candidate server locations, we mustselect K of them (called “centers”), andthen assign each client j to its closestcenter. A client j assigned to a center iincurs a cost of d j ci j . The goal is to se-lect k centers, so that the overall cost isminimal.

The difference between the minimumK -median problem and the facility loca-tion problem is that the former associatesno cost with opening a center (as witha facility, which has an operating cost offi). Further, in the minimum K -medianproblem, the number of servers is boundedby K .

Some initial work on the problem ofreplica server placement has been ad-dressed in da Cunha [1997]. However, ithas otherwise been seldom addressed bythe research community and only few so-lutions have been proposed.



Li et al. [1999] propose a placement al-gorithm based on the assumption that theunderlying network topologies are treesand solve it using dynamic programmingtechniques. The algorithm is designed forWeb proxy placement but is also relevantto server placement. The algorithm worksby dividing a tree T into smaller subtreesTi; the authors show that the best way toplace t proxies is by placing ti proxies foreach Ti such that

∑ti = t. The algorithm

is shown to be optimal if the underlyingnetwork topology is a tree. However, thisalgorithm has the following limitations: (i)it cannot be applied to a wide-area net-work such as the Internet whose topologyis not a tree, and (ii) it has a high computa-tional complexity of O(N 3K 2) where K isthe number of proxies and N is the num-ber of candidate locations. We note thatthe first limitation of this algorithm is dueto its assumption about the presence of asingle origin server and henceforth to findservers that can host a target Web ser-vice. This allows to construct only a treetopology with this origin server as root.However, a typical Web replica hostingsystem will host documents from multipleorigins, falsifying this assumption. Thisnature of problem formulation is more rel-evant for content placement, where ev-ery document has a single origin Webserver.

Qiu et al. [2001] model the replica place-ment problem as a minimum K -medianproblem and propose a greedy algorithm.In each iteration, the algorithm selectsone server, which offers the least cost,where cost is defined as the average dis-tance between the server and its clients.In the ith iteration, the algorithm evalu-ates the cost of hosting a replica at the re-maining N − i + 1 potential sites in thepresence of already selected i − 1 servers.The computational cost of the algorithmis O(N 2K ). The authors also present ahot-spot algorithm, in which the repli-cas are placed close to the clients gener-ating most requests. The computationalcomplexity of the hot-spot algorithm isN 2 + min(NlogN , NK). The authors eval-uate the performance of these two algo-rithms and compare each one with the al-

gorithm proposed in Li et al. [1999]. Theiranalysis shows that the greedy algorithmperforms better than the other two algo-rithms and its performance is only 1.1 to1.5 times worse than the optimal solution.The authors note that the placement al-gorithms need to incorporate the clienttopology information and access patterninformation, such as client end-to-end dis-tances and request rates.

Radoslavov et al. [2001] propose tworeplica server placement algorithms thatdo not require the knowledge of client loca-tion but decide on replica location based onthe network topology alone. The proposedalgorithms are max router fanout andmax AS/max router fanout. The first algo-rithm selects servers closest to the routerhaving maximum fanout in the network.The second algorithm first selects the Au-tonomous System (AS) with the highestfanout, and then selects a server withinthat AS that is closest to the router havingmaximum fanout. The performance stud-ies show that the second algorithm per-forms only 1.1 to 1.2 times worse thanthat of the greedy algorithm proposed inQiu et al. [2001]. Based on this, the au-thors argue that the need for knowledgeof client locations is not essential. How-ever, it must be noted that these topology-aware algorithms assume that the clientsare uniformly spread throughout the net-work, which may not be true. If clients arenot spread uniformly throughout the net-work, then the algorithm can select replicaservers that are close to routers with high-est fanout but distant from most of theclients, resulting in poor client-perceivedperformance.

5.2. Replica Content Placement

The problem of replica content place-ment consists of two sub-problems: contentplacement and replica creation. The firstproblem concerns the selection of a set ofreplica servers that must hold the replicaof a given object. The second problem con-cerns the selection of a mechanism to in-form a replica server about the creation ofnew replicas.



5.2.1. Content Placement. The contentplacement problem consists of selecting Kout of N replica servers to host replicasof an object, such that the objective func-tion is optimized under a given client ac-cess pattern and replica update pattern.The content placement algorithms selectreplica servers in an effort to improve thequality of service provided to the clientsand minimize the object hosting cost.

Similarly to the server placement, thecontent placement problem can be mod-eled as the facility location placement.However, such solutions can be computa-tionally expensive making it difficult tobe applied to this problem, as the con-tent placement algorithms are run farmore often their server-related counter-parts. Therefore, existing replica hostingsystems exploit simpler solutions.

In Radar [Rabinovich and Aggarwal1999], every host runs the replica place-ment algorithm, which defines two clientrequest rate thresholds: Rrep for replicareplication, and Rdel for object deletion,where Rdel < Rrep. A document is deletedif its client request rate drops below Rdel.The document is replicated if its client re-quest rate exceeds Rrep. For request ratesfalling between Rdel and Rrep, documentsare migrated to a replica server locatedcloser to clients that issue more than ahalf of requests. The distance is calculatedusing a Radar-specific metric called pref-erence paths. These preference paths arecomputed by the servers based on informa-tion periodically extracted from routers.

In SPREAD, the replica servers peri-odically calculate the expected numberof requests for an object. Servers decideto create a local replica if the numberof requests exceeds a certain threshold[Rodriguez and Sibal 2000]. These serversremove a replica if the popularity of the ob-ject decreases. If required, the total num-ber of replicas of an object can be restrictedby the object owner.

Chen et al. [2002a] propose a dynamicreplica placement algorithm for scalablecontent delivery. This algorithm usesa dissemination-tree-based infrastructurefor content delivery and a peer-to-peer lo-cation service provided by Tapestry for lo-

cating objects [Zhao et al. 2004]. The algo-rithm works as follows. It first organizesthe replica servers holding replicas of thesame object into a load-balanced tree.Then, it starts receiving client requestswhich target the origin server contain-ing some latency constraints. The originserver services the client, if the server’scapacity constraints and client’s latencyconstraints are met. If any of these con-ditions fail, it searches for another serverin the dissemination tree that satisfiesthese two constraints and creates a replicaat that server. The algorithm aims toachieve better scalability by quickly lo-cating the objects using the peer-to-peerlocation service. The algorithm is goodin terms of preserving client latency andserver capacity constraints. On the otherhand, it has a considerable overheadcaused by checking QoS requirements forevery client request. In the worst case, asingle client request may result in creat-ing a new replica. This can significantlyincrease the request servicing time.

Kangasharju et al. [2001a] model thecontent placement problem as an opti-mization problem. The problem is to placeK objects in some of N servers, in an effortto minimize the average number of inter-AS hops a request must traverse to be ser-viced, meeting the storage constraints ofeach server. The problem is shown to beNP-complete and three heuristics are pro-posed to address this problem. The firstheuristic uses popularity of an object asthe only criterion and every server decidesupon the objects it needs to host based onthe objects’ popularity. The second heuris-tic uses a cost function defined as a productof object popularity and distance of serverfrom origin server. In this heuristic, eachserver selects the objects to host as theones with high cost. The intuition behindthis heuristic is that each server hosts ob-jects that are highly popular and also thatare far away from their origin server, inan effort to minimize the client latency.This heuristic always tries to minimizethe distance of a replica from its originserver oblivious of the presence of otherreplicas. The third heuristic overcomesthis limitation and uses a coordinated



replication strategy where replica loca-tions are decided in a global/coordinatedfashion for all objects. This heuristic usesa cost function that is a product of to-tal request rate for a server, popularityand shortest distance of a server to acopy of object. The central server selectsthe object and replica pairs that yieldthe best cost at every iteration and re-computes the shortest distance betweenservers for each object. Using simulations,the authors show that the global heuris-tic outperforms the other two heuristics.The drawback is its high computationalcomplexity.

5.2.2. Replica Creation Mechanisms. Vari-ous mechanisms can be used to informa replica server about the creation of anew replica that it needs to host. Themost widely used mechanisms for thispurpose are pull-based caching and pushreplication.

In pull-based caching, replica serversare not explicitly informed of the creationof a new replica. When a replica server re-ceives a request for a document it does notown, it treats it as a miss and fetches thereplica from the master. As a consequence,the creation of a new replica is delayed un-til the first request for this replica. Thisscheme is adopted in Akamai [Dilley et al.2002]. Note that in this case, pull-basedcaching is used only as a mechanism forreplica creation. The decision to place areplica in that server is taken by the sys-tem, when redirecting client requests toreplica servers.

In push replication, a replica server isinformed of a replica creation by explic-itly pushing the replica contents to theserver. Similar scheme is used in Glob-ule [Pierre and van Steen 2001] and Radar[Rabinovich and Aggarwal 1999].

5.3. Discussion

The problem of replica server and con-tent placement is not regularly addressedby the research community. A few recentworks have proposed solution for theseproblems [Qiu et al. 2001; Radoslavovet al. 2001; Chen et al. 2002a; Kan-

gasharju et al. 2001a]. We note that anexplicit distinction between server andcontent placement is generally not made.Rather, work has concentrated on find-ing server locations to host contents ofa single content provider. However, sep-arate solutions for server placement andcontent placement would be more usefulin a replica hosting system, as these sys-tems are intended to host different con-tents with varying client access patterns.

Choosing the best performing contentplacement algorithm is not trivial as itdepends on the access characteristics ofthe Web content. Pierre and van Steen[2001] showed there is no single best per-forming replica placement strategy and itmust be selected on a per-document ba-sis based on their individual access pat-terns. Karlsson and Karamanolis [2004]propose a scheme where different place-ment heuristics are evaluated off-line andthe best performing heuristic is selectedon a per-document basis. In Pierre andvan Steen [2001] and Sivasubramanianet al. [2003], the authors propose to per-form this heuristic selection dynamicallywhere the system adapts to change in ac-cess patterns by switching the documents’replication strategies on-the-fly.

Furthermore, the existing server place-ment algorithms improve client QoS byminimizing client latency or distance [Qiuet al. 2001; Radoslavov et al. 2001]. Eventhough client QoS is important to makeplacement decisions, in practice the selec-tion of replica servers is constrained byadministrative reasons, such as businessrelationship with an ISP, and financialcost for installing a replica server. Such asituation introduces a necessary tradeoffbetween financial cost and performancegain, which are not directly comparableentities. This drives the need for serverplacement solutions that not only takesinto account the financial cost of a par-ticular server facility but that can alsotranslate the performance gains into po-tential monetary benefits. To the best ofour knowledge little work has been donein this area, which requires building eco-nomic models that translate the perfor-mance of replica hosting system into the



monetary profit gained. These kinds ofeconomic models are imperative to en-able system designers to make better judg-ments in server placement and provideserver placement solutions that can be ap-plied in practice.

6. CONSISTENCY ENFORCEMENT

The consistency enforcement problem con-cerns selecting consistency models andimplementing them using various consis-tency policies, which in turn can use sev-eral content distribution mechanisms. Aconsistency model is a contract between areplica hosting system and its clients thatdictates the consistency-related proper-ties of the content delivered by the system.A consistency policy defines how, when,and to which object replicas the variouscontent distribution mechanisms are ap-plied. For each object, a policy adheres to acertain consistency model defined for thatobject. A single model can be implementedusing different policies. A content distri-bution mechanism is a method by whichreplica servers exchange replica updates.It defines in what form replica updatesare transferred, who initiates the transfer,and when updates take place.

Although consistency models and mech-anisms are usually well defined, choosinga valid one for a given object is a non-trivial task. The selection of a consistencymodel, policies, and mechanisms must en-sure that the required level of consistency(defined by various consistency metricsdiscussed in Section 3) is met, while keep-ing the communication overhead to be aslow as possible.

6.1. Consistency Models

Consistency models differ in their strict-ness of enforcing consistency. By strongconsistency, we mean that the systemguarantees that all replicas are identi-cal from the perspective of the system’sclients. If a given replica is not consis-tent with others, it cannot be accessed byclients until it is brought up to date. Due tohigh replica synchronization costs, strongconsistency is seldom used in wide-area

systems. Weak consistency, in turn, allowsreplicas to differ, but ensures that all up-dates reach all replicas after some (bound)time. Since this model is resistant to de-lays in update propagation and incurs lesssynchronization overhead, it fits better inwide-area systems.

6.1.1. Single vs. Multiple Master. Depend-ing on whether updates originate from asingle site or from several ones, consis-tency models can be classified as single-master or multi-master, respectively. Thesingle-master models define one machineto be responsible for holding an up-to-date version of a given object. These mod-els are simple and fit well in applicationswhere the objects by nature have a singlesource of changes. They are also commonlyused in existing replica hosting systems,as these systems usually deliver some cen-trally managed data. For example, Radarassumes that most of its objects are staticWeb objects that are modified rarely anduses primary-copy mechanisms for enforc-ing consistency. The multimaster modelsallow more than one server to modify thestate of an object. These models are ap-plicable to replicated Web objects whosestate can be modified as a result of a clientaccess. However, these models introducenew problems such as the necessity of solv-ing update conflicts. Little work has beendone on multi-master models in the con-text of Web replica hosting systems.

6.1.2. Types of Consistency. As we ex-plained, consistency models usually de-fine consistency along three different axes:time, value, and order.

Time-based consistency models wereformalized in Torres-Rojas et al. [1999]and define consistency based on real time.These models require a content distribu-tion mechanism to ensure that an updateto a replica at time t is visible to the otherreplicas and clients before time t + �.Cate [1992] adopts a time-based consis-tency model for maintaining consistencyof FTP caches. The consistency policy inthis system guarantees that the only up-dates that might not yet be reflected in a



site are the ones that have happened inthe last 10% of the reported age of the file.Time-based consistency is applicable to allkinds of objects. It can be enforced usingdifferent content distribution mechanismssuch as polling (where a client or replicapolls often to see if there is an update), orserver invalidation (where a server inval-idates a copy held by other replicas andclients if it gets updated). These mecha-nisms are explained in detail in the nextsection.

Value-based consistency schemes en-sure that the difference between the valueof a replica and that of other replicas (andits clients) is no greater than a certain �.Value-based schemes can be applied onlyto objects that have a precise definition ofvalue. For example, an object encompass-ing the details about the number of seatsbooked in an aircraft can use such a model.This scheme can be implemented by us-ing polling or server invalidation mech-anisms. Examples of value-based consis-tency schemes and content distributionmechanisms can be found in Bhide et al.[2002].

Order-based consistency schemesare generally exploited in replicateddatabases. These models perceive everyread/write operation as a transaction andallow the replicas to operate in differentstate if the out-of-order transactionsadhere to the rules defined by thesepolicies. For example, Krishnakumarand Bernstein [1994] introduce the con-cept of N -ignorant transactions, wherea transaction can be carried out in areplica while it is ignorant of N priortransactions in other replicas. The rulesconstraining the order of execution oftransactions can also be defined basedon dependencies among transactions.Implementing order-based consistencypolicies requires content distributionmechanisms to exchange the transactionsamong all replicas, and transactionsneed to be timestamped using mecha-nisms such as logical clocks [Raynal andSinghal 1996]. This consistency schemeis applicable to a group of objects thatjointly constitute a regularly updateddatabase.

A continuous consistency model, inte-grating the above three schemes, is pre-sented by Yu and Vahdat [2002]. The un-derlying premise of this model is thatthere is a continuum between strong andweak consistency models that is semanti-cally meaningful for a wide range of repli-cated services, as opposed to traditionalconsistency models, which explore eitherstrong or weak consistency [Bernstein andGoodman 1983]. The authors explore thespace between these two extremes by mak-ing applications specify their desired levelof consistency using conits. A conit is de-fined as a physical or logical unit of consis-tency. The model uses a three-dimensionalvector to quantify consistency: (numericalerror, order error, staleness). Numerical er-ror is used to define and implement value-based consistency, order error is used todefine and implement order-based consis-tency schemes, and staleness is used fortime-based consistency. If each of thesemetrics is bound to zero, then the modelimplements strong consistency. Similarly,if there are no bounds then the model doesnot provide any consistency at all. Theconit-based model allows a broad range ofapplications to express their consistencyrequirements. Also, it can precisely de-scribe guarantees or bounds with respectto differences between replicas on a per-replica basis. This enables replicas havingpoor network connectivity to implementrelaxed consistency, whereas replicas withbetter connectivity can still benefit fromstronger consistency. The mechanisms im-plementing this conit-based model are de-scribed in Yu and Vahdat [2000, 2002].

6.2. Content Distribution Mechanisms

Content distribution mechanisms definehow replica servers exchange updates.These mechanisms differ on two aspects:the forms of the update and the direc-tion in which updates are triggered (fromsource of update to other replicas or vice-versa). The decision about these two as-pects influences the system’s attainablelevel of consistency as well as the commu-nication overhead introduced to maintainconsistency.



6.2.1. Update Forms. Replica updatescan be transferred in three different forms.In the first form, called state shipping, awhole replica is sent. The advantage ofthis approach is its simplicity. On the otherhand, it may incur significant communica-tion overhead, especially noticeable whena small update is performed on a largeobject.

In the second update form, called deltashipping, only differences with the pre-vious state are transmitted. It generallyincurs less communication overhead com-pared to state shipping, but it requireseach replica server to have the previousreplica version available. Furthermore,delta shipping assumes that the differ-ences between two object versions can bequickly computed.

In the third update form, called functionshipping, replica servers exchange the ac-tions that cause the changes. It generallyincurs the least communication overheadas the size of description of the action isusually independent from the object stateand size. However, it forces each replicaserver to convey a certain, possibly com-putationally demanding, operation.

The update form is usually dictated bythe exploited replication scheme and theobject characteristics. For example, in ac-tive replication requests targeting an ob-ject are processed by all the replicas of thisobject. In such a case, function shipping isthe only choice. In passive replication, inturn, requests are first processed by a sin-gle replica, and then the remaining onesare brought up-to-date. In such a case, theupdate form selection depends on the ob-ject characteristics and the change itself:whether the object structure allows forchanges to be easily expressed as an oper-ation (which suggests function shipping),whether the object size is large comparedto the size of the changed part (which sug-gests delta shipping), and finally, whetherthe object was simply replaced with a com-pletely new version (which suggests stateshipping).

In general, it is the job of a sys-tem designer to select the update formthat minimizes the overall communicationoverhead. In most replica hosting sys-

tems, updating means simply replacingthe whole replica with its new version.However, it has been shown that updatingWeb objects using delta shipping could re-duce the communication overhead by upto 22% compared to commonly used stateshipping [Mogul et al. 1997].

6.2.2. Update Direction. The updatetransfer can be initiated either by areplica server that is in need for a newversion and wants to pull it from one ofits peers, or by the replica server thatholds a new replica version and wants topush it to its peers. It is also possible tocombine both mechanisms to achieve abetter result.

Pull. In one version of the pull-basedapproach, every piece of data is associ-ated with a Time To Refresh (TTR) at-tribute, which denotes the next time thedata should be validated. The value ofTTR can be a constant, or can be calcu-lated from the update rate of the data.It may also depend on the consistency re-quirements of the system. Data with highupdate rates and strong consistency re-quirements require a small TTR, whereasdata with less updates can have a largeTTR. Such a mechanism is used in Cate[1992]. The advantage of the pull-basedscheme is that it does not require replicaservers to store state information, offer-ing the benefit of higher fault tolerance.On the other hand, enforcing stricter con-sistency depends on careful estimation ofTTR: small TTR values will provide goodconsistency, but at the cost of unneces-sary transfers when the document was notupdated.

In another pull-based approach, HTTPrequests targeting an object are extendedwith the HTTP if-modified-since field.This field contains the modification dateof a cached copy of the object. Upon receiv-ing such a request, a Web server comparesthis date with the modification date of theoriginal object. If the Web server holds anewer version, the entire object is sentas the response. Otherwise, only a headeris sent, notifying that the cached copy is



still valid. This approach allows for imple-menting strong consistency. On the otherhand, it can impose large communicationoverhead, as the object home server has tobe contacted for each request, even if thecached copy is valid.

In practice, a combination of TTR andchecking the validity of a document at theserver is used. Only after the TTR valueexpires, will the server contact the doc-ument’s origin server to see whether thecached copy is still validate. If it is stillvalid, a fresh TTR value is assigned to itand a next validation check is postponeduntil the TTR value expires again.

Push. The push-based scheme ensuresthat communication occurs only whenthere is an update. The key advantageof this approach is that it can meetstrong consistency requirements withoutintroducing the communication overheadknown from the “if-modified-since” ap-proach: since the replica server that ini-tiates the update transfer is aware ofchanges, it can precisely determine whichchanges to push and when. An impor-tant constraint of the push-based schemeis that the object home server needs tokeep track of all replica servers to be in-formed. Although storing this list mayseem costly, it has been shown that itcan be done in an efficient way [Cao andLiu 1998]. A more important problemis that the replica holding the state be-comes a potential single point of failure,as the failure of this replica affects theconsistency of the system until it is fullyrecovered.

Push-based content distributionschemes can be associated with leases[Gray and Cheriton 1989]. In such ap-proaches, a replica server registers itsinterest in a particular object for anassociated lease time. The replica serverremains registered at the object homeserver until the lease time expires. Duringthe lease time, the object home serverpushes all updates of the object to thereplica server. When the lease expires,the replica server can either consider it aspotentially stale or register at the objecthome server again.

Leases can be divided into three groups:age-based, renewal-frequency-based, andload-based ones [Duvvuri et al. 2000].In the age-based leases, the lease timedepends on the last time the object wasmodified. The underlying assumption isthat objects that have not been modifiedfor a long time will remain unmodifiedfor some time to come. In the renewal-frequency-based leases, the object homeserver gives longer leases to replicaservers that ask for replica validationmore often. In this way, the object serverprefers replica servers used by clientsexpressing more interest in the object.Finally, with load-based leases the objecthome server tends to give away shorterlease times when it becomes overloaded.By doing that, the object home serverreduces the number of replica serversto which the object updates have to bepushed, which is expected to reduce thesize of the state held at the object homeserver.

Other Schemes. The pull and push ap-proaches can be combined in differentways. Bhide et al. [2002] propose three dif-ferent combination schemes of Push andPull. The first scheme, called Push-and-Pull (PaP), simultaneously employs pushand pull to exchange updates and hastunable parameters to control the extentof push and pulls. The second scheme,Push-or-Pull (PoP), allows a server toadaptively choose between a push- orpull-based approach for each connection.This scheme allows a server to charac-terize which clients (other replica serversor proxies to which updates need to bepropagated) should use either of thesetwo approaches. The characterization canbe based on system dynamics. By default,clients are forced to use the pull-basedapproach. PoP is a more effective solutionthan PaP, as the server can determine themoment of switching between push andpull, depending on its resource availabil-ity. The third scheme, called PoPoPaP, isan extended version of PoP, that choosesfrom Push, Pull and PaP. PoPoPaPimproves the resilience of the server



(compared to PoP), offers gracefuldegradation, and can maintain strongconsistency.

Another way of combining push and pullis to allow the former to trigger the latter.It can be done either explicitly, by means ofinvalidations, or implicitly, with version-ing. Invalidations are pushed by an ob-ject’s origin server to a replica server. Theyinform the replica server or the clientsthat the replica it holds is outdated. Incase the replica server needs the cur-rent version, it pulls it from the originserver. Invalidations may reduce the net-work overhead, compared to pushing reg-ular updates, as the replica servers do nothave to hold the current version for all thetime and can delay its retrieval until it isreally needed. It is particularly useful foroften-updated, rarely-accessed objects.

Versioning techniques are exploited inAkamai [Dilley et al. 2002; Leighton andLewin 2000]. In this approach, every ob-ject is assigned a version number, in-creased after each update. The parent doc-ument that contains a reference to theobject is rewritten after each update aswell, so that it points to the latest version.The consistency problem is thus reducedto maintaining the consistency of the par-ent document. Each time a client retrievesthe document, the object reference is fol-lowed and a replica server is queried forthat object. If the replica server noticesthat it does not have a copy of the refer-enced version, the new version is pulled infrom the origin server.

Scalable Mechanisms. All the aforesaidcontent distribution mechanisms do notscale for large number of replicas (say, inthe order of thousands). In this case, push-based mechanisms suffer from the over-head of storing the state of each replicaand updating them (through separate uni-cast connections). Pull-based mechanismssuffer from the disadvantage of creatinga hot spot around the origin server withthousands of replicas requesting the ori-gin server (again through separate con-nections) for an update periodically. Bothmechanisms suffer from excessive net-work traffic for updating large number

of replicas, as the same updates are sentto different replicas using separate con-nections. This also introduces consider-able overhead on the server, in additionto increasing the network overhead. Thesescalability limitations require the need forbuilding scalable mechanisms.

Scalable content distribution mecha-nisms proposed in the literature aim tosolve scalability problems of conventionalpush and pull mechanisms by building acontent distribution hierarchy of replicasor clustering objects.

The first approach is adopted inNinan et al. [2002], Tewari et al. [2002]and Fei [2001]. In this approach, a con-tent distribution tree of replicas is builtfor each object. The origin server sendsits update only to the root of the tree(instead of the entire set of replicas),which in turn forwards the update tothe next level of nodes in the tree andso on. The content distribution tree canbe built either using network multicast-ing or application-level multicasting solu-tions. This approach drastically reducesthe overall amount of data shipped by theorigin server. In Ninan et al. [2002], theauthors proposed a scalable lease-basedconsistency mechanism where leases aremade with a replica group (with the sameconsistency requirement), instead of indi-vidual replicas. Each lease group has itsown content distribution hierarchy to sendtheir replica updates. Similarly, Tewariet al. [2002] propose a mechanism thatbuilds a content distribution hierarchyand also uses object clustering to improvethe scalability.

Fei [2001] propose a mechanism thatchooses between update propagation(through a multicast tree) or invalidationschemes on a per-object basis, periodically,based on each object’s update and accessrate. The basic intuition of the mechanismis to choose propagation if an object isaccessed more than it is updated (therebyreducing the pull traffic) and invalidationotherwise (as the overhead for shippingupdates is higher than pulling updates ofan object only when it is accessed). Themechanism computes the traffic overheadof the two methods for maintaining



consistency of an object, given its pastupdate and access rate. It chooses the onethat introduces the least overhead as themechanism to be adopted for that object.

Object clustering is the process of clus-tering various objects with similar proper-ties (update and/or request patterns) andtreating them as a single clustered object.It reduces the connection initiation over-head during the transmission of replicaupdates, from a per-object level to per-cluster level, as updates for a cluster aresent in a single connection instead of indi-viduals connection for each object (note theamount of updates transferred using bothmechanisms are the same). Clusteringalso reduces the number of objects to bemaintained by a server, which can help inreducing the adaptation overhead as a sin-gle decision will affect more objects. To ourknowledge, object clustering is not used inany well-known replica hosting system.

6.3. Mechanisms for DynamicallyGenerated Objects

In recent years, there has been a sharpincrease in providing Web content usingtechnologies like ASPs or CGIs that dy-namically generate Web objects, possiblyfrom an underlying database. These dy-namically generated objects affect Webserver performance as their generationmay require many CPU cycles [Iyengaret al. 1997].

Several systems cache the generatedpages [Challenger et al. 1999; Labrinidisand Roussopoulos 2000]. Assuming thatthe temporal locality of requests is highand updates to the underlying data areinfrequent, these systems avoid generat-ing the same document multiple times.Challenger et al. [1999] additionally pro-poses to maintain a dependency graph be-tween cached pages and the underlyingdata that are used to generate them. Upondata updates, all pages depending on theupdated data are invalidated.

Unfortunately, many applications re-ceive large numbers of unique requestsor must handle frequent data updates.Such applications can be distributed onlythrough replication, where the application

code is executed at the replica servers.This avoids the wide-area network la-tency for each request and ensures quickerresponse time to clients. Replicating adynamic Web object requires replicatingboth the code (e.g., EJBs, CGI scripts,PHPs) and the data that the code actsupon (databases or files). This can reducethe latency of requests, as the requests canbe answered by the application hosted bythe server located close to the clients.

Replicating applications is relativelyeasy provided that the code does not mod-ify the data [Cao et al. 1998; Rabinovichet al. 2003]. However, most Web applica-tions do modify their underlying data. Inthis case, it becomes necessary to man-age data consistency across all replicas.To our knowledge, there are very few ex-isting systems that handle consistencyamong the replicated data for dynamicWeb objects. Gao et al. [2003] propose anapplication-specific edge service architec-ture, where the object is assumed to takecare of its own replication and consistency.In such a system, access to the shared datais abstracted by object interfaces. Thissystem aims to achieve scalability by us-ing weaker consistency models tailored tothe application. However, this requires theapplication developer to be aware of theapplication’s consistency and distributionsemantics.

Further research on these topics wouldbe very beneficial to replica hosting sys-tems, as the popularity of dynamic docu-ments technique is increasing.

6.4. Discussion

In this section, we discussed twoimportant components of consistencyenforcement namely, consistency modelsand content distribution mechanisms. Inconsistency models, we listed differenttypes of consistency models—based ontime, value or transaction orders. In ad-dition to these models, we also discussedthe continuous consistency model, inwhich different network applications canexpress the consistency constraints inany point in the consistency spectrum.This model is useful to capture the



Table III. A Comparison of Approaches for Enforcing ConsistencySystems Pushand Protocols Inv. Prop. Pull Variants CommentsAkamai X X Uses push-based invalidation for

consistency and pull for distributionRadar X Uses primary-copySPREAD X Chooses strategy on a per-object basis

based on its access and update rate[Pierre et al. 2002] X Chooses strategy on a per-object basis

based on its access and update rate[Duvvuri et al. 2000] X Invalidates content until lease is validAdaptive Push-Pull X X Chooses between push and pull

strategy on a per-object basis[Fei 2001] X X Chooses between propagation and

invalidation on a per-object basis

consistency requirements for a broadrange of applications being hosted bya replica hosting system. However, themechanisms proposed to enforce its poli-cies do not scale with increasing numberof replicas. Similar models need to bedeveloped for Web replica hosting systemsthat can provide bounds on inconsistentaccess of its replicas with no loss ofscalability.

In content distribution mechanisms, wediscussed the advantages and disadvan-tages of push, pull, and other adaptivemechanisms. These mechanisms can bebroadly classified as server-driven andclient-driven consistency mechanisms, de-pending on who is responsible for en-forcing consistency. At the outset, client-driven mechanisms seems to be a morescalable option for large-scale hosting sys-tems, as in this case the server is not over-loaded with the responsibility of enforcingconsistency. However, in Yin et al. [2002],the authors have shown that server-drivenconsistency protocols can meet the scala-bility requirements of large-scale dynamicWeb services delivering both static and dy-namic Web content.

We note that existing systems and pro-tocols concentrate only on time-based con-sistency models and very little has beendone on other consistency models. Hence,in our summary table of consistency ap-proaches adopted by various systems andprotocols (Table III), we discuss only thecontent distribution mechanisms adoptedby them.

Maintaining consistency among dy-namic documents requires special mecha-nisms. Existing replication solutions usu-ally incur high overhead because theyneed global synchronization upon updatesof their underlying replicated data. Weconsider that optimistic replication mech-anisms such as those used in Gao et al.[2003] may allow massively scalable con-tent distribution mechanisms for dynam-ically generated documents. An in-depthsurvey on optimistic replication tech-niques can be found in Saito and Shapiro[2003].

7. REQUEST ROUTING

In request routing, we address the prob-lem of deciding which replica server shallbest service a given client request, interms of the metrics selected in Section 3.These metrics can be, for example, replicaserver load (where we choose the replicaserver with the lowest load), end-to-endlatency (where we choose the replicaserver that offers the shortest responsetime to the client), or distance (where wechoose the replica server that is closest tothe client).

Selecting a replica is difficult, becausethe conditions on the replica servers (e.g.,load) and in the network (e.g., link con-gestion, thus its latency) change contin-uously. These changing conditions maylead to different replica selections, de-pending on when and for which clientthese selections are made. In other words,



a replica optimal for a given client maynot necessarily remain optimal for thesame client forever. Similarly, even if twoclients request the same document simul-taneously, they may be directed to differ-ent replicas. In this section, we refer tothese two kinds of conditions as “systemconditions.”

The entire request routing problem canbe split into two: devising a redirectionpolicy and selecting a redirection mech-anism. A redirection policy defines howto select a replica in response to a givenclient request. It is basically an algorithminvoked when the client request is in-voked. A redirection mechanism, in turn,is a mean of informing the client aboutthis selection. It first invokes a redirectionpolicy, and then provides the client withthe redirecting response that the policygenerates.

A redirection system can be deployedeither on the client side, or on theserver side, or somewhere in the net-work between these two. It is also possi-ble to combine client-side and server-sidetechniques to achieve better performance[Karaul et al. 1998]. Interestingly, a studyby Rodriguez et al. [2000] suggests thatclients may easily circumvent the prob-lem of replica selection by simultane-ously retrieving their data from severalreplica servers. This claim is disputed byKangasharju et al. [2001b], who noticethat the delay caused by opening connec-tions to multiple servers can outweigh theactual gain in content download time. Inthis article, we assume that we leave theclient-side unmodified, as the only soft-ware that usually works there is a Webbrowser. We therefore do not discuss thedetails of client-side server-selection tech-niques, which can be found in Conti et al.[2002]. Finally, we do not discuss variousWeb caching schemes, which have beenthoroughly described in Rodriguez et al.[2001], as caches are by nature deployedon the client-side.

In this section, we examine redirectionpolicies and redirection mechanisms sep-arately. For each of them, we discuss sev-eral related research efforts, and summa-rize with a comparison of these efforts.

7.1. Redirection Policies

A redirection policy can be either adap-tive or nonadaptive. The former consid-ers current system conditions while select-ing a replica, whereas the latter does not.Adaptive redirection policies are usuallymore complex than nonadaptive ones, butthis effort is likely to pay off with highersystem performance. The systems we dis-cuss below usually implement both typesof policies, and can be configured to useany combination of them.

7.1.1. Nonadaptive Policies. Nonadaptiveredirection policies select a replica that aclient should access without monitoringthe current system conditions. Instead,they exploit heuristics that assume cer-tain properties of these conditions of whichwe discuss examples below. Although non-adaptive policies are usually easier toimplement, the system works efficientlyonly when the assumptions made by theheuristics are met.

An example of a nonadaptive pol-icy is round-robin. It aims at balancingthe load of replica servers by evenly dis-tributing all the requests among theseservers [Delgadillo 1999; Radware 2002;Szymaniak et al. 2003]. The assumptionhere is that all the replica servers havesimilar processing capabilities, and thatany of them can service any client re-quest. This simple policy has proved towork well in clusters, where all the replicaservers are located in the same place [Paiet al. 1998]. In wide-area systems, how-ever, replica servers are usually distantfrom each other. Since round-robin ig-nores this aspect, it cannot prevent di-recting client requests to more distantreplica servers. If it happens, the client-perceived performance may turn out tobe poor. Another problem is that the aimof load balancing itself is not necessarilyachieved, as processing different requestscan involve significantly different compu-tational costs.

A nonadaptive policy exploited in Radaris the following. All replica servers areranked according to their predicted load,which is derived from the number of



requests each of them has serviced so far.Then, the policy redirects clients so thatthe load is balanced across the replicaservers, and that (additionally) the client-server distance is as low as possible.The assumption here is that the replicaserver load and the client-server distanceare the main factors influencing the ef-ficiency of request processing. Aggarwaland Rabinovich [1998] observe that thissimple policy often performs nearly asgood as its adaptive counterpart, whichwe describe below. However, as both ofthem ignore network congestion, the re-sulting client-perceived performance maystill turn out to be poor.

Several interesting nonadaptive poli-cies were implemented in Cisco Dis-tributedDirector [Delgadillo 1999]. Thefirst one defines the percentage of all re-quests that each replica server receives. Inthis way, it can send more requests to morepowerful replica servers and achieve bet-ter resource utilization. Another policy al-lows for defining preferences of one replicaserver over the other. It may be usedto temporarily relieve a replica serverfrom service (for maintenance purposes,for example), and delegate the requests itwould normally service to another server.Finally, DistributedDirector enables ran-dom request redirection, which can beused for comparisons during some systemefficiency tests. Although all these policiesare easy to implement, they completely ig-nore current system conditions, makingthem inadequate to react to emergencysituations.

One can imagine a nonadaptive redirec-tion policy that statically assigns clients toreplicas based on their geographical loca-tion. The underlying assumptions are thatthe clients are evenly distributed over theworld, and that the geographical distanceto a server reflects the network latency tothat server. Although the former assump-tion is not likely to be valid in a generalcase, the latter has been verified positivelyas we discussed earlier. According to Huf-faker et al. [2002], the correlation betweenthe geographical distance and the networklatency reaches up to 75%. Still, since thispolicy ignores the load of replica servers,

it can redirect clients to overloaded replicaservers, which may lead to substantiallydegraded client experience.

Finally, an interesting nonadaptive pol-icy that has later been used in developingthe Chord peer-to-peer system, is consis-tent hashing [Karger et al. 1999]. The ideais straightforward: a URL is hashed to avalue h from a large space of identifiers.That value is then used to efficiently routea request along a logical ring consistingof cache servers with IDs from that samespace. The cache server with the smallestID larger than h is responsible for hold-ing copies of the referenced data. Varia-tions of this scheme have been extensivelyresearched in the context of peer-to-peerfile sharing systems [Balikrishnan et al.2003], including those that take networkproximity into account (see, e.g., Castroet al. [2003b]).

7.1.2. Adaptive Policies. Adaptive redi-rection policies discover the current sys-tem conditions by means of metric estima-tion mechanisms discussed in Section 3. Inthis way, they are able to adjust their be-havior to situations that normally do notoccur, like flash crowds, and ensure highsystem robustness [Wang et al. 2002].

The information that adaptive policiesobtain from metric estimation mecha-nisms may include, for example, the loadof replica servers or the congestion of se-lected network links. Apart from thesedata, a policy may also need to knowsome request-related information. Thebare minimum is what object is requestedand where the client is located. Moreadvanced replica selection can also takeclient QoS requirements into account.

Knowing the system conditions and theclient-related information, adaptive poli-cies first determine a set of replica serversthat are capable of handling the request(i.e., they store a replica of the documentand can offer required quality of service).Then, these policies select one (or more)of these servers, according to the metricsthey exploit. Adaptive policies may exploitmore than one metric. More importantly,a selection based on one metric is not nec-essarily optimal in terms of others. For



example, Johnson et al. [2001] observedthat most CDNs do not always select thereplica server closest to the client.

The adaptive policy used by Globuleselects the replica servers that are clos-est to the client in terms of network dis-tance [Szymaniak et al. 2003]. Globuleemploys the AS-path length metric, orig-inally proposed by McManus [1999], anddetermines the distance based on a peri-odically refreshed, AS-based map of theInternet. Since this approach uses pas-sive metric estimation services, it doesnot introduce any additional traffic to thenetwork. We consider it to be adaptive,because the map of the Internet is pe-riodically rebuilt, which results in (slow)adaptation to network topology changes.Unfortunately, the AS-based distance cal-culations, although simple to perform, arenot very accurate [Huffaker et al. 2002].

A distance-based adaptive pol-icy is also implicitly exploited bySPREAD [Rodriguez and Sibal 2000].In this system, routers simply interceptrequests on their path toward the objecthome server, and redirected to a near-byreplica server. Consequently, requestsreach their closest replica servers, andthe resulting client-server paths areshortened. This policy in a natural wayadapts to changes in routing. Its biggestdisadvantage is the high cost of deploy-ment, as it requires modifying manyrouters.

A combined policy, considering bothreplica server load and client-server dis-tance, is implemented in Radar. The pol-icy first isolates the replica servers whoseload is below a certain threshold. Then,from these servers, the client-closest oneis selected. The Radar redirection policyadapts to changing replica server loadsand tries to direct clients to their clos-est replica servers. However, by ignoringnetwork congestion and end-to-end laten-cies, Radar focuses more on load balanc-ing than on improving the client-perceivedperformance.

Adaptive policies based on the client-server latency have been proposed byArdaiz et al. [2001] and Andrews et al.[2002]. Based either on the client access

logs, or on passive server-side latencymeasurements, respectively, these policiesredirect a client to the replica server thathas recently reported the minimal latencyto the client. The most important advan-tage of these schemes is that they ex-ploit latency measurements, which are thebest indicator of actual client experience[Huffaker et al. 2002]. On the other hand,both of them require maintaining a cen-tral database of measurements, which lim-its the scalability of systems that exploitthese schemes.

A set of adaptive policies is supportedby Web Server Director [Radware 2002].It monitors the number of clients and theamount of network traffic serviced by eachreplica server. It also takes advantage ofperformance metrics specific for WindowsNT, which are included in the Manage-ment Information Base (MIB). Since thisinformation is only provided in a commer-cial white article, it is difficult to evaluatethe efficiency of these solutions.

Another set of adaptive policies is im-plemented in Cisco DistributedDirector[Delgadillo 1999]. This system supportsmany different metrics, including inter-AS distance, intra-AS distance, and end-to-end latency. The redirection policy candetermine the replica server based onweighted combination of these three met-rics. Although this policy is clearly moreflexible than a policy that uses only onemetric, measuring all the metrics requiresdeploying an “agent” on every replicaserver. Also, the exploited active latencymeasurements introduce additional traf-fic to the Internet. Finally, because Dis-tributedDirector is kept separate fromthe replica servers, it cannot probe theirload—it can be approximated only withthe nonadaptive policies discussed above.

A complex adaptive policy is used inAkamai [Dilley et al. 2002]. It considers afew additional metrics, like replica serverload, the reliability of routes between theclient and each of the replica servers,and the bandwidth that is currently avail-able to a replica server. Unfortunately,the actual policy is subject to trade se-cret and cannot be found in the publishedliterature.



7.2. Redirection Mechanisms

Redirection mechanisms provide clientswith the information generated by theredirection policies. Redirection mech-anisms can be classified according toseveral criteria. For example, Barbiret al. [2003], classify redirection mech-anisms into DNS-based, transport-level,and application-level ones. The authorsuse the term “request routing” to re-fer to what we call “redirection” in thisarticle. Such classification is dictatedby the diversity of request processingstages, where redirection can be incor-porated: name resolution, packet rout-ing, and application-specific redirectionimplementation.

In this section, we distinguish betweentransparent, nontransparent, and com-bined mechanisms. Transparent redirec-tion mechanisms hide the redirection fromthe clients. In other words, a client can-not determine which replica server is ser-vicing it. In nontransparent redirectionmechanisms, the redirection is visible tothe client, which can then explicitly re-fer to the replica server it is using. Com-bined redirection mechanisms combinetwo previous types. They take the bestfrom these two types and eliminate theirdisadvantages.

As we only focus on wide-area systems,we do not discuss solutions that are appli-cable only to local environments. An exam-ple of such a solution is packet hand-off,which is thoroughly discussed in a surveyof load-balancing techniques by Cardelliniet al. [1999]. A more recent survey by thesame authors covers other techniques forlocal-area Web clusters [Cardellini et al.2002].

7.2.1. Transparent Mechanisms. Trans-parent redirection mechanisms performclient request redirection in a trans-parent manner. Therefore, they do notintroduce explicit bounds between clientsand replica servers, even if the clientsstore references to replicas. It is par-ticularly important for mobile clientsand for dynamically changing networkenvironments, as in both these cases, a

replica server now optimal for a givenclient can become suboptimal shortlylater.

Several transparent redirection mecha-nisms are based on DNS [Delgadillo 1999;Cardellini et al. 2003; Rabinovich andAggarwal 1999; Radware 2002; Szyma-niak et al. 2003]. They exploit speciallymodified DNS servers. When a modifiedDNS server receives a resolution query fora replicated service, a redirection policyis invoked to generate one or more ser-vice IP addresses, which are returned tothe client. The policy chooses the replicaservers based on the IP address of thequery sender. In DNS-based redirection,transparency is achieved assuming thatservices are referred to by means of theirDNS names, and not their IP addresses.The entire redirection mechanism is ex-tremely popular, because of its simplicityand independence from the actual repli-cated service—as it is incorporated in thename resolution service, it can be used byany Internet application.

On the other hand, DNS-based redirec-tion has some limitations [Shaikh et al.2001]. The most important ones are poorclient identification and coarse redirectiongranularity. The poor client identificationis caused by the fact that a DNS querydoes not carry the addresses of the query-ing client. The query can pass throughseveral DNS servers before it reachesthe one that knows the answer. However,any of these DNS servers knows only theDNS server with which it directly com-municates, and not the querying client.Consequently, using the DNS-based redi-rection mechanisms forces the system touse the clustering scheme based on localDNS servers, which was discussed in Sec-tion 3. The coarse redirection granularityis caused by the granularity of DNS itself:as it deals only with machine names, it canredirect based only on that part of an ob-ject URL that is related to the machinename. Therefore, as long as two URLs re-fer to the same machine name, they areidentical for the DNS-based redirectionmechanism, which makes it difficult to usedifferent distribution schemes for differ-ent objects.



A scalable version of the DNS-basedredirection is implemented in Akamai.This system improves the scalability ofthe redirection mechanism by maintain-ing two groups of DNS servers: top- andlow-level ones. Whereas the former shareone location, the latter are scattered overseveral Internet data centers, and are usu-ally accompanied by replica servers. Atop-level DNS servers redirects a clientquery to a low-level DNS server proxi-mal to the query sender. Then, the low-level DNS server redirects the sender toan actual replica server, usually placedin the same Internet data center. Whatis important, however, is that the top-to-low level redirection occurs only pe-riodically (about once per hour) and re-mains valid during all that time. Forthis reason, the queries are usually han-dled by proximal low-level DNS servers,which results in short name-resolution la-tency. Also, because the low-level DNSservers and the replica servers share thesame Internet data center, the former mayhave accurate system condition informa-tion about the latter. Therefore, the low-level DNS servers may quickly react tosudden changes, such as flash crowds orreplica server failures.

An original transparent redirectionscheme is exploited in SPREAD, whichmakes proxies responsible for clientredirection [Rodriguez and Sibal 2000].SPREAD assumes the existence of a dis-tributed infrastructure of proxies, eachhandling all HTTP traffic in its neigh-borhood. Each proxy works as follows.It inspects the HTTP-carrying IP pack-ets and isolates those that are target-ing replicated services. All other packetsare routed traditionally. If the requestedreplica is not available locally, the service-related packets are forwarded to anotherproxy along the path toward the origi-nal service site. Otherwise, the proxy ser-vices them and generates IP packets car-rying the response. The proxy rewritessource addresses in these packets, so thatthe client thought that the response orig-inates from the original service site. TheSPREAD scheme can be perceived as a dis-tributed packet hand-off. It is transparent

to the clients, but it requires a whole in-frastructure of proxies.

7.2.2. Nontransparent Mechanisms. Non-transparent redirection mechanismsreveal the redirection to the clients. Inthis way, these mechanisms introduce anexplicit binding between a client and agiven replica server. On the other hand,nontransparent redirection mechanismsare easier to implement than their trans-parent counterparts. They also offer fineredirection granularity (per object),thus allowing for more flexible contentmanagement.

The simplest method that gives the ef-fect of nontransparent redirection is toallow a client to choose from a list ofavailable replica servers. This approach iscalled “manual redirection” and can oftenbe found on Web services of widely knowncorporations. However, since this methodis entirely manual, it is of little use forreplica hosting systems, which require anautomated client redirection scheme.

Nontransparent redirection can be im-plemented with HTTP. It is another redi-rection mechanism supported by WebServer Director [Radware 2002]. AnHTTP-based mechanism can redirectclients by rewriting object URLs insideHTML documents, so that these URLspoint at object replicas stored on somereplica servers. It is possible to treateach object URL separately, which allowsfor using virtually any replica placement.The two biggest advantages of the HTTP-based redirection are flexibility and sim-plicity. Its biggest drawback is the lack oftransparency.

Cisco DistributedDirector also supportsthe HTTP-based redirection, although ina different manner [Delgadillo 1999]. In-stead of rewriting URLs, this system ex-ploits the HTTP 302 (temporary moved)response code. In this way, the redirect-ing machine does not need to store anyservice-related content—all it does is ac-tivate the redirection policy and redirectclient to a replica server. On the otherhand, this solution can efficiently redirectonly per entire Web service, and not perobject.



7.2.3. Combined Mechanisms. It ispossible to combine transparent andnon-transparent redirection mechanismsto achieve a better result. Such ap-proaches are followed by Akamai [Dilleyet al. 2002], Cisco DistributedDirector[Radware 2002] and Web Server Director[Delgadillo 1999]. These systems allowto redirect clients using a “cascade” ofdifferent redirection mechanisms.

The first mechanism in the cascade isHTTP. A replica server may rewrite URLsinside an HTML document so that theURLs of different embedded objects con-tain different DNS names. Each DNSname identifies a group of replica serversthat store a given object.

Although it is in general not recom-mended to scatter objects embedded in asingle Web page over too many servers[Kangasharju et al. 2001b], it may besometimes beneficial to host objects ofdifferent types on separate groups ofreplica servers. For example, as videohosting may require specialized replicaserver resources, it may be reasonableto serve video streams with dedicatedvideo servers, while providing imageswith other, regular ones. In such cases,video-related URLs contain a differ-ent DNS name (like “video.cdn.com”)than the image-related URLs (like“images.cdn.com”).

URL rewriting weakens the trans-parency, as the clients are able to dis-cover that the content is retrieved fromdifferent replica servers. However, be-cause the rewritten URLs contain DNSnames that point to groups of replicaservers, the clients are not bound toany single replica server. In this way,the system preserves the most impor-tant property of transparent redirectionsystems.

The second mechanism in the cas-cade is DNS. The DNS redirection sys-tem chooses the best replica serverwithin each group by resolving the group-corresponding DNS name. In this way,the same DNS-based mechanism can beused to redirect a client to its several bestreplica servers, each belonging to a sepa-rate group.

By using DNS, the redirection systemremains scalable, as it happens in thecase of pure DNS-based mechanisms. Bycombining DNS with URL rewriting, how-ever, the system may offer finer redirectiongranularity and thus allow for more flexi-ble replica placement strategies.

The third mechanism in the cascade ispacket-handoff. The processing capabili-ties of a replica server may be improved bydeploying the replica server as a cluster ofmachines that share the same IP address.In this case, the packet-handoff is imple-mented locally to scatter client requestsacross several machines.

Similarly to pure packet-handoff tech-niques, this part of the redirection cascaderemains transparent for the clients. How-ever, since packet-handoff is implementedonly locally, the scalability of the redirec-tion system is maintained.

As can be observed, combining differ-ent redirection mechanisms leads to con-structing a redirection system that is si-multaneously fine-grained, transparent,and scalable. The only potential problemis that deploying and maintaining such amechanism is a complex task. In practice,however, this problem turns out to be justone more task of a replica hosting systemoperator. The duties like maintaining a setof reliable replica servers, managing mul-tiple replicas of many objects, and makingthese replicas consistent, are likely to be atsimilar (if not higher) level of complexity.

7.3. Discussion

The problem of request routing can bedivided into two subproblems: devisinga redirection policy and selecting a redi-rection mechanism. The policy decides towhich replica a given client should be redi-rected, whereas the mechanism takes careof delivering this decision to the client.

We classify redirection policies into twogroups: adaptive and non-adaptive ones.Non-adaptive policies perform well onlywhen the system conditions do not change.If they do change, the system perfor-mance may turn out to be poor. Adaptivepolicies solve this problem by monitoringthe system conditions and adjusting their



Table IV. The Comparison of Representative Implementations of a Redirection SystemRedirection

Policies MechanismsAdaptive Non-Adaptive TCP DNS HTTP Comb.

System DST LAT NLD CPU USR OTH RR %RQ PRF RND PLD CNT DST 1LV 2LV

Akamai X X X X X X X XGlobule X X X XRadar X X X XSPREAD X XCisco DD X X X X X X X X X XWeb Direct X X X X X X X X

DST : Network distance OTH: Other metrics PLD: Predicted loadLAT : End-to-end latency RR : Round robin CNT: CentralizedNLD: Network load %RQ: Percentage of requests DST: DistributedCPU: Replica server CPU load PRF : Server preference 1LV : One-levelUSR: Number of users RND: Random selection 2LV : Two-level

behavior accordingly. However, they makethe system more complex, as they needspecialized metric estimation services. Wenote that all the investigated systems im-plement both adaptive and nonadaptivepolicies (see Table IV).

We classify redirection mechanismsinto three groups: transparent, non-transparent, and combined ones. Trans-parent mechanisms can be based on DNSor packet-handoff. As can be observedin Table IV, DNS-based mechanisms arevery popular. Among them, a particularlyinteresting is the scalable DNS-basedmechanism built by Akamai. As forpacket-handoff, its traditional limitationto clusters can be alleviated by meansof a global infrastructure, as it is donein SPREAD. Nontransparent mechanismsare based on HTTP. They achieve finerredirection granularity, on the cost of in-troducing an explicit binding between aclient and a replica server. Transparentand nontransparent mechanisms can becombined. Resulting hybrids offer fine-grained, transparent, and scalable redi-rection at the cost of higher complexity.

We observe that the request routingcomponent has to cooperate with the met-ric estimation services to work efficiently.Consequently, the quality of the requestrouting component depends on the accu-racy of the data provided by the metric es-timation service.

Further, we note that simple, unadap-tive policies sometimes work nearly as ef-

ficient as their adaptive counterparts. Al-though this phenomenon may justify us-ing only non-adaptive policies in simplesystems, we do believe that monitoringsystem conditions is of a key value forefficient request routing in large infras-tructures. Moreover, combining severaldifferent metrics in the process of replicaselection may additionally improve thesystem performance.

Finally, we are convinced that usingcombined redirection mechanisms is in-evitable for large-scale wide-area systems.These mechanisms offer fine-grained,transparent, and scalable redirection onthe cost of higher complexity. The result-ing complexity, however, is not signifi-cantly larger compared to that of otherparts of a replica hosting system. Since theability to support millions of clients can beof fundamental importance, using a com-bined redirection mechanism is definitelyworth the effort.

8. CONCLUSION

In this article, we have discussed the mostimportant aspects of replica hosting sys-tem development. We have provided ageneralized framework for such systems,which consists of five components: metricestimation, adaptation triggering, replicaplacement, consistency enforcement, andrequest routing. The framework has beenbuilt around an objective function, whichallows to formally express the goals of



replica hosting system development. Foreach of the framework components, wehave discussed its corresponding prob-lems, described several solutions for them,and reviewed some representative re-search efforts.

We observe that the work done on var-ious problems related to a single com-ponent tends to be imbalanced. In met-ric estimation, for example, many papersfocus on network-related metrics, suchas distance and latency, whereas verylittle has been written about financialand consistency metrics. Similarly, in con-sistency enforcement, usually only time-based schemes are investigated, althoughtheir value- and order-based counterpartscan also be useful. The reason for this situ-ation may be that these ignored problemsare rather non-technical, and addressingthem requires knowledge exceeding thefield of computer systems.

Another observation is that researchersseldom distinguish between the prob-lems of server and content placement.Consequently, the proposed solutions areharder to apply to each of these prob-lems separately—as these problems areusually tackled in very different circum-stances, it seems to be unlikely thata single solution solves both of them.We find that a clear distinction betweenserver and content placement should bemore common among future researchefforts.

Furthermore, there is a need for build-ing economic models that enables com-parison of orthogonal metrics like perfor-mance gain and financial returns. Sucha model would enable the designer of areplica-hosting system to provide mech-anisms that can capture the tradeoff be-tween financial cost (of running and main-taining a new replica server) and possibleperformance gain (obtained by the new in-frastructure).

We also notice that the amount of re-search effort devoted to different frame-work components is imbalanced as well.Particularly neglected ones are objectselection and adaptation triggering, forwhich hardly any paper can be found.These problems can only be investigated

in the context of an entire replica host-ing system (and the applications that ex-ploit it). This drives the need for moreresearch efforts that do not assume thecomponents to be independent from eachother. Instead, they should investigatethe components in the context of an en-tire system, with special attention paidto interactions and dependencies amongthem.

REFERENCES

ABOBA, B., ARKKO, J., AND HARRINGTON, D. 2000.Introduction to Accounting Management. RFC2975.

AGGARWAL, A. AND RABINOVICH, M. 1998. Perfor-mance of replication schemes for an internethosting service. Tech. Rep. HA6177000-981030-01-TM, AT&T Research Labs, Florham Park, NJ.Oct.

ALVISI, L. AND MARZULLO, K. 1998. Message log-ging: Pessimistic, optimistic, causal, and opti-mal. IEEE Trans. Softw. Eng. 24, 2 (Feb.), 149–159.

ANDREWS, M., SHEPHERD, B., SRINIVASAN, A., WINKLER,P., AND ZANE, F. 2002. Clustering and serverselection using passive monitoring. In Proc. 21stINFOCOM Conference (New York, NY). IEEEComputer Society Press, Los Alamitos, CA.

ARDAIZ, O., FREITAG, F., AND NAVARRO, L. 2001. Im-proving the service time of web clients usingserver redirection. In Proc. 2nd Workshop onPerformance and Architecture of Web Servers(Cambridge, MA). ACM, New York, NY.

BALIKRISHNAN, H., KAASHOEK, M. F., KARGER, D.,MORRIS, R., AND STOICA, I. 2003. Looking updata in P2P systems. Commun. ACM 46, 2 (Feb.),43–48.

BALLINTIJN, G., VAN STEEN, M., AND TANENBAUM, A.2000. Characterizing internet performance tosupport wide-area application development.Oper. Syst. Rev. 34, 4 (Oct.), 41–47.

BARBIR, A., CAIN, B., NAIR, R., AND SPATSCHECK, O.2003. Known content network (CN) request-routing mechanisms. RFC 3568.

BARFORD, P., CAI, J.-Y., AND GAST, J. 2001. Cacheplacement methods based on client demandclustering. Tech. Rep. TR1437, University ofWisconsin at Madison. July.

BERNSTEIN, P. A. AND GOODMAN, N. 1983. The failureand recovery problem for replicated databases.In Proc. 2nd Symposium on Principles of Dis-tributed Computing (Montreal, Ont., Canada).ACM, New York, 114–122.

BHIDE, M., DEOLASEE, P., KATKAR, A., PANCHBUDHE, A.,RAMAMRITHAM, K., AND SHENOY, P. 2002. Adap-tive push-pull: Disseminating dynamic webdata. IEEE Trans. Comput. 51, 6 (June), 652–668.



CAO, P. AND LIU, C. 1998. Maintaining strong cacheconsistency in the world wide web. IEEE Trans.Comput. 47, 4 (Apr.), 445–457.

CAO, P., ZHANG, J., AND BEACH, K. 1998. Activecache: Caching dynamic contents on the web.In Proc. Middleware ’98 (The Lake District,England). Springer-Verlag, Berlin, 373–388.

CARDELLINI, V., CASALICCHIO, E., COLAJANNI, M., AND

YU, P. 2002. The state of the art in locallydistributed web-server systems. ACM Comput.Surv. 34, 2 (June), 263–311.

CARDELLINI, V., COLAJANNI, M., AND YU, P. 1999. Dy-namic load balancing on web-server systems.IEEE Internet Comput. 3, 3 (May), 28–39.

CARDELLINI, V., COLAJANNI, M., AND YU, P. S. 2003.Request redirection algorithms for distributedweb systems. IEEE Trans. Paral. Distrib.Syst. 14, 4 (Apr.), 355–368.

CARTER, R. L. AND CROVELLA, M. E. 1997. Dynamicserver selection using bandwidth probing inwide-area networks. In Proc. 16th INFOCOMConference (Kobe, Japan). IEEE Computer So-ciety Press, Los Alamitos, CA. 1014–1021.

CASTRO, M., COSTA, M., KEY, P., AND ROWSTRON, A.2003a. PIC: Practical internet coordinates fordistance estimation. Tech. Rep. MSR-TR-2003-53, Microsoft Research Laboratories. Sept.

CASTRO, M., DRUSCHEL, P., HU, Y. C., AND

ROWSTRON, A. 2003b. Proximity neighborselection in tree-based structured peer-to-peeroverlays. Tech. Rep. MSR-TR-2003-52, MicrosoftResearch, Cambridge, UK.

CATE, V. 1992. Alex—A global file system. InProc. File Systems Workshop (Ann Harbor, MI).USENIX, Berkeley, CA. 1–11.

CHALLENGER, J., DANTZIG, P., AND IYENGAR, A. 1999.A Scalable system for consistently caching dy-namic web data. In Proc. 18th INFOCOM Con-ference (New York, NY). IEEE Computer SocietyPress, Los Alamitos, CA. 294–303.

CHANDRA, P., CHU, Y.-H., FISHER, A., GAO, J., KOSAK,C., NG, T. E., STEENKISTE, P., TAKAHASHI, E.,AND ZHANG, H. 2001. Darwin: Customizableresource management for value-added networkservices. IEEE Netw. 1, 15 (Jan.), 22–35.

CHEN, Y., KATZ, R., AND KUBIATOWICZ, J. 2002a. Dy-namic replica placement for scalable content de-livery. In Proc. 1st International Workshop onPeer-to-Peer Systems (Cambridge, MA). LectureNotes on Computer Science, vol. 2429. Springer-Verlag, Berlin, 306–318.

CHEN, Y., QIU, L., CHEN, W., NGUYEN, L., AND KATZ,R. H. 2002b. Clustering web content for ef-ficient replication. In Proc. 10th InternationalConference on Network Protocols (Paris, France).IEEE Computer Society Press, Los Alamitos,CA.

CHEN, Y., QIU, L., CHEN, W., NGUYEN, L., AND KATZ,R. H. 2003. Efficient and adaptive web repli-cation using content clustering. IEEE J. Sel.Areas Commun. 21, 6 (Aug.), 979–994.

COHEN, E. AND KAPLAN, H. 2001. Proactive cachingof DNS records: Addressing a performance bot-tleneck. In Proc. 1st Symposium on Applicationsand the Internet (San Diego, CA). IEEE Com-puter Society Press, Los Alamitos, CA.

CONTI, M., GREGORI, E., AND LAPENNA, W. 2002.Replicated web services: A comparative analy-sis of client-based content delivery policies. InProc. Networking 2002 Workshops (Pisa, Italy).Lecture Notes on Computer Science, vol. 2376.Springer-Verlag, Berlin, 53–68.

COX, R., DABEK, F., KAASHOEK, F., LI, J., AND

MORRIS, R. 2004. Practical, distributed net-work coordinates. ACM Comput. Communica-tions Review 34, 1 (Jan.), 113–118.

CROVELLA, M. AND CARTER, R. 1995. Dynamicserver selection in the internet. In Proc. 3rdWorkshop on High Performance Subsystems(Mystic, Connecticut). IEEE Computer SocietyPress, Los Alamitos, CA.

DA CUNHA, C. R. 1997. Trace analysis and its ap-plications to performance enhancements of dis-tributed information systems. Ph.D. disserta-tion, Boston University, Boston, Mass.

DELGADILLO, K. 1999. Cisco Distributed Director.Tech. rep., Cisco Systems, Inc. June.

DILLEY, J., MAGGS, B., PARIKH, J., PROKOP, H., SITARA-MAN, R., AND WEIHL, B. 2002. Globally dis-tributed content delivery. IEEE Internet Com-puting 6, 5 (Sept.), 50–58.

DUVVURI, V., SHENOY, P., AND TEWARI, R. 2000. Adap-tive leases: A strong consistency mechanism forthe world wide web. In Proc. 19th INFOCOMConference (Tel Aviv, Israel). IEEE Computer So-ciety Press, Los Alamitos, CA. 834–843.

DYKES, S. G., ROBBINS, K. A., AND JEFFREY, C. L. 2000.An empirical evaluation of client-side server se-lection. In Proc. 19th INFOCOM Conference (TelAviv, Israel). IEEE Computer Society Press, LosAlamitos, CA., 1361–1370.

ELNOZAHY, E., ALVISI, L., WANG, Y.-M., AND JOHNSON,D. 2002. A survey of rollback-recovery proto-cols in message-passing systems. ACM Comput.Surv. 34, 3 (Sept.), 375–408.

FEI, Z. 2001. A novel approach to managing con-sistency in content distribution networks. InProc. 6th Web Caching Workshop (Boston, MA).North-Holland, Amsterdam.

FRANCIS, P., JAMIN, S., JIN, C., JIN, Y., RAZ, D.,SHAVITT, Y., AND ZHANG, L. 2001. IDMaps:Global internet host distance estimation service.IEEE/ACM Trans. Netw. 9, 5 (Oct.), 525–540.

FRANCIS, P., JAMIN, S., PAXSON, V., ZHANG, L.,GRYNIEWICZ, D., AND JIN, Y. 1999. An architec-ture for a global internet host distance estima-tion service. In Proc. 18th INFOCOM Conference(New York, N.Y.). IEEE Computer Society Press,Los Alamitos, CA. 210–217.

FU, Y., CHERKASOVA, L., TANG, W., AND VAHDAT, A. 2002.EtE: Passive end-to-end internet service per-formance monitoring. In Proc. USENIX Annual



Technical Conference (Monterey, CA). USENIX,Berkeley, CA. 115–130.

GAO, L., DAHLIN, M., NAYATE, A., ZHENG, J., AND IYEN-GAR, A. 2003. Application specific data repli-cation for edge services. In Proc. 12th Interna-tional World Wide Web Conference (Budapest,Hungary). ACM Press, New York.

GRAY, C. AND CHERITON, D. 1989. Leases: An effi-cient fault-tolerant mechanism for distributedfile cache consistency. In Proc. 12th Symposiumon Operating System Principles (Litchfield Park,AZ). ACM Press, New York, 202–210.

GUMMADI, K. P., SAROIU, S., AND GRIBBLE, S. D. 2002.King: Estimating latency between arbitrary in-ternet end hosts. In Proc. 2nd Internet Measure-ment Workshop (Marseille, France). ACM Press,New York, 5–18.

HUFFAKER, B., FOMENKOV, M., PLUMMER, D. J., MOORE,D., AND CLAFFY, K. 2002. Distance metrics inthe internet. In Proc. International Telecommu-nications Symposium (Natal RN, Brazil). IEEEComputer Society Press, Los Alamitos, CA.

HULL, S. 2002. Content Delivery Networks.McGraw-Hill, New York.

IYENGAR, A., MACNAIR, E., AND NGUYEN, T. 1997. Ananalysis of web server performance. In Proc.Globecom (Phoenix, AZ). IEEE Computer Soci-ety Press, Los Alamitos, CA.

JALOTE, P. 1994. Fault Tolerance in DistributedSystems. Prentice Hall, Englewood Cliffs, N.J.

JANIGA, M. J., DIBNER, G., AND GOVERNALI, F. J. 2001.Internet infrastructure: Content delivery. Gold-man Sachs Global Equity Research.

JOHNSON, K. L., CARR, J. F., DAY, M. S., AND KAASHOEK,M. F. 2001. The measured performance ofcontent distribution networks. Computer Com-mun. 24, 2 (Feb.), 202–206.

JUNG, J., KRISHNAMURTHY, B., AND RABINOVICH, M.2002. Flash crowds and denial of service at-tacks: Characterization and implications forCDNs and web sites. In Proc. 11th InternationalWorld Wide Web Conference (Honololu, Hawaii).ACM, New York, 293–304.

KANGASHARJU, J., ROBERTS, J., AND ROSS, K. 2001a.Object replication strategies in content distribu-tion networks. In Proc. 6th Web Caching Work-shop (Boston, MA). North-Holland, Amsterdam,The Netherlands.

KANGASHARJU, J., ROSS, K., AND ROBERTS, J. 2001b.Performance evaluation of redirection schemesin content distribution networks. Comput. Com-mun. 24, 2 (Feb.), 207–214.

KARAUL, M., KORILIS, Y., AND ORDA, A. 1998. Amarket-based architecture for managementof geographically dispersed, replicated webservers. In Proc. 1st International Conferenceon Information and Computation Economics(Charleston, SC). ACM, New York, 158–165.

KARGER, D., SHERMAN, A., BERKHEIMER, A., BOGSTAD, B.,DHANIDINA, R., IWAMOTO, K., KIM, B., MATKINS, L.,AND YERUSHALMI, Y. 1999. Web caching with

consistent hashing. In Proc. 8th InternationalWorld Wide Web Conference (Toronto, Ont.,Canada).

KARLSSON, M. AND KARAMANOLIS, C. 2004. Choosingreplica placement heuristics for wide-area sys-tems. In Proc. 24th International Conference onDistributed Computing Systems (Tokyo, Japan).IEEE Computer Society Press, Los Alamitos,CA.

KARLSSON, M., KARAMANOLIS, C., AND MAHALINGAM, M.2002. A Framework for Evaluating ReplicaPlacement Algorithms. Tech. rep., HP Labora-tories, Palo Alto, CA.

KRISHNAKUMAR, N. AND BERNSTEIN, A. J. 1994.Bounded Ignorance: A Technique for Increas-ing Concurrency in a Replicated System. ACMTrans. Datab. Syst. 4, 19, 586–625.

KRISHNAMURTHY, B. AND WANG, J. 2000. Onnetwork-aware clustering of web clients. InProc. SIGCOMM (Stockholm, Sweden). ACMPress, New York, 97–110.

LABRINIDIS, A. AND ROUSSOPOULOS, N. 2000. Web-view materialization. In Proc. SIGMOD Interna-tional Conference on Management of Data (Dal-las, TX). ACM Press, New York, 367–378.

LAI, K. AND BAKER, M. 1999. Measuring Band-width. In Proc. 18th INFOCOM Conference (NewYork, N.Y.). IEEE Computer Society Press, LosAlamitos, CA. 235–245.

LEIGHTON, F. AND LEWIN, D. 2000. Global Host-ing System. United States Patent, Number6,108,703.

LI, B., GOLIN, M. J., ITALIANO, G. F., AND DENG, X. 1999.On the optimal placement of web proxies in theinternet. In Proc. 18th INFOCOM Conference(New York, N.Y.). IEEE Computer Society Press,Los Alamitos, CA. 1282–1290.

MAO, Z. M., CRANOR, C. D., DOUGLIS, F., RABINOVICH,M., SPATSCHECK, O., AND WANG, J. 2002. A pre-cise and Efficient Evaluation of the proximity be-tween web clients and their local DNS servers.In Proc. USENIX Annual Technical Conference(Monterey, CA). USENIX, Berkeley, CA. 229–242.

MCCUNE, T. AND ANDRESEN, D. 1998. Towards ahierarchical scheduling system for distributedWWW server clusters. In Proc. 7th InternationalSymposium on High Performance DistributedComputing (Chicago, IL). IEEE Computer Soci-ety Press, Los Alamitos, CA. 301–309.

MCMANUS, P. R. 1999. A passive system for serverselection within mirrored resource environ-ments using AS path length heuristics. Whitepaper. Appplied Theory, Inc., June.

MOCKAPETRIS, P. 1987a. Domain Names—Conceptsand Facilities. RFC 1034.

MOCKAPETRIS, P. 1987b. Domain Names—Implementation and Specification. RFC 1035.

MOGUL, J. C., DOUGLIS, F., FELDMANN, A., AND

KRISHNAMURTHY, B. 1997. Potential benefits ofdelta encoding and data compression for HTTP.



In Proc. SIGCOMM (Cannes, France). ACM,New York, 181–194.

MOORE, K., COX, J., AND GREEN, S. 1996. Sonar—Anetwork proximity service. Internet-draft. [On-line] http://www.netlib.org/utk/projects/sonar/.

MOSBERGER, D. 1993. Memory consistency models.Oper. Syst. Rev. 27, 1 (Jan.), 18–26.

NELDER, J. A. AND MEAD, R. 1965. A simplexmethod for function minimization. Comput.J. 4, 7, 303–318.

NG, E. AND ZHANG, H. 2002. Predicting internetnetwork distance with coordinates-based ap-proaches. In Proc. 21st INFOCOM Conference(New York, NY). IEEE Computer Society Press,Los Alamitos, CA.

NINAN, A., KULKARNI, P., SHENOY, P., RAMAMRITHAM, K.,AND TEWARI, R. 2002. Cooperative leases: Scal-able consistency maintenance in content dis-tribution networks. In Proc. 11th InternationalWorld Wide Web Conference (Honolulu, HA).ACM, New York, 1–12.

OBRACZKA, K. AND SILVA, F. 2000. Network latencymetrics for server proximity. In Proc. Globecom(San Francisco, CA). IEEE Computer SocietyPress, Los Alamitos, CA.

ODLYZKO, A. 2001. Internet Pricing and the His-tory of Communications. Comput. Netw. 36, 493–517.

PAI, V., ARON, M., BANGA, G., SVENDSEN, M., DR-USCHEL, P., ZWAENEPOEL, W., AND NAHUM, E. 1998.Locality-aware request distribution in cluster-based network servers. In Proc. 8th Interna-tional Conference on Architectural Support forProgramming Languages and Operating Sys-tems (San Jose, CA). ACM, New York, 205–216.

PANSIOT, J. AND GRAD, D. 1998. On routes and mul-ticast trees in the internet. ACM Comput. Com-mun. Rev. 28, 1, 41–50.

PAXSON, V. 1997a. End-to-end routing behavior inthe internet. IEEE/ACM Trans. Netw. 5, 5 (Oct.),601–615.

PAXSON, V. 1997b. Measurements and analysisof end-to-end internet dynamics. Tech. Rep.UCB/CSD-97-945, University of California atBerkeley. Apr.

PEPELNJAK, I. AND GUICHARD, J. 2001. MPLS andVPN Architectures. Cisco Press, Indianapolis,IN.

PIAS, M., CROWCROFT, J., WILBUR, S., HARRIS, T., AND

BHATTI, S. 2003. Lighthouses for scalabledistributed location. In Proc. 2nd InternationalWorkshop on Peer-to-Peer Systems (Berkeley,CA). Vol. 2735. Springer-Verlag, Berlin,Germany, 278–291.

PIERRE, G. AND VAN STEEN, M. 2001. Globule: Aplatform for self-replicating web documents. InProc. Protocols for Multimedia Systems. LectureNotes on Computer Science, vol. 2213. Springer-Verlag, Berlin, Germany, 1–11.

PIERRE, G. AND VAN STEEN, M. 2003. Design and im-

plementation of a user-centered content deliverynetwork. In Proc. 3rd Workshop on Internet Ap-plications (San Jose, CA). IEEE Computer Soci-ety Press, Los Alamitos, CA.

PIERRE, G., VAN STEEN, M., AND TANENBAUM, A.2002. Dynamically selecting optimal distribu-tion strategies for web documents. IEEE Trans.Comput. 51, 6 (June), 637–651.

PRADHAN, D. 1996. Fault-Tolerant Computer Sys-tem Design. Prentice Hall, Englewood Cliffs, N.J.

QIU, L., PADMANABHAN, V., AND VOELKER, G. 2001. Onthe placement of web server replicas. In Proc.20th INFOCOM Conference (Anchorage, AK).IEEE Computer Society Press, Los Alamitos,CA., 1587–1596.

RABINOVICH, M. AND AGGARWAL, A. 1999. Radar: Ascalable architecture for a global web hostingservice. Comput. Netw. 31, 11–16, 1545–1561.

RABINOVICH, M. AND SPASTSCHECK, O. 2002. WebCaching and Replication. Addison-Wesley, Read-ing, MA.

RABINOVICH, M., XIAO, Z., AND AGGARWAL, A. 2003.Computing on the edge: A platform for repli-cating internet applications. In Proc. 8th WebCaching Workshop (Hawthorne, NY).

RADOSLAVOV, P., GOVINDAN, R., AND ESTRIN, D. 2001.Topology-informed internet replica place-ment. In Proc. 6th Web Caching Workshop(Boston, MA). North-Holland, Amsterdam, TheNetherlands.

RADWARE. 2002. Web Server Director. Tech. Rep.,Radware, Inc. Aug.

RAYNAL, M. AND SINGHAL, M. 1996. Logical time:Capturing causality in distributed systems.Computer 29, 2 (Feb.), 49–56.

REKHTER, Y. AND LI, T. 1995. A border gateway pro-tocol 4 (BGP-4). RFC 1771.

RODRIGUEZ, P., KIRPAL, A., AND BIERSACK, E. 2000.Parallel-access for mirror sites in the internet.In Proc. 19th INFOCOM Conference (Tel Aviv,Israel). IEEE Computer Society Press, LosAlamitos, CA., 864–873.

RODRIGUEZ, P. AND SIBAL, S. 2000. SPREAD: Scal-able platform for reliable and efficient auto-mated distribution. Comput. Netw. 33, 1–6, 33–46.

RODRIGUEZ, P., SPANNER, C., AND BIERSACK, E. 2001.Analysis of web caching architecture: Hierarchi-cal and distributed caching. IEEE/ACM Trans.Networking 21, 4 (Aug.), 404–418.

SAITO, Y. AND SHAPIRO, M. 2003. Optimistic replica-tion. Tech. Rep. MSR-TR-2003-60, Microsoft Re-search Laboratories. Sept.

SAYAL, M., SHEUERMANN, P., AND VINGRALEK, R. 2003.Content replication in web++. In Proc. 2nd In-ternational Symposium on Network Computingand Applications (Cambridge, MA). IEEE Com-puter Society Press, Los Alamitos, CA.

SCHNEIDER, F. 1990. Implementing fault-tolerantservices using the state machine approach: A tu-torial. ACM Comput. Surv. 22, 4 (Dec.), 299–320.



SHAIKH, A., TEWARI, R., AND AGRAWAL, M. 2001. Onthe Effectiveness of DNS-based Server Selec-tion. In Proc. 20th INFOCOM Conference (An-chorage, AK). IEEE Computer Society Press, LosAlamitos, CA. 1801–1810.

SHAVITT, Y. AND TANKEL, T. 2003. Big-bang sim-ulation for embedding network distances ineuclidean space. In Proc. 22nd INFOCOM Con-ference (San Francisco, CA). IEEE Computer So-ciety Press, Los Alamitos, CA.

SHMOYS, D., TARDOS, E., AND AARDAL, K. 1997.Approximation algorithms for facility locationproblems. In Proc. 29th Symposium on Theory ofComputing (El Paso, TX). ACM, New York, NY,265–274.

SIVASUBRAMANIAN, S., PIERRE, G., AND VAN STEEN, M.2003. A case for dynamic selection of replica-tion and caching strategies. In Proc. 8th WebCaching Workshop (Hawthorne, NY).

STEMM, M., KATZ, R., AND SESHAN, S. 2000. A net-work measurement architecture for adaptive ap-plications. In Proc. 19th INFOCOM Conference(New York, NY). IEEE Computer Society Press,Los Alamitos, CA. 285–294.

SZYMANIAK, M., PIERRE, G., AND VAN STEEN, M.2003. Netairt: A DNS-based redirection sys-tem for apache. In Proc. International Confer-ence WWW/Internet (Algarve, Portugal).

SZYMANIAK, M., PIERRE, G., AND VAN STEEN, M. 2004.Scalable cooperative latency estimation. In Proc.10th International Conference on Parallel andDistributed Systems (Newport Beach, CA). IEEEComputer Society Press, Los Alamitos, CA.

TERRY, D., DEMERS, A., PETERSEN, K., SPREITZER,M., THEIMER, M., AND WELSH, B. 1994. Ses-sion guarantees for weakly consistent replicateddata. In Proc. 3rd International Conference onParallel and Distributed Information Systems(Austin, TX). IEEE Computer Society Press, LosAlamitos, CA. 140–149.

TEWARI, R., NIRANJAN, T., AND RAMAMURTHY, S. 2002.WCDP: A protocol for web cache consistency. InProc. 7th Web Caching Workshop (Boulder, CO).

TORRES-ROJAS, F. J., AHAMAD, M., AND RAYNAL, M.1999. Timed consistency for shared distributedobjects. In Proc. 18th Symposium on Principlesof Distributed Computing (Atlanta, GA). ACM,New York, 163–172.

VERMA, D. C. 2002. Content Distribution Net-works: An Engineering Approach. John Wiley,New York.

WALDVOGEL, M. AND RINALDI, R. 2003. Efficienttopology-aware overlay network. ACM Comput.Commun. Rev. 33, 1 (Jan.), 101–106.

WANG, J. 1999. A survey of web caching schemesfor the internet. ACM Comput. Commun.Rev. 29, 5 (Oct.), 36–46.

WANG, L., PAI, V., AND PETERSON, L. 2002. The effec-tiveness of request redirection on CDN robust-ness. In Proc. 5th Symposium on Operating Sys-tem Design and Implementation (Boston, MA).USENIX, Berkeley, CA.

WOLSKI, R., SPRING, N., AND HAYES, J. 1999. Thenetwork weather service: A distributed resourceperformance forecasting service for metacom-puting. Future Gen. Comput. Syst. 15, 5-6 (Oct.),757–768.

XIAO, J. AND ZHANG, Y. 2001. Clustering of webusers using session-based similarity measures.In Proc. International Conference on ComputerNetworks and Mobile Computing (Beijing). IEEEComputer Society Press, Los Alamitos, CA. 223–228.

YIN, J., ALVISI, L., DAHLIN, M., AND IYENGAR, A. 2002.Engineering web cache consistency. ACM Trans.Internet Tech. 2, 3 (Aug.), 224–259.

YU, H. AND VAHDAT, A. 2000. Efficient numeri-cal error bounding for replicated network ser-vices. In Proc. 26th International Conference onVery Large Data Bases (Cairo, Egypt), A. E.Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal,N. Kamel, G. Schlageter, and K.-Y. Whang,Eds. Morgan-Kaufman, San Mateo, CA. 123–133.

YU, H. AND VAHDAT, A. 2002. Design and evalu-ation of a conit-based continuous consistencymodel for replicated services. ACM Trans. Com-put. Syst. 20, 3, 239–282.

ZARI, M., SAIEDIAN, H., AND NAEEM, M. 2001. Un-derstanding and reducing web delays. Com-puter 34, 12 (Dec.), 30–37.

ZHAO, B., HUANG, L., STRIBLING, J., RHEA, S., JOSEPH,A., AND KUBIATOWICZ, J. 2004. Tapestry: A re-silient global-scale overlay for service deploy-ment. IEEE J. Sel. Areas Commun. 22, 1 (Jan.),41–53.

Received May 2003; revised June 2004; accepted September 2004


replication for web hosting systems · replication for web hosting systems swaminathan...

Documents