transactions on parallel and distributed computing …ece-research.unm.edu/hayat//tpds_2013.pdf ·...

TRANSACTIONS ON PARALLEL AND DISTRIBUTED COMPUTING 1

Reliability of Heterogeneous DistributedComputing Systems in the Presence of

Correlated FailuresJorge E. Pezoa, Member, IEEE, and Majeed M. Hayat, Senior Member, IEEE

Abstract—While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing

elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem

of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is

tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component

of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where

specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC–based procedure to

draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model

is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs

in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the

reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the

reliability and the key dependence of DTR policies on the type of correlated failures are also investigated.

Index Terms—distributed computing, load balancing, reliability, non-Markovian process, spatially correlated failures, shared risk group

1 INTRODUCTION

T RADITIONALLY, the problem of modeling and ana-lyzing reliability in distributed-computing systems

(DCSs) in the presence of computing element (CE) fail-ures has been tackled under the assumption that failuresamong CEs occur in a mutually-independent fashion.Under the assumption of independent failures, the re-liability of a DCS depends on both the number of CEscomposing the system and their individual likelihoods offailure. A vast amount of work has been developed un-der this assumption, and several approaches to improvethe reliability of applications executed on such systemshave been developed [1]–[3].

The assumption of independent failures of CEs greatlysimplifies the analysis; however, such assumption maynot be realistic for the type of failures occurring in mod-ern DCSs, which may include heterogeneous CEs, non-negligible communication delays, unreliable CEs, unreli-able communication links, and a dynamic topology thatchanges in a random fashion. For instance, Schroederand Gibson, [4], Kondo et al., [5], Gallet et al., [6], andJoshi et al., [7], analyzed different failure traces fromlarge-scale high-performance computing (HPC) systems,Internet distributed systems as well as DCSs, and all of

• J. E. Pezoa and M. M. Hayat are with Department of Electrical andComputer Engineering, University of New Mexico, Albuquerque, NM87131, USA; E-Mail: [email protected],[email protected].

• J. E. Pezoa is also with Electrical Engineering Department and The Centerfor Optics and Photonics, Universidad de Concepcion, Concepcion, Chile.

• M. M. Hayat is also with Center for High Technology Materials, Univer-sity of New Mexico, Albuquerque, NM, USA.

them concluded independently that all those systemsare affected by frequent, correlated machine crashesand network failures that reduce the reliability of theentire system. Further, it was stated that to improve,at no extra cost, the reliability of large-scale DCSs, thesoftware managing applications must provide means tocompensate for correlated failures [5], [7].

DCSs that extend over large geographical areas, asin the case of donation grids and peer-to-peer (P2P)networks for instance, can be vulnerable to large-scalefailures resulting from massive communication networkmalfunctions, wide-area power outages, wide-area nat-ural disasters, or deliberate wide-area attacks to thesystem infrastructure as those inflicted by weapons ofmass destruction (WMD) and high-power electromag-netic pulses. These events of stress occur at specific ge-ographical locations and may induce correlated failuresthat disrupt specific parts of the DCS. Correlated failuresmay not only inflict disturbance to the system’s avail-ability but they may also induce further failures in otherservers as a result of the lack of reliable communicationbetween the DCS components, especially in situationswhere any data exchange takes large communicationtimes [8]. In this work, we are interested in assessingthe service reliability of DCSs in scenarios where serversfail, without recovery, in a correlated manner.

This paper has two contributions: (1) modeling theservice reliability of applications executed on DCSs inthe presence of correlated component failures by meansof a hybrid analytical and Monte-Carlo (MC)–basedapproach, and (2) optimizing the service reliability by


means of dynamic task reallocation (DTR) policies. Theservice reliability is modeled by extending our analyticalnon-Markovian model in [9] to include specific groupfailures of CEs at each failure event. This extension en-ables us to calculate the reliability conditional on the oc-currence of a specific realization of correlated CE failures.By averaging the conditional reliability over a large num-ber of correlated-failure realizations, the average servicereliability of an application in the presence of correlatedfailures can be estimated. To develop a statistical modelfor correlated failures we have adopted the concept ofprobabilistic shared risk link group (SRLG), which wasdeveloped by the network-routing community and hasbeen used it to introduce correlation in a meaningfuland practical manner by defining sets of CEs that maysuffer from a common stress event. To maximize thereliability of DCSs in the presence of correlated failures,a novel class of DTR policies is also developed. TheDTR policies exploit statistical knowledge on correlatedfailures to preemptively redistribute tasks among theCEs with the goal of maximizing the service reliabilityof the application. Results show that the benefit of DTRin improving reliability can be elevated when policiesaccount for the effects of correlated failures.

The rest of this paper is organized as follows. Section 2presents a brief survey of related work on modelingcorrelated failures and assessing reliability in DCSs. InSection 3 we build a model for correlated failures andintroduce the hybrid analytical and MC–based approachfor predicting the service reliability of a DCS. Theconcept of correlated-failure–aware DTRs is introducedin Section 4. Section 5 presents our simulation results.Finally, our conclusions are given in Section 6.

2 RELATED WORK

Correlated failures have been extensively studied inother areas outside the context of DCSs. A simple tax-onomy, based on the type of correlation exhibited by thefailures, classifies correlated failures in temporal, spatial,logical or any combination of them.

Temporal failure correlations have been analyzedmostly in an empirical manner. In [10] the effects offailure patterns on the availability of a DCS’s moni-toring service were studied and simple techniques forimproving the robustness of the monitoring serviceswere developed. In [6] the time-varying behavior offailures in large-scale DCSs was empirically modeledfrom failure traces obtained from production systems.Zhang and Fu analyzed node, cluster, and system-widefailure behaviors to predict and capture temporal failurecorrelations (at different time scales) in a coalition clusterenvironment [11]. Spatially correlated failures have beenmodeled in large-scale systems as well. In [6] spatiallycorrelated failures were identified from real data us-ing the following intuitive approach: groups of failuresoccurring within a short time interval across the CEswere assumed to be spatially correlated. Other model-ing approaches have assumed that spatially correlated

failures are induced by massive events where a regioncontaining several CEs are physically damaged [12].Spatially correlated failures have also been modeled inwireless sensor networks, where spatial-failure patternswere modeled assuming the simultaneous failure of allthe sensor nodes in a specific region [13].

Logical failure correlations have also been studied inDCSs, and have been obtained either from the logicaldata dependencies of the applications or from the logicalinterconnection between hardware components. Weath-erspoon et al. analyzed logically correlated failures inP2P networks and developed a framework for discov-ering groups of CEs that are maximally independentin their failure characteristics and clustered them tocompensate for correlated failures [14]. In [15] and [16],a software reliability modeling framework capable ofincorporating the dependencies among successive soft-ware runs was reported. In [17], Dai et al. evaluatedthe reliability of a grid computing system consideringthe failure correlation of different subtasks executed bythe grid; however, component failures were assumedto be independent. Recently, approximate analytical ex-pressions for reliability in on-demand systems exhibitingcorrelated failures were presented [18].

Traces of real failures from several parallel, high-performance (HP), and distributed computing (DC) en-vironments have become available to researchers in thelast years [4], [19]. In [4], Schroeder and Gibson statis-tically analyzed, and made public, traces of nine yearsof failures from a large HPC center. They noticed thatthe failure time of CEs follows a Weibull distribution,while their recovery time follows a lognormal distribu-tion. In the context of correlated failures, Schroeder andGibson noticed a certain degree of correlation betweenthe failure rate of a CE and the type and intensity of theworkload running on it. Iosup et al. analyzed traces ofthe long-term availability in a large-scale experimentalgrid environment [20]. The authors studied the effectof correlated failures in time, and built a failure modelfor the grid with no spatial correlation between theoccurrence of failures at the different sites of it. Kondoet al. characterized the time availability in an Internet-based DCS focusing on identifying patterns of correlatedavailability [5]. They also modeled the availability andfailure times in diverse DCSs; however, their analysis didnot considered the effect of correlated failures [19].

3 MODELING RELIABILITY OF DCSS IN THE

PRESENCE OF CORRELATED FAILURES

3.1 Problem definition

This paper tackles the problem of improving the reliabil-ity of non-Markovian DCS, by means of task reallocation,when failures in the CEs exhibit spatial correlation.We are particularly interested in improving the servicereliability, which is defined as the probability that agiven application can be entirely executed by a DCS.The DCS is assumed to be composed of n heterogeneous


CEs, whose processing capabilities are of the processor-consistent type [21]; that is, the random time taken byany server to process any task follows a general distri-bution and depends only upon the random service timeof the server executing such task. Parallel applicationsserved by the DCS are assumed to belong to the classof applications with no data-dependence constraints be-tween operations. Moreover, applications are supposedto be partitioned, at time t = 0, into M atomic tasks by anoff-line application scheduler that allocates mj tasks atthe queue of the jth server. Further, all the CEs performa synchronous DTR action at the prescribed time t = tb.

Finally, we have also assumed that the exchange ofany task, or any group of tasks, among any pair ofservers experiences a stochastic communication delay.Such stochastic delays follow general distributions anddepend upon both the number of tasks exchangedamong the servers as well as network-related parameterssuch as heterogeneous end-to-end propagation times. Weshall also assume that computing servers may fail per-manently in a correlated fashion (to be described later)at any random instant following the so-called crash-stop failure model where tasks cannot be recovered froma failed server [22]. As a consequence, the applicationbeing executed on the DCS cannot be completed if atleast one task remains unprocessed at a failed CE. Addi-tionally, we further assume that small fixed-sized failure-notice (FN) messages are exchanged over the networkin order to detect and isolate failed servers. These FNmessages too experience stochastic end-to-end transferdelays that depend only on the end-to-end propagationtime of each communication link. Finally, we adopt thesimplifying assumption that servers employ a reliablemessage-passing protocol to guarantee that tasks are notdiscarded in situations such as a server failing whileexchanges tasks with other servers.

3.2 Reliability in the presence of correlated failures

3.2.1 Shared risk groups and correlated failures

We focus on modeling the service reliability in scenarioswhere servers fail without recovery. Specifically, we areinterested in correlated failures triggered by large-scalegeographical attacks to the DCS infrastructure, whichdiminish the ability of the DCS to reallocate tasks amongCEs. Thus, we model the class of correlated failuresresulting from real-world massive disruptions and/orphysical attacks to the DCS infrastructure. To do so,we have taken from the network routing communitythe concept of SRLGs and adjusted it here to introducecorrelation in a meaningful and practical manner.

The concept of SRLG has successfully been used toaddress, in a systematic manner, the survivability ofnetwork topologies in the presence of multiple correlatedcommunication link failures [23]. The key idea in SRLGsis that multiple yet different telecommunication servicesmay be affected by a common network failure under theproviso that they share a common failure risk, such as a

fiber-optic link, a routing device, a routing domain, etc.The consequence of a common risk failure is that all theservices sharing the same risk would be affected or eventotally interrupted. In [24], the concept of probabilisticSRLG was introduced as a generalization of the tradi-tional SRLG to model stochastic failure events affectingthe network topology, and upon the occurrence of aSRLG failure event, the communication links associatedto the group fail with some probability. Here we takeboth concepts and redefine them to a DC environment.

Definition 1. A shared risk group (SRG) in a DCS is a setof servers that may be affected by a common failure tothe infrastructure of the DCS under the condition thatthey share a common failure risk.

In DCSs examples of common failure risks are: (1) in-frastructure anomalies, such as power outages or spikes;(2) hardware failures, such as failures in memory mod-ules, CPUs or even fans; (3) input/output errors, such asfailures at disk drives or drive controllers; (4) networkfailures, such as failures at FastEthernet o GigaEthernetswitches; and (5) software failures, such as failures atschedulers or distributed file systems. We note all thesetypes of failures are logged by DCSs following the traceformat of The Failure Trace Archive [25]. In fact, byexamining traces in [25] from the HPC system at LosAlamos National Laboratory we have observed that thefirst failure triggered by a power outage produced acorrelated failure at the nodes identified as 655 and 782.Other examples of real-world correlated failures foundin the traces were triggered, for instance, by failures atUPSs and fiber drives. Examples of common failure risksof interest to this paper are groups of CEs sharing a closegeographical area, groups of CEs within the same facility,groups of CEs facing a cyber attack to either the DCS,their distributed operating system, their communicationnetwork, or their Internet service provider, etc. [26].

Suppose now that there exists a set A of SRG eventsthat may induce correlated failures to the DCS. Supposealso that each event A ∈ A has a probability of oc-currence of π(A). Further, assume that the underlyinginfrastructure of an n-server DCS is abstracted by aconnected, undirected graph G = (V,E), where V =1, 2, . . . , n is the set of CEs and E ⊂ V × V is the setof communication links. Consequently, when the SRGevent A happens, the set of servers V can be partitionedinto two sets: VA and V c

A, where the former set denotesthe collection of all servers sharing the common riskassociated to the SRG event A and the latter set denotesall those servers unaffected by the event A.

Definition 2. A probabilistic shared risk group (PSRG) in aDCS is a set of servers that do fail with a positive failureprobability, in the event of a SRG failure. More precisely,the failure probability of the ith server, conditional onthe SRG failure event A, is denoted as pAi and satisfies:pAi > 0 for all i ∈ VA and pAi = 0 otherwise.

Following [24], we assume that only one PSRG event


may occur at a time, meaning that the PSRG failureevents are mutually exclusive, and consequently thefollowing relationship holds:

∑

A∈A π(A) = 1. This oth-erwise arbitrary definition has been effectively used inthe routing community and makes sense in the contextof the class of failures regarded here [23], [24], [27].

Definition 3. We say that the servers i and j belongingto a DCS are correlated if pAi and pAj are both positivefor the A PSRG. Moreover, upon the occurrence of theA SRG event, the probabilities pAi and pAj are mutuallyindependent for all the pairs of servers in VA.

Suppose now that Xi is a binary random variable rep-resenting if the ith server has failed (“1”) or not (“0”). Byarranging the n binary random variables in vector form,we introduce the failure vector X = (X1, X2, . . . , Xn),which takes values in 0, 1n, as the random vectordefining the failure state of the DCS. Also, a realizationof the failure vector X

A is denoted by the binary vectorx and is termed as a failure. Finally as a more practicalmatter, to generate samples of correlated failures given aspecific probabilistic SRG event, it only suffices to specifythe probabilities pAi , independently generate realizationsfor the XA

i random variables, and form the vector xA.

3.2.2 Service reliability and correlated failures

In order to calculate the service reliability in the pres-ence of spatially correlated failures, a hybrid analyticaland MC approach is presented. By drawing samples ofcorrelated failures from the PSRG model, a large numberof realizations of correlated failure patterns can be gen-erated. Thus, conditional on a particular failure pattern,an analytical model for the conditional service reliabilityof a DCS can be derived as shown in Appendix A. Moreprecisely, conditional on a sample, say, X

A = xA, of

correlated failures induced by the occurrence of the APSRG event, the system of recursive integral equations(9), with initial conditions (10), presented in Appendix Acan be used to compute the conditional service reliability.

In brief, we can estimate the service reliability of aDCS in the presence of correlated failures induced by anA PSRG event as follows. Let k be the kth sample of cor-related failures induced by the occurrence of the A PSRGevent, with k = 1, 2, . . . , κ. Let also Rℓ0,k(tb|X

A = xA) be

the service reliability of the DCS when servers performa synchronous DTR action at time t = tb, the initialsystem configuration is as specified by ℓ0, and the kthsample of correlated failures induced by the occurrenceof the A PSRG event is as specified by X

A = xA. The

service reliability of the DCS in the presence of correlatedfailures induced by the occurrence of the A PSRG event,RA

ℓ0(tb), can be estimated by simply averaging over the

κ samples of correlated failures:

RAℓ0(tb) ≈

1

κ

κ∑

k=1

Rℓ0,k(tb|XA = x

A). (1)

4 CORRELATED-FAILURE–AWARE DIS-TRIBUTED TASK REALLOCATION POLICY

In [9], [28] we developed a flexible class of DTR poli-cies. Each policy in the class estimates, at t = tb, theamount of load imbalance, Lex

j (tb), that each server has

with respect to the estimated total system load, Mi(tb).The imbalance estimation criterion considers a generalparameter, denoted as Λj , which represents differentchoices for the imbalance criterion, such as the relativecomputing power and the individual reliability of theCEs. Once the imbalance criterion is defined, each un-balanced server determines the initial amount of tasksto reallocate among the remaining servers in the system.This step is carried out by partitioning the excess loadamong all the candidate task-receiver servers:

l(0)ij ≡ l

(0)ij (tb) = ⌊mi(tb)− ΛjMi(tb)/

∑

ℓ∈Wj

Λℓ⌋, (2)

where ⌊·⌋ is the floor function and mj(tb) is the load at

the jth server at time t = tb. The values l(0)ij are initial

values for the partition of tasks at an unbalanced server.To develop a DTR policy accounting for correlated

failures, the ideas of PSRGs and correlation introducedin Definitions 2 and 3 must be considered. Here wehave modified the general DTR policy and created acorrelated-failure–aware policy by proposing the follow-ing reallocation criterion:

Λj = λdj

(

1− λfj /

∑

k∈V

λfk

)

(1− π(A))(1 − pAj pAi ), (3)

where λdjand λf

j are, respectively, the processing speedand the failure rate of the jth CEs. The idea behindthis definition for the reallocation policy is to favor themigration of tasks from overloaded servers to those CEsthat are less vulnerable to fail in a correlated mannergiven the PSRG event A, while simultaneously penalizethe migration of tasks to those CEs that are correlated,in the sense of Definition 3. Note that the processingspeed of the servers as well as the failure rates are stillconsidered in the definition of the reallocation criterion.When failures are uncorrelated the term pAj p

Ai is zero as

a consequence of Definition 2, and the proposed policybecomes proportional to: (1) the likelihood of not occur-rence of the PSRG event A; and (2) λdj

(1−λfj /

∑

k∈V λfk),

which is exactly the reallocation policy for the indepen-dent failure case defined in [28], Section 2.2.

Note that in the development of the hybrid analyticaland MC model for the service reliability, the DTR policyexecuted by the servers at time tb has been consideredas a parameter. We include such parameterization in thenotation of the service reliability as RA

ℓ0(tb;L) ≡ RA

ℓ0(tb),

where L is an n-by-n matrix whose lijth element denotesthe number of tasks to be reallocated from the ith tothe jth server at time tb. More importantly, we canexploit such parameterization to pose an optimizationproblem that allow us to optimally migrate tasks amongthe CEs such that the service reliability of the DCS, inthe presence of correlated failures, can be maximized.


Mathematically, the following mixed-integer optimiza-tion problem can be stated:

(t∗b ,L∗) = argmin

(tb,L)

RAℓ0(tb;L) , (4)

subject ton∑

j=1,j 6=i

lij ≤ mi, i = 1, . . . , n, (5)

lij ∈ 0, 1, . . . ,mi, i, j = 1, . . . , n, i 6= j, (6)

tb ≥ 0. (7)

The problem has n(n − 1) non-negative integer-valuedvariables, one non-negative real-valued variable andn2 + 1 restrictions. This type of optimization problem isknown to be NP-hard due to the combinatorial explosionof the search space; the efficient search algorithm basedon a pairwise decomposition of the DCS presented in[28] has been employed here to find feasible DTR poli-cies maximizing the service reliability of a DCS in thepresence of correlated failures.

We note that the proposed algorithm scales linearly inboth the number of servers in the DCS and the numberof tasks queued at the overloaded server. This claimis justified as follows. Suppose that the jth server isunbalanced and must reallocate tasks to η servers, wherethe relationship 0 < η < n holds. Since the DTR policyexecuted by the servers is distributed, each one mustsolve (9) and (10) independently. For n= 2 servers, thecomplexity in solving such equations is a function ofthe number of tasks queued at the jth server, that isO(f(mj)). Since the jth server decomposes the DCS intoη pairs of DCSs, the overloaded server must solve atmost η times the optimization problem (4) for n = 2.Further, by construction the algorithm must solve, atthe jth server, no more than N times such optimizationproblem. From this, we observe that the complexity ofthe algorithm is O

(

N(n − 1)f(mj))

. In addition, if anexhaustive search in the number of tasks to reallocateis conducted, then f(mj) is bounded by mj because nomore than mj tasks must be reallocated.

5 RESULTS

5.1 Small-scale experiments

In this subsection the service reliability of a small-scaleDCSs with a representative network topology has beenanalyzed. The DCS corresponds to a 20-node, nation-wide distributed system where servers are located atseveral cities in the US as shown in Fig. 1. In ourcalculations we have considered two classes of CEs: HPservers and standard servers. Since in a DCS HP serversare expected to serve more tasks than standard CEs, theyare supplied with a larger number of communicationlinks. In particular, all those servers with five or morecommunication links in the topologies depicted in Fig. 1are regarded as HP servers. Due to geographical proxim-ity, in the DCS with 20-node servers the following PSRGshave been defined: PSRG-1 composed of servers 10, 11,

17, and 19; PSRG-2 composed of servers 1 and 7; PSRG-3 composed of servers 6 and 8; and PSRG-4 composedof servers 4, 5, 9, and 18. For simulation purposes, wesuppose that each DCS may be affected only by fourdifferent PSRGs events, each one of them associated toPSRGs 1 to 4 and each one of them having a likelihoodof occurrence of π(A) = 0.25.

(a)

Fig. 1: Topology of a sample small-scale DCS.

The non-Markovian stochastic dynamics of DCSs havebeen simulated assuming that both, the service and thetask transfer times, follow Pareto distributions. Paretodistributions have been selected because, after exper-imentally characterizing the dynamics of the testbedDCS described in Appendix B, the empirical probabilitydistribution functions (pdfs) of service and task transfertimes are best fitted by Pareto distributions. The averagetask-processing time of the HP servers was set to 1 s,while the standard deviation of the task-processing timewas set to 0.25 s. The average task-processing time of thestandard servers was set to ten times the average task-processing time of the HP servers, and their standarddeviation was set to 4 seconds. For the task transfertimes, we follow [28] to introduce meaningful communi-cation delays and define the average task-transfer timeto be five times the average service time of the standard(slowest) servers. Further, the mean transfer time of lijtasks from the ith server to the jth, Tij , is calculated asTij = aij lij+bij , where aij and bij are positive constantsthat depend on the link connecting the ith and the jthservers. The parameters aij and bij were set to 1 secondper task and 2 seconds, respectively, as in [28].

Regarding failure times, we follow the results in [19],[20] and assume that the failure times in both cases,correlated and independent failures, follow Weibull dis-tributions with the same average failure times, that is,the average failure time of a server that crashes inde-pendently is 600 s, while the average failure time of allthose servers in a PSRG crashing simultaneously on theoccurrence of a PSRG event is also 600 s. In addition,we have considered a scenario of relatively small anduniform failure probability for the servers conditionalon a PSRG, namely, pAi = 0.35 for all i. To make faircomparisons, in our simulations we have adjusted theaverage number of failed servers to be the same for thecases of correlated and independent failures.


TABLE 1: The service reliability of small- and large-scaleDCSs in the presence of both independent and spatiallycorrelated failures.

20-node DCSInitial PSRG-1 PSRG-2 PSRG-3 PSRG-4

Allocation Indep. Corr. Indep. Corr. Indep. Corr. Indep. Corr.Uniform 1 0.835 0.830 0.869 0.681 0.866 0.694 0.822 0.775Uniform 2 0.787 0.785 0.812 0.653 0.829 0.649 0.794 0.701Uniform 3 0.820 0.818 0.850 0.673 0.852 0.682 0.813 0.746Comp-Pwr. 0.877 0.874 0.897 0.681 0.889 0.691 0.874 0.757

Grid5000 DCS

Initial PSRG-1 PSRG-2 PSRG-3 PSRG-4Allocation Indep. Corr. Indep. Corr. Indep. Corr. Indep. Corr.Uniform 0.791 0.527 0.762 0.646 0.781 0.629 0.794 0.584

Comp-Pwr. 0.913 0.602 0.905 0.692 0.927 0.688 0.916 0.657

Regarding the workload processed by the DCSs, weassume in our calculations that an application composedof M = 5000 tasks is allocated onto the CEs at time t = 0.We have considered four different initial allocations. Theinitial task allocations labeled as Uniform-1, Uniform-2, and Uniform-3 correspond, respectively, to an initialuniform allocation of tasks onto all the CEs, onto the HPservers only, and onto the standard CEs only, while theallocation labeled as Computing-Power corresponds toan initial allocation of tasks proportional to the relativeprocessing speed of the CEs. In addition, a DTR policywith a reallocation criterion based solely on the relativeprocessing speed of the CEs has been considered. All theestimated values reported here correspond to centers ofintervals with 95% confidence, over which the estimateswill not differ from the true value more than 5%.

Table 1 lists the optimal service reliability, for the fourdifferent initial task allocation considered, and for bothcases independent and correlated failures. In the case ofindependent failures, the optimal service reliability wascalculated by means of the pairwise decomposition of aDCS presented in [28]. In the case of correlated failures,the optimal service reliability was approximated by firstgenerating a sample of PSRG correlated failures and,next, the conditional service reliability was calculatedanalytically by solving (9) and (10), and finally (1) wasused to compute the estimated service reliability fora fixed DTR policy. It can be observed from Table 1that, in spite of the DTR and the average number offailures are the same, PSRG correlated failures dimin-ish the service reliability as compared to the case ofindependent failures. For the cases presented here, as aresult of correlation in the failures, the service reliabilityhas been reduced up to 21% for the 20-node DCS. Thereduction in the service reliability is explained by the factthat the likelihood of failure of an HP server increaseswhen correlated failures induced by a PSRG affect aDCS, as compared to the case of independent failures.For independent failures, any server may fail in thesystem; however, when a PSRG affects a DCS only aspecific subset of servers is prone to fail. Recall thatthe average number of failures is the same for both

independent and correlated failures, and recalling alsothat each PSRG, with the exception of PSRG-1 in the20-node DCS, contains one or two HP servers. Thus, itcan be easily observed that the likelihood of failure of anHP server increases in the presence of correlated failuresinduced by a PSRG. The same ideas explain also whyindependent and correlated failures yield approximatelythe same service reliability for the 20-node DCS in thecase of the PSRG-1 event. In Appendix C, the negativeeffects of correlated failures induced by PSRG eventson the average fraction of tasks served by the DCS arepresented in detail as an additional result.

We show in Fig. 2(a) the service reliability of the DCSas a function of the DTR policy, when the initial taskallocation is Uniform-2 and the PSRG event 2 inducescorrelated failures in the system. The DTR policy isrepresented in the figure as the ratio of tasks exchangedamong all the CEs. Since in the Uniform-2 allocationtasks are initially sitting at the standard CEs, the DTRpolicy showed in the figure corresponds to the case oftransferring tasks from standard servers to HP servers.Results suggest that the service reliability, in the presenceof correlated failures generated by a PSRG event, maydrop between 5 and 25% as compared to the case ofindependent-failures. Once again, this result is attributedto the fact that, when correlated failures occur, it ismore likely that an HP server fails as compared to thelikelihood of failure of HP servers when independentfailures affect the DCS. Moreover, Fig. 2(a) also illustratesthe effect of properly selecting the number of tasks tomigrate among the CEs: when the task transfer delaysare negligible compared to the average service timesof tasks, the optimal DTR policy corresponds to theinitial partition specified by (2). However, when the task-transfer delays are not negligible, as in the exampleshown here, such selection is no longer optimal andby transferring only 95% (90%) of the tasks as specifiedby (2), a maximal service reliability of 0.812 (0.653) isachieved when independent (correlated) failures affectDCS’s dynamics. In Fig. 2(b) the service reliability of theDCS in the presence of both independent and correlatedfailures is depicted as a function of the instant whenthe DTR policy is executed by the CEs. Results suggestthat an excessive delay in executing the DTR policy hasthe effect of considerably reducing the service reliabilityregardless of the type of failure affecting the DCS.

In order to experimentally validate our theory wecoded the DTR policy for maximizing the service re-liability on the testbed DCS described in Appendix Band emulated the 20-node DCS shown in Fig. 1. Inorder to yield predictions for the service reliability, therandom times driving the dynamics of the testbed mustbe first experimentally characterized. To do so, Paretodistributions were fitted for the service times and trans-fer times of the testbed. The average service times, theaverage transfer times, and the aij and bij parameterswere estimated using maximum likelihood estimatorsas in [28]. From the experimental characterization, we


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Fraction of Tasks Exchanged

Ser

vice

Rel

iabi

lity

IndependentSRG-correlated

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

tb, s

Ser

vice

Rel

iabi

lity


(a)

(b)

Fig. 2: The service reliability of the small-scale DCS as afunction of the: (a) fraction of tasks reallocated; and (b)instant when the DTR policy is executed.

have adjusted the DC application so that the averageservice times at the CEs are between 5 and 10 secondsfor the standard servers and between 1 and 1.5 secondsfor the HP servers. Also, traffic shapers have been setso that the estimated aij and bij parameters are be-tween 5 and 10 seconds per task, and 1 to 3 seconds,respectively. The initial workload in the DCS was setto 2000 tasks, the failure times of the CEs failing ina correlated manner was assumed to follow Weibulldistributions with average failure times of 400 s. As insimulations, we considered a scenario of relatively smalland uniform failure probability for the CEs, conditionalon a PSRG, i.e., pAi = 0.35 for all i. Also, to make faircomparisons, we have adjusted the average number offailed CEs to be the same for the cases of correlatedand independent failures. In the experimental setup,the reliability was calculated by averaging a total of500 independent realizations of failure patterns for eachpolicy, while in the case of the theoretical predictions10000 realizations of independent failures patterns wereemployed for each policy.

Figure 3 shows the theoretical predictions and theexperimental results for the service reliability in thepresence of PSRG-2 correlated failures, as a functionof the fraction of tasks reallocated among the servers,when the Uniform-1 initial task allocation was employedto partition the application at time t = 0. As in thecase of simulations, in Fig. 3 two types of DTR policieshave been considered: one that disregards the fact thatfailures occur in a correlated manner (labeled as “CF-Unaware”) and another policy accounting for correlatedfailures (labeled as “CF-Aware”). First, observe thatFig. 3 visually suggests a fairly good agreement betweenthe theoretical predictions and the experimental results.

TABLE 2: Maximal service reliability for two types ofDTR policies and different initial allocation of tasks.

Maximal Service ReliabilityInitial load Theoretical Emulated

CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 1 0.795 0.653 0.772 0.667Uniform 2 0.786 0.640 0.791 0.669Uniform 3 0.751 0.622 0.772 0.614Comp-Pwr. 0.702 0.608 0.726 0.615

This subjective assessment is confirmed by the fact thatmaximum absolute errors smaller than 4% have beenobtained between the theoretical and curves obtainedfrom the emulated DCS. As before, we see that if aDTR policy does not take into account the effects ofcorrelated component failures then the service reliabilityis reduced by up to 22%. The larger reduction in theservice reliability is observed in an operational regimewhere more tasks are exchanged among servers. Last,in Table 2 we list the maximal service reliability forboth the correlated-failure–unaware and the correlated-failure–aware DTR policies for different initial allocationof tasks of the 2000 tasks, when PSRG-2 events inducecorrelated failures. The optimal values for the service re-liability have been obtained by solving the optimizationproblem (4). Table 2 shows that regardless of how thetasks have been initially allocated onto the servers, theservice reliability is indeed greater when a DTR policyexploits the knowledge about the correlated componentfailures occurring in the system.

Finally, we present in Appendix C experiments con-ducted over a 38-nodes DCS, which show the diminish-ing effects of correlated failures on the service reliability.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Ser

vice

Rel

iabi

lity

CF-Unaware (Theoretical)CF-Unaware (Emulation)CF-Aware (Theoretical)CF-Unaware (Emulation)

Fig. 3: The service reliability as a function of fraction oftasks reallocated among the CEs, when independent andcorrelated failures affect an emulated DCS.

5.2 Large-scale experiments

In order to study the effect of correlated failures in amore realistic scenario, we have analyzed the large-scalegrid computing system called Grid5000 [29]. Grid5000is a computing grid geographically distributed across9 cities, termed as sites, and composed of 15 clusters


with a total number of 1288 nodes. For simplicity, wehave studied here Grid5000’s behavior in the presenceof correlated failures at the cluster level, that is, weconsidered the Grid5000 as a 15 server DCS where eachone of the 15 servers summarizes the behavior of all thenodes forming the corresponding cluster. Since Grid5000is geographically distributed across 9 different sites,which host one or more clusters per site, we have definedPSRGs for all those sites containing two or more clusters.Thus, the so-called PSRG-1 is composed of clusters 1 to4, the PSRG-2 is composed of clusters 7 and 8, the PSRG-3 is composed of clusters 9 and 10, while the PSRG-4 iscomposed of clusters 11 and 12. The remaining clusters(5, 6, 13, 14, and 15) are not associated to any SRGs andwe have assumed that they fail independently. Basedon this definition of PSRGs, we suppose that Grid5000may be affected by four different PSRG events, each oneof them associated to PSRGs 1 to 4. The likelihood ofoccurrence of each PSRG event is π(A) = 0.25.

To model the dynamics of the 15 clusters in Grid5000,we downloaded traces from The Grid Workload Archive,[30], and The Failure Trace Archive [25]. We pooledsamples of all the nodes forming a cluster and employedparametric as well as non-parametric pdf estimationmethods to fit proper distributions for both, the taskexecution times and the failure times, at the clusterlevel. Heterogeneity was introduced in the task execu-tion times by filtering out the data, using the “Applica-tion class” field provided in the traces, and consideringonly those 15 application classes with larger number ofsamples. Table 6 in Appendix C lists statistics and thefitted distributions for servers’ task execution times inGrid5000. Note that, at the cluster level, average taskservice times are highly heterogeneous as observed inTable 6 and as indicated by a coefficient of variationof 1.985. The heterogeneity in the service times is alsoobserved noting that such times seem to follow differ-ent parametric distributions, such as Gamma, ExtremeValue, and Pareto, as well as non-parametric distribu-tions, such as mixtures of 2 and 4 Gaussian kernels.

Regarding failure times, we also pooled samples ofall the nodes in a cluster and filtered out the data toconsider only the availability times of the 15 clusters. InTable 6 statistics as well as the fitted distributions for thefailure times in Grid5000 are listed. We note that, at thecluster level, the average failure times in Grid5000 aremore-or-less homogeneous with a coefficient of variationof 0.645. To establish fair comparisons and simplify oursimulations, the failure time of all the clusters withina PSRG is assumed to follow the distribution withthe largest average failure time. In addition, as in theprevious examples, we have considered a scenario ofrelatively small yet uniform failure probability for theclusters, conditional on a PSRG. We assumed that theith cluster in a PSRG may fail with a probability ofpAi = 0.35 for all i. Regarding task transfer times, wesafely assumed that communication delays are negligibledue to sites in Grid5000 are interconnected using high-

speed VLANs at a speed of 1 Gbps. Further, we assumedalso that the topology of the network is a full-mesh.

As in the previous simulations, and in order to es-tablish fair comparisons, we have adjusted the averagenumber of failed servers to be the same for the casesof correlated and independent failures. The workloadto be processed by the Grid5000 DCSs is composed ofM = 100, 000 tasks, which are allocated onto the servers,at time t = 0, using an uniform allocation of tasks,labeled as “Uniform,” and an allocation proportional tothe relative processing speed of the clusters, labeled as“Computing-Power.” Note that, on average, the simu-lated Grid5000 DCS is capable of serving a workload ofapproximately 190,000 tasks. Regarding the DTR policy,we have employed again a reallocation criterion basedon the relative processing speed of the servers.

In Table 1 we have also listed the optimal servicereliability for the two different initial task allocationsconsidered, and when independent and correlated fail-ures affect the Grid5000 DCS. As in the case of the 20-node DCSs, results in Table 1 show that, regardless of theDTR policy employed by the system, correlated failuresheavily reduce the service reliability as compared to thecase of independent failures. Results also show that thelargest reduction in the service reliability occurs whenthe PSRG-1 is affected by correlated failures. This isattributed to the following issues. First, a correlatedfailure occurring at the PSRG-1 produces the simulta-neous failure of four clusters, meaning that about 25%of the total number of clusters in the DCS becomesunavailable. Second, the four clusters in the PSRG-1have a combined processing power of 0.886 tasks persecond, which represents 60% of the total computingpower of the DCS. Third, the DTR policy used by theDCS reallocates more workload onto the clusters 3, 4,and 12 since they have the largest processing speeds.Consequently, a correlated failure at the PSRG-1 reducesthe likelihood of serving the workload because of thesimultaneous reduction in the number of available clus-ters, the large reduction in the processing power of theGrid5000 DCS, and the incorrect reallocation of work-load onto clusters that are prone to fail in a correlatedmanner. We comment that the second larger reductionin the service reliability occurs when correlated failuresaffect the PSRG-4. Such a reduction is explained because:(1) PSRG-4 has a combined processing power of 0.550tasks per second, which represents about 37% of the totalcomputing power of the DCS; and (2) PSRG-4 containscluster 12, which has the second larger processing speedin the system, making such cluster an excellent candidatecluster to receive reallocated workload.

5.3 Discussion

The reduction in the service reliability in the presenceof correlated failures is a consequence of using a non-suitable DTR policy, which is not aware of the type ofcorrelation present in the failure patterns. In order to


TABLE 3: Service reliability of small- and large-scaleDCSs achieved by correlated-failure aware and unawareDTR policies.

20-node DCSPSRG-1 PSRG-2

Initial DTR policy DTR policyAllocation CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 1 0.841 0.830 0.823 0.681Uniform 2 0.775 0.785 0.801 0.653Uniform 3 0.819 0.818 0.810 0.673Comp-Pwr. 0.877 0.874 0.837 0.681

PSRG-3 PSRG-4Initial DTR policy DTR policy

Allocation CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 1 0.831 0.694 0.846 0.775Uniform 2 0.809 0.649 0.828 0.701Uniform 3 0.817 0.682 0.814 0.746Comp-Pwr. 0.825 0.691 0.829 0.757

Grid5000 DCSPSRG-1 PSRG-2

Initial DTR policy DTR policyAllocation CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 0.631 0.527 0.660 0.646

Comp-Pwr. 0.781 0.602 0.796 0.692PSRG-3 PSRG-4

Initial DTR policy DTR policyAllocation CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 0.671 0.629 0.644 0.584

Comp-Pwr. 0.791 0.688 0.795 0.657

observe the effect on the service reliability of includingthe information about the correlation induced by PSRGin the failure patterns, we compare the DTR reallocationcriterion based solely on the processing speed of servers(labeled as “CF-Unaware”) and a DTR with a correlated-failure–aware policy (labeled as “CF-Aware”) that weproposed in (3). In this example (as in the previouscases), the workload is distributed at time t = 0 usingfour different allocations. The comparison results arelisted in Table 3. It can be clearly observed that byincluding the information about the correlation inducedby the PSRGs into a DTR policy a considerable increasein the service reliability is obtained as compared toa policy that neglects such information. Moreover, bycomparing Tables 1 and 3 it can be observed that for the20-node DCS, the optimal values for reallocating tasksdictated by the correlated-failure–aware DTR policy in-crease the service reliability to a level which is almostthe same as in the case of independent failures, therebycompensating the negative impact induced by correlatedfailures on the system’s reliability. This compensationeffect is achieved because the correlated-failure–awarepolicy does: (1) smartly move the calculations (tasks)away from those servers that are prone to fail in a cor-related manner, to all those servers that do no share thesame likelihood of failure; and (2) simultaneously leavea small fraction of the load on those servers that mayfail in a correlated manner, thereby exploiting their com-puting capabilities. These two features of the correlated-failure–aware DTR policy devised in this paper are key,since they represent the fundamental trade-off appearingin DC in the presence of correlated failures. On one

hand, migrating less tasks among the CEs is suitable inscenarios when CEs work independently; as such, theeffect of correlated failures can be partially compensated.On the other hand, migrating more tasks among theCEs is suitable for cooperative work and it increasesthe coupling between the CEs, which, in turn, has anegative effect on the reliability if correlated failures arenot accounted for in a DTR policy. As an additionalresult, in Appendix C we compare the CF-Aware andthe CF-Unaware DTR policies and their effect of theservice reliability for the case of the Uniform-2 initialtask allocation.

For the Grid5000 DCS, we note that the use of thecorrelated-failure–aware DTR policy enhances the ser-vice reliability of the system up to 28%, as compared tothe use of a policy that unwisely disregards the effect ofcorrelated failures. Unlike the case of the 20-node DCS,the maximal service reliability yielded by the correlated-failure–aware DTR policy does not achieve the samelevel of reliability as in the case of independent failures.This is due to the fact that those clusters with thelarger individual and combined processing capabilitiesare more likely to fail in a correlated manner becausethey belong to the PSRG-1 and to the PSRG-4. Thus, thecorrelated-failure–aware policy is severely constrainedand cannot reallocate enough tasks to the fastest clustersin the system, and consequently, it is capable of only par-tially compensating for the negative impact of correlatedfailures on the system’s reliability.

6 CONCLUSIONS

This paper sheds light on the impact of spatially cor-related failures on the reliability of DCSs and presentsstrategies for how to mitigate the adverse effects of corre-lated failures using simple preemptive DTR policies thatare aware of the failure statistics. The concept of SRLG,which is used in the routing community as an effectivemechanism to model and simulate correlated failures,has been adapted in this paper to introduce the idea ofPSRGs. Under this concept, correlated failures resultingfrom real-world massive disruptions and/or physicalattacks to the DCS infrastructure can be modeled, andcorrelation can be introduced by defining or identifyingthose CEs that may suffer from a common stress event.

The effects of correlated failures on the service re-liability have been investigated thoroughly using theproposed reliability model. Results show that correlatedfailures reduce both the service reliability and the aver-age number of tasks executed by a DCS as compared toscenarios of independent failures. Notably, a correlated-failure–aware DTR policy has been proposed in orderto enhance the service reliability of the system. More-over, it has been observed that the optimal selection ofthe number of tasks to be reallocated among the CEsdepends upon the degree of correlation in the failures.


ACKNOWLEDGMENT

This work was supported by Defense Threat ReductionAgency (Combating WMD Basic Research Program). J.E. Pezoa was also supported by CONICYT, FONDECYTIniciacion Folio 11110078. Authors wish to thank the helpof the Center for Advanced Research Computing at theUniversity of New Mexico.

REFERENCES

[1] M. Litzkow et al., “Condor - a hunter of idle workstations,” inProc. 8th Int. Conf. Dist. Comp. Systems, 1988, pp. 104–111.

[2] D. Vidyarthi et al., “Maximizing reliability of a distributed com-puting system with task allocation using simple genetic algo-rithm,” J. Syst. Arch., vol. 47, pp. 549–554, 2001.

[3] J. Palmer et al., “Empirical and analytical evaluation of systemswith multiple unreliable servers,” in Proc. Int. Conf. DependableSyst. and Nets., Philadelphia, PA, 2006, pp. 517–525.

[4] B. Schroeder et al., “A large-scale study of failures in high-performance computing systems,” in Proc. Int. Conf. DependableSyst. & Nets., Washington, DC, USA, 2006, pp. 249–258.

[5] D. Kondo et al., “On correlated availability in internet-distributedsystems,” in Proc. 9th IEEE/ACM Int. Conf. on Grid Computing, ser.GRID ’08, Washington, DC, USA, 2008, pp. 276–283.

[6] M. Gallet et al., “A model for space-correlated failures in large-scale distributed systems,” in Proc. 16th Euro-Par Conf. on Parallelprocessing, 2010, pp. 88–100.

[7] P. Joshi et al., “Prefail: A programmablefailure-injection framework,” EECS Department,University of California, Berkeley, Tech. Rep.UCB/EECS-2011-30, Apr 2011. [Online]. Available:http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-30.html

[8] Y. S. Dai et al., “Reliability analysis of grid computing systems,”in Proc. Pacific Rim Int. Symposium on Dependable Computing, 2002,pp. 97–103.

[9] J. E. Pezoa et al., “Performance and reliability of non-markovianheterogeneous distributed computing systems,” IEEE TPDS,vol. 23, no. 7, pp. 1288–1301, 2012.

[10] S. Nath et al., “Tolerating correlated failures in wide-area moni-toring services,” Intel Research, Tech. Rep., 2004.

[11] Z. Zhang et al., “Proactive failure management for high availabil-ity computing in computer clusters,” in Proc. 3rd Int. Joint Conf.Computational Science and Optimization, 2010, pp. 377–381.

[12] K. Kim et al., “Assessing the impact of geographically correlatedfailures on overlay-based data dissemination,” in GLOBECOM,2010, pp. 1–5.

[13] N. Hamed Azimi et al., “Data preservation under spatial failuresin sensor networks,” in Proc. 11th ACM Int. Symp. Mobile ad hocnetworking and computing, 2010, pp. 171–180.

[14] H. Weatherspoon et al., “Introspective failure analysis: Avoidingcorrelated failures in peer-to-peer systems,” Reliable DistributedSystems, IEEE Symposium on, 2002.

[15] K. Goseva-Popstojanova et al., “Failure correlation in softwarereliability model,” IEEE Trans. Reliability, vol. 49, pp. 37–48, 2000.

[16] Y. Dai et al., “Modeling and analysis of correlated software failuresof multiple types,” IEEE Trans. Reliability, vol. 54, pp. 100–106,2005.

[17] Y.-S. Dai et al., “Performance and reliability of tree-structured gridservices considering data dependence and failure correlation,”IEEE Transactions on Computers, vol. 56, pp. 925–936, 2007.

[18] L. Fiondella et al., “Estimating system reliability with correlatedcomponent failures,” International Journal of Reliability and Safety,vol. 4, no. 2–3, pp. 188–205, 2010.

[19] D. Kondo et al., “The failure trace archive: Enabling comparativeanalysis of failures in diverse distributed systems,” in Proc. 10thIEEE/ACM Int. Conf. Cluster, Cloud and Grid Computing, ser. CC-GRID ’10, Washington, DC, USA, 2010, pp. 398–407.

[20] A. Iosup et al., “On the dynamic resource availability in grids,”in Proceedings of the 8th IEEE/ACM International Conference on GridComputing, ser. GRID ’07, Washington, DC, USA, 2007, pp. 26–33.

[21] V. Shestak et al., “Robust sequential resource allocation in het-erogeneous distributed systems with random compute node fail-ures,” in Proc. Int. Parallel and Dist. Proc. Symposium, 2009.

[22] T. Ma et al., “Evaluation of the qos of crash-recovery failure detec-tion,” in Proceedings of the ACM symposium on Applied computing.ACM, 2007, pp. 538–542.

[23] D. Papadimitriou et al., “Inference of shared risk link groupsinternet draft,” IETF, Internet draft, 2002. [Online]. Available:http://www3.tools.ietf.org/html/draft-many-inference-srlg-02

[24] K. Lee et al., “Cross-layer survivability in wdm-based networks,”IEEE/ACM Trans. Netw., vol. 19, no. 4, pp. 1000–1013, 2011.

[25] INRIA, “The Failure Trace Archive,” http://fta.inria.fr, 2012, [On-line; accessed May-2012].

[26] M. J. Ciaraldi et al., “Risks in anonymous distributed computingsystems,” in Proceedings of International Network Conference, ser.INC 2000, 2000.

[27] S. Neumayer et al., “Assessing the Vulnerability of the FiberInfrastructure to Disasters,” Networking, IEEE/ACM Transactionson, vol. PP, no. 99, p. 1, 2010.

[28] J. E. Pezoa et al., “Maximizing service reliability in distributedcomputing systems with random failures: Theory and implemen-tation,” IEEE Trans. Parallel and Dist. Systems, vol. 21, no. 10, pp.1531–1544, 2010.

[29] INRIA, “The Grid5000,” http://www.grid5000.fr, 2012, [Online;accessed May-2012].

[30] PDS Group, TU Delft, “The Grid Workload Archive,”http://gwa.ewi.tudelft.nl, 2012, [Online; accessed May-2012].

[31] M. Dodge et al., The Atlas of Cyberspace. Addison Wesley, 2008.

Jorge E. Pezoa (S’08-M’10) received his B.Sc.and M.S. degrees in electronics engineering andelectrical engineering in 1999 and 2003, respec-tively, from the Universidad de Concepcion, Con-cepcion, Chile. He received the Ph.D. degree inelectrical engineering from the University of NewMexico in 2010. Dr. Pezoa is currently an As-sistant Professor with the Electrical EngineeringDepartment, Universidad de Concepcion, Con-cepcion, Chile. He is a member of the Societyof Photo-optical Instrumentation Engineers, the

Optical Society of America, and the Association for Computing Machin-ery. His areas of interest are distributed computing, pattern recognition,statistical signal processing, network optimization, and hyperspectralimage and signal processing for industrial processes.

Majeed M. Hayat (S’89-M’92-SM’00) was bornin Kuwait in 1963. He received the B.Sc. degree(summa cum laude) in electrical engineeringfrom the University of the Pacific, Stockton, CA,in 1985 and the M.S. and Ph.D. degrees inelectrical and computer engineering from theUniversity of Wisconsin, Madison, in 1988 and1992, respectively.

He is currently a Professor of electrical andcomputer engineering with the University of NewMexico, Albuquerque, where he is an Associate

Director of the Center for High Technology Materials and the GeneralChair of the Optical Science and Engineering Program. From 2004 to2009, he was an Associate Editor of Optics Express. He has authoredor coauthored over 75 peer-reviewed articles (H-Index of 26) and hasfour issued patents, three of which have been licensed. His research ac-tivities cover a broad range of topics, including avalanche photodiodes,signal and image processing, algorithms for infrared spectral sensingand imaging, optical communication, and modeling and optimization ofdistributed systems and networks.

Dr. Hayat is a Fellow of the Society of Photo-optical InstrumentationEngineers, the Optical Society of America, he is Senior Member of theIEEE. He is currently the Chair of the Topical Committee of Photodetec-tors, Sensors, Systems and Imaging within the IEEE Photonics Society.He was a recipient of the National Science Foundation Early FacultyCareer Award (in 1998) and the Chief Scientist Award for Excellence(in 2006) from the National Consortium for Measures and SignaturesIntelligence Research and the Defense Intelligence Agency.


Supplementary file for:

RELIABILITY OF HETEROGENEOUS

DISTRIBUTED COMPUTING SYSTEMS IN THE

PRESENCE OF CORRELATED FAILURES

Jorge E. Pezoa and Majeed M. Hayat

APPENDIX ASERVICE RELIABILITY IN NON-MARKOVIAN

DCSS

A.1 Age-dependent service reliability

In [9] we presented a continuous-time, discrete–state-space, distributed queuing model and characterized thestochastic time taken by the DCS in executing a work-load under the assumption that all the random timesdriving the system dynamics are mutually independentand may follow any general probability distribution. Forcompleteness, we review here the state-space representa-tion for the DCS and the analytical model for reliabilitydeveloped in [9].

Following the formalism in [9], the configuration of ann-server DCS can be described using four time-varyingquantities. The first quantity is the n-dimensional col-umn vector m(t) whose ith component specifies thenumber of tasks queued at the ith server at time t. Thesecond quantity is the n-by-n binary matrix F(t) whoseijth element describes the failed (“0”) or functional (“1”)state of the jth server as perceived by the ith server attime t. The third quantity is the network-state matrixC(t) that specifies the number of tasks in transit overthe network at time t. Specifically, the network-statematrix is an n-by-n matrix whose ijth element is a non-negative integer determining the number of tasks beingtransferred from the ith to the jth server at time t. Sincethe vector m(t) and the matrices F(t) and C(t) havefinite dimensions and take values on the discrete setsΩ1 = 0, 1, . . . ,Mn, Ω2 = 0, 1n

2

and Ω3 = 0, 1, . . . ,Mn

2

, respectively, it will be useful to define a one-to-one mapping h : Ω → I, where Ω = Ω1 × Ω2 × Ω3, suchthat for each possible value of the concatenated matrix(

m,F,C)

in Ω, h(m,F,C) assigns a positive integer inthe index set I = 1, 2, . . . , κ, where κ is the cardinalityof Ω.

The fourth quantity in the formalism described in [9]

is the concatenated matrix a

= (aM , aF , aC), termed asthe system-age matrix. This matrix contains real-valuedage variables, which are responsible for keeping track ofthe memory of all those random times whose probabilitydistributions have memory. The system-age matrix iscomposed of: (i) n age variables, aMi

, associated with thefirst stochastic service time of a task at the ith server; (ii)n age variables, aFii

, associated with the random failuretime of the ith server; (iii) n2 − n age variables, aFij

,associated with the random transfer time of a FN packetfrom the ith to the jth server; and (iv) up to n2 − n agevariables, aCik

, associated with the random transfer time

of lik tasks from the ith to the kth server. In [9] the state-space representation was based on the age-dependent

system-state matrix S(t)

=(

m(t),F(t),C(t), a(t))

, whichfully describes the state of an n-server DCS at time t.

As in [9], the stochastic process S(t), t ≥ 0 char-acterizing the stochastic dynamics of the DCS is in-troduced. This process is employed to mathematicallydefine the service time of an application as the randomtime taken by the DCS to execute the entire application,when servers perform a synchronous DTR action attime t = tb, and the initial system configuration isas specified by m0,F0,C0, and the initial age of theDCS is as specified by aM0

, aF0, aC0

. More precisely,the age-dependent stochastic service time is defined asTℓ0(tb, a0) , inft > 0 : m(t) = 0 and C(t) = 0,where ℓ0 = h(m0,F0,C0) and a0 = (aM0

, aF0, aC0

). Inaddition, the service reliability was precisely definedas the probability that the application can be entirelyexecuted by the system, that is

Rℓ0(tb, a0) , PTℓ0(tb, a0) < ∞. (8)

(For brevity, when the system age is zero we will omitexplicit reference to the age in the expression for theservice reliability, e.g., see (1) in this paper.) Note thatthe application service-time is infinite when at least onetask remains queued at a CE that has already failed.Therefore, the service reliability is a reasonable metriconly in settings where servers can fail without recoveryand/or in settings where applications cannot continuetheir execution after a failure.

With all these definitions at hand, an age-dependentstochastic regeneration theory was developed to char-acterized the service reliability of an n-server DCS. Forcompleteness, we reiterate here the characterization ofthe service reliability stated in Corollaries 1 and 2 in [9].To do so, we need to introduce the last set of definitions.First, the age-dependent regeneration time, denoted byτa, is defined as the minimum of the following fourrandom variables: the time to the first task service byany CE, the time to the first occurrence of failure atany CE, the time to the first arrival of a FN packet atany CE, or the time to the first arrival of a group oftasks at any CE. Second, the term GX(α) is defined asGX(α) , PX = τa|τa = αfτa(α) with X any of therandom times associated to m, F, or C and fτa(α) is thepdf of the age-dependent regeneration time τa. Third,PX=τa|τa=α is the probability that the regenerationevent is τa = X conditional on the event τa = α.Finally, the vector v

(i) (respectively, the matrix A(ij)) is

identical to the vector v (respectively, the matrix A) butwith its ith (respectively, ijth) element set to zero.

Theorem 1 (Age-dependent characterization for the ser-vice reliability, Pezoa and Hayat [9]). The service reliabilityin executing an application using an n-server non-MarkovianDCS satisfies the following system of recursive, coupled in-tegral equations in the time variable ξ, which represents any


arbitrary task reallocation instant:

Rℓ(ξ, aM , aF , aC) =

∫ ξ

0

[

n∑

i=1

GWi1(α)Rℓi

(

ξ−α, (aM+α)(i),

aF +α, aC+α)

+

n∑

i=1

GYi(α)Rℓ

′

i

(

ξ−α, aM+α, (aF +α)(ii),

aC+α)

+

n∑

i=1

n∑

j=1,j 6=i

GXij(α)Rℓij

(

ξ−α, aM+α, aF+α,

aC+α)

]

dα+(

1− Fτa(ξ))

Rℓ(0, aM , aF , aC). (9)

where ℓ ∈ I, ℓ′

i = h(m,F(ii),0), ℓij = h(m,F(ji),0), andRℓ(0, TM − ξ, aM , aF , aC) is the initial condition related tothe ℓth integral equation. Moreover, the initial condition ofthe service reliability satisfies the system of recursive, coupledintegral equations:

Rℓ(0, aM , aF , aC) =

∫ ∞

0

[

n∑

i=1

GWi1(α)Rℓi

(

0, (aM+α)(i),

aF + α, aC + α)

+

n∑

i=1

GYi(α)Rℓ

′

i

(

0, aM+α, (aF + α)(ii),

aC + α)

+

n∑

i=1

n∑

j=1,j 6=i

GZji(α)Rℓji

(

0, aM+α, aF+α, aC+α)

+n∑

i=1

n∑

j=1,j 6=i

GXij(α)Rℓij

(

0, aM+α, aF +α, aC+α)

]

dα,

(10)

with ℓ ∈ I. Proof and further details can be found elsewhere[9].

In Subsection A.2 we extend the aforementionedmodel for service reliability to account for correlatedfailures. To do so, we employ the concept of PSRG andbuild a model for the service reliability in the presenceof correlated failures.

A.2 Service reliability in the face of spatially corre-

lated failures

The service reliability in the presence of spatially cor-related failures can be calculated by means of a hybridanalytical and MC approach. To do so, the same princi-ples presented [9] can be applied to obtain regenerativeequations for the conditional service reliability of a DCS.Note, however, that the analytical model for the servicereliability requires the specification of the probabilitydistribution for the random time when a given set of CEsfails in a correlated manner. Consequently, the followingassumptions must be added to the set of assumptionslisted in Section 2 in [9]: (i) the pdf of the random timeY R, which represents the time when a set of CEs failsin a correlated manner, is known to follow a generaldistribution denoted as fY R(t); and (ii) the randomtime Y R is mutually independent of all the randomtimes listed in Section 2 in [9]. These assumptions arereasonable in order to develop a tractable model for theservice reliability.

Conditional on a correlated failure pattern, we cannow formally define the conditional service time ofan application as the random time taken by the DCSto execute the entire application when CEs performa synchronous DTR action at time t = tb, the ini-tial system configuration is as specified by S(0) =(m0,F0,C0, aM0

, aF0, aC0

) and the correlated failure pat-tern is specified as X

A = xA. More precisely, for ℓ0 =

h(m0,F0,C0) we define Tℓ0(tb, a0) = inft > 0 : m(t) =0 and C(t) = 0 with the proviso that the CEs specifiedby x

A will fail simultaneously at the random time Y A

conditional on the occurrence of the A PSRG event. Thus,the conditional service reliability can be precisely definedas the probability that the application can be entirelyexecuted by the system when a set of CEs fails in acorrelated manner, that is

Rℓ0

(

tb, a0|XA = x

A)

, PTℓ0(tb, a0) < ∞|XA = xA.

(11)Following the derivation presented in Appendixes A

and B of [9] (pages 15 to 17), it is not difficult to provethat conditional on the failure pattern X

A = xA, the

service reliability can be characterized by means of a sys-tem of recursive integral equations and its correspondingset of initial conditions, which is almost identical to theone presented in Theorem 1 with the sole exception thatthe service reliability, R, will depend on a particularrealization, XA = x

A, of a correlated failure triggeredby the A PSRG event. More specifically, the dynamicsof the service reliability of an n-server DCS, conditionalon the failure pattern X

A = xA, with an arbitrary initial

system configuration denoted as S = (m,F,C, a) andwhose CEs perform a synchronous DTR action at thetime ξ ≥ 0, can be calculated using the recursions inthe discrete vector m and the discrete matrices F and C

presented in Theorem 1. That is, conditional on a sampleof a correlated failures induced by the occurrence of theA PSRG event, the system of recursive integral equations(9) with initial conditions (10) can be used to computethe conditional service reliability. Consequently, the ser-vice reliability of the DCS in the presence of correlatedfailures induced by the occurrence of the A PSRG event,RA

ℓ0(tb), can be estimated by simply averaging over the

κ samples of correlated failures as stated in (1).

APPENDIX BDISTRIBUTED COMPUTING EMULATION

TESTBED

In this paper, we have modified the small-scale DCStestbed presented in [28] to emulate the class of corre-lated failures resulting from real-world massive disrup-tions and/or physical attacks to the DCS infrastructure.The emulation testbed has been implemented on a ded-icated cluster of twenty computers. The occurrence offailures at the CEs is emulated by software. In the caseof independent failures, each CE randomly generates atime to fail. In the case of correlated failures, the CE incharge of both, providing the initial task allocation and


coordinating the beginning of the DC, must also draw afailure pattern. Next, such failure pattern is broadcastedto all the CEs in the system using the same data packetemployed to trigger the beginning of the execution of theapplication. Once the failure pattern is received, each CEreceiving a failure indication independently generates arandom failure time. Upon the occurrence of a failure,a CE is switched from the so-called working state tothe failed state. If a CE is in the failed state then itcannot process tasks. Since the testbed has been imple-mented on a cluster environment, the actual topologyof the communication network is a full mesh and thecommunication delays are negligible because the CEsare connected using Fast Ethernet network cards. Thus,in order to simulate a DCS in a meaningful fashion,the communication delays and the underlying networktopology have been emulated as well. In the case of com-munication delays, artificial latency is introduced intothe network by means of traffic shaping applications. Atraffic shaper may reduce the actual transfer speed ofthe network interfaces to slow speeds such as 1024 to512 Kbps. The network topology is transformed from afull mesh to a connected graph by simply providing toeach CE, at the application layer, a subset of CEs to beable to communicate with.

The software architecture of the emulation testbedis divided in three layers: application, task reallocationand communication. Layers are implemented in softwareusing POSIX threads. Since we are interested in the classof parallel applications that have no data-dependenceconstraints between operations, the application selectedto illustrate the DC is matrix multiplication. The serviceof a task has been defined as the multiplication of onerow by a static matrix, which is duplicated in all the CEs.To achieve variability in the processing time of the CEs,randomness is introduced by randomly choosing both,the number of significant digits and the number of digitsin the exponent forming the floating point representationof each element of the vectors and matrices to be mul-tiplied by the emulated application. Also, the numberof significant digits and exponents may follow any arbi-trarily specified probability distribution. The applicationlayer also switches the state of a CE from working tofailed. The same layer maintains, at each CE, two vectorsof n − 1 components which track the failed or workingstate of the other CEs in the DCS. The first vector storesthe number of tasks queued at the other CEs using along integer representation. The second vector is binaryand indicates which CEs remain functional. The task re-allocation layer executes the DTR policy defined for eachtype of emulation conducted. This layer schedules andtriggers the reallocation instants when task-reallocationis performed. It also: (i) determines if a CE is overloadedwith respect to the other CEs in the system; (ii) selectswhich CEs are candidate CEs for receiving tasks; and(iii) computes the amount of task to exchange amongthe CEs. Finally, the communication layer of each CEhandles the transfer of tasks as well as the transfer of

FN packets. Each CE uses the UDP transport protocol totransfer an FN packet to the other CEs, while the TCPtransport protocol is used to transfer tasks between theCEs.

APPENDIX CADDITIONAL SMALL-SCALE EXPERIMENTS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

80

90

100


Tas

ks S

erve

d, %


Fig. 4: The average fraction of tasks served by the DCSas a function of fraction of tasks reallocated among theCEs in the DCS for the 20-node DCS.

Here we present two additional results for the 20-nodeDCS evaluated in Section 5.1. The first additional resultis presented in Fig. 4, where the average percentageof tasks executed by the DCS is plotted as a functionthe fraction of tasks exchanged among the CEs. Thismetric was estimated by means of MC simulations andthe results presented in Fig. 4 correspond to centers ofintervals, with 95% confidence over which the averagenumber of served tasks will not differ from the truevalue more than 5%. It can be seen that, on average,a larger number of tasks are served by the DCS whenindependent failures affect the behavior of the system.This result is expected because values of service reliabil-ity closer to one imply that a DCS is able to executethe entire application more times before its fails. Thesecond additional result for the 20-node DCS is shown inFigure 5. The figure depicts the service reliability of theDCS, for the case of the Uniform-2 initial task allocation,as a function of the total number of tasks exchangedover the CEs. The service reliability has been obtainedby means of the correlated-failure–unaware and thecorrelated-failure–aware DTR policies, when correlatedfailures generated by an PSRG-2 event affect the dy-namics of the DCS. Recall that the presence of correlatedfailures generated by an PSRG event reduced the servicereliability between 5 and 25% as compared to the case ofindependent failures. When the correlated failure-aware–policy is employed to migrate tasks among the CEs, theservice reliability is not affected by the PSRG inducedcorrelated failures and such a large reduction in thereliability is not observed.

As another additional example of the effect of spatialcorrelated failure on the service reliability, we have ana-lyzed a 38-node DCS whose underlying communication


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Ser

vice

Rel

iabi

lity

Indep. FailuresCF-Unaware PolicyCF-Aware Policy

Fig. 5: Service reliability as a function of fraction oftasks reallocated among the CEs, when independent andcorrelated failures affect the CEs.

(a)

Fig. 6: Sample DCS interconnected by the AT&T IPbackbone network 2Q2000 [31].

network is a modified version of the AT&T IP backbonenetwork 2Q2000, [31]. The topology of the small-scaleDCS is depicted in Fig. 6. Due to geographical proximity,the following PSRGs have been defined: PSRG-1 com-posed of servers 4, 5, 6, 7, and 8; PSRG-2 composed ofservers 17, 18, and 19; PSRG-3 composed of servers 15,16, 21, 22, 23, and 24; and PSRG-4 composed of servers27, 28, 29, 30, and 34. In our simulations we suppose thatthe 38-node DCS may be affected only by four differentPSRGs events, each one of them associated to PSRGs 1 to4 and each one of them having a likelihood of occurrenceof π(A) = 0.25.

Table 4 lists the optimal service reliability, for thefour different initial task allocations considered, and forboth cases of independent and correlated failures. Itcan be observed from the table that, as in the case ofthe 20-node DCS, in spite of the DTR and the aver-age number of failures are the same, PSRG correlatedfailures diminish the service reliability as compared tothe case of independent failures. Specifically, for the38-node DCS the service reliability has been reducedby up to 18%. We have also compared the correlated-failure–aware policy (labeled as “CF-Aware”) proposedin (3) to the correlated-failure–unaware policy (labeledas “CF-Unaware”) for the case of the 38-node DCS. Inthe simulations, four different initial distribution for theworkload were considered. The comparison results arelisted in Table 5. It can be clearly observed that, as in the

TABLE 4: Service reliability in the presence of indepen-dent and correlated failures.

38-node DCSInitial PSRG-1 PSRG-2 PSRG-3 PSRG-4

Allocation Indep. Corr. Indep. Corr. Indep. Corr. Indep. Corr.Uniform 1 0.705 0.601 0.671 0.651 0.623 0.502 0.709 0.608Uniform 2 0.683 0.588 0.602 0.578 0.598 0.487 0.688 0.590Uniform 3 0.692 0.596 0.657 0.603 0.617 0.493 0.691 0.599Comp-Pwr. 0.715 0.622 0.699 0.662 0.634 0.511 0.707 0.619

TABLE 5: Service reliability for correlated-failure–awareand correlated-failure–unaware DTR policies when cor-related failures affect the CEs.

38-node DCSPSRG-1 PSRG-2

Initial DTR policy DTR policyAllocation CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 1 0.729 0.601 0.698 0.651Uniform 2 0.735 0.588 0.627 0.578Uniform 3 0.707 0.596 0.684 0.603Comp-Pwr. 0.728 0.622 0.679 0.662

PSRG-3 PSRG-4Initial DTR policy DTR policy

Allocation CF-Aware CF-Unaware CF-Aware CF-UnawareUniform 1 0.675 0.502 0.721 0.608Uniform 2 0.682 0.487 0.746 0.590Uniform 3 0.693 0.493 0.733 0.599Comp-Pwr. 0.701 0.511 0.755 0.619

other two case presented, by including the informationabout the correlation induced by the PSRGs into a DTRpolicy, a significant increase in the service reliability canbe obtained as compared to a policy that ignores suchinformation.


TABLE 6: Summary statistics and fitted distributions forthe task execution times and failure times of the clustersin Grid5000.

Execution time of workloads, sPSRG/ Number Data Std. FittedServer Samples Range Mean Dev. Distr.

1/1 295535 [2,228741] 589.12 2268.12 Mixture 21/2 101792 [0,2866] 57.53 65.20 Mixture 21/3 93204 [0,6884] 5.42 93.74 Gamma1/4 61501 [0,4865] 1.47 27.97 Gamma–/5 31673 [0,15765] 282.35 453.91 Pareto–/6 30114 [0,35476] 718.67 2150.89 Pareto2/7 27590 [0,2659] 387.49 72.18 Extreme Value2/8 23739 [0,706550] 5341.43 41539.29 Pareto3/9 23664 [8,13911] 233.71 405.24 Lognormal

3/10 15212 [0,10676] 281.70 655.98 Lognormal4/11 13266 [0,518400] 12152.92 22905.75 Mixture 44/12 5844 [0,61] 1.82 6.96 Exponential–/13 4751 [0,1623934] 12617.51 58527.49 Mixture 2–/14 2846 [3,3443] 77.72 140.20 Shft. Exp.–/15 2674 [1,143] 114.47 6.63 Extreme Value

Failure time of clusters, sPSRG/ Number Data Std. FittedServer Samples Range Mean Dev. Distr.

1/1 13067 [4,3019024] 155779.54 400084.68 Weibull1/2 25463 [0,2076562] 80123.09 147926.31 Weibull1/3 25953 [1,2425914] 94106.21 224171.62 Weibull1/4 4745 [1,5016838] 137627.75 423754.93 Lognormal–/5 8208 [0,4950459] 148933.36 394790.03 Weibull–/6 5804 [2,6586822] 369748.47 707692.00 Lognormal2/7 34640 [0,36567830] 123835.84 306017.51 Weibull2/8 2929 [1,3570913] 265545.44 460572.17 Lognormal3/9 5398 [1,12268635] 364429.36 1140752.44 Lognormal

3/10 2205 [1,1374232] 95297.83 245267.62 Weibull4/11 13002 [4,9611907] 339971.81 767908.27 Weibull4/12 1137 [0,2378828] 129359.64 358888.71 Weibull–/13 21165 [1,1663515] 52456.56 122340.10 Gamma–/14 123240 [0,3015458] 82291.76 168538.36 Weibull–/15 7362 [0,3020840] 99435.33 260929.85 Weibull

transactions on parallel and distributed computing …ece-research.unm.edu/hayat//tpds_2013.pdf ·...

Documents