ieee transactions on parallel and · pdf filethis new approach was implemented and evaluated...

13
An Online Data Access Prediction and Optimization Approach for Distributed Systems Renato Porfirio Ishii and Rodrigo Fernandes de Mello Abstract—Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHC- CERN project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution time in about 50 percent, specially when handling large amounts of data. Index Terms—Distributed computing, distributed file system, data access optimization, time series analysis, prediction. Ç 1 INTRODUCTION S CIENTIFIC applications tend to produce and handle large amounts of data. Examples of such applications include the Pan-STARRS project, 1 which captures at about 2.5 petabytes (PB) of data per year, and Large Hadron Collider project (LHC), 2 that generates from 50 to 100 PB of data every year. Those applications tend to consider distributed computing tools to deal with the need for high performance and storage requirements. Such tools have great potential for solving a vast class of complex problems; however, they are still limited in terms of manipulating large amounts of data [1]. Those applications rely on cluster and grid environ- ments, which are the main large-scale computing infra- structures currently available. Clusters are composed of a set of workstations or personal computers with similar architecture, connected via a high-speed local area net- work, in which every station usually executes the same operating system. Grids are organized as federations of computers, in which one federation is a different admin- istrative domain sharing its own resources. This type of infrastructure is characterized by hardware and software heterogeneity [2]. These infrastructures meet the demands imposed by several types of applications, however, there is still room for significative theoretical and practical improvements when dealing with large amounts of data. Those improvements would certainly motivate the adoption of distributed computing in several domains such as high energy physics, financial modeling, earthquake simulations, and climate modeling [3], [4], [5], [6]. However, there are still great challenges to meet such improvements, which comprise dealing with heterogeneous resources, access latency varia- tions, fault detection, and recovery, etc. All those challenges have been motivating studies in different subjects such as job scheduling, load balancing, communication protocols, high-performance hardware ar- chitectures, and data access optimization [4], [7]. This later subject, i.e., data access optimization, is particularly appeal- ing to the Computer Science area, mainly due to the possibility to provide high performance to data-intensive applications [5], [6], [8]. This possibility has motivated the study of the Data Access Problem (DAP) in distributed computing environ- ments [9]. The complexity of this NP-complete problem is justified by the optimization of data location and consis- tency, which involve operations such as data distribution, migration and replication [10]. Data distribution aims at efficiently allocating file chunks on the computing environ- ment attempting to improve the performance of read-and- write operations. Migration is also considered in order to IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012 1017 . R.P. Ishii is with the Faculty of Computer Science, Federal University of Mato Grosso do Sul-UFMS, Cidade Universita´ria, Caixa Postal 549, Campo Grande, MS 79070-900, Brazil. E-mail: [email protected]. . R.F. de Mello is with the Department of Computer Science, Institute of Mathematics and Computer Sciences, University of Sa˜o Paulo-USP, Avenida Trabalhador sa˜o-carlense, 400, Centro, Caixa Postal 668, Sa˜o Carlos, SP 13560-970, Brazil. E-mail: [email protected]. Manuscript received 17 May 2010; revised 19 Aug. 2011; accepted 23 Sept. 2011; published online 19 Oct. 2011. Recommended for acceptance by I. Stojmenovic. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-2010-05-0299. Digital Object Identifier no. 10.1109/TPDS.2011.256. 1. http://pan-starrs.ifa.hawaii.edu/public. 2. http://public.web.cern.ch/public/en/LHC/LHC-en.html. 1045-9219/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

Upload: nguyentu

Post on 21-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

An Online Data Access Predictionand Optimization Approach

for Distributed SystemsRenato Porfirio Ishii and Rodrigo Fernandes de Mello

Abstract—Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such

data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of

data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been

considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those

studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this

paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on

distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes

application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing

properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize

data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHC-

CERN project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution

time in about 50 percent, specially when handling large amounts of data.

Index Terms—Distributed computing, distributed file system, data access optimization, time series analysis, prediction.

Ç

1 INTRODUCTION

SCIENTIFIC applications tend to produce and handle largeamounts of data. Examples of such applications

include the Pan-STARRS project,1 which captures at about2.5 petabytes (PB) of data per year, and Large HadronCollider project (LHC),2 that generates from 50 to 100 PBof data every year. Those applications tend to considerdistributed computing tools to deal with the need for highperformance and storage requirements. Such tools havegreat potential for solving a vast class of complexproblems; however, they are still limited in terms ofmanipulating large amounts of data [1].

Those applications rely on cluster and grid environ-ments, which are the main large-scale computing infra-structures currently available. Clusters are composed of aset of workstations or personal computers with similararchitecture, connected via a high-speed local area net-work, in which every station usually executes the sameoperating system. Grids are organized as federations of

computers, in which one federation is a different admin-istrative domain sharing its own resources. This type ofinfrastructure is characterized by hardware and softwareheterogeneity [2].

These infrastructures meet the demands imposed byseveral types of applications, however, there is still room forsignificative theoretical and practical improvements whendealing with large amounts of data. Those improvementswould certainly motivate the adoption of distributedcomputing in several domains such as high energy physics,financial modeling, earthquake simulations, and climatemodeling [3], [4], [5], [6]. However, there are still greatchallenges to meet such improvements, which comprisedealing with heterogeneous resources, access latency varia-tions, fault detection, and recovery, etc.

All those challenges have been motivating studies indifferent subjects such as job scheduling, load balancing,communication protocols, high-performance hardware ar-chitectures, and data access optimization [4], [7]. This latersubject, i.e., data access optimization, is particularly appeal-ing to the Computer Science area, mainly due to thepossibility to provide high performance to data-intensiveapplications [5], [6], [8].

This possibility has motivated the study of the DataAccess Problem (DAP) in distributed computing environ-ments [9]. The complexity of this NP-complete problem isjustified by the optimization of data location and consis-tency, which involve operations such as data distribution,migration and replication [10]. Data distribution aims atefficiently allocating file chunks on the computing environ-ment attempting to improve the performance of read-and-write operations. Migration is also considered in order to

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012 1017

. R.P. Ishii is with the Faculty of Computer Science, Federal University ofMato Grosso do Sul-UFMS, Cidade Universitaria, Caixa Postal 549,Campo Grande, MS 79070-900, Brazil. E-mail: [email protected].

. R.F. de Mello is with the Department of Computer Science, Institute ofMathematics and Computer Sciences, University of Sao Paulo-USP,Avenida Trabalhador sao-carlense, 400, Centro, Caixa Postal 668, SaoCarlos, SP 13560-970, Brazil. E-mail: [email protected].

Manuscript received 17 May 2010; revised 19 Aug. 2011; accepted 23 Sept.2011; published online 19 Oct. 2011.Recommended for acceptance by I. Stojmenovic.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2010-05-0299.Digital Object Identifier no. 10.1109/TPDS.2011.256.

1. http://pan-starrs.ifa.hawaii.edu/public.2. http://public.web.cern.ch/public/en/LHC/LHC-en.html.

1045-9219/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

bring data closer to where it is requested. In that sense, it

must analyze data transfer impacts to reduce costs of access

operations. Replication takes decisions about where and

when to copy files (create replicas) as a way to increase

performance and resilience against failures [11]. Comple-

mentarily, consistency aims at coordinating the update of

multiple file replicas [12], providing a single data view to

distributed applications.All those factors have been studied and possible

solutions proposed, as presented in Section 2, however

they still lack in three substantial points: 1) most works only

consider read operations, what tends to limit their applic-

ability in real-world scenarios; 2) they employ static

approaches to optimize access, i.e., mechanisms that do

not consider the dynamic system behavior; finally, 3) works

still need to employ an extra effort in reducing application

execution time, what depends on data access patterns. In

this context, it is very important to understand, analyze,

estimate, and predict application behavior in order to

optimize read-and-write operations, consequently, redu-

cing execution time. The term behavior in here characterizes

data access operations over time, including, for example,

the operation type (read or write), which and when a file is

accessed, etc.Given such drawbacks, this paper employs strategies to

support the online prediction of application behavior in

attempt to optimize data access operations on distributed

systems. In order to accomplish this goal, our approach

considers the following steps

1. Application knowledge acquisition;2. Organization of process behaviors as time series;3. Analysis of time series generation processes;3

4. Selection of techniques to model times series;5. Definition of how many future observations will be

predicted;6. Prediction of observations, and, finally,7. Execution of the optimization heuristic in attempt to

reduce the time consumed in data access operations.

We designed this approach in order to automatize all those

steps, avoiding or at least expressively reducing human

intervention.Experiments were conducted using the OptorSim simu-

lator [13], sponsored by the LHC/CERN project and widely

employed by the scientific community [9], [14], [15], [16],

[17], [18], [19]. Results confirm this new approach reduces

application execution time in about 50 percent, specially

when dealing with large amounts of data.The remainder of this paper is organized as follows:

related works are presented in Section 2; Section 3 presents

the time series classification approach and experiments on

real data to confirm it is capable of selecting adequate

modeling techniques [20]; the new optimization heuristic

is presented in Section 4 and experimental results are

shown in Section 5; Finally, we present concluding

remarks and references.

2 RELATED WORKS

This section presents related works on the prediction ofapplication behavior and data access optimization ap-proaches.

2.1 Prediction of Application Behavior

Several studies have been considering statistical techniquesto analyze data and construct probabilistic models tocharacterize and predict application workloads in distrib-uted environments. Those models have been employed toassist fault diagnosis, resource allocation, and systemperformance optimizations.

As one of the first works in this area, Devarakonda andIyer [21] propose a statistical approach to predict theconsumption of CPU, file system I/O, and memory. Thisstudy models the behaviors of processes using automata,storing those models in databases. When a new processarrives at the system, the approach verifies if there is anyautomaton capable of representing it. If so, this automatonis used to estimate resource requirements for this nextprocess. Trace-driven experiments confirm a strong correla-tion in between next and past executions.

Faerman et al. [22] propose the Adaptive RegressionModeling (AdRM) to estimate file transferring events ofdistributed applications. AdRM was evaluated consideringtwo data-intensive applications: the Synthetic ApertureRadar Atlas (SARA) and the Storage Resource Broker(SRB). SARA handles a large number of small files (from1 to 3 MB), while SRB deals with files at about 16 MB.Experiments compare AdRM against the Raw Bandwidthmodel (RBW). Results confirm AdRM presents a goodprediction accuracy, however, this work lacks in compar-isons and analysis to other well-known models.

Wang et al. [23] employ machine learning techniques tomodel requests to storage devices, based on ClassificationAnd Regression Trees (CART). This work proposes twoprediction approaches: 1) the first considers one predictionmodel for every data access request; 2) the second modelsand predicts observations based on an average behavior ofrequests. Experiments show that both models present anaverage error of 17 and 38 percent, respectively. The first iscomputationally costly though.

Senger et al. [24] propose an online approach to acquire,classify, and extract process behavior. The informationacquisition is obtained by instrumenting the Linux kernel.In this sense, one does not need to modify nor recompileuser applications in order to monitor them. This informa-tion is submitted to an ART-2A neural network architecture,which groups and labels process behavioral states. Resultsconfirm the approach is capable of automatically modelingprocess behaviors, also presenting good computationalperformance, stability, and plasticity.

Senger et al. [25] also propose an approach to predictexecution times of parallel applications, aiming at improv-ing scheduling decisions in large-scale environments. Thisapproach provides application knowledge to the systemscheduler, which is responsible for optimizing resourceallocation. According to experiments, this approach pre-dicts the average behavior of applications presenting errorsin range [38, 57] percent.

1018 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012

3. In Time Series Analysis, a system is assumed to produce a particulartime series. In that context, term generation process is related to suchsystem properties, i.e., stochasticity, linearity and stationarity.

Page 3: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

Mello [26] employ Chaos-Theory concepts and nonlinearprediction techniques to model and predict processbehavior over time. Results confirm behaviors can bemodeled using Dynamical Systems and, particularly, thatRadial Basis Functions are good estimators for futureprocess states. Schedulers can take advantage of that studyto improve resource allocation.

The studies mentioned in this section demonstratedperformance improvements in many scenarios, such asresource allocation, fault diagnosis, and others. The maingoal of such works is to apply different approaches toanalyze data and construct probabilistic models to char-acterize and predict application workloads in distributedenvironments. One of the main drawbacks of those studiesis that they do not evaluate properties related to applicationbehaviors (such as stochasticity, linearity, and stationarity),therefore, they do not provide or select adequate tools tomodel and predict such behaviors. Moreover, they do notconsider an automatic and complete approach frombehavior extraction to optimization. Such drawbacks arehere addressed as detailed in next sections.

2.2 Data Access Optimization

Several studies have been attempting to improve dataaccess on distributed environments. Such works are mainlyfocused on data replication, distribution, and consistency.In this context, Oldfield and Kotz [27] proposed the Armadaframework to execute, control, and monitor applications.Armada builds graph structures to represent processingand data flows. Graphs, representing process versus datadependencies, support decisions on moving data towardprocesses, thus reducing execution time. Experimentscompare applications running on a traditional environmentand also on Armada. In that scenario, Armada improvesnetwork throughput in around 40 percent.

Kim et al. [28] propose a heuristic to improve data access,i.e., reduce execution time of data-intensive applications.This heuristic attempts to find the lowest cost data sourcefor every process. Authors consider a trace-based syntheticscenario on PlanetLab to evaluate their approach. Resultsconfirm that heuristic outperforms conventional latency-based and random allocation approaches.

Chervenak et al. [29] propose a framework called ReplicaLocation Service (RLS) which maintains and providesinformation on physical locations of replicas. RLS is usedin a variety of production environments such as the LaserInterferometer Gravitational Wave Observatory (LIGO)[30], Earth System Grid (ESG) [31], and Pegasus [32].Authors presented a performance study demonstrating thatthe individual RLS servers perform and scale up well whencompared to the native MySQL using ODBC clients.

AL-Mistarihi and Yong [14] propose an approach toselect the best replica, i.e., which presents the lowestaccess cost, for applications. Authors consider the Analy-tical Hierarchy Process (AHP) to solve that problem, theyevaluate this approach using the OptorSim simulator andcompare it to a random one. Unfortunately, they did notcompare the approach to the most common ones alreadyincluded in OptorSim: LRU, LFU, and the EconomicModel (ECO).

The main drawback of those works is that they do nottake into account the dynamic behavior of applications,predictions nor estimations to perform data access optimi-zation. These issues are considered in this paper.

3 TIME SERIES ANALYSIS

By modeling the outputs produced by real-world systems,we can study and, therefore, understand how they work andbehave under different circumstances. This is speciallyinteresting to support the prediction of behavior and,consequently, support decision making, what is particularlyrequired in certain application domains. Here, we areinterested in predicting read-and-write operations in at-tempt to optimize data access on distributed environments.

Outputs produced by real-world systems present a strongtemporal dependency, i.e., adjacent observations are depen-dent [33]. This dependency strongly reduces the modelingaccuracy of conventional techniques. In order to overcomethis problem, a new area was developed, called Time SeriesAnalysis [34], in which data are commonly organized interms of variables and their observations over time.

Before modeling, nevertheless, it is necessary to under-stand the implicit features embedded in data such as thestationarity, linearity, and stochasticity. By understandingthose features, we can better select modeling techniques toperform prediction. A time series is said to be stationarywhen its observations are in a particular state of statisticalequilibrium, i.e., they evolve over time around a constantaverage and variance [34]. On the other side, in linear timeseries, observations are composed of a linear combination ofpast occurrences and noises [35]. Finally, stochastic timeseries are formed by random observations and relations,which follow probability density functions and may changeover time [34].

The identification of these features, however, is not asimple task and requires not only specialists’ opinions butalso additional information on how series were obtained,what supports the characterization of the time seriesgeneration process, i.e., the system used to produce suchseries. By knowing such generation process, we can selectan adequate technique to model and study time series.However, the main problem of this approach is thesubjectivity imposed by specialists, which can lead tofailures. Moreover, during time series modeling, mainlywhen considering those obtained from real-world systems,researchers neither have further information nor specialistsavailable to support this task [33].

These issues have been constantly limiting the appro-priate modeling of series, mainly when researchers are notspecialists in time series analysis or belong to other areasnot related to statistics or mathematics, such as pharma-ceutical research, biotechnology, medicine, public health,agriculture, and climate modeling. This is also the situationobserved when studying process behavior, as in this paper.Such behavior can be organized as time series and, byunderstanding those series, we can better model and,consequently, predict observations.

The need for specialists and the lack of a step-by-stepanalysis on how to analyze time series motivated Ishii et al.[20] to propose an automatic and systematic approach to

ISHII AND DE MELLO: AN ONLINE DATA ACCESS PREDICTION AND OPTIMIZATION APPROACH FOR DISTRIBUTED SYSTEMS 1019

Page 4: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

evaluate and classify time series generation processes. Thisapproach combines dynamical systems, statistical tools, andtests to support time series classification. All tools areorganized according to a tree view (Fig. 1) and also indicatemodels for every branch.

At the first tree level (Fig. 1), there is the time series. Thesecond level separates deterministic from stochastic series.When a series is deterministic, it is better studied usingChaos-Theory tools, making unnecessary further evalua-tions. Still in the second level, we have the stochastic type ofseries. When under that branch, series still must beevaluated in order to select the best modeling approach.Thus, we proceed with the third level which verifies serieslinearity. Nonlinear series can be addressed by usingnonlinear modeling approaches [36], [37], [38]. When linear,we still need to evaluate whether the series is stationary ornot. When the series is stochastic, linear, and stationary, itcan be modeled by using techniques such as Autoregressivemodels (AR ), Moving averages (MA), or Autoregressivemoving average models (ARMA) [34]. When the series isstochastic, linear, and nonstationary, it can be modeled byusing Autoregressive integrated moving average models(ARIMA) [34].

By using this tree (Fig. 1), Ishii et al. [20] suggestadequate modeling techniques for every branch, which arecapable of dealing with the time series generation process.

In this paper, process behavior is represented by the timeseries having information on read-and-write operations.The generation processes of those time series are thenanalyzed as before mentioned and, thus, modeling techni-ques are selected according to the approach proposed byIshii et al. [20]. Such techniques are employed to predictprocess behaviors over time.

4 ONLINE DATA ACCESS PREDICTION AND

OPTIMIZATION APPROACH FOR DISTRIBUTED

SYSTEMS

This paper proposes an approach to support the onlineprediction of application behavior in an attempt to optimizedata access operations on distributed systems. At a firststep, this approach employs time series analysis tools,detailed in Section 3, to evaluate process behavior and selectmodeling techniques, which are used to predict application

behaviors. At last, predictions are used as input to a newdata access optimization heuristic.

This approach is composed of the following steps (Fig. 2)

1. Application knowledge acquisition;2. Organization of process behaviors as time series;3. Analysis of time series generation processes;4. Selection of techniques to model times series;5. Definition of how many future observations will be

estimated;6. Prediction of observations; and, finally,7. Execution of the optimization heuristic in attempt to

reduce the time consumed in data access operations.

The first step, i.e., application knowledge acquisition, isresponsible for monitoring process behavior by using eventinterception. The interception mechanism is associated withthe process under execution. At every function call, thismechanism is informed, being capable of taking furtherdecisions or simply log information. Different libraries andtools provide interception, the most common ones are theUnix DLSym [39] and Unix Ptrace [40].

DLSym intercepts only calls to dynamic functions, i.e.,functions made available by shared libraries. When aprogram calls a function, DLSym intercepts the call andinjects any code instead. In our approach, we simply inject acode to log all information about the call, i.e., functionname, arguments, and return value. By using DLSym wecan inject any code even in precompiled programs, thus,there is no need to modify and compile the applicationsource code. Consequently, any procedure in a dynamicshared library can be intercepted, what helps to buildmonitoring tools. The second mechanism is Ptrace whichtransparently intercepts process signals and system calls. Itis usually employed to build diagnosis and debug tools.Differently from DLSym, Ptrace can intercept any systemcall or signals from applications.

This paper considered the DLSym mechanism because itefficiently extracts information of data access such asoperation type, file identifier, etc., (see Table 1). Moreover,DLSym has lower impacts on capturing information thanPtrace, as depicted in [9].

After extracting the application behavior, we transformthe sequence of read-and-write events in a multidimen-sional time series. Every event is described by quintupletr ¼ fpid; inode; amt; time; opg in which pid is the processidentifier that performed the operation, inode is the file

1020 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Fig. 1. Classification of time series generation processes, adaptedfrom [20].

Fig. 2. The main steps of online predictive approach.

Page 5: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

identifier, amt is the amount of data read or written to disk,time represents the time interval in between consecutiveoperations, and op is the operation type, i.e., read or write.A series containing u samples is, then, defined asTR ¼ ftr0; tr1; . . . ; tru�1g. After this organization, the multi-dimensional time series (see example in Table 1) is stored ina trace file.

The third step evaluates the generation process of timeseries TR according to specific properties: stochasticity,linearity, and stationarity (see Section 3). Having thisevaluation, we are able to select an adequate tool to modelseries and, afterwards, predict observations.

As presented in Section 3, Ishii et al. [20] discuss andpropose an automatic and systematic approach to evaluateand classify time series generation processes. In thatstudy, authors confirm: 1) Recurrence Plots (RP) [41] areuseful to verify stochastic levels of time series; 2) theWhite Neural Network (WNN) test [42] is useful toevaluate the time series linearity; and 3) Space-TimeSeparation Plots (STP) [43] evaluates series stationarity.That approach and tools are here considered to evaluatethe generation process of time series.

Based on the evaluation of the time series generationprocess, we select an adequate modeling technique asdetailed in Section 3. According to Ishii et al. [20], when theseries is deterministic, a reconstruction is conducted byconsidering the Takens’ immersion theorem [44], whichrelates series observations over time. On the other hand,when the series is stochastic and nonlinear, the suggestedtechniques are radial basis function approximation, artifi-cial neural networks, Autoregressive Conditional Hetero-skedasticity (ARCH) [35], filters, and polynomialapproaches [38]. When stochastic, linear, and stationary,the series can be modeled by using AR, MA, or ARMA [34].Otherwise, when the series is stochastic, linear, andnonstationary, it is better modeled using ARIMA [34].

In the next step, we consider the adaptive sliding-window (ASW) mechanism proposed in [9] to estimate thenumber of time series observations to be predicted, basedon process behavior changes. As observed in [9], if we donot consider such estimator and, thus, predict less observa-tions, the optimization heuristic will be more frequentlyexecuted. On the other hand, if we predict more observa-tions, process behavior may change and the exceedingpredictions become inaccurate enough, risking optimizationdecisions. ASW considers the type, number of operations,

and � parameter to adapt the window length according toconsecutive homogeneous operations (i.e., reading orwriting). Equation (1) defines the ASW, where Wtþ1 is thenext window length, Wt is the current window length, Op isthe number of homogeneous operations, and � is the factorwhich determines modifications in the window length

Wtþ1 ¼Wt � ð1� �Þ þOp2�OpWt � �: ð1Þ

For example, given W1 ¼ 10, Op ¼ 4, and � ¼ 0:10, wewould obtain a next window length equals to W2 ¼ 9. Notethat the window adapts according to the number ofoperations under the same type, see Fig. 3. On the otherhand, when Op is equal to Wt (see W4 in Fig. 3) the nextwindow length increases considerably, W5 ¼ 11. Parameter� is experimentally defined (as presented in [9]).

Fig. 3 presents a sample trace file with process behaviorinformation (we focused only in reading (r) and writing(w) operations). The first part of the Fig. 3 shows the initialwindow length (W1 ¼ 10) and all consecutive operations.The second part shows how to compute of the next slidingwindow length based on (1). Finally, the last part showsthe subsequent windows W2 ¼ 9;W3 ¼ 8;W4 ¼ 7, andW5 ¼ 11, which clearly demonstrates the adaptation ofwindow length according to the behavior of reading andwriting operations.

After the previous steps, the prediction is performed onthe time series, which represents process behaviors. Themeasurement considered to evaluate prediction results wasthe Normalized Root Mean Squared Error (NRMSE),defined in (2), in which xmax is the maximum and xmin isthe minimum observed values, xi is the expected value attime instant i, xi is the obtained value at time instant i, i.e.,the predicted observation, and n is the number of predictedobservations [45]. Hypothesis tests were also performed onall prediction results, considering the critical value of thestatistic z ¼ 1:96 for the confidence level of 95 percent

ISHII AND DE MELLO: AN ONLINE DATA ACCESS PREDICTION AND OPTIMIZATION APPROACH FOR DISTRIBUTED SYSTEMS 1021

TABLE 1Example of Trace File

Fig. 3. Example of the Adaptive Sliding Window with � ¼ 0:10.

Page 6: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

NRMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1ðxi�xiÞ2

n

r

xmax � xmin: ð2Þ

Finally, our heuristic (Algorithm 1), inspired in [9], isexecuted to optimize decisions based on predictions, i.e.,considering the amount of data to be read or written overtime. This heuristic takes decisions on data replication,migration, and consistency. Differently from other ap-proaches that simply consider historical data [9], [16], thisheuristic considers the automatic and online prediction ofread-and-write operations performed by processes.

Algorithm 1. The proposed heuristic

Require:

Set of sites S ¼ fs0; s1; . . . ; sn�1g;Set of process P ¼ fp0; p1; . . . ; ph�1g;Set of files F ¼ ff0; f1; . . . ; fz�1g;Set of prediction values PB; == in which PB contains all

predicted events based on TR.

PB ¼ fpb0;pb1; . . . ;pbu�1g;fASWgp ¼ initial value is defined by a constant; == This

parameter can be seen as a vector containing the window

length of next predictions for every process p 2 P , this is

why it is indexed by p.

Op ¼ 1; == Number of consecutive homogeneousoperations.

� ¼ 0:10; == Adaptive parameter.

Ensure:

Set of replicas F 0.

1: for each p allocated on grid site s 2 S do

2: for each pb 2 PB do

3: p getProcessðpbÞ4: if p.op ¼ “Read” then

5: Read(pb)

6: else if p.op ¼ “Write” then

7: Invalidate(f)

8: Write(pb)

9: else

10: PB predict(p; fASWgp;TR) == accordingto Section 3.

11: organize predictions as time series

12: while fASWgp 6¼ Empty do

13: RetrieveðfÞ14: end while

15: end if

16: fASWgp fASWgp � ð1� �Þ þOp2� OpfASWgp � �

17: end for

18: end for

This heuristic maps the grid environment into a graphGðS; LÞ, with grid sites s 2 S and edges l 2 L representcommunication links between elements in S. Every grid siteruns a set of processes, previously scheduled according toany policy. Files f 2 F are distributed in sites s 2 S. Thisheuristic also considers u future observations given by PB,which is the predicted behavior for all processes in Porganized as a sequence of next accesses over time. Thisalgorithm also depends on an initial length for the adaptive

sliding window, which estimates the number of observa-

tions to be predicted (parameter � is used by this estimator).An instance of this heuristic is run at every grid site

s 2 S. It periodically verifies if a process p will access data,

considering predictions pb 2 PB. If pb confirms that p will

access data, then the heuristic needs to define the operation

type: read or write (from lines 4 to 8). Otherwise, from lines

12 to 14, the heuristic assesses the feasibility of replication,

migration, consistency, and retrieval of files, according to

the predicted events, lines 10-11. Line 16 computes the next

window length for a process.Every process p 2 P has its own ASW (represented in

Algorithm 1 as a vector of prediction values fASWgp). PB

is a subset of the multidimensional time series TR,

considering only field amt (or amount) in a unidimensional

time series. This field was selected due to it provides the

amount of data considered in replication, migration, and

consistency operations.The method RetrieveðfÞ, Algorithm 2 establishes a

new local copy (replication) of the remote file f and assesses

the copy removal option (for data migration purposes). The

method Read(pb), Algorithm 3 receives a quintuple pb

and, if file f is invalid then it applies the method

Retrieve(f) and finally reads it. The method Write(pb)

works similarly and is presented in Algorithm 4. The

method Invalidate(f), Algorithm 5 receives a file and

updates all replicas as invalid.

Algorithm 2. Retrieve(f)

Require:

The f file.

Ensure:

f 0 replica file retrieved from nearest site.

1: A � S== subset of sites where there are copies of f .

2: for each a 2 A do

3: evaluate costs on replication, migration and

consistency

4: end for

5: return f 0

Algorithm 3. Read(pb)

Require:

pb.

Ensure:

Reading data from file.

1: if f is “invalid” then

2: Retrieve ðfÞ3: end if

4: reads

Algorithm 4. Write(pb)

Require:

pb

Ensure:

Writing data to file.

1: if f is “invalid” then

2: Retrieve ðfÞ3: end if

4: writes

1022 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Page 7: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

Algorithm 5. Invalidate(f)

Require:

The f file.

Ensure:

Invalidate the subset of S having outdated replicas of f .

1: A � S== subset of sites where there are copies of f .

2: for each a 2 A do

3: update copy ca of f as “invalid” in site a.

4: end for

5: return S.

5 EXPERIMENTS

This section introduces the data sets we considered toconduct experiments as well as results. Results include:1) the assessment of the generation process of time series(Section 3), which supports the selection of adequatemodeling techniques; 2) the assessment of time seriesprediction, i.e., the forecasting of process behavior inorder to optimize future data access operations; and 3) thedata access optimization results using the proposedheuristic and comparisons against other common consid-ered approaches.

5.1 The Data Sets

In order to validate our approach, we consider some real-world traces provided by Storage Networking IndustryAssociation (SNIA) [46], which were produced by real datacenter servers (enterprise servers at Microsoft Research atCambridge) during one week of observation. Such datacenter has 13 servers, 36 logical volumes, and 179 harddisks. Table 2 describes the monitored servers.4

Every logical volume was monitored and all block-levelread and write operations were captured. Traces werecollected using the Event Tracing for Windows (ETW)during a period of 168 hours (1 week). Each trace eventdescribes the I/O request which includes the timestamp,disk number, initial logical block number, number of blockstransferred, and type of the operation (read or write). Inorder to proceed with experiments we selected three of

those traces (boldface rows in Table 2): volume proj withwrite-only operations, volume hm with read-only, andvolume mds with read-only operations. Afterwards, wefiltered only the first 2,000 relevant events for our model tocompose data sets: 1) timestamps; 2) number of blockstransferred; and 3) type of operations. Those data sets werethen evaluated in order to obtain the time series generationprocess (Section 5.2) and, consequently, we selected themodeling technique to permit the prediction of processbehaviors. Such behaviors were then considered to optimizedata access operations.

5.2 Classifying Data Sets: Assessing the TimeSeries Generation Process

In this paper, we organize process behavior as time series,and, afterwards, predict observations which are input to adata access optimization approach. However, first of all, weneed to understand the generation process of those timeseries, what supports the selection of adequate modelingtechniques. In order to accomplish that, we employ theapproach by Ishii et al. [20] on the three data sets to assesstheir stochasticity, linearity, and stationarity. After definingsuch properties for every data set, we then select the mostadequate model to represent every one. Recurrence Plot [41]was considered to evaluate stochasticity. White NeuralNetwork [35] was considered to evaluate linearity. Space-Time Separation Plot [47] and Autocorrelation Function(ACF) [48] was considered to evaluate stationarity.

RP evaluates all possible states that can be visited by aseries trajectory (f~xgNi¼1, in which N is the number ofconsidered states) in phase space [41], [44]. Moreover, basedon RP, Zbilut and Webber [49] propose several measure-ments (RQA or Recurrence Quantification Analysis) toquantify RP visual structures, such as: the rate orpercentage of recurrence (RR), degree of determinism(DET), maximum length of a diagonal line (Lmax), degreeof divergence (DIV), and the Shannon Entropy of theprobability density function of diagonal lines (ENTR). Suchmeasurements produce a time-dependent behavior whichallow the analysis and study of series transitions.

WNN is a test to verify the linearity of time series byusing an artificial neural network [35]. The test considers astochastic process Z ¼ fZt; t ¼ 1; . . . ; ng; Zt ¼ ðYt;XtÞ, inwhich Yt is a scalar and Xt is a vector, and measures theconditional probability, i.e., the probability of Yt given Xt,denoted by EðYtjXtÞ, to conclude about series linearity.Formally, this is represented by a regression function gðxÞ ¼EðYtjXtÞ which is found after training a multilayer artificialneural network. The objective of that test is to verify if theWNN architecture is capable of mapping function gð:Þ.

STP [47] is a correlation test to evaluate the geometrictrajectories (called global) or fractal structures (called local)of series. STP explicitly considers the time separation ofstates, i.e., dispersion (distance) between states versustheir separation over time. It generates contours thatdetermine the ranges of correlation for a given embeddeddimension in a series. This correlation represents theprobability that a pair of randomly selected states is lowerthan a distance r in phase space [44]. This distance and theseparation dimension reflect the position in neighborhoodat a given time instant.

ISHII AND DE MELLO: AN ONLINE DATA ACCESS PREDICTION AND OPTIMIZATION APPROACH FOR DISTRIBUTED SYSTEMS 1023

4. Further information is available at http://www.snia.org.

TABLE 2Data Center: 13 Servers, 36 Logical Volumes

and 179 Hard Disks Were Monitoredover One Week

Page 8: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

ACF [48] measures the degree of correlation among

time series observations, i.e., we compute the separation

distance of an observation at time instant t against another

tþ k, where k is the time lag. Formally, ACF(k) of a time

series X with average �, is defined in (3), in which E½:� is

the expected average value of the expression and �2 is the

variance of X

ACFðkÞ ¼ E½ðXt � �ÞðXtþk � �Þ��2

: ð3Þ

We simply introduce RP, WNN, STP, and ACF to

support result analysis. However, we suggest the reading

of Ishii et al. [20] in order to better understand such

techniques and also issues related to stochasticity, linearity,

and stationarity.In this context, we started assessing the generation

process of data set proj, which contains write-only opera-

tions. That data set presented the RQA measurements

showed in Table 3. By analyzing results, we observe this

series presents low recurrence rate and that diagonals

confirm it presents low divergence, complexity, and

determinism. Furthermore, when analyzing the RP chart

(Fig. 4), we confirm the low frequency of diagonals. Given

such circumstances, proj can be classified as a stochastic

time series.5

Table 4 presents RQA measurements for data set hm,

which considers read-only operations. Based on results

(Table 4), hm presents low recurrence rate. Diagonal

measurements confirm this series presents moderate diver-gence, low complexity, and degree of determinism. More-over, by analyzing the RP chart (Fig. 5), we confirm the lowfrequency of diagonals. Considering such analysis, hm canbe classified as stochastic.

Table 5 presents RQA measurements for data set mds,which considers read-only operations. By analyzing results(Table 5), we conclude this series presents a very highrecurrence rate and the formation of diagonals. Thosediagonals confirm this series presents low divergence,moderate complexity, and high degree of determinism.Furthermore, when analyzing the RP chart (Fig. 6), we alsoconfirm the high frequency of diagonals. Given suchcircumstances, this series is classified as deterministic.

Then, we applied the WNN test [20], [42] to evaluatetime series linearity. Table 6 summarizes the WNN test andindicates the classification of the series under study. In thissituation, two series are linear and one is not. However,even after applying the linearity test, we still need toidentify the degrees of stationarity for each series. Thus, theSTP test [20], [38] is applied.

Time series proj presents the ACF and STP chartspresented in Fig. 7. ACF demonstrates proj has a constant

1024 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012

5. It is relevant to mention that we assume a time series is deterministiconly if DET � 95%, otherwise we conclude it tends to be stochastic.

Fig. 4. Recurrence Plot: Data set proj.

TABLE 3Recurrence Quantification Analysis measurements:

Data Set proj with Time Delay 6, EmbeddingDimension 14 and " ¼ 1:0

Fig. 5. Recurrence Plot: Data set hm.

TABLE 4Recurrence Quantification Analysis Measurements:

Data Set hm with Time Delay 4, EmbeddingDimension 2, and " ¼ 1:0

TABLE 5Recurrence Quantification Analysis Measurements:

Data Set mds with Time Delay 5, EmbeddingDimension 2, and " ¼ 1:0

Page 9: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

autocovariance structure which is maintained over time.STP results confirm this autocovariance structure, whosedistance (y-axis) values attenuate slowly over time (x-axis).Every curve in STP chart represents a probability p such asdefined in [20], [38]. By analyzing results, we conclude proj

is stationary.Time series hm has the ACF and STP charts presented in

Fig. 8. The ACF values slowly decay at the beginning andpresent abrupt changes which are maintained over time,i.e., this does not present a constant autocovariancestructure. STP results confirm high values of distance overtime. Based on such analysis, we consider hm as anonstationary time series.

Stationarity feature was not evaluated in the mds timeseries because previous results conducted on this seriesallow to classify it as deterministic and nonlinear, asdefined in the methodology proposed in [20].

All results previously presented permitted to classify theproperties of data sets. Now having such properties, we canbetter select modeling techniques and, therefore, improveprediction results as seen in [20].

5.3 Predicting Process Behaviors

In the next step, data sets were divided in two sets in orderto evaluate process behavior predictions: 1) the first with80 percent of observations for training our approach, and2) the remainder 20 percent to evaluate predictions results.

Data set proj was the first to be studied. According to ourexperiments, it was classified as stochastic, linear, andstationary. By considering such classification, this series

would be better modeled by using statistic tools, such as ARand ARMA models [20]. This modeling is verified by resultspresented in Fig. 9, which confirms our hypothesis. In thatsense, ARMA(1,2) was capable of modeling proj with thesmallest error among all evaluated approaches, indicatinggood prediction results. ARMA (1,2) was obtained by usingthe auto.arima tool provided in R statistical software [50],which returns the best order for the arima model to fit data.

Now, consider data set hm which is, according to ourexperiments, classified as stochastic, linear, and nonsta-tionary. According to proposed classification (see Fig. 1)[20], this series would be better modeled by statistic tools,such as ARIMA, which was confirmed by the resultspresented in Fig. 10. ARIMA(2,1,2) was capable of modelinghm with lower errors.

Finally, we consider data set mds which is, according toexperiments, classified as deterministic and nonlinear.

ISHII AND DE MELLO: AN ONLINE DATA ACCESS PREDICTION AND OPTIMIZATION APPROACH FOR DISTRIBUTED SYSTEMS 1025

TABLE 6WNN Test for Time Series Linearity Considering

the SNIA Repository

Fig. 7. Autocorrelation Function and Space-Time Separation Plot testsfor Data set proj.

Fig. 6. Recurrence Plot: Data set mds.

Fig. 8. Autocorrelation Function and Space-Time Separation Plot testsfor Data set hm.

Page 10: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

Following the proposed classification (see Fig. 1), this seriesis better modeled by using chaos-theory or dynamicalsystem tools. Fig. 11 confirms our hypothesis, in which theseries is better modeled by using chaos-theory tools(unfolding the behavior in phase space and modeling usingRBF). The Phase Space/RBF approach was capable ofmodeling mds at lower errors, thus, obtaining goodprediction results.

At the end of this stage we have all modeling techniquesselected in order to proceed with predictions and the dataaccess optimization.

5.4 Employing the Proposed Optimization HeuristicConsidering Predicted Behaviors

All behavior predicted is according to the previous sectionsis made available to our optimization heuristic, whichconsiders the possibility to anticipate read and writeoperations as well as consistency verification.

In order to evaluate our proposed heuristic whichconsiders the prediction of process behaviors, we con-ducted experiments by using the OptorSim simulator,which is part of the European DataGrid EDG Project [13].This simulator was originally developed to model thedynamic effects of data replication for the Large HadronCollider Computing Grid (LCG), and it is also considered inother works [9], [14], [15], [16], [17], [18], [19].

Fig. 12 depicts the execution of heuristic in the OptorSimsimulator (describes in Section 5.4), considering the online

prediction of process behaviors. During execution, ourheuristic continuously collects past operations and invokesthe Tisean package6 to conduct predictions according tothe modeling techniques selected in the previous phase. Insuch invocation, the heuristic informs the number of futureevents to be predicted, which is maintained in the adaptivesliding window (ASW in Algorithm 1). Tisean returnspredicted observations to our heuristic which considerssuch operations to take decisions on replication, migration,and consistency. The adaptive sliding window is modifiedaccording to homogeneous or heterogeneous operations,following the idea proposed in [9]. When the length of thiswindow increases, then the heuristic invokes Tisean topredict more observations.

After presenting our proposed heuristic, which opti-mizes data access operations using online prediction, weconduct several experiments to compare it with othercommonly considered techniques. When presenting results,we refer to our heuristic as H-Pred.

We evaluated five optimization techniques: LRU, LFU,Economic Model, H-Hist, and H-Pred (showed in Algo-rithm 1). LRU replicates files when jobs need them,

removing the least-recently-used replicas. LFU replicatesfiles when jobs need them, removing the least-frequently-used replicas. ECO considers an economic model toreplicate files. Replicas are removed according to a Zipf

1026 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Fig. 10. Data set hm: Normalized Root Mean Squared Error.

Fig. 11. Time series mds: Normalized Root Mean Squared Error.Fig. 9. Data set proj: Normalized Root Mean Squared Error.

Fig. 12. Optimization Heuristic: The main steps.

6. Time series analysis project: it considers techniques based onDynamical Systems and Chaos Theory [51] http://www.mpipks-dresden.mpg.de/~tisean/Tisean_3.0.1/index.html.

Page 11: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

estimation function. H-Hist was previously proposed in [9],and it was considered in this paper for comparativepurposes. The first three techniques are available in thedefault version of the OptorSim simulator. The last twotechniques were also implemented in OptorSim.

The first experiment evaluates the performance of dataoptimization techniques on the hm data set. The perfor-mance metric considered was the average execution time (inseconds). As shown in Fig. 13, both heuristics (H-Hist andH-Pred) were capable of reducing the job execution time inaround 60 percent when compared to other data replicationtechniques. The H-Pred heuristic adopts the ARIMA (2,1,2)model to predict data access operations (see Section 5.2).

The next experiment evaluates the performance of dataoptimization techniques considering an environmentwhich adopts the mds data set with read only operations.In Fig. 14, H-Hist and H-Pred heuristics were capable ofreducing the job execution time in around 57 percentwhen compared to other replication techniques. In addi-tion, the H-Pred heuristic adopts the Phase Space/RBFtechnique to predict data access operations.

The last experiment evaluates the performance of dataoptimization techniques considering an environment

which adopts the proj data set with write only operations.In Fig. 15, ECO technique has better performance than theH-Hist in around 2 � 3%. In addition, the H-Pred heuristic,which considered ARMA(1,2) model to predict data accessoperations, has lower performance than the ECO techniquein around 3 � 4%.

The H-Pred heuristic produces equivalent results or, insome cases (Fig. 15), lower than the H-Hist technique.However, this predictive approach (H-Pred) does notrequire an initial execution of the application, avoidsstorage of long historical, and efficiently adapts accordingto variations on the behavior of applications.

Moreover, it is observed in all charts in this section that,besides increasing the applications performance, historical(H-Hist) and predictive (H-Pred) approaches have lowvariability in response time when compared to other dataoptimization techniques, i.e., have low standard deviation.This feature makes both approaches more reliable underdifferent scenarios, which reinforces the use of applicationknowledge to optimize data access operations.

6 CONCLUDING REMARKS

This paper has presented a data access optimizationapproach which uses predictive techniques for distributedcomputing environments. Our main objective is to mini-mize the application execution time by optimizing dataaccesses and, therefore, improving decisions on replication,migration, and consistency. From that, data access opera-tions are transformed into time series. By modeling thoseseries, we can understand the behavior of applications and,therefore, predict future observations. Such predictionsupports to take decisions beforehand. However, thismodeling is related to specific aspects of each time seriessuch as the stochasticity, linearity, and stationarity.

The efficiency of our approach was confirmed throughexperiments using real-world data. We evaluated ourapproach to select modeling techniques for real systemsdata (SNIA data sets [46]). By conducting additional tests,we confirmed that the time series classification can indeedbe used as a way to select the most appropriate set ofmodeling techniques.

ISHII AND DE MELLO: AN ONLINE DATA ACCESS PREDICTION AND OPTIMIZATION APPROACH FOR DISTRIBUTED SYSTEMS 1027

Fig. 13. Comparative results between the optimization techniques LRU,LFU, ECO, H-Hist, and H-Pred for the hm data set.

Fig. 14. Comparative results between the optimization techniques LRU,LFU, ECO, H-Hist, and H-Pred for the mds data set.

Fig. 15. Comparative results between the optimization techniques LRU,LFU, ECO, H-Hist, and H-Pred for the proj data set.

Page 12: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

Simulations were conducted to study the efficiency ofour heuristic which consider predictive mechanisms, underthree distinct environments. Experimental results confirmthat this approach outperforms other commonly consideredones (e.g., LRU, LFU, and Economic Model) in approxi-mately 50 percent. In addition, our approach is online,avoids the storage of long historical and efficiently adaptsaccording to variations on the behavior of applications.

ACKNOWLEDGMENTS

This paper is based upon work supported by FAPESP—SaoPaulo Research Foundation, Brazil, under grant no. 2011/02655-9, CNPq—National Council for Scientific and Tech-nological Development research funding agency undergrants no. 304338/2008-7 and 470739/2008-8. Any opinions,findings, and conclusions or recommendations expressed inthis material are those of the authors and do not necessarilyreflect the views of FAPESP and CNPq.

REFERENCES

[1] M. Parashar and S. Hariri, Autonomic Computing: Concepts,Infrastructure, and Applications, Manish Parashar and Salim Haririeds. Taylor & Francis, Inc., 2007.

[2] H. Stockinger, “Defining the Grid: A Snapshot on the CurrentView,” The J. Supercomputing, vol. 42, pp. 3-17, http://www.springerlink.com/content/906476116167673m, 2007.

[3] T. Hey and A.E. Trefethen, “Cyberinfrastructure for E-Science,”Science, vol. 308, no. 5723, pp. 817-821, http://www.sciencemag.org/cgi/content/abstract/308/5723/817, 2005.

[4] R. Ranjan, A. Harwood, and R. Buyya, “A Study on Peer-to-PeerBased Discovery of Grid Resource Information,” technical report,Dept. of Computer Science and Software Eng., Univ. ofMelbourne, Dec. 2006.

[5] J. Gray, “eScience: A Transformed Scientific Method,” The FourthParadigm: Data-Intensive Scientific Discovery, T. Hey, S. Tansley,and K. Tolle, eds., Microsoft Research, 2009.

[6] J.P. Collins, “Sailing on an Ocean of 0s and 1s,” Science, vol. 327,pp. 1455-1456, Mar. 2010.

[7] G. Fox and D. Gannon, “Computational Grids,” Computing inScience and Eng., vol. 3, no. 4, pp. 74-77, 2001.

[8] M. Nielsen, “A Guide to the Day of Big Data,” Nature, vol. 462,pp. 722-723, Dec. 2009.

[9] R.P. Ishii and R.F. de Mello, “An Adaptive and HistoricalApproach to Optimize Data Access in Grid Computing Environ-ments,” INFOCOMP J. Computer Science, vol. 10, no. 2, pp. 26-43,http://www.dcc.ufla.br/infocomp/, 2011.

[10] N.N. Dang and S.B. Lim, “Combination of Replication andScheduling in Data Grids,” Int’l J. Computer Science and NetworkSecurity, vol. 7, no. 3, pp. 304-308, Mar. 2007.

[11] R.M. Rahman, K. Barker, and R. Alhajj, “A Predictive Techniquefor Replica Selection in Grid Environment,” Proc. IEEE SeventhInt’l Symp. Cluster Computing and Grid, pp. 163-170, May 2007.

[12] Y. Sun and Z. Xu, “Grid Replication Coherence Protocol,” Proc. 18thInt’l Symp. Parallel and Distributed Processing, pp. 232-239, Apr. 2004.

[13] W.H. Bell, D.G. Cameron, A.P. Millar, L. Capozza, K. Stockinger,and F. Zini, “Optorsim: A Grid Simulator for Studying DynamicData Replication Strategies,” Int’l J. High Performance ComputingApplications, vol. 17, no. 4, pp. 403-416, http://hpc.sagepub.com/cgi/content/abstract/17/4/403, 2003.

[14] H.H.E. AL-Mistarihi and C.H. Yong, “On Fairness, OptimizingReplica Selection in Data Grids,” IEEE Trans. Parallel DistributedSystems, vol. 20, no. 8, pp. 1102-1111, Aug. 2009.

[15] G. Belalem, “Economic Model for Consistency Management ofReplicas in Data Grids with Optorsim Simulator,” Proc. SecondInt’l Conf. Networks for Grid Applications (GridNets), O. Akan, P.Bellavista, J. Cao, F. Dressler, D. Ferrari, M. Gerla, H. Kobayashi,S. Palazzo, S. Sahni, X. S. Shen, M. Stan, J. Xiaohua, A. Zomaya, G.Coulson, P. Vicat-Blanc Primet, T. Kudoh, and J. Mambretti, eds.,pp. 121-129, 10.1007/978-3-642-0280-3_13. http://dx.doi.org/10.1007/978-3-642-02080-3_13, 2009.

[16] R.P. Ishii and R.F. de Mello, “A History-Based Heuristic toOptimize Data Access in Distributed Environments,” Proc. 21stIASTED Int’l Conf. Parallel and Distributed Computing and Systems(PDCS ’09), Nov. 2009.

[17] X.Y. Ren, R.C. Wang, and Q. Kong, “Using Optorsim to EfficientlySimulate Replica Placement Strategies,” The J. China Univ. of Postsand Telecomm., vol. 17, no. 1, pp. 111-119, http://www.sciencedirect.com/science/article/pii/S1005888509604349, 2010.

[18] K. Jain, A.V. Vidhate, V. Wangikar, and S. Shah, “Design of FileSize and Type of Access Based Replication Algorithm for DataGrid,” Proc. Int’l Conf. and Workshop Emerging Trends in Technology(ICWET ’11), pp. 315-319, http://doi.acm.org/10.1145/1980022.1980093, 2011.

[19] K. Sashi and A.S. Thanamani, “Dynamic Replication in a DataGrid Using a Modified Bhr Region Based Algorithm,” FutureGeneration Computer Systems, vol. 27, pp. 202-210, http://dx.doi.org/10.1016/j.future.2010.08.011, Feb. 2011.

[20] R.P. Ishii, R.A. Rios, and R.F. de Mello, “Classification of TimeSeries Generation Processes Using Experimental Tools: A Surveyand Proposal of an Automatic and Systematic Approach,” Int’l J.Computational Science and Eng., vol. 6, pp. 217-237, http://www.inderscience.com/browse/index.php?journalCODE=ijcse,2011.

[21] M. Devarakonda and R. Iyer, “Predictability of Process ResourceUsage: A Measurement-Based Study on Unix,” IEEE Trans.Software Eng., vol. 15, no. 2, pp. 1579-1586, http://dx.doi.org/10.1109/32.58769, Dec. 1989.

[22] M. Faerman, A. Su, R. Wolski, and F. Berman, “AdaptivePerformance Prediction for Distributed Data-intensive Applica-tions,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing ’99),p. 36, 1999.

[23] M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Faloutsos, andG.R. Ganger, “Storage Device Performance Prediction with CartModels,” Proc. IEEE CS 12th Ann. Int’l Symp. Modeling, Analysis,and Simulation of Computer and Telecomm. Systems (MASCOTS ’04),pp. 588-595, 2004.

[24] L. Senger, R.F. Mello, M.J. Santana, and R.H.C. Santana, “An On-Line Approach for Classifying and Extracting ApplicationBehavior on Linux,” High Performance Computing: Paradigm andInfrastructure, pp. 381-401, John Wiley and Sons Inc., 2005.

[25] L.J. Senger, M. Santana, and R. Santana, “An Instance-basedLearning Approach for Predicting Parallel Applications ExecutionTimes,” Proc. Third Int’l Information and Telecomm. TechnologiesSymp., pp. 9-15, Dec. 2005.

[26] R.F. de Mello, “Sistemas dinamicos e tecnicas inteligentes para apredicao de Comportamento de Processos: Uma Abordagem ParaOtimizacao de Escalonamento em Grades Computacionais,” PhDdissertation, Intituto de Cincias Matematicas e de Computa�cao -USP, Jan. 2009.

[27] R. Oldfield and D. Kotz, “Improving Data Access for Computa-tional Grid Applications,” Cluster Computing, vol. 9, no. 1, pp. 79-99, Jan. 2006.

[28] J. Kim, A. Chandra, and J.B. Weissman, “Using Data Accessibilityfor Resource Selection in Large-Scale Distributed Systems,” IEEETrans. Parallel Distributed Systems, vol. 20, no. 6, pp. 788-801, June2009.

[29] A.L. Chervenak, R. Schuler, M. Ripeanu, M.A. Amer, S. Bharathi,I. Foster, A. Iamnitchi, and C. Kesselman, “The Globus ReplicaLocation Service: Design and Experience,” IEEE Trans. ParallelDistributed Systems, vol. 20, no. 9, pp. 1260-1272, Sept. 2009.

[30] A. Abramovici, W.E. Althouse, R.W.P. Drever, Y. Grsel, S.Kawamura, F.J. Raab, D. Shoemaker, L. Sievers, R.E. Spero, K.S.Thorne, R.E. Vogt, R. Weiss, S.E. Whitcomb, and M.E. Zucker,“LIGO: The Laser Interferometer Gravitational-Wave Observa-tory,” Science, vol. 256, no. 5055, pp. 325-333, 1992.

[31] F.A.R. Nat’l Center, “The Earth System Grid-ESG Project,”www.earthsystemgrid.org, 2005.

[32] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K.Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S.Koranda, “Mapping Abstract Complex Workflows onto GridEnvironments,” J. Grid Computing, vol. 1, no. 1, pp. 25-39, Mar.2003.

[33] R.H. Shumway and D.S. Stoffer, Time Series Analysis and ItsApplications: With R Examples, second ed. Springer, May 2006.

[34] G. Box, G.M. Jenkins, and G. Reinsel, Time Series Analysis:Forecasting & Control, third ed. Prentice Hall, Feb. 1994.

1028 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Page 13: IEEE TRANSACTIONS ON PARALLEL AND · PDF fileThis new approach was implemented and evaluated using the OptorSim simulator, ... Iyer [21] propose a ... prediction techniques to model

[35] H. White, “A heteroskedasticity-Consistent Covariance MatrixEstimator and a Direct Test for Heteroskedasticity,” Econometrica,vol. 48, no. 4, pp. 817-838, http://www.jstor.org/stable/1912934,1980.

[36] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.[37] S. Haykin, Neural Networks and Learning Machines, third ed.

Prentice-Hall, 2008.[38] M. Casdagli, “Nonlinear Prediction of Chaotic Time Series,”

Physica D: Nonlinear Phenomena, vol. 35, no. 3, pp. 335-356, May1989.

[39] C. Jung, D.-K. Woo, K. Kim, and S.-S. Lim, “PerformanceCharacterization of Prelinking and Preloadingfor EmbeddedSystems,” Proc. ACM and IEEE Seventh Int’l Conf. EmbeddedSoftware, pp. 213-220, 2007.

[40] R.P. Spillane, C.P. Wright, G. Sivathanu, and E. Zadok, “RapidFile System Development Using Ptrace,” Proc. Workshop Experi-mental Computer Science, p. 22, 2007.

[41] N. Marwan, M. Carmen Romano, M. Thiel, and J. Kurths,“Recurrence Plots for the Analysis of Complex Systems,” PhysicsReports, vol. 438, nos. 5/6, pp. 237-329, http://dx.doi.org/10.1016/j.physrep.2006.11.001, Jan. 2007.

[42] T.-H. Lee, H. White, and C.W.J. Granger, “Testing for NeglectedNonlinearity in Time Series Models: A Comparison of NeuralNetwork Methods and Alternative Tests,” J. Econometrics, vol. 56,pp. 269-290, 1993.

[43] D. Yu, W. Lu, and R.G. Harrison, “Space Time-Index Plots forProbing Dynamical Nonstationarity,” Physics Letters A, vol. 250,nos. 4-6, pp. 323-327, 1998.

[44] F. Takens, “Detecting Strange Attractors in Turbulence,” Dynami-cal Systems and Turbulence, vol. 1, pp. 366-381, http://dx.doi.org/10.1007/BFb0091924, 1981.

[45] M. Anderson and W. Woessner, Applied Groundwater Modeling:Simulation of Flow and Advective Transport, second ed. AcademicPress., 1992.

[46] D. Narayanan, A. Donnelly, and A. Rowstron, “Write Off-Loading: Practical Power Management for Enterprise Storage,”Trans. Storage, vol. 4, no. 3, pp. 1-23, 2008.

[47] A. Provanzale, L.A. Smith, R. Vio, and G. Murante, “Distinguish-ing Between Low-Dimensional Dynamics and Randomness inMeasured Time Series,” Phys. D, vol. 58, nos. 1-4, pp. 31-49, 1992.

[48] G. Box, G.M. Jenkins, and G. Reinsel, Time Series Analysis:Forecasting & Control, third ed. Prentice Hall, Feb. 1994.

[49] J.P. Zbilut and C.L. Webber Jr., “Embeddings and Delays asDerived from Quantification of Recurrence Plots,” Physics LettersA, vol. 171, nos. 3/4, pp. 199-203, Dec. 1992.

[50] R.J. Hyndman and Y. Khandakar, “Automatic Time SeriesForecasting: The Forecast Package for r,” J. Statistical Software,vol. 27, no. 3, pp. 1-22, http://www.jstatsoft.org/v27/i03, 2008.

[51] R. Hegger, H. Kantz, and T. Schreiber, “Practical Implementationof Nonlinear Time Series Methods: The TISEAN Package,” Chaos,vol. 9, pp. 413-435, 1998.

Renato Porfirio Ishii received the PhD degreeat the University of Sao Paulo, Sao Carlos in2010. He is currently an assistant professor atthe Faculty of Computer Science, FederalUniversity of Mato Grosso do Sul, CampoGrande, Brazil. His research interests includeautonomic computing, cluster and grid comput-ing, bioinspired computing, and time seriesanalysis.

Rodrigo Fernandes de Mello received the PhDdegree from the University of Sao Paulo, SaoCarlos in 2003. He is currently an associateprofessor at the Department of ComputerScience, Institute of Mathematics and ComputerSciences, University of Sao Paulo, Sao Carlos,Brazil. His research interests include autonomiccomputing, machine learning, scheduling, andbioinspired computing.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

ISHII AND DE MELLO: AN ONLINE DATA ACCESS PREDICTION AND OPTIMIZATION APPROACH FOR DISTRIBUTED SYSTEMS 1029