research article risk intelligence: making profit from...

Research ArticleRisk Intelligence Making Profit from Uncertainty in DataProcessing System

Si Zheng Xiangke Liao and Xiaodong Liu

School of Computer National University of Defense Technology China

Correspondence should be addressed to Si Zheng sizheng1009gmailcom

Received 27 February 2014 Accepted 19 March 2014 Published 24 April 2014

Academic Editors N Barsoum V N Dieu P Vasant and G-W Weber

Copyright copy 2014 Si Zheng et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In extreme scale data processing systems fault tolerance is an essential and indispensable part Proactive fault tolerance scheme(such as the speculative execution in MapReduce framework) is introduced to dramatically improve the response time of jobexecutions when the failure becomes a norm rather than an exception Efficient proactive fault tolerance schemes require preciseknowledge on the task executions which has been an open challenge for decades To well address the issue in this paper we designand implement RiskI a profile-based prediction algorithm in conjunction with a riskaware task assignment algorithm to acceleratetask executions taking the uncertainty nature of tasks into account Our design demonstrates that the nature uncertainty brings notonly great challenges but also new opportunities With a careful design we can benefit from such uncertainties We implement theidea in Hadoop 0210 systems and the experimental results show that compared with the traditional LATE algorithm the responsetime can be improved by 46 with the same system throughput

1 Introduction

Nowadays we are witnessing extreme scale data processingsystems that employ thousands of servers to coordinatelyaccomplish jobs [1] In systems of such a large scale hardwarefailure becomes a norm rather than an exception and thefault tolerance becomes an indispensable part of the systemscheduling [2] For example in an early report inMarch 2006Google reported that at least one server would fail every dayand each job would experience five failures Besides a singlefailure on job could cause completion time to increase by upto 50 [3 4] When the system scales the problem becomeseven more severe

In traditional approaches the fault tolerance is providedin a reactivemanner Failed jobs are reexecuted until they suc-ceed As the systems scale up this simple reactive schedulingscheme becomes inefficient because the failures will becomeincreasingly more frequent In the extreme case job may beexecuted forever due to continuous failures

To well address this issue a more promising approachis the proactive fault tolerance scheme For example inthe MapReduce computational model [5 6] the speculativeexecution is employed Speculative execution will monitor

the execution status of the subjobs (called tasks) and predictthe execution failures Once a task is obviously slower thanthe others (called straggler [6] or outlier [7]) a backup taskwill be executed to ensure the successful execution of the taskSimilar proactive mechanisms are employed by other largescale data processing systems such as Mesos [8] Dryad [9]and Spark [10]

Speculative execution can dramatically reduce the exe-cution time of tasks and bring great advantages on systemresponse time It is however on the cost of system through-put To allow speculative executions certain amount of thesystem resources will be allocated to execute backup tasksrather than new tasks demoting the system throughput thatdepends on the allocated resources In other words there is afundamental tradeoff between the overall system throughputand the response time of individual tasks A well designedfault tolerance scheme should strike the best tradeoff betweenthe two objectives

Efficient fault tolerance schemes raise great challenges tothe system designers First the optimal scheduling requiresthe precise knowledge on the execution time for each task[5 6] which is very difficult if not impossible in practice[11] Robust scheduling algorithms can tolerate the inaccurate

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 398235 16 pageshttpdxdoiorg1011552014398235

2 The Scientific World Journal

Input Task 119909 Node 119886Output Prediction time [119905min 119905max] weighting scheme 119882119896(1) 119889 larr 0 Task 119911(2) for each Task 119910 in database profile do(3) compute 119889 (119909 119910)(4) if 119889 (119909 119910) gt 119889 then(5) 119889 larr 119889 (119909 119910)(6) 119911 larr 119910(7) end if(8) end for(9) 119879 (119911) larr [119905min 119905max](10) Running Task 119909(11) 119879 (119909) larr 119905(12) compute 119877119894119895(13) while not the minimum sum

119899

119894=1sum119899

119895=119894+1(119871 119894119895

2+ 119872119894119895

2) do

(14) adjust weighting scheme 119882119896 = (119908119894 119908119895 119908119896 119908119897 )(15) end while

Algorithm 1 Algorithm for task execution time prediction

execution time information but on the cost of the schedulingoptimality The widely adopted virtualization technology(eg [12]) makes the problem even more challenging

Second our later empirical experiments (in Section 3)will demonstrate that the task execution time in such a largescale is not a precise time but naturally contains uncertaintiesIt can only be confined to a range of time even when the sametask is executed in the same environment

Third assuming the execution time is available for everytask a slower task is not necessary the straggler because ofthe different launch time of tasks [5] The very nature of theMapReduce environment makes trouble in finding correctstragglers The successful identification of stragglers requiresmuch more design intelligence [6]

To well address these challenges and provide an efficientfault tolerance scheme in this paper we propose a novelapproach that exploits the very nature of the MapReduce-like large scale data processing systems Our approach calledRiskI is inspired by two observations First in such extremescale data processing systems jobs are periodically executedand thus we can collect the history information of a task topredict its future execution time Second the task executiontime is naturally uncertain and thus we can explore therisk management theory to assign tasks For example afaster node in average is not necessary to be a better nodeunless it can reduce the execution time for the whole jobTo summarize the main contributions of this paper are asfollows

First we collect real traces from aMapReduce productionsystem and find the uncertain nature of task execution timein such systems We conduct comprehensive experiments toinvestigate the impact factors of the task uncertainties

Second we define a similarity measure to quantifythe difference between executed tasks in system based onwhich we design a task execution time prediction algorithm(Algorithm 1) Note that the output of the prediction is arange of time with probability distribution rather than aprecise time

Third being aware of the task execution uncertainty wedesign a risk management based task assignment algorithmand implement it on top of a Hadoop 0210 productionsystem We conduct comprehensive experiments to evaluatethe performance Compared with the state-of-art schedulingalgorithm LATE [6] the job response time can be improvedby up to 46with the same system throughput and comparedwith the nativeMapReduce schedulingwith no fault tolerancecapability the performance degradation of our scheduling isneglected small

The remainder of the paper is organized as followsIn Section 2 we give a review of the related works onperformance prediction and task scheduling in distributedsystems In Section 3 we first introduce the backgroundinformation of modern extreme scale data processing sys-tem and motivation of this work In Section 4 we presentour empirical study results revealing the very nature ofthe task execution uncertainty We present our design ontask execution time prediction and the risk managementbased scheduling in Sections 5 and 6 respectively We willpresent the performance evaluation in Section 7 and draw aconclusion with the future work directions in the last section

2 Related WorkFault tolerance and task assignments have been widely stud-ied especially in traditional distributed system [13] Researchtopics mainly include performance modeling and prediction[11] task assignment [14] heuristic algorithm [15] resourceallocation [8] and fairness [16] In our work wemainly focuson two aspects namely the performance prediction and taskassignment

Performance modeling is a conventional way of predict-ing the execution time of tasks in advance of executions [17]The design philosophy is to have a comprehensive under-standing on machine capabilities and application featuresso that the execution behavior of the machines can be wellcharacterized Snavely et al [11] proposed a framework for

The Scientific World Journal 3

performance modeling and prediction on large HPC systemTheir work focused on characterizing machine profiles andapplication signatures Prediction is made by a proper convo-lution method Modeling-based prediction works well underthe condition of comprehensive understanding the machinesand applications complicated convolution method Thesefactors above may have the effect on prediction accuracySome works employed queuing network models [18ndash21] torepresent and analyze resource sharing systems The modelis a collection of interacting services centers representingsystem resources and a set of customers representing theusers sharing the resources Another model is a simulationmodel [22] and it is actually the most flexible and generalanalysis technique The main drawback is its developmentand execution cost Petri nets can also be used to answerperformance-related questions since they can verify thecorrectness of synchronization between various activities ofconcurrent systems [20 23 24]

Our work differs significantly from this literature as wemake our prediction based on historical execution recordBecause of considering a heterogeneous environment inMapReduce getting such information is much costly anddifficult First collecting and analyzing sufficient charac-terizations of the ability of the machine are an inefficientwork It needs to run low-level benchmarks for gatheringperformance attributes It may be applicable for HPC sys-tem where the environment is homogeneous but it is notpractical in practical heterogeneous environments Secondvirtualization in computer makes some resource sharinginvisible Third capturing and analyzing application featureswill introduce extra overhead So it is rarely possible toemploy modeling-based method to predict execution time oftask in MapReduce anymore

Task assignment problem is also a very active researcharea especially in traditional distributed system [25ndash28]Much of works leverage precise time prediction for schedul-ing decision making [29 30] Other works concentrate onthe target of load balance [31 32] However in heterogeneousenvironment the blind pursuit of load balance does not bringbetter performance any more

Real-time scheduling [29] is relative work Most worksemploy theworst time for scheduling tomeet the requirementof real time and avoid the loss of uncertainty Our work isalso related to coscheduling [30] in multicore system par-ticularly with processor heterogeneity Jiang et al proposeda scheduling police based on prior knowledge that accuratethreads execution times were known to the system Withthis advanced knowledge they can make better schedulingdecisions In contrast our scheduling has no such precise andglobal information Instead of making scheduling decisionsbased on precise time estimation we take into account thefact that our prediction is a range of time

Task assignment problem in MapReduce is still in thebeginning since MapReduce adopted a simple schedulingpolice which is easy to implement The master will assigna task to a tasktracker without considering its ability andpossible execution time It is only to lunch the task asearly as possible Run-time stragglers have been identifiedby past work For the sake of reducing the response time of

job LATE [6] schedules a duplicate task for the stragglerwhich is named as speculative execution However sinceno distinction is made between tasktrackers LATE cannotguarantee the speculative execution to be successful So theeffect of speculative execution is by chance totally Delayscheduling [33] is another closed work we know of to ourown They begin to distinguish the tasktrackers with datalocality It schedules task as far as possible to the tasktrackerwho meets the need of data locally Mantri [7] uses theprogress rate instead of the task duration to remain agnosticto skews in work assignment among tasks

Our work is different from above as we recognized thedistinction of performance between working nodes Firstwe investigated how to estimate the response time for atask on different working nodes Instead of modeling-basedmethod we make our prediction through learning fromexperience Second since our prediction is a range of timewhich combines with confidence interval we employed themethod from risk of decision making to help us makescheduling decisions

3 Motivation

In this section we provide some background information ofthe MapReduce-like computing environments as the back-ground information We first describe the working mecha-nism of the MapReduce and elaborate how the speculativeexecutionmechanismworksWith simple examples we pointout the limitations in existing systems

31 MapReduce Background MapReduce is a typical rep-resentative of the modern parallel computing environmentthat provides extreme scale support Hadoop is an opensource implementation of MapReduce Besides MapReduceand Hadoop there are many other similar MapReduce-likecomputing frameworks such as theMesos [8] Dryad [9] andSpark [10] All these frameworks share a common idea

Roughly speaking a computational system contains amaster node and a number of working nodes The masternode splits the job to a number of nearly independent tasksand assigns them to working nodes As these tasks arenearly independent the parallelism can be fully exploitedAs illustrated in Figure 1 a computation job mainly has threestages the Map tasks for processing the raw data the shufflestage is to transfer the intermediate value and the Reducetasks that come out with the final results Each working nodehas a number of slots for running the tasks When a taskfinishes the slot will be freed for the other task usage

In extreme scale data processing systems the traditionalreactive fault tolerance scheme becomes ineffective becauseof the large number of failures and the long responseAnd thus MapReduce framework adopts a proactive faulttolerance scheme More specifically when a task in executionis much slower than other tasks in the same job speculativeexecution will be invoked to cope with this slowest task(called straggler [6]) In fact because there is little depen-dence between tasks the response time of a job is onlydetermined by its straggler To reduce the job response timeand provide efficient proactive fault tolerance we only need


Master

HDFS Work nodes

Work nodesHDFS

Map phase Shuffle phase Reduce phase

Figure 1 MapReduce computation framework that the job is splitto two kinds of tasks Map and Reduce Different tasks are nearlyindependent so that different tasks can be executed in a parallelmanner

to accelerate its straggler while not delaying the other taskstoo much so that new stragglers appear

Speculative execution is based on a simple principleWhen the straggler is identified a backup task will beexecuted for the straggler hoping the backup can be finishedearlier than the straggler As long as one wins that is eitherthe original straggler or the backup one finishes the otherwillbe killed It is essential to trade the system resources (and thusthe overall throughput) to fuel the executions of individualjobs

32 Challenges In practice efficient speculative executionfaces many challenges A crucial fact is that the backuptask is not necessary to be the winner (finishes earlier)Speculative execution may fail meaning that though thebackup is invoked the original straggler finishes earlierthan the backup When this happens we in effect wastethe resource allocated to backup task but make no benefiton response time In other words not all the speculativeexecutions are profitable We thus argue that the efficiencyof the speculative execution highly depends on the winingrate of the backup tasks defined as the percentage of thebackup task that is finished earlier than the straggler Themain objective of this paper then becomes to increase thewining rate of backup tasks

This is a very challenging issue in practice First stragglerscan only be identified when they have been executed for awhile and thus the backup task starts always at later timeThe backup must be sufficiently faster than the stragglerto be the winner This requires the accurate knowledge onthe execution time of both stragglers and backup tasks Inliterature however accurate prediction for task executionshas been an open challenge for decades in traditional parallelsystems It is partially because in traditional approaches [11]it is assumed that we have the comprehensive understanding

of the machine capability and the job workloadThis requiresa great deal of measurement efforts which are prohibitivelyhigh in a highly dynamic and heterogeneous environmentas we are in The application of virtualization makes theproblem even challenging as resources become virtualizedin an invisible manner The real capability of each machine(virtual machine or virtual storage) is hard to measure

Second stragglers are difficult to identify We are facingthe dilemma that on one hand we desire an early iden-tification of the stragglers so that the backup tasks have alarger improvement space and their wining becomes easierOn the other hand later identifications of stragglers are morelikely to be accurate as the stragglers may change during theexecution

Third optimal speculative execution requires the preciseexecution time of tasks for example a task will be finishedin 15 seconds Unfortunately uncertainty is a nature inMapReduce The execution time of a task is within a rangeof time even under the same computing environmentsSuch an uncertainty is a double-blade sword On the onehand there will be no optimal assignment schemes in priorexecution and thus to find the optimal one in advanceis impossible On the other hand recall that the stragglerdetermines the response time of jobs and thus there will bemultiple assignment schemes that perform exactly the sameas the optimal assignment In other words even when ourassignment scheme is not the optimal one the performancemay not be degraded This property greatly eases the designchallenges In later sections we will show how to benefit fromthis by applying the theory of decisionmaking under risk anduncertainty

4 Observation

In this section we study the fundamental characteristics ofthe task execution in MapReduce environments We firstuse experimental results to reveal the uncertainty natureof task executions and then demonstrate the repeatableexecution property of tasks In the last we will show that by asimple similarity-based scheme the degree of execution timeuncertainty can be dramatically reduced

41 Uncertainty of Task Executions In this part we useexperimental results to reveal the uncertainty nature ofthe MapReduce We configure a standard Hadoop systemversion 0210 (Hadoop is an open sourced implementationof MapReduce) with four machines and run a WordCountapplication [34] In our experiments 128 Map tasks and onereduce task are executed and the results are presented inFigure 2

Figure 2 shows that the execution time ranges from 39 sto 159 s with the average of 80 s The standard deviation(Std) is 2742 s Note that the MapReduce system will equallysplit the Map tasks and thus the workload of these 128map tasks is nearly identical We thus argue that uncertaintyis a nature in MapReduce On the other hand even ifidentical tasks were executed in the identical machines or thesame task repeatedly running under the same environmentthe execution time is not identical but stable Figure 4(a)


1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index

Figure 2 Execution time of tasks There are 128 map tasks in totaland each execution time of task was plotted

shows the CDF of response time of the task which wasrepeatedly running for 20 times in the same condition (wechoose two tasks from two different jobs) We can figurethat the execution time is also uncertain By this nature thetraditional model-based task execution prediction algorithmis costly and will fail as the implicit assumption of thisalgorithm is that a task will behave the same under the sameenvironment However we also find that the execution time isstable whichmeansmost time is within a small range of timeSo we argue that a reasonable prediction for the task shouldbe a range of time with probability distribution

42 Repeated Execution of MapReduce Tasks In MapReduceenvironments most of the jobs are batch for extreme scaledata processing The spirit is that moving the program isless costly than moving the data [5] Consequently the samejobs are repeatedly invoked to process different data [35]For example Facebook Yahoo and eBay process terabytesof data and event logs per day on their MapReduce clustersfor spam detection business intelligence and various opti-mizations Moreover such jobs rarely update An interestingobservation here is that the executions of a job in the past canbe of great value for futureWe can build profiles for each taskand refer such profile to predict the job future executions

Besides the similarity between tasks of different jobsin some instance similarity also exits in tasks of the samejob Because the number of tasks may be more than thenumber of slots tasks may start and finish in ldquowavesrdquo [35]Figure 6 shows the progress of the Map and Reduce tasks of aWordCount job with 8GB dataThe 119909-axis is the time and the119910-axis is for the 34 map slots and 10 reduce slots The blocksize of the file is 64MB and there are 8GB64MB = 128 inputsplits As each split is processed by a different map task thejob consists of 64 map tasks As seen in the figure since thenumber of map tasks is greater than the number of providedmap slots the map stage proceeds in multiple rounds of slotassignment and the reduce stage proceeds in 3 waves Fromthe figure we can find that the tasks assigned to the sameworking node have the close execution time The similarityexists between different waves on each node in map stage

In Figure 2 we reorganize the execution time of tasks inFigure 3 while this time tasks executed in the same machinewill be grouped together Table 1 gives the maximal andminimal average and Std of these tasks in each group Com-pared with that in the original configuration the Std reduces

Node 1Node 2

Node 3Node 4

1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index

Figure 3 Reorganized execution time of tasks Tasks in the samenode were plotted together

Table 1 Execution time statistics for 128 map tasks from job word-count

Avg Std Max MinTotal 80 2742 159 32Node 1 10861 1846 159 80Node 2 9266 968 116 75Node 3 7361 561 84 63Node 4 435 494 54 32

dramatically for example by 33 as from 2742 to 1846 Thisis because the four machines have the different hardwareand tasks in different machines will experience the differentresource sharing conditions Nevertheless tasks in the samemachine will experience a more ldquosimilarrdquo environment Thisshows that a simple approach with the similarity of taskstaken into account can dramatically reduce the uncertaintyNext we will show how to exploit such similarity

It should be pointed out that the execution environ-ment we mentioned above includes two parts Except fromthe hardware configurations from different machines taskswhich are running parallel on the same execution unit (asknown as tasktracker in Hadoop) can also affect executionenvironment Based on different configuration one or moretasks will run on the same tasktracker which will causeresource contention and impact on execution time

43 Design Overview Keeping the uncertainty and taskssimilarity nature in mind in this paper we propose a novelproactive fault tolerance scheme inMapReduce environmentAs illustrated in Figure 5 the design named as RiskI isconstituted of two components (i) a profile-based executiontime prediction scheme and (ii) a risk-aware backup taskassignment algorithm The basic idea is that we build aprofile for each executed task in history and refer the profileof a similar task (in terms of its own property and theenvironment) to predict a new task in execution As theuncertainty is a nature that cannot be completely avoidedwe design a risk-aware assignment algorithm that can benefitfrom such an uncertainty by exploring the theory of decisionmaking under uncertainty Our goal is to maximize theassignment profiting and minimize the risk meanwhile


1

08

06

04

02

00 15 30 45 60

CDF

Time (s)

(a) CDF of execution time for task from BBP

1

08

06

04

02

00 10 2 0 30 40

CDF

Time (s)

(b) CDF of execution time for task from wordcount

Figure 4 Execution time

Jobs

Periodically running

ProfileMaximize

Minimize

profit

risk

Risk-awareassignment

RiskI system architecture

Profile-basedprediction

Execution time

Figure 5 RiskI has two main parts a profile-based executionalgorithm for task execution time prediction and a risk-awarebackup task assignment algorithm used to make scheduling basedon range of time

5 Execution Time Prediction

In this section we will introduce how to leverage historicalinformation to predict the execution time for tasks We firstpresent the architecture of the prediction algorithm and thenintroduce the components respectively In the last we use apseudocode to present details of the design

51 Prediction Algorithm Architecture The prediction of thejob execution is based on a simple idea For each executedtask we will maintain a profile to record its executioninformation When predicting a new task we refer to thesehistorical profiles and look for the most similar task Itsexecution time will then be our prediction result Noticethat due to the uncertainty nature of the job execution theoutput of our prediction is also a range of time indicating themaximum and minimum of the execution time rather thana single value

The prediction algorithm mainly consists of four stepsas illustrated in Figure 7 We first make a similarity searchand look for the most similar task in history Tasks will thenbe assigned and executed according to our predictions As

3432

30282624

2220181614

12108642

0 50 100 150 200 250 300 350 400 450 500

Task

slot

s

Time (s)

First waveSecond wave

Third waveFourth wave

Suffle Sort ReduceNode 4

Node 3

Node 2

Node 1

Figure 6 ldquoWavesrdquo in execution Since the number of tasks is morethan task slots tasks were executed in ldquowavesrdquo In our example thereare 9 waves in node 4 and 3 waves in node 1

long as the execution is finished we will adjust some criticalcontrol parameters and maintain the profiles for the nexttasks Noticing that there is no absolute the same betweentasks the key in the prediction algorithm design is thesimilarity of definition for any given two tasks that can yieldthe best prediction results Next we will give details on thesesteps

52 Similarity Scheme The similarity of two tasks can beaffected by many factors such as the number of read bytesand the number of write bytes In our work we apply all thesefactors while our method has the adaptive capability thatfor factors of little impact they will be ignored automatically


Step one Step two

Step three Step four

Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction

Figure 7 Our prediction has four steps Upon a new task is inprediction we firstly look for the most similar task in history andthen make a prediction based on similar task Then we wait for thetask running and record the running information and adjust weightsetting scheme

This is done by applying an appropriate similarity schemethat is how to obtain the similarity based on these factors

In literature there are several approaches to measurethe similarity that measures the similarity between tasksin different aspects For example the weighted Euclideandistance (WED) [36] measures the actual distance betweentwo vectors the cosine similarity measures the directioninformation and the Jaccard similaritymeasures the duplica-tion ratio Considering the requirement of ourworkwe adoptWED and leave more design options for future work

Definition 1 Given two tasks 119883 and 119884 with their affectingfactor vectors 119909119895 and 119910119895 and a weight scheme 119908119895 thesimilarity between the two tasks can be calculated as follows

119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)

where 119899 is the number of affecting factors and 119908119895 is a weightscheme that reflects the importance of the different affectingfactors on the similarity

53 Weighting Scheme Weighting scheme determines theimpact of the affecting factors on the similarity measureIn practice different machines may have different weightingschemes because of the different hardware For example fora CPU-rich machine the IO may become the bottleneckof the task execution and therefore IO-related factors aremore important while IO-rich machines may desire CPU-related factors To well address the issue we apply a quadraticprogramming method to dynamically set the weighting

Quadratic programming is a mature method for weightscheme setting in machine learning [15] Our aim is tominimize the differences between the similarities calcu-lated from our WED and the real differences obtained

Table 2 A rough weight scheme for job wordcount

Factors WeightJob nameWork nodeInput size 345Environment

Task 1 (randomwrite) 336Task 2 (wordcount map) 158Task 3 (BBP) 141

Others 2

by comparing the real execution time Supposing thereare 119899 task records in the node and each has 119902 affectingfactors the constrains in the problem can be presentedin formulas (2) and (3) in which 119878119894119895119896 is the similarity onthe 119896th attribute between task 119894 and task 119895 sum

119902

119896=1119878119894119895119896119882119896 is

the similarity between task 119894 and task 119895 calculated usingformula (1) and 119877119894119895 is the real similarity between task 119894 andtask 119895 calculated by real execution time 119871 119894119895 is the value bywhich the calculated similarity is less than the real similarityand 119872119894119895 is the value by which the calculated similarity isgreater than the real similarity

Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)

where 119905 is the real execution time and the prediction timeis [119905min 119905max] 119905avg presents the average time of prediction

As there is no comprehensive analysis of the complexityof quadratic programming problems our analysis only canindicate how large this quadratic programming problem canbe but not the actual complexity of the problem From theabove formalization the quadratic programming problemhas 119899 lowast (119899 minus 1) + 119902 variables and 119899 lowast (119899 minus 1) constraints

The general factors used in calculating similarity includejob name node name input size workload output size andthe parallel task that is running on the same node We figurea rough weight scheme for job wordcount in one node Weshowed it in Table 2 From the table we can see that the factorinput size and IO-heavy task like randomwrite influence theexecution time most

6 Risk-Aware Assignment

In this section we introduce our risk-aware assignmentalgorithm that attempts to benefit from the uncertainty oftask executions We first characterize the unique featuresand design overview of task assignment with uncertainty


Start

Findingstraggler

Findingavailable node

Risk-awaredecision making

Yes

Yes

No

NoCan speculative executiondetermined success

Assign backuptask

Assign backup taskand kill original task End

Makingprediction

Profit gt 0 andmaximum profit

Figure 8 Flowchart of risk-aware assignment

and then present the theory of decision making underuncertainty which is the theoretical foundation for our risk-aware assignment At last we introduce our assignmentalgorithm based on the theory

61 Design Overview Themethod is presented as follows inthe first step by comparing the average execution time ofall the running tasks and the passed running time we findout the straggler in step two we check out all the availablework nodes and make the execution time prediction in stepthree wemake the execution time prediction for the stragglerrunning on the available node in step four with the help ofthe theory of decision making under risk the most expectedprofiting assignment was chosen for speculative executionIf the new assignment can guarantee the speculative successthe original task can be killed Figure 8 is a flowchart of ourrisk-aware assignment design The main challenge is how toleverage theory of risk decision to help us assign task

62 Theory of Risk Decision Since our prediction is a rangeof time how to compare two ranges of time is another issueto help decision making A direct method is comparing theaverage time of two ranges and the distance of two averagevalues is the time profit or time loss The average time workswell in the condition where time profit is uniformed per unitof time However this assumption is broken in MapReduceAs we mentioned in Section 3 because of the existence ofstraggler the time profit is only related to the last finish timebut care nothing about the time before last finish time Inother words may be some tasks would finish faster it is stillhelpless for the execution time of the job because the jobwould be finished after all the tasks were executed Underthis condition time profit is different per unit time and thedecision making under risk was introduced

In many fields risky choice and the selection criterionare what people seek to optimize There are two parts

included in risk of decision making risk assessment andrisk management Risk assessment is in charge of evaluatingrisk or benefit and risk management takes responsibilityfor taking appropriate actions to control risk and maximizeprofit Expected value criterion is also called Bayesian princi-ple It incorporates the probabilities of the states of naturecomputes the expected value under each action and thenpicks the action with the largest expected value We cancalculate the potential value of each option with the followingequation

119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)

where 119864(119883) stands for the expected value Theprobability 119875(119909) of an event 119909 is an indication of howlikely that event is to happen In the equation 119880(119909) denotesthe profit and loss value of 119909 The summation range includesevery number of 119909 that is a possible value of the randomvariable 119883

63 Profit Function and Unique Feature First the function119875(119909) is easy to get from historical statistic As mentionedbefore 119875(119909) is the probability of the occurrence of state 119909 Inour case 119875(119905119894 119905119895) represents the probability that the real exe-cution time falls in the area between [119905119894 119905119895] This probabilitycan learn from statistic information in history For examplethe output of prediction is a set 119879 = 119905min 119905119894 119905119895 119905max inwhich the 119905119894 was sorted by its value Since the prediction is arange of time with a probability distribution we can exploitthe probability distribution to compute the 119875(119905119894 119905119895) with thefollowing equation

119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)

where |119879| is the number of elements in the set119879 and |[119905119894 119905119895]⋂

119879| is the number of elements falling into range [119905119894 119905119895]Next we will take an example to explain our definition for

profit function 119880(119909) As shown in Figure 9(a) the executiontime of task is a nature uncertainty and the assignmentscheme should be aware of the potential risks In Figure 9(b)because of the existence of straggler the time profit is onlyrelated to the last finish time which was denoted as 119905119904 Underthis condition time profit is different per unit time and isdivided into two parts which are part one [119905119894 119905119904] and parttwo [119905119904 119905119895] and we plotted them in Figure 9(c) The profitfunction 119880(119909) can be calculated as

119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)

where avg(119905119894 119905119904) is the average time of [119905119894 119905119904] and avg(119905119894 119905119895) isthe average time of [119905119894 119905119895]

Let 119879 denote an assignment option Let1198791 and 1198792 denotetwo ranges of time divided by 119905119904 The expected profit can becalculated as

119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)


Input available Nodes 119873 = (1198991 1198992 1198993 )(1) Finding the straggler task 119904 by prediction(2) for each available node 119899119894 in 119873 do(3) Make prediction 119879 = [119905min 119905119894 119905max](4) Compute profit 119875119894 larr 119864(119879)(5) Record the maximal 119875119894 and node 119899(6) end for(7) if remaining time 119905left gt the worst reexecution time 119905max then(8) kill task 119905119886119904119896 sdot 119896119894119897119897 (119904)(9) end if(10) assign backup task 119879119886119904119896119886119904119904119894119892119899 (119904 119899)

Algorithm 2 Algorithm for risk-aware assignment

Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts

StragglerOptional backup tasksAverage execution time

(c)

ts

Kill


(d)

Figure 9 An example for different time profit per unit The timeprofit is divided into two parts So we considered separately thesetwo parts

If 119864(119879) gt 0 119864(119879) would be a better option since it islikely more close to reduce the completion time of a jobIf there are more options for backup task assignment themax 119864(119879) would be the best choice

A special case is that 119905max is less than the best time ofstraggler as shown in Figure 9(d) In this case the stragglercan be killed immediately to release resources

The detailed algorithm is shown in Algorithm 2 We firstcompare the average execution time of all the running tasksand the passed running time and then we find out thestraggler And then we check out all the available work nodesand make the execution time prediction In step three wemake the execution time prediction for the straggler runningon the available node In step four with the help of theoryof decision making under risk the most expected profitingassignment was chosen for speculative execution If the newassignment can guarantee the speculative success the originaltask can be killed

Table 3 Our testbed used in evaluation

VMs Map slots Reduce slots VM configurationNode 1 0 1 1 core 2G RAMNode 2 4 0 4 core 2 G RAMNode 3 4 4 4 core 2 G RAMNode 4 4 4 4 core 2 G RAMNode 5 1 1 4 core 2 G RAMTotal 13 10 17 core 10G RAM

7 EvaluationWe have conducted a series of experiments to evaluate theperformance of RiskI and LATE in a variety combination ofdifferent job size Since the environmental variability mayresult in high variance in the experiment result we alsoperformed evaluations on different heterogeneous environ-ment configuration We ran our first experiment on a smallprivate cluster which has 5 working nodes and a masterOur private cluster occupies two whole Dell PowerEdgeR710 machines Each machine has 16 Xeon E5620 24GHzCPUs 12GB DDR3 RAM and 4 times 25119879 disks We use Xenvirtualization software to manage virtual machines Table 3lists the hardware configurations of our testbed We installedour modified version of Hadoop 0210 to cluster and theblock size was configured as 64MB Different numbers oftask slots were set to each node according to diversity of VMconfigurations

71 Performance on Different Job Size To identify the pre-dominance of RiskI we first ran some simple workloads Inthis configuration all jobs were set with one size in a roundWe performed evaluation on three different types of job sizewhich are small medium and large respectively The smalljob only contains 2 maps the medium job has 10 maps andthe large job has 20 maps We chose Bailey-Borwein-Plouffe(BBP) job for evaluation and there are 20 jobs in total witharrival interval 60 s

Figure 10(a) shows a CDF of job execution time for largejobs We see that about 70 of the large jobs are significantlyimproved under our scheduling Figure 10(b) illustrates thecorresponding running details including the total numberof speculative executions those succeeded and being killed


0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI

FailedSucceededKilled

LATE

(b) Statistics of speculative execution for large jobs includingbackup task failed succeeded and killed

100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs

Figure 10 CDFs of execution times of jobs in various job sizes RiskI greatly improves performance for all size of jobs especially for largeand medium jobs

From Figure 10(b) we can find that RiskI has two main ben-efits on scheduling First with the help of estimation we areable to cut off some unnecessary speculative executions thusrelease more resources Since most speculative executionswere failed in LATE the total speculative executions werereduced from 181 to 34 in RiskI Second the successful ratioof speculative increased In our scheduler only 6 speculativeexecutions failed out of 34 and the success ratio is 81

Figures 10(c) and 10(d) plotted the results on medium jobsize and small job size The performance on large job size isdistinguished from those on smaller jobs because the smalljobs cause less resource contention for low system workload

72 Performance onMix Job Size In this section we considera workload with more complex job combination We useGridmix2 [37] a default benchmarks for Hadoop clusterwhich can generate an arrival sequence of jobs with the inputof a mix of synthetic jobs considering all typical possibility

In order to simulate the real trace for synthetic workloadwe generated job sets that follow the trace formFacebook that

Table 4 Distribution of job sizes

Job size Map Reduce Jobs JobsSmall 2 2 19 38Medium 10 5 19 38Large 20 10 12 24Total 468 253 50

were used in the evaluation of delay scheduling [33] Delayscheduling divided the jobs into 9 bins on different job sizeand assigned different proportion respectively For the sakeof simplicity we generalize these job sizes into 3 groups smallmedium and large respectively Jobs in bin 1 take up about38 in that trace and we assign the same proportion to thesmall jobs We grouped bins 2ndash4 as medium jobs which alsohold 38 in total The rest of bins are grouped into largejobs with about 24 amount Table 4 summarizes the job sizeconfiguration


M M M M M M M M M M M M M

M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50

Figure 11 The Gridmix job has 50 jobs in total The first 13 jobs are medium size The jobs from 14 to 23 are small The small job and largejob are alternate arrived from job 26 to job 40 The large jobs appear consecutive form job 41 to job 45 and the last 5 jobs are medium

Table 5The configuration details for BBP including start digit enddigit and the workload per map task

Job size Start Digit ndigit WorkloadmapSmall 1 25000 156283174Medium 1 55000 151332558Large 1 70000 122287914

Except for the job size proportion arrival interval is alsovery important in generating a job arrival trace since it maylead to different schedule behaviors and subsequently affectsystem throughput In our experiment we set the arrivalinterval to 60 s and 120 s Figure 11 gives the generated jobsequences

Finally we ran the workload described above undertwo schedulers LATE (Hadoop default scheduler FIFO) andRiskI We submitted each job by only one user

721 Result for BBP Our detailed configuration of BBP jobswas shown in Table 5 Figure 12 plotted the execution time ofeach job with interval 60 s

For ease of analysis we divided the job sequence into 5parts The first part is from job 1 to job 13 where the jobs aremedium The second part is from job 14 to job 23 The jobsin the second part are all small Job 24 to job 50 is the thirdpart mainly consists of small and large jobs The last 10 jobsare equally divided into two parts and the jobs are large andmedium

In the first part M the response time of jobs is accumu-lated from the beginning because the workload is becomingheavier gradually The resource contention led to morestragglers and more speculative executions More resourceswere used for backup task The following jobs had to waituntil resources were released However because of lack ofdelicate design of scheduling most backup tasks failed Thesystem sacrificed the throughput but did not get benefitFrom Figure 12(b) we can see that there are 48 backup tasksand only 19 succeeded Compared to LATE RiskI cut thenumber of backup task from 48 to 13 The static result was

plotted in Figure 12(c) Many invalid speculative executionsare avoided and saved the throughput Since some backuptasks are certainly finished earlier we killed the first attemptof task immediately in order to save resources As a resultonly 2 backup tasks failed The same conditions occurred inthe left 4 parts

However the situation is a little different when the jobarrival interval is 120 s We also plotted the execution timeof each job with interval 120 s in Figure 13 There are littledifferent exits in the first 4 parts The reason is that thesystem workload is light because each job has 120 s beforethe next job arrives The system has much free resourcesfor backup task running We can see from Figure 13(b) thatno backup task succeeded in the first two parts Even if thebackup task failed it will not cause any loss But in RiskI nobackup task was launched for speculative execution whichwas plotted in Figure 13(c) Because none of them wouldsucceed most differences lied in the part L where we reducedthe speculative execution count from 39 to 12 and only 1speculative execution failed

As a result our scheduler performs much better ininterval 60 than interval 120

722 Result for Streamsort We evaluated the job of stream-sort by running configured gridmix2 also But we have tomodify the Hadoop 0210 for different arrival interval Theperformance of sort is also improved better when workloadis heavy since RiskI not only shortened the response timeof job but also killed some tasks to release resource earlierWhen workload is light our scheduler only takes advantageof reducing job response time

Another conclusion is that the average performance ofBBP is better than sort The main reason is that compared toBBP prediction of task for streamsort has longer range whichmeans more uncertainty

73 Impact of Heterogeneous Degree We change the testbedand Hadoop configurations to evaluate RiskI in differentexecution environments We run our experiments in twoother different environments The original environmentabove is denoted as testbed 1 And the new environments


1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M

(a) Execution time for 50 jobs

80

60

40

20

0

M S S L L M

SucceededFailed

(b) Speculative count for LATE

M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0

(c) Speculative count for RiskI

Figure 12 Details of execution information with interval 60 (a) is static of execution time for each job (b and c) are statistics of speculativeexecution count in different parts

were denoted as testbeds 2 and 3 The testbed 3 is 4times more heterogeneous than testbed 2 and 8 timesthan testbed 1 We measure heterogeneity by its numberof virtual machines on the same physical machine Theresult was shown in Figure 14 When the environment ismore heterogeneous RiskI perform degrades much less thanLATE

74 Stable of Prediction Since we make our prediction bysearching similar executions of task in history we have todemonstrate that the similar executions have stable responsetime It is obvious that stable response time will make ourprediction significant Otherwise our prediction is helpless

for scheduling Therefore we did the following experimentsto proof our prediction

First we selected a four-slot work node which impliesat most four tasks that can run on the node in parallelAnd then we investigated the task running time under allthe possible execution conditions These conditions consistof various combinations of tasks which are selected fromwordcount randomwrite and Bailey-Borwein-Plouffe Sothere are 9 conditions in total Figure 15(a) shows our resultson running time of task from BBP job that the task ran 20times in every environment We can see that the executiontime is relatively stable in a range time For further analysisFigure 4(a) shows the CDF of response time distributionbetween the longest time and the shortest time in one


01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0



condition It demonstrated that most times are centralizeddistribution The result of task from wordcount is shownin Figures 15(b) and 4(b) The conclusion can be alsocertified

We showed the prediction evolution with the execu-tion times increased in Figure 16 We selected two con-ditions where the prediction is most stable and mostunstable The evolutionary process of prediction is plot-ted

8 Conclusion

We propose a framework named RiskI for task assignmentin a fault tolerance data processing system We suggest thatby finding the most similar task in history profile-based

method can predict the execution time with a range of timeThen it makes risk-aware assignment for task to seek moreprofit while reducing the risk introduced by uncertaintyExtensive evaluations have shown that RiskI can providegood performance in all conditions And RiskI performsmuch better when the system puts upmore heterogeneityWebelieve that the performance of RiskI will improve further ifwe understand the behavior of system and find better similaralgorithm

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper


0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3

Figure 14 CDFs of execution time in different testbeds

70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index

(a) Task prediction for BBP

60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index

(b) Task prediction for wordcount

Figure 15 Execution time prediction


29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case

Figure 16 Prediction evolution

Acknowledgments

This paper is supported by NSFC no 61272483 and no61202430 and National Lab Fund no KJ-12-06

References

[1] K V Vishwanath and N Nagappan ldquoCharacterizing cloudcomputing hardware reliabilityrdquo in Proceedings of the 1st ACMSymposium on Cloud Computing (SoCC rsquo10) pp 193ndash203Citeseer June 2010

[2] P Gill N Jain and N Nagappan ldquoUnderstanding network fail-ures in data centers measurement analysis and implicationsrdquoACMSIGCOMMComputer Communication Review vol 41 no4 pp 350ndash361 2011

[3] Q Zheng ldquoImprovingMapReduce fault tolerance in the cloudrdquoin Proceedings of the IEEE International Symposium on Par-allel and Distributed Processing Workshops and Phd Forum(IPDPSW rsquo10) pp 1ndash6 IEEE April 2010

[4] Q Zhang L Cheng and R Boutaba ldquoCloud computing state-of-the-art and research challengesrdquo Journal of Internet Servicesand Applications vol 1 no 1 pp 7ndash18 2010

[5] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[6] M Zaharia A Konwinski A Joseph R Katz and I StoicaldquoImproving mapreduce performance in heterogeneous envi-ronmentsrdquo in Proceedings of the 8th USENIX Conference onOperating Systems Design and Implementation pp 29ndash42 2008

[7] G Ananthanarayanan S Kandula A Greenberg et al ldquoReiningin the outliers in map-reduce clusters using mantrirdquo in Proceed-ings of the 9th USENIX Conference on Operating Systems Designand Implementation pp 1ndash16 USENIX Association 2010

[8] B Hindman A Konwinski M Zaharia et al ldquoMesos aplatform for fine-grained resource sharing in the data centerrdquo inProceedings of the 8thUSENIXConference onNetworked SystemsDesign and Implementation pp 22ndash22 USENIX Association2011

[9] M Isard M Budiu Y Yu A Birrell and D Fetterly ldquoDryaddistributed data-parallel programs from sequential buildingblocksrdquo ACM SIGOPS Operating Systems Review vol 41 no 3pp 59ndash72 2007

[10] M Zaharia M Chowdhury M Franklin S Shenker andI Stoica ldquoSpark cluster computing with working setsrdquo inProceedings of the 2nd USENIX Conference on Hot Topics inCloud Computing pp 10ndash10 USENIX Association 2010

[11] A Snavely NWolter and L Carrington ldquoModeling applicationperformance by convolving machine signatures with applica-tion profilesrdquo in Proceedings of the IEEE InternationalWorkshoponWorkload Characterization (WWC-4 rsquo01) pp 149ndash156 IEEE2001

[12] R Uhlig G Neiger D Rodgers et al ldquoIntel virtualizationtechnologyrdquo Computer vol 38 no 5 pp 48ndash56 2005

[13] P Jalote Fault Tolerance in Distributed Systems Prentice-Hall1994

[14] A Salman I Ahmad and S Al-Madani ldquoParticle swarmoptimization for task assignment problemrdquoMicroprocessors andMicrosystems vol 26 no 8 pp 363ndash371 2002

[15] M Frank and P Wolfe ldquoAn algorithm for quadratic program-mingrdquoNaval Research Logistics Quarterly vol 3 no 1-2 pp 95ndash110 2006

[16] R Jain D Chiu and W Hawe A Quantitative Measure ofFairness and Discrimination for Resource Allocation in SharedComputer System Eastern Research Laboratory Digital Equip-ment Corporation 1984

[17] S Balsamo A DiMarco P Inverardi andM Simeoni ldquoModel-based performance prediction in software development asurveyrdquo IEEE Transactions on Software Engineering vol 30 no5 pp 295ndash310 2004

[18] L Kleinrock Queueing Systems Volume 1 Theory Wiley-Interscience 1975

[19] E Lazowska J Zahorjan G Graham and K Sevcik Quan-titative System Performance Computer System Analysis UsingQueueing Network Models Prentice-Hall 1984

[20] K Kant and M Srinivasan Introduction to Computer SystemPerformance Evaluation McGraw-Hill College 1992

[21] K Trivedi Probability and Statistics with Reliability Queuingand Computer Science Applications Wiley New York NY USA2002

[22] J Banks and J Carson Event System Simulation Prentice-Hall1984

[23] M Marsan G Balbo and G Conte Performance Models ofMultiprocessor Systems MIT Press 1986

[24] F Baccelli G Balbo R Boucherie J Campos and G ChiolaldquoAnnotated bibliography on stochastic petri netsrdquo in Perfor-mance Evaluation of Parallel and Distributed Systems SolutionMethods vol 105 CWI Tract 1994

[25] M Harchol-Balter ldquoTask assignment with unknown durationrdquoJournal of the ACM vol 49 no 2 pp 260ndash288 2002

[26] D W Pentico ldquoAssignment problems a golden anniversarysurveyrdquo European Journal of Operational Research vol 176 no2 pp 774ndash793 2007


[27] C Shen and W Tsai ldquoA graph matching approach to optimaltask assignment in distributed computing systems using aminimax criterionrdquo IEEE Transactions on Computers vol 100no 3 pp 197ndash203 1985

[28] M Harchol-Balter M E Crovella and C D Murta ldquoOnchoosing a task assignment policy for a distributed serversystemrdquo Journal of Parallel and Distributed Computing vol 59no 2 pp 204ndash228 1999

[29] R I Davis and A Burns ldquoA survey of hard real-time schedulingfor multiprocessor systemsrdquo ACM Computing Surveys vol 43no 4 article 35 2011

[30] Y Jiang X Shen J Chen and R Tripathi ldquoAnalysis and approx-imation of optimal co-scheduling on chip multiprocessorsrdquoin Proceedings of the 17th International Conference on ParallelArchitectures and Compilation Techniques (PACT rsquo08) pp 220ndash229 ACM October 2008

[31] R Bunt D Eager G Oster and C Williamson ldquoAchievingload balance and effective caching in clustered web serversrdquoin Proceedings of the 4th International Web Caching WorkshopApril 1999

[32] P B Godfrey and I Stoica ldquoHeterogeneity and load balance indistributed hash tablesrdquo in Proceedings of the IEEE 24th AnnualJoint Conference of the IEEE Computer and CommunicationsSocieties (INFOCOM rsquo05) vol 1 pp 596ndash606 IEEE March2005

[33] M Zaharia D Borthakur J Sen Sarma K Elmeleegy SShenker and I Stoica ldquoDelay scheduling a simple techniquefor achieving locality and fairness in cluster schedulingrdquo in Pro-ceedings of the 5th European Conference on Computer Systems(EuroSys rsquo10) pp 265ndash278 ACM April 2010

[34] httpwikiapacheorghadoopWordCount[35] A Verma L Cherkasova and R H Campbell ldquoARIA auto-

matic resource inference and allocation formapreduce environ-mentsrdquo in Proceedings of the 8th ACM International Conferenceon Autonomic Computing pp 235ndash244 ACM June 2011

[36] J de Leeuw and S Pruzansky ldquoA new computationalmethod tofit the weighted euclidean distance modelrdquo Psychometrika vol43 no 4 pp 479ndash490 1978

[37] httphadoopapacheorgdocsmapreducecurrentgridmixhtml

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



Input Task 119909 Node 119886Output Prediction time [119905min 119905max] weighting scheme 119882119896(1) 119889 larr 0 Task 119911(2) for each Task 119910 in database profile do(3) compute 119889 (119909 119910)(4) if 119889 (119909 119910) gt 119889 then(5) 119889 larr 119889 (119909 119910)(6) 119911 larr 119910(7) end if(8) end for(9) 119879 (119911) larr [119905min 119905max](10) Running Task 119909(11) 119879 (119909) larr 119905(12) compute 119877119894119895(13) while not the minimum sum

119899

119894=1sum119899

119895=119894+1(119871 119894119895

2+ 119872119894119895

2) do

(14) adjust weighting scheme 119882119896 = (119908119894 119908119895 119908119896 119908119897 )(15) end while

Algorithm 1 Algorithm for task execution time prediction

execution time information but on the cost of the schedulingoptimality The widely adopted virtualization technology(eg [12]) makes the problem even more challenging

Second our later empirical experiments (in Section 3)will demonstrate that the task execution time in such a largescale is not a precise time but naturally contains uncertaintiesIt can only be confined to a range of time even when the sametask is executed in the same environment

Third assuming the execution time is available for everytask a slower task is not necessary the straggler because ofthe different launch time of tasks [5] The very nature of theMapReduce environment makes trouble in finding correctstragglers The successful identification of stragglers requiresmuch more design intelligence [6]

To well address these challenges and provide an efficientfault tolerance scheme in this paper we propose a novelapproach that exploits the very nature of the MapReduce-like large scale data processing systems Our approach calledRiskI is inspired by two observations First in such extremescale data processing systems jobs are periodically executedand thus we can collect the history information of a task topredict its future execution time Second the task executiontime is naturally uncertain and thus we can explore therisk management theory to assign tasks For example afaster node in average is not necessary to be a better nodeunless it can reduce the execution time for the whole jobTo summarize the main contributions of this paper are asfollows

First we collect real traces from aMapReduce productionsystem and find the uncertain nature of task execution timein such systems We conduct comprehensive experiments toinvestigate the impact factors of the task uncertainties

Second we define a similarity measure to quantifythe difference between executed tasks in system based onwhich we design a task execution time prediction algorithm(Algorithm 1) Note that the output of the prediction is arange of time with probability distribution rather than aprecise time

Third being aware of the task execution uncertainty wedesign a risk management based task assignment algorithmand implement it on top of a Hadoop 0210 productionsystem We conduct comprehensive experiments to evaluatethe performance Compared with the state-of-art schedulingalgorithm LATE [6] the job response time can be improvedby up to 46with the same system throughput and comparedwith the nativeMapReduce schedulingwith no fault tolerancecapability the performance degradation of our scheduling isneglected small

The remainder of the paper is organized as followsIn Section 2 we give a review of the related works onperformance prediction and task scheduling in distributedsystems In Section 3 we first introduce the backgroundinformation of modern extreme scale data processing sys-tem and motivation of this work In Section 4 we presentour empirical study results revealing the very nature ofthe task execution uncertainty We present our design ontask execution time prediction and the risk managementbased scheduling in Sections 5 and 6 respectively We willpresent the performance evaluation in Section 7 and draw aconclusion with the future work directions in the last section

2 Related WorkFault tolerance and task assignments have been widely stud-ied especially in traditional distributed system [13] Researchtopics mainly include performance modeling and prediction[11] task assignment [14] heuristic algorithm [15] resourceallocation [8] and fairness [16] In our work wemainly focuson two aspects namely the performance prediction and taskassignment

Performance modeling is a conventional way of predict-ing the execution time of tasks in advance of executions [17]The design philosophy is to have a comprehensive under-standing on machine capabilities and application featuresso that the execution behavior of the machines can be wellcharacterized Snavely et al [11] proposed a framework for









3 Motivation






Master

HDFS Work nodes

Work nodesHDFS










4 Observation





1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index






Node 1Node 2

Node 3Node 4

1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index








1

08

06

04

02

00 15 30 45 60

CDF

Time (s)


1

08

06

04

02

00 10 2 0 30 40

CDF

Time (s)



Jobs


ProfileMaximize

Minimize

profit

risk




Execution time






3432

30282624

2220181614

12108642

0 50 100 150 200 250 300 350 400 450 500

Task

slot

s

Time (s)




Node 3

Node 2

Node 1





Step one Step two


Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction





119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)







Others 2


119902

119896=1119878119894119895119896119882119896 is


Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)







Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in











3 Motivation






Master

HDFS Work nodes

Work nodesHDFS










4 Observation





1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index






Node 1Node 2

Node 3Node 4

1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index








1

08

06

04

02

00 15 30 45 60

CDF

Time (s)


1

08

06

04

02

00 10 2 0 30 40

CDF

Time (s)



Jobs


ProfileMaximize

Minimize

profit

risk




Execution time






3432

30282624

2220181614

12108642

0 50 100 150 200 250 300 350 400 450 500

Task

slot

s

Time (s)




Node 3

Node 2

Node 1





Step one Step two


Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction





119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)







Others 2


119902

119896=1119878119894119895119896119882119896 is


Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)







Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Master

HDFS Work nodes

Work nodesHDFS










4 Observation





1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index






Node 1Node 2

Node 3Node 4

1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index








1

08

06

04

02

00 15 30 45 60

CDF

Time (s)


1

08

06

04

02

00 10 2 0 30 40

CDF

Time (s)



Jobs


ProfileMaximize

Minimize

profit

risk




Execution time






3432

30282624

2220181614

12108642

0 50 100 150 200 250 300 350 400 450 500

Task

slot

s

Time (s)




Node 3

Node 2

Node 1





Step one Step two


Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction





119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)







Others 2


119902

119896=1119878119894119895119896119882119896 is


Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)







Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index






Node 1Node 2

Node 3Node 4

1601401201008060402000 16 32 48 64 80 96 112 128

Tim

e (s)

Task index








1

08

06

04

02

00 15 30 45 60

CDF

Time (s)


1

08

06

04

02

00 10 2 0 30 40

CDF

Time (s)



Jobs


ProfileMaximize

Minimize

profit

risk




Execution time






3432

30282624

2220181614

12108642

0 50 100 150 200 250 300 350 400 450 500

Task

slot

s

Time (s)




Node 3

Node 2

Node 1





Step one Step two


Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction





119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)







Others 2


119902

119896=1119878119894119895119896119882119896 is


Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)







Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




1

08

06

04

02

00 15 30 45 60

CDF

Time (s)


1

08

06

04

02

00 10 2 0 30 40

CDF

Time (s)



Jobs


ProfileMaximize

Minimize

profit

risk




Execution time






3432

30282624

2220181614

12108642

0 50 100 150 200 250 300 350 400 450 500

Task

slot

s

Time (s)




Node 3

Node 2

Node 1





Step one Step two


Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction





119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)







Others 2


119902

119896=1119878119894119895119896119882119896 is


Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)







Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Step one Step two


Record andadjust

similarity

Similaritysearch

Runningtask

Makingprediction





119889 (119883 119884) =2radic

119899

sum

119894=1

119908119894(119909119894 minus 119910119894)2 (1)







Others 2


119902

119896=1119878119894119895119896119882119896 is


Minimize119899

sum

119894=1

119899

sum

119895=119894+1

(1198712

119894119895+ 1198722

119894119895) (2)

Subject to119902

sum

119896=1

119878119894119895119896119882119896 + 119871 119894119895 minus 119872119894119895 = 119877119894119895

(119894 119895 = 1 119899 119894 lt 119895)

(3)

119877119894119895 = 1 minus

min 1003816100381610038161003816119905 minus 119905max

10038161003816100381610038161003816100381610038161003816119905min minus 119905

1003816100381610038161003816

max 119905 119905avg (4)







Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Start

Findingstraggler



Yes

Yes

No


Assign backuptask


Makingprediction








119864 (119883) = sum

119909

119875 (119909)119880 (119909) (5)



119875 (119905119894 119905119895) =

10038161003816100381610038161003816[119905119894 119905119895]⋂119879

10038161003816100381610038161003816

|119879|

(6)




119880(119905119894 119905119895) =

avg (119905119894 119905119904) minus 119905119904 [119905119894 119905119904]

avg (119905119894 119905119895) minus 119905119904 [119905119904 119905119895]

(7)



119864 (119879) = 119875 (1198791)119880 (1198791) + 119875 (1198792) 119880 (1198792) (8)




Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






Tim

e (s)

(a)

tjts

ti

(b)

Tim

e (s)

Part one

Part two

ts


(c)

ts

Kill


(d)











0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




0 400 800 1200 1600

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Large jobs

200

150

100

50

0

RiskI


LATE


100 250 400 550 700

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Medium jobs

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

90 93 96 99 102

(d) Small jobs











M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





M

M M M M M

S S

S S S S S S S S S

S S S S S S S S

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

L L L

L L L L L L L

L L41 42 43 44 45 46 47 48 49 50
















1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




1200

900

600

300

01 6 11 16 21 26 31 36 41 46

LATERiskI

Tim

e (s)

Job index

M S S L L M


80

60

40

20

0

M S S L L M

SucceededFailed


M S S L L M

SucceededFailed

Killed

25

20

15

10

5

0








01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




01 6 11 16 21 26 31 36 41 46

Tim

e (s)

Job index

M S S L L M

LATERiskI

500

400

300

200

100


80

60

40

20

0

M S S L L M

SucceededFailed

100


M S S L L M

SucceededFailed

Killed

15

10

5

0





8 Conclusion






0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(a) Testbed 1

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(b) Testbed 2

0 250 500 750 1000

1

08

06

04

02

0

Time (s)

CDF

LATERiskI

(c) Testbed 3


70

56

42

28

14

0 2 4 6 8 10 12

Tim

e (s)

Background index


60

48

36

24

12

00 2 4 6 8 10 12

Tim

e (s)

Background index




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




29

27

25

23

210 4 8 12 16 20

Tim

e (s)

Execution times

(a) Best case

0 4 8 12 16 20

Tim

e (s)

Execution times

65

60

55

50

45

(b) Worst case


Acknowledgments


References














































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



research article risk intelligence: making profit from...

Documents