clap: component-level approximate processing for low tail...

14
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEE Transactions on Parallel and Distributed Systems 1 CLAP: Component-level Approximate Processing for Low Tail Latency and High Result Accuracy in Cloud Online Services Rui Han, Siguang Huang, Zhentao Wang, Jianfeng Zhan Abstract—Modern latency-critical online services such as search engines often process requests by consulting large input data spanning massive parallel components. Hence the tail latency of these components determines the service latency. To trade off result accuracy for tail latency reduction, existing techniques use the components responding before a specified deadline to produce approximate results. However, they skip a large proportion of components when load gets heavier, thus incurring large accuracy losses. In this paper, we propose CLAP to enable component-level approximate processing of requests for low tail latency and small accuracy losses. CLAP aggregates information of input data to create small aggregated data points. Using these points, CLAP reduces latency variance of parallel components and allows them to produce initial results quickly; CLAP also identifies the parts of input data most related to requests’ result accuracies, thus first using these parts to improve the produced results to minimize accuracy losses. We evaluated CLAP using real services and datasets. The results show: (i) CLAP reduces tail latency by 6.46 times with accuracy losses of 2.2% compared to existing exact processing techniques; (ii) when using the same latency, CLAP reduces accuracy losses by 31.58 times compared to existing approximate processing techniques. Index Terms—Cloud online services, tail latency, result accuracy, component-level approximate processing, aggregated data points 1 I NTRODUCTION P ROVIDING quick responsiveness (within 100ms) in re- quest processing is crucial for today’s online services such as e-commerce sites and web search engines, as their potential profits are proportional to service latency (request response time) [24], [35], which includes both the request queueing delay and the time of being processed. This paper focuses on a wide class of highly parallel services, in which the processing of each request needs to consult a large input dataset by parallelizing sub-operations across hundreds or thousands of service components. Each component needs to process a subset of the input dataset to produce a result and hence the tail latency (e.g. the 99.9th percentile latency) of these components determines the overall service latency [22], [44]. Example services are: (1) At an e-commerce site, a user-based collaborative filtering (CF) recommender system predicts an active user’s rating on an unknown item (prod- uct) by scanning millions of existing ratings from similar- minded users in a user-item rating matrix [49]; (2) a web search engine uses an inverted index to organize millions of web pages. For each query, the search engine returns the web pages that are most related to the query by calculating the pages’ similarity scores to the query words (terms) and ranking them in descending order according to their scores; and (3) other examples include handwritten digit Rui Han, Zhentao Wang, and Jianfeng Zhan (corresponding author) are with the Institute Of Computing Technology, Chinese Academy of Sciences, Beijing, China. Siguang Huang is with the School of Software, Tsinghua University, Beijing, China. E-mail: [email protected], [email protected], {wangzhentao, zhanjianfeng}@ict.ac.cn. This work is supported by the National Natural Science Foundation of China (Grant No. 61502451) and the National Key Research and Development Plan of China (Grant No. 2016YFB1000601). recognition, financial and data analysis, and online machine translation and speech recognition [19], [41]. When delivering online services in a cloud platform, service providers usually have limited budgets, namely limited resources, to maintain the quality of service (QoS) requirements of their services. Hence under resource and response time constraints, a widely applied solution is to produce approximate results in request processing in order to trade off result accuracy (correctness) for service latency reduction [35], [22], [41], [59]. For example, in CF-based rec- ommender systems and search engines, the result accuracies are the errors between predicted and actual ratings and the proportion of the actual top k web pages (that is, the top 10 pages that represent the best answers to the query terms) in the returned top k pages [19], respectively. As a small degree of inaccuracy cannot be evidently perceived and thus are tolerable by service users [41], [16], efficiently and successfully applying such approximate processing requires reducing component tail latency without incurring large accuracy losses compared to the accuracy of exact results. This task is difficult enough for highly parallel services deployed in a cloud environment, in which service compo- nents hosted across different nodes usually have large per- formance variance. This variance comes from different hard- ware and software reasons [22] as well as frequently chang- ing performance interferences from co-located workloads such as short-running MapReduce jobs [18], [29]. Further- more, such performance variance is significantly amplified by request queuing delays when considering service load variations, thus incurring high component tail latency [44]. Existing techniques reduce tail latency by using results only from a part of the components responding before a specified deadline to produce approximate results [32], [35], [33], [23],

Upload: others

Post on 11-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

1

CLAP: Component-level ApproximateProcessing for Low Tail Latency and High Result

Accuracy in Cloud Online ServicesRui Han, Siguang Huang, Zhentao Wang, Jianfeng Zhan

Abstract—Modern latency-critical online services such as search engines often process requests by consulting large input dataspanning massive parallel components. Hence the tail latency of these components determines the service latency. To trade off resultaccuracy for tail latency reduction, existing techniques use the components responding before a specified deadline to produceapproximate results. However, they skip a large proportion of components when load gets heavier, thus incurring large accuracylosses. In this paper, we propose CLAP to enable component-level approximate processing of requests for low tail latency and smallaccuracy losses. CLAP aggregates information of input data to create small aggregated data points. Using these points, CLAP reduceslatency variance of parallel components and allows them to produce initial results quickly; CLAP also identifies the parts of input datamost related to requests’ result accuracies, thus first using these parts to improve the produced results to minimize accuracy losses.We evaluated CLAP using real services and datasets. The results show: (i) CLAP reduces tail latency by 6.46 times with accuracylosses of 2.2% compared to existing exact processing techniques; (ii) when using the same latency, CLAP reduces accuracy losses by31.58 times compared to existing approximate processing techniques.

Index Terms—Cloud online services, tail latency, result accuracy, component-level approximate processing, aggregated data points

F

1 INTRODUCTION

P ROVIDING quick responsiveness (within 100ms) in re-quest processing is crucial for today’s online services

such as e-commerce sites and web search engines, as theirpotential profits are proportional to service latency (requestresponse time) [24], [35], which includes both the requestqueueing delay and the time of being processed. This paperfocuses on a wide class of highly parallel services, in whichthe processing of each request needs to consult a large inputdataset by parallelizing sub-operations across hundreds orthousands of service components. Each component needsto process a subset of the input dataset to produce a resultand hence the tail latency (e.g. the 99.9th percentile latency)of these components determines the overall service latency[22], [44]. Example services are: (1) At an e-commerce site, auser-based collaborative filtering (CF) recommender systempredicts an active user’s rating on an unknown item (prod-uct) by scanning millions of existing ratings from similar-minded users in a user-item rating matrix [49]; (2) a websearch engine uses an inverted index to organize millionsof web pages. For each query, the search engine returns theweb pages that are most related to the query by calculatingthe pages’ similarity scores to the query words (terms)and ranking them in descending order according to theirscores; and (3) other examples include handwritten digit

• Rui Han, Zhentao Wang, and Jianfeng Zhan (correspondingauthor) are with the Institute Of Computing Technology, ChineseAcademy of Sciences, Beijing, China. Siguang Huang is with theSchool of Software, Tsinghua University, Beijing, China. E-mail:[email protected], [email protected], {wangzhentao,zhanjianfeng}@ict.ac.cn. This work is supported by the National NaturalScience Foundation of China (Grant No. 61502451) and the National KeyResearch and Development Plan of China (Grant No. 2016YFB1000601).

recognition, financial and data analysis, and online machinetranslation and speech recognition [19], [41].

When delivering online services in a cloud platform,service providers usually have limited budgets, namelylimited resources, to maintain the quality of service (QoS)requirements of their services. Hence under resource andresponse time constraints, a widely applied solution is toproduce approximate results in request processing in orderto trade off result accuracy (correctness) for service latencyreduction [35], [22], [41], [59]. For example, in CF-based rec-ommender systems and search engines, the result accuraciesare the errors between predicted and actual ratings and theproportion of the actual top k web pages (that is, the top 10pages that represent the best answers to the query terms)in the returned top k pages [19], respectively. As a smalldegree of inaccuracy cannot be evidently perceived andthus are tolerable by service users [41], [16], efficiently andsuccessfully applying such approximate processing requiresreducing component tail latency without incurring largeaccuracy losses compared to the accuracy of exact results.

This task is difficult enough for highly parallel servicesdeployed in a cloud environment, in which service compo-nents hosted across different nodes usually have large per-formance variance. This variance comes from different hard-ware and software reasons [22] as well as frequently chang-ing performance interferences from co-located workloadssuch as short-running MapReduce jobs [18], [29]. Further-more, such performance variance is significantly amplifiedby request queuing delays when considering service loadvariations, thus incurring high component tail latency [44].Existing techniques reduce tail latency by using results onlyfrom a part of the components responding before a specifieddeadline to produce approximate results [32], [35], [33], [23],

Page 2: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

2

[59], [22]. However, they do not address issues relatingto reducing individual components’ latencies themselves.This means under heavy loads, these techniques have toskip results from a large proportion of slow components tomaintain low tail latency. These skipped results may causelarge accuracy losses because processing the input data onall components potentially contributes to result accuracy.

In this paper, we argue that to provide small accuracylosses while maintaining low component tail latency incloud online services, we need to perform approximateprocessing on all individual components, instead of onlyusing results on quick components to produce approximateresults. To this end, this paper proposes CLAP, a frameworkto enable the fine-grained approximate processing at thecomponent level. The basic approach taken by CLAP is topre-create small aggregated data points, each one aggregatesthe information of a part of similar data points in theinput data. At runtime, CLAP maintains low tail latencyby dynamically allocating aggregated data points amongmultiple components of different performance to reducetheir latency variance, and by allowing all componentsto produce initial approximate results quickly using theallocated points. At the same time, CLAP minimizes resultaccuracy losses by using aggregated data points to estimatethe correlations between different parts of the input dataand arbitrary requests’ result accuracies, thus first usingthe most accuracy-related parts of input data to improvethe produced results. Note that the proposed framework isnot intended to replace, but rather complement the existingtail latency reduction techniques based on producing exactresults (see Table 2). CLAP also differs from traditionaltechniques that pre-compute structures (e.g. samples orwavelets) of input data based on past query templates anduse these structures to answer certain types of requestswith both accuracy and latency bounds [16]. In contrast,CLAP needs no prior knowledge about the requests to beprocessed and it can support arbitrary requests in services.

Our approach is proposed for online services under theassumption that the small proportion of critical input datathat contributes to large parts of result accuracy can be effi-ciently identified and balanced at the individual componentlevel. To demonstrate the effectiveness of our approach, weimplement it and modify two online services, namely a CF-based recommender system [10] and a web search engine[2], to adapt their request processing using CLAP. Theseservices represent two typical types of applications that fallin the scope of our assumption: (1) the recommender systemuses numeric datasets as input data, in which the criticaldata is the users with high weights (correlations) to activeusers; (2) the search engine uses text datasets as input data,in which the critical data is the web pages with high scores.The evaluation results using real-world datasets in bothservices show that by processing the generated aggregateddata points, the parts of input data with higher estimatedcorrelations are indeed more related to different requests’result accuracy. We further compare CLAP against existingtail latency reduction techniques, using both the syntheticworkload in the recommender system and the realisticsearch engine workload derived from a real-world userquery log in the Sogou search engine [14]. The evaluationresults show: (i) compared to the request reissue technique

based on producing exact results [35], [22], [50], CLAPreduces the component tail latency by an average of 6.46times with small accuracy losses of 2.2%; (ii) compared tothe partial execution technique based on producing approx-imate results [32], [35], [33], [23], [59], [22], CLAP reduces theaccuracy loss by 31.58 times when using the same latency.

The remainder of this paper is organized as follows:Section 2 introduces our approach and Section 3 presents itsimplementation. Section 4 evaluates the proposed approach.Section 5 discusses the related work, and finally, Sections 6summarizes the work.

2 CLAPIn this section, we first present an overview of the CLAPframework in Section 2.1, following by an explanation of itsmodules in Sections 2.2, 2.3, and 2.4.

2.1 Overview

For an online service, CLAP is presented to enable thecomponent-level approximate processing using both offlineand online modules, as shown in Figure 1.

Offline synopsis management. Two modules are re-sponsible for pre-creating and updating aggregated datapoints in the offline mode. Suppose the service’s input dataconsists of m subsets, the synopsis creation module organizeseach subset using a proper data structure to transform itinto an index file and a synopsis. The synopsis consists ofmultiple aggregated data points, where each one aggregatesthe information of multiple original input data points thathave similar feature values. For each synopsis, an index filerecords the mapping relationship between each aggregateddata point and the original data points aggregated by it.Note that the synopsis creation process is only applied once,and the synopsis updating module periodically updates thecreated synopses in an incremental fashion to keep pacewith input data changes during service running.

Online component-level approximate processing.Based on the synopses, the two online modules are designedto provide low tail latency and high result accuracy inrequest processing. First, the predictive input data allocatoris applied on a group of n parallel components deployedacross different nodes. The allocator reduces the service timevariance of these components at run-time by collecting theresource contentions on them using the resource monitorsand periodically updating the data points (including bothaggregated data points and their corresponding originaldata points) allocated to them. In an allocation, componentssuffering from less resource contentions are predicted ashaving quicker service times and they are therefore allo-cated larger numbers of points to reduce component latencyvariance.

Secondly, for each request, the accuracy-aware approximateprocessor is applied on each component to produce a resultusing two stages. The first stage produces an initial approx-imate result using the aggregated points allocated to thecomponent. The number of aggregated points is sufficientlysmall such that the production process only causes a lowlatency even when handling heavy loads. By processing theaggregated points, this stage also estimates the correlations

Page 3: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

3

Fig. 1. The Overview of CLAP

between different parts of the input data and the request’sresult accuracy. The second stage then iteratively improvesthe produced result within a specified service deadline. Themost accuracy-related parts are first used in the improve-ment to minimize the request’s accuracy loss.

2.2 Offline Synopsis ManagementThe basic idea of synopsis creation is to group similar datapoints in a subset of input data and store their aggregatedinformation in a synopsis to preserve data similarity. Tothis effect, R-tree is used in synopsis creation and updatingfor three reasons. First, R-tree can effectively structure datapoints in a bottom-up fashion and the ones close in featurevalues are allocated to the same node in tree construction.Second, an R-tree is a depth-balanced tree, which meansthe nodes at the same depth of the tree have a similarapproximation level to the subset and the nodes at differentlevels represent the subset at multiple levels of granularity.Third, an R-tree is an index structure that supports dynamicinsertion and deletion of leaf nodes, thus enabling the incre-mental updating of an existing synopsis. Based on R-tree,the synopsis creation process has three steps.

Step 1. Dimensionality reduction of the subset. As R-treeworks effectively in low-dimensional spaces, this step em-ploys the singular value decomposition (SVD) dimensional-ity reduction technique to transform the subset into a low-dimensional and dense dataset. SVD can transform a u × vdataset into a u× j dataset where j is much smaller than v,while minimizing the difference (distance) between the twodatasets. CLAP uses the incremental SVD [26] whose execu-tion time is independent of the dataset size and hence thetransformation process can be completed quickly (within afew seconds) even when dealing with large-scale datasets.Note that the above step works on numeric datasets. For atext dataset such as a collection of web pages, the datasetneeds to be transformed into a numeric dataset, in whicheach numeric data point extracts the feature value of itscorresponding text data. For example, a web page can betransformed into a numeric data point whose attributes areall the words in the collection of web pages and the value ofa attribute is the occurrence number of a word in the page.

Step 2. Similar data points organization. This step operateson the low-dimensional dataset and groups similar datapoints in it by constructing an R-tree. It then outputs anindex file by selecting a depth of R-tree nodes that decide the

mapping relationship between aggregated and original datapoints. In the R-tree, a node enclosing multiple original datapoints corresponds to an aggregated data point, and all thenodes at one depth of the tree correspond to the aggregateddata points in the synopsis.

Step 3. Information aggregation of original data points. Ac-cording to the index file, the final step obtains each aggre-gated data point’s corresponding original data points (with-out feature reduction) and aggregates their information togenerate the synopsis. Depending on the type of dataset,there are two ways to perform such aggregation. (1) Fora numeric dataset, the aggregated information can be themean of original data points’s feature values. For example,in CF-based recommender systems, suppose an aggregateduser (data point) corresponds to a set U of original users, inwhich a subset Ui ⊆ U of users have rated an item i. Theaggregated user’s rating on item i is users’ average rating oni in set Ui. (2) For a text dataset, the aggregated informationcan be the merged information of multiple text data points.For example, in a search engine, suppose an aggregated webpage corresponds to a set of web pages (data points), thispage contains all the contents in these pages.

Figure 2 shows an example synopsis creation process.Step 1 transforms a 12 × 5 input dataset t into a 12 × 2dataset t′. We can see that data points with similar featurevalues (e.g. points d1 and d2) in t still have similar featuresin t′. Step 2 organizes the 12 data points in t′ by constructingan R-tree, in which similar data points are grouped in thesame leaf node. Leaf and non-leaf nodes are then recursivelygrouped together following the same principle to preservedata similarity. Step 2 selects nodes N5 and N6 to generatean index file. Finally, step 3 creates a synopsis accordingthe index file. This synopsis consists of two aggregated datapoints ad1 and ad2, each one aggregates information of sixoriginal data points.

Note that in synopsis creation, the selection of the R-tree depth decides the number of nodes (i.e. the number ofaggregated data points). In CLAP, this number influencesspace consumption, component latency, and result accuracy ofthe on-line request processing. On the one hand, in additionto the original input data, using aggregated data points toproduce initial results requires extra memory space to storethese points. Hence the number of aggregated data pointsshould be much smaller than the number of original datapoints, thus reducing the extra space consumption and alsoguaranteeing the quick production of initial results (that is,low component latency) even when handling large serviceloads. On the other hand, a sufficient number of aggre-gated data points allows the fine-grained differentiation ofthe different parts of the input data represented by thesepoints, thus enabling the efficient identification of the mostaccuracy-related parts and first using these parts to improveresult accuracy.

Motivated by the fact that input data of online servicescontinually changes, synopsis updating is designed to pe-riodically update the existing set of synopses. To minimizethe overheads in updating, this module detects changes ininput data and only updates the synopsis parts influencedby the changes. This updating strategy is built upon thedynamic insertion and deletion of leaf nodes in an R-treeand it considers two situations of input data changes. In the

Page 4: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

4

Fig. 2. An example of offline synopsis creation and updating

first situation, new input data points are added. This modulethus adds new leaf nodes to incorporate these points intothe R-tree. In the second situation, the feature values orcontents of a proportion of existing data points change. Thismodule thus deletes the leaf nodes including these datapoints and inserts new leaf nodes to represent the changedpoints. In both situations, synopsis updating identifies theparts of nodes influenced by the newly inserted leaf nodesand updates their corresponding aggregated data points inthe synopsis. For example, Figure 2(d) shows that when anew data point d13 is added, a new leaf node is added tonode N5 and only aggregated data point ad1 is updated.

2.3 Predictive input data allocator

Given a set of k aggregated data points and their corre-sponding subsets of original datasets to be allocated into ncomponents (this information is recorded in the index file),the steps of the predictive input data allocator are detailedin Algorithm 1. At each allocation interval (e.g. 1 minute),the algorithm first collects each component c’s resourcecontention information using the monitors deployed on c’snode n (line 1). The monitors detect the usages of sharedresource including CPU and I/O bandwidths: (1) multi-coreprocessor usage Uprocessor: it represents the ratio of timerunning instructions on the cores. This resource contentioncauses degradation of computation performance; (2) diskI/O bandwidth Udisk: it is measured by the amount ofread/write megabytes per second; and (3) network I/Obandwidth Unetwork: it is measured by the amount ofsend/receive megabytes per second. These two resourcecontentions cause degradation of communication perfor-mance.

In the calculation of component c’s resource contention,the monitors consider two factors. The first factor is themonitored resource usages come from c’s co-running pro-grams on the same node. The second factor is node n’scapacities (that is, the maximal resource amounts) of CPUand I/O bandwidths. Formally, let the resource usages

Algorithm 1 Predictive allocation of input data on a groupof componentsRequire: k: the total number of aggregated data points;

Di: the set of original data points represented by adi;|Di|: the number of data points in Di;{ad1, ad2,..., adk}: the k ranked aggregated data points,where |Di| ≤ |Di+1| (i=1,...k − 1);n: the number of components belonging to a group;cj : the jth component (j=1,...n);Uj : cj ’s resource contention vector;RG(Uj): the regression method used to predict cj ’srelative service time according to Uj ;xj : cj ’s predicted relative service time;oj : the ratio of aggregated data points allocated to cj ;

1. Monitor all n components’ resource contention vectors{U1, U2,..., Un} once every allocation interval;

2. for (j=1; j ≤ n; j++) do3. Predict component cj ’s relative service time xj using

Uj : xj=RG(Uj);4. end for5. Calculate {o1, o2,..., on} such that x1 × o1 = x2 × o2 = ...

= xn × on and∑n

i=1 oi = 1;6. Rank the n components {c1, c2,..., cn} in ascending order

according to their ratios, where oj ≤ oj+1 (j=1,...n− 1);7. j=1; //j is the index of components8. for (i=1; i ≤ k; i++) do9. //i is the index of aggregated data points

10. while (oj ∗ k < 1 and j < n) do11. //the allocation to cj is completed12. j++;13. end while14. Allocate adi and Di to component cj ;15. oj=oj − 1

k ;16. j++;17. if (j > n) then18. j=1;19. end if20. end for

Page 5: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

5

be Uprocessor, Udisk, and Unetwork, and node n’s resourcecapacities be CAprocessor, CAdisk, and CAnetwork, c’s re-source contention information is represented by a vectorU=( Uprocessor

CAprocessor, Udisk

CAdisk, Unetwork

CAnetwork).

Using contention information U, the algorithm uses aregression model RG to predict component c’s performancedegradation due to the performance interference it suffersfrom (line 2 to 4). This degradation is measured by therelative service time x, which is c’s service time under re-source contention U divided by c’s service time withoutresource contention. The regression model RG describes therelationship between c’s contention vector U and its relativeservice time x: x=RG(U). The training of the model takes aset of v samples {(U1,x1}),...,({Uv ,xv)} as input, which canbe obtained from profiling runs of the service or its historicalrunning logs.

Based on the predicted service times of all n compo-nents in a group, the algorithm calculates the ratios ofthe k (k ≥ n) aggregated data points to be allocated tothese components such that a component’s ratio is inverselyproportional to its service time (line 5). That is, a com-ponent with smaller relative service time (that is, higherperformance) is allocated a larger ratio of aggregated datapoints. The calculation of ratios is based on the assumptionthat the n parallel components have the same processingalgorithm and their service time is proportional to the inputdata size. In CLAP, this size is represented by the numberof aggregated data points because the processing of thesepoints is the mandatory task that must be completed by eachcomponent to produce an initial approximate result. Aftercalculating the ratios of all n components, these componentsare ranked in ascending order according to their ratios(line 6). The k aggregated data points are also ranked inascending order according to the numbers of the originaldata points represented by them. Finally, the algorithmsequentially allocates the k ranked aggregated data pointsand their corresponding sets of original data points to then ranked components (line 7 to 20). This allocation assignsmore aggregated and original data points to components oflarger ratios.

Figure 3 shows an example of allocating five aggregateddata points to two components c1 and c2 deployed on differ-ent nodes. At an allocation interval, the relative service timesof c1 and c2 are predicted as 1.8 and 1.2 using the monitoredresource contention vectors U1 and U2, respectively. Hencethe ratios of input data allocated to c1 and c2 are 0.4 and0.6, respectively. The allocator then sequentially allocates thefive ranked aggregated data points {ad1, ad2, ad3, ad4 ad5}and their corresponding sets {D1, D2, D3, D4 D5} to c1 andc2 using five iterations.

The implementation of the input data allocator invokesAlgorithm 1 periodically with the assumption that theoverheads from one invocation cause slight interruptionto the running service. At each allocation, after decidingthe allocation of the k aggregated data points to the ncomponents (the time complexity is O(k × n)), conductingthis allocation has two overheads. The first overhead istransmitting the ids of the allocated aggregated and originaldata points to each component. The second overhead isupdating the input data processed by each component. Theupdating involves three steps: identifying the difference of

Fig. 3. An example of predictive input data allocation

the allocated data points between the last allocation andthe current allocation; removing the unallocated data points;and loading the newly allocated data points from disk intomemory. To make the allocation algorithm applicable forlarge services with massive parallel components and inputdatasets, the proposed framework divides the whole serviceinto groups, thus maintaining small allocation overheadsby restricting the size of data points and the number ofcomponents in allocation.

2.4 Accuracy-aware approximate processor

On each component, the steps of accuracy-aware approx-imate processing are detailed in Algorithm 2. For eachrequest, an initial result ar is first produced using the aggre-gated data points allocated to this component (line 1). Sub-sequently, the algorithm estimates the correlations betweendifferent parts of the input data and the request’s resultaccuracy to enable accuracy-aware request processing. Thisestimation is based on the processing of the aggregated datapoints (line 1). First, processing an aggregated data point adigives an estimation of correlation ci between this point andthe request’s result accuracy. For example, in recommendersystems and search engines, the correlations are the weightbetween an aggregated user and an active user and thesimilarity score between an aggregated web page and thequery terms in a request, respectively. As adi contains theaggregated information of a set Di of original data pointswith similar feature values, we assume a linear dependencybetween ci and set Di’s correlation to result accuracy. Thatis, a higher value of ci means a larger accuracy improvementbrought by processing the data points in Di. For example,in search engines, a higher similarity score ci means theoriginal web pages in set Di have higher similarity scoreson average. Hence processing the web pages in Di has ahigher probability of finding the actual top k web pagesand bringing a larger improvement in result accuracy.

Based on the estimated correlations, the algorithm firstranks the aggregated data points (line 2), and then uses theranking order of each aggregated data point to determinethe ranking order of its corresponding set of original datapoints (line 3). Subsequently, the algorithm sequentiallyuses the ranked sets to improve result ar (line 4 to 10).The improvement process iteratively executes under twoconditions: (1) the elapsed time lela is smaller than thespecified deadline lspe; (2) the number i of the processed

Page 6: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

6

Algorithm 2 Accuracy-aware approximate processing on acomponentRequire: m: the number of aggregated data points allocated

to the component;ri: adi’s correlation to the request’s result accuracy (1 ≤i ≤ m);Di: the set of original data points represented by adi;ldeadline: the specified deadline of service latency;lelapse: the elapsed service time since the request sub-mitting time;ar: the approximate result;εmax: the refinement threshold.

1. Process the m aggregated data points to obtain the initialresult ar and the m correlations r1 to rm;

2. Rank the m aggregated data points in descending orderaccording to their correlations to ar’s result accuracy;

3. Obtain the ranked sets {D′1,D′2,...,D′m} according to theranking orders of aggregated data points;

4. i=1; //i is the index of aggregated data points5. Obtain the current elapsed time lela;6. while (i ≤ εmax) do7. for each original data point d ∈ D′i do8. Process d to improve ar;9. Obtain the current elapsed time lelapse;

10. if (lelapse < ldeadline) then11. Return ar;12. end if13. end for14. i=i + 1;15. end while16. Return ar.

sets is smaller than or equal to the refinement thresholdεmax, i.e. the maximal number of sets of original data pointsto be processed in the improvement. The second conditionis based on the observation that in some cases, processingthe original data points in a proportion of the top rankedsets determines most of the result accuracy. For example,in search engines, the original web pages in the top 40%ranked sets contain over 98% of the actual top 10 webpages for different requests. Hence, the threshold avoids theunnecessary processing of less accuracy-related data points.

3 IMPLEMENTATION

CLAP is implemented in Java and it is currently targeted forservices running in cloud infrastructures and Linux environ-ment. Its offline modules are implemented based on opensource packages of R-tree and SVD (Section 3.1). Its onlinemodules are incorporated with two typical online services:a recommender system and a search engine (Section 3.2).

3.1 Offline Synopsis Management Modules

CLAP currently supports R-tree based synopsis creationand updating, which operate on a service’s input data andthey are independent of online request processing. First ofall, step 1 of the synopsis creation module is implementedbased on the incremental SVD method [6]. This step treatsthe dimensionality reduction process as a gradient descent

optimization problem. Suppose a v-dimensional dataset istransformed into a j-dimensional one (v > j), the timecomplexity of this step is O(j × i), where i is the numberof iterations for each dimensionality. Step 2 is implementedusing the standard R-tree package [7]. Given a dataset withk data points, the time complexity of constructing an R-treeis O(k× log k). Finally, step 3 (i.e. information aggregation) isthe most computation expensive step, whose time complex-ity of generating a synopsis using a k×v dataset is O(k×v).

Once the synopsis is generated, the R-tree and the indexfile are stored and they can be used as the starting pointof synopsis updating. To minimize the updating overhead,CLAP uses a low-priority strategy to perform the synopsisupdating. It monitors the overall resource utilization andtriggers the periodic synopsis updating when the resourceis underutilized, thus ensuring little interruption to therunning service. The updating module is implemented todetect changes in the original input data or newly arriveddata, dynamically updates the R-tree and the index file, andonly re-generates the parts of the synopsis according to thechanges in the updated index file.

3.2 Online Component-level Approximate ProcessingModulesIn CLAP, the resource monitors are implemented for Linuxbased systems. On a VM, a monitor collects the resourceusages of different types (processor cores, and disk and net-work I/O bandwidths) by accessing the information storedin the /proc filesystem. The monitor obtains the informationonce every second and writes the average resource usagesto a Redis data store [11] per minute. This measurementmethod guarantees low overheads in monitoring and doesnot affect the service performance.

For a component c in a service, the monitored resourcecontention information is the aggregated resource usagescollected from all c’s co-located VMs on the same node. Atthe end of each allocation interval (1 minute), the predictiveinput data allocator predicts the service times of all com-ponents in a group using their latest resource contentioninformation and enforces the allocation of data points for thenext interval. The regression model used in the predictionis constructed based on training samples obtained fromprofiling runs of service components. In profile runs, wetest the service time of a component when using a synopsisas the input data. The component is co-located with twomicrobenchmarks that emulate different types and intensi-ties of resource contentions: (1) the Linux stress test toolemulates CPU and disk resource contentions: a CPU con-tention doing integer and floating point calculations; and adisk contention doing disk reading and writing operations.(2) the Isolation Benchmark Suite (IBS) [5] emulates networkresource contentions that do network sending and receivingoperations. Note that only one out of all homogeneousparallel components needs to be profiled and thus avoidingthe scalability issue associated with the service profiling.Using the training samples, a polynomial regression modelis constructed using Matlab R2016a [8].

In addition, incorporating the accuracy-aware approxi-mate processor module into a service does not require anymodification in the request processing algorithm, but con-trolling the input dataset (either aggregated or original data

Page 7: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

7

points) fed to the algorithm. The implementation thereforeis independent of the type of requests to be processed atruntime. In order to test CLAP using services that use inputdata of different types, we incorporated its online modulesinto a CF-based recommender system [10] and a Lucene websearch engine [2]. The distributed versions of both servicesare implemented based on Storm [3] (a real-time distributedprocessing platform), and the enterprise version of Storm,Alibaba JStorm [1], is used. We now introduce the twoservices.

The CF-based recommender system. In e-commercesites such as Amazon and eBay, the user-based CF algo-rithm is a predominant type of techniques applied in manyrecommender systems [49]. In a CF-based recommendationsystem, a user-item rating matrix is the input dataset usedfor storing the user historical ratings (preference scores) fordifferent products (items).

For a request from an active user u, the system predictsthe u’s rating p(u, i) on a target item i using two steps.The first step calculates the weight (Pearson’s correlationcoefficient [47]) w(u, v) between user u and any neighbor-hood user v who has rated the same item i in the matrix.After calculating the weights, the second step generates theprediction of user u’s rating on item i by taking a weightedaverage of all ratings of item i from user u’s neighborhood

users: p(u, i) = ru +

∑v∈I

w(u,v)×(rv,i−rv)∑v∈I|w(u,v)|

, where I is the

set of all users that have rated item i, rv,i denotes user v’srating of item i, and ru is the average rating of all itemsrated by user u, and |w(u, v)| denotes the absolute value ofthe weight w(u, v).

The Lucene web search engine. In today’s Internet, websearch engines such as Google, Bing, and Baidu are the mostheavily used web services and we study the open sourceLucene search engine [2] as an example. At the offline webpage collection stage, the web crawler crawls the web pagesand builds the inverted index, which includes a vocabularycontaining all the words in the crawled web pages. At theonline request processing stage, if a query request does nothit the query cache, the search engine scans its index file tosearch web pages that match the query terms in the request,and then ranks these pages according to their similarityscores to the terms. The service then returns the ranked webpages as the result, in which a small number of top rankedpages (e.g. the top 10 pages) usually stand for the answersto the query terms [19].

4 EXPERIMENTAL EVALUATION

In this section, we first evaluate the CLAP offline modulesusing large datasets in real services (Section 4.2). We thencompare the CLAP online modules against existing tail la-tency reduction techniques using both synthetic and realisticworkloads (Section 4.3). Finally, we discuss each module’sinfluence on tail latency and accuracy in approximate resultproduction (Section 4.4).

4.1 Experimental SettingsExperiment platform. The experiments were conducted ona cluster with 30 nodes. Each node has two 6-core Intel XeonE5645 processors, 32 GB of DRAM, and eight 1 TB 7200RPM

SATA disk drives. The nodes in the cluster are connectedthrough 1 GB ethernet network cards. The operating systemof both physical machines and VMs is Linux Ubuntu 14.04.1.The Kernel Virtual Machine (KVM), JDK versions are 1.7.91,1.7.0, respectively. In the JStorm distribution, the versions ofJStorm, Python, and Zookeeper are 0.9.6.3, 2.7.6, and 3.4.6,respectively. The version of Hadoop distribution is 1.2.1.

Workloads. We test two service workloads based on theimplementation of AccuracyTrader on two online servicesin Section 3.2. For the recommender system workload, theinput data is a real and large-scale dataset: MovieLens [9],which contains 138,000 users and their 20 million ratings ondifferent items. For the search engine workload, the inputdataset is a collection of 3.6 million web pages released bythe Sogou web search engine [12].

Compared techniques. The basic approach without anytail latency reduction techniques and two tail latency re-duction techniques widely used in today’s online servicesare compared. (1) Request reissue [35], [22], [50]. If somestraggling sub-operations of a request have been executedfor more than a high percentile of the expected latency forthis class of sub-operations, a copy of each straggling sub-operation is started on a different machine and the resultfrom the one that returns it first is used. The percentile is setto 95th in our evaluations. (2) Partial execution [32], [35], [33],[23], [59], [22]. For each request, this technique only usesa part of sub-operations that complete before a specifieddeadline to produce its approximate result and skips othersub-operations.

Evaluation metrics. Both performance and accuracymetrics are used to evaluate the online services. The perfor-mance metric is each request’s component tail latency, i.e.the latency of the slowest component among all parallelcomponents. This latency determines the request’s responsetime. The accuracy metric is the percentage of accuracylosses, which denotes the percentage of decreased accuraciesin approximate results when comparing to accuracies inexact results that are produced using full computation overthe entire input data.

In recommender systems, the accuracy is measured bythe root-mean-square error (RMSE) [49], which denotes theerrors between the predicted and actual values of ratings.Formally, RMSE is a weighted average error that measuresthe prediction accuracy for all the target items in a test set

T : RMSE =

√∑i∈T

(p(u,i)−ru,i)2

nT, where nT represents the

number of items in set T , p(u, i) is the item i’s predictedrating and ru,i is its actual rating. In search engines, theaccuracy is measured by the proportion of the actual top10 web pages (i.e. the 10 pages with the highest similarityscores when searching all web pages) in the returned top 10pages [32], [19].

4.2 Evaluation of Offline Synopsis Management

4.2.1 Overheads of synopsis creation and updatingEvaluation settings. The input dataset in either service isequally divided into 6 subsets. In the recommender system,each subset has approximately 3.33 million ratings. In thesearch engine, each subset has 0.6 million web pages. Foreach subset, a synopsis is generated.

Page 8: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

8

Fig. 4. Evaluation of synopsis updating with CLAP

Synopsis creation. We tested the three steps of the synopsisgeneration on one node. At step 1, a subset is transformedto a 3-dimensional dataset. In SVD transformation, eachdimension has 100 iterations. At step 2, the 3-dimensionaldataset is organized using an R-tree to generate an indexfile. In the recommender system, each aggregated usercorresponds to an average of 117.64 original users. In thesearch engine, each aggregated web page corresponds to anaverage of 42.55 original pages. At step 3, the informationof the subset is aggregated to generate a synopsis. For therecommender system and the search engine, a synopsis wascreated in about 60 seconds and 50 minutes, respectively.Note that in the experiments that follow, the above synopseswill be used.

Synopsis updating. We designed two categories of scenar-ios to evaluate how the synopsis is updated under differentchanges in the input dataset. In the first category, eachsubset has a proportion of i% new data points (users or webpages) being added. In the second category, each subset hasa proportion of i% existing data points being changed. Ineach scenario, 10 values of i (i=1, 2,..., 10) were tested onone node. Each test was repeated 10 times for consistencyand the average is reported in Figure 4. The evaluationresults show: (i) all the updating processes were completedmuch faster than the synopsis creation processes; (ii) thefirst category of synopsis updating was completed fasterthan the second category. This is because in the incrementalupdating of a synopsis, the first category of scenarios onlyneeds to add new R-tree nodes. In contrast, the secondcategory of scenarios needs to delete existing nodes and addnew R-tree nodes, thus leading to longer updating time

4.2.2 Effectiveness of the Generated SynopsesEvaluation settings. In the CF-based recommender system,the data points are users and the request are active users.A user’s influence on an active user’s predicted ratingdepends the weight between this user and the active user.Hence an aggregated user’s correlation to an active user(request) u’s result accuracy is denoted by the weight be-tween this user and u. An original user v is viewed asbeing highly related to u’s result accuracy if the weightw(u, v) between u and v is large. Since the value of weight(Pearson’s correlation coefficient) ranges between -1 and 1,we consider two types of large weights: the absolute valueof weight is equal to or larger than 0.8 (|w(u, v)| ≥ 0.8) and0.9 (|w(u, v)| ≥ 0.9). In the search engine, the data points areweb pages and the request are queries. An aggregated webpage’s correlation to a query’s result accuracy is denoted bythis page’s ranking order, which is decided by its similarity

Fig. 5. Evaluation of identifying highly related original data points withsynopses

scores to the query terms. An original web page is viewedas being highly related to a request’s result accuracy if thispage belongs to the query’s actual top 10 web pages. Inthis evaluation, we randomly selected 1,000 active users and1,000 queries to represent different requests.

Evaluation results. In Figure 5, the x axis lists the rankedaggregated data points divided into 10 sections, and they axis shows each section’s average percentage of highlyrelated original data points (users or web pages) whentesting the 1,000 requests. In the recommender system, theaggregated users are ranked and divided according to theirweights to the requests. We can see in Figure 5(a) that inthe first section, the percentages of highly related originalusers whose absolute weights are equal to or larger than0.8 and 0.9 are about 85% and 77%, respectively. Thesetwo percentages are much smaller (about 25% and 16%)in the second section and they are smaller than 18% and12% in the remaining eight sections. In the search engine,the aggregated users are ranked and divided according totheir ranking orders. We can see in Figure 5(b) that the firstsection contains about 80% of the actual top 10 web pages.This percentage is much smaller (about 14%) in the secondsection and it is less than 5% in the remaining eight sections.

Results. Using the synopses, the aggregated data points withhigher ranks indeed correspond to parts of input data more relatedto different requests’ result accuracies.

4.3 Evaluation of Online Approximate Processing

4.3.1 Evaluation settings

Deployment settings. We compare all the techniques underthe same deployment setting. We deployed the service(either the recommender system or the search engine) ina cluster of KVM VMs [40], each has 2 CPU cores, 2 GBmemory, and hosts a service component. The service compo-nents include one component for accepting and partitioningrequests, 60 parallel components for processing the inputdata, and one component for composing results to produceresponses to end users. The 60 parallel components aredivided into six groups, each group has 10 components andCLAP allocates the aggregated data points in a synopsisamong these components.

To reflect the performance variance of parallel compo-nents in cloud environments, the 10 components in a groupare hosted across four different nodes n1 to n4. Specifically,nodes n1, n2, n3, and n4 host four, four, one, and one com-ponents, respectively. The components deployed on either

Page 9: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

9

node n1 or n2 co-locate with a VM with 4 CPU cores and4 GB memory and this VM runs the Hadoop MapReduceworkloads that cause continually changing performance in-terferences to the components. Hence in the group, the eightcomponents hosted on nodes n1 and n2 suffer from largerperformance interferences than those of the components onnodes n3 and n4.

MapReduce workloads. Two types of MapReduce jobs,namely a CPU-intensive job (WordCount) and an I/O inten-sive job (Sort), are used and their input data sizes range from1MB to 10GB. The tested MapReduce workloads thus havedifferent resource demands and they represent a large frac-tion of short-running batch jobs in the cloud. The MapRe-duce workloads are generated using BigDataBench-MT [4],a benchmark tool to replay workloads according to real-world traces. The arrival pattern (that is, jobs’ submittingtime, type, and input data) of MapReduce workloads fol-lows the Facebook production traces provided by StatisticalWorkload Injector for MapReduce (SWIM) [18], [13].

Training of the performance predictor. For a parallel compo-nent in the service, we trained its performance predictor (i.e.a regression model) using training samples collected fromprofile runs. In profile runs, we ran the component on a VMwith 2 cores and 2 GB memory, and used another VM with4 cores and 4 GB memory co-located on the same node torun the two microbenchmark tools (stress and IBS), whichgenerate contentions on different shared resources (proces-sors, disk bandwidths, and network bandwidths). For eachshared resource, profiling tests need to be run when thecomponent suffers from different resource contentions. Wetested 10 resource contentions whose intensities stepwiseincrease until the resource utilization reaches the capacity ofthe node. For each resource contention, we tested a set of1,000 requests (i.e. 1,000 active users or queries) to estimatethe component’s service time when processing a synopsis.We also tested the same requests to estimate the compo-nent’s service time without any performance interference inorder to calculate its relative service times under differentresource contentions.

Metrics used in comparison. For the basic approach andrequest reissue that produce exact results, we compare per-formance (tail latencies) between them and CLAP. We alsoshow the accuracy losses of the approximate results pro-duced by CLAP. For partial execution that produces approx-imate results, we compare accuracy losses between it andCLAP. To make our comparisons fair, the same tail latenciesare permitted in generating the approximate results with thetwo compared algorithms.

4.3.2 Comparison to Existing Tail Latency Reduction Tech-niques Using Synthetic and Realistic WorkloadsComparison using synthetic workload. In this evaluation,we test the recommendation system workload using sixdifferent request arrival rates, namely 50, 75, 100, 125, 150,and 175 requests/second. The test set consists of 1,000 activeusers randomly selected from the MovieLens dataset. Foreach active user, 20% of the items are the randomly selectedto form the test set, while the remaining 80% of items andthe items from all the other users form the training set. InCLAP, the refinement threshold is set to 100% in order toprocess as many original data points as possible within the

specified deadline (100ms). This is because all these pointspotentially contribute to result accuracy according to Figure5(a).

Evaluation results. Table 1(a) shows the tail latencies ofthe three techniques under different request arrival rates.When the load is light (arrival rate is 50), request reissuehas the smallest latency. When load gradually increases,CLAP provides the lowest tail latencies. Table 1(b) lists theaccuracy losses of partial execution and CLAP. We can seethat in all cases, CLAP causes accuracy losses of less than2%, and these accuracy losses are much smaller than thoseof partial execution.

TABLE 1Comparison of different techniques using the CF-based recommender

workload

(a) Tail latency (ms) in the basic approach, request reissue, and CLAP

Technique Request arrival rate50 75 100 125 150 175

Basic 138.86 1.56 5.15 7.62 1.10 1.18×103 ×104 ×104 ×105 ×105

Request reissue 55.04 194.97 536.12 849.94 1306.36 1817.35CLAP 65.02 107.54 126.00 132.90 163.68 192.74

(b) Percentages of accuracy losses in partial execution and CLAP

Technique Request arrival rate50 75 100 125 150 175

Partial execution 1.96 5.50 15.10 37.13 99.50 99.72CLAP 0.23 0.65 0.85 1.28 1.45 1.84

Comparison using realistic workload. In this evalua-tion, we tested the search engine workload, in which boththe query terms and their arrival patterns (that is, queries’submitting time, arrival rates, and sequences) are derivedfrom a real-world 24-hour user query log collected by theSogou search engine [14]. In the log, the load the searchengine follows the typical diurnal variation during 24 hours:the load is the lightest at midnight (hour 1 to 7, i.e. 1:00a.m. to 7:00 a.m.) and it gradually increases in the morning(hour 9 to 11); in the afternoon (hour 12 to 18), the loadreaches the heaviest value and then gradually decreases inthe evening (hour 19 to 24). We therefore select three typicaltest periods, namely hour 9 to 11, hour 16 to 18, and hour 23to 1, that represent queries with increasing, the largest, anddecreasing arrival rates, respectively. We test each periodseparately using 180 sessions, each session lasts 1 minute(the average rate of each session is shown in Figures 6(a), (e),and (i)) and the average is reported. In CLAP, the refinementthreshold is set to 40% in order to process at most theoriginal data points from the top 40% ranked aggregateddata points within the specified deadline (100ms). This isbecause these original data points include over 98% of actualtop 10 web pages according to Figure 5(b).

Evaluation results. Figure 6 demonstrates the fluctuationof tail latencies in the three compared techniques. We cansee that the basic approach (Figures 6(b), (f) and (j)) causesthe highest tail latencies, which become longer and longerwhen load increases because the queueing time of the slow-est component continuously increases. In contrast, requestreissue (Figures 6(c), (g) and (k)) significantly decreases taillatencies by reducing the latencies of a small proportionof the slowest components. However, this technique still

Page 10: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

10

Fig. 7. Comparison of percentages of accuracy losses using the searchengine workload of three test periods

causes much longer tail latencies than those of CLAP (Fig-ures 6(d), (h) and (l)) when the service is stressed by heavyworkloads. In addition, Figure 7 shows that in both approx-imate processing techniques, the accuracy losses fluctuatewith request arrival rates because heavy loads mean lessinput data can be processed and thus incurring larger losses.CLAP displays obvious superiority over partial executionby causing much smaller accuracy losses.

Analysis of the compared techniques. Request reissueworks best when load is light and parallel components havedifferent performances. This technique thus reduces taillatency by reissuing replicas of straggling sub-operations onslow components and executing them on quick components.However, under heavy loads, all components are stressedand have high latencies due to long request queueing de-lays, thus weakening the reissue mechanism. Evaluationresults of both service loads show that under heavy loads,the tail latencies of request reissue are 6 to 12 times largerthan those of CLAP. This is because CLAP achieves consis-tently low tail latencies using two techniques. First, it allowsall components to produce the initial approximate resultsquickly using the small aggregated data points. Second, itrestricts the latencies of these components in the refinementprocessing using both the deadline and the refinementthreshold.

In partial execution, each component still performs fullcomputations on the entire input data. Hence under heavyloads, the component’s service rate may be smaller thanthe request arrival rate and the request queue will becomelonger and longer. This means a majority of the componentsmay have longer latencies than the specified deadline, andpartial execution has to skip the processing results on thesecomponents, thus incurring large losses in result accuracy. Incontrast, although CLAP only processes a small proportionof input data on each component in order to provide lowlatencies under heavy loads, the processed data points arethe most accuracy-related ones for each request (e.g. only

searching 20% of the top-ranked web pages can find over92% of the actual top 10 web pages), thereby only causingsmall accuracy losses.

Analysis of CLAP. In the evaluation of both serviceloads using CLAP, the actual latency is longer than thespecified deadline (100ms) under heavy loads. There aretwo situations when a request’s latency exceeds 100ms ona component. The first and major situation is when therequest waits a long time in the queue before being served,after which the production of the initial approximate resultcauses a longer delay than 100ms. The second situationarises due to the intense resource contention on the node,which causes the component’s serving process being sus-pended. When the process is resumed, its latency may vio-late the specified deadline. Although the evaluation resultsshow that in CLAP, only a small proportion of parallel com-ponents violate the specified deadline (this proportion is lessthan 1% under light loads and it is about 5% under heavyloads), the latencies of these violated components decidethe tail latencies and the requests’ whole response times inhighly parallel services. On the other hand, a majority (95%)of the components can still meet the specified deadlineunder heavy loads. This means sufficient numbers of mostaccuracy-related original data points can be processed onthese components, thus bringing small accuracy losses inthe produced approximate results.

Results. Compared to request reissue, CLAP reduces thecomponent tail latency by an average of 6.46 times with smallaccuracy losses of 2.2% in the evaluations of the recommendersystem and the search engine workloads. Using the same servicelatency, CLAP reduces the result accuracy losses by an averageof 31.58 times compared to partial execution in the evaluations ofthe two workloads.

4.4 Discussion of the Three CLAP Modules

Approximate processing methodology provides the notionof trade-off between latency and accuracy. CLAP exploitssuch trade-off on individual service components, in whichboth tail latency and accuracy loss in approximate result pro-duction are influenced by the settings of CLAP offline andonline modules. In this section, we take the recommendationsystem workload as an example and design experimentsto demonstrate the influence of each CLAP module. Theexperiment of each module is conducted under the sameexperimental settings as Section 4.3.

4.4.1 Discussion of Synopses of Different SizesIn the offline synopsis creation module, the selection of theR-tree depth decides size (i.e. the number of aggregateddata points) in a synopsis. In previous evaluations of therecommendation system workload, the synopsis (called s10)is generated using the nodes at depth 10 of the R-tree.Selecting a lower depth results in a smaller synopsis thatrepresents the input data at a coarser level of granularity,thus influencing both tail latency and accuracy loss inapproximate result production. To illustrate this, we testanother synopsis (called s5) that is generated using thenodes at depth 5 of the R-tree.

Figure 8(a) compares the tail latencies of approximateresults produced based on synopses s5 and s10. We can

Page 11: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

11

Fig. 6. Comparison of tail latency (ms) using the search engine workload of three test periods

see that using s5, CLAP consistently produces results withlatencies of 100ms (the specified deadline). This is becauseproducing initial results using the aggregated data points isthe mandatory task on all components and it is the majorreason of exceeding the deadline when load gets heavier.Using s5 whose size is 6.18 times smaller than that of s10,this production process can be completed much faster andthus considerably reducing the chance of exceeding thespecified deadline.

On the other hand, Figure 8(b) shows that synopsiss5 incurs larger accuracy losses than those of s10 for tworeasons. First, the initial approximate results produced usings5 have a larger accuracy loss (28.73%) than that (16.78%) ofs10. Second, when refining the initial results based on s5,the average percentage of processed original data points isonly 69.43% of the average percentage based on s10. This isbecause the numbers of original data points represented bythe aggregated data points in s5 have a much larger discrep-ancy than those in s10. In s5, just the two largest aggregatedpoints correspond to 37.67% and 15.17% of all the originaldata points, and these percentages are only 1.44% and 1.37%in s10, respectively. Hence using s5, the input data allocatedto different components are highly unequal. This means atthe refinement stage, the components that are allocated thetwo largest aggregated points can only process parts (about60%) of original data points even under light loads, thusincurring larger accuracy losses.

It is easy to see that using a synopsis s′ whose size islarger than that of s10 will increase tail latency because theproduction time of initial approximate result increases. Inaddition, the accuracy loss when using s′ is similar to thatof s10 for two reasons. First, the initial approximate resultsproduced using s′ and s10 have similar accuracy losses. Wetested four synopses generated according to nodes at depths11, 12, 13, and 14 of the R-tree, and the accuracy losses of theinitial approximate results produced using these synopses

Fig. 8. Comparison of tail latencies and percentages of accuracy losseswhen using synopses of different sizes

decrease less than 2% compared to that of s10. Second, boths′ and s10 contains a sufficient number of aggregated datapoints to enable the fine-grained allocation of original datapoints, and hence the percentages of the processed originaldata points when using synopses s′ and s10 are similar.

4.4.2 Discussion of Predictive Input Data AllocationsThe effectiveness of the predictive input data allocator is con-siderably impacted by the accuracy of predicted relativeservice time. To evaluate this accuracy, we ran a parallelcomponent on a VM with 2 cores and 2 GB memory, andused another VM with 4 cores and 4 GB memory co-locatedon the same node to run a Hadoop job of different sizes.In the evaluation, we tested two Hadoop jobs (WordCountand Sort) with 13 different input sizes ranging from 1MBto 10GB, thus having different performance interferences tothe component’s service time. For each input size, the testwas repeated 10 times for consistency and the predictionerrors of the performance predictor are reported using errorbars. From Figure 9, we can see that in all tests, the averageprediction errors are smaller than 6%. In over 95% of thecases (within 2 standard deviations (SDs) of the mean),the prediction errors are smaller than 8.12%, indicating the

Page 12: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

12

Fig. 9. Prediction errors of the predictive input data allocator modelunder different resource contentions

Fig. 10. Comparison of tail latencies and accuracy losses with andwithout predictive input data allocation

performance predictor keeps a good track of the increasedservice time due to resource contention.

Furthermore, we tested CLAP using the recommenda-tion system workload with and without (that is, both theaggregated and original data points are equally allocatedto parallel components) predictive input data allocation.Figure 10(a) compares the tail latency of the two allocationstrategies. We can see that when load becomes heavierand the components become stressed, dynamically allocat-ing data to parallel components according to their perfor-mances can considerably decrease their latency variance,thus reducing the component tail latency. This is becausestraggling components with lower performances are allo-cated less aggregated and original data points. The initialapproximate results thus can be produced more quicklyon these components to reduce their latencies under heavyloads. In addition, Figure 10(b) shows that with predictiveinput data allocation, the produced results also have smalleraccuracy losses. This is because the allocation also allowsmore original data points to be processed in the refinement.

4.4.3 Discussion of Different Refinement Thresholds

In the accuracy-aware approximate processor, the refinementthreshold is a key parameter that determines the maximalpercentage of original data points capable of being pro-cessed to refine the result. The variations in the thresh-old therefore influence both latency and accuracy of theapproximate results. In previous evaluations of the rec-ommendation workload, the threshold is set to 100% thatallows components process as many original data pointsas possible. In this evaluation, we tested three thresholds:30%, 40%, and 50%. As shown in Figure 11(a), smaller

Fig. 11. Comparison of tail latencies and percentages of the processedoriginal data points using different refinement thresholds

TABLE 2Tail latency reduction techniques based on producing exact results

Category Introduction

Usefixed re-sourcesand noreplicas

Solve specific sources of tail latency by modifyinghardware/software systems including DVFS [38], OSkernel [43], [44], and system software [54], [31]Dynamically manage services for resource efficiency[29], [45] or energy efficiency [51], [34] while maintain-ing low tail latency

Usemoreresources

Give a high priority in the allocation of CPU [60], [56],cache [39], and network resources [62], [27], [35]Increase the degree of parallelism [37], [30], [35], [58],[36] or add new components [61]

Use re-dundantreplicas

Execute multiple replicas for all tasks: approachesapplied in Hadoop [17] and cloud storage systems [55];theoretical analysis of redundant mechanism [52], [53],[48], [53]Execute redundant replicas only for straggling tasksor sub-operations in requests: approaches applied inonline services [22], [35], [50] and Hadoop [57]

thresholds always brings lower tail latencies under differentloads, because smaller percentages of original data pointscan be processed in the refinement. This also incurs largeraccuracy losses as shown in Figure 11(b). Hence adjustingthe value of the refinement threshold, CLAP can also maketrade-offs between latency and accuracy while only causingsmall accuracy losses.

5 RELATED WORK

Reducing tail latency in large-scale distributed computingand storage systems has attracted much attention in recentyears [22]. Table 2 summarizes existing techniques based onproducing exact results that form a complement to our work,and we do not explain them in details here. In this section,we discuss related work based on producing approximateresults. Performing approximate processing in environmentswith limited execution time and resources has been thesubject of extensive research that makes trade-offs betweenprocessing latency and result accuracy. Different techniqueshave different focuses on the contradictory metrics of la-tency and accuracy.

Approximate processing with accuracy targets. Thesetechniques re-structure a computation process into manda-tory and optional computations and then leverage theoptional computations to perform approximate processingusing both hardware or software approaches. The hard-ware approaches (e.g. approximate computing [19], [41])

Page 13: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

13

execute optional computations on faster but less-reliable(may have arithmetic errors) hardware while maintaininga desired accuracy during approximate processing. Thesoftware approaches (e.g. online aggregation [20], [46], [42],[15]) produce multiple approximate results using optionalcomputations. By estimating accuracy bounds of these re-sults based on statistical sampling theory, they stop thecomputation process once the estimated accuracy reachesa specified one. However, the processing latency of bothapproaches depends on the desired accuracy and they areunsuitable for systems with real-time latency requirements.

Approximate processing with accuracy and latencybounds. Based on workload characteristic of past queries,some techniques pre-compute specialized structures (e.g.samples, histograms, or wavelets) of input datasets. Eachstructure can be used to answer a specific type of queryrequests with both accuracy and latency bounds [25], [21],[16]. Although these techniques can provide low latencyfor requests with certain attributes (e.g. the high-frequencyterms in search engine), they are impractical to process on-line services’ arbitrary requests, in which the combinationsof attributes are unpredictable. Hence these techniques areorthogonal to CLAP, which uses pre-computed synopsesthat aggregate the entire information of all attributes tosupport arbitrary requests.

Partial execution for tail latency reduction. In large-scale online services, the partial execution technique [32],[35], [33], [23], [59], [22] only uses the results from apart of parallel components responding before a deadlineto produce approximate results and skips the results onother components. Although providing low tail latency,this technique does not change the output of individualcomponents because each component still produces exactresults using the entire input data. Hence under heavy loadswhen requests have long queueing delays, this techniqueshas to skip results on a large proportion of componentsto guarantee low tail latency, thus causing large accuracylosses because all the skipped results potentially contributeto result accuracy. In contrast, CLAP performs computationsover sufficiently small synopses to produce quick initialresults on all components. Within a specified deadline, itimproves the results using the parts of input data mostrelated to result accuracy, thus resulting in high accuracywhile maintaining low tail latency despite handling heavyloads.

Difference from conference version. Portions of thework presented in this paper have appeared in a previousconference paper [28]. We have revised the article a lotand made substantive improvements as follows. First of all,evolved from the previous work that focuses on approxi-mate processing on individual components, we develop anew module (i.e. predictive input data allocator in Section2.3) that balances input data among multiple componentsaccording to their monitored performances, thus furtherreducing tail latency and accuracy loss of online services.Secondly, the previous accuracy-aware approximate proces-sor controls the refinement process at the granularity of perset of original data points, this work provides a more precisecontrolling strategy at the granularity of per original datapoint (Section 2.4). Thirdly, with the improved approach thatis implemented for Linux based environment (Section 3.2),

we redo the whole experiments with heavier service loads(Sections 4.2 and 4.3). We also add evaluations of the threeCLAP modules separately, thus demonstrating each mod-ule’s influence on tail latency and accuracy loss (Section 4.4).Finally, the related work has been substantially extended toreflect many recent advancements on tail latency reductionand approximate processing techniques (Section 5).

6 CONCLUSION

In this paper, we presented CLAP, a component-level ap-proximate processing framework for both low tail latencyand high result accuracy in cloud online services. CLAPis based on two key ideas: First, it aggregates informa-tion of similar input data to create small aggregated datapoints offline. By periodically allocating these points amongparallel components at run-time, it reduces the componentlatency variance and allows all components to produceinitial results quickly despite handling heavy loads. Second,it estimates the correlations between different parts of theinput data and arbitrary requests’ result accuracies usingthe aggregated data points, thus minimizing accuracy lossesby first using the most accuracy-related input data to refinethe produced results. Evaluation results using workloadsin real services demonstrate the effectiveness of CLAP atmaintaining low tail latency with small accuracy losses.

REFERENCES

[1] Alibaba jstorm. https://github.com/alibaba/jstorm.[2] Apache lucene search engine. http://lucene.apache.org/.[3] Apache storm. http://storm.apache.org/.[4] Bigdatabench-mt. http://prof.ict.ac.cn/BigDataBench/

multi-tenancyversion/.[5] Ibs bench. http://web2.clarkson.edu/class/cs644/isolation/.[6] Incremental svd method. http://sifter.org/∼simon/journal/

20061211.html.[7] Jsi (java spatial index) rtree library. http://jsi.sourceforge.net/.[8] Matlab. http://www.mathworks.com/products/matlab/.[9] Movielens 20 million dataset. http://grouplens.org/datasets/

movielens/.[10] Recommender systems. http://www.cs.carleton.edu/cs comps/

0607/recommend/recommender/index.html.[11] Redis data store. http://redis.io/.[12] Sogou web pages collection (in chinese). http://www.sogou.com/

labs/resource/t.php.[13] Statistical workload injector for mapreduce (swim). https://

github.com/SWIMProjectUCB/SWIM/wiki.[14] User query logs in sogou search engine (in chinese). http://www.

sogou.com/labs/resource/q.php.[15] Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar,

Michael Jordan, Samuel Madden, Barzan Mozafari, and Ion Stoica.Knowing when you’re wrong: building fast and reliable approx-imate query processing systems. In SIGMOD’14, pages 481–492.ACM, 2014.

[16] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner,Samuel Madden, and Ion Stoica. Blinkdb: queries with boundederrors and bounded response times on very large data. In Eu-roSys’13, pages 29–42. ACM, 2013.

[17] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and IonStoica. Effective straggler mitigation: Attack of the clones. InNSDI’13, volume 13, pages 185–198, 2013.

[18] Yanpei Chen, Sara Alspaugh, and Randy Katz. Interactive ana-lytical processing in big data systems: A cross-industry study ofmapreduce workloads. VLDB’12, 5(12):1802–1813, 2012.

[19] Vinay K Chippa, Srimat T Chakradhar, Kaushik Roy, and AnandRaghunathan. Analysis and characterization of inherent applica-tion resilience for approximate computing. In DAC’13, page 113.ACM, 2013.

Page 14: CLAP: Component-level Approximate Processing for Low Tail ...prof.ict.ac.cn/hanrui/wp-content/uploads/2016/12/TPDS-tail-latency.pdf · recognition, financial and data analysis, and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2650988, IEEETransactions on Parallel and Distributed Systems

14

[20] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein,Khaled Elmeleegy, and Russell Sears. Mapreduce online. InNSDI’10, volume 10, page 20, 2010.

[21] Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jer-maine. Synopses for massive data: Samples, histograms, wavelets,sketches. Foundations and Trends in Databases, 4(1–3):1–294, 2012.

[22] Jeffrey Dean and Luiz Andre Barroso. The tail at scale. Communi-cations of the ACM, 56(2):74–80, 2013.

[23] Zhihui Du, Hongyang Sun, Yuxiong He, Yu He, David A Bader,and Huazhe Zhang. Energy-efficient scheduling for best-effortinteractive services to achieve high response quality. In IPDPS’13,pages 637–648. IEEE, 2013.

[24] Dan Farber. Google’s marissa mayer: speed wins. ZDNet Betweenthe Lines, 2006.

[25] Minos N Garofalakis and Phillip B Gibbons. Approximate queryprocessing: Taming the terabytes. In VLDB’01, 2001.

[26] Genevieve Gorrell. Generalized hebbian algorithm for incrementalsingular value decomposition in natural language processing. InEACL’06, 2006.

[27] Matthew P Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert NMWatson, Andrew W Moore, Steven Hand, and Jon Crowcroft.Queues don’t matter when you can jump them! In NSDI’15, 2015.

[28] Rui Han, Siguang Huang, Fei Tang, Fugui Chang, and JianfengZhan. Accuracytrader: Accuracy-aware approximate processingfor low tail latency and high result accuracy in cloud onlineservices. In ICPP’16, page in press. IEEE, 2016.

[29] Rui Han, Junwei Wang, Siguang Huang, Chenrong Shao, ShulinZhan, Jianfeng Zhan, and Jose Luis Vazquez-Poletti. Pcs: Predic-tive component-level scheduling for reducing tail latency in cloudonline services. In ICPP’15, pages 490–499. IEEE, 2015.

[30] Md E Haque, Yong hun Eom, Yuxiong He, Sameh Elnikety, Ri-cardo Bianchini, and Kathryn S McKinley. Few-to-many: Incre-mental parallelism for reducing tail latency in interactive services?In ASPLOS’15, pages 161–175. ACM, 2015.

[31] Jun He, Duy Nguyen, Andrea C Arpaci-Dusseau, and Remzi HArpaci-Dusseau. Reducing file system tail latencies with chopper.In FAST’15, pages 119–133. USENIX Association, 2015.

[32] Yuxiong He, Sameh Elnikety, James Larus, and Chenyu Yan. Zeta:scheduling interactive services with partial execution. In SoCC’12,page 12. ACM, 2012.

[33] Yuxiong He, Sameh Elnikety, and Hongyang Sun. Tians schedul-ing: Using partial processing in best-effort applications. InICDCS’11, pages 434–445. IEEE, 2011.

[34] Chang-Hong Hsu, Yunqi Zhang, Michael Laurenzano, DavidMeisner, Thomas Wenisch, Jason Mars, Lingjia Tang, Ronald GDreslinski, et al. Adrenaline: Pinpointing and reining in tailqueries with quick voltage boosting. In HPCA’15, pages 271–282.IEEE, 2015.

[35] Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache,Mikhail Rybalkin, and Chenyu Yan. Speeding up distributedrequest-response workflows. In SIGCOMM’13, pages 219–230.ACM, 2013.

[36] Myeongjae Jeon, Yuxiong He, Hwanju Kim, Sameh Elnikety, ScottRixner, and Alan L Cox. Tpc: Target-driven parallelism combiningprediction and correction to reduce tail latency in interactiveservices. In ASPLOS’16, pages 129–141. ACM, 2016.

[37] Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He,Sameh Elnikety, Alan L Cox, and Scott Rixner. Predictive paral-lelization: Taming tail latencies in web search. In SIGIR’14, pages253–262. ACM, 2014.

[38] Svilen Kanev, Kim Hazelwood, Gu-Yeon Wei, and David Brooks.Tradeoffs between power management and tail latency inwarehouse-scale applications. Power, 20:40, 2014.

[39] Harshad Kasture and Daniel Sanchez. Ubik: efficient cache sharingwith strict qos for latency-critical workloads. In ASPLOS’14, pages729–742. ACM, 2014.

[40] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and AnthonyLiguori. kvm: the linux virtual machine monitor. In Proceedings ofthe Linux Symposium, volume 1, pages 225–230, 2007.

[41] L. Kugler. Is ’good enough’ computing good enough? Communi-cations of the ACM, 58(5):12–14, 2015.

[42] Nikolay Laptev, Kai Zeng, and Carlo Zaniolo. Early accurate re-sults for advanced analytics on mapreduce. VLDB’12, 5(10):1028–1039, 2012.

[43] Jacob Leverich and Christos Kozyrakis. Reconciling high serverutilization and sub-millisecond quality-of-service. In EuroSys’14,page 4. ACM, 2014.

[44] Jialin Li, Naveen Kr Sharma, Dan RK Ports, and Steven D Gribble.Tales of the tail: Hardware, os, and application-level sources of taillatency. In SoCC’14, pages 1–14. ACM, 2014.

[45] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran-ganathan, and Christos Kozyrakis. Heracles: improving resourceefficiency at scale. In ISCA’15, pages 450–462. ACM, 2015.

[46] Niketan Pansare, Vinayak R Borkar, Chris Jermaine, and TysonCondie. Online aggregation for large mapreduce jobs. VLDB’11,4(11):1135–1145, 2011.

[47] Karl Pearson. Notes on the history of correlation. Biometrika,13(1):25–45, 1920.

[48] Zhan Qiu, Juan F Perez, and Peter G Harrison. Beyond the mean infork-join queues: Efficient approximation for response-time tails.Performance Evaluation, 91:99–116, 2015.

[49] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborativefiltering techniques. Advances in artificial intelligence, 2009:4, 2009.

[50] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann.C3: Cutting tail latency in cloud data stores via adaptive replicaselection. In NSDI’15, 2015.

[51] Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and TN Vi-jaykumar. Timetrader: Exploiting latency tail to save datacenterenergy for online search. In MICRO’15, 2015.

[52] Ashish Vulimiri, Philip Brighten Godfrey, Radhika Mittal, JustineSherry, Sylvia Ratnasamy, and Scott Shenker. Low latency viaredundancy. In CoNEXT’13, pages 283–294. ACM, 2013.

[53] Da Wang, Gauri Joshi, and Gregory Wornell. Efficient taskreplication for fast response times in parallel computation. InSIGMETRICS’14, pages 599–600. ACM, 2014.

[54] Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe,Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, andCalton Pu. Detecting transient bottlenecks in n-tier applicationsthrough fine-grained analysis. In ICDCS’13, pages 31–40. IEEE,2013.

[55] Zhe Wu, Curtis Yu, and Harsha V Madhyastha. Costlo: Cost-effective redundancy for lower latency variance on cloud storageservices. In NSDI’15, 2015.

[56] Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey.Bobtail: avoiding long tails in the cloud. In NSDI’13, pages 329–342, 2013.

[57] Neeraja J Yadwadkar, Ganesh Ananthanarayanan, and RandyKatz. Wrangler: Predictable and faster jobs using fewer resources.In SoCC’14, pages 1–14. ACM, 2014.

[58] Qing Yang, Xiaoxiao Li, Hongyi Yao, Ji Fang, Kun Tan, WenjunHu, Jiansong Zhang, and Yongguang Zhang. Bigstation: Enablingscalable real-time signal processingin large mu-mimo systems.In ACM SIGCOMM Computer Communication Review, volume 43,pages 399–410. ACM, 2013.

[59] Jeong-Min Yun, Yuxiong He, Sameh Elnikety, and Shaolei Ren.Optimal aggregation policy for reducing tail latency of web search.In SIGIR’15, pages 63–72. ACM, 2015.

[60] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, VrigoGokhale, and John Wilkes. Cpi2: Cpu performance isolation forshared compute clusters. In EuroSys’13, pages 379–391. ACM,2013.

[61] Jieming Zhu, Zibin Zheng, and Michael R Lyu. Dr2: Dynamicrequest routing for tolerating latency variability in online cloudapplications. In CLOUD’13, pages 589–596. IEEE, 2013.

[62] Timothy Zhu, Alexey Tumanov, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Prioritymeister: Tail latency qos forshared networked storage. In SoCC’14, pages 1–14. ACM, 2014.

Rui Han is an assistant professor at the Institute of Computing Tech-nology (ICT), Chinese Academy of Sciences (CAS). He received MScwith honor in 2010 from Tsinghua University, China, and PhD degree in2014 from the department of computing, Imperial College London, UK.His research interests are data center and cloud computing, large-scaleparallel and distributed systems, and big data benchmarks.

Siguang Huang is a graduate student at the School of Software,Tsinghua University. His work focuses on large-scale and real-timesystems.

Zhentao Wang is a full-time software engineer at ICT, CAS. He has richexperiences in developing big data systems.

Jianfeng Zhan is a full professor at the ICT, CAS. His research workis driven by interesting problems. He enjoys building new systems, andcollaborating with researchers with different backgrounds.