robus: fair cache allocation for multi-tenant data-parallel workloads · 2018-04-12 · robus: fair...

18
arXiv:1504.06736v2 [cs.DC] 3 Sep 2015 ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon Fain, Kamesh Munagala, Shivnath Babu Duke University {mayuresh, btfain, kamesh, shivnath}@cs.duke.edu ABSTRACT Systems for processing big data—e.g., Hadoop, Spark, and mas- sively parallel databases—need to run workloads on behalf of mul- tiple tenants simultaneously. The abundant disk-based storage in these systems is usually complemented by a smaller, but much faster, cache. Cache is a precious resource: Tenants who get to use cache can see two orders of magnitude performance improve- ment. Cache is also a limited and hence shared resource: Unlike a resource like a CPU core which can be used by only one tenant at a time, a cached data item can be accessed by multiple tenants at the same time. Cache, therefore, has to be shared by a multi-tenancy- aware policy across tenants, each having a unique set of priorities and workload characteristics. In this paper, we develop cache allocation strategies that speed up the overall workload while being fair to each tenant. We build a novel fairness model targeted at the shared resource setting that in- corporates not only the more standard concepts of Pareto-efficiency and sharing incentive, but also define envy freeness via the notion of core from cooperative game theory. Our cache management plat- form, ROBUS, uses randomization over small time batches, and we develop a proportionally fair allocation mechanism that satisfies the core property in expectation. We show that this algorithm and re- lated fair algorithms can be approximated to arbitrary precision in polynomial time. We evaluate these algorithms on a ROBUS pro- totype implemented on Spark with RDD store used as cache. Our evaluation on an industry-standard workload shows that our algo- rithms provide a speedup close to performance optimal algorithms while guaranteeing fairness across tenants. 1. INTRODUCTION Two recent trends in data processing are: (i) the use of multi- tenant clusters for analyzing large and diverse datasets, and (ii) the aggressive use of memory to speed up processing by caching datasets. The growing popularity of systems like Apache Spark [4], SAP HANA [24], Hadoop with Discardable Distributed Memory [1], and Tachyon [45] highlight these trends. For example, Spark introduces an abstraction called Resilient Distributed Dataset (RDD) to represent any data relevant to mod- . ern analytics: files (on a local or distributed file-system), tables (horizontally or vertically partitioned), vertices or edges of graphs, statistical models learned from data, etc. A user can create an RDD directly from data residing on a local or distributed file-system, or by applying a transformation to one or more other RDDs. The user can then direct the system to cache the RDD in memory. Figure 1 gives an example. Computations done on RDDs cached in memory run 10-100x faster than when the data resides on disk [4]. Figure 1: A sample Spark program Can User-Directed Caching and Multi-tenancy Coexist? User- directed caching brings some major challenges in a multi-tenant data analytics cluster: A precious resource: Tenants who get to use the in-memory cache see many orders of magnitude of performance improve- ment. However, the cache is also a limited resource since the total size of memory in a cluster is usually orders of magnitude smaller than the data sizes stored and queried in the cluster. Complications from sharing: Unlike a resource like a CPU core which is used by one tenant at a time, a cached data item can simultaneously benefit a high-priority and a low-priority tenant. Avoiding cache hogs: Low-priority tenants should not be able to hog the available cache, and prevent other tenants from get- ting the performance benefits they deserve. Utilities differ: Different tenants have different utilities for datasets that could be placed in the cache. When faced with such challenges, traditional cache allocation poli- cies can lead to user dissatisfaction, poor or unpredictable perfor- mance, and low resource utilization. We will illustrate the prob- lems and opportunities through an example. Consider a social- networking company, SpaceBook, that runs a multi-tenant cluster for analyzing datasets about how its users are using the service. Multiple tenants: The predominant practice in the industry is to group similar users—e.g., users in the same department—into queues (or, pools). Each queue forms a tenant in the cluster. The cluster at SpaceBook is used by three tenants: (i) Analyst, the business analysts in the company, (ii) Engineer, the developers in the com- pany who develop data-driven applications such as recommenda- tion models, and (iii) VP, the top-level management in the company such as the CEO and the Chief Security Officer who look at hourly and daily reports. Cacheable entities: These three tenants will benefit from caching one or more of three viewsR, S, and P—each of size M bytes.

Upload: others

Post on 06-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

arX

iv:1

504.

0673

6v2

[cs.

DC

] 3

Sep

201

5

ROBUS: Fair Cache Allocation for Multi-tenantData-parallel Workloads

Mayuresh Kunjir, Brandon Fain, Kamesh Munagala, Shivnath BabuDuke University

mayuresh, btfain, kamesh, [email protected]

ABSTRACTSystems for processing big data—e.g., Hadoop, Spark, and mas-sively parallel databases—need to run workloads on behalf of mul-tiple tenants simultaneously. The abundant disk-based storage inthese systems is usually complemented by a smaller, but muchfaster,cache. Cache is a precious resource: Tenants who get touse cache can see two orders of magnitude performance improve-ment. Cache is also a limited and hence shared resource: Unlike aresource like a CPU core which can be used by only one tenant atatime, a cached data item can be accessed by multiple tenants at thesame time. Cache, therefore, has to be shared by a multi-tenancy-aware policy across tenants, each having a unique set of prioritiesand workload characteristics.

In this paper, we develop cache allocation strategies that speedup the overall workload while beingfair to each tenant. We build anovel fairness model targeted at the shared resource setting that in-corporates not only the more standard concepts of Pareto-efficiencyand sharing incentive, but also define envy freeness via the notionof corefrom cooperative game theory. Our cache management plat-form,ROBUS, uses randomization over small time batches, and wedevelop a proportionally fair allocation mechanism that satisfies thecore property in expectation. We show that this algorithm and re-lated fair algorithms can be approximated to arbitrary precision inpolynomial time. We evaluate these algorithms on aROBUS pro-totype implemented on Spark with RDD store used as cache. Ourevaluation on an industry-standard workload shows that ouralgo-rithms provide a speedup close to performance optimal algorithmswhile guaranteeing fairness across tenants.

1. INTRODUCTIONTwo recent trends in data processing are: (i) the use of multi-

tenant clusters for analyzing large and diverse datasets, and (ii)the aggressive use of memory to speed up processing by cachingdatasets. The growing popularity of systems like Apache Spark [4],SAP HANA [24], Hadoop with Discardable Distributed Memory [1],and Tachyon [45] highlight these trends.

For example, Spark introduces an abstraction calledResilientDistributed Dataset (RDD)to represent any data relevant to mod-

.

ern analytics: files (on a local or distributed file-system),tables(horizontally or vertically partitioned), vertices or edges of graphs,statistical models learned from data, etc. A user can createan RDDdirectly from data residing on a local or distributed file-system, orby applying a transformation to one or more other RDDs. The usercan then direct the system to cache the RDD in memory. Figure 1gives an example. Computations done on RDDs cached in memoryrun 10-100x faster than when the data resides on disk [4].

Figure 1: A sample Spark programCan User-Directed Caching and Multi-tenancy Coexist?User-

directed caching brings some major challenges in a multi-tenantdata analytics cluster:

• A precious resource:Tenants who get to use the in-memorycache see many orders of magnitude of performance improve-ment. However, the cache is also a limited resource since thetotal size of memory in a cluster is usually orders of magnitudesmaller than the data sizes stored and queried in the cluster.

• Complications from sharing:Unlike a resource like a CPU corewhich is used by one tenant at a time, a cached data item cansimultaneously benefit a high-priority and a low-priority tenant.

• Avoiding cache hogs:Low-priority tenants should not be ableto hog the available cache, and prevent other tenants from get-ting the performance benefits they deserve.

• Utilities differ: Different tenants have different utilities for datasetsthat could be placed in the cache.

When faced with such challenges, traditional cache allocation poli-cies can lead to user dissatisfaction, poor or unpredictable perfor-mance, and low resource utilization. We will illustrate theprob-lems and opportunities through an example. Consider a social-networking company, SpaceBook, that runs a multi-tenant clusterfor analyzing datasets about how its users are using the service.Multiple tenants: The predominant practice in the industry is togroup similar users—e.g., users in the same department—into queues(or, pools). Each queue forms a tenant in the cluster. The clusterat SpaceBook is used by three tenants: (i)Analyst, the businessanalysts in the company, (ii)Engineer, the developers in the com-pany who develop data-driven applications such as recommenda-tion models, and (iii)VP, the top-level management in the companysuch as the CEO and the Chief Security Officer who look at hourlyand daily reports.Cacheable entities:These three tenants will benefit from cachingone or more of threeviews—R, S, andP—each of sizeM bytes.

Page 2: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

Throughout this paper, “view” refers to any data item that can becached to give a performance benefit. For SQL workloads, a viewcorresponds to a SQL expression, like any candidate view gener-ated by a materialized view selection algorithm [32, 58, 8, 42]. Forbroader data analytics—e.g., machine learning and graph processing—a view corresponds to a dataset on which the user has put a cachedirective (recall the example Spark program from Figure 1).

Tenant R S P

Analyst 2 1 0Engineer 2 1 0

VP 0 1 2Table 1: Utilities of cached views to tenants at SpaceBook

Utilities: The matrix in Table 1 shows theutility that each tenantgets if the corresponding view were to be cached in memory. Asimple definition of utility we will use in this paper is the savingsin I/O because data is read from the in-memory cache versus disk.For example, if viewR is cached in memory, then tenantAnalystwill get a utility of two units. One common pattern in multi-tenantclusters that we bring out in Table 1 is that viewRcould be the de-tailed logs that business analysts and developers access quite often;view P could be a table that only the top-level management has ac-cess to; while viewScould be a materialized view with aggregatedinformation shared by all tenants.

Scenario 1: Debbie, the cluster DBA, is responsible for allocat-ing resources so that the tenants get good performance whileallresources are used effectively. Suppose the in-memory cache atSpaceBook has a total size ofM bytes. Debbie first configures astatic and equal partitioning of the cache so that each tenant is enti-tled to M

3 bytes of cache memory. Recall that each of the viewsR,S, andP areM bytes each; so none of them will fit in theirM

3 bytesof cache. Hence, resource utilization will be poor and none of thetenants will receive any performance boost.

Scenario 2: Next, Debbie switches to a more common cache al-location policy,Least Recently Used (LRU). The viewR is usedthe most at Spacebook, so it will likely remain cached for themosttime. Thus, the Analyst and Engineer tenants will see performancespeedups. However, the VP tenant’s workload will see poor perfor-mance, causing these users—including Zuck, SpaceBook’s CEO—to complain that important reports needed for their decision-makingare not being generated on time.

Scenario 3:Debbie decides to give the VP tenant 50% higher pri-ority than the other tenants. So, she assigns weights to the Ana-lyst, Engineer, and VP tenants in the ratio 1 : 1 : 1.5. Debbie nowswitches to a policy that allocates the cache based on the weightedutility of the tenants. She tells Zuck that his reports will now begenerated faster. Unfortunately, Zuck will not see any performanceimprovement: viewR will still be the only one cached since it hasthe highest weighted utility of 4 (= 2×1+2×1); higher than viewS’s weighted utility of 3.5 (= 1×1+1×1+1×1.5), and viewP’sweighted utility of 3 (= 2×1.5).

Scenario 4: To improve the poor performance seen by the VPtenant, Zuck gives Debbie the money needed to double the cachememory in the cluster. Now two views will fit in the 2M-sizedcache. However, even after this massive investment, the VP tenantwill only see a minor increase in performance compared to theAn-alyst and Engineer tenants: viewsRandSwill now be cached sincethey together have the highest weighted utility of 7.5 (4 forR+ 3.5for S); higher than 7 forRandP, and 6.5 for SandP.

Scenario 5: Zuck is very unhappy with Debbie. She now tries toimprove things by switching back to static partitioning of2M

3 forevery tenant, causing everyone to get poor performance because

none of the views fit. In desperation, Debbie now has to go to theAnalyst and Engineer tenants to request them to stop adding cachedirectives to their workloads. The whole multi-tenant situation be-comes extremely messy.

Better scenarios:Let us consider what Debbie would have wantedideally. An alternative in Scenario 3 is to cache viewS instead ofR. While Shas a slightly lower weighted utility of 3.5 compared to4 for R, all three tenants will see peformance improvements fromcachingS. An alternative in Scenario 4 is to cacheR andP whichwill also give performance benefits to all three tenants while onlybeing slightly lower in overall weighted utility than caching R andS. In particular, theVP tenant will now see major benefits fromdoubling the cache size.

The above example shows non-trivial nature of the problem ofcache allocation in multi-tenant setups. There is a need to makeprincipled choices when it comes to picking data items to cache.This motivates the main challenge we address.

Develop a cache allocation policy that provides near-optimal performance speedups for tenants’ workloadwhile simultaneously achieving near-optimal fairnessin terms of the tenants’ performance.

Our Contributions.• In Section 2, we proposeROBUS, a platform to optimize

multi-tenant query workloads in anonlinemanner using cachefor speedup. This framework groups queries in small time-based batches and employs a randomized cache allocationpolicy on each batch.

• In Section 3, we consider the abstract setting of shared re-source allocation within a batch, and enumerate propertiesthat we desire from any allocation scheme. We show that thenotion of core from cooperative game theory captures thefairness properties in a succinct fashion. We show that whenrestricted to randomized allocation policies within a batch, asimple algorithm termedproportional fairnessgenerates anallocation which satisfies fairness properties in expectationfor that batch.

• The policies we construct are based on convex programmingformulations of exponential size. Nevertheless, in Section 4,we show that these policies admit to arbitrarily good approx-imations in polynomial time using the multiplicative weightupdate method. We present implementations of two fair poli-cies: max-min fairness and proportional fairness. We alsopresent faster and more practical heuristics for computingthese solutions.

• We show a proof-of-concept implementation ofROBUS on amulti-tenant Spark cluster. Motivated by practical use cases,we develop a synthetic workload generator to create variousscenarios. Implementation details and evaluation are pro-vided in Section 5. Results show that our policies providedesirable throughput and fairness in a comprehensive set ofsetups.

• Finally, our policies are specified abstractly, and as such eas-ily extend to other resource allocation settings. We discussthis in Section 3.4.

2. ROBUS PLATFORMROBUS (Random Optimized Batch Utility Sharing), shown in

Figure 2, is the cache management platform we have developed

Page 3: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

for multi-tenant data-parallel workloads.ROBUS is designed to beeasily pluggable in systems like Hadoop and Spark. Each tenantsubmits its workload in an online fashion to a designated queuewhich is characterized by a weight indicating the tenant’sfair shareof system resources. (Recall our example from Section 1.)

Utility estimation

model

Cache

budget

ROBUS OptimizerBatch of query

workload

Candidate

views

Recommended

views

Cache update

Query rewrite

Schedule for

execution

View selection

algorithm

Tenant utility

model1 2

3

4

5

6

ROBUS workow

CPU cores

Weig-

hted

tenants

Memory

Cache Heap

Network I/ODisk I/O

Cluster resources

Figure 2:ROBUS platform

As illustrated in Figure 2,ROBUS processes the workload inbatchesby running five steps in a repeated loop. Step 1 removesa batch of queries that were submitted in a fixed time intervalintothe tenants’ queues. Step 2 runs an algorithm over this entire batchto select a set of views to cache. This computation simultaneouslyoptimizes performance and fairness; designing this algorithm is themain focus of this paper.

Step 2 takes three inputs: (i) a set of candidate views for thequery batch, (ii) a utility estimation model for cached views, and(iii) the total cache budget (i.e., memory available for caching).The candidate view generation inROBUS is a pluggable module.By default, the candidate views for a SQL query are the base tablesaccessed by the query. For workloads like machine learning andgraph processing, the candidate views are datasets on whichtheuser has put a cache directive (recall the example Spark programfrom Figure 1).

Any candidate view selection algorithm from the literaturecanbe plugged in toROBUS [32, 58, 8, 42]. To support this feature,ROBUS has a pluggable Step 4 where a query can be rewrittento use the views selected for caching in Step 2 before being run inStep 5. We make use ofROBUS’s pluggability in Section 5 to run acandidate view selection algorithm that considers different verticalprojections of input tables in the workload. In future, we plan toextend Step 4 to support re-optimization of the query based on thecached views. Re-optimization may change the query plan entirely.

The utility estimation model is used in the view selection pro-cedure to estimate the utility provided to a query by any cachedview. ROBUS currently models these utilities as savings in diskI/O costs if the view were to be read off of in-memory cache versusdisk. This approach keeps the models simple and widely applica-ble. In future, we plan to incorporate richer utility modelsfrom theliterature that can account for more complex candidate viewselec-tion algorithms that consider interactions among views (e.g., useof views can completely change the query plan for a query [44]).Total utility of a cache configuration to a tenant is computedbysumming up estimated utilities of the queries submitted by the ten-ant. The tenant utilities thus computed are used by view selectionalgorithm to recommend optimal set of views to cache.

In Step 3, the cache is updated with the views selected by ouralgorithm (if they are not already in the cache). The query batchis then run via Steps 4 and 5. Every query runs as data-paralleltasks in a system like Hadoop or Spark. Our current prototypeofROBUS runs on Spark. A task scheduler (e.g., [2, 27]) is respon-sible for allocating system resources to the tasks. Clustermemory

is divided into two parts: a heap space for run-time objects and acache for the selected views. While the heap is divided across tasksand is allocated by the task scheduler, the cache is shared byall thequeries in the batch simultaneously and managed byROBUS.

3. FAIRNESS PROPERTIES AND POLICIESFOR SINGLE BATCH

In this section we study various notions of fairness when re-stricted to view selection for queries from a single batch. We con-sider policies that compute allocations that simultaneously providelarge utility to many tenants, and enforce a rigorous notionof fair-ness between the tenants. Since this is very related to otherre-source allocation problems in economics [18, 6, 16], we drawheav-ily on that work for inspiration. However, the key difference fromstandard resource allocation problems is that in our setting, the re-sources (or views) are simultaneously shared by tenants. Incon-trast, the resource allocation settings in economics have typicallyconsideredpartitioning resources between tenants. As we shall seebelow, this leads to interesting differences in the notionsof fairness.

3.1 Fairness and RandomizationIt is well-known in economics [19] that the combination of fair-

ness and indivisible resources (in our case, the cache and views)necessitates randomization. To develop intuition, we present twoexamples.

First consider a simple fair allocation scheme that forN tenantssimply allows each tenant to use1N of the total cache for her pre-ferred view(s). It is plausible that some tenants prefer a large viewthat does not fit in this partition but does fit in the cache. There-fore, letting tenants have1N probability of using the whole cachecan have arbitrarily larger expected utility than the scheme whichwith probability 1 lets them use1N fraction of the whole cache.

Next, consider a batch wherein two tenants each request a differ-ent large view such that only one can fit into the cache. In thiscase,there can be no deterministic allocation scheme that does not ignoreone of the tenants. Using randomization, we can easily ensure thateach tenant has the same utility in expectation. In fact, utility in ex-pectation will be the per batch guarantee we seek, which overthelong time horizon of a workload will lead to deterministic fairness.

Notation for Single Batch.Since our view selection policy works on individual batchesat a

time, the notation and discussion below is specific to queries withina batch. LetN denote the total number of tenants. Define:

DEFINITION 1. A configurationS is the set of views feasible inthat the sum of the view sizes∑Si∈S|Si | is at most the cache size.

Ui(S) denotes the utility to tenanti that would result from cachingS, which is defined as the sum over all queries ini’s queue of theutility for that query.

ROBUS generates a setQ of configurations which by definitioncan fit in the cache, and assigns a probabilityxS to cache each con-figurationS∈Q. Define the vector of all such probabilities as:

DEFINITION 2. AnAllocation x is the vector corresponding toprobabilities xs of choosing configuration S normalized so‖x‖ =∑S∈Q xS= 1.

We denoteUi(x) = ∑S∈Q xSUi(S) as the expected utility of tenanti in allocationx. ROBUS implements allocationx by sampling aconfiguration from the probability distribution.

For each tenanti, let U∗i = maxSUi(S) denote the maximumpossible utility tenanti can obtain if it were the only tenant in

Page 4: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

the system. For allocationx, we define thescaled utilityof i as

Vi(x) =Ui(x)U∗i

. We will use this concept crucially in defining ourfairness notions.

3.2 Basic Fairness DesiderataThe first question to ask when designing a fair allocation algo-

rithm is what properties define fairness. There has been muchre-cent work in economics and computer science on heterogeneous re-source allocation problems and we begin by considering the prop-erties that this related work examines [27, 52, 39]. Note that be-cause we work within a randomized model, all of these propertiesare framed in terms of expected utility of tenants.

• Pareto Efficiency (PE): An allocation is Pareto-efficient ifno other allocation simultaneously improves the expected util-ity of at least one tenant and does not decrease the expectedutility of any tenant.

• Sharing Incentive (SI): This property is termedindividualrationality in Economics. ForN tenants, each tenant shouldexpect higher utility in the shared allocation setting thanshewould expect from simply always having access to1

N of theresources. Since our allocations are randomized, allocationx satisfies SI if for all allocationsy with ||y|| ≤ 1

N and fortenantsi, Ui(x) ≥Ui(y). In other words,Vi(x) ≥ 1

N for alltenantsi, whereVi(x) is the scaled utility function definedabove.

One property that is widely studied in other resource allocationcontexts is strategy-proofness on the part of the tenants (the notionthat no tenant should benefit from lying). In our case, since thequeries are seen by the query optimizer, strategy-proofness is not anissue. The above desiderata also omit envy-freeness (that no tenantshould prefer the allocation to another tenant) which is somethingwe revisit later.

We now consider a progression of view selection mechanismson a single batch from very simple to more sophisticated. As arunning example, suppose there is a cache of capacity 1. There arethree viewsR, S, or P that are demanded byN tenants. Each viewhas unit size, so that we can cache only one view any time. Notethat this is a drastically simplified example setup only intended tobuild intuition about why certain view selection algorithms mightfail or are superior to others; our results and experiments do notonly have unit views, are not limited to three tenants, and may havearbitrarily complex utilities compared to these examples.

We can summarize the input information our view selection mightsee in a given batch in a table (e.g., Table 2) where the numbers rep-resent utilities tenants get from the views. An allocation here is avectorx of three dimensions and‖x‖ = 1 that gives the probabili-ties in our randomized frameworkxR,xS,xP for selecting the views.

Tenant R S P

A 1 0 0B 0 1 0C 0 0 1

Table 2: Every tenant gets utility from a different view

Static Partitioning.Recall that static partitioning is the algorithm that simply deter-

ministically allows each of theN tenants to use1N of the shared re-source. This algorithm does not take advantage of randomization.

For the example in Table 2, this algorithm cannot cache anythingbecause each user only gets to decide on the use of1

3 of the cache.The algorithm is sharing incentive in the standard deterministic set-ting, but is trivially not Pareto efficient and is not sharingincentivein expectation either. As mentioned previously, such examples mo-tivate the randomization framework to start with.

Random Serial Dictatorship.A natural progression from static partitioning is to consider ran-

dom serial dictatorship (RSD), a mechanism that is widely con-sidered [16, 6] for problems such as house allocation and schoolchoice. We order the tenants in a random permutation. Each tenantsequentially computes the best set of views to cache (in the residualcache space) to maximize its own utility. In the example in Table 2,each tenant gets a13 chance of picking her preferred resource (sincein a random permutation each tenant has a1

3 chance of appearingfirst) so the allocation isx =< xR = 1

3 ,xS = 13 ,xP = 1

3 >, whereeach tenant has the same utility in expectation. In fact, it is easy toprove that RSD is always SI: Each tenant has1

N chance of beingfirst in the random ordering, so its scaled utility is at least1

N .However, in contrast with resource partitioning problems,our

problem has a shared aspect that RSD fails to capture. For example,consider the situation in Table 3; RSD computes the same alloca-tion as in the example in Table: 2,x =< xR = 1

3 ,xS=13 ,xP = 1

3 >.However, on this example, though RSD is SI, it is not Pareto-efficient (PE). Tenants A and C have expected utility of 1 (a1

3chance of getting 2 if they come first in the permutation and a1

3chance of getting 1 if B does) and tenant B has expected utility of13 with this allocation. However, if we used allocationx =< xR =0,xS= 1,xP = 0> then tenants A, B, and C all have utility 1, whichis strictly better for tenant B and as good for tenants A and C.RSDfails to capture the fact that while each tenant may have differenttop preferences, many tenants may share secondary preferences.

Tenant R S P

A 2 1 0B 0 1 0C 0 1 2

Table 3: Every tenant gets utility from the same view

Utility Maximization Mechanism.We next consider the mechanism which simply maximizes the

total expected utility of an allocation,i.e., arg maxx ∑i Ui(x). It iseasy to check that this mechanism can ignore tenants who do notcontribute enough to the overall utility. In other words, itcannot beSI.

Max-min Fairness (MMF).In this algorithm we combine previous insights to optimize per-

formance subject to fairness constraints to get a mechanismthat isboth SI and PE. For allocationx, letv(x)= (V1(x),V2(x), . . . ,VN(x))denote the vector of scaled utilities of the tenants. We choose an al-locationx so that the vectorv(x) is lexicographically max-min fair.This means the smallest value inv(x) is as large as possible; sub-ject to this, the next smallest value is as large as possible,and so on.We present algorithms to compute these allocations in Section 4.

THEOREM 1. The MMF mechanism is both PE and SI.

PROOF. The RSD mechanism guarantees scaled utility of at least1N to each tenant. Since the MMF allocation is lexicographically

Page 5: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

max-min, the minimum scaled utility it obtains is at least the min-imum scaled utility in RSD, which is at least1

N . To show PE, notethat if there were an allocation that yielded at least as large utilityfor all tenants, and strictly higher utility for one tenant,the newallocation would be lexicographically larger, contradicting the def-inition of MMF.

Tenant R S

T1 1 0T2 1 0... ... ...TN 0 1

Table 4: All tenants except one get utility from the same view

Consider the example in Table 4. It is easy to see that the MMFvalue is1

2 and can be achieved with an allocation of< xR= 12 ,xS=

12 >. This allocation is both SI and PE.

3.3 Envy-freeness and the CoreThe above discussion omits one important facet of fairness.A

fair allocation has to beenvy free, meaning no tenant has to envyhow the allocation treats another tenant. In the case where re-sources are partitioned between tenants, such a notion is easy todefine: No tenant must derive higher utility from the allocationto another tenant. However, in our setting, resources (views) areshared between tenants, and the only common factor is the cachespace. In any allocationx, each tenant derives utility from certainviews, and we can term the expected size of these views as thecache shareof this user.

One could try to define envy-freeness in terms of cache space asfollows: No tenant should be able to improve expected utility byobtaining the cache share of another tenant. But this simplymeansall tenants have the same expected cache share. Such an allocationneed not be Sharing Incentive. Consider the example in Table5,where each viewR andShas size 1 and the cache has size 1. Theonly allocation that equalizes cache share cachesS entirely. Butthis is not SI for tenantB.

Tenant R S

A 0 1B 100 1

Table 5: Counterexample for perfect Envy-freeness

This motivates taking the utility of tenants into account indefin-ing envy. However, this quickly gets tricky, since the utility can be acomplex function of the entire configuration, and not of individualviews. In order to develop more insight, we use an analogy to pub-lic projects. The tenants are members of a society, who contributeequal amount of tax. The total tax is the cache space. Each view isa public project whose cost is equal to its size. Users deriveutilityfrom the subset of projects built (or views cached). In a societalcontext, users are envious if they perceive an inordinate fraction oftax dollars being spent on making a small number of users happy.In other words, if they perceive abridge to nowherebeing built.Let us revisit the example in Table 4. Here, the MMF allocationsetsx =< xR = 1

2 ,xS= 12 > and ignores the fact that an arbitrarily

large number of tenants want R, compared to just one tenant whowants S. If we treatR as a school andS as a park, an arbitrarilylarge number of users want a school compared to a park, yet halfthe money is spent on the school, and half on the park. This will beperceived as unfair on a societal level.

Randomized Core.In order to formalize this intuition, we borrow the notion ofcore

from cooperative game theory and exchange market economics[29,55, 13, 22]. We treat each user as bringing arate endowmentof 1

Nto the system. If they were the only user in the system, we wouldproduce an allocationx with ||x|| = 1

N and maximize their utility.An allocationx over all tenants lies in thecore if no subset of ten-ants can deviate and obtain better utilities for all participants bypooling together their rate endowments. More formally,

DEFINITION 3. An allocationx is said to lie in the (random-ized) core if for any subset T of N tenants, there is no feasible

allocationy such that||y|| = |T|N , for which Ui(y) ≥Ui(x),∀i ∈ T

and Uj (y)>U j (x) for at least one j∈ T.

It is easy to check that any allocation in the core is both SI andPE, by considering setsT of size 1 andN respectively. In the aboveexample (Table 4), the allocationx =< xR= N−1

N ,xS=1N > lies in

the core. TenantTN gets its SI amount of utility and cache space.The more demanded viewR is cached by a proportionally largeramount. In societal terms, each user perceives his tax dollars as be-ing spent fairly. Similarly, in the example in Table 5, the allocationx =< xR = 1

2 ,xS=12 > lies in the core.

In the context of provisioning public goods, there are two solu-tion concepts that are known to lie in the core: The first, termed aLindahl equilibrium [25, 46] attempts to find per-tenant prices thatimplement a Walrasian equilibrium, while the second, termed ratioequilibrium [38] attempts to find per-tenant ratios of cache-shares.However, these concepts are shown to exist using fixed-pointthe-orems, which don’t lend themselves to efficient algorithmicimple-mentations. We sidestep this difficulty by using randomization toour advantage, and show that a simple mechanism finds an alloca-tion in the core.

Proportional Fairness.

DEFINITION 4. An allocationx is proportionally fair (PF) if itis a solution to:

MaximizeN

∑i=1

log(Ui(x)) subject to:||x|| ≤ 1 (1)

We show the following theorem using the KKT (Karush-Kuhn-Tucker) conditions [43]. The proof also follows easily fromtheclassic first order optimality condition of PF [50]; however, wepresent the entire proof for completeness. Subsequently, in Sec-tion 4, we show how to compute this allocation efficiently.

THEOREM 2. Proportionally fair allocations satisfy the coreproperty.

PROOF. Let x denote the optimal solution to (PF). Letd denotethe dual variable for the constraint||x|| ≤ 1. By the KKT condi-tions, we have:

xS> 0 =⇒ ∑i

Ui(S)Ui(x)

= d

xS= 0 =⇒ ∑i

Ui(S)Ui(x)

≤ d

Multiplying the first set of identities byxS and summing them, wehave

d = d(∑S

xS) =∑i

∑SxSUi(S)Ui(x)

=∑i

1= N

Page 6: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

This fixes the value ofd. Next, consider a subsetT of users,with |T| = K, along with some allocationy with ||y|| = K

N . Firstnote that the KKT conditions implied:

∑i

Ui(S)Ui(x)

≤N ∀S

Multiplying by yS and summing, we have:

∑i

Ui(y)Ui(x)

≤ N∑S

yS= K

Therefore,

∑i∈T

Ui(y)Ui(x)

≤ K

Therefore, ifUi(y) >Ui(x) for somei ∈ T, then there existsj ∈ Tfor which U j (y) < U j (x). This shows that no subsetT can devi-ate to improve their utility, so that the (PF) allocation lies in thecore.

3.4 DiscussionOur notion of core easily extends to tenants having weights.Sup-

pose tenanti has weightλi . Then an allocationx belongs to thecore if for all subsetsT of tenants, there does not existy with||y|| = ∑i∈T λi

∑Ni=1 λi

such that for all tenantsi ∈ T, Ui(x) ≤ Ui(y), and

U j (x) <U j (y) for at least onej ∈ T. The proportional fairness al-gorithm is modified to maximize∑i λi logUi(x) subject to||x|| ≤ 1.

We note that the PF algorithm finds an allocation in the (ran-domized) core to any resource allocation game that can be speci-fied as follows: The goal is to choose a randomization over feasibleconfigurations of resources. Each configuration yields an arbitraryutility to each tenant. This model is fairly general. For instance,consider the setting in [27, 52], where resources can be partitionedfractionally between agents, and an agent’s rate (utility)dependson the minimum resource requirement satisfied in any dimension.Suppose we treat each agent as being endowed with1

N fraction ofthe supply of resources in all dimensions, the above result showsthat the (PF) allocation satisfies the property that no subset of userscan pool their endowments together to achieve higher rates for allparticipants.

Utilities under MMF and PF.We now compare the total utility,∑i Vi(x) for the optimal MMF

and (PF) solutions. We present results showing that (PF) haslargerutility than MMF in certain canonical scenarios. Our first sce-nario defines the followinggroupedinstance: There arek views,1,2, . . . ,k each of unit size. The cache also has size 1. There arekgroups of tenants; groupi hasNi tenants all of which want viewi.

LEMMA 1. The total utility of (PF) is at least the total utility ofMMF for any grouped instance.

PROOF. On grouped instances, MMF sets rate 1/k for each ten-ant, yielding a total utility ofN/k for N tenants. The (PF) algorithmsets ratexi = Ni/N for all tenants in groupi. This yields total utilityof ∑i N

2i /N. Next note that

∑ki=1N2

i

k≥

(

∑ki=1 Ni

k

)2

Noting that∑i Ni =N, it is now easy to verify that (PF) yields largerutility.

In fact, the ratio of the utilities of MMF and PF is preciselythe Jain’s index [37] of the vector〈N1,N2, . . . ,Nk〉. By settingk = N/2+ 1, andN2 = N3 = · · · = Nk = 1, this shows that (PF)can haveΩ(N) times larger total utility than MMF. Our next sce-nario focuses on arbitrary instances with only two tenants.

LEMMA 2. For two tenants, the total utility of (PF) is at leastthe total utility of MMF.

PROOF. Let the utilities of the two tenants bea,b in (PF) andA,B in MMF. Assumea≤ b. Since MMF maximizes the minimumutility, we havea≤min(A,B). Let α = A/a andβ = B/b, so thatα ≥ 1. Since log(a)+ log(b) = log(ab) is maximized by definitionof PF and log is an increasing function, we haveab≥ AB, soαβ ≤1. Sinceα ≥ 1, this implies 1/β ≥ α ≥ 1. Therefore

b−B= B(1/β −1)≥ a(1/β −1)≥ a(α−1) = A−a

This showsa+b≥ A+B completing the proof.

Summary of Fairness Properties.In summary, Table 6 shows the fairness properties that hold for

all of our candidate algorithms. We abbreviate the properties SI forsharing incentive and PE for pareto efficiency. Based on thisanaly-sis, we suggest that proportional fairness is likely to be a preferableview selection algorithm for our ROBUS framework. The theoret-ical properties of proportional fairness suggest that it should per-form fairly and efficiently.

Algorithm SI PE CORE

Random Serial Dictatorship X

Utility Maximization X

Max-Min Fairness X X

Proportional Fairness X X X

Table 6: Fairness properties of mechanisms

4. APPROXIMATELY COMPUTING PF ANDMMF ALLOCATIONS

In this section, we show that the PF and MMF allocations canbe computed to arbitrary precision. We then present fast heuristicalgorithms for approximately computing PF and MMF allocations,which we implement in our prototype.

One key issue in computation is that the number of configura-tions is exponential in the number of views and tenants, so that theconvex programming formulations have exponentially many vari-ables. Nevertheless, since the programs haveO(N) constraints, weuse the multiplicative weight method [11, 26] to solve them ap-proximately in time polynomial inN and accuracy parameter 1/ε.These algorithms assume access to awelfare maximizationsubrou-tine that we term WELFARE.

DEFINITION 5. Given weight vectorw, WELFARE(w)computesa configuration S that maximizes weighted scaled utilities,i.e., solvesargmaxS∑N

i=1 wiVi(S).

The scaled utilities are computed using the tenant utility modeldescribed in Section 2. In our presentation, we assume WELFARE

solves the welfare maximization problem exactly. Our algorithmswill make polynomially many calls to WELFARE.

Page 7: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

Multiplicative Weight Method.We first detail the multiplicative weight method, which willserve

as a common subroutine to all our provably good algorithms. Thisclassical framework [11, 26] uses a Lagrangian update to decidefeasibility of linear constraints to arbitrary precision.

We first define the generic problem of deciding the feasibility ofa set linear constraints: Given a convex setP ∈ Rs, and anr × smatrixA,

LP(A,b,P): ∃x∈ P such thatAx≥ b?

Let y ≥ 0 be anr dimensional dual vector for the constraintsAx≥ b. We assume the existence of an efficient ORACLE of theform:

ORACLE C(A,y) = maxytAz: z∈ P.

The ORACLE can be interpreted as follows: Suppose we take alinear combination of the rows ofAx, multiplying row aix by yi .Suppose we maximize this as a function ofx∈ P, and it turns outto be smaller thanyTb. Then, there is no feasible way to satisfyall constraints inAx≥ b, since the feasible solutionx would makeyTAx≥ yTb. On the other hand, suppose we find a feasiblex. Then,we check which constraints are violated by thisx, and increase thedual multipliersyi for these constraints. On the other hand, if a con-straint is too slack, we decrease the dual multipliers. We iterate thisprocess until either we find ay which provesAx≥ b is infeasible,or the process roughly converges.

More formally, we present the Arora-Hazan-Kale (AHK) proce-dure [11] for deciding the feasibility of LP(A,b,P). The runningtime is quantified in terms of the WIDTH defined as:

ρ = maxi

maxx∈P|aix−bi |

Algorithm 1 AHK Algorithm

1: Let K← 4ρ2 logrδ 2 ; y1 =~1

2: for t = 1 to K do3: Findxt using ORACLE C(A,yt).4: if C(A,yt)< yT

t b then5: Declare LP(A,b,P) infeasible and terminate.6: end if7: for i = 1 to r do8: Mit = aixt −bi ⊲ Slack in constrainti.9: yit+1← yit (1−δ )Mit /ρ if Mit ≥ 0.

10: yit+1← yit (1+δ )−Mit /ρ if Mit < 0.11: ⊲ Multiplicatively updatey.12: end for13: Normalizeyt+1 so that||yt+1||= 1.14: end for15: Return x= 1

K ∑Kt=1xt .

This procedure has the following guarantee [11]:

THEOREM 3. If LP(A,b,P) is feasible, the AHK procedure neverdeclares infeasibility, and the final x satisfies:

(aix−bi)+δ ≥ 0 ∀i

4.1 Proportional FairnessOur algorithm uses the AHK algorithm as a subroutine and con-

siders dual weights to find an additiveε approximation solution.The primary result is the following theorem:

THEOREM 4. An approximation algorithm computes an addi-

tiveε approximation to (PF) with O( 4N4 log2 Nε2 ) calls toWELFARE,

and polynomial additional running time.

Proof.For allocationx, let B(x) = ∑i logVi(x). Let Q∗ = maxx B(x)

denote the optimal value of (PF), and letx∗ denote this optimalvalue. We first present a Lipschitz type condition, whose proof weomit from this version.

LEMMA 3. Let y satisfy B(y)≥ Q∗− ε for ε ∈ (0,1/6). Then,for all i, Vi(y)≥Vi(x)/2.

The proof idea is to use the concavity of the log function to ex-hibit a convex combination ofx and y whose value exceedsQ∗,which is a contradiction. It is therefore sufficient to findQ∗ to anadditive approximation in order to achieve at least half thewelfareof (PF) for all tenants. Towards this end, for a parameterQ, wewrite (PF) as a feasibility problem PFFEAS(Q) as follows:

DEFINITION 6. PFFEAS(Q) decides the feasibility of the con-straints

(F) =

∑S

xSVi(S)− γi ≥ 0 ∀i

subject to the constraints:

(P1) =

∑S

xS≤ 1, xS≥ 0 ∀S

(P2) =

∑i

logγi ≥Q, γi ∈ [1/N,1] ∀i

The above formulation is not an obvious one, and is related tovirtual welfareapproaches recently proposed in Bayesian mecha-nism design [20, 15]. The key idea is to connect expected val-ues (utility) to their realizations in each configurations via expectedvalue variables, theγi . The constraints (P2) and (P1) are over ex-pected values, and realizations respectively. The ORACLE compu-tation in the multiplicative weight procedure will decouple into op-timizing expected value variables over (P2), and optimizing WEL-FARE over (P1) respectively, and both these problems will be easilysolvable.

We note that (P2) has additional constraintsγi ∈ [1/N,1] ∀i.These are in order to reduce the width of the constraints (F).Notethat otherwise,γi can take on unbounded values while still beingfeasible to (P2), and this makes the width of (F) unbounded. Thelower bound of 1/N on γi is to control the approximation error in-troduced. We argue below that these constraints do not change ourproblem.

LEMMA 4. Let Q∗ denote the optimal value of the proportionalfair allocation (PF). Then,PFFEAS(Q) is feasible if and only ifQ≤Q∗.

PROOF. In the formulation PFFEAS(Q), the quantityγi is sim-ply the scaled utility of tenanti. Consider the proportionally fairallocationx. For this allocation, all scaled utilities lie in[1/N,1]since the allocation is SI. Therefore,x is feasible for PFFEAS(Q∗).On the other hand, ify is feasible to PFFEAS(Q) for Q> Q∗, theny is also feasible for (PF), contradicting the optimality ofx.

Page 8: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

We will therefore search for the largestQ for which PFFEAS(Q)is feasible. Since eachγi ∈ [1/N,1], we haveQ ∈ [−N logN,0].Therefore, obtaining an additiveε approximation toQ∗ by binarysearch requiresO(logN) evaluations of PFFEAS(Q) for variousQ,assuming constantε > 0.

Solving PFFEAS(Q). We now fix a valueQ and apply the AHKprocedure to decide the feasibility of PFFEAS(Q). To map to thedescription in the AHK procedure, we haveb = 0, andA is theLHS of the constraints (F). We haver = N. Since anyVi(S) ≤ 1,andγi ∈ [1/N,1], the widthρ of (F) is at most 1. Finally, for small

constantε > 0, we will setδ = εN2 . Therefore,K = 4N4 logN

ε2 .For dual weightsw, the oracle subproblemC(A,w) is the follow-

ing:

Maxx,γ ∑i(wiVi(x)− γi )

subject to (P1) and (P2). This separates into two optimization prob-lems.

The first sub-problem maximizes∑i wiVi(x) subject tox satisfy-ing (P1). This is simply WELFARE(w). The second sub-problem isthe following:

Minimize∑i

wiγi

subject tow satisfying (P2). LetL denote the dual multiplier to theconstraint∑i logγi ≥Q. Consider the Lagrangian problem:

Minimize∑i(wiγi−L logγi)

subject toγi ∈ [1/N,1] for all i. The optimal solution setsγi(L) =max(1/N,min(1,L/wi)), which is an non-decreasing function ofL.We check if∑i γi(L) < Q. If so, we increaseL till we satisfy theconstraint with equality. This parametric search takes polynomialtime, and solves the second sub-problem.

The AHK procedure now gives the following guarantee: Eitherwe declare PFFEAS(Q) is infeasible, or we find(x,γ) such that forall i, we have:

∑S

xSVi(S)≥ γi− ε/N2 ≥ γi(1− ε/N)

Since∑i logγi ≥Q, the above implies:

B(x) =∑i

logVi(x)≥Q−∑i

log(1− ε/N) ≥Q− ε

so that the valueQ− ε is achievable with the allocationx.

Binary Search. To complete the analysis, since PFFEAS(Q∗) isfeasible, the procedure will never declare infeasibility when runwith Q = Q∗, and will find anx with B(x) ≥ Q∗− ε, yielding anadditiveε approximation. This binary search overQ takesO(logN)iterations.

Thus, we arrive at the result of theorem 4.

4.2 Max-min FairnessWe present an algorithm SIMPLEMMF that computes an alloca-

tion x maximizing mini Vi(x). The MMF allocation can be com-puted by applying this procedure iteratively as in [28]; we omit thesimple details from this version. We note that the idea of applyingthe multiplicative weight method to compute max-min utility alsoappeared in [21].

We write the problem of deciding feasibility as SIMPLEMMF(λ ):

(F) =

∑S

Vi(S)xS≥ λ ∀i

subject to the constraints:

(P) =

∑S

xS≤ 1, xS≥ 0 ∀S

We haveλ ∗ ∈ [1/N,1], whereλ ∗ = maxx mini Vi(x). Therefore,the widthρ ≤ 1. Further, we can setδ = ε/N. We can now com-

puteK from the AHK procedure, so thatK = 4N2 logNε2 in order to

approximateλ ∗ to a factor of(1− ε). The procedure is describedin Algorithm 2.

Algorithm 2 Approximation Algorithm for SIMPLEMMF

1: Letε denote a small constant< 1.2: T← 4N2 logN

ε2

3: w1←1N ⊲ Initial weights

4: x← 0 ⊲ Probability distribution over set of views5: for k∈ 1,2, . . . ,T do6: LetSbe the solution to WELFARE(wk).

7: wi(k+1)← wik exp(−ε Ui (S)U∗i

)

8: Normalizewk+1 so that||wk+1||= 1.9: xS← xS+

1T ⊲ Add S to collection

10: end for

In order to compute MMF allocations, we use a similar idea todecide feasibility, except that we have to performO(N2) invoca-

tions. This blows up the running time toO(

4N4 logNε2

)

invocations

of WELFARE.The algorithm gives the following result:

THEOREM 5. An approximation algorithm forSIMPLEMMF(Algorithm 2) finds a solutionx such thatmini Vi(x) ≥ λ ∗(1− ε)using 4N2 logN

ε2 calls toWELFARE.

4.3 Fast HeuristicsIn this section, we present heuristic algorithms that directly work

with the exponential size convex programs. We directly implementthese algorithms in software to gather our experimental results.

Configuration Pruning.ForM = O(N2), generateM randomN-dimensional unit vectors

wk,k = 1,2, . . . ,M. For eachwk, let Sk be the configuration corre-sponding to WELFARE(wk). Denote this set of configurations byS . We restrict the convex programming formulations of PF andMMF to just the set of configurationsS , and solve these programsdirectly, as we describe below. The intuition behind doing thispruning step is the following: The approximation algorithms forPF and MMF find convex combinations of configurations that areoptimal for WELFARE(w) for somew’s that are computed by themultiplicative weight procedure. Instead of this, we generate ran-dom such Pareto-optimal configurations, giving sufficient coverageso that each tenant has a high probability of having the maximumweight at least once.

We compared two algorithms for SIMPLEMMF, one using themultiplicative weight procedure (Algorithm 2), and the other solv-ing the linear program (Program (3) below) restricted to random

Page 9: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

optimal configurations. When run on 200 batches with five ten-ants, using 5 weight vectors gives a 10.4% approximation to theobjective of SIMPLEMMF. With 25 random weight vectors, theapproximation error is 1.4%, and using 50 random weights, the ap-proximation error drops to 0.6%. This shows that a small setS ofconfigurations that are optimal solutions to WELFARE(w) for ran-dom vectorsw is sufficient to generate good approximations to ourconvex programs. In our implementation, we setS to be the unionof these configurations along with the configurations generated bythe SIMPLEMMF algorithm (Algorithm 2).

Proportional Fairness.We first note that (PF) is equivalent to the following; the proof

of equivalence follows from Theorem 2, where the dual variablecorresponding to the constraint∑SxS= 1 is preciselyN.

Max g(x) =N

∑i=1

log(Vi(x))−N‖x‖ s.t.: x≥ 0 (2)

Given a configuration spaceS , we can solve the program (2) usinggradient descent, as shown in Algorithm 3. As precomputation, foreach configurationS∈ S , we precomputeVi(S). ThenVi(x) =∑S∈S Vi(S)xS.

Algorithm 3 Proportional Fairness Heuristic

1: Let M = |S |. Sett = 1.2: Let x1 = (1/M,1/M, . . . ,1/M).3: repeat4: y = ∇g(x) evaluated atx = xt .5: r∗ = argmaxr (g(xt + ry))6: xt+1 = xt + r∗y7: Projectxt+1 as: xd = max(xd,0) for all dimensionsd ∈1,2, . . . ,M.

8: until x t converges

Max-min Fairness.Using the precomputed configuration spaceS , we solve SIM -

PLEMMF using the following linear program:

max

λ | ∑S∈S

Vi(S)xS≥ λ ∀i,x≥ 0

(3)

This can be solved using any off-the-shelf LP solver (our imple-mentation uses the open source lpsolve package [14]). In order tocompute the MMF allocation, we iteratively compute the lexico-graphically max-min allocation using the above LP. The details arestandard; see for instance [28]. Briefly, in each iteration avalue ofλ is computed. All tenants whose rate cannot be increased beyondλ without decreasing the rate of another tenant are considered sat-urated and the rate ofλ for these tenants is a constraint in the nextiteration of the LP. The solution to the final LP for which all tenantsare saturated is the MMF solution.

5. EVALUATIONWe evaluate cache allocation policies on a variety of practical se-

tups of multi-tenant analytics clusters. The setups may differ in thenumber of tenants, workload arrival patterns, data access patterns,etc. Some of the example setups are listed below.

• Analysts: Tenants correspond to various BI analysts in an en-terprise that run a similar workload. Some of the datasets arefrequently accessed by all tenants suggesting a good opportu-nity for shared optimization.

• ETL+Analysts: All analysts have similar data access patternsas above. But additionally, a tenant runs ETL workload thatmay touch different datasets than the BI tenants.

• Production+Engineering: Engineering workload is of burstynature. Depending on the time of day, engineering queues havedifferent amounts of work whereas production queues, runningpre-scheduled workflows, have similar amounts of work.

We replicate various combinations of these setups on a small-scale Spark cluster and run controlled experiments using a mixof TPC-H benchmark [?] workload and a synthetically generatedscan-based workload.

5.1 Setup and MethodologyFigure 2 has presented the architecture ofROBUS. We use Apache

Spark [4] to build a system prototype. Spark is a natural choice forthe evaluation since it supports distributed memory-basedabstrac-tion in the form of Resilient Distributed Datasets (RDDs). In ourprototype, a long running Spark context is shared among multiplequeues, with each queue corresponding to a tenant. The Sparkcon-text has an access to the entire RDD cache in the cluster. Spark’sinternal fair share scheduler is configured with a dedicatedpool foreach queue; the fair share properties of the pool are set proportionalto weight of the corresponding queue.

Spark version 1.1.1Number of worker nodes 10Instance type of nodes c3.2xlargeTotal number of cores 80Executor memory 80GB

Table 7: Test cluster setup on Amazon EC2

Table 7 presents our test cluster setup. We generate two typesof data to reflect two types of uses observed in typical multi-tenantclusters: (a) A set of 30 datasets with varying sizes each match-ing schema of the “sales” tables—store_sales, catalog_sales, andweb_sales—from TPC-DS benchmark [5] data, and (b) All TPC-Hbenchmark [?] datasets generated at scale 5.

The first category of data represents raw fact/log data that comesinto the cluster from the OLTP/operational databases in a com-pany. This data is processed by synthetically-generated ETL andexploratory SQL queries, each performing scans and aggregationsover a dataset. We refer to this category of queries as theSales

workload. Total size ofSales data on disk is 600GB. We createa vertical projection view on each dataset on its most frequentlyaccessed columns and use it whenever possible to answer queries.Sizes of these views when loaded to cache range from 118MB to3.6GB as can be seen in Figure 3.

The second category represents data in the cluster after it hasbeen processed by ETL. Note that this data is typically much smallerin size compared to fact/log data. In our experiments, this data isqueried by standardTPC-H benchmark queries which consist ofa suite of business-oriented analytics and involves more complexoperations, such as joins, compared to theSales workload. All ofthe queries in our evaluation are submitted using SparkSQL APIs.

We set the cache size to 8GB, 10% of the total executor memory,leaving aside the rest as a heap space. Only 6GB of the cache isused to carry out our optimizations in order to avoid memory man-agement issues our Spark installation experienced while evictingfrom a near-full cache.

The tenant utility model we use to estimate the utility of a cacheconfiguration in our evaluation is based on the observationsmadefrom real-life clusters in [9]. If all the datasets that a query needs

Page 10: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

0 10 20 300

1

2

3

Siz

ein

GB

Figure 3: Cache size estimates of candidateSales views

Metadata

Candidate Views

. . .

- Storage Paths

- Cache size estimates

Queue 1

Queue 2

Queue N

Generator

1

Generator

2

Generator

N

ROBUS

Optimizer

Spark

Sche-

duler

Monitor- Execution time

- Waiting time

- Size of cache

Replay le

Figure 4: Workload generation forROBUS

are cached, then the query is assigned a utility equal to the totalsize of data it reads; which corresponds to the savings in disk readI/O. Otherwise, we assign a utility of zero. It is observed in[9] thatqueries do not benefit much from in-memory caching if any partoftheir working set is not cached.

The workflow is described in Section 2 already. Here we wantto add the fact that the cache update phase in our evaluation setuponly marks datasets for caching or uncaching using Spark’s cachedirectives. Spark lazily updates the cache when the first query re-questing cached data from the batch is scheduled for execution.

Workload Arrival and Data Access.Figure 4 shows our workload generation process. Several studies

have established that query arrival times follow a Poisson distribu-tion [31, 54]. We use the same in our prototype. Previous studieshave also indicated that the data accessed by analytical workloadsfollows a Zipf distribution [31, 53]: A small number of datasets aremore popular than others, while there is a long tail of datasets thatare only sporadically accessed. To replicate such data access, oursyntheticSales workload generator picks a dataset from a Zipfiandistribution provided at the time of configuration and adds groupingand aggregation predicates from a probability distribution definedfor the chosen dataset. TheTPC-H workload generator, on theother hand, picks a benchmark query from a probability distribu-tion over the queries provided at the time of configuration.

Further, [53] also shows that 90% of recently accessed data isre-accessed within next hour of first access. This makes a lotofsense because users typically want to drill down a dataset further inresponse to some interesting observation obtained in the previousrun. In order to support such scenarios, we pick a small window intime from a Normal distribution. Over this window, a small subsetof datasets is chosen from the Zipfiang. This subset forms candi-dates for the duration of the window. Each query to be generatedpicks one dataset from the candidates uniformly at random. Thistechnique is taken from [31] which terms the values used in local

window as “cold” values to differentiate them from globallypopu-lar “hot” values. The generated workload still follows the Zipfiangglobally. The local distribution is optional; If not provided, datasetsare picked from Zipfiang at all times.

5.2 Performance MetricsWe gather several performance metrics while executing a work-

load. They are defined next. We emphasize that these metrics areover long time horizons.

1. Throughput. This is simple to define:

Throughput=number of queries served

total time taken(4)

2. Fairness Index. For job schedulers, a performance-basedfairness index is defined in terms of variance in slowdownsof jobs in a shared cluster compared to a baseline case whereevery job receives all the resources [34]. As our work isabout speeding up queries, we use relative speedups acrossqueries while deriving fairness. The baseline is the case ofstatically partitioned cache. Here,Xi is the mean speedup fortenanti, andλi is the weight of tenanti.

Fairness index=(∑n

i=1Xiλi)2

n∑ni=1(

Xiλi)2

(5)

3. Average Cache Utilization.This is simply the average frac-tion of cache utilized during workload execution.

4. Hit Ratio. The fraction of queries served off cached views.

Some of the other metrics we collect include flow time, meanexecution time, mean wait time, and wait time fairness index. Theyare not included due to space constraints.

5.3 Algorithm EvaluationIn this section, we evaluate four view selection algorithmson

various setups. Each algorithm processes a batch of query work-load in an offline manner as detailed in Section 2. Section 3 dis-cussed several possible algorithms. Here, we compare the follow-ing:

1. STATIC: Cache is partitioned in proportion to weights of thetenants. We treat this as baseline when evaluating fairnessindex.

2. MMF: Max-min fairness implementation described in Sec-tion 4.3.

3. FASTPF: Proportional fairness implementation described inSection 4.3.

4. OPTP: The only goal is to optimize for query performance;Workload from a batch is treated as if belonging to a singletenant – a special case of either MMF or FASTPF.

In order to compare these algorithms across various settings, wevary the following parameters independently in our experiments.

1. Data sharing among tenants (Section 5.3.1);

2. Workload arrival rate (Section 5.3.2); and

3. Number of tenants (Section 5.3.3).

Page 11: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

Setup Distributions used by four tenantsG1 h1, h1, h1, h1G2 h1, h1, h1, g1G3 h1, h1, g1, g2G4 h1, g1, g2, g3

Table 8: Data access distributions used in evaluation on a mixedworkload

Setup Distributions used by four tenantsG1 g1, g1, g1, g1G2 g1, g1, g1, g2G3 g1, g1, g2, g3G4 g1, g2, g3, g4

Table 9: Data access distributions used in evaluation onSaleswork-load

5.3.1 Effect of data sharing among tenantsTo study the impact of different data sharing patterns on theper-

formance of algorithms, we create four different workload distribu-tions:h1 picks queries uniformly at random over a set of 15TPC-H

benchmark queries;g1−g3 create three different Zipf distributionsover 30Sales datasets over which scan-and-aggregation queries aregenerated. Each of the distributions is skewed towards a differentsubset of datasets. Using these distributions, we create four test se-tups allowing different levels of data sharing, as listed inTable 8.The batch size is set to 40 seconds; the inter-query arrival time dis-tribution for all the tenants is given by Poisson(20); and werun 30batches of workload for every data point.

Figure 5 shows how different algorithms perform in each of thesesetups. Throughput goes down with heterogeneity in data access.STATIC policy fails to cache any dataset forTPC-H workload be-cause each of the queries we generate reads the largest table, lineitem,which amounts to≈ 3.8GB, much larger than cache at the dis-posal of STATIC. The other three policies, on the other hand, canserve every query off cache in setupG1. As a result, they exhibita throughput of more than 2x over STATIC. However, as the het-erogeneity in data access increases, the gap in throughput narrows.Even though the shared policies cache more data, frequent updatesto cache configuration per batch cause additional delays. Weex-plore possibility of retaining state of cache in Section 5.4.

Among the shared cache policies, OPTP scores high on through-put but very low on fairness index. It uses cache exclusivelyforTPC-H tenants at the cost of degradation toSales tenants’ perfor-mance. MMF and FASTPF policies, on the other hand show muchbetter tradeoffs in terms of performance and fairness to tenants.

We also repeated the same experiment onSales data alone. Wefirst create four different Zipf distributions over candidate views:g1,g2,g3,g4. Each of the distributions is skewed towards a differ-ent subset of views.We create four test setups, each allowing a dif-ferent level of data sharing, as listed in Table 9. The other commonparameters are listed in Table 10.

Figure 6 shows how different algorithms perform in each of thesesetups. Throuphput goes down with heterogeneity in data access.STATIC performs poorly in all the setups, the performance beingbetween 30%-40% worse of the others. Its lower cache utiliza-

Parameter ValueQuery inter-arrival rates (sec) 20 ∀ tenantBatch size (sec) 40Number of batches 30

Table 10: Data sharing experiment setup

1 2 3

0.2

0.4

0.6

Top 3 views fromg1

1 2 3

0.2

0.4

0.6

Top 3 views fromg2

MMF FASTPF OPTP

Figure 7: Fraction of time the popular views in setupG2 werecached

tion and lower hit ratio are further indicators of why STATIC is notthe right choice for cache allocation. There is very little to distin-guish among the three cache-sharing algorithms. This showsthatour fair algorithms can provide a throughput close to the optimal.In terms of fairness, OPTP algorithm gives the most inconsistentperformance. It scores high in the setup with most heterogeneity,but fails when data sharing is involved. MMF and FASTPF, on theother hand, score high in all the setups.

The performance of MMF interestingly falls alarmingly low inthe second setup. This is clearly an outcome of the data sharingpattern wherein three of four tenants largely share the samesubsetof views. Recollecting the example presented in Table 4, MMFtries to share the cache (probabilistically) equally between the twosets of tenants effectively producing an allocation off thecore. Weinclude a chart showing the duration the most popular views werecached for by MMF, FASTPF, and OPTP. (Figure 7) Top threeviews in each ofg1 andg2 serve 25%, 13%, and 8% of the queriesrespectively. It can be seen that while MMF caches the topmostview from the distributions roughly equally, FASTPF and OPTPfavor the topmost view fromg1 more since it is shared by threetenants. MMF tries to compensate the three tenants by cachingtheir second best view more, but this view has a lower utilitybothdue to lower access frequency and smaller size. So the overall per-formance of MMF suffers in this case.

5.3.2 Effect of variance in query arrival ratesTo replicate the bursty tenants scenarios, we vary query inter-

arrival rates of tenants in a two-tenant setup. We create three setups—low, mid, andhigh—with query inter-arrival rates as listed in Ta-ble 11. The other parameters used in each of the setups are listed inTable 12.

Setup Poisson mean,λ1 Poisson mean,λ2low 12 12mid 18 8high 24 6

Table 11: Query inter-arrival rates for different setups

Figure 8 shows the impact of variance in query arrival rate onvarious metrics. The performance of STATIC remains below theother three algorithms as can be seen from the first three graphs.

Parameter ValueData access distributionsg1,g2Batch size (sec) 72Number of batches 30

Table 12: Query arrival rate experiment setup

Page 12: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

G1 G2 G3 G4

5

10

15

20

Throughput(/min)

G1 G2 G3 G4

0

0.5

1

Average cache utilization

G1 G2 G3 G4

0

0.5

1

Hit ratio

G1 G2 G3 G40.2

0.4

0.6

0.8

1

Fairness index

STATIC MMF FASTPF OPTP

Figure 5: Effect of data sharing changes on four equi-paced tenants on a mixed workload

G1 G2 G3 G4

4

6

8

10

Throughput(/min)

G1 G2 G3 G4

0

0.5

1

Average cache utilization

G1 G2 G3 G4

0.2

0.4

0.6

Hit ratio

G1 G2 G3 G4

0.6

0.8

1

Fairness index

STATIC MMF FASTPF OPTP

Figure 6: Effect of data sharing changes on four equi-paced tenants onSales workload

low mid high

4

6

8

10

Throughput(/min)

low mid high

0

0.5

1

Average cache utilization

low mid high

0.2

0.4

0.6

Hit ratio

low mid high

0.6

0.8

1

Fairness index

STATIC MMF FASTPF OPTP

Figure 8: Effect of variance in query arrival rates

Page 13: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

MMF FASTPF OPTP

baseline

1.281.38

0.67

1.37 1.361.41

mea

nsp

eedu

p

Tenant 1Tenant 2

Figure 9: Mean speedups provided by different algorithms overSTATIC policy for the two tenants in the setuphigh

The performance gap, however, is small because the cache is par-titioned in only two parts for STATIC each part being large enoughto serve 80% of the queries that could be served off unpartitionedcache. When it comes to the fairness index, all the algorithms ex-cept OPTP get a near-perfect score. OPTP favors the faster tenantin bothmidandhighsetups so much that the slower tenant’s perfor-mance degrades. Figure 9 shows the speedups for MMF, FASTPF,and OPTP relative to STATIC under the setuphigh. It can be seenthat the first tenant sees a performance degradation with OPTP em-pirically proving the fact that OPTP is not sharing incentive.

5.3.3 Effect of number of tenants

Setup Poisson mean,λ2 104 208 40

Table 13: Query inter-arrival rates for a tenant under different se-tups

To further stress the utility of optimizing the entire cacheas ashared resource, we experiment with increasing number of tenants.Specifically, we consider scenarios with 2, 4, and 8 tenants,allusing the same distribution over dataset access. We try to keepthe number of queries per batch the same by doubling query inter-arrival rate with doubling of the number of tenants, batch size re-maining the same across the setups. Table 13 lists the query inter-arrival rates we used. The other parameters common across thesetups are listed in Table 14.

Figure 10 shows behavior of the algorithms under these scenar-ios. The gap in throughput between STATIC and the other algo-rithms is large (35%-45%). As the number of tenants goes up, theaverage cache utilization of STATIC drops sharply, whereas the av-erage cache utilization of the other algorithms remain largely sta-ble. This can be attributed to the static partitioning of cache inSTATIC. The hit ratio shows a similar pattern again showing whySTATIC is not the best choice. In terms of fairness index, OPTPfinds it increasingly harder to provide a fair solution. Withan in-crease in the number of tenants, the number of queries per tenantper batch goes down which makes the locally optimal choices ofOPTP more unlikely to provide equal speedups. In contrast, MMFand FASTPF, with their randomized choices, score over 0.9 in allthe scenarios exhibiting their superiority.

5.4 Discussion and Future Work

Parameter ValueData access distributions g1∀ tenantBatch size (sec) 40Number of batches 30

Table 14: Number of tenants experiment setup

0 20 40

0.6

0.8

1

MMFFASTPF

Figure 11: Fairness index as a function of number of batches

Our evaluation on practical setups brings up some interesting in-sights that opens up multiple possibilities for the future.We discusssome of the challenges and the directions here.

Our experiments show that across all setups, FASTPF and MMFprovide far better trade-offs in throughput and fairness comparedto STATIC and OPTP. We note that in comparing max-min fair andproportional fair implementations, there is no clear winner. Webelieve this is a second order difference that a more precisecostmodel and implementations of the exact algorithms (for instance,the algorithm in Section 4.1 for proportional fairness) will bringout. However, even given similar empirical results, PF has the ad-vantage of the core property as a succinct and easy to explainnotionof fairness.

We next note that therunning time of our algorithms is polyno-mial in number of tenants. In most typical industry setups, the oneswe evaluated, there is only a handful number of tenants. Therefore,we expect our algorithms to be fast even in the wild. Just to quan-tify the query wait times, we observed them to be of the order oftens of milliseconds in most cases.

Convergence Properties.As our algorithms are randomized in nature, it is important to

study how long they take to converge to solutions that yield fair-ness across time. After running several workloads, we find thatthe number of batches to achieve convergence is very small, of theorder of 15-25. In Figure 11, we present results of a four tenantworkload with 50 batches, optimized once using MMF and onceusing FASTPF. The fairness index was computed after every 2-4batches. It can be seen that both algorithms converge to their re-spective optimal values at around 20 batches. As a future work, weplan to systematically study which parameters define rate ofcon-vergence of the algorithms.

Batch Size and Cache State.Our batched processing architecture introduces additional opti-

mization choices. Primarily, there are two ways of tuning a viewselection algorithm:1. Controlling batch size, and2. Managing state of cache across batches.

The first option needs no elaboration. The second option is whetherthe cache is treated asstatefulor asstatelesswhen optimizing abatch. In the former case, the estimated benefit of views thatare al-ready in cache is boosted by a factorγ > 1. This influences the nextcache allocation, and makes it more likely for these views tostay inthe cache. The latter case ignores the state of the cache whencon-sidering the next batch. All the results presented so far have usedstateless cache.

We empirically compared how the algorithms react to these pa-rameters. Figure 12 shows effect of change in batch size on two ver-

Page 14: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

2 4 8

4

6

8

10

Throughput(/min)

2 4 8

0

0.5

1

Average cache utilization

2 4 8

0.2

0.4

0.6

Hit ratio

2 4 8

0.6

0.8

1

Fairness index

STATIC MMF FASTPF OPTP

Figure 10: Effect of changing number of tenants

20 40 60 80

4

6

8

10

Throughput(/min)

20 40 60 80

0.6

0.8

1

Fairness index

MMFSL MMFSF FASTPFSL FASTPFSF

Figure 12: Effect of batch size on four equi-paced tenants setup

sions each of MMF and FASTPF: one treating the cache as state-less (MMFSL and FASTPFSL), and the other treating it as stateful(MMFSF and FASTPFSF), with γ = 2. It can be seen that bothversions provide similar throughput in all the cases. It canbe ob-served that the stateful algorithms score higher on fairness for thesmallest batch size but there is no clear pattern seen when the batchsize is larger. It makes sense since the lower batch sizes do notgive enough choices for fair configurations of cache and maintain-ing the state results in an artificial increase of the batch size therebyproviding better configurations. As a future work, we plan toex-plore these trade-offs on a larger scale to devise better guidelineson parameter tuning.

Engineering issues.We now highlight some challenges in scaling up our experiments

to industry scale. These challenges are tied to engineeringissuesin current implementations of systems such as Spark, and will getironed out over time. Most common multi-tenant Spark setupsusea separate Sparkcontextfor each tenant, effectively partitioningcache. In fact, most current multi-tenant data warehouse systemsrecommend splitting memory across queues. This is in part dueto multi-thread management challenges that result in unpredictablebehavior such as premature eviction of cached data blocks. Anotherengineering issue, specific to Spark, is the inordinately long delaysin garbage collection when cluster scales up. We should be ableto see a much better impact ofROBUS optimization once thesepractical issues get resolved.

Code Base.The code base ofROBUS has been open-sourced [3] and our

entire experimental setup can be replicated following a simple setof instructions provided with the code.

6. RELATED WORK

Physical design tuning and Multi-query optimization.Classical view materialization algorithms in databases [32, 58,

8, 30, 47] treat entire workload as a set and optimize towardsoneor more of the flow time, space budget, and view maintenancecosts. Online physical design tuning approaches [17, 42, 23, 44],on the other hand, adapt to changes in query workload by modify-ing physical design. None of the afore-mentioned approaches sup-port multi-tenant workloads and therefore cannot be used inselect-ing views for caching. However, some of the techniques used,inparticular candidate view enumeration, view matching, andqueryrewrite, can be applied inROBUS framework.

Batched optimization of queries was proposed in [56] and is usedin many work sharing approaches [59, 7, 51].ROBUS employsbatched query optimization likewise, but crucially also ensures thateach tenant gets their fair share of benefit.

Fairness theory.The proportional fairness algorithm is widely studied in Eco-

nomics [50, 36, 18] as well as in scheduling theory [27, 52, 41,40, 33, 57, 10]. In the context of resource partitioning problems(or exchange economies) [13, 22], it is well-known that a convexprogram, called the Eisenberg-Gale convex program [36] computesprices that implement a Walrasian equilibrium (or market clearingsolution). Our shared resource allocation problem is different fromallocation problems where resources need to be partitioned, and itis not clear how to specify prices for resources (or views) inoursetting. Nevertheless, we show that there is an exponentialsizeconvex program using configurations as variables, whose solutionimplements proportional fairness in a randomized sense.

In scheduling theory, the focus is on analyzing delay proper-ties [41, 40, 33] assuming jobs have durations. Our focus is insteadon utility maximization, which has also been considered in the con-text of wireless scheduling in [57, 10]. The latter work focuses onlong-termfairness for partitioned resources, where utility of a ten-ant is defined as sum of discounted utilities across time. Theresult-ing algorithms, though simple, only provide guarantees assumingjob arrivals are ergodic, and if tenants exist forever. Theydo notprovide per-epoch guarantees. In contrast, we focus on obtainingper-epoch fairness in a randomized sense without ergodic assump-tions, and on defining the right fairness concepts when resourcesare shared. We finally note that [39] presents dynamic schemes forachieving envy-freeness across time; however, these techniques arespecific to resource partitioning problems and to not directly applyto our shared resource setting.

Multi-tenant architectures.Traditionally, the notion of multi-tenancy in databases deals with

sharing database system resources, viz. hardware, process, schema,

Page 15: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

among users [35, 12, 48]. Each tenant only accesses data owned bythem. Emerging multi-tenant big data architectures, on theotherhand, allow for entire cluster data to be shared among tenants. Thissharing of data is critical in our work as it allows the cache to beused much more efficiently.

A critical component of modern multi-tenant architectures, suchas Apache Hadoop, Apache Spark, Cloudera Impala, is a fair sched-uler/ resource allocator [2, 27, 34]. The resource pool consideredby these schedulers do not differentiate the cache resourcefromthe heap resource and as a result divides the cache among ten-ants. As seen in our work, partitioned cache setups severelyreduceoptimization opportunities. Some recent approaches treatcacheas an independent resource when running multiple jobs. PAC-Man [9] exploits multi-wave execution workflow of Hadoop jobsto make caching decisions at the granularity of parallely runningtasks of a job. In another work, LRU policies for Buffer poolmemory are extended to meet SLA guarantees of multiple ten-ants [49]. However, none of the approaches exploit the opportu-nities presented by multi-shared nature of cache resource.In an-other advancement, distributed analytics systems are supporting adistributed cache store shared by multiple tenants [45, 1].ROBUS

optimizer will be a natural fit for such systems.

7. CONCLUSIONEmerging Big data multi-tenant analytics systems complement

an abundant disk-based storage with a smaller, but much faster,cache in order to optimize workloads by materializing viewsin thecache. The cache is a shared resource, i.e., cached data can beaccessed by all tenants. In this paper, we presentedROBUS, acache management platform for achieving both a fair allocation ofcache and a near-optimal performance in such architectures. Wedefined notions of fairness for the shared settings using randomiza-tion in small batches as a key tool. We presented a fairness modelthat incorporates Pareto-efficiency and sharing incentive, and alsoachieves envy-freeness via the notion of core from cooperative gametheory. We showed a proportionally fair mechanism to satisfy thecore property in expectation. Further, we developed efficient algo-rithms for two fair mechanisms and implemented them in aROBUS

prototype built on a Spark cluster. Our experiments on variouspractical setups show that it is possible to achieve near-optimal fair-ness, while simultaneously preserving near-optimal performancespeedups using the algorithms we developed.

Our framework is quite general and applies to any setting whereresource allocations are shared across agents. As future work, weplan to explore other applications of this framework.

8. REFERENCES[1] Discardable distributed memory: Supporting memory

storage in HDFS,http://hortonworks.com/blog/ddm/.[2] Hadoop fair scheduler,

http://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html.[3] ROBUS Cache Planner,

https://github.com/dukedbgroup/ROBUS.[4] Spark: Lightning fast cluster computing,

https://spark.apache.org/.[5] TPC-DS benchmark,http://www.tpc.org/tpcds/.[6] A. Abdulkadiroglu and T. Sonmez. Random serial

dictatorship and the core from random endowments in houseallocation problems.Econometrica, 66(3):689–701, 1998.

[7] P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scansof large data files.Proc. VLDB Endow., 1(1):958–969, Aug.2008.

[8] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automatedselection of materialized views and indexes in SQLdatabases. InProceedings of the 26th InternationalConference on Very Large Data Bases, VLDB ’00, pages496–505, San Francisco, CA, USA, 2000. MorganKaufmann Publishers Inc.

[9] G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur,S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinatedmemory caching for parallel jobs. InProceedings of the 9thUSENIX Conference on Networked Systems Design andImplementation, NSDI’12, pages 20–20, Berkeley, CA,USA, 2012. USENIX Association.

[10] M. Andrews. Instability of the proportional fair schedulingalgorithm for hdr.Wireless Communications, IEEETransactions on, 3(5):1422–1426, 2004.

[11] S. Arora, E. Hazan, and S. Kale. The multiplicative weightsupdate method: A meta algorithm and applications.Theoryof Computing, 8:121–164, 2005.

[12] S. Aulbach, T. Grust, D. Jacobs, A. Kemper, and J. Rittinger.Multi-tenant databases for software as a service:Schema-mapping techniques. InProceedings of the 2008ACM SIGMOD International Conference on Management ofData, SIGMOD ’08, pages 1195–1206, New York, NY,USA, 2008. ACM.

[13] R. J. Aumann. Markets with a continuum of traders.Econometrica: Journal of the Econometric Society, pages39–50, 1964.

[14] M. Berkelaar, K. Eikland, and P. Notebaert.lpsolve : Opensource (Mixed-Integer) Linear Programming system.

[15] A. Bhalgat, S. Gollapudi, and K. Munagala. Optimalauctions via the multiplicative weight method. InProceedings of the fourteenth ACM conference on Electroniccommerce, pages 73–90. ACM, 2013.

[16] A. Bogomolnaia and H. Moulin. A new solution to therandom assignment problem.Journal of Economic Theory,100(2):295–328, 2001.

[17] N. Bruno and S. Chaudhuri. An online approach to physicaldesign tuning. InProceedings of the InternationalConference on Data Engineering (ICDE), 2007.

[18] E. Budish. The combinatorial assignment problem:Approximate competitive equilibrium from equal incomes.Journal of Political Economy, 119(6):1061–1103, 2011.

[19] E. Budish, Y.-K. Che, F. Kojima, and P. Milgrom. Designingrandom allocation mechanisms: Theory and applications.American Economic Review, 103(2):585–623, 2013.

[20] Y. Cai, C. Daskalakis, and S. M. Weinberg. Optimalmulti-dimensional mechanism design: Reducing revenue towelfare maximization. InFOCS, 2012.

[21] D. Chakrabarty, A. Mehta, and V. Vazirani. Design is as easyas optimization. In M. Bugliesi, B. Preneel, V. Sassone, andI. Wegener, editors,Automata, Languages andProgramming, volume 4051 ofLecture Notes in ComputerScience, pages 477–488. Springer Berlin Heidelberg, 2006.

[22] G. Debreu and H. Scarf. A limit theorem on the core of aneconomy.International Economic Review, 4(3):pp. 235–246,1963.

[23] I. Elghandour and A. Aboulnaga. Restore: Reusing results ofmapreduce jobs.Proc. VLDB Endow., 5(6):586–597, Feb.2012.

[24] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, andW. Lehner. Sap hana database: Data management for modern

Page 16: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

business applications.SIGMOD Rec., 40(4):45–51, Jan.2012.

[25] D. K. Foley. Lindahl’s solution and the core of an economywith public goods.Econometrica: Journal of theEconometric Society, pages 66–72, 1970.

[26] Y. Freund and R. E. Schapire. Adaptive game playing usingmultiplicative weights.Games and Economic Behavior,29(1–2):79 – 103, 1999.

[27] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski,S. Shenker, and I. Stoica. Dominant resource fairness: Fairallocation of multiple resource types. InProceedings of the8th USENIX Conference on Networked Systems Design andImplementation, NSDI’11, pages 323–336, Berkeley, CA,USA, 2011. USENIX Association.

[28] A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Choosy:Max-min fair sharing for datacenter jobs with constraints.InProceedings of the 8th ACM European Conference onComputer Systems, pages 365–378. ACM, 2013.

[29] D. Gillies.Some Theorems on n-Person Games. PhD thesis,Princeton University, 1953.

[30] J. Goldstein and P.-A. Larson. Optimizing queries usingmaterialized views: A practical, scalable solution. InProceedings of the 2001 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’01, pages331–342, New York, NY, USA, 2001. ACM.

[31] J. Gray, K. Baclawski, P. Sundaresan, and S. Englert.Quickly generating billion-record synthetic databases.Association for Computing Machinery, Inc., January 1994.

[32] H. Gupta and I. S. Mumick. Selection of views to materializein a data warehouse.IEEE Transactions on Knowledge andData Engineering, 17(1):24–43, 2005.

[33] S. Im, J. Kulkarni, and K. Munagala. Competitive algorithmsfrom competitive equilibria: Non-clairvoyant schedulingunder polyhedral constraints. InProceedings of the 46thAnnual ACM Symposium on Theory of Computing, STOC’14, pages 313–322, 2014.

[34] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar,and A. Goldberg. Quincy: Fair scheduling for distributedcomputing clusters. InProceedings of the ACM SIGOPS22Nd Symposium on Operating Systems Principles, SOSP’09, pages 261–276, New York, NY, USA, 2009. ACM.

[35] D. Jacobs, S. Aulbach, and T. U. MÃijnchen. Ruminationson multi-tenant databases. InBTW Proceedings, volume 103of LNI, pages 514–521. GI, 2007.

[36] K. Jain and V. V. Vazirani. Eisenberg-gale markets:Algorithms and structural properties. InProceedings of thethirty-ninth annual ACM symposium on Theory ofcomputing, pages 364–373. ACM, 2007.

[37] R. K. Jain, D.-M. W. Chiu, and W. R. Hawe. A quantitativemeasure of fairness and discrimination for resourceallocation in shared computer systems. Technical report,Digital Equipment Corporation, Sept. 1984.

[38] M. Kaneko. The ratio equilibrium and a voting game in apublic goods economy.Journal of Economic Theory,16(2):123–136, 1977.

[39] I. Kash, A. D. Procaccia, and N. Shah. No agent left behind:Dynamic fair division of multiple resources. InProceedingsof the 2013 International Conference on Autonomous Agentsand Multi-agent Systems, AAMAS ’13, pages 351–358,Richland, SC, 2013. International Foundation forAutonomous Agents and Multiagent Systems.

[40] F. Kelly, L. Massoulié, and N. Walton. Resource poolingincongested networks: proportional fairness and product form.Queueing Systems, 63(1-4):165–194, 2009.

[41] F. P. Kelly, A. K. Maulloo, and D. K. H. Tan. Rate control forcommunication networks: Shadow prices, proportionalfairness and stability.The Journal of the OperationalResearch Society, 49(3):pp. 237–252, 1998.

[42] Y. Kotidis and N. Roussopoulos. DynaMat: A dynamic viewmanagement system for data warehouses. InProceedings ofthe 1999 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’99, pages 371–382, NewYork, NY, USA, 1999. ACM.

[43] H. W. Kuhn and A. W. Tucker. Nonlinear programming. InProceedings of the Second Berkeley Symposium onMathematical Statistics and Probability, pages 481–492,Berkeley, Calif., 1951. University of California Press.

[44] J. LeFevre, J. Sankaranarayanan, H. Hacigumus,J. Tatemura, N. Polyzotis, and M. J. Carey. Miso: Souping upbig data query processing with a multistore system. InProceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’14, pages1591–1602, New York, NY, USA, 2014. ACM.

[45] H. Li, A. Ghodsi, M. Zaharia, E. Baldeschwieler, S. Shenker,and I. Stoica. Tachyon: Memory throughput I/O for clustercomputing frameworks.LADIS, 18:1, 2013.

[46] A. Mas-Colell and J. Silvestre. Cost share equilibria:Alindahlian approach.Journal of Economic Theory,47(2):239–256, 1989.

[47] H. Mistry, P. Roy, S. Sudarshan, and K. Ramamritham.Materialized view selection and maintenance usingmulti-query optimization. InProceedings of the 2001 ACMSIGMOD International Conference on Management of Data,SIGMOD ’01, pages 307–318, New York, NY, USA, 2001.ACM.

[48] V. Narasayya, S. Das, M. Syamala, S. Chaudhuri, F. Li, andH. Park. A demonstration of SQLVM: Performance isolationin multi-tenant relational database-as-a-service. InProceedings of the 2013 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’13, pages1077–1080, New York, NY, USA, 2013. ACM.

[49] V. Narasayya, I. Menache, M. Singh, F. Li, M. Syamala, andS. Chaudhuri. Sharing buffer pool memory in multi-tenantrelational database-as-a-service.Proc. VLDB Endow.,8(7):726–737, Feb. 2015.

[50] J. Nash. The bargaining problem.Econometrica,18(2):155–162, 1950.

[51] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, andN. Koudas. MRShare: Sharing across multiple queries inmapreduce.Proc. VLDB Endow., 3(1-2):494–505, Sept.2010.

[52] D. C. Parkes, A. D. Procaccia, and N. Shah. Beyonddominant resource fairness: Extensions, limitations, andindivisibilities. In Proceedings of the 13th ACM Conferenceon Electronic Commerce, EC ’12, pages 808–825, NewYork, NY, USA, 2012. ACM.

[53] K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop’sadolescence: An analysis of hadoop usage in scientificworkloads.Proc. VLDB Endow., 6(10):853–864, Aug. 2013.

[54] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou. Workloadcharacterization on a production hadoop cluster: A casestudy on Taobao. InIISWC’12, pages 3–13, 2012.

[55] H. E. Scarf. The core of an n person game.Econometrica,

Page 17: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

35(1):pp. 50–69, 1967.[56] T. K. Sellis. Multiple-query optimization.ACM Trans.

Database Syst., 13(1):23–52, Mar. 1988.[57] A. L. Stolyar. On the asymptotic optimality of the gradient

scheduling algorithm for multiuser throughput allocation.Oper. Res., 53(1):12–25, 2005.

[58] J. Yang, K. Karlapalem, and Q. Li. Algorithms formaterialized view design in data warehousing environment.In Proceedings of the 23rd International Conference on VeryLarge Data Bases, VLDB ’97, pages 136–145, SanFrancisco, CA, USA, 1997. Morgan Kaufmann PublishersInc.

[59] M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperativescans: Dynamic bandwidth sharing in a dbms. InProceedings of the 33rd International Conference on VeryLarge Data Bases, VLDB ’07, pages 723–734. VLDBEndowment, 2007.

APPENDIX

A. EXPERIMENT RESULTS

A.1 Results of experiments on effect of datasharing on mixed workload

Metric STATIC MMF FASTPF OPTPThroughput(/min) 7.80 19.2 19.2 19.2Avg cache util. 0.00 0.83 0.83 0.83Hit ratio 0.00 1.00 1.00 1.00Fairness index 1.00 0.71 0.71 0.71

Table 15: Performance of algorithms on setupG1

Metric STATIC MMF FASTPF OPTPThroughput(/min) 7.20 9.00 10.2 16.2Avg cache util. 0.08 0.81 0.87 0.92Hit ratio 0.08 0.54 0.68 0.83Fairness index 1.00 0.83 0.79 0.75

Table 16: Performance of algorithms on setupG2

Metric STATIC MMF FASTPF OPTPThroughput(/min) 7.20 7.50 7.80 9.60Avg cache util. 0.16 0.96 0.98 1.00Hit ratio 0.19 0.53 0.55 0.67Fairness index 1.00 0.77 0.66 0.50

Table 17: Performance of algorithms on setupG3

Metric STATIC MMF FASTPF OPTPThroughput(/min) 5.40 5.40 5.40 4.80Avg cache util. 0.24 0.91 0.93 0.96Hit ratio 0.26 0.43 0.47 0.46Fairness index 1.00 0.81 0.80 0.38

Table 18: Performance of algorithms on setupG4

A.2 Results of experiments on effect of datasharing on Sales workload

Metric STATIC MMF FASTPF OPTPThroughput(/min) 6.00 9.42 9.42 10.08Avg cache util. 0.34 0.87 0.86 0.88Hit ratio 0.42 0.67 0.67 0.68Fairness index 1.00 0.98 0.94 0.84

Table 19: Performance of algorithms on setupG1

Metric STATIC MMF FASTPF OPTPThroughput(/min) 5.70 7.20 7.44 8.24Avg cache util. 0.34 0.93 0.90 0.94Hit ratio 0.43 0.57 0.61 0.63Fairness index 1.00 0.96 0.92 0.78

Table 20: Performance of algorithms on setupG2

Page 18: ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads · 2018-04-12 · ROBUS: Fair Cache Allocation for Multi-tenant Data-parallel Workloads Mayuresh Kunjir, Brandon

Metric STATIC MMF FASTPF OPTPThroughput(/min) 5.34 7.44 7.38 7.92Avg cache util. 0.30 0.93 0.93 0.94Hit ratio 0.38 0.60 0.59 0.58Fairness index 1.00 0.98 0.92 0.72

Table 21: Performance of algorithms on setupG3

Metric STATIC MMF FASTPF OPTPThroughput(/min) 4.20 5.64 5.76 6.00Avg cache util. 0.28 0.89 0.88 0.92Hit ratio 0.34 0.50 0.56 0.55Fairness index 1.00 0.96 0.96 0.99

Table 22: Performance of algorithms on setupG4

A.3 Results of experiments on effect of queryarrival rate

Metric STATIC MMF FASTPF OPTPThroughput(/min) 5.76 6.42 6.72 6.90Avg cache util. 0.77 0.93 0.93 0.94Hit ratio 0.40 0.50 0.49 0.51Fairness index 1.00 1.00 0.99 0.97

Table 23: Performance of algorithms on setuplow

Metric STATIC MMF FASTPF OPTPThroughput(/min) 6.12 6.78 6.96 6.96Avg cache util. 0.72 0.90 0.89 0.90Hit ratio 0.44 0.49 0.49 0.56Fairness index 1.00 1.00 0.98 0.87

Table 24: Performance of algorithms on setupmid

Metric STATIC MMF FASTPF OPTPThroughput(/min) 5.52 6.12 6.30 6.54Avg cache util. 0.69 0.90 0.91 0.91Hit ratio 0.39 0.48 0.48 0.51Fairness index 1.00 1.00 1.00 0.89

Table 25: Performance of algorithms on setuphigh

A.4 Results of experiments on effect of num-ber of tenants

Metric STATIC MMF FASTPF OPTPThroughput(/min) 7.00 10.00 9.70 10.40Avg cache util. 0.67 0.93 0.93 0.97Hit ratio 0.50 0.68 0.68 0.68Fairness index 1.00 0.98 1.00 1.00Table 26: Performance of algorithms on setup with 2 tenants

Metric STATIC MMF FASTPF OPTPThroughput(/min) 6.00 9.40 9.40 10.10Avg cache util. 0.34 0.87 0.86 0.88Hit ratio 0.42 0.67 0.67 0.68Fairness index 1.00 0.98 0.94 0.84Table 27: Performance of algorithms on setup with 4 tenants

Metric STATIC MMF FASTPF OPTPThroughput(/min) 5.34 8.34 8.22 9.18Avg cache util. 0.07 0.82 0.82 0.87Hit ratio 0.26 0.65 0.65 0.68Fairness index 1.00 0.94 0.91 0.78Table 28: Performance of algorithms on setup with 8 tenants