online scheduling of spark workloads with mesos using ... · 3.1 introduction including background...

Online Scheduling of Spark Workloads withMesos using Different Fair Allocation Algorithms∗

Y. Shan, A. Jain, G. Kesidis, B. UrgaonkarSchool of EECS, PSU, State College, PA{yxs182,axj182,gik2,buu1}@psu.edu

J. Khamse-Ashari and I. LambadarisSCE Dept, Carleton Univ., Ottawa, Canada{jalalkhamseashari,ioannis}@sce.carleton.ca

August 13, 2018

1 IntroductionIn the following, we present illustrative example and experimental results comparingfair schedulers allocating resources (indexed r) from multiple servers (indexed i, withresource capacities ci,r) to distributed application frameworks (indexed n, with re-source demands per task dn,r). Resources are allocated so that at least one resource ris exhausted in every server.

Schedulers considered include DRF (DRFH) and Best-Fit DRF (BF-DRF) [1, 11],TSF [10], and PS-DSF [2]. We also consider server selection under Randomized RoundRobin (RRR) and based on their residual (unreserved) resources. In the following,we consider cases with frameworks of equal priority and without server-preferenceconstraints. We first give typical results of an illustrative numerical study and thengive typical results of a study involving Spark workloads on Mesos, which we havemodified and open-sourced to prototype different schedulers.

2 Illustrative numerical study of fair scheduling by pro-gressive filling

In this section, we consider the following typical example of our numerical study withtwo heterogeneous distributed application frameworks (n = 1, 2) having resource de-mands per unit workload:

d1,1 = 5, d1,2 = 1, d2,1 = 1, d2,2 = 5; (1)∗This work supported in part by an NSF CSR research grant, a gift from IBM, and a gift of AWS credits.

1

arX

iv:1

803.

0092

2v3

[cs

.PF]

20

Apr

201

8

PPPPPPPPsched.(n, i)

(1,1) (1,2) (2,1) (2,2) total

DRF [1, 11] 6.55 4.69 4.69 6.55 22.48TSF [10] 6.5 4.7 4.7 6.5 22.4

RRR-PS-DSF 19.44 1.15 1.07 19.42 41.08BF-DRF [11] 20 2 0 19 41PS-DSF [2] 19 0 2 20 41

rPS-DSF 19 2 2 19 42

Table 1: Workload allocations xn,i for different schedulers under progressive fillingfor illustrative example with parameters (1) and (2). Averaged values over 200 trialsreported for the first three schedulers operating under RRR server selection.

and two heterogeneous servers (i = 1, 2) having two different resources with capaci-ties:

c1,1 = 100, c1,2 = 30, c2,1 = 30, c2,2 = 100. (2)

For DRF and TSF, the servers i are chosen in round-robin fashion, where the serverorder is randomly permuted in each round; DRF under such randomized round-robin(RRR) server selection is the default Mesos scheduler, cf. next section. One can alsoformulate PS-DSF under RRR wherein RRR selects the server and the PS-DSF crite-rion only selects the framework for that server. Frameworks n are chosen by progres-sive filling with integer-valued tasking (x), i.e., whole tasks are scheduled.

Numerical results for scheduled workloads for this illustrative example are givenin Tables 1 & 2, and unused resources are given in Tables 3 and 4. 200 trials wereperformed for DRF, TSF and PS-DSF under RRR server selection, so using Table 2we can obtain confidence intervals for the averaged quantities given in Table 1 forschedulers under RRR. For example, the 95% confidence interval for task allocation ofthe first framework on the second server (i.e., (n, i) = (1, 2)) under TSF is

(6.5− 2 · 0.46/√200, 6.5 + 2 · 0.46/

√200) = (6.43, 6.57).

Note how PS-DSF’s performance under RRR is comparable to when frameworks andservers are jointly selected [2], and with low variance in allocations. We also foundthat RRR-rPS-DSF performed just as rPS-DSF over 200 trials.

We found task efficiencies improve using residual forms of the fairness criterion.For example, the residual PS-DSF (rPS-DSF) criterion is

K̃n,j,xj= xn max

r

dn,rφn(cj,r −

∑n′ xn′,jdn′,r)

That is, this criterion makes scheduling decisions by progressive filling using currentresidual (unreserved) capacities based on the current allocations x. From Table 1, wesee the improvement is modest for the case of PS-DSF.

2

PPPPPPPPsched.(n, i)

(1,1) (1,2) (2,1) (2,2)

DRF [1, 11] 2.31 0.46 0.46 2.31TSF [10] 2.29 0.46 0.46 2.29

RRR-PS-DSF 0.59 0.99 1 0.49

Table 2: Sample standard deviation of allocations xn,i for different schedulers underRRR server selection with. Averaged values over 200 trials reported.

PPPPPPPPsched.(i, r)

(1,1) (1,2) (2,1) (2,2)

DRF [11] 62.56 0 0 62.56TSF [10] 62.8 0 0 62.8

RRR-PS-DSF 1.8 4.6 4.86 1.92BF-DRF [11] 0 10 1 3PS-DSF [2] 3 1 10 0

rPS-DSF 3 1 1 3

Table 3: Unused capacities ci,r −∑

n xn,idn,r for different schedulers under progres-sive filling for illustrative example with parameters (1) and (2). Averaged values over200 trials reported under RRR server selection.

PPPPPPPPsched.(i, r)

(1,1) (1,2) (2,1) (2,2)

DRF [1, 11] 11.09 0 0 11.09TSF [10] 10.99 0 0 10.99

RRR-PS-DSF 0.59 0.99 1 0.49

Table 4: Sample standard deviation of unused capacities ci,r −∑

n xn,idn,r for differ-ent schedulers under RRR server selection over 200 trials.

3

Improvements are also obtained by best-fit server selection. For example, best-fitDRF (BF-DRF) first selects framework n by DRF and then selects the server whoseresidual capacity most closely matches their resource demands {dn,r}r [11].

3 Online experiments with MesosThe execution traces presented in the figures are typical of the multiple trials we ran.

3.1 Introduction including background on MesosThe Mesos master (including its resource allocator, see [4]) works in dynamic/onlineenvironment with churn in the distributed computing/application frameworks it man-ages. When all or part of a Mesos agent1 becomes available, a framework is selectedby Mesos and a resource allocation for it is performed. The framework accepts the of-fered allocation in whole or part. When a framework’s tasks are completed, the Mesosmaster may be notified that the corresponding resources of the agents are released, andthen the master will make new allocation decisions to existing or new frameworks.Newly arrived frameworks with no allocations are given priority. We consider twoimplementations of fair resource scheduling algorithms in Mesos.

In oblivious2 allocation, the allocator is not aware of the resource demands of theframeworks3. A framework running an uncharacterized application may be configuredto accept all resources offered to it.

In workload-characterized allocation, each active framework n simply informsthe Mesos allocator of its resource demands per task, {dn,r}r. The Mesos allocatorselects a framework and allocates a single task worth of resources from a given agentwith unassigned (released) resources.

In the following, we compare different scheduling algorithms implemented as theMesos allocator. Given a pool of agents with unused resources, PS-DSF [2], rPS-DSFand best-fit (BF) [11] allocations will depend on particular agents. When a Mesosframework (Spark job) completes, its resources from different agents are released. Wehave observed that at times the Mesos allocator sequentially schedules agents withavailable resources (i.e., the agents are released according to some order), while at othertimes the released agents are scheduled as a pool so that the agent-selection mechanismwould be relevant. Initially, the agents are always scheduled by the Mesos allocator asa pool.

In our Mesos implementation, the workflow of these two different allocations isshown in Figure 1.

3.2 Running Spark on MesosIn our experiments, the frameworks will operate under the distributed programmingplatform Spark in client mode. Each Spark job (Mesos framework) is divided into

1an agent is a.k.a. server, slave or worker and is typically a virtual machine2called “coarse grain” in Mesos.3Indeed, the frameworks themselves may not be aware.

4

Figure 1: Flowchart of Coarse-grained/Oblivious and Fine-grained/Workload-Characterized Allocation.

multiple tasks (threads). Multiple Spark executors will be allocated for a Spark job.The executors can simultaneously run a certain maximum number of tasks dependingon how many cores on the executor and how many cores are required per task; whena task completes, the executor informs the driver to request another, i.e. executorspull in work. Each executor is a Mesos task in the default “coarse-grained” mode[7] and an executor resides in a container of a Mesos agent [3]. Plural executors cansimultaneously reside on a single Mesos agent. An executor usually terminates as theentire Spark job terminates [6]. When starting a Spark job, the resources required tostart an executor (d) and the maximum number of executors that can be created toexecute the tasks of the job, may be specified. The Spark driver will attempt to use asmuch of its allocated resources as possible.

In a typical configuration, Spark employs three classical parallel-computing tech-

5

niques: jobs are divided into microtasks (typically based on fine partition of the dataseton which work is performed); when underbooked, executors pull work (tasks) from adriver; and the driver employs time-out at program barriers4 to detect straggler tasksand relaunch them on new executors (speculative execution) [8]. In this way, Spark canreduce (synchronization) delays at barriers while not needing to know either the execu-tion speed of the executors nor the resources required to achieve a particular executiontime of the tasks. On the other hand, microtasking does incur potentially significantoverhead compared to an approach with larger tasks whose resource needs have beenbetter characterized, i.e., as {dn,r} resources per task5.

3.3 Experiment ConfigurationIn our experiments, there are two Spark submission groups (“roles” in Mesos’ jargon):group Pi submits jobs that accurately calculate π = 3.1415... via Monte Carlo sim-ulation; group WordCount submits word-count jobs for a 700MB+ document. Theexecutors of Pi require 2 CPUs and about 2 GB memory (Pi is CPU bottlenecked),while those of WordCount require 1 CPU and about 3.5 GB memory (WordCount ismemory bottlenecked). Each group has five job submission queues, which means therecould be ten jobs running on the cluster at the same time. Each queue initially has fiftyjobs. Again, each job is divided into tasks and tasks are run in plural Spark executors(Mesos tasks) running on different Mesos agents.

The Mesos agents run on six servers (AWS c3.2xlarge virtual-machine instances),two each of three types in our cluster. A type-1 server provides 4 CPUs and 14 GBmemory, so it would be well utilized by 4 WordCount tasks. A type-2 server pro-vides 8 CPUs and 8 GB memory, so it would be well utilized by 4 Pi tasks. A type-3server provides 6 CPUs and 11 GB memory, so it would be well utilized by 2 Pi and 2WordCount tasks. The Mesos master operates in a c3.2xlarge with 8 cores and 15 GBmemory.

The experiment setup is illustrated in Figure 2.

3.4 Prototype implementationWe modified the allocator module of Mesos (version 1.5.0) to use different schedul-ing criteria; in particular, criteria depending on the agent/server so that agents are notnecessarily selected in RRR fashion when a pool of them is available. We also modi-fied the driver in Spark to pass on a framework n’s resource needs per task ({dn,r}) inworkload-characterized mode. Our code is available here [5, 9].

3.5 Experimental Results for Different SchedulersWe ran the same total workload for the four different Mesos allocators all under Ran-domized Round-Robin (RRR) agent selection: oblivious DRF (Mesos default), obliv-ious PS-DSF, workload-characterized DRF, and workload-characterized PS-DSF. (In

4where parallel executed tasks all need to complete before the program can proceed5what may be called “coarse grain” in the context of Spark.

6

Figure 2: Experiment setup.

this section, we drop the “RRR” qualifier). A summary of our results is that overallexecution time is improved under workload characterization and under allocations thatare agent/server specific.

3.5.1 DRF vs. PS-DSF in oblivious mode

The resource allocation under different fairness criteria are shown in Figure 3. It can beseen that PS-DSF can achieve higher resource utilization than DRF because it “packs”tasks better into heterogeneous servers. As a result, the entire job-batch under PS-DSFfinishes earlier. Also note that at the end of the experiment, there is a sudden drop inallocated memory percentage. This is because the memory-intensive Spark WordCountjobs finish earlier and CPU is the bottleneck resource for the remaining Spark Pi jobs.

3.5.2 Schedulers in workload-characterized mode

The experimental results under workload-characterized mode, as shown in Fig. 4, areconsistent with their oblivious counterparts - PS-DSF has higher resource utilizationthan DRF. Also note that the resource utilizations in workload-characterized mode haveless variance than those in oblivious mode, which will be explained in Sec. 3.5.3.

7

Figure 3: Comparison between DRF and PS-DSF in oblivious mode.

In Figure 5, we compare TSF [10] under RRR6, rPS-DSF (under RRR), and BF-DRF (again, “best fit” is an agent-selection mechanism when there is a pool of agentsto choose from). From the figure, the execution times of BF-DRF and -rPS-DSF arecomparable to PS-DRF (but cf. Section 3.7) and shorter than TSF (which is comparableto DRF).

3.5.3 Oblivious versus Workload Characterized modes

We also compared oblivious and workload-characterized allocation for the same schedul-ing algorithm. Again, when a Spark job finishes, its executors may not simultaneouslyrelease resources from the Mesos allocator’s point-of-view. So under oblivious allo-cation, it’s possible that multiple Spark frameworks can share the same server, as istypically the case under workload-characterized scheduling. However, oblivious al-

6Note that [10] also describes experimental results for a Mesos implementation of TSF.

8

Figure 4: Comparison between DRF and PS-DSF in workload-characterized mode.

location is a coarse-grained enforcement of progressive filling, where the resourcesare less evenly distributed among the frameworks - some frameworks may receive theentire remaining resources on a agent in a single offer, leaving nothing available forothers. From Figures 6-7, note how under oblivious allocation the amount of allocatedresources drops more sharply when a Spark job ends, and variance of utilized resourcesunder oblivious allocation is larger than under workload-characterized. Consequently,the entire job-batch tends to finish sooner under workload-characterized allocator, aswe see in Figures 6-7.

3.6 With Homogeneous ServersWe also did experiments in a cluster with six type-3 servers (6 CPUs, 11 GB memory).In Figure 8 we show that DRF and PS-DSF have nearly identical performance withhomogeneous servers.

9

Figure 5: Comparison among TSF, Best-fit DRF and rPS-DSF (in the workload char-acterized mode).

3.7 BF-DRF versus rPS-DSFFinally, with a different experimental set-up, we compare BF-DRF (which first selectsthe framework and then selects the “best fit” from among available agents/servers)and a representative of a family of server-specific schedulers, rPS-DRF under RRR.Consider a case where there are three servers, one of each of the above server types(types 1-3).

Suppose under a current allocation, we have one Spark-Pi and two Spark-WordCountexecutors on the type-1 server, two Spark-Pi and one Spark-WordCount executors onthe type-2 server, and two Spark-Pi and two Spark-WordCount executors on the type-3server. So, whenever a Pi or WordCount framework releases its executor’s resourcesback to the cluster, its DRF “score” is reduced so the scheduler will always sends a re-source offer to the same client framework in this scenario. On the other hand, rPS-DSFwill make a decision considering the amount of (remaining) resources on the server,and so will make a more efficient allocation.

We illustrate this with the example of Figure 9. In this experiment, we let eachgroup submit their Spark jobs through five queues with 20 jobs each. To create theabove scenario, instead of exposing all the servers to the client frameworks, we register

10

Figure 6: Comparison between oblivious and workload-characterized modes underDRF.

servers one by one from type-1 to type-3. From the figure, note that both rPS-DSF andBF-DRF have an initiall inefficient memory allocation, but rPS-DSF is able to adaptand quickly increase its memory efficiency, while BF-DRF does not.

References[1] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica.

Dominant resource fairness: Fair allocation of multiple resource types. In Proc.USENIX NSDI, 2011.

[2] J. Khamse-Ashari, I. Lambadaris, G. Kesidis, B. Urgaonkar, and Y.Q. Zhao. Per-Server Dominant-Share Fairness (PS-DSF): A Multi-Resource Fair AllocationMechanism for Heterogeneous Servers. In Proc. IEEE ICC, Paris, May 2017.

11

Figure 7: Comparison between oblivious and workload-characterized modes under PS-DSF.

[3] Apache Mesos - Containerizers. http://mesos.apache.org/documentation/latest/containerizer-internals/.

[4] Apache Mesos - Mesos Architecture. http://mesos.apache.org/documentation/latest/architecture/.

[5] Mesos multi-scheduler. https://github.com/yuquanshan/mesos/tree/multi-scheduler.

[6] Apache Spark - Dynamic Resource Allocation.https://spark.apache.org/docs/latest/job-scheduling.html.

[7] Apache Spark - Running Spark on Mesos.https://spark.apache.org/docs/latest/running-on-mesos.html.

[8] Apache Spark - Spark Configuration. https://spark.apache.org/docs/latest/configuration.html.

12

Figure 8: Workload-characterized DRF and PS-DSF with homogeneous servers.

[9] Spark with resource demand vectors. https://github.com/yuquanshan/spark/tree/d-vector.

[10] W. Wang, B. Li, B. Liang, and J. Li. Multi-resource fair sharing for datacenterjobs with placement constraints. In Proc. Supercomputing, Salt Lake City, Utah,2016.

[11] W. Wang, B. Liang, and B. Li. Multi-resource fair allocation in heterogeneouscloud computing systems. IEEE Transactions on Parallel and Distributed Sys-tems, 26(10):2822–2835, Oct. 2015.

13

Figure 9: Performance of Best-fit DRF and rPS-DSF given initial suboptimal alloca-tion.

14

online scheduling of spark workloads with mesos using ... · 3.1 introduction including background...

Documents