omega: flexible, scalable schedulers for large compute ... · cluster machines (10,000s) cluster...
TRANSCRIPT
![Page 1: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/1.jpg)
Google Confidential and Proprietary
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf (University of Cambridge Computer Lab)
Andy Konwinski (UC Berkeley)
Michael Abd-El-Malek (Google)
John Wilkes (Google)
original: April 17th, 2013this version: Oct 2013
EuroSys 20131
![Page 2: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/2.jpg)
Google Confidential and Proprietary
http://www.google.com/about/datacenters/inside/locations/
We own and operate data centers around the world
![Page 3: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/3.jpg)
![Page 4: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/4.jpg)
![Page 5: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/5.jpg)
![Page 6: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/6.jpg)
Google Confidential and Proprietary
the scheduling problem
EuroSys 2013
Tasks
Job
Machines
2
![Page 7: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/7.jpg)
Google Confidential and Proprietary
trends observed
Diverse workloads
Increasing cluster sizes
Growing job arrival rates
EuroSys 20133
![Page 8: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/8.jpg)
Google Confidential and Proprietary
why is this a problem?
EuroSys 2013
Cluster machines (10,000s)
Cluster scheduler
Arriving jobs and tasks (1,000s)
scheduling logic
4
60+ seconds!
![Page 9: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/9.jpg)
Google Confidential and Proprietary
why is this a problem?
EuroSys 2013
Cluster machines(10,000s)
Cluster scheduler
Arriving jobs and tasks (1,000s)
scheduling logic
Hence:Break up into independent schedulers.
Increasing complexity!
But: How do we arbitrate resources between schedulers?
![Page 10: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/10.jpg)
Google Confidential and Proprietary
existing approaches
static partitioning
● poor utilization● inflexible
S0 S1 S2
monolithic scheduler
SCHEDULER
● hard to diversify● code growth● scalability bottleneck
EuroSys 20136
![Page 11: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/11.jpg)
Google Confidential and Proprietary
existing approaches
EuroSys 2013
S0 S1 S2
CLUSTER STATE
shared-state
e.g. UCB Mesos [NSDI 2011]
● hoarding● information hiding
S0 S1 S2
RESOURCE MANAGER
two-level
7
![Page 12: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/12.jpg)
Google Confidential and Proprietary
how does omega work?
EuroSys 2013
S0 S1
8
![Page 13: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/13.jpg)
Google Confidential and Proprietary9
how does omega work?
EuroSys 2013
S0 S1
![Page 14: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/14.jpg)
Google Confidential and ProprietaryEuroSys 2013
how does omega work?
EuroSys 2013
S0 S1
![Page 15: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/15.jpg)
Google Confidential and Proprietary11
how does omega work?
EuroSys 2013
S0 S1
Conflict!
![Page 16: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/16.jpg)
Google Confidential and Proprietary
how does omega work?
EuroSys 2013
S0 S1
failure! success!
12
![Page 17: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/17.jpg)
Google Confidential and Proprietary
overview
1) intro & motivation
2) workload characterization
3) comparison of approaches
4) trace-based simulation
5) flexibility case study
EuroSys 201313
![Page 18: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/18.jpg)
Google Confidential and Proprietary
workload: batch/service split
Batch Service
EuroSys 201314
![Page 19: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/19.jpg)
Google Confidential and Proprietary
workload: batch/service split
Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource job runtime in sec.]
Cluster AMedium sizeMedium utilization
Cluster BLarge sizeMedium utilization
Cluster CMedium (12k mach.)High utilizationPublic trace
EuroSys 201315
![Page 20: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/20.jpg)
Google Confidential and Proprietary
workload: batch/service split
Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource job runtime in sec.]
Cluster AMedium size
Medium utilization
Cluster BLarge size
Medium utilization
Cluster CMedium size
High utilization
Public trace
EuroSys 2013
TAKEAWAY
Most jobs are batch, but most resources are consumed by service jobs.
16
![Page 21: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/21.jpg)
Google Confidential and Proprietary
workload: job runtime distributions
BatchService
EuroSys 2013
Frac
tion
of jo
bs ru
nnin
g fo
r les
s th
an X
![Page 22: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/22.jpg)
Google Confidential and Proprietary
workload: inter-arrival time distributions
ServiceBatch
Frac
tion
of in
ter-
arriv
al g
aps
less
than
X
EuroSys 2013
BatchService
![Page 23: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/23.jpg)
Google Confidential and Proprietary
workload: batch/service split
EuroSys 2013
Batch jobs Service jobs80th %ile runtime
80th %ile inter-arrival time
12-20 min. 29 days
4-7 sec. 2-15 min.
17
![Page 24: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/24.jpg)
Google Confidential and Proprietary
overview
1) intro & motivation
2) workload characterization
3) comparison of approaches
4) trace-based simulation
5) flexibility case study
EuroSys 201318
![Page 25: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/25.jpg)
Google Confidential and Proprietary
methodology: simulation
simulation using
empirical workload parameters distributions
EuroSys 2013
Code available:
http://code.google.com/p/cluster-scheduler-simulator
19
![Page 26: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/26.jpg)
Google Confidential and Proprietary
parameters
Scheduler decision time
n: num. tasks
decision time
EuroSys 2013
ttask: per-task(usually 0.005s
per task)
(usually 0.1s per job)
20
![Page 27: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/27.jpg)
Google Confidential and Proprietary
scheduling policies
EuroSys 201348
Why might scheduling take 60 seconds?
● Large jobs (tens of thousands of tasks)
● Optimization algorithms (constraints, bin packing with knock-on preemption)
● Picky jobs in a full cluster
● Monte Carlo simulations (fault tolerance)
![Page 28: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/28.jpg)
Open issues: failure toleranceTopology-aware scheduling for concurrent outages● a fault tree
![Page 29: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/29.jpg)
Open issues: failure toleranceTopology-aware schedulingfor concurrent outages● a fault tree● a fault DAG
![Page 30: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/30.jpg)
Open issues: failure toleranceTopology-aware schedulingfor concurrent outages● a fault tree● a partially redundant
fault DAG
![Page 31: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/31.jpg)
Open issues: failure tolerance● real fault, or
lost touch?● time to detect vs.
false positives?● multiple
information sources for correlated failures?
![Page 32: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/32.jpg)
Google Confidential and Proprietary
How do does the shared-state design compare with other architectures?
EuroSys 2013
Experiment details:● all clusters, 7 simulated days● 2 schedulers● varying Service scheduler
Experiment 1:
21
![Page 33: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/33.jpg)
Google Confidential and ProprietaryEuroSys 2013
t job for ALL
jobsttask for ALL jobs
schedulerbusyness
monolithic, uniform decision time (single logic)
time spent scheduling total time
blue => all jobs were scheduled
red => unscheduled jobs remained
22
![Page 34: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/34.jpg)
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
monolithic, fast-path batch decision time
head-of-lineblocking
23
![Page 35: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/35.jpg)
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
mesos v0.9 (of May 2012)
Ooops...
24
![Page 36: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/36.jpg)
Google Confidential and Proprietary
3. Blue receives tiny offer.
EuroSys 2013
mesos
S1 S2
1. Green receives offer of all available resources.
2. Blue's task finishes.
4. Blue cannot use it.
[repeat many times]
5. Green finishes scheduling.
6. Blue receives large offer.
By now, it has given up.
RESOURCE MANAGER
25
![Page 37: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/37.jpg)
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
omega, no optimizations
26
![Page 38: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/38.jpg)
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
omega, optimized
27
![Page 39: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/39.jpg)
Google Confidential and ProprietaryEuroSys 2013
omega, optimized
TAKEAWAY
The Omega shared-state model performs as well as a (complex) monolithic multi-path
scheduler.
Monolithic Mesos Omega
28
![Page 40: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/40.jpg)
Google Confidential and Proprietary
Does the shared-state design scale to many schedulers?
EuroSys 2013
Experiment details:● cluster B, 7 simulated days● 2 schedulers● varying job arrival rate and number of schedulers
Experiment 2:
29
![Page 41: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/41.jpg)
Google Confidential and ProprietaryEuroSys 2013
scaling to many schedulers
30
![Page 42: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/42.jpg)
Google Confidential and Proprietary
overview
1) intro & motivation
2) workload characterization
3) comparison of approaches
4) trace-based simulation
5) flexibility case study
EuroSys 201331
![Page 43: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/43.jpg)
Google Confidential and ProprietaryEuroSys 2013
simulator comparison
homogeneousmachines
lightweight simulator
high-fidelity simulator
job parameters empirical distribution
real-world
workload trace
constraints
scheduling algorithm
runtime
not supported supported
random first fitGoogle
algorithm
fast (24h ≃ 5min) slow (24h ≃ 2h)
32
![Page 44: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/44.jpg)
Google Confidential and Proprietary
Experiment details:● cluster C, 29 days● 2 schedulers,
non-uniform decision time● varying Service scheduler
EuroSys 2013
Experiment 3:
How much scheduler interference do we see with real Google workloads?
33
![Page 45: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/45.jpg)
Google Confidential and Proprietary
conflict fraction
EuroSys 2013
num. conflictstotal num. transactions
![Page 46: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/46.jpg)
Google Confidential and Proprietary
scheduler busyness
overhead due to conflicts
EuroSys 201334
sche
dule
r bus
ynes
s
![Page 47: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/47.jpg)
Google Confidential and Proprietary
scheduler busyness
overhead due to conflicts
EuroSys 2013
TAKEAWAY
Interference is higher for real-world settings.
35
![Page 48: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/48.jpg)
Google Confidential and Proprietary
1. Fine-grained conflict detection
optimizations
2. Incremental commits
EuroSys 201336
1st
2nd
#89693
69
![Page 49: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/49.jpg)
Google Confidential and Proprietary
Experiment details:● cluster C, 29 days● 2 schedulers,
non-uniform decision time● varying Service scheduler
EuroSys 2013
Experiment 4:
How do the optimizations affect performance?
37
![Page 50: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/50.jpg)
Google Confidential and Proprietary
impact on scheduler utilization
EuroSys 201338
![Page 51: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/51.jpg)
Google Confidential and Proprietary
practical implications – scheduler utilization
overhead due to conflicts
EuroSys 201339
sche
dule
r bus
ynes
s
![Page 52: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/52.jpg)
Google Confidential and Proprietary
practical implications – scheduler utilization
overhead due to conflicts
EuroSys 2013
TAKEAWAY
We can make simple improvements that significantly improve scalability.
40
![Page 53: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/53.jpg)
Google Confidential and Proprietary
Case study
MapReduce scheduler with opportunistic extra resources
EuroSys 201341
![Page 54: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/54.jpg)
Google Confidential and Proprietary
11
50
200
1000
8000
100 450
3
5
workers in MR jobs
Number of workers [log10]Snapshot over 29 days
Cou
nt o
f job
s w
ith X
wor
kers
![Page 55: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/55.jpg)
Google Confidential and Proprietary
case study: a MapReduce scheduler
EuroSys 2013cluster C, 29 days
Relative speedup [log10]
Frac
tion
of M
R jo
bs w
ith s
peed
up <
X
60% of MapReduces
43
3-4x speedup!
better
![Page 56: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/56.jpg)
Google Confidential and Proprietary
case study: a MapReduce scheduler
EuroSys 2013cluster C, 29 days
Relative speedup [log10]
Frac
tion
of M
R jo
bs w
ith s
peed
up <
X
60% of MapReduces
3-4x speedup!
TAKEAWAY
The Omega approach gives us the flexibility to easily support custom policies.
44
better
![Page 57: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/57.jpg)
Google Confidential and Proprietary
conclusion
TAKEAWAYS
Flexibility and scale require parallelism,
parallel scheduling works if you do it right, and
using shared state is the way to do it right!
EuroSys 201345
![Page 58: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/58.jpg)
Google Confidential and Proprietary
BACKUP SLIDES
EuroSys 2013
![Page 59: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/59.jpg)
Google Confidential and ProprietaryEuroSys 2013
centralized resource-allocator (not fault-tolerant)
per-job “application manager”(MapReduce calls it a “controller”)
YARN
Apache Hadoop YARN: Yet Another Resource Negotiator. ACM SoCC’13
![Page 60: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/60.jpg)
Google Confidential and Proprietary
methodology: simulation
empiricaldistribution
Event-driven simulator
...
Batch
Service
MapReduce
Workload
Experiment configuration
Cluster state
Initial cluster state
EuroSys 2013
Code available:
http://code.google.com/p/cluster-scheduler-simulator
19
![Page 61: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/61.jpg)
Google Confidential and Proprietary
Frac
tion
of jo
bs ru
nnin
g fo
r les
s th
an X
workload: job runtime distributions
BatchService
EuroSys 2013
TAKEAWAY
Service jobs, once scheduled, run for much longer than batch jobs do.
![Page 62: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/62.jpg)
Google Confidential and Proprietary
workload: inter-arrival time distributions
ServiceBatch
Frac
tion
of in
ter-
arriv
al g
aps
less
than
X
EuroSys 2013
TAKEAWAY
Service jobs arrive much less frequently than batch jobs do.
![Page 63: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/63.jpg)
Google Confidential and Proprietary
CLUSTER STATE
Shared state
the omega approach
Optimistic concurrency
S0 S1 S2● Deltas against shared state
● Easy to develop & maintain
● Heterogeneous schedulers OK
● No explicit coordination required
● Interference resolution (not prevention)
● Scales wellEuroSys 2013
![Page 64: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/64.jpg)
Google Confidential and Proprietary
impact on conflict fraction
EuroSys 2013
![Page 65: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/65.jpg)
Google Confidential and Proprietary
case study: a MapReduce scheduler
EuroSys 2013
50% of MapReduces
4.5x speedup
Relative speedup [log10]
cluster A, 29 days
Frac
tion
of M
R jo
bs w
ith s
peed
up <
X
![Page 66: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5efdd38130663f40e654e/html5/thumbnails/66.jpg)
Google Confidential and Proprietary
caveats, or when this won't work well
● aggressive, systematically adverse workloads or schedulers
● small clusters with high overcommit
EuroSys 2013
deal with using out-of-band or post-facto enforcement mechanisms
Possible problems...