learning scheduling algorithms for data processing clusters · observation of jobs and cluster...
TRANSCRIPT
![Page 1: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/1.jpg)
Learning Scheduling Algorithms for Data Processing Clusters
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh
![Page 2: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/2.jpg)
Motivation
Scheduling is a fundamental task in computer systems• Cluster management (e.g., Kubernetes, Mesos, Borg)• Data analytics frameworks (e.g., Spark, Hadoop)• Machine learning (e.g., Tensorflow)
Efficient scheduler matters for large datacenters• Small improvement can save millions of dollars at scale
2
![Page 3: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/3.jpg)
Designing Optimal Schedulers is Intractable
Must consider many factors for optimal performance:• Job dependency structure• Modeling complexity• Placement constraints• Data locality• ……
Graphene [OSDI ’16], Carbyne [OSDI ’16]Tetris [SIGCOMM ’14], Jockey [EuroSys ’12]TetriSched [EuroSys ‘16], device placement [NIPS ’17]Delayed Scheduling [EuroSys ’10]……
Practical deployment:Ignore complexity à resort to simple heuristicsSophisticated system à complex configurations and tuning
No “one-size-fits-all” solution:Best algorithm depends on specific workload and system
![Page 4: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/4.jpg)
Can machine learning help tame the complexity of efficient schedulers for data processing jobs?
![Page 5: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/5.jpg)
Decima: A Learned Cluster Scheduler
5
• Learns workload-specific scheduling algorithms for jobs with dependencies (represented as DAGs)
Job DAG
Job 2 Job 3
Scheduler
Executor 1
Executor m
Executor 2
“Stages”: Identical tasks that can run in parallel
Data dependencies
5
![Page 6: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/6.jpg)
Decima: A Learned Cluster Scheduler
6
• Learns workload-specific scheduling algorithms for jobs with dependencies (represented as DAGs)
Job 1
Scheduler
Server 1
Server m
Server 2
![Page 7: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/7.jpg)
Design overview
State
Job DAG 1 Job DAG n
Executor 1 Executor m
Scheduling Agent
p[Policy Network
GraphNeural
Network
EnvironmentSchedulable
NodesObjective
Reward
Observation of jobs and cluster status
![Page 8: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/8.jpg)
Number of servers working on this job
Scheduling policy: FIFO
Average Job Completion Time:
225 sec
Demo
![Page 9: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/9.jpg)
Scheduling policy: Shortest-Job-First
Average Job Completion Time:
135 sec
![Page 10: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/10.jpg)
Scheduling policy: Fair
Average Job Completion Time:
120 sec
![Page 11: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/11.jpg)
FairShortest-Job-First
Average Job Completion Time:135 sec
Average Job Completion Time:120 sec
![Page 12: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/12.jpg)
Scheduling policy: Decima
Average Job Completion Time:
98 sec
![Page 13: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/13.jpg)
FairDecima
Average Job Completion Time:98 sec
Average Job Completion Time:120 sec
![Page 14: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/14.jpg)
Contributions
State
Job DAG 1 Job DAG n
Executor 1 Executor m
Scheduling Agent
p[Policy Network
GraphNeural
Network
EnvironmentSchedulable
NodesObjective
Reward
Observation of jobs and cluster status
1. First RL-based scheduler for complex data processing jobs2. Scalable graph neural network to express scheduling policies3. New learning methods that enables training with online job arrivals
14
![Page 15: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/15.jpg)
Encode scheduling decisions as actions
15
Job DAG 1
Job DAG n
Server 1
Server 2
Server 4
Server 3
Server m
Set of identical free executors
![Page 16: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/16.jpg)
Option 1: Assign all Executors in 1 Action
Problem: huge action space
Job DAG 1
Job DAG n
Server 1
Server 2
Server 4
Server 3
Server m16
![Page 17: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/17.jpg)
Option 2: Assign One Executor Per Action
Problem: long action sequences
Job DAG 1
Job DAG n
Server 1
Server 2
Server 4
Server 3
Server m17
![Page 18: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/18.jpg)
Decima: Assign Groups of Executors per Action
18
Job DAG 1
Job DAG n
Server 1
Server 2
Server 4
Server 3
Server m
Use 3 servers
Use 1 server
Use 1 server
Action = (node, parallelism limit)
![Page 19: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/19.jpg)
19
Job DAG 1
Job DAG n
Node features:
• # of tasks • avg. task duration • # of servers currently
assigned to the node• are free servers local to
this job?
Arbitrary number of jobs
Process Job Information
![Page 20: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/20.jpg)
20
Graph Neural Network
Job DAG
68
3
2
Score on each node
![Page 21: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/21.jpg)
Training
21
Decima agent cluster
Reinforcement learning training
Generate experience data
![Page 22: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/22.jpg)
Time
Number of backlogged jobs
Handle Online Job Arrival
The RL agent has to experience continuous job arrival during training.
→ inefficient if simply feeding long sequences of jobs
Initial random policy
22
![Page 23: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/23.jpg)
The RL agent has to experience continuous job arrival during training.
→ inefficient if simply feeding long sequences of jobs
Time
Number of backlogged jobs
Waste training time
Initial random policy
Handle Online Job Arrival
23
![Page 24: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/24.jpg)
The RL agent has to experience continuous job arrival during training.
→ inefficient if simply feeding long sequences of jobs
Time
Number of backlogged jobs
Early reset for initial training
Initial random policy
Handle Online Job Arrival
24
![Page 25: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/25.jpg)
The RL agent has to experience continuous job arrival during training.
→ inefficient if simply feeding long sequences of jobs
Time
Number of backlogged jobs
As training proceeds, stronger policy keeps the queue stable
Increase the reset time
Curriculum learning
Handle Online Job Arrival
25
![Page 26: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/26.jpg)
26
Variance from Job Sequences
RL agent needs to be robust to the variation in job arrival patterns.
→ huge variance can throw off the training process
![Page 27: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/27.jpg)
27
Job size
Timet
Future workload #1
Future workload #2
action at
Score for action at = (return after at) − (average return)= ∑#$%#& '#$ − ((*#)
Must consider the entire job sequence to score actions
Variance from Job Sequences
![Page 28: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/28.jpg)
Average return for trajectories from state stwith job sequence zt, zt+1, …
Score for action at = ∑"#$"% &"# − ((*")Score for action at = ∑"#$"% &"# − ((*", -", -"./, … )
Input-Dependent Baseline
28
Broadly applicable to other systems with external input process: Adaptive video streaming, load balancing, caching, robotics with disturbance…
• Variance reduction for reinforcement learning in input-driven environments. Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh. International Conference on Learning Representations (ICLR), 2019.
![Page 29: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/29.jpg)
• 20 TPC-H queries sampled at random; input sizes: 2, 5, 10, 20, 50, 100 GB
• Decima trained on simulator; tested on real Spark cluster
29
Decima improves average job completion time by 21%-3.1x over baseline schemes
Decima vs. Baselines: Batched Arrivals
![Page 30: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/30.jpg)
30
Decima with Continuous Job Arrivals
1000 jobs arrives as a Poisson process with avg. inter-arrival time = 25 sec
Decima achieves 28% lower average JCT than best heuristic, and 2X better JCT in overload
Bette
r
![Page 31: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/31.jpg)
Tuned weighted fairDecima
31
Understanding Decima
Tuned weighted fairDecima
![Page 32: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/32.jpg)
Industrial trace (Alibaba): 20,000 jobs from production cluster
Multi-resource requirement: CPU cores + memory units
Flexibility: Multi-Resource Scheduling
32
![Page 33: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/33.jpg)
• Impact of each component in the learning algorithm
• Generalization to different workloads
• Training and inference speed
• Handling missing features
• Optimality gap
Other Evaluations
33
![Page 34: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec5f0fcd2b31741e6002ca9/html5/thumbnails/34.jpg)
Summary
• Decima uses reinforcement learning to generate workload-specificscheduling algorithms
• Decima employs curriculum learning and variance reduction to enable training with stochastic job arrivals
• Decima leverages a scalable graph neural network to process arbitrary number of job DAGs
• Decima outperforms existing heuristics and is flexible to apply to other applications
http://web.mit.edu/decima/34