jockey - eurosys presentationcs.brown.edu/~adf/work/eurosys2012-talk.pdfjockey guaranteed job...
TRANSCRIPT
Jockey Guaranteed Job Latency in
Data Parallel Clusters
Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca
2
DATA PARALLEL CLUSTERS
3
DATA PARALLEL CLUSTERS
4
DATA PARALLEL CLUSTERS Predictability
5
DATA PARALLEL CLUSTERS Deadline
6
DATA PARALLEL CLUSTERS Deadline
7
VARIABLE LATENCY
8
VARIABLE LATENCY
0 5 10 15 20 25 30 35 40
latency [minutes]
9
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
10
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
11
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
12
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
13
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
14
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
4.3x
15
Why does latency vary?
1. Pipeline complexity 2. Noisy execution environment
Cosmos
16
MICROSOFT’S DATA PARALLEL CLUSTERS
Cosmos
17
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore
Cosmos
18
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore • Dryad
Cosmos
19
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore • Dryad • SCOPE
Cosmos
20
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore • Dryad • SCOPE
21
DRYAD’S DAG WORKFLOW
Cosm
os Cl
uste
r
22
DRYAD’S DAG WORKFLOW
Cosm
os Cl
uste
r
23
DRYAD’S DAG WORKFLOW
Pipeline
Job
24
DRYAD’S DAG WORKFLOW
Deadline
Deadline
Deadline Deadline
Deadline
25
DRYAD’S DAG WORKFLOW
Deadline
Deadline
Deadline Deadline
Deadline
26
Stage
DRYAD’S DAG WORKFLOW
Job
27
Stage
DRYAD’S DAG WORKFLOW
Tasks
Job
28
29
EXPRESSING PERFORMANCE TARGETS
Priorities?
30
EXPRESSING PERFORMANCE TARGETS
Priorities? Not expressive enough
Weights?
31
EXPRESSING PERFORMANCE TARGETS
Priorities? Not expressive enough
Weights? Difficult for users to set
Utility curves?
32
EXPRESSING PERFORMANCE TARGETS
Priorities? Not expressive enough
Weights? Difficult for users to set
Utility curves? Capture deadline & penalty
33
OUR GOAL
34
OUR GOAL
Maximize utility
35
OUR GOAL
Maximize utility while minimizing resources
36
OUR GOAL
Maximize utility while minimizing resources
by dynamically adjusting the allocation
Jockey 37
Jockey 38
• Large clusters
Jockey 39
• Large clusters • Many users
Jockey 40
• Large clusters • Many users • Prior execution
41
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
42
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
43
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
44
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
45
JOCKEY – CONTROL LOOP
46
JOCKEY – CONTROL LOOP
47
JOCKEY – CONTROL LOOP
48
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
49
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
f( job state, allocation) -> remaining run time
50
JOCKEY – PROGRESS INDICATOR
51
JOCKEY – PROGRESS INDICATOR
52
JOCKEY – PROGRESS INDICATOR
53
JOCKEY – PROGRESS INDICATOR
total running
54
JOCKEY – PROGRESS INDICATOR
total running +
total queuing
55
JOCKEY – PROGRESS INDICATOR
stage
total running +
total queuing
56
JOCKEY – PROGRESS INDICATOR
total running +
total queuing
total running +
total queuing
total running +
total queuing
Stage 1
Stage 2
Stage 3
+
+
57
JOCKEY – PROGRESS INDICATOR
total running +
total queuing
total running +
total queuing
total running +
total queuing
# complete total tasks
# complete total tasks
# complete total tasks
Stage 1
Stage 2
Stage 3
+
+
58
JOCKEY – PROGRESS INDICATOR
59
JOCKEY – PROGRESS INDICATOR
0 10 20 30 40 50 60
time [min]
60
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
time [min]
61
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
time [min]
62
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
estim
ated
job
com
plet
ion
[min
]
time [min]
63
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
estim
ated
job
com
plet
ion
[min
]
time [min]
64
JOCKEY – CONTROL LOOP
65
JOCKEY – CONTROL LOOP
1% complete
2% complete
3% complete
4% complete
5% complete
Job model
66
JOCKEY – CONTROL LOOP
10 nodes 20 nodes 30 nodes
1% complete
2% complete
3% complete
4% complete
5% complete
Job model
67
JOCKEY – CONTROL LOOP
10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Job model
68
JOCKEY – CONTROL LOOP
10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Job model
Deadline: 50 min.
Completion: 1%
69
JOCKEY – CONTROL LOOP
Job model
Deadline: 50 min.
Completion: 1%
10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
70
JOCKEY – CONTROL LOOP
Job model 10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Deadline: 40 min.
Completion: 3%
71
JOCKEY – CONTROL LOOP
Job model 10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Deadline: 30 min.
Completion: 5%
72
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
73
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
analytic model?
74
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
analytic model? machine learning?
75
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
analytic model? machine learning?
simulator
76
JOCKEY
Problem Solution
77
JOCKEY
Problem Solution
Pipeline complexity
78
JOCKEY
Problem Solution
Pipeline complexity Use a simulator
79
JOCKEY
Problem Solution
Pipeline complexity Use a simulator
Noisy environment
80
JOCKEY
Problem Solution
Pipeline complexity Use a simulator
Noisy environment Dynamic control
Jockey in Action 81
Jockey in Action 82
• Real job
Jockey in Action 83
• Real job • Production cluster
Jockey in Action 84
• Real job • Production cluster • CPU load: ~80%
Jockey in Action
85
Jockey in Action
86
Jockey in Action
87
Jockey in Action
88
Initial deadline: 140 minutes
Jockey in Action
89
New deadline: 70 minutes
Jockey in Action
90
New deadline: 70 minutes
Release resources due to excess pessimism
Jockey in Action
91
“Oracle” allocation: Total allocation-hours
Deadline
Jockey in Action
92
“Oracle” allocation: Total allocation-hours
Deadline
Available parallelism less than allocation
Jockey in Action
93
“Oracle” allocation: Total allocation-hours
Deadline
Allocation above oracle
Evaluation 94
Evaluation 95
• Production cluster
Evaluation 96
• Production cluster • 21 jobs
Evaluation 97
• Production cluster • 21 jobs • SLO met?
Evaluation 98
• Production cluster • 21 jobs • SLO met? • Cluster impact?
Evaluation
99
Evaluation
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
job completion time relative to deadline
deadline
Jobs which met the SLO
100
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
deadline
Jobs which met the SLO
101
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
102
Missed 1 of 94 deadlines
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
103
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
104
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
105
1.4x
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
max allocation Jockey
deadline
Jobs which met the SLO
106
Allocated too many resources
Missed 1 of 94 deadlines
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
max allocation Jockey
Allocation fromsimulator
deadline
Jobs which met the SLO Allocated too many resources
107
Simulator made good predictions: 80% finish before deadline
Missed 1 of 94 deadlines
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
max allocation Jockey
Allocation fromsimulator
Control loop only
deadline
Jobs which met the SLO Allocated too many resources
Simulator made good predictions: 80% finish before deadline
108
Control loop is stable
and successful
Missed 1 of 94 deadlines
Evaluation
109
Evaluation
110
0% 25% 50% 75% 100%
fraction of allocation above oracle
Evaluation
111
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
Evaluation
112
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
Jockey
Evaluation
113
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
max allocation
Jockey
Evaluation
114
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
Allocation from simulator
max allocation
Control loop only
Jockey
Conclusion 115
116
Data parallel jobs are complex,
117
Data parallel jobs are complex, yet users demand deadlines.
118
Data parallel jobs are complex, yet users demand deadlines.
Jobs run in shared, noisy clusters,
119
Data parallel jobs are complex, yet users demand deadlines.
Jobs run in shared, noisy clusters, making simple models inaccurate.
Jockey 120
simulator
121
control-loop
122
123
Deadline
Deadline
Deadline Deadline
Deadline
124
Questions? Andrew Ferguson [email protected]
125
Co-a
utho
rs • Peter Bodík
(Microsoft Research) • Srikanth Kandula
(Microsoft Research) • Eric Boutín
(Microsoft) • Rodrigo Fonseca
(Brown)
Questions? 126
Andrew Ferguson [email protected]
Backup Slides
127
!"
# $%&# '#!"#$%"&'()*+",$*+&)(
Utility Curves
Deadline
For single jobs, scale doesn’t matter
For multiple jobs, use financial penalties
128
129
Jockey Resource allocation control loop
1. Slack
2. Hysteresis
3. Dead Zone
Prediction Run Time Utility
130
Cosmos
• Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler)
• Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion.
• A token is a guaranteed share of CPU and memory • To increase efficiency, unused tokens are re-allocated to
jobs with available work
Resource sharing in Cosmos
131
Jockey Progress indicator • Can use many features of the job to build a progress
indicator • Earlier work (ParaTimer) concentrated on fraction of tasks
completed • Our indicator is very simple, but we found it performs
best for Jockey’s needs Total vertex initialization time
Total vertex run time Frac;on of completed ver;ces
132
Comparison with ARIA • ARIA uses analytic models • Designed for 3 stages: Map, Shuffle, Reduce • Jockey’s control loop is robust due to control-
theory improvements • ARIA tested on small (66-node) cluster without a
network bottleneck • We believe Jockey is a better match for production
DAG frameworks such as Hive, Pig, etc.
133
Jockey
Latency prediction: C(p, a) • Event-based simulator
– Same scheduling logic as actual Job Manager
– Captures important features of job progress
– Does not model input size variation or speculative re-execution of stragglers
– Inputs: job algebra, distributions of task timings, probabilities of failures, allocation
• Analytic model
– Inspired by Amdahl’s Law: T = S + P/N
– S is remaining work on critical path, P is all remaining work, N is number of machines
134
Jockey Resource allocation control loop • Executes in Dryad’s Job Manager
• Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup)
• Output: Number of tokens to allocate
• Improved with techniques from control-theory
Jockey offline during job runtime
job profile
135
Jockey
simulator
offline during job runtime
job profile
136
Jockey
simulator
offline during job runtime
job stats
job profile
137
Jockey
simulator
offline during job runtime
job stats latency predictions
job profile
138
Jockey
simulator
offline during job runtime
utility function job stats latency
predictions
job profile
139
Jockey
simulator
offline during job runtime
running job
utility function job stats latency
predictions
resource allocation control loop
job profile
140