scheduling in distributed systems - andrii vozniuk
DESCRIPTION
My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system. This talk is based on the following papers: 1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt 2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica 3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWittTRANSCRIPT
Scheduling In Distributed Systems
Andrii Vozniuk EPFL July 4, 2012
Candidacy exam
Data explosion Processing gets more complicated
Big Data
2
Resources of many computers should be used
Generates: 40 TB/dayStores: 20 PB/year
Generates: 25 TB/dayStores: 10 PB/year
Typical Data Processing Pipeline
3
Log data
Clean data
Query data
Analyze data
Sensor data
No one-size-fits-all system currently exists
ETL-like batchprocessing
Efficient queryexecution
Using resources ofmany organizations
Particle found!
User model
Outline
4
Condor - compute-intensive system
Gamma - parallel database
MapReduce - data-intensive system
Ɣ
Conclusions
Future Research
Scheduling Policy: setting an ordering of tasks Assigning resources to tasks
Scheduling In Distributed Systems
5
Scheduling is challenging in distributed systems
How to match resources and tasks?
tasktask
task task
Matching Tasks With Resources
6
How scheduling is influenced by data and execution models?
Perspectives Data model Execution model
System/Perspective
Data model Execution model
Gamma Relational Multioperator
MapReduce Unconstrained MapReduce
Condor Unconstrained Unconstrained
Gamma Pioneering parallel database Data model: constrained
Relational data model Relations are horizontally partitioned
Execution model: constrained Multioperator queries Operators employ hash-based algorithms
7
Ɣ
Gamma: Scheduler
8
Scheduling is done at the operator level
Host Machine
Query Manager Catalog
Gamma Database
a-m n-z
OperatorProcess
Operator Process
SchedulerProcess
Optimizes queryCompiles plan
SELECT r FROM RWHERE r < ‘k’
Schedulesoperators
Execution onrelevant nodes
Node 1 Node 2
Ɣ
query
Gamma: Batch Scheduling
9
Exploit sharing by scheduling in a batch Example of selection sharing
Batch scheduling trades latency for throughput
σ1
A
Shared scanσ2
A
σ1
A
σ2
Ɣ
Reads of A can be shared applying predicates in turn Shared relation A is scanned only once
Gamma: Batch Scheduling Joins
10
Sharing among joins reduces total execution time
Several hash-joins in a batch of queries Hash table for the same relation can be shared Example assumes 100% selectivity of σ
σ σ
⋈
A Β
σ σ
⋈
A C
σ σ
⋈
B A
σ
C
⋈
Shared hash-table for A
Ɣ
Sharing reduces I/O and memory usage
Limitations Of Gamma Gamma offers
Efficient query execution Sharing in a batch of queries
Gamma operates on structured data Gamma is not suitable for
Unstructured data processing ETL type of workload Running on large scale
11
A different system for ETL processing is needed
Ɣ
MapReduce System for data-intensive applications Execution model: constrained
Job is a set of map and reduce tasks Tasks are independent
Data model: unconstrained Arbitrary data format Files are partitioned into chunks Each chunk is replicated several times
12
MapReduce: Scheduling
13
4 Map tasks
2 Reduce task
MapReduce job
Map1
Map2
Map3
Map4
Reduce
Reduce
Example:
Temp1Result
1
Chunk1
Temp2
Chunk2
Temp4Temp3
Chunk3 Result
2
Chunk4
Fine grain scheduling improves fault tolerance and elasticity
Tasks are scheduled close to data Execution is scalable and fault-tolerant Execution is elastic
MapReduce: Speculative Execution Nodes may become slow Speculative execution minimizes job’s response time Launch if progress is 20% less than average
14
Speculative execution works well in homogeneous environment
Normal node
Temporary slow node
backup
straggler
Emerging Heterogeneous Infrastructures Replacement of failed components Extending existing cluster with new machines Virtualized data centers of cloud providers
CPU and RAM are isolated Contention for disk and network
15
In many real-life cases the infrastructure is heterogeneous
1 2 3 4 5 6 70
10203040506070
VMs on Physical Host
IO P
erfo
rman
ce p
er
VM (M
B/s)
MapReduce: Heterogeneous Cluster
Performance degrades on heterogeneous cluster Slow nodes are wasted Backup tasks on slow nodes All straggling tasks are treated equally Thrashing due to excessive speculative execution
16
Speculative execution should be improved for heterogeneous cluster
Fast node
Slow node
MapReduce: LATE Scheduler Idea: back up the task with the largest estimated finish
time (Longest Approximate Time to End)
Thresholds Limit the number of backup tasks Launch backup tasks on fast nodes Backup only sufficiently slow tasks
17
progress score
execution timeprogress rate =
1 – progress score
progress rateestimated time left =
LATE looks forward to prioritize tasks to speculate
MapReduce: LATE Example Back up the task with Longest Approximate Time to End
18
2 min
Time (min)
Progress = 5.3%
Estimated time left:(1-0.05) / (1/1.9) = 1.8
Progress = 66%
LATE correctly identifies task which hurts the response time the most
3x slower
1.9x slower
1
2
3
1 task/min
Estimated time left:(1-0.66) / (1/3) = 1
improvement
Limitations Of MapReduce
19
MapReduce offers High scalability Good fault tolerance Handling of unstructured data
MapReduce is not suitable for Running on multi organization infrastructure Harvesting idle resources in organization
A different system for multi organization infrastructure is needed
Compute-intensive system harvesting idle resources Data model: arbitrary Execution model: arbitrary
jobjob
Condor
20
job
job
Increase resources utilization by scheduling jobs on idle machines
How to increase utilization and respect
the owners?
jobjob
Condor Scheduler: Centralized?
21
Scheduler
Efficient but not reliable, possible bottleneck
job
job
jobjob
Condor Scheduler: Distributed?
22
Scheduler
Reliable but inefficient
Scheduler
Scheduler
Scheduler
job
job
jobjob
Condor Scheduler: Hybrid!
23
Hybrid approach has the best of both worlds
Scheduler
Scheduler
Matchmaker
Scheduler
Information about nodesInformation about tasks
job
job
1
1
12
3
3
4
Job Description
[MyType=“Job”TargetType = “Machine“Department=“CompSci“Requirements = (other.OpSys==LINUX && other.Disk > 10000000)Rank=Memory]
Machine Description
[MyType=“Machine“TargetType=“Job“Machine=“nostos.cs.wisc.edu“OpSys=“LINUX“Disk=3076077Requirement = (LoadAvg <= 0.3) && (KeyboardIdle > (15*60))Rank = other.Department==self.Department]
ClassAds: Describing Jobs and Resources
24
Matchmaker is suitable for heterogeneous shared clusters
Requirements should be satisfied Candidate with the highest rank is returned
Scheduling done at different levels Gamma: operator level scheduling enables sharing MR and Condor: arbitrary code => sharing is hard Condor: matchmaking gives control on job placement
Hybrid approaches are promising for big data processing
Scheduling in heterogeneous deployments is challenging
Conclusions
25
References Matchmaking: Distributed Resource Management for Hig
h Throughput Computing by Rajesh Raman, Miron Livny and Marvin Solomon.
Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt.
Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
Slides 14 and 18 exploit presentation ideas from the LATE slides for OSDI 2008 by Matei Zaharia
27