why static is bad! hadoop pregel mpi shared cluster today: static partitioningwant dynamic sharing

Post on 19-Dec-2015

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Why static is bad!

HadoopHadoop

PregelPregel

MPIMPIShared cluster

Today: static partitioning

Want dynamic sharing

Comparing Sharing Frameworks: choice• Choice of resources• Can a framework pick between all resources?• A predefined subset?• Or a random chosen subset?

• Why important?• Policies may need to be global --localization• If you can preempt you can get your

preference

Comparing Sharing Frameworks: Interference• Can frameworks tray to use the

same machines?• Can a framework pick between all resources?

• How to avoid this?• Offer resources to machines one at a time• Statically partition• Offer in parallel and arbitrate when conflict

arises.

Comparing Sharing Frameworks: Granularity• Allocation Granularity• MPI tasks: gang-schedule, job can’t run until all slots are

acquired.• Hadoop: elastic, job can start running when it allocates a few

slots

• Why important?• If gang-scheduling, then the framework will hoard until it gets

all the slots it needs. • The cluster may or may not be underutilized.

• Cluster-wide behaviors

Mesos

Other Benefits of MesosRun multiple instances of the same framework

»Isolate production and experimental jobs»Run multiple versions of the framework

concurrently

Build specialized frameworks targeting particular problem domains

»Better performance than general-purpose abstractions

GoalsHigh utilization of resources

Support diverse frameworks (current & future)

Scalability to 10,000’s of nodes

Reliability in face of failuresResulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design ElementsFine-grained sharing:

»Allocation at the level of tasks within a job»Improves utilization, latency, and data

locality

Resource offers:»Simple, scalable application-controlled

scheduling mechanism

Element 1: Fine-Grained Sharing

Framework 1Framework 1

Framework 2Framework 2

Framework 3Framework 3

Coarse-Grained Sharing (HPC):

Fine-Grained Sharing (Mesos):

+ Improved utilization, responsiveness, data locality

Storage System (e.g. HDFS) Storage System (e.g. HDFS)

Fw. 1Fw. 1

Fw. 1Fw. 1Fw. 3Fw. 3

Fw. 3Fw. 3 Fw. 2Fw. 2Fw. 2Fw. 2

Fw. 2Fw. 2

Fw. 1Fw. 1

Fw. 3Fw. 3

Fw. 2Fw. 2Fw. 3Fw. 3

Fw. 1Fw. 1

Fw. 1Fw. 1 Fw. 2Fw. 2Fw. 2Fw. 2

Fw. 1Fw. 1

Fw. 3Fw. 3 Fw. 3Fw. 3

Fw. 3Fw. 3

Fw. 2Fw. 2

Fw. 2Fw. 2

Element 2: Resource OffersOption: Global scheduler

»Frameworks express needs in a specification language, global scheduler matches them to resources

+ Can make optimal decisions– Complex: language must support all framework needs

– Difficult to scale and to make robust– Future frameworks may have unanticipated needs

Element 2: Resource OffersMesos: Resource offers

»Offer available resources to frameworks, let them pick which resources to use and which tasks to launch

+ Keeps Mesos simple, lets it support future frameworks

- Decentralized decisions might not be optimal

Mesos ArchitectureMPI jobMPI job

MPI scheduler

MPI scheduler

Hadoop job

Hadoop job

Hadoop schedulerHadoop

scheduler

Allocation

module

Mesosmaster

Mesos slaveMesos slaveMPI

executor

Mesos slaveMesos slave

MPI executo

rtasktask

Resource offer

Resource offer

Pick framework to offer

resources to

Pick framework to offer

resources to

Mesos ArchitectureMPI jobMPI job

MPI scheduler

MPI scheduler

Hadoop job

Hadoop job

Hadoop schedulerHadoop

scheduler

Allocation

module

Mesosmaster

Mesos slaveMesos slaveMPI

executor

Mesos slaveMesos slave

MPI executo

rtasktask

Pick framework to offer

resources to

Pick framework to offer

resources toResource offer

Resource offer

Resource offer = list of (node, availableResources)

E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) }

Resource offer = list of (node, availableResources)

E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) }

Mesos ArchitectureMPI jobMPI job

MPI scheduler

MPI scheduler

Hadoop job

Hadoop job

Hadoop schedulerHadoop

scheduler

Allocation

module

Mesosmaster

Mesos slaveMesos slaveMPI

executor

Hadoop executo

r

Hadoop executo

r

Mesos slaveMesos slave

MPI executo

rtasktask

Pick framework to offer

resources to

Pick framework to offer

resources to

task

Framework-specific

scheduling

Framework-specific

scheduling

Resource offer

Resource offer

Launches and isolates

executors

Launches and isolates

executors

Drawbacks• Poor fairness• Jobs with long tasks can dominate• There is NO preemption!!

• Sticky slots• Jobs with higher priority can dominate a set of preferred slots• Mesos uses lottery scheduling, probability of being offered a slot is

proportional to the frameworks priority

• Head of line blocking• Mesos offers resources one framework at a time

• Prevents frameworks from trying to use the same slots• Based on assumptions: scheduling decisions are quick, • Mesos revokes offers if a schedules takes too long• Essentially leads to a queue

Omega

Omega• Scales• Central layer only does optimistic conflict

resolution• No head of Line blocking

• Allows for flexible and evolvable scheduling• Framework can implement any arbitrary form of

scheduling• Each framework has global view• Frameworks can preempt each other

Comparing Sharing Frameworks• Choice of resources

• Interference

• Allocation Granularity

• Cluster-wide behaviors

Comparing Frameworks

top related