hadoop on-mesos

34
Hadoop on Mesos with a short history of distributed computing

Upload: henry-cai-

Post on 06-May-2015

10.376 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Hadoop on-mesos

Hadoop on Mesoswith a short history of distributed computing

Page 2: Hadoop on-mesos

Agenda

1. Introduction (to me)2. A short history of distributed computing3. Hadoop on Mesos4. Case study - Airbnb5. Final thoughts6. Q&A

Page 3: Hadoop on-mesos

About me - Brenden Matthews

● cyclist● runner● started computering before it was cool● free software advocate & contributor (Conky)● for a living, engineers software @ Airbnb

Page 4: Hadoop on-mesos

About me - Brenden Matthews

● cyclist● runner● started computering before it was cool● free software advocate & contributor (Conky)● for a living, engineers software @

I don't even like computers.

Page 5: Hadoop on-mesos

Von Neumann Bottleneck● Forever limited by memory and other I/O

bandwidth limitations● To do more, you must scale beyond a single

node● Even with SMP

systems, the samelimitations apply

A little history

Page 6: Hadoop on-mesos

Early days of distributed computing

● Working around the Von Neumann Bottleneck: scaling up & out (Cray, SGI, IBM)

● 'Supercomputers' only practical for organizations with budget multipliers that start with a 'B'

Page 7: Hadoop on-mesos

Who has time to build a datacentre?

● Xen hypervisor is released in 2003, paves the way for an 'abstract datacentre' through virtualization

● Amazon launches EC2 in 2006, kicks off the 'cloud computing' craze

Page 8: Hadoop on-mesos

DIY supercomputer; a novel approach

● Google's MapReduce papers formalized the concept of 'black-box' distributed computing (2004)

● Google's own infrastructure is built upon free software and commodity hardware

Page 9: Hadoop on-mesos

DIY supercomputer; a novel approach

● Hadoop: a free implementation of Google's infrastructure; 'big computing' for all (2005)○ Robust○ High tolerance of system failure

Page 10: Hadoop on-mesos

We're still left withmany incomplete solutions● EC2 doesn't solve some problems:

○ Virtualization delivers poor performance when compared to 'bare metal'; must compensate by adding more instances

○ Frequent instance failures (mystery reboots, etc)○ EC2 isn't 'application aware' (though some have

tried)

What else?● Supercomputers aren't affordable● Building a datacentre is not feasible for most● Existing 'application in the cloud' systems

are too restrictive

Page 11: Hadoop on-mesos

How can we overcome these problems?

Page 12: Hadoop on-mesos

The dream is alive.

Page 13: Hadoop on-mesos

Mesos is an operating system for your cluster that provides application level distributed computing

Mesos helps bridge the gap between the hardware and your application (or 'framework', in Mesos terms)

What's Mesos?

Page 14: Hadoop on-mesos

Why Mesos?

yes, but...

Page 15: Hadoop on-mesos

I enjoy doing things the hard way.

Page 16: Hadoop on-mesos

I really enjoy doing things the hard way.

Page 17: Hadoop on-mesos

Hadoop on Mesos: Why?

● Formalized, scalable distributed computing● Extensive toolset (Hive, Pig, Cascading,

Cascalog, ...)● Familiar to many ('gold standard')● Hadoop as a distributed application (a novel

concept!)● Multiple versions of Hadoop (upgrade path)● Why stop at Hadoop? There's more to do

with our cluster! (Chronos, Storm, Jenkins, Spark, ...) and who has time to manage it?

Page 18: Hadoop on-mesos

Hadoop on Mesos: Goals

● Avoid complexity: rely on existing, vetted systems, where possible

● Hadoop on Mesos should behave like any other Hadoop

● Realize high resource utilization● Minimize contention & starvation● Make Hadoop a first class framework on

Mesos

Page 19: Hadoop on-mesos

Hadoop terminology

● JobTracker: manages cluster resources, assigns tasks to TaskTrackers

● TaskTracker: manages individual map/reduce tasks, serves intermediate data amongst other TaskTrackers

● Job: collection of map and reduce tasks● Task: one unit of work for a job (be it map or

reduce)● Slot: a task executor, is either map or

reduce● HDFS: distributed filesystem (outside scope)

Page 20: Hadoop on-mesos

Hadoop on Mesos: Challenges

● Availability: JobTracker must ensure adequate map and reduce slots are available for current & future jobs

● Capacity: how do you estimate capacity? How do you profile jobs?

● Optimization: general case, or specific cases? Per job resource allocation policies? Separate JobTrackers for different job types?

Page 21: Hadoop on-mesos

Hadoop on Mesos: Challenges

○ Mesos reservations allow for reservation of slave resources for frameworks

○ Hadoop FairScheduler supports role fair sharing and task pre-emption within JobTracker

● Resource reservations: handling competing frameworks on the same cluster

Page 22: Hadoop on-mesos

Hadoop on Mesos: Challenges

Job Maps Reduces Duration Start

1 95 5 1h 0

2 5 100 1m 1m

3 10 10 30m 60m

4 50 0 20m 70m

5 100 5 1h 80m

Maps Reduces

95 5

48 52

10 10

60 10

90 10

Job Flow

With capacity for 100 slots

A contrived example

Maps Reduces

50 50

50 50

50 50

50 50

50 50

Ideal allocation Actual Hadoop

Page 23: Hadoop on-mesos

Hadoop on Mesos: What we did

● Mesos Scheduler is a thin layer atop the Hadoop scheduler

● JobTracker launches TaskTrackers for each job, using either a fixed or variable slot policy○ Fixed policy launches a fixed number of slots per

TaskTracker○ Variable policy attempts to launch an ideal number

of TaskTrackers and slots based on job queue● Task scheduling is left to the underlying

scheduler (i.e., Hadoop FairScheduler)

Page 24: Hadoop on-mesos

Suggested key configuration values

Hadoop on Mesos: How we did it

Name Value

mapred.tasktracker.map.tasks.maximum 50

mapred.tasktracker.reduce.tasks.maximum 50

mapred.mesos.slot.map.minimum 1000

mapred.mesos.slot.reduce.minimum 1000

mapred.mesos.scheduler.policy.fixed false

mapred.mesos.slot.cpus 0.95

mapred.mesos.slot.mem 1550

Page 25: Hadoop on-mesos

● Engineering & analytics departments use Hive, Pig, Cascading and other tools on Hadoop:○ Building search indices○ Pricing suggestion system○ Trust & safety, fraud detection○ Business analytics

● Dealing with hypergrowth

Case study: Airbnb

Page 26: Hadoop on-mesos

● Had previously been using EMR, Amazon's managed Hadoop as a service

● EMR suffers from:○ limited Hive/Pig features○ feature lag○ inability to patch or modify Hadoop

● Data infrastructure was prone to error due to significant complexity○ EMR clusters would be spun up & destroyed every

week○ accessing Hadoop required strange SSH 'hopping'

Case study: Airbnb, yesterday

Page 27: Hadoop on-mesos

Case study: Airbnb, today

● We run Chronos, Hadoop, and Storm on Mesos now

● Finished complete migration to Mesos from EMR (June 2013)

● ~500 Chronos jobs● ~20TiB of daily Hive data, ~1-2PiB of

archived data

Page 28: Hadoop on-mesos

● Data availability: all time high

● Eng. & analytics customer satisfaction through the roof

Case study: Airbnb, today

Page 29: Hadoop on-mesos

Action shots

Page 30: Hadoop on-mesos

Action shots

Page 31: Hadoop on-mesos

Next steps

● Locality awareness● HDFS on Mesos● HA JobTracker● JobTracker on Mesos

Page 33: Hadoop on-mesos

Thanks!

Page 34: Hadoop on-mesos

Questions?