hadoop on-mesos

Post on 06-May-2015

10.377 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hadoop on Mesoswith a short history of distributed computing

Agenda

1. Introduction (to me)2. A short history of distributed computing3. Hadoop on Mesos4. Case study - Airbnb5. Final thoughts6. Q&A

About me - Brenden Matthews

● cyclist● runner● started computering before it was cool● free software advocate & contributor (Conky)● for a living, engineers software @ Airbnb

About me - Brenden Matthews

● cyclist● runner● started computering before it was cool● free software advocate & contributor (Conky)● for a living, engineers software @

I don't even like computers.

Von Neumann Bottleneck● Forever limited by memory and other I/O

bandwidth limitations● To do more, you must scale beyond a single

node● Even with SMP

systems, the samelimitations apply

A little history

Early days of distributed computing

● Working around the Von Neumann Bottleneck: scaling up & out (Cray, SGI, IBM)

● 'Supercomputers' only practical for organizations with budget multipliers that start with a 'B'

Who has time to build a datacentre?

● Xen hypervisor is released in 2003, paves the way for an 'abstract datacentre' through virtualization

● Amazon launches EC2 in 2006, kicks off the 'cloud computing' craze

DIY supercomputer; a novel approach

● Google's MapReduce papers formalized the concept of 'black-box' distributed computing (2004)

● Google's own infrastructure is built upon free software and commodity hardware

DIY supercomputer; a novel approach

● Hadoop: a free implementation of Google's infrastructure; 'big computing' for all (2005)○ Robust○ High tolerance of system failure

We're still left withmany incomplete solutions● EC2 doesn't solve some problems:

○ Virtualization delivers poor performance when compared to 'bare metal'; must compensate by adding more instances

○ Frequent instance failures (mystery reboots, etc)○ EC2 isn't 'application aware' (though some have

tried)

What else?● Supercomputers aren't affordable● Building a datacentre is not feasible for most● Existing 'application in the cloud' systems

are too restrictive

How can we overcome these problems?

The dream is alive.

Mesos is an operating system for your cluster that provides application level distributed computing

Mesos helps bridge the gap between the hardware and your application (or 'framework', in Mesos terms)

What's Mesos?

Why Mesos?

yes, but...

I enjoy doing things the hard way.

I really enjoy doing things the hard way.

Hadoop on Mesos: Why?

● Formalized, scalable distributed computing● Extensive toolset (Hive, Pig, Cascading,

Cascalog, ...)● Familiar to many ('gold standard')● Hadoop as a distributed application (a novel

concept!)● Multiple versions of Hadoop (upgrade path)● Why stop at Hadoop? There's more to do

with our cluster! (Chronos, Storm, Jenkins, Spark, ...) and who has time to manage it?

Hadoop on Mesos: Goals

● Avoid complexity: rely on existing, vetted systems, where possible

● Hadoop on Mesos should behave like any other Hadoop

● Realize high resource utilization● Minimize contention & starvation● Make Hadoop a first class framework on

Mesos

Hadoop terminology

● JobTracker: manages cluster resources, assigns tasks to TaskTrackers

● TaskTracker: manages individual map/reduce tasks, serves intermediate data amongst other TaskTrackers

● Job: collection of map and reduce tasks● Task: one unit of work for a job (be it map or

reduce)● Slot: a task executor, is either map or

reduce● HDFS: distributed filesystem (outside scope)

Hadoop on Mesos: Challenges

● Availability: JobTracker must ensure adequate map and reduce slots are available for current & future jobs

● Capacity: how do you estimate capacity? How do you profile jobs?

● Optimization: general case, or specific cases? Per job resource allocation policies? Separate JobTrackers for different job types?

Hadoop on Mesos: Challenges

○ Mesos reservations allow for reservation of slave resources for frameworks

○ Hadoop FairScheduler supports role fair sharing and task pre-emption within JobTracker

● Resource reservations: handling competing frameworks on the same cluster

Hadoop on Mesos: Challenges

Job Maps Reduces Duration Start

1 95 5 1h 0

2 5 100 1m 1m

3 10 10 30m 60m

4 50 0 20m 70m

5 100 5 1h 80m

Maps Reduces

95 5

48 52

10 10

60 10

90 10

Job Flow

With capacity for 100 slots

A contrived example

Maps Reduces

50 50

50 50

50 50

50 50

50 50

Ideal allocation Actual Hadoop

Hadoop on Mesos: What we did

● Mesos Scheduler is a thin layer atop the Hadoop scheduler

● JobTracker launches TaskTrackers for each job, using either a fixed or variable slot policy○ Fixed policy launches a fixed number of slots per

TaskTracker○ Variable policy attempts to launch an ideal number

of TaskTrackers and slots based on job queue● Task scheduling is left to the underlying

scheduler (i.e., Hadoop FairScheduler)

Suggested key configuration values

Hadoop on Mesos: How we did it

Name Value

mapred.tasktracker.map.tasks.maximum 50

mapred.tasktracker.reduce.tasks.maximum 50

mapred.mesos.slot.map.minimum 1000

mapred.mesos.slot.reduce.minimum 1000

mapred.mesos.scheduler.policy.fixed false

mapred.mesos.slot.cpus 0.95

mapred.mesos.slot.mem 1550

● Engineering & analytics departments use Hive, Pig, Cascading and other tools on Hadoop:○ Building search indices○ Pricing suggestion system○ Trust & safety, fraud detection○ Business analytics

● Dealing with hypergrowth

Case study: Airbnb

● Had previously been using EMR, Amazon's managed Hadoop as a service

● EMR suffers from:○ limited Hive/Pig features○ feature lag○ inability to patch or modify Hadoop

● Data infrastructure was prone to error due to significant complexity○ EMR clusters would be spun up & destroyed every

week○ accessing Hadoop required strange SSH 'hopping'

Case study: Airbnb, yesterday

Case study: Airbnb, today

● We run Chronos, Hadoop, and Storm on Mesos now

● Finished complete migration to Mesos from EMR (June 2013)

● ~500 Chronos jobs● ~20TiB of daily Hive data, ~1-2PiB of

archived data

● Data availability: all time high

● Eng. & analytics customer satisfaction through the roof

Case study: Airbnb, today

Action shots

Action shots

Next steps

● Locality awareness● HDFS on Mesos● HA JobTracker● JobTracker on Mesos

Links

● The code: https://github.com/airbnb/mesos● Airbnb Engineering Blog: http://nerds.airbnb.

com/● My other stuff: https://github.

com/brndnmtthws

brenden@diddyinc.combrenden.matthews@airbnb.com

Thanks!

Questions?

top related