the zoo expands: labrador *loves* elephant, thanks to hamster

29
The Zoo Expands Labrador Elephant, Thanks to Hamster Milind Bhandarkar Chief Scientist, Pivotal Software, Inc.

Upload: milind-bhandarkar

Post on 26-Jan-2015

110 views

Category:

Data & Analytics


0 download

DESCRIPTION

The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.

TRANSCRIPT

Page 1: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

The Zoo Expands Labrador šŸ’› Elephant, Thanks to Hamster

Milind Bhandarkar Chief Scientist, Pivotal Software, Inc.

Page 2: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

About Meā€¢ http://www.linkedin.com/in/milindb

ā€¢ Founding member of Hadoop team at Yahoo! [2005-2010]

ā€¢ Contributor to Apache Hadoop since v0.1

ā€¢ Built and led Grid Solutions Team at Yahoo! [2007-2010]

ā€¢ Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

ā€¢ Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Page 3: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Hamster

ā€¢ Hadoop and MPI on the same cluster

ā€¢ Runtime for OpenMPI applications on YARN

ā€¢ Available on Pivotal HD

Page 4: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Why MPI ?ā€¢ Hadoop Dataflow paradigms (MapReduce,

TeZ etc) not suitable for iterative applications

ā€¢ Message Passing Interface (MPI)

ā€¢ Mature standard

ā€¢ Used extensively in HPC

ā€¢ Huge ecosystem

Page 5: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

MPI in Science & Engg

Earth Atmosphere

Chemistry

Biology

Math Nuclear

Page 6: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

MPI in Industry

Mechanical ļæ½ar

Finance/bank Oil Exploration Cryptography

Spacecraft

Page 7: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

OpenMPI

ā€¢ Mature Open Source implementation of MPI 3.0 Standard (mpi-forum.org)

ā€¢ New BSD license

ā€¢ 30+ contributing organizations from academia, research and industry

ā€¢ http://open-mpi.org

Page 8: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

OpenMPI Architecture

Page 9: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Pluggable

Page 10: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Hamster Designā€¢ YARN as Resource Manager

ā€¢ Hamster Application Manager

ā€¢ Manages MPI jobs

ā€¢ (tries to) Implement Gang-Scheduling

ā€¢ Leverages OMPI/ORTE strengths

ā€¢ Wire-up, Task monitoring, Fast Interconnect

Page 11: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Hamster ArchitectureResource Manager

Scheduler

AMService

Node Manager Node Manager Node Manager

ā€¦

Proc/Container

Framework Daemon NSMPI Scheduler HNP

MPI AM

Proc/Containerā€¦RM-AM

AM-NM

RM-NodeManagerClient

Client-RM

Aux Srvcs

Proc/Container

Framework Daemon NS

Proc/Containerā€¦

Aux SrvcsRM-

NodeManager

Page 12: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Hamster AppMasterā€¢ Master daemon for MPI ( similar to JobTracker in

MapReduce)

ā€¢ Implements and participates in the YARN-RM App lifecycle protocol

ā€¢ Maintains heartbeat with RM to ensure liveness

ā€¢ MPI Scheduler - Negotiates resource allocation with YARN-RM

ā€¢ Head Node Process (HNP) - manages job execution

Page 13: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Hamster Node Service

ā€¢ User-level daemon per MPI job

ā€¢ Manages task execution

ā€¢ Coarse-grained container management

ā€¢ Bootstrapped by YARN-NM

ā€¢ Implemented as YARN Auxiliary Service

Page 14: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Page 15: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Why GraphLab on Hadoop ?

ā€¢ Graph Analytics & Machine Learning only one stage in E2E data pipeline

ā€¢ ETL/Preprocessing

ā€¢ Building Graphs from fact & dimension tables

ā€¢ Publishing analytics results, post-processing

Page 16: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

GraphLab 2.2

ā€¢ Communication patterns based on Data

ā€¢ Several Toolkits (Graph Analytics + ML Algorithms) available

ā€¢ Graph-Programming API

ā€¢ Uses MPI for communication

Page 17: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Pivotal HD

HDFS

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Yarn

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Spring XD

Pivotal HD Enterprise

Spring

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ ā€“ Advanced Database Services

Distributed In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver ā€“ Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD ā€“ Real-Time Database Services

MADlib Algorithms

Oozie

Virtual Extensions

Graphlab, Open MPI

Page 18: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Performance

Page 19: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Test Environment

ā€¢ Pivotal Analytics Workbench Cluster

ā€¢ Pivotal HD 1.1 (Apache Hadoop 2.0.5)

ā€¢ Hamster - 1.0, OpenMPI-1.7.2

ā€¢ 515 nodes

ā€¢ 2x6-core Westmere, 48GB RAM, 12x2TB SATA, Mellanox FDR Infiniband

Page 20: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Null Jobā€¢ Measures overhead of launching MPI jobs

ā€¢ Tests scalability of resource allocation, launching and wire-up

ā€¢ Sub-linear scalability (slightly worse than O(logN)

ā€¢ Overhead of launching 15000 processes = 1 minute

Page 21: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Total RuntimeTi

me

(Sec

.)

5

18.75

32.5

46.25

60

Process number0 4000 8000 12000 16000

E2E time

Page 22: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Allocation TimeTi

me

(Sec

.)

1

2.25

3.5

4.75

6

Number of Processes0 4000 8000 12000 16000

Allocation Time

Page 23: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Launch TimeTi

me

(Sec

.)

0

7.5

15

22.5

30

Number of processes0 4000 8000 12000 16000

Launch Time

Page 24: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Comparison with OpenMPI

ā€¢ HPL (HP Linpack for Top-500)

ā€¢ Number of processes 50ā€”1000

ā€¢ Hamster 1% slower than OpenMPI

Page 25: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

HPL - Hamster vs OpenMPI

Tim

e (S

ec.)

0

30

60

90

120

1000 500 200 50

Page 26: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

GraphLab ALS

ā€¢ Wikipedia dataset

ā€¢ 4.3 M terms, 3.3M documents, 513M occurrences

ā€¢ 17 Processes

ā€¢ 5 Iterations

Page 27: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

GraphLab ALSTi

me

(Sec

.)

0

335

670

1005

1340

Hamster OpenMPI

Page 28: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

GraphLab PageRankā€¢ Twitter Dataset

ā€¢ 4.1 M nodes, 1.4 B edges

ā€¢ Data Size : 26GB

ā€¢ NP = 17

ā€¢ 50 iterations: 297 seconds

ā€¢ 100 iterations: 339 seconds

Page 29: The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Questions?