hadoop: the default machine learning platform ?

82
Hadoop: The Default Machine Learning Platform ? Milind Bhandarkar Chief Scientist, Pivotal @techmilind Wednesday, December 18, 2013

Upload: milind-bhandarkar

Post on 26-Jan-2015

110 views

Category:

Data & Analytics


1 download

DESCRIPTION

Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.

TRANSCRIPT

Page 1: Hadoop: The Default Machine Learning Platform ?

Hadoop: The Default Machine Learning

Platform ?

Milind BhandarkarChief Scientist, Pivotal

@techmilind

Wednesday, December 18, 2013

Page 2: Hadoop: The Default Machine Learning Platform ?

About Me

• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Wednesday, December 18, 2013

Page 3: Hadoop: The Default Machine Learning Platform ?

Acknowledgements

•Developers of various Open-Source, and Proprietary Data Platforms

• Ex-Colleagues at Yahoo! Research

• Colleagues at Data Science Team, Pivotal

• Vijay Narayanan, Microsoft

Wednesday, December 18, 2013

Page 4: Hadoop: The Default Machine Learning Platform ?

Wednesday, December 18, 2013

Page 5: Hadoop: The Default Machine Learning Platform ?

Kryptonite: First Hadoop Cluster At Yahoo!

Wednesday, December 18, 2013

Page 6: Hadoop: The Default Machine Learning Platform ?

M45: Academic Collaboration

Wednesday, December 18, 2013

Page 7: Hadoop: The Default Machine Learning Platform ?

Analytics Workbench

Wednesday, December 18, 2013

Page 8: Hadoop: The Default Machine Learning Platform ?

Analytics Workbench

Wednesday, December 18, 2013

Page 9: Hadoop: The Default Machine Learning Platform ?

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

INTERACTIVE

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

ONLINE

Hadoop 1.0(Image Courtesy Arun Murthy, Hortonworks)

Wednesday, December 18, 2013

Page 10: Hadoop: The Default Machine Learning Platform ?

MapReduce 1.0

(Image Courtesy Arun Murthy, Hortonworks)

Wednesday, December 18, 2013

Page 11: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

Wednesday, December 18, 2013

Page 12: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

• High data throughput: 100 TB/hr using 500 mappers

Wednesday, December 18, 2013

Page 13: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

• High data throughput: 100 TB/hr using 500 mappers

• Framework provides fault tolerance

• Monitors and and re-starts tasks on other machines should one of the machines fail

Wednesday, December 18, 2013

Page 14: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

• High data throughput: 100 TB/hr using 500 mappers

• Framework provides fault tolerance

• Monitors and and re-starts tasks on other machines should one of the machines fail

• Excels in counting patterns over data records

Wednesday, December 18, 2013

Page 15: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

• High data throughput: 100 TB/hr using 500 mappers

• Framework provides fault tolerance

• Monitors and and re-starts tasks on other machines should one of the machines fail

• Excels in counting patterns over data records

• Built on relatively cheap, commodity hardware

Wednesday, December 18, 2013

Page 16: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

• High data throughput: 100 TB/hr using 500 mappers

• Framework provides fault tolerance

• Monitors and and re-starts tasks on other machines should one of the machines fail

• Excels in counting patterns over data records

• Built on relatively cheap, commodity hardware

• Large volumes of data already stored on Hadoop clusters running MapReduce

Wednesday, December 18, 2013

Page 17: Hadoop: The Default Machine Learning Platform ?

ML in MapReduce: Why ?

• Learning can become limited by computation time and not data volume

• With large enough data and number of machines

• Reduces the need to down-sample data

• More accurate parameter estimates compared to learning on a single machine for the same amount of time

Wednesday, December 18, 2013

Page 18: Hadoop: The Default Machine Learning Platform ?

Learning Models in MapReduce

• Data parallel algorithms are most appropriate for MapReduce implementations

• Not necessarily the most optimal implementation for a specific algorithm

• Other specialized non-MapReduce implementations exist for some algorithms, which may be better

• MR may not be the appropriate framework for exact solutions of non data parallel/sequential algorithms

• Approximate solutions using MR may be good enough

Wednesday, December 18, 2013

Page 19: Hadoop: The Default Machine Learning Platform ?

Types of Learning in MapReduce

• Parallel training of multiple models

• Train either in Mappers or Reducers

• Ensemble training methods

• Train multiple models and combine them

•Distributed learning algorithms

• Learn using both Mappers and Reducers

Wednesday, December 18, 2013

Page 20: Hadoop: The Default Machine Learning Platform ?

Parallel Training of Multiple Models

• Train multiple models simultaneously using a learning algorithm that can be learnt in memory

• Useful when individual models are trained using a subset, filtered or modification of raw data

• Train 1 model in each reducer

• Map:

• Input: All data

• Filters subset of data relevant for each model training

• Output: <model_index, subset of data for training this model>

• Reduce

• Train model on data corresponding to that model_index

Wednesday, December 18, 2013

Page 21: Hadoop: The Default Machine Learning Platform ?

Distributed Learning Algorithms

• Suitable for learning algorithms that are

• Compute-Intensive per data record

• One or few iterations for learning

• Do not transfer much data between iterations

• Typical algorithms

• Fit the Statistical query model (SQM)

• Divide and conquer

Wednesday, December 18, 2013

Page 22: Hadoop: The Default Machine Learning Platform ?

k-Means Clustering• Choose k samples as initial cluster centroids

• In each MapReduce Iteration:

• Assign membership of each point to closest cluster

• Re-compute new cluster centroids using assigned members

• Control program to

• Initialize the centroids

• random, initial clustering on sample etc.

• Run the MapReduce iterations

• Determine stopping criterion

Wednesday, December 18, 2013

Page 23: Hadoop: The Default Machine Learning Platform ?

k-Means Clustering in MapReduce

Wednesday, December 18, 2013

Page 24: Hadoop: The Default Machine Learning Platform ?

k-Means Clustering in MapReduce

• Map

• Input data points: x_i

• Input cluster centroids: c_i

• Assign each data point to closest cluster

• Output

Wednesday, December 18, 2013

Page 25: Hadoop: The Default Machine Learning Platform ?

k-Means Clustering in MapReduce

• Map

• Input data points: x_i

• Input cluster centroids: c_i

• Assign each data point to closest cluster

• Output

• Reduce

• Compute new centroids for each cluster

Wednesday, December 18, 2013

Page 26: Hadoop: The Default Machine Learning Platform ?

Complexity of k-Means Clustering

• Each point is compared with each cluster centroid

• Complexity = N*K*O(d) where O(d) is the complexity of the distance metric

• Typical Euclidean distance is not a cheap operation

• Can reduce complexity using an initial canopy clustering to partition data cheaply

Wednesday, December 18, 2013

Page 27: Hadoop: The Default Machine Learning Platform ?

Apache Mahout• Goal

• Create scalable, machine learning algorithms under the Apache license

• Contains both:

• Hadoop implementations of algorithms that scale linearly with data.

• Fast sequential (non MapReduce) algorithms

• Wiki:

• https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki

Wednesday, December 18, 2013

Page 28: Hadoop: The Default Machine Learning Platform ?

Algorithms in Mahout• Classification:

• Logistic Regression, Naïve Bayes, Complementary Naïve Bayes, Random Forests

• Clustering

• K-means, Fuzzy k-means, Canopy, Mean-shift clustering, Dirichlet Process clustering, Latent Dirichlet allocation, Spectral clustering

• Parallel FP growth

• Item based recommendations

• Stochastic Gradient Descent (sequential)

Wednesday, December 18, 2013

Page 29: Hadoop: The Default Machine Learning Platform ?

Challenges for ML with MapReduce

•MapReduce is optimized for large batch data processing

• Assumes data parallelism

• Ideal for shared-nothing computing

•Many learning algorithms are iterative

• Incur significant overheads per iteration

Wednesday, December 18, 2013

Page 30: Hadoop: The Default Machine Learning Platform ?

Challenges (contd.)

•Multiple scans of the same data

• Typically once per iteration: High I/O overhead reading data into mappers per iteration

• In some algorithms static data is read into mappers in each iteration

• e.g. input data in k-means clustering.

Wednesday, December 18, 2013

Page 31: Hadoop: The Default Machine Learning Platform ?

Challenges (contd.)

•Need a separate controller outside the framework to:

• Coordinate the multiple MapReduce jobs for each iteration

• Perform some computations between iterations and at the end

•Measure and implement stopping criterion

Wednesday, December 18, 2013

Page 32: Hadoop: The Default Machine Learning Platform ?

Challenges (contd.)

• Incur multiple task initialization overheads

• Setup and tear down mapper and reducer tasks per iteration

• Transfer/shuffle static data between mapper and reducer repeatedly

• Intermediate data is transferred through index/data files on local disks of mappers and pulled by reducers

Wednesday, December 18, 2013

Page 33: Hadoop: The Default Machine Learning Platform ?

Challenges (contd.)

• Blocking architecture

• Reducers cannot start till all map jobs complete

• Availability of nodes in a shared environment

•Wait for mapper and reducer nodes to become available in each iteration in a shared computing cluster

Wednesday, December 18, 2013

Page 34: Hadoop: The Default Machine Learning Platform ?

Iterative Algorithms in MapReduce

Pass Result

Wednesday, December 18, 2013

Page 35: Hadoop: The Default Machine Learning Platform ?

Iterative Algorithms in MapReduce

Overhead per Iteration:• Job setup• Data Loading• Disk I/O

Pass Result

Wednesday, December 18, 2013

Page 36: Hadoop: The Default Machine Learning Platform ?

Enhancements to MapReduce

• Many proposals to overcome these challenges

• All try to retain the core strengths of data partitioning and fault tolerance of Hadoop to various degrees

• Proposed enhancements and alternatives to Hadoop

• Worker/Aggregator framework, HaLoop, MapReduce Online, iMapReduce, Spark, Twister, Hadoop ML, ...

Wednesday, December 18, 2013

Page 37: Hadoop: The Default Machine Learning Platform ?

Worker/Aggregator

12/17/13102

Final Result

Wednesday, December 18, 2013

Page 38: Hadoop: The Default Machine Learning Platform ?

Worker/Aggregator

12/17/13102

Advantages:• Schedule once per Job• Data stays in memory• P2P communication

Final Result

Wednesday, December 18, 2013

Page 39: Hadoop: The Default Machine Learning Platform ?

HaLoop

• Programming model and architecture for iterations

• New APIs to express iterations in the framework

• Loop-aware task scheduling

• Physically co-locate tasks that use the same data in different iterations

• Remember association between data and node

• Assign task to node that uses data cached in that node

Wednesday, December 18, 2013

Page 40: Hadoop: The Default Machine Learning Platform ?

HaLoop (contd.)• Caching for loop invariant data:

• Detect invariants in first iteration, cache on local disk to reduce I/O and shuffling cost in subsequent iterations

• Cache for Mapper inputs, Reducer Inputs, Reducer outputs

• Caching to support fixpoint evaluation:

• Avoids the need for a dedicated MR step on each iteration

Wednesday, December 18, 2013

Page 41: Hadoop: The Default Machine Learning Platform ?

Spark

• Open Source Cluster Computing model:

• Different from MapReduce, but retains some basic character

• Optimized for :

• Iterative computations

• Applies to many learning algorithms

• Interactive data mining

• Load data once into multiple mappers and run multiple queries

Wednesday, December 18, 2013

Page 42: Hadoop: The Default Machine Learning Platform ?

Spark (contd.)• Programming model using working sets 

• applications reuse intermediate results in multiple parallel operations

• preserves the fault tolerance of MapReduce

• Supports

• Parallel loops over distributed datasets

• Loads data into memory for (re)use in multiple iterations

• Access to shared variables accessible from multiple machines

• Implemented in Scala,

• www.spark-project.org

Wednesday, December 18, 2013

Page 43: Hadoop: The Default Machine Learning Platform ?

Hadoop 2.0

(Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

!"#$%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

&'()*+,-*%!2+%.(#"*"#./%"2#*3'&'0#3#&(*

*4*$'('*5"/2#..,&01*

!"#$.%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

/0)1%!2+%.(#"*"#./%"2#*3'&'0#3#&(1*

2*3%!#6#2%7/&*#&0,&#1*

HADOOP 2.0

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

*

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

%

&)%!-'(2;1*

)2%%$9;*'=>%?;'(:%!"#$%&''()$*+,'

*

$*;75-*<%-.*/0'

*

Wednesday, December 18, 2013

Page 44: Hadoop: The Default Machine Learning Platform ?

!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+

35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*

9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***

:!;<3+=>&",04-%0?+

2.;@,!<;2A@+=;0B?+

7;,@!>2.C+=7D(EFG+7HGI?+

C,!J3+=C$E&"K?+

2.L>@>M,9+=7"&EN?+

3J<+>J2+=M"0)>J2?+

M.O2.@+=3:&*0?+

M;3@,+=70&E%K?+=P0&/0I?+

YARN Platform

(Image Courtesy Arun Murthy, Hortonworks)

Wednesday, December 18, 2013

Page 45: Hadoop: The Default Machine Learning Platform ?

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./.*

+"',&-'$)*0/1*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./0*

+"',&-'$)*./2*

3%*.*

+"',&-'$)*0/0*

+"',&-'$)*0/.*

+"',&-'$)*0/2*

3%0*

+4-$',0*

5$6"7)8$%&'&($)*

98:$#74$)*

YARN Architecture

(Image Courtesy Arun Murthy, Hortonworks)

Wednesday, December 18, 2013

Page 46: Hadoop: The Default Machine Learning Platform ?

YARN

• Yet Another Resource Negotiator

• Resource Manager

•Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Wednesday, December 18, 2013

Page 47: Hadoop: The Default Machine Learning Platform ?

MPP SQL On Hadoop

Wednesday, December 18, 2013

Page 48: Hadoop: The Default Machine Learning Platform ?

SQL-on-Hadoop

• Pivotal HAWQ

• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger

• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

•More to come...

Wednesday, December 18, 2013

Page 49: Hadoop: The Default Machine Learning Platform ?

NetworkInterconnect

...

......HAWQ & HDFS MasterSevers

Planning & dispatch

SegmentSevers

Query execution

...Storage

HDFS, HBase …

Wednesday, December 18, 2013

Page 50: Hadoop: The Default Machine Learning Platform ?

Namenode

Breplication

Rack1 Rack2

DatanodeDatanode Datanode

Read/Write

Segment

Segment host

Segment

Segment

Segment host

Segment

Segment host

Master host

Meta Ops

HAWQ Interconnect

Segment

Segment

Segment

Segment hostSegment

Datanode

Segment Segment Segment Segment

Wednesday, December 18, 2013

Page 51: Hadoop: The Default Machine Learning Platform ?

HAWQ vs Hive

Lower is Better

Wednesday, December 18, 2013

Page 52: Hadoop: The Default Machine Learning Platform ?

Provides data-parallel implementations of mathematical, statistical and machine-learning

methods for structured and unstructured data.

In-Database Analytics

Wednesday, December 18, 2013

Page 53: Hadoop: The Default Machine Learning Platform ?

MADlib Algorithms

Wednesday, December 18, 2013

Page 54: Hadoop: The Default Machine Learning Platform ?

MADLib Functions• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

• Naïve Bayes

• Elastic Net Regression

• Decision Trees / Random Forest

• Support Vector Machines

• Cox Proportional Hazards Regression

• Descriptive Statistics

• ARIMA

Wednesday, December 18, 2013

Page 55: Hadoop: The Default Machine Learning Platform ?

k-Means Usage

SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters );

centroids | objective_fn | frac_reassigned | …

------------------------------------------------------------------------+------------------+-----------------+ …{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …

Wednesday, December 18, 2013

Page 56: Hadoop: The Default Machine Learning Platform ?

Accessing HAWQ Through R

Wednesday, December 18, 2013

Page 57: Hadoop: The Default Machine Learning Platform ?

Pivotal R

• Interface is R client

• Execution is in database

• Parallelism handled by PivotalR

• Supports a portion of R

R> x = db.data.frame(“t1”)R> l = madlib.lm(interlocks ~ assets + nation, data = t)

Wednesday, December 18, 2013

Page 58: Hadoop: The Default Machine Learning Platform ?

Wednesday, December 18, 2013

Page 59: Hadoop: The Default Machine Learning Platform ?

A wrapper of MADlib

• Linear regression

• Logistic regression

• Elastic Net

• ARIMA

• Table summary

Wednesday, December 18, 2013

Page 60: Hadoop: The Default Machine Learning Platform ?

A wrapper of MADlib

• Linear regression

• Logistic regression

• Elastic Net

• ARIMA

• Table summary

• $ [ [[ $<- [<- [[<-

• is.na

• + - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim names

• content

And more ... (SQL wrapper)

• predict

Wednesday, December 18, 2013

Page 61: Hadoop: The Default Machine Learning Platform ?

A wrapper of MADlib

• Linear regression

• Logistic regression

• Elastic Net

• ARIMA

• Table summary

• Categorial variable

as.factor()

• $ [ [[ $<- [<- [[<-

• is.na

• + - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim names

• content

And more ... (SQL wrapper)

• predict

Wednesday, December 18, 2013

Page 62: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Wednesday, December 18, 2013

Page 63: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Wednesday, December 18, 2013

Page 64: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Wrapper of objects in databasex = db.data.frame("table")

Wednesday, December 18, 2013

Page 65: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

Wednesday, December 18, 2013

Page 66: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

Wednesday, December 18, 2013

Page 67: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Most operations

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

Wednesday, December 18, 2013

Page 68: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.view

Most operations

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

Wednesday, December 18, 2013

Page 69: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.viewas.db.data.frame(...)

Most operations

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

Wednesday, December 18, 2013

Page 70: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.viewas.db.data.frame(...)

Most operations

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

MADlib wrapper functions

Wednesday, December 18, 2013

Page 71: Hadoop: The Default Machine Learning Platform ?

Pivotal Confidential–Internal Use Only 49

db.obj

db.data.frame db.Rquery

db.table db.viewas.db.data.frame(...)

Most operations

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column"), etc.

MADlib wrapper functions

preview

Wednesday, December 18, 2013

Page 72: Hadoop: The Default Machine Learning Platform ?

In-Database Execution

• All data stays in DB: R objects merely point to DB objects

• All model estimation and heavy lifting done in DB by MADlib

• R→ SQL translation done in the R client

•Only strings of SQL and model output transferred across ODBC/DBI

Wednesday, December 18, 2013

Page 73: Hadoop: The Default Machine Learning Platform ?

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

•More to come ...

Wednesday, December 18, 2013

Page 74: Hadoop: The Default Machine Learning Platform ?

Hamster

• Hadoop and MPI on the same cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

Wednesday, December 18, 2013

Page 75: Hadoop: The Default Machine Learning Platform ?

Hamster Components

• Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

•ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Wednesday, December 18, 2013

Page 76: Hadoop: The Default Machine Learning Platform ?

Resource Manager

Scheduler

AMService

Node Manager Node Manager Node Manager !

Proc/Container

Framework Daemon NS MPI

Scheduler HNP

MPI AM

Proc/Container

! RM-AM

AM-NM

RM-NodeManager Client Client-RM

Aux Srvcs

Proc/Container

Framework Daemon NS

Proc/Container

!

Aux Srvcs RM-

NodeManager

Hamster ArchitectureWednesday, December 18, 2013

Page 77: Hadoop: The Default Machine Learning Platform ?

Hamster Scalability

• Sufficient for small to medium HPC workloads

• Job launch time gated by YARN resource scheduler

Launch WireUp Collectives Monitor

OpenMPI O(logN) O(logN) O(logN) O(logN)

Hamster O(N) O(logN) O(logN) O(logN)

Wednesday, December 18, 2013

Page 78: Hadoop: The Default Machine Learning Platform ?

GraphLab + Hamsteron Hadoop

!

Wednesday, December 18, 2013

Page 79: Hadoop: The Default Machine Learning Platform ?

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

Wednesday, December 18, 2013

Page 80: Hadoop: The Default Machine Learning Platform ?

GraphLab Features

• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Wednesday, December 18, 2013

Page 81: Hadoop: The Default Machine Learning Platform ?

Only Graphs are not Enough

• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Wednesday, December 18, 2013

Page 82: Hadoop: The Default Machine Learning Platform ?

Questions?Wednesday, December 18, 2013