machine learning in the cloud with spark - cl.cam.ac.uk · machine learning in the cloud with spark...

Machine Learning in the Cloud with Spark

R212 Data Centric Systems and NetworkingOpen Source Project Study

by Haikal Pribadi

Machine Learning in the Cloud with Spark

Problem domain

Motivation and Contribution

Project Goal

Project Evaluation

Converging Trends

Motivation and Contribution

Why Scale Up to the Cloud?

Input data size– Large training data instances

● e.g DryadLINQ and MapReduce

– High input dimensionality● e.g GraphLab and GraphChi

Complexity of data and computation– Data complexity brings algorithm comlexity

● e.g. PLANET (on top of MapReduce)

Time constraint and parameter tuning– Distribute system usage to increase throughput– Model and hyper-parameter selection are repetitive

and independent

Data Parallelism– MapReduce, GraphLab

Task Parallelism– Multicores, GPUs (CUDA), MPI

or perhaps Hybrid?– Spark, GraphLab, DryadLINQ

Problem?

With the various option of distributed architectures, implementing different machine systems become very task-specific

Different architectures brings different benefits and constraints

Spark + MLbase

Unified scalable machine learning

Suitable for many common Machine Learning problem

(project hypothesis)

Project Goal

Develop Mainstream ML

Evaluate Spark+MLbase on developing common Machine Learning problems

Develop Mainstream ML

Classification– e.g. Bayesian classifier for Spam Filtering

Clustering– e.g. k-means clustering for market segmentation

Regression– e.g. Linear regression on weather forecasting

Collaborative Filtering– Alternating Least Square for recommendation systems

Project Goal

Platform– Amazon EC2

Run time– Spark

Application– MLbase

Project Evaluation

Evaluation and Analysis

Parallelism– Granularity of data parallelism and task parallelism

Algorithm complexity– Complexity of customization

Programming paradigm– Learning curve, expressiveness and dataflow

Dataset distribution– Management of large dataset

Performance Comparison

Learn a most suitable algorithm to be come a benchmark

Compare performance with [e.g.] GraphLab– Speed (sequential runs)– Scalability (efficiency increase with parallelism)– Throughput (t ime / input size)

Thank you!Questions are very welcome

Haikal Pribadihp356@cam.ac.uk

machine learning in the cloud with spark - cl.cam.ac.uk · machine learning in the cloud with spark...

Documents

spark your legacy (spark summit 2016)

paris spark meetup : extension de spark (tachyon / spark...

spark sql and dataframes spark graphx spark mlib spark...

algorithms and data structures for database systems · r121...

bosch industrial spark plugs: the right spark plug for...

data processing in apache spark•next week`s lecture is...

user guide vodafone mobile wi-fi r212 - comfortsurf · user...

bab iv hasil penelitian dan pembahasandonny 7 1 8 bambang w...

spark plug thread repair spark plug spark plug sockets for

s u m m i t - amazon web services... · task2/slide1 task...

spark streaming , spark sql

developing apache spark applications - cloudera · apache...

spark infrastructure - australian securities exchange ·...

learning spark ch10 - spark streaming

menghidupkan kebebasan melalui gagasan hayek (haikal...

mr. joseph haikal, president kiddie academy 16655 noyes...

anemia hemolitik-ppt haikal

spark and spark sql - amir h. payberah · spark and spark...

spark job server and spark as a query engine (spark meetup...

spark plug thread repair spark plug spark plug sockets for...