distributed machine learning 101 using apache spark from a browser devoxx.be2015

Post on 12-Apr-2017

658 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Distributed Machine Learning using Apache Spark from the Browser

Devoxx Belgium 2015, Antwerpen

● Distributed computing● what is Machine Learning?

● Spark for machine learning?

● Spark MLlib by examples

● Spark and other libraries

● Wrap up

Outline

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Distributed ComputingWhy you must care, by Data Fellas

Andy Petrella & Xavier Tordoir

Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga!

Computing

Processing Power Memory Storage

Computing

Oh no!

Hence performance is limited in time and space

Processing Power Memory StorageTIME SPACE

Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.

The components interact with each other in order to achieve a common goal. [...].

Ref: https://en.wikipedia.org/wiki/Distributed_computing

Distributing

Interesting

Consequences

Oh no!

Algorithms have to work on DATA Partitions and with partial results

The entire dataset cannot be accessed at once

New resource!

Damned

Processing Power Memory StorageSPACE

Network

TIME

Network Will impact performances...

Oops did it again

Distributing

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage

network

DrawbackPartition

Huh?

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage

network

DrawbackPartition

Hey, you sank my node!

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

network

Processing

Memory

Storage

BOOM

Ouch, my rack

AdvantageElastic scaling

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage network

What if this cluster happens to not be big enough?

That’s more reasonable

AdvantageElastic scaling

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage network

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage network

network

HPC: computationally intensive applications

Model: specialized hardware (CPU/GPU) and network

They are orchestrated by a scheduler that gather their computing power and memory.

Yeah! what about?

What about HPC?

Drawbacks:

● Costs and upgrades by large blocks● Decoupled storage

storage latency = no streaming / no Iteration

Got No Money and NO time

What about HPC?

Why processing data if not to model?

Machine learning: iterative (streaming & batch)

Data is aggregated in the form of a model (parameters)

Data change little, model is small

Do that baby!

Iterate

Iterate

you gotta be kidding

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage

Storage

Moving lots of data again and again...

Distributed computing allow cost effective parallelism

Efficiency requires distributed storage

Colocated with the processing units

What about programming models?

Summary

Interesting

Distributed storage

Partitions!

HDFS: Apache implementation of Google FS

● Natural fit for distributed storage● Works as a service

Other chunked sources...

● Apache Cassandra, S3, Tachyon,...

Distributed storage

Split da Name Node

256Mb put /data/f256.txt

replication factor 2 Data Node 1

Data Node 2

Data Node 4

Data Node 3

Distributed storage

Split da

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

256Mb put /data/f256.txt

replication factor 2 64Mb

64Mb

64Mb

64Mb

Distributed storage

Everywhere

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

256Mb

64Mb

64Mb

64Mb

64Mb

put /data/f256.txtreplication factor 2 put /data/f256.txt/part-r-00000 64

Mb

Distributed storage

everywhere

256Mb put /data/f256.txt

replication factor 2Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

put /data/f256.txt/part-r-00000 64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

Distributed storage

Replicate

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

256Mb put /data/f256.txt

replication factor 2 put /data/f256.txt/part-r-00000 64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

Map ReduceHigh Level Execution

The rocket’s base

data part

data part

data part

data part

Load the data

Map ReduceHigh Level Execution

The rocket’s engines

data part mapper

data part

data part

data part

mapper

mapper

mapper

Mapand Pair

Map ReduceHigh Level Execution

The rocket’s trunk

GroupB

yKey

data part mapper

data part

data part

data part

mapper

mapper

mapper

Shuffle Pairs using Keys

Map ReduceHigh Level Execution

The rocket’s cockpit

data part mapper

GroupB

yKey

Reducer

data part

data part

data part

mapper

mapper

mapper

Reducer

Reducer

Values per key are Reduced

Map ReduceHigh Level Execution

The rocket’s tip

data part mapper

GroupB

yKey

Reducer

data part

data part

data part

mapper

mapper

mapper

Reducer

Reducer

Results

We collect the results

Map ReduceHigh Level Execution

To the infinite and beyond!

data part mapper

GroupB

yKey

Reducer

data part

data part

data part

mapper

mapper

mapper

Reducer

Reducer

Results

The whole#!

Map Reduce Matrix-Vector Product

How about word count?

=

Map Reduce Matrix-Vector Product

Back to school...

=

Map Reduce Matrix-Vector Product

Wait, that’s maths

=

Map Reduce Matrix-Vector Product

Where is the RAT?

Store Matrix as ordered

Vector V loaded in memory as ordered

Map function:

Each matrix element mapped on a producT

Map Reduce Matrix-Vector Product

OK … I TAKE OVER

MAP

Map Reduce Matrix-Vector Product

just a sum …

REDUCE

Map ReduceSummary

Summary ==

Reduce?

Simple Abstraction of computations, Map and Reduce

Using simple abstraction of data, key value pairs

Map ReduceSummary

So what?

Brings transparent:

● parallelization● distribution ● fault tolerance

Why Apache SparkMapReduce on steroids

Man… Finally!

Uses

● Functional paradigm● Lazy computations

Creates dependencies between tasks definitions and optimizes execution

Why Apache SparkMapReduce on steroids

Almost forgot that one

Can cache data in memory or local file system.

Far less IO or network.

What is Machine learning?Why you must care, by Data Fellas

Andy Petrella & Xavier Tordoir

you cannot prove a vague theory is wrong

[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.

—Richard Feynman [1964]

What is Machine Learning?Science with data

Surely You’re Joking Mr…

● Modelling without first principle…

What is Machine Learning?Overview

2nd law neither...

● Modelling without first principle…

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

Take that Newton...

● Modelling without first principle…

● Modelling dependencies from the data

What is Machine Learning?Overview

With some “a priori” knowledge

● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation

What is Machine Learning?Learning Machine…

You still need a domain expert…

Like me!

LearningMachine

● Estimate dependencies from data

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

SamplesGenerator

System

x

y

z ?

LearningMachine

● Estimate dependencies from data

● Minimize a risk functional over the set given the data

What is Machine Learning?Overview

I like them so much in LaTeX2e

SamplesGenerator

System

x

y

z ?

LearningMachine

● Regression: continuous output

○ Risk = Prediction error

● Classification: categorical output

○ Risk = Probability of misclassification

What is Machine Learning?Supervised learning

Lyfxw y-fxw2…

WTF?

What is Machine Learning?Unsupervised learning: no output

I like clusters, specially with roasted nuts

● Clustering

○ Risk = Error Distortion (distances to center)

● Density estimation (probability densities)

What is Machine Learning?Bias - Variance, Regression illustration

Playtime!

Notebook!

What is Machine Learning?Inductive principle

In principle, it should work.

An inductive principle tells what to do

Finite Data

Inductive principle

Model

What is Machine Learning?Inductive principle

In principle, it should work.

Empirical risk minimization

Finite Data Model

• Functions class not defined• Loss not defined• Optimization procedure not defined

What is Machine Learning?Inductive principle

In principle, it should work.

Regularization

Finite Data Model

• control on penalty strength• Penalize complexity/a priori knowledge

What is Machine Learning?Inductive principle

In principle, it should work.

Early stopping rules

Finite Data Model

• Iterative optimization• Depends on initial params and algorithm• used for neural networks• Penalize along a path

What is Machine Learning?Inductive principle

In principle, it should work.

Structural Risk

Finite Data Model

• Analytic estimates of empirical risk

What is Machine Learning?Inductive principle

In principle, it should work.

Bayesian inference

Finite Data Model

• Explicit a priori probabilities• Learn mixtures• Hard multidimensional integrations…

What is Machine Learning?Curse of dimentionality

In principle, it should work.

We want to control complexity

Finite Data Model

• smoothness constraint in a neighborhood

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Data density is key…

Finite DataIn a Space

ModelComplexity

Inductive principle

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Data density is key…e.g.● 1-D 0.1m interval => 10 points/m● 2-D 0.1M interval => 100 points/M^2

● d-d 0.1 m interval => 10^d points/m^d

Same smoothness requires lots of data in high dimensional spaces

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Sampling is hard…e.g.● 1-D 10% sample => 0.1 x size● 2-D 10% sample => 0.31 x size

● 10-d 10% sample => 0.79 x size

=> local estimates from samples are difficult

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Data points are closer to edges…One Data points “sees” himself as an outlier

=> Predictions require lots of extrapolation

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Samples must increase exponentially

… or model complexity must be controlled

What is Machine Learning?Regularization in more details

In principle, it should work.

Data driven penalized risk minimization

What is Machine Learning?Regularization in more details

In principle, it should work.

Loss functions

What is Machine Learning?Regularization in more details

In principle, it should work.

Regularizers

L2 (ridge)

L1(lasso)

Elastic net

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

Which algorithm to find a minimum in a distributed fashion?

Convex optimization methods (linear methods)● Gradient descent● Stochastic gradient descent● Limited-memory BFGS

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

Gradient descent● Efficient steps but needs to read through

the whole data

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

Stochastic Gradient descent● Samples data for each step but converges

very slowly

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

L-BFGS● quadratic derivative estimates by keeping

several previous gradient in memory● Fast convergence

What is Machine Learning?Model selection

all work and no play makes Jack a dull boy

Model Complexity control: Resampling

Selecting the right lambda…

… to minimize prediction risk

What is Machine Learning?Model selection

Enough theory boy!

The universe

What is Machine Learning?Model selection

Enough theory boy!

Our data

What is Machine Learning?Model selection

Enough theory boy!

Our data

Learning Set (70%)

validation set (30%)

What is Machine Learning?Model selection

Enough theory boy!

Our data

Learning Set (70%)

validation set (30%)

What is Machine Learning?Model selection

Nice flag

K-Fold

K = 4

MLLibA library to learn them all...

Distributed computing framework

Large Scale Data Processing engine

What is Apache Spark?

I play BIG!

Distributed computing framework

Large Scale Data Processing engine

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!

What is Apache Spark?

Distributed computing framework

Large Scale Data Processing engine

● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Let the brain do the work...

What is Apache Spark?

Distributed computing framework

Large Scale Data Processing engine

● Interactive● @ any scale

Breed mixin’

What is Apache Spark?

MLLibSpark

In principle, it should work.

Intro to Spark… notebook

MLLibSpark

In principle, it should work.

Intro to Spark… notebook

So we’we seen… ● Basics of Spark data manipulation● MLLib data representation● Linear regression● Regularization and k-fold cross validation

What else is there?

MLLibSpark

In principle, it should work.

Basic statisticsClassification and regressionCollaborative filteringClusteringDimensionality reductionFeature extraction and transformationFrequent pattern miningEvaluation metrics…

http://spark.apache.org/docs/latest/mllib-guide.html

MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)

Playtime!

Some more examples

GenomicsThe data

So… that’s what separates us huh?

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

1000 genomes: http://www.1000genomes.org/

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)

Playtime!

Notebook!

What else?Old and new players are now integrating with Spark

(and Scala)

Integrated with Data Frame

Offer API to create

shareable/reusable

Pipeline constructions (PCA, …)

Spark ML Pipeline

Higher API

Like Pipeline but

Type Safe

Chainable API (andThen-friendly)

Spark ML Keystone

Higher API

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

H2O

Higher API

DL4J Spark ML

Higher API

Intel Data Analytics Acceleration Library

DAAL (Intel)

Higher API

Declarative large-scale machine learning

optimization based on data and cluster

characteristics

System ML (IBM)

Higher API

Nitro's Extremely Exciting Deep Learning Engine

MLP, RBM, LSTM and more to come

Needle

Higher API

H2OSparkling & Deep Learning on genomics

water in fire

Learning structures using H2O Deep Learning Algorithm integrated in SparKin a Notebookon an Ec2 Cluster

http://h2o.ai/product/sparkling-water/

H2OSparkling: in-memory data exchange

I remember things better when I remember then twice.

Wrap upwhat we hope you have learned

Distributed computingFor machine learning

I am ready.

Data is exploding

Distributed Technologies are maturing

Scale up and down, interactivity

Distributed ML on SparkWhat is available

What are my options by the way?

Spark MLLibH2O

DL4J

Needle

EC2 GCEURIKA-XA

clouderaMapr

Hortonworks

HDFSC*

kafka

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab

Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)

Check also @TypeSafe: http://t.co/o1Bt6dQtgH

top related