distributed machine learning 101 using apache spark from a browser devoxx.be2015

Distributed Machine Learning using Apache Spark from the Browser

Devoxx Belgium 2015, Antwerpen

● Distributed computing● what is Machine Learning?

● Spark for machine learning?

● Spark MLlib by examples

● Spark and other libraries

● Wrap up

Outline

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Distributed ComputingWhy you must care, by Data Fellas

Andy Petrella & Xavier Tordoir

Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga!

Computing

Processing Power Memory Storage

Computing

Oh no!

Hence performance is limited in time and space

Processing Power Memory StorageTIME SPACE

Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.

The components interact with each other in order to achieve a common goal. [...].

Ref: https://en.wikipedia.org/wiki/Distributed_computing

Distributing

Interesting

Consequences

Oh no!

Algorithms have to work on DATA Partitions and with partial results

The entire dataset cannot be accessed at once

New resource!

Damned

Processing Power Memory StorageSPACE

Network

Network Will impact performances...

Oops did it again

Distributing

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

network

DrawbackPartition

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

network

DrawbackPartition

Hey, you sank my node!

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Storage

network

Processing

Memory

Storage

Ouch, my rack

AdvantageElastic scaling

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage network

What if this cluster happens to not be big enough?

That’s more reasonable

AdvantageElastic scaling

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage network

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage network

network

HPC: computationally intensive applications

Model: specialized hardware (CPU/GPU) and network

They are orchestrated by a scheduler that gather their computing power and memory.

Yeah! what about?

What about HPC?

Drawbacks:

● Costs and upgrades by large blocks● Decoupled storage

storage latency = no streaming / no Iteration

Got No Money and NO time

What about HPC?

Why processing data if not to model?

Machine learning: iterative (streaming & batch)

Data is aggregated in the form of a model (parameters)

Data change little, model is small

Do that baby!

Iterate

you gotta be kidding

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Moving lots of data again and again...

Distributed computing allow cost effective parallelism

Efficiency requires distributed storage

Colocated with the processing units

What about programming models?

Summary

Interesting

Distributed storage

Partitions!

HDFS: Apache implementation of Google FS

● Natural fit for distributed storage● Works as a service

Other chunked sources...

● Apache Cassandra, S3, Tachyon,...

Distributed storage

Split da Name Node

256Mb put /data/f256.txt

replication factor 2 Data Node 1

Data Node 2

Data Node 4

Data Node 3

Distributed storage

Split da

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

replication factor 2 64Mb

Distributed storage

Everywhere

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

put /data/f256.txtreplication factor 2 put /data/f256.txt/part-r-00000 64

Distributed storage

everywhere

replication factor 2Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

put /data/f256.txt/part-r-00000 64Mb

Distributed storage

Replicate

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

replication factor 2 put /data/f256.txt/part-r-00000 64Mb

Map ReduceHigh Level Execution

The rocket’s base

data part

Load the data

The rocket’s engines

data part mapper

data part

mapper

Mapand Pair

The rocket’s trunk

GroupB

data part mapper

data part

mapper

Shuffle Pairs using Keys

The rocket’s cockpit

data part mapper

GroupB

Reducer

data part

mapper

Reducer

Values per key are Reduced

The rocket’s tip

data part mapper

GroupB

Reducer

data part

mapper

Reducer

Results

We collect the results

To the infinite and beyond!

data part mapper

GroupB

Reducer

data part

mapper

Reducer

Results

The whole#!

Map Reduce Matrix-Vector Product

How about word count?

Back to school...

Wait, that’s maths

Where is the RAT?

Store Matrix as ordered

Vector V loaded in memory as ordered

Map function:

Each matrix element mapped on a producT

OK … I TAKE OVER

just a sum …

REDUCE

Map ReduceSummary

Summary ==

Reduce?

Simple Abstraction of computations, Map and Reduce

Using simple abstraction of data, key value pairs

Map ReduceSummary

So what?

Brings transparent:

● parallelization● distribution ● fault tolerance

Why Apache SparkMapReduce on steroids

Man… Finally!

● Functional paradigm● Lazy computations

Creates dependencies between tasks definitions and optimizes execution

Why Apache SparkMapReduce on steroids

Almost forgot that one

Can cache data in memory or local file system.

Far less IO or network.

What is Machine learning?Why you must care, by Data Fellas

Andy Petrella & Xavier Tordoir

you cannot prove a vague theory is wrong

[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.

—Richard Feynman [1964]

What is Machine Learning?Science with data

Surely You’re Joking Mr…

● Modelling without first principle…

What is Machine Learning?Overview

2nd law neither...

Machine learning you do with a Learning Machine

Take that Newton...

● Modelling dependencies from the data

With some “a priori” knowledge

● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation

What is Machine Learning?Learning Machine…

You still need a domain expert…

Like me!

LearningMachine

● Estimate dependencies from data

Machine learning you do with a Learning Machine

SamplesGenerator

System

LearningMachine

● Estimate dependencies from data

● Minimize a risk functional over the set given the data

I like them so much in LaTeX2e

SamplesGenerator

System

LearningMachine

● Regression: continuous output

○ Risk = Prediction error

● Classification: categorical output

○ Risk = Probability of misclassification

What is Machine Learning?Supervised learning

Lyfxw y-fxw2…

What is Machine Learning?Unsupervised learning: no output

I like clusters, specially with roasted nuts

● Clustering

○ Risk = Error Distortion (distances to center)

● Density estimation (probability densities)

What is Machine Learning?Bias - Variance, Regression illustration

Playtime!

Notebook!

What is Machine Learning?Inductive principle

In principle, it should work.

An inductive principle tells what to do

Finite Data

Inductive principle

Empirical risk minimization

Finite Data Model

• Functions class not defined• Loss not defined• Optimization procedure not defined

Regularization

Finite Data Model

• control on penalty strength• Penalize complexity/a priori knowledge

Early stopping rules

Finite Data Model

• Iterative optimization• Depends on initial params and algorithm• used for neural networks• Penalize along a path

Structural Risk

Finite Data Model

• Analytic estimates of empirical risk

Bayesian inference

Finite Data Model

• Explicit a priori probabilities• Learn mixtures• Hard multidimensional integrations…

What is Machine Learning?Curse of dimentionality

We want to control complexity

Finite Data Model

• smoothness constraint in a neighborhood

What is Machine Learning?Curse of dimensionality

Data density is key…

Finite DataIn a Space

ModelComplexity

Inductive principle

Data density is key…e.g.● 1-D 0.1m interval => 10 points/m● 2-D 0.1M interval => 100 points/M^2

● d-d 0.1 m interval => 10^d points/m^d

Same smoothness requires lots of data in high dimensional spaces

Sampling is hard…e.g.● 1-D 10% sample => 0.1 x size● 2-D 10% sample => 0.31 x size

● 10-d 10% sample => 0.79 x size

=> local estimates from samples are difficult

Data points are closer to edges…One Data points “sees” himself as an outlier

=> Predictions require lots of extrapolation

Samples must increase exponentially

… or model complexity must be controlled

What is Machine Learning?Regularization in more details

Data driven penalized risk minimization

Loss functions

Regularizers

L2 (ridge)

L1(lasso)

Elastic net

Optimization (there comes the fun… )

Which algorithm to find a minimum in a distributed fashion?

Convex optimization methods (linear methods)● Gradient descent● Stochastic gradient descent● Limited-memory BFGS

Gradient descent● Efficient steps but needs to read through

the whole data

Stochastic Gradient descent● Samples data for each step but converges

very slowly

L-BFGS● quadratic derivative estimates by keeping

several previous gradient in memory● Fast convergence

What is Machine Learning?Model selection

all work and no play makes Jack a dull boy

Model Complexity control: Resampling

Selecting the right lambda…

… to minimize prediction risk

Enough theory boy!

The universe

Enough theory boy!

Our data

Enough theory boy!

Our data

Learning Set (70%)

validation set (30%)

Enough theory boy!

Our data

Learning Set (70%)

validation set (30%)

Nice flag

K-Fold

MLLibA library to learn them all...

Distributed computing framework

Large Scale Data Processing engine

What is Apache Spark?

I play BIG!

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!

● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Let the brain do the work...

● Interactive● @ any scale

Breed mixin’

MLLibSpark

Intro to Spark… notebook

MLLibSpark

Intro to Spark… notebook

So we’we seen… ● Basics of Spark data manipulation● MLLib data representation● Linear regression● Regularization and k-fold cross validation

What else is there?

MLLibSpark

Basic statisticsClassification and regressionCollaborative filteringClusteringDimensionality reductionFeature extraction and transformationFrequent pattern miningEvaluation metrics…

http://spark.apache.org/docs/latest/mllib-guide.html

MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)

Playtime!

Some more examples

GenomicsThe data

So… that’s what separates us huh?

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)

Playtime!

Notebook!

What else?Old and new players are now integrating with Spark

(and Scala)

Integrated with Data Frame

Offer API to create

shareable/reusable

Pipeline constructions (PCA, …)

Spark ML Pipeline

Higher API

Like Pipeline but

Type Safe

Chainable API (andThen-friendly)

Spark ML Keystone

Higher API

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

Higher API

DL4J Spark ML

Higher API

Intel Data Analytics Acceleration Library

DAAL (Intel)

Higher API

Declarative large-scale machine learning

optimization based on data and cluster

characteristics

System ML (IBM)

Higher API

Nitro's Extremely Exciting Deep Learning Engine

MLP, RBM, LSTM and more to come

Needle

Higher API

H2OSparkling & Deep Learning on genomics

water in fire

Learning structures using H2O Deep Learning Algorithm integrated in SparKin a Notebookon an Ec2 Cluster

http://h2o.ai/product/sparkling-water/

H2OSparkling: in-memory data exchange

I remember things better when I remember then twice.

Wrap upwhat we hope you have learned

Distributed computingFor machine learning

I am ready.

Data is exploding

Distributed Technologies are maturing

Scale up and down, interactivity

Distributed ML on SparkWhat is available

What are my options by the way?

Spark MLLibH2O

Needle

EC2 GCEURIKA-XA

clouderaMapr

Hortonworks

HDFSC*

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

ops data

sci ops

ops data

web ops data

web ops data sci

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab

Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)

Check also @TypeSafe: http://t.co/o1Bt6dQtgH

distributed machine learning 101 using apache spark from a browser devoxx.be2015

Data & Analytics

r + apache spark

performance-analyse von apache spark und apache...

apache spark & hadoop

apache spark session

using apache spark, apache kafka and apache...

apache spark 101

apache spark operations

integrating apache hive with kafka, spark, and...

apache ignite and apache spark - gridgain systems · ignite...

apache spark rdds

state of security: apache spark & apache zeppelin

apache spark

[@naukriengineering] apache spark

distributed machine learning 101 using apache spark from the...

managed solutions apache spark® · apache spark® apache...

developing apache spark applications · apache spark...

knime extension for apache spark installation guide ·...

accelerator for apache spark functional specification ·...

a tutorial on apache spark - michael...

writing apache spark and apache flink applications using...