joseph bradley, software engineer, databricks inc. at mlconf sea - 5/01/15

Spark DataFrames

and ML Pipelines

Joseph K. BradleyMay 1, 2015

MLconf Seattle

Who am I?

Joseph K. Bradley

Ph.D. in ML from CMU, postdoc at Berkeley

Apache Spark committer

Software Engineer @ Databricks Inc.

2

Databricks Inc.

3

Founded by the creators of Spark

& driving its development

Databricks Cloud: the best place to run Spark

Guess what…we’re hiring!

databricks.com/company/careers

https://databricks.com/company/careers

4

Concise APIs in Python, Java, Scala… and R in Spark 1.4!

500+ enterprises using or planning to use Spark in production (blog)

Spark

SparkSQL Streaming MLlib GraphX

Distributed computing engine• Built for speed, ease of use,

and sophisticated analytics• Apache open source

https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html

Beyond Hadoop

5

Early adopters (Data) Engineers

MapReduce &functional API

Data Scientists& Statisticians

Spark for Data Science

DataFrames

Intuitive manipulation of distributed structured data

6

Machine Learning Pipelines

Simple construction and tuning of ML workflows

Google Trends for “dataframe”

7

DataFrames

8

dept age name

Bio 48 H Smith

CS 54 A Turing

Bio 43 B Jones

Chem 61 M Kennedy

RDD API

DataFrame API

Data grouped into named columns

DataFrames

9

dept age name

Bio 48 H Smith

CS 54 A Turing

Bio 43 B Jones

Chem 61 M Kennedy

Data grouped into named columns

DSL for common tasks• Project, filter, aggregate, join, …• Metadata• UDFs

Spark DataFrames

10

API inspired by R and Python Pandas• Python, Scala, Java (+ R in dev)

• Pandas integration

Distributed DataFrame

Highly optimized

11

0 2 4 6 8 10

RDD Scala

RDD Python

Spark Scala DF

Spark Python DF

Runtime of aggregating 10 million int pairs (secs)

Spark DataFrames are fast

better

Uses SparkSQLCatalyst optimizer

12

Demo: DataFrames

Spark for Data Science

DataFrames

• Structured data

• Familiar API based on R & Python Pandas

• Distributed, optimized implementation

13


Simple construction and tuning of ML workflows

About Spark MLlib

Started @ Berkeley

• Spark 0.8

Now (Spark 1.3)

• Contributions from 50+ orgs, 100+ individuals

• Growing coverage of distributed algorithms

Spark

SparkSQL Streaming MLlib GraphX

14

About Spark MLlibClassification

• Logistic regression

• Naive Bayes

• Streaming logistic regression

• Linear SVMs

• Decision trees

• Random forests

• Gradient-boosted trees

Regression

• Ordinary least squares

• Ridge regression

• Lasso

• Isotonic regression

• Decision trees

• Random forests

• Gradient-boosted trees

• Streaming linear methods

15

Statistics

• Pearson correlation

• Spearman correlation

• Online summarization

• Chi-squared test

• Kernel density estimation

Linear algebra

• Local dense & sparse vectors & matrices

• Distributed matrices

• Block-partitioned matrix

• Row matrix

• Indexed row matrix

• Coordinate matrix

• Matrix decompositions

Frequent itemsets

• FP-growth

Model import/export

Clustering

• Gaussian mixture models

• K-Means

• Streaming K-Means

• Latent Dirichlet Allocation

• Power Iteration Clustering

Recommendation

• Alternating Least Squares

Feature extraction & selection

• Word2Vec

• Chi-Squared selection

• Hashing term frequency

• Inverse document frequency

• Normalizer

• Standard scaler

• Tokenizer

ML Workflows are complex

16

Image classification pipeline*

* Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines

Specify pipeline Inspect & debug Re-run on new data Tune parameters

https://amplab.cs.berkeley.edu/ml-pipelines/

Example: Text Classification

17

Goal: Given a text document, predict its

topic.

Subject: Re: Lexan Polish?

Suggest McQuires #1 plastic

polish. It will help somewhat

but nothing will remove deep

scratches without making it

worse than it already is.

McQuires will do something...

1: about science0: not about science

LabelFeatures

Dataset: “20 Newsgroups”From UCI KDD Archive

ML Workflow

18

Train model

Evaluate

Load data

Extract features

Load Data

19

Train model

Evaluate

Load data

Extract features

built-in external

{ JSON }

JDBC

and more …

Data sources for DataFrames

Load Data

20

Train model

Evaluate

Load data

Extract features

label: Int

text: String

Current data schema

Extract Features

21

Train model

Evaluate

Load data

Extract features

label: Int

text: String

Current data schema

Extract Features

22

Train model

Evaluate

Load datalabel: Int

text: String

Current data schema

Tokenizer

Hashed Term Freq.features: Vector

words: Seq[String]

Transformer

Train a Model

23

Logistic Regression

Evaluate

label: Int

text: String

Current data schema

Tokenizer


words: Seq[String]

prediction: Int

Estimator

Load data

Transformer

Evaluate the Model

24

Logistic Regression

Evaluate

label: Int

text: String

Current data schema

Tokenizer


words: Seq[String]

prediction: Int

Load data

Transformer

Evaluator

Estimator

By default, always append new columns

Can go back & inspect intermediate results

Made efficient by DataFrameoptimizations

ML Pipelines

25

Logistic Regression

Evaluate

Tokenizer

Hashed Term Freq.

Load data

Pipeline

Test data

Logistic Regression

Tokenizer

Hashed Term Freq.

Evaluate

Re-run exactly the same way

Parameter Tuning

26

Logistic Regression

Evaluate

Tokenizer

Hashed Term Freq.

lr.regParam

{0.01, 0.1, 0.5}

hashingTF.numFeatures

{100, 1000, 10000} Given:

• Estimator

• Parameter grid

• Evaluator

Find best parameters

CrossValidator

27

Demo: ML Pipelines

Recap

DataFrames

• Structured data

• Familiar API based on R & Python

Pandas

• Distributed, optimized

implementation


• Integration with DataFrames

• Familiar API based on scikit-learn

• Simple parameter tuning 28

Composable & DAG Pipelines

Schema validation

User-defined Transformers & Estimators

Looking Ahead

Collaborations with UC Berkeley & others

• Auto-tuning models

29

DataFrames

• Further

optimization

• API for R

ML Pipelines

• More algorithms &

pluggability

• API for R

Thank you!

Spark documentationspark.apache.org

Pipelines blog postdatabricks.com/blog/2015/01/07

DataFrames blog postdatabricks.com/blog/2015/02/17

Databricks Cloud Platformdatabricks.com/product

Spark MOOCs on edXIntro to Spark & ML with Spark

Spark Packagesspark-packages.org

http://spark.apache.org/

https://databricks.com/blog/2015/01/07

https://databricks.com/blog/2015/02/17

https://databricks.com/product/

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

http://spark-packages.org/

joseph bradley, software engineer, databricks inc. at mlconf sea - 5/01/15

Technology

creators of spark

spark guess whatwere

java r

r python pandasdistributed

python pandaspython

development databricks

named columnsdsl

named columnsfor