joseph bradley, software engineer, databricks inc. at mlconf sea - 5/01/15
TRANSCRIPT
Spark DataFrames
and ML Pipelines
Joseph K. BradleyMay 1, 2015
MLconf Seattle
Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
4
Concise APIs in Python, Java, Scala… and R in Spark 1.4!
500+ enterprises using or planning to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine• Built for speed, ease of use,
and sophisticated analytics• Apache open source
Beyond Hadoop
5
Early adopters (Data) Engineers
MapReduce &functional API
Data Scientists& Statisticians
Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows
Google Trends for “dataframe”
7
DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into named columns
DataFrames
9
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into named columns
DSL for common tasks• Project, filter, aggregate, join, …• Metadata• UDFs
Spark DataFrames
10
API inspired by R and Python Pandas• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQLCatalyst optimizer
12
Demo: DataFrames
Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
13
Machine Learning Pipelines
Simple construction and tuning of ML workflows
About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
14
About Spark MLlibClassification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
15
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Frequent itemsets
• FP-growth
Model import/export
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
ML Workflows are complex
16
Image classification pipeline*
* Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines
Specify pipeline Inspect & debug Re-run on new data Tune parameters
Example: Text Classification
17
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science0: not about science
LabelFeatures
Dataset: “20 Newsgroups”From UCI KDD Archive
ML Workflow
18
Train model
Evaluate
Load data
Extract features
Load Data
19
Train model
Evaluate
Load data
Extract features
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
Load Data
20
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
21
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
22
Train model
Evaluate
Load datalabel: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.features: Vector
words: Seq[String]
Transformer
Train a Model
23
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.features: Vector
words: Seq[String]
prediction: Int
Estimator
Load data
Transformer
Evaluate the Model
24
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.features: Vector
words: Seq[String]
prediction: Int
Load data
Transformer
Evaluator
Estimator
By default, always append new columns
Can go back & inspect intermediate results
Made efficient by DataFrameoptimizations
ML Pipelines
25
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
Load data
Pipeline
Test data
Logistic Regression
Tokenizer
Hashed Term Freq.
Evaluate
Re-run exactly the same way
Parameter Tuning
26
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
CrossValidator
27
Demo: ML Pipelines
Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 28
Composable & DAG Pipelines
Schema validation
User-defined Transformers & Estimators
Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further
optimization
• API for R
ML Pipelines
• More algorithms &
pluggability
• API for R
Thank you!
Spark documentationspark.apache.org
Pipelines blog postdatabricks.com/blog/2015/01/07
DataFrames blog postdatabricks.com/blog/2015/02/17
Databricks Cloud Platformdatabricks.com/product
Spark MOOCs on edXIntro to Spark & ML with Spark
Spark Packagesspark-packages.org