a full machine learning pipeline in scikit-learn vs in scala-spark: pros and cons

PowerPoint Presentation

A full Machine learning pipeline in Scikit-learn vs Scala-Spark: pros and consJose Quesada and David Anderson@quesada, @alpinegizmo, @datascienceret

1

Scala and spark are very close: if you learn one you learn the other.Spark is distributed scala2

Why this talk?

How do you get from a single-machine workload to a fully distributed one?

Answer: Spark machine learning

Is there something I'm missing out by staying with python?

Scala and spark are very close: if you learn one you learn the other.Spark is distributed scalaThis has been possible for years, but nowadays its not only possible but pleasant5

You attend a Retreat, not a training6

Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc DSR accepts fewer than 5% of the applications

Strong focus on commercial awareness

5 years of working experience on average

30+ partner companies in Europe

DSR participants do a portfolio project

Why is DSR talking about Scala/Spark?

They are behind ScalaIBM is behind thisThey hired us to make training materials

Source: Spark 2015 infographic

TimeMindshare in data science badasses (subjective)

A talk should give you a superpower.- Am I missing out?14

Scala

Scala offers the easiest refactoring experience that I've ever had due to the type system.Jacob, coursera engineer

SparkBasically distributed ScalaAPIScala, Java, Python, and R bindings

LibrariesSQL, streams, graph processing, machine learning

One of the most active open source projects

redo the diagram16

Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.

Dean Wampler, Lightbend

All under one roof (big Win)

Source: Spark 2015 infographicSpark CoreSpark SQLSpark streamingSpark.ml (machine learningGraphX(graphs)

Spark Programming ModelInputDriver / SparkContextWorkerWorkerTasksResults

19

Data is partitioned; code is sent to the dataInputDriver / SparkContextWorkerWorkerTasksResults

DataData

20

Example: word counthello worldfoo barfoo foo barbye worldData is immutable, and is partitioned across the cluster

21

Example: word counthello worldfoo barfoo foo barbye worldWe get things done by creating new, transformed copies of the data.In parallel.

helloworldfoobarfoofoobarbyeworld

(hello, 1)(world, 1)(foo, 1)(bar, 1)(foo, 1)(foo, 1)(bar, 1)(bye, 1)(world, 1)

22

Example: word counthello worldfoo barfoo foo barbye worldSome operations require a shuffle to group data together

helloworldfoobarfoofoobarbyeworld

(hello, 1)(world, 1)(foo, 1)(bar, 1)(foo, 1)(foo, 1)(bar, 1)(bye, 1)(world, 1)

(hello, 1)(foo, 3)(bar, 2)(bye, 1)(world, 2)

23

Example: word countlines = sc.textFile(input)words = lines.flatMap(lambda x: x.split(" "))word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y))-------------------------------------------------word_count.saveAsTextFile(output)Pipelined into the same python executor Nothing happens until after this line, when this "action" forces evaluation of the RDD

24

RDD Resilient Distributed DatasetAn immutable, partitioned collection of elements that can be operated on in parallelLazyFault-tolerant

fault-tolerant: missing partitions can be recomputed by using the lineage graph to rerun operations25

PySpark RDD Execution Model

Whenever you provide a lambda to operate on an RDD:Each Spark worker forks a Python workerdata is serialized and piped to those Python workers

When using python, the sparkcontext in python is basically a proxy. py4j is used to launch a JVM and create a native spark context. py4j manages communication between the python and java spark context objects.

In the workers, some operations can be executed directly in the JVM. But, for example, if you've implemented a map function in python, a python process is forked to execute this user-supplied mapping. Each thread in the spark worker will have its own python sub-process.

When Python wrapper calls the underlying Spark codes written in Scala running on a JVM, translation between two different environments and languages might be the source of more bugs and issues.

26

Scala and spark are very close: if you learn one you learn the other.Spark is distributed scalaThis has been possible for years, but nowadays its not only possible but pleasant27

Impact of this execution modelWorker overhead (forking, serialization)The cluster manager isn't aware of Python's memory needsVery confusing error messages

28

Spark Dataframes (and Datasets)Based on RDDs, but tabular; something like SQL tablesNot PandasRescues Python from serialization overheaddf.filter(df.col("color") == "red") vs. rdd.filter(lambda x: x.color == "red")processed entirely in the JVMPython UDFs and maps still require serialization and piping to Pythoncan write (and register) Scala code, and then call it from Python

29

DataFrame execution: unified across languagesPython DFJava/Scala DFR DFLogical PlanExecutionAPI wrappers create a logical plan (a DAG)Catalyst optimizes the plan; Tungsten compiles the plan into executable code

DataFrame performance

ML WorkflowData IngestionData Cleaning / Feature EngineeringModel TrainingTesting and ValidationDeployment

33

Machine learning with scikit-learnEasy to useRich ecosystemLimited to one machine (but see sparkit-learn package)

34

Machine learning with Hadoop (in short: NO)Each iteration is a new M/R jobEach job must store data in HDFS lots of overhead

35

How Spark killed Hadoop map/reduceFar easier to program

More cost-effective since less hardware can perform the same tasks much faster

Can do real-time processing as well as batch processing

Can do ML, graphs

Just one Map / Reduce step, but many algorithms are iterativeDisk based long startup times-------Spark is a wholesale replacement for MapReduce that leverages lessons learned from MapReduce. The Hadoop community realized that areplacement for MR was needed. While MR has served the community well, its a decade old and shows clear limitations and problems, as weve seen. In late 2013, Cloudera, the largest Hadoop vendor officially embraced Spark as the replacement. Most of the other Hadoop vendors have followed suit.

When it comes to one-pass ETL-like jobs, for example, data transformation ordata integration, then MapReduce is the dealthis is what it was designed for.

Advantages for Hadoop: Security, staffing36

Machine learning with SparkSpark was designed for ML workloadsCaching (reuse data)Accumulators (keep state across iterations)Functional, lazy, fault-tolerantMany popular algorithms are supported out of the boxSimple to productionalize modelsMLlib is RDD (the past), spark.ml is dataframes, the future

sample use case for accumulators: gradient descent37

Spark is an Ecosystem of ML frameworksSpark was designed by people who understood the need of ML practitioners (unlike Hadoop)MLlibSpark.mlSystem.ml (IBM)Keystone.ml

Spark.ML the basicsDataFrame: ML requires DFs holding vectorsTransformer: transforms one DF into anotherEstimator: fit on a DF; produces a transformerPipeline: chain of transformers and estimatorsParameter: there is a unified API for specifying parametersEvaluator: CrossValidator: model selection via grid search

39

Hyper-parametertuningMachine Learning scaling challenges that Spark solves

40

Hyper-parametertuningMachine Learning scaling challenges that Spark solvesETL/feature engineering

42

Hyper-parametertuningMachine Learning scaling challenges that Spark solvesETL/feature engineeringModel

43

Q: Hardest scaling problem in data science?A: Adding peopleSpark.ml has a clean architecture and APIs that should encourage code sharing and reuseGood first step: can you refactor some ETL code as a Transformer?

Don't see much sharing of components happening yetEntire libraries, yes; components, not so muchPerhaps because Spark has been evolving so quicklyE.g., pull request implementing non-linear SVMs that has been stuck for a year

Structured types in SparkSQLDataFramesDataSets(Java/Scala only) Syntax ErrorsRuntimeCompile timeCompile timeAnalysis ErrorsRuntimeRuntimeCompile time

45

User experience Spark.ml Scikit-learn

Indexing categorical featuresYou are responsible for identifying and indexing categorical featuresval rfcd_indexer = new StringIndexer() .setInputCol("color") .setOutputCol("color_index") .fit(dataset)val seo_indexer = new StringIndexer() .setInputCol("status") .setOutputCol("status_index") .fit(dataset)

Spark.ml Departs from scikit-learn quite a bit47

Assembling featuresYou must gather all of your features into one Vector, using a VectorAssemblerval assembler = new VectorAssembler() .setInputCols(Array("color_index", "status_index", ...)) .setOutputCol("features")

48

Spark.ml Scikit-learn: Pipelines (good news!)Spark ML and scikit-learn: same approach

Chain together Estimators and Transformers

Support non-linear pipelines (must be a DAG)

Unify parameter passing

Support for cross-validation and grid search

Can write your own custom pipeline stages

Spark.ml just like scikit-learn

Good 49

TransformerDescriptionscikit-learnBinarizerThreshold numerical feature to binaryBinarizerBucketizerBucket numerical features into rangesElementwiseProductScale each feature/column separatelyHashingTFHash text/data to vector. Scale by term frequencyFeatureHasherIDFScale features by inverse document frequencyTfidfTransformerNormalizerScale each row to unit normNormalizerOneHotEncoderEncode k-category feature as binary featuresOneHotEncoderPolynomialExpansionCreate higher-order featuresPolynomialFeaturesRegexTokenizerTokenize text using regular expressions(part oftext methods)StandardScalerScale features to 0 mean and/or unit varianceStandardScalerStringIndexerConvert String feature to 0-based indicesLabelEncoderTokenizerTokenize text on whitespace(part oftext methods)VectorAssemblerConcatenate feature vectorsFeatureUnionVectorIndexerIdentify categorical features, and indexWord2VecLearn vector representation of words

Spark.ml Scikit-learn: NLP tasks (thumbs up)

fromhttps://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-apache-spark-1-4.html50

Graph stuff (graphX, graphframes, not great)Extremely easy to run monster algorithms in a cluster

GraphX has no python API

Graphframes are cool, and should provide access to the graph tools in Spark from python

In practice, it didnt work too well

Things we liked in Spark MLArchitecture encourages building reusable pieces

Type safety, plus types are driving optimizations

Model fitting returns an object that transforms the data

Uniform way of passing parameters

It's interesting to use the same platform for ETL and model fitting

Very easy to parallelize ETL and grid search, or work with huge models

52

Disappointments using Spark MLFeature indexing and assembly can become tedious

Surprised by the maximum depth limit for trees: 30

Data exploration and visualization aren't easy in Scala

Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)

53

What is new for machine learning in Spark 2.0DataFrame-based Machine Learning API emerges as the primary ML API:With Spark 2.0, the spark.ml package, with its pipeline APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.Machine learning pipeline persistence:Users can nowsave and loadmachine learning pipelines and models across all programming languages supported by Spark.

What is new for data structures in Spark 2.0

Unifying the API for Streams and static data: Infinite datasets (same interface as dataframes)

What have Spark and Scala ever given us?

Other than distributed dataframes, distributed machine learning, easy distributed grid search,distributed SQL, distributed stream analysis,more performance than map reduceeasier programming modelAnd easier deployment

What have Spark and Scala ever given us?

Reminder: 25 videos explaining ML on sparkFor people who already know ML

http://datascienceretreat.com/videos/data-science-with-scala-and-spark)

Thank you for your attention!@quesada, @datascienceret

a full machine learning pipeline in scikit-learn vs in scala-spark: pros and cons

Data & Analytics